Technical

Edge AI: Processing Data Where It's Created Instead of Sending It to the Cloud

Edge AI hardware, model optimization techniques, deployment frameworks, and hybrid architectures for real-time inference at the edge.

Dragan Gavrić
Dragan Gavrić Co-Founder & CTO
| · 13 min read
Edge AI: Processing Data Where It's Created Instead of Sending It to the Cloud

Edge AI: Processing Data Where It’s Created Instead of Sending It to the Cloud

The default assumption for most AI workloads has been simple: collect data, send it to the cloud, run inference, return results. This architecture works well for batch processing and latency-tolerant applications. It breaks down when milliseconds matter, bandwidth is expensive, or privacy regulations prohibit data leaving a physical location.

Edge AI — running machine learning models directly on devices or local servers near the data source — addresses all three constraints. It’s not a replacement for cloud AI. It’s a complementary architecture that processes data where it’s generated, sending only results (not raw data) to the cloud when necessary.

The market is catching up to the technology. Gartner projects that by 2027, over 55% of deep neural network inference will happen at the edge rather than in centralized cloud data centers, up from roughly 10% in 2023. The hardware has gotten good enough, the optimization techniques have matured, and the use cases are compelling enough that edge AI has moved from experimental to production-ready.

Why Edge Computing Matters for AI Workloads

Three fundamental constraints drive the shift toward edge AI. Understanding them clarifies when edge processing is the right architectural choice.

Latency

A round trip from device to cloud and back typically takes 50-200ms under ideal conditions. Add network variability, and you’re looking at 100-500ms for real-world deployments. For many applications, that’s fine. For others, it’s disqualifying.

An autonomous vehicle processing camera feeds at 30 frames per second has 33ms per frame to detect obstacles and make decisions. A quality inspection system on a manufacturing line moving at 100 parts per minute has 600ms per part — and that includes image capture, preprocessing, inference, and actuator response. A real-time sports scoring system needs sub-100ms latency to keep up with live action.

Edge inference eliminates the network round trip entirely. A model running on an NVIDIA Jetson Orin can process a 640x480 image in 5-15ms depending on model complexity. That’s 10-40x faster than the cloud round-trip, and the latency is deterministic — no network variability.

Bandwidth Costs

Sending raw sensor data to the cloud is expensive at scale. A single industrial camera generates 1-5 GB of image data per hour. A fleet of 100 cameras generates 100-500 GB per hour. At standard cloud egress rates ($0.05-$0.09 per GB), that’s $120-$1,080 per day just in bandwidth costs, not counting cloud compute for inference.

Edge processing inverts this equation. Process images locally, extract the relevant information (defect detected at position X,Y with confidence 0.94), and send only the structured results to the cloud. The data volume drops by 99%+, and bandwidth costs drop proportionally.

Privacy and Data Sovereignty

Some data can’t leave a specific location. Medical imaging in a hospital, surveillance footage in certain jurisdictions, biometric data under GDPR — all may be subject to regulations that prohibit cloud transmission. Edge AI processes this data locally, ensuring raw data never leaves the premises. Only anonymized results or aggregate statistics are transmitted.

This is particularly relevant in the EU, where GDPR’s data minimization principle favors processing data at the point of collection rather than transmitting it to centralized systems for processing.

Edge AI Hardware

The hardware landscape for edge AI has evolved rapidly. The right choice depends on your inference requirements, power budget, and deployment environment.

NVIDIA Jetson Platform

NVIDIA’s Jetson lineup is the most widely adopted edge AI platform, and for good reason. The range covers use cases from embedded sensors to edge servers:

  • Jetson Orin Nano (40 TOPS): Entry-level for single-model inference. 7-15W power draw. Suitable for single-camera vision applications.
  • Jetson Orin NX (100 TOPS): Mid-range. Handles multiple models simultaneously or larger models. Typical for industrial inspection and robotics.
  • Jetson AGX Orin (275 TOPS): High-end edge compute. Multiple camera streams, complex multi-model pipelines. Used in autonomous vehicles and edge servers.

TOPS (Tera Operations Per Second) is the standard throughput metric for edge AI hardware, though real-world performance depends heavily on model architecture, precision, and optimization.

The Jetson platform’s strength is its CUDA ecosystem. Models developed and trained on NVIDIA GPUs in the cloud deploy to Jetson with minimal modification. TensorRT, NVIDIA’s inference optimizer, works across both cloud and edge hardware.

Google Coral

Google’s Coral platform uses the Edge TPU (Tensor Processing Unit), a purpose-built ASIC for neural network inference. The Coral USB Accelerator plugs into any Linux system and adds 4 TOPS of inference capability for under $60.

Coral excels at quantized models (INT8). If your model fits within the Edge TPU’s constraints — quantized to 8-bit integers, using supported operations — inference is fast and power-efficient. The limitation is flexibility. Models that use unsupported operations fall back to CPU execution, which is dramatically slower.

Best for: Single-model deployments where the model has been specifically designed and quantized for Edge TPU. Common in smart cameras, environmental sensors, and low-power IoT devices.

Intel Neural Compute Stick and OpenVINO

Intel’s approach centers on software rather than hardware. The OpenVINO toolkit optimizes models for inference on Intel CPUs, integrated GPUs, and the discontinued Neural Compute Stick (now replaced by Intel’s Arc GPU lineup and AI-specific silicon).

OpenVINO’s advantage is that it runs on existing Intel hardware — the x86 CPUs and integrated GPUs already in many edge servers and industrial PCs. If you have an Intel-based gateway or industrial PC, you can add AI inference without new hardware.

Performance is lower than dedicated AI accelerators, but for many edge applications — document classification, anomaly detection on time-series data, simple image classification — an optimized model on an Intel CPU is sufficient.

Apple Neural Engine and Qualcomm AI Engine

For mobile edge AI, Apple’s Neural Engine (in A-series and M-series chips) and Qualcomm’s AI Engine (in Snapdragon processors) provide on-device inference. Core ML and the Qualcomm Neural Processing SDK, respectively, are the deployment frameworks.

These matter for applications where the “edge” is a phone or tablet — field inspection, augmented reality, on-device translation, and mobile health applications.

Model Optimization: Making Models Fit the Edge

Cloud inference runs on GPUs with 24-80 GB of VRAM and hundreds of watts of power. Edge devices have 4-16 GB of memory and 5-30 watts. Bridging this gap requires aggressive model optimization.

Quantization

Quantization reduces the numerical precision of model weights and activations. A standard model uses 32-bit floating point (FP32). Quantization maps these to lower-precision representations:

  • FP16 (half-precision): 2x memory reduction, minimal accuracy loss. Supported on virtually all edge hardware. This is the default starting point for edge optimization.
  • INT8 (8-bit integer): 4x memory reduction from FP32. Typically 1-3% accuracy loss. Requires calibration with representative data. Supported on Jetson, Coral, and Intel hardware.
  • INT4 (4-bit integer): 8x memory reduction. 3-8% accuracy loss depending on model and task. Emerging support on newer hardware. Useful for extremely constrained deployments.

The process is straightforward with modern tools. TensorRT, ONNX Runtime, and TensorFlow Lite all support post-training quantization — you take a trained FP32 model, provide a calibration dataset (typically 100-1,000 representative samples), and the tool produces a quantized model. Quantization-aware training, where quantization is simulated during training, recovers some accuracy loss but requires retraining.

Pruning

Pruning removes unnecessary weights from a neural network. Research consistently shows that large models contain significant redundancy — 50-90% of weights can be removed with minimal accuracy impact.

Unstructured pruning sets individual weights to zero. This reduces the theoretical computation but doesn’t always translate to faster inference because hardware doesn’t efficiently handle sparse matrices at arbitrary sparsity patterns.

Structured pruning removes entire filters, channels, or layers. This produces smaller, dense models that run faster on standard hardware. A structured-pruned model is genuinely smaller and faster, not just theoretically sparser.

In practice, pruning and quantization are combined. Prune the model to remove redundant structure, then quantize the remaining weights. The compound effect can reduce model size by 10-20x with less than 5% accuracy loss on many tasks.

Knowledge Distillation

Knowledge distillation trains a small “student” model to mimic a large “teacher” model. Instead of training the student on raw data labels, you train it on the teacher’s output probabilities — the “soft” labels that contain information about the teacher’s uncertainty and class relationships.

A common edge AI pattern: train a large ResNet-152 or Vision Transformer in the cloud (the teacher), then distill its knowledge into a MobileNetV3 or EfficientNet-Lite (the student). The student achieves 85-95% of the teacher’s accuracy at 10-50x less computation.

Distillation is particularly effective when your edge task is narrower than the general task the teacher was trained on. A teacher trained on ImageNet’s 1,000 classes distilled into a student that only needs to classify 10 defect types retains nearly all relevant knowledge in a fraction of the parameters.

TensorRT and ONNX Runtime

TensorRT is NVIDIA’s inference optimizer and runtime. It takes a trained model (from PyTorch, TensorFlow, or ONNX format), applies layer fusion, kernel auto-tuning, quantization, and memory optimization to produce a deployment-ready engine optimized for specific NVIDIA hardware.

Performance gains from TensorRT optimization are typically 2-5x over running the original framework on the same hardware. For Jetson deployments, TensorRT is essentially mandatory for production-quality inference speeds.

ONNX Runtime provides hardware-agnostic inference optimization. It takes ONNX-format models and optimizes them for the execution provider — CUDA, TensorRT, OpenVINO, Core ML, or CPU. The advantage is portability: one ONNX model runs on NVIDIA, Intel, ARM, and Apple hardware with appropriate execution providers.

For multi-platform edge deployments — where some nodes are Jetson, some are Intel-based, and some are ARM — ONNX Runtime provides a unified deployment path.

Edge Deployment Frameworks

The gap between a model that works in a Jupyter notebook and a model running reliably on 500 edge devices is substantial. Deployment frameworks bridge this gap.

TensorFlow Lite

TensorFlow Lite is the most mature framework for mobile and embedded deployment. It supports model conversion from TensorFlow with automatic optimization, runs on Android, iOS, Linux, and microcontrollers, and has broad hardware delegate support (GPU, NNAPI, Coral Edge TPU, Hexagon DSP).

For simple classification and detection tasks on mobile or embedded Linux, TensorFlow Lite is often the fastest path to deployment.

PyTorch Mobile and ExecuTorch

PyTorch Mobile allows deploying PyTorch models to iOS and Android. Meta’s newer ExecuTorch project targets edge and embedded deployment more broadly, with optimizations for ARM CPUs, Apple Neural Engine, and Qualcomm hardware.

If your ML team trains in PyTorch (which is the majority of research and many production teams), the PyTorch-native deployment path avoids model conversion complexity.

Apache TVM

Apache TVM is a compiler-based approach to model optimization and deployment. It takes models from any framework (PyTorch, TensorFlow, ONNX) and compiles them to optimized code for specific hardware targets — including targets that other frameworks don’t support well, like RISC-V processors and custom accelerators.

TVM’s auto-tuning capability searches for the fastest implementation of each operation on your specific hardware. This can outperform framework-specific optimizations on non-standard hardware, but the compilation and tuning process is more complex.

Edge-Cloud Hybrid Architecture

Pure edge or pure cloud is rarely the right answer. The most effective architectures combine both, with clear rules about what runs where.

The Split Inference Pattern

Run a lightweight model at the edge for real-time decisions, and periodically send data to the cloud for deeper analysis with a larger model.

Example: Quality inspection. An edge device runs a MobileNet-based defect detector on every part (5ms inference time, 95% accuracy). Parts flagged as potentially defective are photographed at high resolution and sent to the cloud, where a large EfficientNet-B7 model runs a detailed analysis (200ms inference time, 99.5% accuracy). The edge model handles 95% of decisions in real-time. The cloud model handles the 5% that need higher accuracy.

This pattern reduces cloud costs (only 5% of data is sent to the cloud), maintains real-time responsiveness (edge decisions are instant), and provides high accuracy where it matters (the cloud model catches edge model mistakes).

Over-the-Air Model Updates

Edge models need updating as new data becomes available and model improvements are developed. OTA update architecture for edge AI includes:

  1. Model versioning. Every deployed model has a version identifier. The device reports its current model version to the management server.
  2. Staged rollout. New models deploy to a canary group first (5-10% of devices). Monitor accuracy and performance before rolling out to the full fleet.
  3. Rollback capability. If a new model performs worse than expected, automatically roll back to the previous version. Devices should store at least two model versions locally.
  4. Differential updates. Don’t send the entire model for every update. Send only the changed weights. This reduces download size by 60-90% for minor model updates.
  5. Validation on device. After downloading a new model, the device runs it against a local validation set before switching to it for production inference. If validation accuracy drops below a threshold, the device rejects the update and reports back.

When we built the IoT infrastructure for EcoBikeNet — a bike tracking and monitoring system — edge-side intelligence was essential for real-time decision-making where network connectivity wasn’t guaranteed. The architecture followed this hybrid pattern: lightweight anomaly detection models ran on the edge devices for immediate alerts, while more comprehensive analytics ran in the cloud on aggregated data. OTA updates ensured the edge models stayed current without requiring physical access to dispersed devices across a bike-sharing network.

Monitoring an Edge Fleet

Monitoring hundreds or thousands of edge devices running AI models introduces unique challenges:

  • Device health. CPU/GPU temperature, memory usage, disk space, network connectivity. Devices in outdoor or industrial environments face environmental stress that cloud servers don’t.
  • Model performance. Track inference accuracy, latency, and throughput per device. A model that performs well in the lab might degrade in the field due to different lighting, camera angles, or environmental conditions.
  • Data drift detection. Monitor whether the input data distribution is changing in ways that degrade model accuracy. If a camera’s lens gets dirty or lighting conditions change seasonally, model accuracy drifts, and you need to detect it before it becomes a problem.
  • Connectivity status. Track which devices are online, which are reporting intermittently, and which have gone dark. For remote deployments, connectivity loss might indicate hardware failure, power issues, or network problems.

Build a fleet management dashboard that shows aggregate health metrics and allows drill-down to individual devices. Set alerts for anomalies: a device whose inference latency suddenly doubles, a device whose reported accuracy drops below threshold, a device that stops reporting entirely.

Use Cases: Where Edge AI Delivers Business Value

Predictive Maintenance in Manufacturing

Sensors on industrial equipment generate vibration, temperature, and acoustic data continuously. Cloud-based analysis adds latency that can mean the difference between catching a bearing failure before it happens and reacting after the equipment is already damaged.

Edge AI processes sensor data locally, detecting anomalies in real-time. A model running on a Jetson Nano analyzes vibration patterns from an accelerometer at 10kHz sampling rate, detecting bearing wear signatures weeks before failure. The edge device sends an alert to the maintenance system immediately — no network round-trip, no cloud dependency.

The ROI is straightforward. Unplanned downtime in manufacturing costs $50,000-$250,000 per hour depending on the equipment. A single prevented failure pays for the entire edge AI deployment.

Visual Quality Inspection

Manufacturing quality inspection is one of the most mature edge AI applications. Cameras capture images of products on a production line, and an edge device runs inference to detect defects: surface scratches, dimensional errors, assembly mistakes, material impurities.

The requirements are demanding: sub-100ms latency per part, 99%+ accuracy (false rejects are costly, missed defects are costlier), and operation in harsh industrial environments with variable lighting.

Modern edge AI achieves this. A YOLOv8 model optimized with TensorRT on a Jetson AGX Orin processes 640x480 images in 8-12ms with defect detection accuracy above 99.2% on well-curated training data. That’s fast enough for production lines running 200+ parts per minute.

Retail Analytics

Edge AI in retail processes camera feeds for foot traffic analysis, queue length estimation, shelf stock monitoring, and demographic analysis — all without sending video to the cloud, which addresses both bandwidth costs and privacy concerns.

A single store with 10 cameras generates roughly 50 GB of video per day. Processing this in the cloud costs approximately $15-$25/day in bandwidth and compute. An edge device processing the same feeds locally costs $0.10-$0.20/day in electricity, plus the one-time hardware cost of $300-$800.

The privacy advantage is equally compelling. Edge processing means video stays in the store. No personally identifiable images are transmitted or stored centrally. This simplifies GDPR compliance and reduces the regulatory surface area of the system.

Autonomous Vehicles and Robotics

The most compute-intensive edge AI application. Autonomous systems fuse data from cameras, LiDAR, radar, and ultrasonic sensors to build a real-time world model and make driving or navigation decisions. Cloud latency is unacceptable — decisions must happen in under 50ms.

This domain pushes edge hardware to its limits. NVIDIA’s DRIVE platform, based on the Orin architecture, processes 254 TOPS at the vehicle edge. Even so, model optimization is critical — autonomous driving stacks run multiple models simultaneously (object detection, lane detection, depth estimation, path planning), and each must be optimized for the shared compute budget.

Practical Recommendations

Start with the Use Case, Not the Hardware

Define your latency, accuracy, and privacy requirements first. Then select hardware and optimize models to meet those requirements. Buying an NVIDIA Jetson AGX Orin for a use case that a Coral USB Accelerator handles is wasted budget. Running an edge deployment when cloud inference meets your latency needs is unnecessary complexity.

Optimize Aggressively Before Deploying

Don’t deploy your cloud model to the edge and hope it fits. Build a systematic optimization pipeline: distillation to reduce model size, pruning to remove redundancy, quantization to reduce precision, and TensorRT/ONNX Runtime compilation for hardware-specific optimization. Measure accuracy at each step and stop when you hit the accuracy-performance trade-off that your use case requires.

Plan for the Fleet from Day One

One edge device is a prototype. A hundred edge devices is an operations challenge. Build fleet management, OTA updates, monitoring, and rollback capability before you scale, not after. The cost of retrofitting fleet management after deployment is 3-5x higher than building it in from the start.

Budget for Data Labeling

Edge AI models need training data specific to their deployment environment. A model trained on internet images of defects won’t accurately detect defects in your specific manufacturing process with your specific lighting and camera setup. Budget for domain-specific data collection and labeling — it’s typically 40-60% of the total project cost and the primary determinant of model accuracy.

Edge AI is the natural evolution of IoT architectures. The first generation of IoT collected data and sent it to the cloud. The current generation processes data at the source and sends only decisions. The hardware, optimization tools, and deployment frameworks have matured to the point where this is a practical, production-ready approach for latency-sensitive, bandwidth-constrained, and privacy-critical applications. The organizations that invest in edge AI capability now will have a significant operational advantage as sensor data volumes continue to grow faster than network bandwidth.

Share

Ready to Build Your Next Project?

From custom software to AI automation, our team delivers solutions that drive measurable results. Let's discuss your project.

Dragan Gavrić

Dragan Gavrić

Co-Founder & CTO

Co-founder of Notix with deep expertise in software architecture, AI development, and building scalable enterprise solutions.