Why Hailo
We started shipping Hailo-8 accelerators about two years ago, after testing it head-to-head with Jetson Xavier NX on a vision inspection workload. The headline numbers were clear: comparable inference performance at roughly a quarter of the power, with a much smaller thermal envelope. After eight production stations, here's what we know that's not in the marketing material.The toolchain workflow, in practice
The path from a PyTorch model to a Hailo-deployed binary is:- Train in PyTorch / TensorFlow
- Export to ONNX
- Optimize with the Hailo Dataflow Compiler (DFC) — this includes quantization to INT8
- Compile to a Hailo Executable Format (HEF) targeting the specific chip (Hailo-8 / 8L / 15)
- Deploy via HailoRT runtime
Steps 3 and 4 are where the real work happens. The DFC needs a representative calibration dataset — at least 64 images, ideally 512 — captured under production conditions. Calibration is the difference between "almost the same accuracy as FP32" and "embarrassing accuracy regression we explain to the customer".
Quantization sensitivity is real
Some architectures quantize cleanly. Others don't.- ResNet, MobileNet, YOLO families — INT8 with <1 % accuracy regression. No drama.
- Transformers (ViT, DETR) — sensitive. Often need per-channel quantization, sometimes need partial FP16 retention on attention heads.
- Anomaly detection (PatchCore, EfficientAD) — distance-based scoring is sensitive to quantization noise. We spent a week recovering 2 % AUROC on EfficientAD with QAT before deciding to keep it on a Jetson Orin Nano instead.
The pragmatic rule: if your model has unusual numerics (cosine similarity in the loss, distance-based scoring, custom layer norms), assume quantization will cost you 1-3 % accuracy and budget for QAT.
Memory & multi-model deployments
Hailo-8 has 20 MB of on-chip SRAM. A typical YOLOv8s post-quantization is around 12 MB; YOLOv8m is around 25 MB and doesn't fit alone. The chip then "context switches" — loading partial graphs from host RAM — which costs latency.For multi-model deployments (e.g. detection + classification + OCR on the same chip), HailoRT supports model swapping between frames. It's measurably slower than a single model. We size for single-model where latency matters, multi-model where the use case can tolerate 30-50 ms swap penalties.
Hailo-15 vs Hailo-8 — when to upgrade
Hailo-15 is the newer SoC-style chip with built-in ISP, video codec, and more compute. We use it when:- The cell is space-constrained and we want camera + accelerator on a single board
- We need >1 stream at production resolution
- Multi-model deployments stop fitting on Hailo-8
For a single-camera, single-model station, the Hailo-8 M.2 is still the cheapest path.
One thing we'd warn about
The Hailo ecosystem is excellent if you're building one camera-to-decision pipeline per chip. It is less ergonomic if you're building a heterogeneous data pipeline with 20 transforms and 3 conditional models — for that you want CPU + Hailo, not Hailo alone.Anyone running Hailo-15 in real cells yet? Curious about the ISP integration story and whether it actually replaces a discrete camera ASIC.