benchmarksarchitectureperformance

ARM vs x86 Vector Search: Benchmarking FAISS, Annoy and Milvus on Pi and Desktop

ffuzzypoint

2026-02-10

10 min read

Empirical late-2025 benchmarks comparing FAISS, Annoy, and Milvus on Raspberry Pi 5 vs x86—indexing time, latency, memory and actionable tuning for ARM edge.

Why your Vector Search behaves differently on a Pi vs a desktop — and what to do about it

If you’re trying to ship reliable semantic/fuzzy search features on an edge device (Raspberry Pi) but your tests were only on a beefy x86 desktop, you’re already on the wrong trajectory. The CPU ISA, vector extensions, memory bandwidth and OS-level memory limits change the whole trade-off space for indexing, latency, and memory. This article presents reproducible, late-2025 benchmarks and tuning guidance for FAISS, Annoy, and Milvus across ARM (Raspberry Pi 5) and x86 (desktop). The goal: practical rules for devs and infra teams deciding whether to build on-device, cross-compile, or offload to servers.

Executive summary — what I learned running these experiments (fast takeaways)

Desktop (x86) outperforms Pi (ARM) by ~3–7x on query throughput and 2–6x on indexing time depending on index type and whether vectorized BLAS is used.
Memory is the real limiter on ARM. Heavy index builds (IVF training, HNSW graph construction) can peak above 8GB and either swap or fail on a 8GB Pi 5.
Compression wins on edge. FAISS IVFPQ and Milvus IVF_PQ reduce RAM by an order of magnitude vs Flat/HNSW at modest recall cost — and are the practical choice for Pi deployments.
Build on x86, ship to ARM is often the fastest path to reliable production. Serialize compressed indexes on desktop and memory-map them on the Pi for sub-second load + low RAM footprint.

Context and test environment (reproducible methodology)

Benchmarks were run in December 2025 on a 1M x 128 float32 dataset (SIFT-like vectors — 1,000,000 vectors × 128 dims ≈ 488MB raw) and a 10k query set (k=10). I tested standard production index configurations that teams commonly use:

FAISS IndexFlatL2 (Flat), IndexIVFPQ (nlist=4096, m=16, nbits=8, nprobe=8), IndexHNSWFlat (M=32, efConstruction=200, efSearch=32)
Annoy (trees=64, Angular)
Milvus (IVF_PQ and HNSW configs matching FAISS params; Milvus ARM64 Docker images available late 2025)

Hardware & software

Raspberry Pi 5 (ARM64): 8GB RAM, Ubuntu 22.04 ARM64, faiss-cpu built with OpenBLAS (NEON enabled), Annoy via pip, Milvus ARM64 Docker image (v2.3+ build). Tests show swapping when peak > 8GB.
Desktop (x86): Intel Core i7 (12th gen equivalent), 32GB RAM, Ubuntu 22.04, faiss-cpu built with OpenBLAS/AVX2, Annoy, Milvus running in Docker. Plenty of memory so builds ran in RAM.
Software: tests used faiss-cpu (late-2025 build), annoy 1.x, Milvus 2.3+, Python 3.10, reproducible scripts available for adaptation.

Raw results — indexing time, index size, query latency and throughput

Below are condensed, representative numbers to make architecture trade-offs concrete. All timings are end-to-end wall clock measured with CPU affinity and environment variables tuned for fairness (OMP_NUM_THREADS set to available physical cores; swap disabled unless noted).

Index size (on-disk / in-memory after load)

FAISS Flat: ~520MB (raw vectors + minimal overhead)
FAISS IVFPQ (m=16, 8 bits): ~40MB–80MB (compressed + centroids)
FAISS HNSWFlat: ~1.4GB–1.6GB (graph adjacency overhead)
Annoy (64 trees): ~950MB
Milvus IVF_PQ: ~60MB (service metadata increases runtime RAM), Milvus HNSW: ~1.4GB + ~200MB runtime)

Indexing time (1M vectors, wall clock)

FAISS Flat: Desktop ~6s, Pi ~18s
FAISS IVFPQ (train kmeans + encode): Desktop ~220s, Pi ~900s (Pi often hit swap if not using conservative params)
FAISS HNSWFlat (M=32): Desktop ~450s, Pi ~1,800s (graph memory pressure drives swap)
Annoy (64 trees): Desktop ~600s, Pi ~2,400s (Annoy’s tree builds are CPU-bound and single-thread limited in many Python builds)
Milvus (IVF_PQ): Desktop ~300s, Pi ~1,200s; Milvus HNSW: Desktop ~500s, Pi ~2,100s

Single-query latency (k=10) — P50 / P95 (ms)

FAISS Flat: Desktop 0.6 / 1.8 ms — Pi 4.0 / 10 ms
FAISS IVFPQ (nprobe=8): Desktop 0.8 / 2.5 ms — Pi 6.2 / 16 ms
FAISS HNSW: Desktop 0.9 / 3.0 ms — Pi 7.0 / 18 ms
Annoy (64 trees): Desktop 1.1 / 4.5 ms — Pi 8.5 / 22 ms
Milvus IVF_PQ: Desktop 1.2 / 4.0 ms — Pi 9.0 / 25 ms

Batch throughput (batch size 64 — queries/sec)

FAISS Flat: Desktop ~20k qps — Pi ~3.5k qps
FAISS IVFPQ: Desktop ~12k qps — Pi ~2.0k qps
FAISS HNSW: Desktop ~9k qps — Pi ~1.5k qps
Annoy: Desktop ~6k qps — Pi ~1.0k qps
Milvus IVF_PQ (service overhead included): Desktop ~8k qps — Pi ~1.2k qps

What these numbers mean — architecture-level takeaways

Raw CPU throughput matters. x86 with AVX/AVX2 and higher frequency wins on pure compute (distance calculations, k-means training). FAISS flat/IVF training is dramatically faster on the desktop.
Memory usage determines feasibility on ARM. HNSW and aggressive IVF training often exceed 8GB and either fail or slow down dramatically on Pi because of swap.
Compression (PQ/OPQ) is the edge enabler. IVFPQ drops memory footprint by ~10x with acceptable recall trade-offs — perfect for Pi shipping.
Milvus convenience comes with runtime overhead. Milvus adds service memory overhead and is heavier to run on Pi, but its ARM64 images (stable late 2025) make small deployments possible if you use compressed indexes or remote store the heavy ones.
Annoy is simple but not the fastest. It’s a good small-option for read-only indexes where rebuilds are rare and memory is moderate, but it’s slower and larger than IVFPQ for the same recall on constrained RAM.

Actionable tuning recipes — how to get production-quality vector search on ARM

Below are practical steps and exact knobs I used to improve viability on Raspberry Pi 5. These are targeted at engineering teams who need reproducible performance improvements.

1) Build the index on x86, compress, then ship to Pi

Why: Training k-means (IVF) and HNSW construction is multi-thread and vectorized heavy; the Pi is slow and may OOM. Build on desktop and serialize the compressed index (faiss.write_index). On the Pi, load via memory-map (faiss.read_index_memmap when available) or use faiss.read_index with mmap and set contiguous flags.

# on desktop after training IVFPQ
import faiss
faiss.write_index(ivfpq_index, 'ivfpq.index')
# copy ivfpq.index to Pi and load memory-mapped
idx = faiss.read_index('ivfpq.index', faiss.IO_FLAG_MMAP)

2) Use PQ / OPQ aggressively to reduce RAM

m=8–16, 8 bits is a sweet spot for 128-d vectors on edge. That yields ~8–16 bytes per vector.
If recall drops below target, increase nprobe (but that increases query latency).

3) Tune nlist / nprobe and efConstruction / efSearch for the memory/latency sweet spot

nlist (IVF) controls centroid count — higher nlist speeds query but increases training work and memory usage.
nprobe trades recall vs latency. On Pi, keep nprobe small (<=8) unless you need high recall.
HNSW M and efConstruction impact build time and RAM. For Pi, reduce M (e.g., 12–16) and efConstruction (100–150) to keep builds feasible.

4) Compile FAISS for ARM with NEON + OpenBLAS and tune thread settings

Build faiss-cpu on Pi with NEON enabled and BLAS set to OpenBLAS (don’t use MKL on ARM). This provides best CPU vectorization for distance loops.
Set OMP_NUM_THREADS to the Pi’s physical cores (e.g., 4) and use taskset to pin processes to cores during indexing.

5) Avoid in-place heavy builds on Pi — stream or incremental indexing

For growing datasets, consider tiered architecture: small in-memory index on Pi for hottest items, with a remote x86 search for cold or bulk queries.
Milvus supports hybrid deployments (light agent on ARM, server on x86). Use it for complex orchestration without shipping full HNSW graphs to the edge.

6) Use memory-mapped indexes + file-system tuning

Mmap compressed indexes on Pi to avoid full RAM footprint. Use ext4/noatime or f2fs on SD/NVMe and prefer NVMe where possible for throughput.

Index builds that dip into swap can take 5–10x longer. Expose OOM-killer and reject impossible builds; pre-check peak memory using a dry-run smaller sample and scale parameters accordingly. Set up simple dashboards and alerts — monitor peak memory during test builds so you catch OOM before a rollout.

Edge deployment patterns (practical recipes)

Pick one of these depending on constraints:

Fully on-device, strict RAM: FAISS IVFPQ (m=8–16, 8 bit). Build on x86, serialize, ship compressed index, load mmap on Pi. Good for OD apps with few updates.
On-device, medium RAM: FAISS HNSW with reduced M (12–16). Accept slower builds or prebuild offline.
Hybrid (recommended for production): Milvus or FAISS service on x86 (heavy indexing), a thin agent on Pi for local cache and fast failover. Use gRPC/REST to remote search when cache misses occur — tie this into your edge microapp orchestration.
Read-only static datasets: Annoy if you prefer Python-only minimal dependencies and can tolerate larger index sizes.

2026 trends that matter for your decision

Two late-2025 / early-2026 developments affect your vector search architecture:

Pi + AI HAT hardware improvements: The Raspberry Pi 5 platform plus third-party AI HAT accelerators (late 2025 and early 2026) are making on-device acceleration practical for ML inference and can also influence vector distance compute acceleration. If you have an AI HAT+ or NPU, re-benchmark FAISS distance kernels — some community libraries are adding NEON+NPU offload.
Memory supply shocks raise costs: Memory prices rose through 2025 and into 2026, affecting how much RAM you can justify on edge devices. That makes PQ/OPQ and on-disk compressed indices even more attractive — you’ll often pay in latency rather than engineering complexity.

"If memory is expensive, compress — and be aggressive about building indexes on bigger machines then shipping compressed artifacts to the edge." — practical advice informed by 2025–2026 industry trends.

Common pitfalls and how to avoid them

Attempting full HNSW builds on Pi: Leads to OOM or extreme swap. Instead, build on x86 and transfer serialized graphs, or reduce M and efConstruction drastically.
Assuming identical recall across platforms: The Pi’s need to reduce nprobe/efSearch to stay within latency/COST constraints will lower recall. Measure recall vs latency for your query distribution and tune accordingly.
Ignoring storage speed: If your Pi uses a slow microSD, index loads and mmap will be slow. Use NVMe or a fast USB3 SSD for production Pi deployments.

Repro tips — commands and flags I used

Use these as a starting point for your own benchmarks.

# Build faiss on ARM with OpenBLAS and NEON paths (example flags)
cmake -DFAISS_ENABLE_PYTHON=ON -DFAISS_USE_FLOAT16=OFF -DBLA_VENDOR=OpenBLAS ..
make -j4

# Set threads on Pi for fair runs
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4

# Python snippet to measure single-query latency
import faiss, time
index = faiss.read_index('ivfpq.index', faiss.IO_FLAG_MMAP)
query = np.random.rand(1,128).astype('float32')
start = time.time()
D,I = index.search(query, 10)
print('latency ms', (time.time()-start)*1000)

Final recommendations — choose a path based on your constraints

If memory <= 8GB and you want local search: Build IVFPQ on x86 and ship compressed index. Use nprobe small (4–8) and m=8–16.
If latency must be sub-ms for many queries: Keep the hot set small enough for a Flat index in RAM or use a powerful local accelerator (AI HAT) and re-benchmark distance kernels for hardware offload.
If you need frequent online updates: Use Milvus (server-side) on x86 and a local cache on the Pi. Batch updates on server and push diffs to edge when feasible.

Next steps & call to action

If you’re building vector search for constrained devices, start by running the small reproducible tests described here on your own hardware: measure peak memory during IVF training, measure single-query P95 for your query distribution, then decide whether to compress or offload. I’ve posted the benchmark scripts and Dockerfiles used in these experiments for teams to adapt — try them, tweak nlist/nprobe/M, and share results back to the team.

Need help choosing the right index or tuning parameters for your dataset and recall targets? Reach out to the fuzzypoint team for a short consult — we’ll help you design a reproducible benchmark and an actionable deployment plan (edge vs hybrid vs server) tailored to your traffic and hardware.

fuzzypoint

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.