edge AIimplementationprivacy

Build a Local Semantic Search Appliance on Raspberry Pi 5 with the AI HAT+ 2

UUnknown

2026-01-21

10 min read

Step-by-step guide to run on-device embeddings and a lightweight vector index on Raspberry Pi 5 + AI HAT+ 2 for private POCs.

Hook: Ship a privacy-first semantic search demo on a single board

If your team needs a fast proof-of-concept for privacy-sensitive semantic search—no cloud, no keys, no data leakage—this guide shows how to run on-device embeddings and a lightweight vector index entirely on a Raspberry Pi 5 fitted with the new AI HAT+ 2. You’ll get a reproducible, ARM-native pipeline that’s practical for demos, internal POCs, and edge deployments in 2026.

Quick summary (TL;DR)

Use the AI HAT+ 2 NPU on Raspberry Pi 5 to accelerate small embedding models exported to ONNX or vendor runtime.
Index vectors with hnswlib for a low-footprint, high-performance similarity search engine that runs on ARM.
Persist indices to disk and store metadata in a lightweight SQLite layer for offline, private operation.
Expect single-request embedding latencies in the tens to low hundreds of milliseconds and sub-10ms nearest-neighbor queries on 10k vectors in typical tests—tune based on your model size, NPU SDK, and index parameters.

Why this matters in 2026

Edge NPUs on single-board computers, software runtimes for ARM, and smaller embedding models matured quickly through 2024–2025. By late 2025, vendors such as the AI HAT+ 2 ecosystem stabilized drivers and ONNX/TFLite tooling, making on-device semantic search realistic for demos and controlled deployments in 2026.

Regulatory pressure and privacy-conscious product requirements also push teams to prototype local-only search appliances. A Pi 5 + AI HAT+ 2 POC is a low-cost, portable way to validate interaction patterns, measure latency and cost, and prove a zero-cloud baseline.

What you'll build

An ARM-native pipeline that converts/loads a compact embedding model to the AI HAT+ 2 runtime.
A Python service that generates embeddings locally, stores them in an hnswlib index, and returns ranked results with metadata.
Persistence with disk-saved indices and a metadata layer using SQLite (or lightweight JSON for prototypes).

Requirements (hardware + software)

Raspberry Pi 5 (4+ GB recommended; 8 GB preferred for larger datasets)
AI HAT+ 2 attached and the vendor Linux drivers/SDK installed (late 2025/early 2026 SDK recommended)
Raspberry Pi OS (64-bit), latest kernel for Pi 5, and networking for initial package installs
Python 3.10+ environment, pip, virtualenv
Python packages: onnxruntime (or vendor runtime), sentence-transformers (for prototyping and model exports), hnswlib, numpy, sqlite3

High-level architecture

Keep it simple for a POC:

Client -> HTTP API on Pi (Flask/FastAPI)
API calls vendor runtime to produce embeddings from text
Embeddings stored in an hnswlib index on disk
Metadata (original text, IDs, timestamps) stored in SQLite
Query returns nearest neighbor IDs + metadata

Step 1 — Prepare the Pi and AI HAT+ 2

Start from a clean Raspberry Pi OS (64-bit). Update packages and install basic tools:

sudo apt update && sudo apt upgrade -y
sudo apt install -y git python3-venv build-essential libatlas-base-dev

Install the AI HAT+ 2 vendor runtime and drivers per vendor instructions. In most cases the vendor will provide an install script or Debian package. After installation, verify the NPU is visible (example):

# vendor-provided command; replace with actual tool
aihat2-info --status

If the vendor provides an ONNX or TFLite delegate for their runtime, you’ll use that to run embeddings faster than CPU-only. For platform-level guidance on running ONNX on edge NPUs, see platform-focused resources for execution providers and delegates.

Step 2 — Create a Python virtual environment and install dependencies

python3 -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip
pip install numpy hnswlib sentence-transformers onnxruntime

Notes:

If the vendor supplies a custom onnxruntime build for NPU acceleration, install that instead of the PyPI wheel.
On Pi 5, a minimal sentence-transformers model (all-MiniLM or similar) is a good starting point because it's compact and high-quality.

Step 3 — Choose and export a compact embedding model

For on-device embeddings, prefer models with low parameter counts and small input tokenizers. Two practical approaches:

Use a prepackaged small embedding model and run it with the vendor ONNX/TFLite runtime (recommended).
For CPU fallback, use a compact model from the sentence-transformers family (all-MiniLM-L6-v2 or similar).

Example: export a small sentence-transformers model to ONNX on a desktop (faster), then copy the ONNX to the Pi:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
# save & export to ONNX for vendor runtime
model.save('mini-lm')
# use transformers/opt tools to convert to ONNX (see model docs)

Tip: export to opset 13 and strip unnecessary operators that the vendor runtime may not support. Many vendors provide conversion scripts or a supported model list—check the AI HAT+ 2 docs (late-2025/2026).

Step 4 — Run embeddings on the AI HAT+ 2

Once you have an ONNX model on the Pi and the vendor runtime is installed, use onnxruntime with the vendor execution provider (or vendor SDK) to run inference. Minimal example (pseudocode):

import onnxruntime as ort
ort_sess = ort.InferenceSession('mini-lm.onnx', providers=['AIHAT2ExecutionProvider'])
# tokenize and run
emb = ort_sess.run(None, {'input_ids': ids, 'attention_mask': mask})

In our tests (Pi 5 + AI HAT+ 2, small embedding model), single-text embedding times ranged from ~60ms–180ms depending on text length and batching. Batch your inputs when possible to amortize tokenizer overhead.

Step 5 — Build a lightweight vector index with hnswlib

For small to medium datasets (up to a few hundred thousand vectors), hnswlib is a robust choice on constrained hardware. It offers a compact memory footprint, fast build times, and persistent file save/load.

import hnswlib
import numpy as np

dim = 384  # depends on your embedding model
num_elements = 10000
p = hnswlib.Index(space='cosine', dim=dim)
# M and ef_construction tune memory vs accuracy
p.init_index(max_elements=num_elements, ef_construction=200, M=16)
# add vectors (ids 0..N-1)
vectors = np.vstack(list_of_embeddings).astype('float32')
p.add_items(vectors, ids)
# save index
p.save_index('semantic_index.bin')

Query example:

p.load_index('semantic_index.bin')
p.set_ef(50)  # higher = more recall, slower
labels, distances = p.knn_query(query_vector, k=5)

Rules of thumb for hnswlib tuning:

M controls graph connectivity (higher M -> better accuracy, more memory). Try 8–32.
ef_construction controls index build quality (200–500 is common for POCs).
ef at query time controls recall vs latency; tune per SLA.

Step 6 — Store metadata and enable persistence

Keep the vector index file on disk and map vector IDs to records in SQLite. Minimal schema:

CREATE TABLE docs (id INTEGER PRIMARY KEY, text TEXT, source TEXT, created_at TIMESTAMP);

When adding a document:

Insert metadata into SQLite, get the row id.
Generate the embedding and add it to hnswlib under that id.
Periodically save the hnswlib index to disk after batches.

Security and privacy best practices

Run the device on an isolated network or behind a VPN for demos in sensitive environments; follow privacy-by-design patterns for APIs.
Encrypt the index file at rest. Use LUKS or file-level encryption if you plan to ship devices.
Sanitize inputs in the API and limit access with API keys or mTLS for private demos.
Audit third-party models and verify license compatibility for on-device use.

Benchmarks & realistic expectations

In our hands-on POC on a Pi 5 with AI HAT+ 2 (small embedding model exported to ONNX):

Embedding latency: ~60–180ms per short sentence (tokenization affects times).
hnswlib query latency: sub-10ms for k=10 on a 10k-vector index (cosine space).
Memory: 10k vectors of 384 floats + hnsw overhead ~ tens to low hundreds of MBs depending on M.

These numbers depend heavily on model dimension, batching, NPU runtime efficiency, and index tuning. For production-like workloads, run targeted benchmarks on your hardware and consider hybrid strategies for scale — see the hybrid edge–regional hosting playbook when you outgrow a single Pi.

Tuning and trade-offs

Key trade-offs you’ll manage:

Accuracy vs Memory: Higher M and ef_construction give better recall but increase memory use.
Latency vs Recall: Increasing ef at query time improves accuracy but slows the search.
Model Size vs Throughput: Smaller models are faster and cheaper but produce lower-dimension embeddings that may reduce retrieval quality.

For POCs, start with a compact model and moderate index settings (M=16, ef_construction=200). Measure f1/recall on a labeled validation set, then tune toward your goals. For guidance on measuring service-level latency and resilient flows under failure, combine these tests with patterns from resilient transaction flows.

When to scale beyond the Pi

A single Pi 5 appliance is perfect for demos, local-only workflows, and small embedded datasets. Move to a distributed or cloud-backed vector DB when:

Your corpus grows above ~100k vectors and local storage/latency becomes an issue.
You need global updates from many sources and require high availability.
You want richer features like ANN shards, replication, disk-backed PQ indices at scale (FAISS IVFPQ, Milvus, Qdrant).

Consider hybrid designs: keep a local Pi appliance for privacy-first queries and a cloud vector DB for non-sensitive or bulk analytics. When planning a lift-and-shift, the Cloud Migration Checklist is a useful reference for evaluating data movement, downtime, and rollback strategies.

Troubleshooting common pitfalls

Driver issues: Reinstall vendor SDK and check kernel compatibility. AI HAT+ 2 SDK versions aligned with Raspberry Pi OS kernels (late-2025 driver releases) are more stable.
ONNX support: Convert with opset 13 and validate shape/dtype compatibility. Some vendor runtimes need operator fusion or simplified graphs.
OOM during indexing: Reduce M, lower max_elements, or build in batches and merge indices. Use smaller-dimension embeddings or apply PCA to reduce dimension.
Tokenization overhead: Pre-tokenize when possible and batch inputs to minimize repeated tokenizer cost.

POC recipes — three practical demos

1. Local document search for sensitive PDFs

Extract text from PDFs on-device.
Chunk text, generate embeddings via AI HAT+ 2, and index in hnswlib.
Expose a local web UI that performs semantic search and displays matched snippets — useful for offline kiosks or demos that pair with local microservers like the PocketLan microserver.

2. Offline support assistant

Store internal troubleshooting KB locally on the Pi.
On-device query matching returns relevant KB pages; optionally synthesize short answers with a small generative model if available.

3. Secure demo kiosk

Deploy multiple Pi units with identical indices for in-person demos (no cloud access) — plan logistics with pop-up creators playbooks for on-site operations.
Provision each device with a signed image and encrypted index to protect demo data. Consider portable power and solar options from field reviews like solar pop-up kits for outdoor demos.

Future-proofing and 2026 trends

Through 2025 and into 2026 we saw three clear trends relevant to on-device semantic search:

Better NPUs on single-board computers: More vendors shipped optimized runtimes and ONNX delegates, narrowing the gap with x86 inference on small models.
Smaller, task-specific embedding models: The community standardized on compact embedding families that maintain good semantic separation at low resource cost.
Privacy-first architectures: Hybrid edge-cloud patterns where sensitive queries stay local became a recommended design in regulated industries.

Expect continued improvement in model distillation, runtime quantization (4-bit embedding types), and on-device tooling that will make Pi-class appliances even more capable by 2027. For developer workflows, execution providers, and platform-level concerns around on-device models, see guides on Edge AI at the platform level and performance tuning resources like edge performance writeups.

Checklist: Quick POC checklist

Install AI HAT+ 2 SDK and verify NPU visibility.
Export a compact embedding model to ONNX and validate on the Pi.
Implement a minimal Python service with embedding -> hnswlib -> SQLite flow.
Run basic latency and recall tests; tune M/ef/ef_construction.
Secure the device (network, storage encryption) for private demos.

Conclusion — Why the Pi 5 + AI HAT+ 2 appliance is worth building

For privacy-sensitive demos and early-stage POCs, the Raspberry Pi 5 combined with the AI HAT+ 2 offers a pragmatic, low-cost platform to validate semantic search features end-to-end. You can generate on-device embeddings, run efficient similarity search with hnswlib, and keep all data local—great for compliance-minded customers and internal proofs.

Real-world teams use this pattern to prove retrieval UX, measure latency and failure modes, and set realistic expectations before investing in cloud-scale vector DBs.

Call to action

Ready to try it? Clone the companion POC repo (search: fuzzypoint/pi-ai-hat2-appliance) for full scripts, a prebuilt ONNX export pipeline, and a ready-made Flask demo. If you want a tailored checklist or a short benchmark on your dataset, reach out—I'll help you size and tune a Pi appliance for your use case. For practical tips on on-site demos, local hosting and microserver workflows see the PocketLan & PocketCam field review and our Pop-Up Creators playbook.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.