Vector Search

Research Date: 2026-02-20 Focus: Large-scale approximate nearest neighbor (ANN) search for face embedding retrieval at 200M+ gallery size

1. Problem Framing

Face matching at 200M+ scale is fundamentally a high-dimensional ANN search problem:

Embedding dimensionality: 128–512-dim (ArcFace, AdaFace, ElasticFace typically output 512-dim)
Raw storage for 200M × 512-dim float32: 200M × 512 × 4 bytes = ~400 GB
Required recall: ≥95% Recall@1 for biometric applications (higher is better)
Target latency: <100ms p99 for 1:N identification in production
Throughput: Varies by use case; CCTV analytics may batch; border control needs low latency per query

2. Index Approaches Compared

2.1 FAISS — Facebook AI Similarity Search

Library: Meta / open-source (C++/Python) Reference: FAISS GitHub | NVIDIA cuVS FAISS integration (May 2025)

IndexIVFFlat

Metric	Value
Memory (200M × 512-dim)	~400 GB (raw) + inverted list overhead
Recall@1	95–99% (with nprobe tuning)
QPS (single CPU node)	~500–2,000 depending on nprobe
Build time	Hours for 200M
Compression	None — stores raw float32

Partitions space into Voronoi cells; at query time scans nprobe cells
No compression: highest recall but huge RAM requirement (~400 GB+ for 200M)
Verdict: Impractical at 200M scale without distributed RAM

IndexIVFPQ (IVF + Product Quantization)

Metric	Value
Memory (200M × 512-dim, M=64, nbits=8)	~12–25 GB (32–64 bytes/vector)
Recall@1	85–95% with reranking; ~80–90% without
QPS (CPU)	~2,000–10,000
Compression ratio	32–64× reduction vs flat
Build time	4–12 hours for 200M on CPU

PQ breaks each vector into M sub-vectors, each quantized to a codebook
For 512-dim with M=64, each sub-vector is 8-dim → 8 bits → 64 bytes/vector
Recall impact: PQ introduces ~5–15% recall loss vs exact search; reranking with raw vectors recovers most
Verdict: The workhorse for 200M-scale on RAM-constrained hardware

FAISS GPU (GpuIndexIVFPQ / GpuIndexIVFFlat)

Metric	Value
GPU speedup vs CPU	5–20× for search; 4.7–8.1× for IVF with cuVS
Build time speedup (cuVS IVF)	Up to 4.7× vs CPU
CAGRA build vs HNSW CPU	Up to 12.3× faster build
CAGRA search vs CPU HNSW	Up to 4.7× lower latency
Batch QPS (GPU CAGRA, 10k batch)	Millions of queries/second

GPU FAISS shines for batch workloads (e.g., video analytics with many simultaneous queries)
Single-query latency benefit is smaller due to GPU launch overhead
NVIDIA cuVS (RAPIDS) integration in FAISS (2025): significant speedups for both IVF and CAGRA graph indexes
Verdict: Excellent for batch throughput; less compelling for single-query latency use cases

2.2 HNSW — Hierarchical Navigable Small World

Implementations: hnswlib, FAISS IndexHNSW, Milvus, Qdrant, pgvector Reference: Milvus IVF vs HNSW guide

Metric	Value
Memory (200M × 512-dim)	~400 GB raw + 20–40% graph overhead → ~500–600 GB
Recall@1	97–99.5%
QPS (single query, CPU)	~1,000–5,000
Build time	Very long for 200M (graph construction is O(n log n))
Filtering support	Limited; post-filter or pre-filter approximation

Best recall-latency tradeoff for in-memory datasets
Memory requirement (500–600 GB for 200M) is the main obstacle
Scales poorly beyond what fits in RAM
Verdict: Ideal for ≤50M vectors in high-RAM machines; impractical for 200M without quantization or disk offload

2.3 DiskANN / Vamana — Microsoft Disk-Based Graph Index

Reference: DiskANN overview | SQL Server 2025 public preview

Metric	Value
Memory (200M vectors)	Small in-RAM index; vectors on SSD
Recall@1	≥95% on billion-scale datasets
Latency (single query)	<5ms at 95% recall@1
RAM requirement	~64 GB RAM sufficient for 1B vectors
Build time	Several hours; GPU acceleration gives 40× speedup

Vamana graph: shorter average search path than HNSW and NSG → fewer SSD reads
Peak at 100–200M vectors in RAM; DiskANN designed for >200M going to SSD
SQL Server 2025 integrates DiskANN natively (public preview 2025)
Rewritten in Rust; stateless orchestrator model integrating with host DB storage
Verdict: Best option for 200M–1B+ scale when RAM is constrained; low latency (~5ms p99) on SSD

2.4 ScaNN — Google Scalable Nearest Neighbors

Reference: ScaNN announcement | SOAR algorithms | AlloyDB ScaNN (Oct 2024)

Metric	Value
ann-benchmarks.com QPS	~2× the next-fastest library at same accuracy
Memory vs HNSW	4× smaller footprint (AlloyDB ScaNN vs pgvector HNSW)
SOAR track winners (Big-ANN NeurIPS’23)	Highest result in OOD and streaming tracks
Adaptive filtering	Yes — runtime selectivity learning (2025 update)

ScaNN uses anisotropic quantization (quantizes in direction of query, not isotropically)
Integrated into Google AlloyDB as a Postgres vector extension (Oct 2024)
2025 update: Adaptive filtration — learns filter selectivity at query time
Verdict: Best-in-class for pure ANN throughput; anisotropic quantization preserves recall better than standard PQ

2.5 SPANN — Microsoft Hierarchical Inverted + SSD

Reference: SPANN paper | Microsoft Research

Metric	Value
Recall@1	90%
Recall@10	90%+
Latency vs DiskANN	~2× faster at same recall and memory
Memory	Only 10% of original memory cost vs in-memory algorithms
Architecture	Centroids in RAM, posting lists on SSD

Hierarchical balanced clustering + query-aware posting list pruning
2× faster than DiskANN on billion-scale benchmarks at same recall/memory
Verdict: Strong alternative to DiskANN when minimizing RAM is critical

2.6 CAGRA — NVIDIA GPU Graph-Based ANN (cuVS)

Reference: CAGRA paper (2023) | NVIDIA blog

Metric	Value
Search speedup vs CPU HNSW	33–77× at 90–95% recall range
Single-query speedup vs HNSW	3.4–53× at 95% recall
Build speedup vs CPU HNSW	2.2–27×
Batch QPS (10k batch size)	Millions of queries/second
GPU requirement	NVIDIA A100 / H100 class

CAGRA builds GPU-optimized graph; outperforms CPU HNSW dramatically in batch
Integrated with FAISS via cuVS (May 2025 Meta announcement)
DiskANN Vamana can now be built on GPU with 40× speedup over CPU build
Verdict: Best for batch analytics (video surveillance, offline identity resolution); not designed for online single-query search

2.7 Vespa — Hybrid HNSW-IF (Inverted File)

Reference: Vespa billion-scale hybrid

Metric	Value
Architecture	HNSW centroids in RAM; non-centroid vectors on disk
Memory saving	10× vs pure in-memory HNSW
Recall@10	Competitive with pure HNSW
Filtering	Native support with inverted index
Latency	Single-digit ms for 100M centroids

Bridges HNSW quality with disk economics
Production-tested on LAION-5B class datasets
Verdict: Good for 200M+ with native filtering (e.g., filter by date, region, watchlist group)

2.8 Milvus — Production Vector Database

Reference: Milvus docs | Milvus vs Qdrant

Metric	Value
Max scale	Billions of vectors (distributed)
Supported indexes	IVF-Flat, IVF-PQ, HNSW, DiskANN, CAGRA (GPU)
Latency (million-scale)	Single-digit ms
Filtering	Yes — metadata + vector hybrid
Production readiness	High — cloud-native, distributed, CRUD

Full production system: CRUD, HA, horizontal scaling, access control
Supports IVF-PQ, HNSW, DiskANN, and GPU CAGRA backends
Contrast with raw FAISS: FAISS is a library (no CRUD, no HA, no distribution)
Verdict: Best managed solution for 200M+ in production if not building custom infrastructure

2.9 Qdrant — Rust-Based Vector Database

Reference: Qdrant benchmarks

Metric	Value
QPS at 50M, 99% recall	41.47 QPS (vs pgvectorscale 471 QPS on same data)
Architecture	Rust, HNSW-based with scalar/binary quantization
Filtering	Excellent — payload-indexed filtering
Memory optimization	Scalar quantization (4× reduction); binary quantization (32× reduction)

Strong filtering capabilities; Rust implementation → low overhead
Benchmark shows lower raw QPS vs pgvectorscale at large scale but better filtering flexibility
Verdict: Good for metadata-rich filtering use cases; less optimized for pure ANN throughput at 200M+

3. NeurIPS Big-ANN Benchmark Results (2021–2023)

Reference: Big-ANN-Benchmarks | NeurIPS’23 results

Key findings from the benchmark competition:

Filtered track winner (NeurIPS’23): ParlayANN — Vamana graphs + spatial inverted indices per tag; 11× faster than baseline
OOD track joint winners: MysteryANN / RoarANN, PyANNS
Streaming track: SOAR (ScaNN variant) — highest result
BANG (2024): Single A100 GPU achieves 50–400× higher throughput than CPU methods at 0.9 recall on billion-scale datasets

4. NIST FRVT / FRTE Context

Reference: NIST FRTE 1:N | NEC April 2025

NIST tests currently reach 12 million enrolled identities for 1:N identification
NEC system: 0.07% authentication error rate at 12M scale (April 2025)
FRVT has evaluated 400+ algorithms on 18M+ images of 8M+ people
Gap: NIST FRTE does not yet test at 200M+ scale; real-world deployments (national IDs, airport systems) have reached this scale operationally but benchmarks are not public

5. Product Quantization Impact on Face Embeddings

Reference: Milvus PQ guide | HuggingFace embedding quantization

Quantization Type	Memory Reduction	Recall Impact
Scalar (int8)	4×	Negligible (75% mem reduction)
Product Quantization	32–64×	5–15% recall loss (recoverable with reranking)
Binary	32×	Moderate loss; effective for prefiltering

For 512-dim ArcFace embeddings specifically:

PQ with M=64 subvectors, 8-bit codes: 512-dim × 4 bytes → 64 bytes/vector (32× compression)
200M vectors: 400 GB → ~12.5 GB with PQ (fits comfortably in RAM)
Recall@1 drop: typically 5–10% vs exact; reranking top-k with exact vectors recovers to ~99%
Recommended pattern: IVF-PQ for ANN search → rerank top-100 candidates with exact distances

6. Memory Footprint Summary for 200M × 512-dim

Approach	RAM Required	Disk Required	Notes
IVF-Flat	~400 GB	—	Exact; impractical on single node
IVF-PQ (M=64)	~12–25 GB	—	Highly compressed; 200M fits easily
HNSW (full in-memory)	~550–600 GB	—	Impractical; needs 4× RAM server
DiskANN	~50–100 GB	~400 GB SSD	95%+ recall; SSD latency
SPANN	~40–60 GB	~400 GB SSD	2× faster than DiskANN
Vespa HNSW-IF	~50–100 GB	~400 GB SSD	Good filtering

7. GPU vs CPU Decision Matrix

Workload Type	Recommended Approach	Why
Batch (10k+ simultaneous queries)	CAGRA / GPU IVF-PQ	33–400× throughput advantage
Online low-latency (single query <10ms)	DiskANN / SPANN / CPU HNSW	GPU launch overhead hurts single-query
Massive gallery (>500M) with filtering	DiskANN + IVF hybrid	Only disk-based can handle at this scale
Cost-sensitive cloud deployment	IVF-PQ on CPU	Cheap; good recall with reranking
Highest throughput, GPU budget	CAGRA (A100/H100)	Millions of QPS in batch

8. Production Readiness Rankings

System	Production Readiness	CRUD	HA/Distribution	Filtering	GPU
Milvus	High	Yes	Yes	Yes	Yes (CAGRA)
Qdrant	High	Yes	Yes	Excellent	No
Vespa	High	Yes	Yes	Excellent	No
FAISS	Low (library)	No	No	No	Yes
DiskANN	Medium	Partial	Via host DB	Via host DB	Build only
SPANN	Low (research)	No	No	No	No
ScaNN	Medium	Via AlloyDB	Via AlloyDB	Yes (adaptive)	No

9. UFME Recommendations

For UFME-specific recommendations based on this research, see Executive Summary.