Skip to content

Vector Search

Research Date: 2026-02-20 Focus: Large-scale approximate nearest neighbor (ANN) search for face embedding retrieval at 200M+ gallery size


Face matching at 200M+ scale is fundamentally a high-dimensional ANN search problem:

  • Embedding dimensionality: 128–512-dim (ArcFace, AdaFace, ElasticFace typically output 512-dim)
  • Raw storage for 200M × 512-dim float32: 200M × 512 × 4 bytes = ~400 GB
  • Required recall: ≥95% Recall@1 for biometric applications (higher is better)
  • Target latency: <100ms p99 for 1:N identification in production
  • Throughput: Varies by use case; CCTV analytics may batch; border control needs low latency per query

Section titled “2.1 FAISS — Facebook AI Similarity Search”

Library: Meta / open-source (C++/Python) Reference: FAISS GitHub | NVIDIA cuVS FAISS integration (May 2025)

MetricValue
Memory (200M × 512-dim)~400 GB (raw) + inverted list overhead
Recall@195–99% (with nprobe tuning)
QPS (single CPU node)~500–2,000 depending on nprobe
Build timeHours for 200M
CompressionNone — stores raw float32
  • Partitions space into Voronoi cells; at query time scans nprobe cells
  • No compression: highest recall but huge RAM requirement (~400 GB+ for 200M)
  • Verdict: Impractical at 200M scale without distributed RAM
MetricValue
Memory (200M × 512-dim, M=64, nbits=8)~12–25 GB (32–64 bytes/vector)
Recall@185–95% with reranking; ~80–90% without
QPS (CPU)~2,000–10,000
Compression ratio32–64× reduction vs flat
Build time4–12 hours for 200M on CPU
  • PQ breaks each vector into M sub-vectors, each quantized to a codebook
  • For 512-dim with M=64, each sub-vector is 8-dim → 8 bits → 64 bytes/vector
  • Recall impact: PQ introduces ~5–15% recall loss vs exact search; reranking with raw vectors recovers most
  • Verdict: The workhorse for 200M-scale on RAM-constrained hardware

FAISS GPU (GpuIndexIVFPQ / GpuIndexIVFFlat)

Section titled “FAISS GPU (GpuIndexIVFPQ / GpuIndexIVFFlat)”
MetricValue
GPU speedup vs CPU5–20× for search; 4.7–8.1× for IVF with cuVS
Build time speedup (cuVS IVF)Up to 4.7× vs CPU
CAGRA build vs HNSW CPUUp to 12.3× faster build
CAGRA search vs CPU HNSWUp to 4.7× lower latency
Batch QPS (GPU CAGRA, 10k batch)Millions of queries/second
  • GPU FAISS shines for batch workloads (e.g., video analytics with many simultaneous queries)
  • Single-query latency benefit is smaller due to GPU launch overhead
  • NVIDIA cuVS (RAPIDS) integration in FAISS (2025): significant speedups for both IVF and CAGRA graph indexes
  • Verdict: Excellent for batch throughput; less compelling for single-query latency use cases

2.2 HNSW — Hierarchical Navigable Small World

Section titled “2.2 HNSW — Hierarchical Navigable Small World”

Implementations: hnswlib, FAISS IndexHNSW, Milvus, Qdrant, pgvector Reference: Milvus IVF vs HNSW guide

MetricValue
Memory (200M × 512-dim)~400 GB raw + 20–40% graph overhead → ~500–600 GB
Recall@197–99.5%
QPS (single query, CPU)~1,000–5,000
Build timeVery long for 200M (graph construction is O(n log n))
Filtering supportLimited; post-filter or pre-filter approximation
  • Best recall-latency tradeoff for in-memory datasets
  • Memory requirement (500–600 GB for 200M) is the main obstacle
  • Scales poorly beyond what fits in RAM
  • Verdict: Ideal for ≤50M vectors in high-RAM machines; impractical for 200M without quantization or disk offload

2.3 DiskANN / Vamana — Microsoft Disk-Based Graph Index

Section titled “2.3 DiskANN / Vamana — Microsoft Disk-Based Graph Index”

Reference: DiskANN overview | SQL Server 2025 public preview

MetricValue
Memory (200M vectors)Small in-RAM index; vectors on SSD
Recall@1≥95% on billion-scale datasets
Latency (single query)<5ms at 95% recall@1
RAM requirement~64 GB RAM sufficient for 1B vectors
Build timeSeveral hours; GPU acceleration gives 40× speedup
  • Vamana graph: shorter average search path than HNSW and NSG → fewer SSD reads
  • Peak at 100–200M vectors in RAM; DiskANN designed for >200M going to SSD
  • SQL Server 2025 integrates DiskANN natively (public preview 2025)
  • Rewritten in Rust; stateless orchestrator model integrating with host DB storage
  • Verdict: Best option for 200M–1B+ scale when RAM is constrained; low latency (~5ms p99) on SSD

2.4 ScaNN — Google Scalable Nearest Neighbors

Section titled “2.4 ScaNN — Google Scalable Nearest Neighbors”

Reference: ScaNN announcement | SOAR algorithms | AlloyDB ScaNN (Oct 2024)

MetricValue
ann-benchmarks.com QPS~2× the next-fastest library at same accuracy
Memory vs HNSW4× smaller footprint (AlloyDB ScaNN vs pgvector HNSW)
SOAR track winners (Big-ANN NeurIPS’23)Highest result in OOD and streaming tracks
Adaptive filteringYes — runtime selectivity learning (2025 update)
  • ScaNN uses anisotropic quantization (quantizes in direction of query, not isotropically)
  • Integrated into Google AlloyDB as a Postgres vector extension (Oct 2024)
  • 2025 update: Adaptive filtration — learns filter selectivity at query time
  • Verdict: Best-in-class for pure ANN throughput; anisotropic quantization preserves recall better than standard PQ

2.5 SPANN — Microsoft Hierarchical Inverted + SSD

Section titled “2.5 SPANN — Microsoft Hierarchical Inverted + SSD”

Reference: SPANN paper | Microsoft Research

MetricValue
Recall@190%
Recall@1090%+
Latency vs DiskANN~2× faster at same recall and memory
MemoryOnly 10% of original memory cost vs in-memory algorithms
ArchitectureCentroids in RAM, posting lists on SSD
  • Hierarchical balanced clustering + query-aware posting list pruning
  • 2× faster than DiskANN on billion-scale benchmarks at same recall/memory
  • Verdict: Strong alternative to DiskANN when minimizing RAM is critical

2.6 CAGRA — NVIDIA GPU Graph-Based ANN (cuVS)

Section titled “2.6 CAGRA — NVIDIA GPU Graph-Based ANN (cuVS)”

Reference: CAGRA paper (2023) | NVIDIA blog

MetricValue
Search speedup vs CPU HNSW33–77× at 90–95% recall range
Single-query speedup vs HNSW3.4–53× at 95% recall
Build speedup vs CPU HNSW2.2–27×
Batch QPS (10k batch size)Millions of queries/second
GPU requirementNVIDIA A100 / H100 class
  • CAGRA builds GPU-optimized graph; outperforms CPU HNSW dramatically in batch
  • Integrated with FAISS via cuVS (May 2025 Meta announcement)
  • DiskANN Vamana can now be built on GPU with 40× speedup over CPU build
  • Verdict: Best for batch analytics (video surveillance, offline identity resolution); not designed for online single-query search

2.7 Vespa — Hybrid HNSW-IF (Inverted File)

Section titled “2.7 Vespa — Hybrid HNSW-IF (Inverted File)”

Reference: Vespa billion-scale hybrid

MetricValue
ArchitectureHNSW centroids in RAM; non-centroid vectors on disk
Memory saving10× vs pure in-memory HNSW
Recall@10Competitive with pure HNSW
FilteringNative support with inverted index
LatencySingle-digit ms for 100M centroids
  • Bridges HNSW quality with disk economics
  • Production-tested on LAION-5B class datasets
  • Verdict: Good for 200M+ with native filtering (e.g., filter by date, region, watchlist group)

Reference: Milvus docs | Milvus vs Qdrant

MetricValue
Max scaleBillions of vectors (distributed)
Supported indexesIVF-Flat, IVF-PQ, HNSW, DiskANN, CAGRA (GPU)
Latency (million-scale)Single-digit ms
FilteringYes — metadata + vector hybrid
Production readinessHigh — cloud-native, distributed, CRUD
  • Full production system: CRUD, HA, horizontal scaling, access control
  • Supports IVF-PQ, HNSW, DiskANN, and GPU CAGRA backends
  • Contrast with raw FAISS: FAISS is a library (no CRUD, no HA, no distribution)
  • Verdict: Best managed solution for 200M+ in production if not building custom infrastructure

Reference: Qdrant benchmarks

MetricValue
QPS at 50M, 99% recall41.47 QPS (vs pgvectorscale 471 QPS on same data)
ArchitectureRust, HNSW-based with scalar/binary quantization
FilteringExcellent — payload-indexed filtering
Memory optimizationScalar quantization (4× reduction); binary quantization (32× reduction)
  • Strong filtering capabilities; Rust implementation → low overhead
  • Benchmark shows lower raw QPS vs pgvectorscale at large scale but better filtering flexibility
  • Verdict: Good for metadata-rich filtering use cases; less optimized for pure ANN throughput at 200M+

3. NeurIPS Big-ANN Benchmark Results (2021–2023)

Section titled “3. NeurIPS Big-ANN Benchmark Results (2021–2023)”

Reference: Big-ANN-Benchmarks | NeurIPS’23 results

Key findings from the benchmark competition:

  • Filtered track winner (NeurIPS’23): ParlayANN — Vamana graphs + spatial inverted indices per tag; 11× faster than baseline
  • OOD track joint winners: MysteryANN / RoarANN, PyANNS
  • Streaming track: SOAR (ScaNN variant) — highest result
  • BANG (2024): Single A100 GPU achieves 50–400× higher throughput than CPU methods at 0.9 recall on billion-scale datasets

Reference: NIST FRTE 1:N | NEC April 2025

  • NIST tests currently reach 12 million enrolled identities for 1:N identification
  • NEC system: 0.07% authentication error rate at 12M scale (April 2025)
  • FRVT has evaluated 400+ algorithms on 18M+ images of 8M+ people
  • Gap: NIST FRTE does not yet test at 200M+ scale; real-world deployments (national IDs, airport systems) have reached this scale operationally but benchmarks are not public

5. Product Quantization Impact on Face Embeddings

Section titled “5. Product Quantization Impact on Face Embeddings”

Reference: Milvus PQ guide | HuggingFace embedding quantization

Quantization TypeMemory ReductionRecall Impact
Scalar (int8)Negligible (75% mem reduction)
Product Quantization32–64×5–15% recall loss (recoverable with reranking)
Binary32×Moderate loss; effective for prefiltering

For 512-dim ArcFace embeddings specifically:

  • PQ with M=64 subvectors, 8-bit codes: 512-dim × 4 bytes → 64 bytes/vector (32× compression)
  • 200M vectors: 400 GB → ~12.5 GB with PQ (fits comfortably in RAM)
  • Recall@1 drop: typically 5–10% vs exact; reranking top-k with exact vectors recovers to ~99%
  • Recommended pattern: IVF-PQ for ANN search → rerank top-100 candidates with exact distances

6. Memory Footprint Summary for 200M × 512-dim

Section titled “6. Memory Footprint Summary for 200M × 512-dim”
ApproachRAM RequiredDisk RequiredNotes
IVF-Flat~400 GBExact; impractical on single node
IVF-PQ (M=64)~12–25 GBHighly compressed; 200M fits easily
HNSW (full in-memory)~550–600 GBImpractical; needs 4× RAM server
DiskANN~50–100 GB~400 GB SSD95%+ recall; SSD latency
SPANN~40–60 GB~400 GB SSD2× faster than DiskANN
Vespa HNSW-IF~50–100 GB~400 GB SSDGood filtering

Workload TypeRecommended ApproachWhy
Batch (10k+ simultaneous queries)CAGRA / GPU IVF-PQ33–400× throughput advantage
Online low-latency (single query <10ms)DiskANN / SPANN / CPU HNSWGPU launch overhead hurts single-query
Massive gallery (>500M) with filteringDiskANN + IVF hybridOnly disk-based can handle at this scale
Cost-sensitive cloud deploymentIVF-PQ on CPUCheap; good recall with reranking
Highest throughput, GPU budgetCAGRA (A100/H100)Millions of QPS in batch

SystemProduction ReadinessCRUDHA/DistributionFilteringGPU
MilvusHighYesYesYesYes (CAGRA)
QdrantHighYesYesExcellentNo
VespaHighYesYesExcellentNo
FAISSLow (library)NoNoNoYes
DiskANNMediumPartialVia host DBVia host DBBuild only
SPANNLow (research)NoNoNoNo
ScaNNMediumVia AlloyDBVia AlloyDBYes (adaptive)No

For UFME-specific recommendations based on this research, see Executive Summary.