Vector Search
Research Date: 2026-02-20 Focus: Large-scale approximate nearest neighbor (ANN) search for face embedding retrieval at 200M+ gallery size
1. Problem Framing
Section titled “1. Problem Framing”Face matching at 200M+ scale is fundamentally a high-dimensional ANN search problem:
- Embedding dimensionality: 128–512-dim (ArcFace, AdaFace, ElasticFace typically output 512-dim)
- Raw storage for 200M × 512-dim float32: 200M × 512 × 4 bytes = ~400 GB
- Required recall: ≥95% Recall@1 for biometric applications (higher is better)
- Target latency: <100ms p99 for 1:N identification in production
- Throughput: Varies by use case; CCTV analytics may batch; border control needs low latency per query
2. Index Approaches Compared
Section titled “2. Index Approaches Compared”2.1 FAISS — Facebook AI Similarity Search
Section titled “2.1 FAISS — Facebook AI Similarity Search”Library: Meta / open-source (C++/Python) Reference: FAISS GitHub | NVIDIA cuVS FAISS integration (May 2025)
IndexIVFFlat
Section titled “IndexIVFFlat”| Metric | Value |
|---|---|
| Memory (200M × 512-dim) | ~400 GB (raw) + inverted list overhead |
| Recall@1 | 95–99% (with nprobe tuning) |
| QPS (single CPU node) | ~500–2,000 depending on nprobe |
| Build time | Hours for 200M |
| Compression | None — stores raw float32 |
- Partitions space into Voronoi cells; at query time scans
nprobecells - No compression: highest recall but huge RAM requirement (~400 GB+ for 200M)
- Verdict: Impractical at 200M scale without distributed RAM
IndexIVFPQ (IVF + Product Quantization)
Section titled “IndexIVFPQ (IVF + Product Quantization)”| Metric | Value |
|---|---|
| Memory (200M × 512-dim, M=64, nbits=8) | ~12–25 GB (32–64 bytes/vector) |
| Recall@1 | 85–95% with reranking; ~80–90% without |
| QPS (CPU) | ~2,000–10,000 |
| Compression ratio | 32–64× reduction vs flat |
| Build time | 4–12 hours for 200M on CPU |
- PQ breaks each vector into M sub-vectors, each quantized to a codebook
- For 512-dim with M=64, each sub-vector is 8-dim → 8 bits → 64 bytes/vector
- Recall impact: PQ introduces ~5–15% recall loss vs exact search; reranking with raw vectors recovers most
- Verdict: The workhorse for 200M-scale on RAM-constrained hardware
FAISS GPU (GpuIndexIVFPQ / GpuIndexIVFFlat)
Section titled “FAISS GPU (GpuIndexIVFPQ / GpuIndexIVFFlat)”| Metric | Value |
|---|---|
| GPU speedup vs CPU | 5–20× for search; 4.7–8.1× for IVF with cuVS |
| Build time speedup (cuVS IVF) | Up to 4.7× vs CPU |
| CAGRA build vs HNSW CPU | Up to 12.3× faster build |
| CAGRA search vs CPU HNSW | Up to 4.7× lower latency |
| Batch QPS (GPU CAGRA, 10k batch) | Millions of queries/second |
- GPU FAISS shines for batch workloads (e.g., video analytics with many simultaneous queries)
- Single-query latency benefit is smaller due to GPU launch overhead
- NVIDIA cuVS (RAPIDS) integration in FAISS (2025): significant speedups for both IVF and CAGRA graph indexes
- Verdict: Excellent for batch throughput; less compelling for single-query latency use cases
2.2 HNSW — Hierarchical Navigable Small World
Section titled “2.2 HNSW — Hierarchical Navigable Small World”Implementations: hnswlib, FAISS IndexHNSW, Milvus, Qdrant, pgvector Reference: Milvus IVF vs HNSW guide
| Metric | Value |
|---|---|
| Memory (200M × 512-dim) | ~400 GB raw + 20–40% graph overhead → ~500–600 GB |
| Recall@1 | 97–99.5% |
| QPS (single query, CPU) | ~1,000–5,000 |
| Build time | Very long for 200M (graph construction is O(n log n)) |
| Filtering support | Limited; post-filter or pre-filter approximation |
- Best recall-latency tradeoff for in-memory datasets
- Memory requirement (500–600 GB for 200M) is the main obstacle
- Scales poorly beyond what fits in RAM
- Verdict: Ideal for ≤50M vectors in high-RAM machines; impractical for 200M without quantization or disk offload
2.3 DiskANN / Vamana — Microsoft Disk-Based Graph Index
Section titled “2.3 DiskANN / Vamana — Microsoft Disk-Based Graph Index”Reference: DiskANN overview | SQL Server 2025 public preview
| Metric | Value |
|---|---|
| Memory (200M vectors) | Small in-RAM index; vectors on SSD |
| Recall@1 | ≥95% on billion-scale datasets |
| Latency (single query) | <5ms at 95% recall@1 |
| RAM requirement | ~64 GB RAM sufficient for 1B vectors |
| Build time | Several hours; GPU acceleration gives 40× speedup |
- Vamana graph: shorter average search path than HNSW and NSG → fewer SSD reads
- Peak at 100–200M vectors in RAM; DiskANN designed for >200M going to SSD
- SQL Server 2025 integrates DiskANN natively (public preview 2025)
- Rewritten in Rust; stateless orchestrator model integrating with host DB storage
- Verdict: Best option for 200M–1B+ scale when RAM is constrained; low latency (~5ms p99) on SSD
2.4 ScaNN — Google Scalable Nearest Neighbors
Section titled “2.4 ScaNN — Google Scalable Nearest Neighbors”Reference: ScaNN announcement | SOAR algorithms | AlloyDB ScaNN (Oct 2024)
| Metric | Value |
|---|---|
| ann-benchmarks.com QPS | ~2× the next-fastest library at same accuracy |
| Memory vs HNSW | 4× smaller footprint (AlloyDB ScaNN vs pgvector HNSW) |
| SOAR track winners (Big-ANN NeurIPS’23) | Highest result in OOD and streaming tracks |
| Adaptive filtering | Yes — runtime selectivity learning (2025 update) |
- ScaNN uses anisotropic quantization (quantizes in direction of query, not isotropically)
- Integrated into Google AlloyDB as a Postgres vector extension (Oct 2024)
- 2025 update: Adaptive filtration — learns filter selectivity at query time
- Verdict: Best-in-class for pure ANN throughput; anisotropic quantization preserves recall better than standard PQ
2.5 SPANN — Microsoft Hierarchical Inverted + SSD
Section titled “2.5 SPANN — Microsoft Hierarchical Inverted + SSD”Reference: SPANN paper | Microsoft Research
| Metric | Value |
|---|---|
| Recall@1 | 90% |
| Recall@10 | 90%+ |
| Latency vs DiskANN | ~2× faster at same recall and memory |
| Memory | Only 10% of original memory cost vs in-memory algorithms |
| Architecture | Centroids in RAM, posting lists on SSD |
- Hierarchical balanced clustering + query-aware posting list pruning
- 2× faster than DiskANN on billion-scale benchmarks at same recall/memory
- Verdict: Strong alternative to DiskANN when minimizing RAM is critical
2.6 CAGRA — NVIDIA GPU Graph-Based ANN (cuVS)
Section titled “2.6 CAGRA — NVIDIA GPU Graph-Based ANN (cuVS)”Reference: CAGRA paper (2023) | NVIDIA blog
| Metric | Value |
|---|---|
| Search speedup vs CPU HNSW | 33–77× at 90–95% recall range |
| Single-query speedup vs HNSW | 3.4–53× at 95% recall |
| Build speedup vs CPU HNSW | 2.2–27× |
| Batch QPS (10k batch size) | Millions of queries/second |
| GPU requirement | NVIDIA A100 / H100 class |
- CAGRA builds GPU-optimized graph; outperforms CPU HNSW dramatically in batch
- Integrated with FAISS via cuVS (May 2025 Meta announcement)
- DiskANN Vamana can now be built on GPU with 40× speedup over CPU build
- Verdict: Best for batch analytics (video surveillance, offline identity resolution); not designed for online single-query search
2.7 Vespa — Hybrid HNSW-IF (Inverted File)
Section titled “2.7 Vespa — Hybrid HNSW-IF (Inverted File)”Reference: Vespa billion-scale hybrid
| Metric | Value |
|---|---|
| Architecture | HNSW centroids in RAM; non-centroid vectors on disk |
| Memory saving | 10× vs pure in-memory HNSW |
| Recall@10 | Competitive with pure HNSW |
| Filtering | Native support with inverted index |
| Latency | Single-digit ms for 100M centroids |
- Bridges HNSW quality with disk economics
- Production-tested on LAION-5B class datasets
- Verdict: Good for 200M+ with native filtering (e.g., filter by date, region, watchlist group)
2.8 Milvus — Production Vector Database
Section titled “2.8 Milvus — Production Vector Database”Reference: Milvus docs | Milvus vs Qdrant
| Metric | Value |
|---|---|
| Max scale | Billions of vectors (distributed) |
| Supported indexes | IVF-Flat, IVF-PQ, HNSW, DiskANN, CAGRA (GPU) |
| Latency (million-scale) | Single-digit ms |
| Filtering | Yes — metadata + vector hybrid |
| Production readiness | High — cloud-native, distributed, CRUD |
- Full production system: CRUD, HA, horizontal scaling, access control
- Supports IVF-PQ, HNSW, DiskANN, and GPU CAGRA backends
- Contrast with raw FAISS: FAISS is a library (no CRUD, no HA, no distribution)
- Verdict: Best managed solution for 200M+ in production if not building custom infrastructure
2.9 Qdrant — Rust-Based Vector Database
Section titled “2.9 Qdrant — Rust-Based Vector Database”Reference: Qdrant benchmarks
| Metric | Value |
|---|---|
| QPS at 50M, 99% recall | 41.47 QPS (vs pgvectorscale 471 QPS on same data) |
| Architecture | Rust, HNSW-based with scalar/binary quantization |
| Filtering | Excellent — payload-indexed filtering |
| Memory optimization | Scalar quantization (4× reduction); binary quantization (32× reduction) |
- Strong filtering capabilities; Rust implementation → low overhead
- Benchmark shows lower raw QPS vs pgvectorscale at large scale but better filtering flexibility
- Verdict: Good for metadata-rich filtering use cases; less optimized for pure ANN throughput at 200M+
3. NeurIPS Big-ANN Benchmark Results (2021–2023)
Section titled “3. NeurIPS Big-ANN Benchmark Results (2021–2023)”Reference: Big-ANN-Benchmarks | NeurIPS’23 results
Key findings from the benchmark competition:
- Filtered track winner (NeurIPS’23): ParlayANN — Vamana graphs + spatial inverted indices per tag; 11× faster than baseline
- OOD track joint winners: MysteryANN / RoarANN, PyANNS
- Streaming track: SOAR (ScaNN variant) — highest result
- BANG (2024): Single A100 GPU achieves 50–400× higher throughput than CPU methods at 0.9 recall on billion-scale datasets
4. NIST FRVT / FRTE Context
Section titled “4. NIST FRVT / FRTE Context”Reference: NIST FRTE 1:N | NEC April 2025
- NIST tests currently reach 12 million enrolled identities for 1:N identification
- NEC system: 0.07% authentication error rate at 12M scale (April 2025)
- FRVT has evaluated 400+ algorithms on 18M+ images of 8M+ people
- Gap: NIST FRTE does not yet test at 200M+ scale; real-world deployments (national IDs, airport systems) have reached this scale operationally but benchmarks are not public
5. Product Quantization Impact on Face Embeddings
Section titled “5. Product Quantization Impact on Face Embeddings”Reference: Milvus PQ guide | HuggingFace embedding quantization
| Quantization Type | Memory Reduction | Recall Impact |
|---|---|---|
| Scalar (int8) | 4× | Negligible (75% mem reduction) |
| Product Quantization | 32–64× | 5–15% recall loss (recoverable with reranking) |
| Binary | 32× | Moderate loss; effective for prefiltering |
For 512-dim ArcFace embeddings specifically:
- PQ with M=64 subvectors, 8-bit codes: 512-dim × 4 bytes → 64 bytes/vector (32× compression)
- 200M vectors: 400 GB → ~12.5 GB with PQ (fits comfortably in RAM)
- Recall@1 drop: typically 5–10% vs exact; reranking top-k with exact vectors recovers to ~99%
- Recommended pattern: IVF-PQ for ANN search → rerank top-100 candidates with exact distances
6. Memory Footprint Summary for 200M × 512-dim
Section titled “6. Memory Footprint Summary for 200M × 512-dim”| Approach | RAM Required | Disk Required | Notes |
|---|---|---|---|
| IVF-Flat | ~400 GB | — | Exact; impractical on single node |
| IVF-PQ (M=64) | ~12–25 GB | — | Highly compressed; 200M fits easily |
| HNSW (full in-memory) | ~550–600 GB | — | Impractical; needs 4× RAM server |
| DiskANN | ~50–100 GB | ~400 GB SSD | 95%+ recall; SSD latency |
| SPANN | ~40–60 GB | ~400 GB SSD | 2× faster than DiskANN |
| Vespa HNSW-IF | ~50–100 GB | ~400 GB SSD | Good filtering |
7. GPU vs CPU Decision Matrix
Section titled “7. GPU vs CPU Decision Matrix”| Workload Type | Recommended Approach | Why |
|---|---|---|
| Batch (10k+ simultaneous queries) | CAGRA / GPU IVF-PQ | 33–400× throughput advantage |
| Online low-latency (single query <10ms) | DiskANN / SPANN / CPU HNSW | GPU launch overhead hurts single-query |
| Massive gallery (>500M) with filtering | DiskANN + IVF hybrid | Only disk-based can handle at this scale |
| Cost-sensitive cloud deployment | IVF-PQ on CPU | Cheap; good recall with reranking |
| Highest throughput, GPU budget | CAGRA (A100/H100) | Millions of QPS in batch |
8. Production Readiness Rankings
Section titled “8. Production Readiness Rankings”| System | Production Readiness | CRUD | HA/Distribution | Filtering | GPU |
|---|---|---|---|---|---|
| Milvus | High | Yes | Yes | Yes | Yes (CAGRA) |
| Qdrant | High | Yes | Yes | Excellent | No |
| Vespa | High | Yes | Yes | Excellent | No |
| FAISS | Low (library) | No | No | No | Yes |
| DiskANN | Medium | Partial | Via host DB | Via host DB | Build only |
| SPANN | Low (research) | No | No | No | No |
| ScaNN | Medium | Via AlloyDB | Via AlloyDB | Yes (adaptive) | No |
9. UFME Recommendations
Section titled “9. UFME Recommendations”For UFME-specific recommendations based on this research, see Executive Summary.
10. Key References
Section titled “10. Key References”- FAISS GitHub Wiki — Indexing 1M vectors
- NVIDIA cuVS GPU acceleration in FAISS (Meta, May 2025)
- CAGRA paper — Highly Parallel GPU Graph ANN
- DiskANN overview — Harsha Simhadri
- SPANN — Microsoft billion-scale ANN
- ScaNN — Google efficient vector similarity search
- SOAR — ScaNN Big-ANN NeurIPS’23 winner
- Vespa HNSW-IF hybrid billion-scale
- Big-ANN-Benchmarks NeurIPS’23
- BANG — Billion-scale ANN on single GPU
- Qdrant benchmarks
- NIST FRTE 1:N Identification
- HuggingFace embedding quantization guide
- AlloyDB ScaNN vs pgvector HNSW (Oct 2024)
- SQL Server 2025 DiskANN public preview