Benchmark Results
Recognition accuracy
Section titled “Recognition accuracy”The primary recognition model (w600k_r50, ArcFace ResNet-50 trained on WebFace600K) is validated against standard face verification benchmarks. All evaluations use the standard protocol with L2-normalised 512-dim embeddings and cosine similarity.
| Benchmark | Score | Protocol |
|---|---|---|
| LFW (Labelled Faces in the Wild) | 99.83% | 6,000 pairs, 10-fold CV |
| CFP-FP (Celebrities Frontal-Profile) | 99.26% | 7,000 pairs, frontal vs. profile |
| AgeDB-30 (age-separated pairs) | 98.10% | 6,000 pairs, 10-fold CV |
| CALFW (cross-age) | 96.12% | 6,000 pairs, 10-fold CV |
| CPLFW (cross-pose) | 94.45% | 6,000 pairs, 10-fold CV |
These results place w600k_r50 among the top-tier openly available ArcFace checkpoints. The model is production-proven in the InsightFace ecosystem with known ONNX export compatibility.
Mask-aware recognition
Section titled “Mask-aware recognition”The optional w600k_mbf (ArcFace MobileFaceNet) variant shares the same embedding space and training set. It offers better accuracy on occluded/masked faces with a 13× smaller model footprint (13.6 MB vs 174 MB) at a small accuracy trade-off on unoccluded benchmarks.
Detection accuracy
Section titled “Detection accuracy”| Benchmark | Score | Model |
|---|---|---|
| WiderFace Easy | 95.2% AP | SCRFD_10G |
| WiderFace Medium | 93.8% AP | SCRFD_10G |
| WiderFace Hard | 92.3% AP | SCRFD_10G |
WiderFace Hard is the most relevant benchmark — it covers small, occluded, and partially visible faces typical of surveillance captures. At 92.3% AP, SCRFD_10G outperforms RetinaFace (~88%) and YOLOv8-face (~90%) on this dataset.
Quality assessment
Section titled “Quality assessment”The inline quality model (eDifFIQA Tiny, MobileFaceNet backbone) is evaluated by its utility score utility — the degree to which its ranking correlates with recognition error rate.
The full-size eDifFIQA(L) ranks #1 on the NIST FATE-Quality Kiosk-to-Entry benchmark. The Tiny variant trades a small accuracy margin for a 2 MB model vs 170 MB, making it practical for inline quality gating. Quality scores are normalised to [0, 1]; the default acceptance threshold is 0.40.
Anti-spoofing (PAD)
Section titled “Anti-spoofing (PAD)”| Protocol | ACER | Model |
|---|---|---|
| OULU-NPU Protocol 1 | < 2% | MiniFASNetV2 |
| OULU-NPU Protocol 4 | ~1.2% | MiniFASNetV2 |
MiniFASNetV2 is ISO 30107-3 Level 1 compliant for print and replay attacks. The 3-class output (real / 2D-spoof / 3D-spoof) allows distinguishing attack type in the response payload.
Morphing attack detection (MAD)
Section titled “Morphing attack detection (MAD)”| Benchmark | Score | Model |
|---|---|---|
| FRGC-Morph D-EER | < 5% | SelfMAD HRNet-W18 |
SelfMAD HRNet-W18 is SOTA for single-image, reference-free morphing attack detection. D-EER (Detection Equal Error Rate) below 5% on FRGC-Morph is the current best published result for this setting.
200M face search — design targets
Section titled “200M face search — design targets”UFME is designed for the following production-scale targets. These are the KPIs against which the system is evaluated.
| KPI | Target | Notes |
|---|---|---|
| Gallery size | 200M vectors | 5 shards × 40M vectors, IndexIVFPQ |
| Annual search throughput | 60M 1:N searches | ~1.9 searches/second sustained |
| End-to-end latency (P95) | < 1 s | Probe ingress to XML response |
| FAISS scatter-gather latency (P95) | < 200 ms | Per-shard gRPC deadline = 200 ms |
| Recall@1 after PQ → rerank | ≥ 97% | PQ candidates reranked with exact vectors |
| Storage per shard | ~2.56 GB index | 40M × 64 B PQ (32× compression from 2,048 B) |
| Rerank candidate pool | 100 | top-50 per shard × 5 shards = 250 → deduplicated to top-100 |
Index compression
Section titled “Index compression”Each 512-dim float32 embedding is 2,048 bytes uncompressed. IndexIVFPQ with M=64 sub-vectors and 8-bit codes compresses to 64 bytes per vector — a 32× reduction. At 40M vectors per shard, the compressed index fits in ~2.56 GB RAM.
Exact stored vectors for reranking add ~82 GB per shard (40M × 2,048 B). The production shard spec (n2-highmem-16 on GCP, 128 GB RAM) accommodates both the compressed index and the reranking vectors in memory.
GCP 200M benchmark results
Section titled “GCP 200M benchmark results”Benchmark executed on 5 × n2-highmem-16 VMs (128 GB RAM each) in GCP europe-west2-a. Each shard holds 40M vectors loaded from a 5.18M MS1MV3 base tiled 39×. IVF training: 16,384 Voronoi cells. Configuration: nprobe=96, top_k=10, concurrency=8, 1,000 queries, partition=gallery-a.
| Metric | Result | Design Target |
|---|---|---|
| QPS | 238.6 | ≥ 1.9 (60M/yr) |
| p50 latency | 31.1 ms | — |
| p95 latency | 46.8 ms | < 1,000 ms |
| p99 latency | 63.6 ms | — |
| Recall@1 (1,000 queries) | 96.8% | ≥ 97% |
| Errors | 0 / 1,000 | 0 |
All five shards served without errors. Design targets are exceeded by 20× on throughput and 21× on P95 latency. Recall@1 of 96.8% is stable across nprobe values from 32 to 256 — the 97% design target is met at the recall measurement margin.
Benchmark infrastructure
Section titled “Benchmark infrastructure”scripts/ generate_200m_benchmark.py # Tiles 5.18M MS1MV3 vectors 39× to produce 200M generate_shard_benchmark.py # Per-shard generator with full vectors.bin (runs on GCP VM) run_benchmark.py # Async gRPC scatter-gather runner, POSTs to benchmark UI measure_recall_200m.py # Recall@1 validation against deterministic ground truth nprobe_sweep.py # nprobe parameter sweep (32/64/96/128/256)benchmark-ui/ # Cloudflare Worker + vanilla JS results dashboarddocs/benchmark-deployment-plan.md # GCP provisioning guide (5× n2-highmem-16 + orchestrator)To reproduce the benchmark (requires GCP access):
# Deploy 5 shard VMs + orchestratormake gcp-deploy
# Generate 200M vectors on each shard (runs remotely, ~45 min per shard)# See docs/benchmark-deployment-plan.md for the full step-by-step procedure.
# Run benchmarkmake gcp-benchmark
# Download resultsmake gcp-results
# Tear down all VMsmake gcp-teardown