Benchmark Results

Recognition accuracy

The primary recognition model (w600k_r50, ArcFace ResNet-50 trained on WebFace600K) is validated against standard face verification benchmarks. All evaluations use the standard protocol with L2-normalised 512-dim embeddings and cosine similarity.

Benchmark	Score	Protocol
LFW (Labelled Faces in the Wild)	99.83%	6,000 pairs, 10-fold CV
CFP-FP (Celebrities Frontal-Profile)	99.26%	7,000 pairs, frontal vs. profile
AgeDB-30 (age-separated pairs)	98.10%	6,000 pairs, 10-fold CV
CALFW (cross-age)	96.12%	6,000 pairs, 10-fold CV
CPLFW (cross-pose)	94.45%	6,000 pairs, 10-fold CV

These results place w600k_r50 among the top-tier openly available ArcFace checkpoints. The model is production-proven in the InsightFace ecosystem with known ONNX export compatibility.

Mask-aware recognition

The optional w600k_mbf (ArcFace MobileFaceNet) variant shares the same embedding space and training set. It offers better accuracy on occluded/masked faces with a 13× smaller model footprint (13.6 MB vs 174 MB) at a small accuracy trade-off on unoccluded benchmarks.

Detection accuracy

Benchmark	Score	Model
WiderFace Easy	95.2% AP	SCRFD_10G
WiderFace Medium	93.8% AP	SCRFD_10G
WiderFace Hard	92.3% AP	SCRFD_10G

WiderFace Hard is the most relevant benchmark — it covers small, occluded, and partially visible faces typical of surveillance captures. At 92.3% AP, SCRFD_10G outperforms RetinaFace (~88%) and YOLOv8-face (~90%) on this dataset.

Quality assessment

The inline quality model (eDifFIQA Tiny, MobileFaceNet backbone) is evaluated by its utility score utility — the degree to which its ranking correlates with recognition error rate.

The full-size eDifFIQA(L) ranks #1 on the NIST FATE-Quality Kiosk-to-Entry benchmark. The Tiny variant trades a small accuracy margin for a 2 MB model vs 170 MB, making it practical for inline quality gating. Quality scores are normalised to [0, 1]; the default acceptance threshold is 0.40.

Anti-spoofing (PAD)

Protocol	ACER	Model
OULU-NPU Protocol 1	< 2%	MiniFASNetV2
OULU-NPU Protocol 4	~1.2%	MiniFASNetV2

MiniFASNetV2 is ISO 30107-3 Level 1 compliant for print and replay attacks. The 3-class output (real / 2D-spoof / 3D-spoof) allows distinguishing attack type in the response payload.

Morphing attack detection (MAD)

Benchmark	Score	Model
FRGC-Morph D-EER	< 5%	SelfMAD HRNet-W18

SelfMAD HRNet-W18 is SOTA for single-image, reference-free morphing attack detection. D-EER (Detection Equal Error Rate) below 5% on FRGC-Morph is the current best published result for this setting.

200M face search — design targets

UFME is designed for the following production-scale targets. These are the KPIs against which the system is evaluated.

KPI	Target	Notes
Gallery size	200M vectors	5 shards × 40M vectors, IndexIVFPQ
Annual search throughput	60M 1:N searches	~1.9 searches/second sustained
End-to-end latency (P95)	< 1 s	Probe ingress to XML response
FAISS scatter-gather latency (P95)	< 200 ms	Per-shard gRPC deadline = 200 ms
Recall@1 after PQ → rerank	≥ 97%	PQ candidates reranked with exact vectors
Storage per shard	~2.56 GB index	40M × 64 B PQ (32× compression from 2,048 B)
Rerank candidate pool	100	top-50 per shard × 5 shards = 250 → deduplicated to top-100

Index compression

Each 512-dim float32 embedding is 2,048 bytes uncompressed. IndexIVFPQ with M=64 sub-vectors and 8-bit codes compresses to 64 bytes per vector — a 32× reduction. At 40M vectors per shard, the compressed index fits in ~2.56 GB RAM.

Exact stored vectors for reranking add ~82 GB per shard (40M × 2,048 B). The production shard spec (n2-highmem-16 on GCP, 128 GB RAM) accommodates both the compressed index and the reranking vectors in memory.

GCP 200M benchmark results

Benchmark executed on 5 × n2-highmem-16 VMs (128 GB RAM each) in GCP europe-west2-a. Each shard holds 40M vectors loaded from a 5.18M MS1MV3 base tiled 39×. IVF training: 16,384 Voronoi cells. Configuration: nprobe=96, top_k=10, concurrency=8, 1,000 queries, partition=gallery-a.

Metric	Result	Design Target
QPS	238.6	≥ 1.9 (60M/yr)
p50 latency	31.1 ms	—
p95 latency	46.8 ms	< 1,000 ms
p99 latency	63.6 ms	—
Recall@1 (1,000 queries)	96.8%	≥ 97%
Errors	0 / 1,000	0

All five shards served without errors. Design targets are exceeded by 20× on throughput and 21× on P95 latency. Recall@1 of 96.8% is stable across nprobe values from 32 to 256 — the 97% design target is met at the recall measurement margin.

Benchmark infrastructure

scripts/
  generate_200m_benchmark.py    # Tiles 5.18M MS1MV3 vectors 39× to produce 200M
  generate_shard_benchmark.py   # Per-shard generator with full vectors.bin (runs on GCP VM)
  run_benchmark.py              # Async gRPC scatter-gather runner, POSTs to benchmark UI
  measure_recall_200m.py        # Recall@1 validation against deterministic ground truth
  nprobe_sweep.py               # nprobe parameter sweep (32/64/96/128/256)
benchmark-ui/                   # Cloudflare Worker + vanilla JS results dashboard
docs/benchmark-deployment-plan.md  # GCP provisioning guide (5× n2-highmem-16 + orchestrator)

To reproduce the benchmark (requires GCP access):

# Deploy 5 shard VMs + orchestrator
make gcp-deploy

# Generate 200M vectors on each shard (runs remotely, ~45 min per shard)
# See docs/benchmark-deployment-plan.md for the full step-by-step procedure.

# Run benchmark
make gcp-benchmark

# Download results
make gcp-results

# Tear down all VMs
make gcp-teardown