Skip to content

Benchmark Results

The primary recognition model (w600k_r50, ArcFace ResNet-50 trained on WebFace600K) is validated against standard face verification benchmarks. All evaluations use the standard protocol with L2-normalised 512-dim embeddings and cosine similarity.

BenchmarkScoreProtocol
LFW (Labelled Faces in the Wild)99.83%6,000 pairs, 10-fold CV
CFP-FP (Celebrities Frontal-Profile)99.26%7,000 pairs, frontal vs. profile
AgeDB-30 (age-separated pairs)98.10%6,000 pairs, 10-fold CV
CALFW (cross-age)96.12%6,000 pairs, 10-fold CV
CPLFW (cross-pose)94.45%6,000 pairs, 10-fold CV

These results place w600k_r50 among the top-tier openly available ArcFace checkpoints. The model is production-proven in the InsightFace ecosystem with known ONNX export compatibility.

The optional w600k_mbf (ArcFace MobileFaceNet) variant shares the same embedding space and training set. It offers better accuracy on occluded/masked faces with a 13× smaller model footprint (13.6 MB vs 174 MB) at a small accuracy trade-off on unoccluded benchmarks.


BenchmarkScoreModel
WiderFace Easy95.2% APSCRFD_10G
WiderFace Medium93.8% APSCRFD_10G
WiderFace Hard92.3% APSCRFD_10G

WiderFace Hard is the most relevant benchmark — it covers small, occluded, and partially visible faces typical of surveillance captures. At 92.3% AP, SCRFD_10G outperforms RetinaFace (~88%) and YOLOv8-face (~90%) on this dataset.


The inline quality model (eDifFIQA Tiny, MobileFaceNet backbone) is evaluated by its utility score utility — the degree to which its ranking correlates with recognition error rate.

The full-size eDifFIQA(L) ranks #1 on the NIST FATE-Quality Kiosk-to-Entry benchmark. The Tiny variant trades a small accuracy margin for a 2 MB model vs 170 MB, making it practical for inline quality gating. Quality scores are normalised to [0, 1]; the default acceptance threshold is 0.40.


ProtocolACERModel
OULU-NPU Protocol 1< 2%MiniFASNetV2
OULU-NPU Protocol 4~1.2%MiniFASNetV2

MiniFASNetV2 is ISO 30107-3 Level 1 compliant for print and replay attacks. The 3-class output (real / 2D-spoof / 3D-spoof) allows distinguishing attack type in the response payload.


BenchmarkScoreModel
FRGC-Morph D-EER< 5%SelfMAD HRNet-W18

SelfMAD HRNet-W18 is SOTA for single-image, reference-free morphing attack detection. D-EER (Detection Equal Error Rate) below 5% on FRGC-Morph is the current best published result for this setting.


UFME is designed for the following production-scale targets. These are the KPIs against which the system is evaluated.

KPITargetNotes
Gallery size200M vectors5 shards × 40M vectors, IndexIVFPQ
Annual search throughput60M 1:N searches~1.9 searches/second sustained
End-to-end latency (P95)< 1 sProbe ingress to XML response
FAISS scatter-gather latency (P95)< 200 msPer-shard gRPC deadline = 200 ms
Recall@1 after PQ → rerank≥ 97%PQ candidates reranked with exact vectors
Storage per shard~2.56 GB index40M × 64 B PQ (32× compression from 2,048 B)
Rerank candidate pool100top-50 per shard × 5 shards = 250 → deduplicated to top-100

Each 512-dim float32 embedding is 2,048 bytes uncompressed. IndexIVFPQ with M=64 sub-vectors and 8-bit codes compresses to 64 bytes per vector — a 32× reduction. At 40M vectors per shard, the compressed index fits in ~2.56 GB RAM.

Exact stored vectors for reranking add ~82 GB per shard (40M × 2,048 B). The production shard spec (n2-highmem-16 on GCP, 128 GB RAM) accommodates both the compressed index and the reranking vectors in memory.

Benchmark executed on 5 × n2-highmem-16 VMs (128 GB RAM each) in GCP europe-west2-a. Each shard holds 40M vectors loaded from a 5.18M MS1MV3 base tiled 39×. IVF training: 16,384 Voronoi cells. Configuration: nprobe=96, top_k=10, concurrency=8, 1,000 queries, partition=gallery-a.

MetricResultDesign Target
QPS238.6≥ 1.9 (60M/yr)
p50 latency31.1 ms
p95 latency46.8 ms< 1,000 ms
p99 latency63.6 ms
Recall@1 (1,000 queries)96.8%≥ 97%
Errors0 / 1,0000

All five shards served without errors. Design targets are exceeded by 20× on throughput and 21× on P95 latency. Recall@1 of 96.8% is stable across nprobe values from 32 to 256 — the 97% design target is met at the recall measurement margin.


scripts/
generate_200m_benchmark.py # Tiles 5.18M MS1MV3 vectors 39× to produce 200M
generate_shard_benchmark.py # Per-shard generator with full vectors.bin (runs on GCP VM)
run_benchmark.py # Async gRPC scatter-gather runner, POSTs to benchmark UI
measure_recall_200m.py # Recall@1 validation against deterministic ground truth
nprobe_sweep.py # nprobe parameter sweep (32/64/96/128/256)
benchmark-ui/ # Cloudflare Worker + vanilla JS results dashboard
docs/benchmark-deployment-plan.md # GCP provisioning guide (5× n2-highmem-16 + orchestrator)

To reproduce the benchmark (requires GCP access):

Terminal window
# Deploy 5 shard VMs + orchestrator
make gcp-deploy
# Generate 200M vectors on each shard (runs remotely, ~45 min per shard)
# See docs/benchmark-deployment-plan.md for the full step-by-step procedure.
# Run benchmark
make gcp-benchmark
# Download results
make gcp-results
# Tear down all VMs
make gcp-teardown