Executive Summary
Date: 2026-02-20 Scope: Compare every UFME VISION.md design choice against current SOTA research and production systems.
Executive Summary
Section titled “Executive Summary”UFME’s architecture is well-aligned with production best practices in several areas (hexagonal architecture, stateless processing, ONNX deployment). However, the specific model and infrastructure choices need updating to reflect 2024-2026 SOTA. The most significant gaps are:
| Priority | Gap | Impact |
|---|---|---|
| Critical | IndexIVFFlat requires ~400GB RAM at 200M scale | Infeasible without 20-node cluster; IVF-PQ reduces to ~12-25GB |
| High | RetinaFace is superseded by SCRFD | 3-4% Hard accuracy gain; same ONNX ecosystem |
| High | ArcFace alone is no longer SOTA loss | AdaFace/TopoFR offer better robustness on hard cases |
| High | No ISO-compliant quality assessment | OFIQ (ISO 29794-5) is the standard for production identity systems |
| Medium | PAD needs unified physical+digital detection | Deepfakes not addressed in VISION.md |
| Medium | No morphing attack detection | Critical for border control and identity verification |
| Low | 512-dim embedding is correct | Industry standard confirmed |
| Low | ViT architecture choice is validated | ViTs outperform CNNs on face recognition with sufficient data |
1. Face Detection
Section titled “1. Face Detection”UFME Choice: RetinaFace (FPN + Context Modules)
Section titled “UFME Choice: RetinaFace (FPN + Context Modules)”SOTA Assessment
Section titled “SOTA Assessment”| Metric | RetinaFace R50 | SCRFD_10G | YOLOv12-Face | Verdict |
|---|---|---|---|---|
| WiderFace Easy | 95.0% | 95.2% | ~95%+ | Parity |
| WiderFace Medium | 93.0% | 93.9% | ~93%+ | SCRFD wins |
| WiderFace Hard | 83.0% | 83.1% | ~80%+ | SCRFD wins |
| Inference (CPU VGA) | ~80ms | ~80ms | fast | Parity |
| ONNX export | Yes | Yes | Yes | All good |
| Landmarks | 5 | 5 | 5 | Same |
| Ecosystem | InsightFace | InsightFace | Ultralytics | Same maintainers |
Recommendation: Replace RetinaFace with SCRFD_10G (or SCRFD_34GF for max accuracy)
Section titled “Recommendation: Replace RetinaFace with SCRFD_10G (or SCRFD_34GF for max accuracy)”- Same InsightFace ecosystem, same ONNX export path, same 5-point landmarks
- ~1% better on Hard split (occluded/small faces — critical for operational use)
- Drop-in replacement at the adapter level (no domain changes needed)
- SCRFD_500M available for edge deployment if needed later
- License note: InsightFace pretrained models require commercial license for production
Alignment with UFME Architecture
Section titled “Alignment with UFME Architecture”Perfect fit. The hexagonal design means swapping detection models is an adapter change only. The InferencePort protocol is model-agnostic. This validates the architecture’s simplicity.
2. Face Recognition (Feature Extraction)
Section titled “2. Face Recognition (Feature Extraction)”UFME Choice: Vision Transformer (ViT-Base or ViT-Large) + ArcFace loss, 512-dim float32
Section titled “UFME Choice: Vision Transformer (ViT-Base or ViT-Large) + ArcFace loss, 512-dim float32”SOTA Assessment
Section titled “SOTA Assessment”Architecture (ViT vs CNN)
Section titled “Architecture (ViT vs CNN)”| Aspect | UFME (ViT) | CNN (IR-100) | Hybrid (EdgeNeXt) |
|---|---|---|---|
| IJB-C TAR@FAR=1e-4 | ~97.5% (TransFace) | ~97.0% (Glint360K) | 94.85% (EdgeFace) |
| Occlusion resilience | Excellent (global attention) | Good | Good |
| Data requirement | High (>1M identities) | Moderate | Low |
| ONNX export | Yes (opset ≥14) | Yes | Yes |
| Inference speed | Slower | Faster | Fastest |
Verdict: ViT choice is validated. ViTs outperform CNNs in 13/15 evaluations when pretrained on large data. UFME’s stated advantage (correlating distant facial features from the first layer) is confirmed by research. The key risk is ViT data hunger — mitigated by TransFace’s EHSM/DPAP or LVFace’s PCO training techniques.
Training Loss
Section titled “Training Loss”| Loss | Year | Key Advantage | IJB-C TAR@1e-4 | Best For |
|---|---|---|---|---|
| ArcFace | 2019 | Clean geodesic margin | ~97.0% | Strong baseline |
| AdaFace | 2022 | Quality-adaptive margin | ~97.4% | Low-quality/surveillance |
| ElasticFace | 2022 | Stochastic margin | SOTA 7/9 benchmarks | General robustness |
| TopoFR | 2024 | Topological alignment | SOTA+ | Structure preservation |
| LVFace PCO | 2025 | Progressive cluster optimisation | SOTA | Large-scale training |
Verdict: ArcFace is a solid baseline but no longer the best standalone loss.
Recommendation: Adopt AdaFace loss, keep ViT backbone, keep 512-dim
Section titled “Recommendation: Adopt AdaFace loss, keep ViT backbone, keep 512-dim”- AdaFace’s quality-adaptive margin is particularly relevant for production identity systems: immigration images vary wildly in quality (passport photos vs CCTV captures vs aged documents)
- The feature norm as quality proxy aligns with UFME’s quality pipeline — AdaFace internally does what UFME’s quality gate does externally
- If training from scratch: use LVFace’s PCO for ViT training stability
- If fine-tuning InsightFace pretrained: AdaFace loss is a drop-in replacement for ArcFace
- 512-dim embedding is confirmed as industry standard — no change needed
Alignment with UFME Architecture
Section titled “Alignment with UFME Architecture”Perfect fit. The loss function is a training-time concern, not a runtime concern. The ViT architecture and 512-dim output are already in the design. Changing the loss requires zero runtime code changes.
3. Vector Storage & Matching Engine
Section titled “3. Vector Storage & Matching Engine”UFME Choice: Sharded FAISS with IndexIVFFlat, 20 nodes, 10M vectors/node
Section titled “UFME Choice: Sharded FAISS with IndexIVFFlat, 20 nodes, 10M vectors/node”SOTA Assessment
Section titled “SOTA Assessment”| Approach | RAM (200M × 512d) | Recall@1 | QPS (CPU) | Filtering | Production |
|---|---|---|---|---|---|
| UFME: IVF-Flat | ~400 GB | 95-99% | ~2K | Bitset | Library |
| IVF-PQ (M=64) | ~12-25 GB | 85-95% (99%+ w/ rerank) | ~5-10K | Bitset | Library |
| HNSW | ~550 GB | 97-99.5% | ~3K | Limited | Library |
| DiskANN | ~64 GB + SSD | ≥95% | <5ms latency | Via host DB | Medium |
| ScaNN | ~12-25 GB | Best-in-class | ~2× next fastest | Adaptive | Medium |
| Milvus | Configurable | Configurable | High | Excellent | High |
Critical Gap: IndexIVFFlat Memory
Section titled “Critical Gap: IndexIVFFlat Memory”The UFME design uses IndexIVFFlat which stores raw float32 vectors. At 200M × 512-dim × 4 bytes = 400 GB of RAM. The design acknowledges this and distributes across 20 nodes with 10M vectors each (~20 GB/node). This works but is expensive.
IVF-PQ would reduce total RAM to ~12-25 GB (fits on 2-4 nodes instead of 20), with recall recoverable to ~99% via reranking top-100 candidates against exact vectors.
Recommendation: Use IVF-PQ as primary index, with exact-vector reranking
Section titled “Recommendation: Use IVF-PQ as primary index, with exact-vector reranking”Tier 1 (Primary): IVF-PQ with M=64 subvectors
- 200M vectors compressed to ~12-25 GB total
- nlist ≈ 14,000-20,000 Voronoi cells
- nprobe = 64-128 for 90-95% initial recall
- Rerank top-100 candidates against stored exact vectors → Recall@1 ≥ 97%
- 4-6 nodes instead of 20 (significant cost reduction)
Tier 2 (Future scale): DiskANN for >500M gallery
- SSD-backed, ~64 GB RAM even for 1B vectors
- <5ms latency at 95% recall
Tier 3 (Batch analytics): CAGRA/GPU via FAISS cuVS integration
- For batch video surveillance identity resolution
- 33-77× throughput vs CPU HNSW
What UFME Gets Right
Section titled “What UFME Gets Right”- Scatter-gather topology: Confirmed as the correct pattern. All production systems at this scale use distributed fan-out.
- Inner Product metric: Correct for L2-normalised vectors (equivalent to cosine).
- Pre-filtering bitsets: Aligned with SOTA. FAISS supports
search_with_masknatively. - Avoiding Milvus/Qdrant: The VISION.md’s concern about “unacceptable network and translation overhead” from commercial vector DBs is valid for a latency-sensitive biometric system. Raw FAISS with custom distribution is the right choice.
What UFME’s Epochal Time Model Adds
Section titled “What UFME’s Epochal Time Model Adds”The implementation plan’s event-sourced, immutable-snapshot approach to index management is not standard in SOTA — it is architecturally superior. Most production FAISS deployments use mutable in-place updates, which creates exactly the complecting problems Hickey identifies. The atomic-swap snapshot model is a genuine innovation over standard practice.
Alignment with UFME Architecture
Section titled “Alignment with UFME Architecture”Good fit with one change. Swap IndexIVFFlat for IndexIVFPQ in the outbound FAISS adapter. The domain layer, ports, pipeline stages, and scatter-gather coordination are all unchanged. This is exactly the kind of change the hexagonal architecture was designed to absorb.
4. Quality Assessment
Section titled “4. Quality Assessment”UFME Choice: “Auxiliary lightweight network” for blur, illumination, yaw/pitch/roll
Section titled “UFME Choice: “Auxiliary lightweight network” for blur, illumination, yaw/pitch/roll”SOTA Assessment
Section titled “SOTA Assessment”| Method | Type | ISO Compliant | Best For |
|---|---|---|---|
| UFME (lightweight net) | CNN regression | No | Basic gating |
| MagFace | Implicit in FR loss | No | Zero-overhead |
| CR-FIQA | Certainty ratio | No | Best AUC |
| SDD-FIQA | Wasserstein pseudo-labels | No | Generalisation |
| OFIQ | BSI reference impl | Yes (29794-5) | ISO compliance |
| ViT-FIQA | Learnable quality token | No | ViT integration |
Critical Gap: ISO Compliance
Section titled “Critical Gap: ISO Compliance”For production identity verification systems, ISO/IEC 29794-5 compliance is likely mandatory. OFIQ (Open Source Face Image Quality) is the BSI/eu-LISA reference implementation, specifically designed for border control and ID systems. It is:
- Open source (C/C++)
- The only ISO 29794-5 compliant implementation
- Evaluated by NIST FATE Quality
- Maintained by BSI + eu-LISA (the EU biometric infrastructure agency)
Recommendation: Dual quality assessment
Section titled “Recommendation: Dual quality assessment”- OFIQ for ISO compliance — Run as the primary quality gate for all images. Produces ISO-standard quality components (illumination, pose, focus, expression, occlusion, etc.). Satisfies the
QualityPortprotocol. - MagFace-style implicit quality — Use the ViT feature norm as an additional quality signal during AdaFace training. This is free (no extra model) and provides a quality proxy correlated with recognition performance.
- CR-FIQA or ViT-FIQA for research benchmarking — Useful for internal quality distribution analysis but not required for production.
Alignment with UFME Architecture
Section titled “Alignment with UFME Architecture”Perfect fit. The complecting audit already separated quality measurement from quality policy. OFIQ is a pure measurement function that satisfies QualityPort. The quality gate remains a separate configurable step.
5. Presentation Attack Detection (PAD)
Section titled “5. Presentation Attack Detection (PAD)”UFME Choice: “Auxiliary AI model” for spatial inconsistencies, Moiré patterns, texture degradation
Section titled “UFME Choice: “Auxiliary AI model” for spatial inconsistencies, Moiré patterns, texture degradation”SOTA Assessment
Section titled “SOTA Assessment”| Approach | Type | Covers Physical | Covers Digital | Cross-Domain |
|---|---|---|---|---|
| UFME (texture analysis) | CNN | Yes | Partial | Weak |
| CDCN++ | CDC + NAS | Yes | No | Moderate |
| S-Adapter | ViT adapter | Yes | No | Strong |
| UniAttack | Unified model | Yes | Yes | Good |
| InstructFLIP | VLM | Yes | Yes | Strong |
| MADation | CLIP + LoRA | No | Morphing only | Strong |
| NIST FATE participants | Various | Evaluated | Evaluated | Evaluated |
Gaps Identified
Section titled “Gaps Identified”- No deepfake detection — VISION.md mentions “deepfakes” but the described method (Moiré patterns, texture) is primarily physical-PAD. Digital attacks (face swap, GAN faces, reenactment) require different detection approaches.
- No morphing attack detection (MAD) — Critical for border control and identity verification. Morphing attacks blend two identities into one passport photo, making this a high-priority threat.
- No domain generalisation strategy — PAD models notoriously overfit to training conditions. Cross-dataset HTER remains 6-15% for most methods.
- No ISO 30107-3 compliance pathway — NIST FATE PAD uses ISO 30107-3 metrics. The UFME PAD description doesn’t reference compliance.
Recommendation: Multi-layer PAD with unified physical+digital detection
Section titled “Recommendation: Multi-layer PAD with unified physical+digital detection”-
Primary PAD: Unified physical+digital detector
- Deploy a ViT-based unified model (e.g., S-Adapter or UniAttack approach)
- Single model handles print, replay, mask, deepfake, face swap
- Satisfies the PAD port as a pure measurement function
-
Morphing Attack Detection: MADation (CLIP + LoRA)
- Separate module for morphing detection specifically
- Critical for document-based enrollment (passport photos)
- Foundation model approach provides strong generalisation
-
ISO 30107-3 compliance
- Evaluate through NIST FATE PAD program
- Report APCER/BPCER/ACER per ISO standard
Alignment with UFME Architecture
Section titled “Alignment with UFME Architecture”Good fit. The complecting audit already identified that PAD should be a composable stage with measurement separated from decision. Multiple PAD modules (physical, digital, morphing) can each satisfy a PadPort protocol independently. The pipeline orchestrator composes them.
6. Alignment (Affine Transform)
Section titled “6. Alignment (Affine Transform)”UFME Choice: Affine similarity transform to 112×112 pixel grid using 5 landmarks
Section titled “UFME Choice: Affine similarity transform to 112×112 pixel grid using 5 landmarks”SOTA Assessment
Section titled “SOTA Assessment”This is correct and unchanged since 2019. The standard alignment pipeline is:
- Detect 5 landmarks (eyes, nose, mouth corners)
- Compute similarity transform to canonical template positions
- Apply affine warp to produce 112×112 crop
Research confirms 5-point alignment is sufficient for recognition. 68-point adds geometry analysis but doesn’t improve recognition accuracy. 478-point (MediaPipe FaceMesh) is for AR/expression, not recognition.
Verdict: No change needed. UFME’s alignment is aligned with SOTA.
Section titled “Verdict: No change needed. UFME’s alignment is aligned with SOTA.”7. Infrastructure & Deployment
Section titled “7. Infrastructure & Deployment”UFME Choice: Docker, Kubernetes, ONNX Runtime (CPU AVX-512, optional TensorRT GPU)
Section titled “UFME Choice: Docker, Kubernetes, ONNX Runtime (CPU AVX-512, optional TensorRT GPU)”SOTA Assessment
Section titled “SOTA Assessment”Production stack is well-chosen:
- ONNX Runtime is the standard inference runtime (InsightFace, AdaFace, EdgeFace all ship ONNX)
- FP16 TensorRT provides 1.8× speedup with <0.05% accuracy drop
- INT8 quantisation provides 4× size reduction with minimal embedding error (+0.02%)
- AVX-512 on Intel CPUs is optimal for 512-dim dot products
Recommendation: Add quantisation options to the inference adapter
Section titled “Recommendation: Add quantisation options to the inference adapter”- FP16 for GPU (TensorRT): default when GPU available
- INT8 for edge/cost-constrained: viable for all UFME models
- ONNX Runtime batch size tuning: ~3.2× speedup at batch=8 vs batch=1
New: FAISS cuVS Integration (May 2025)
Section titled “New: FAISS cuVS Integration (May 2025)”Meta and NVIDIA announced cuVS integration into FAISS in May 2025. This enables:
- GPU-accelerated IVF index build (4.7-8.1× speedup)
- CAGRA graph index for batch search (33-77× speedup)
- Drop-in replacement for CPU FAISS — same Python API
Consider for Tier 2 batch analytics workloads.
Summary Scorecard
Section titled “Summary Scorecard”| Component | UFME Design | SOTA Best | Gap | Effort to Close |
|---|---|---|---|---|
| Architecture | Hexagonal, stateless | Hexagonal, stateless | None | - |
| Detection | RetinaFace | SCRFD_10G/34GF | Small (same ecosystem) | Adapter swap |
| Alignment | 5-point affine, 112×112 | 5-point affine, 112×112 | None | - |
| Recognition backbone | ViT-Base/Large | ViT with PCO/EHSM | Small (training technique) | Training config |
| Training loss | ArcFace | AdaFace / TopoFR | Medium | Training config |
| Embedding dim | 512 float32 | 512 float32 | None | - |
| Vector index | IndexIVFFlat | IndexIVFPQ + rerank | High (20→4 nodes) | Adapter config |
| Scatter-gather | Fan-out gRPC | Fan-out gRPC | None | - |
| Filtering | Pre-computed bitsets | Pre-computed bitsets | None | - |
| Index lifecycle | Epochal snapshots | Mutable (UFME is better) | UFME leads SOTA | - |
| Quality assessment | Lightweight CNN | OFIQ (ISO 29794-5) | High (ISO compliance) | New adapter |
| PAD | Texture analysis | Unified phys+digital | High (deepfake gap) | New models |
| Morphing detection | Not addressed | MADation (CLIP+LoRA) | High (border control) | New module |
| Inference runtime | ONNX Runtime | ONNX Runtime + TensorRT | Small | Config |
| Deployment | Docker + K8s | Docker + K8s | None | - |
Recommended Priority Actions
Section titled “Recommended Priority Actions”Must-Do (Before Production Release)
Section titled “Must-Do (Before Production Release)”- Switch IndexIVFFlat → IndexIVFPQ with reranking. Reduces cluster from 20 to 4-6 nodes. Massive cost reduction.
- Integrate OFIQ for ISO 29794-5 quality compliance. Production deployments typically require this.
- Add unified PAD covering both physical spoofs and digital attacks (deepfakes).
- Add morphing attack detection — critical for border control and identity verification.
Should-Do (Before Go-Live)
Section titled “Should-Do (Before Go-Live)”- Upgrade RetinaFace → SCRFD_10G (same ecosystem, better Hard accuracy).
- Switch ArcFace → AdaFace loss for quality-adaptive matching (helps with low-quality operational images).
- Add FP16/INT8 quantisation options to inference adapters.
- Plan NIST FATE evaluation for both PAD (ISO 30107-3) and Quality.
Nice-to-Have (Future Roadmap)
Section titled “Nice-to-Have (Future Roadmap)”- CAGRA/GPU integration via FAISS cuVS for batch video analytics.
- DiskANN tier for scaling beyond 500M vectors.
- LVFace PCO training for ViT stability at scale.
- Synthetic training data (FRCSyn approach) for bias mitigation.
Validation of Architecture
Section titled “Validation of Architecture”The most important finding: UFME’s hexagonal architecture absorbs all SOTA changes gracefully.
Every recommended change is either:
- An adapter swap (RetinaFace → SCRFD, IndexIVFFlat → IndexIVFPQ, OFIQ integration)
- A training configuration change (ArcFace → AdaFace)
- A new module addition (morphing detection, unified PAD)
No recommended change touches the core domain layer. The ports, pure functions, frozen dataclasses, and composition patterns are all unchanged. This is the payoff of Hickey’s simplicity: when concerns are not complected, improvements to one concern don’t ripple through the system.
Detection Takeaways for UFME
Section titled “Detection Takeaways for UFME”- For high-accuracy server-side detection: SCRFD_10G or SCRFD_34GF + 5-point alignment is the InsightFace standard stack — excellent ONNX support, well-maintained.
- For edge/lightweight deployment: YuNet (<1MB, <2ms on modern CPU) is remarkable. SCRFD_500M is another option.
- For mobile/browser: BlazeFace + MediaPipe FaceMesh is the production-proven path (Google-maintained, TFLite/WASM).
- Landmark choice: 5-point is the standard for recognition preprocessing (fast, sufficient). 68-point adds detailed geometry analysis. 478-point enables AR/expression/3D reconstruction.
- License caution: InsightFace buffalo_l pretrained models are non-commercial research only. Commercial deployment requires a license from InsightFace.
- 2025 trend: YOLOv12’s attention-centric design is emerging as the new YOLO baseline for general object/face detection. MediaPipe continues to dominate mobile face mesh.
Vector Search Architecture for UFME at 200M+ Scale
Section titled “Vector Search Architecture for UFME at 200M+ Scale”Based on the vector search research, the recommended approach for UFME’s production face search system at 200M+ gallery:
Tier 1: Primary Index (Online Search)
Section titled “Tier 1: Primary Index (Online Search)”IVF-PQ + FAISS on CPU cluster
- 200M x 512-dim compressed to ~12-25 GB with M=64 PQ
- nlist = sqrt(N) ~ 14,000-20,000 Voronoi cells
- nprobe = 64-128 at query time for 90-95% recall
- Rerank top-100 candidates with exact L2/cosine -> final Recall@1 >= 97%
- Cost: 2-4 CPU nodes with 64 GB RAM each
Tier 2: High-Throughput Batch Analytics
Section titled “Tier 2: High-Throughput Batch Analytics”CAGRA on GPU (A100 / H100) via FAISS cuVS integration
- For batch identity resolution on video streams
- 33-77x faster than CPU HNSW at 90-95% recall
Tier 3: Scale Beyond 200M (Future)
Section titled “Tier 3: Scale Beyond 200M (Future)”DiskANN or SPANN when gallery exceeds 500M
- SSD-backed; <5ms latency at 95% recall@1
- ~64 GB RAM sufficient even for 1B vectors
Filtering Strategy
Section titled “Filtering Strategy”- Use partitioned IVF: create sub-indexes per watchlist/group for mandatory pre-filtering
- Or use Milvus/Qdrant with native metadata filtering for flexible post-filtering
- Avoid large-n post-filtering with HNSW (recall drops with selectivity)
Sources
Section titled “Sources”See detailed research in:
- sota-detection.md — Face detection models and benchmarks
- sota-recognition.md — Recognition architectures and training losses
- sota-vector-search.md — Vector search at 200M+ scale
- sota-pad-quality.md — PAD and quality assessment methods