Skip to content

Executive Summary

Date: 2026-02-20 Scope: Compare every UFME VISION.md design choice against current SOTA research and production systems.


UFME’s architecture is well-aligned with production best practices in several areas (hexagonal architecture, stateless processing, ONNX deployment). However, the specific model and infrastructure choices need updating to reflect 2024-2026 SOTA. The most significant gaps are:

PriorityGapImpact
CriticalIndexIVFFlat requires ~400GB RAM at 200M scaleInfeasible without 20-node cluster; IVF-PQ reduces to ~12-25GB
HighRetinaFace is superseded by SCRFD3-4% Hard accuracy gain; same ONNX ecosystem
HighArcFace alone is no longer SOTA lossAdaFace/TopoFR offer better robustness on hard cases
HighNo ISO-compliant quality assessmentOFIQ (ISO 29794-5) is the standard for production identity systems
MediumPAD needs unified physical+digital detectionDeepfakes not addressed in VISION.md
MediumNo morphing attack detectionCritical for border control and identity verification
Low512-dim embedding is correctIndustry standard confirmed
LowViT architecture choice is validatedViTs outperform CNNs on face recognition with sufficient data

UFME Choice: RetinaFace (FPN + Context Modules)

Section titled “UFME Choice: RetinaFace (FPN + Context Modules)”
MetricRetinaFace R50SCRFD_10GYOLOv12-FaceVerdict
WiderFace Easy95.0%95.2%~95%+Parity
WiderFace Medium93.0%93.9%~93%+SCRFD wins
WiderFace Hard83.0%83.1%~80%+SCRFD wins
Inference (CPU VGA)~80ms~80msfastParity
ONNX exportYesYesYesAll good
Landmarks555Same
EcosystemInsightFaceInsightFaceUltralyticsSame maintainers

Recommendation: Replace RetinaFace with SCRFD_10G (or SCRFD_34GF for max accuracy)

Section titled “Recommendation: Replace RetinaFace with SCRFD_10G (or SCRFD_34GF for max accuracy)”
  • Same InsightFace ecosystem, same ONNX export path, same 5-point landmarks
  • ~1% better on Hard split (occluded/small faces — critical for operational use)
  • Drop-in replacement at the adapter level (no domain changes needed)
  • SCRFD_500M available for edge deployment if needed later
  • License note: InsightFace pretrained models require commercial license for production

Perfect fit. The hexagonal design means swapping detection models is an adapter change only. The InferencePort protocol is model-agnostic. This validates the architecture’s simplicity.


UFME Choice: Vision Transformer (ViT-Base or ViT-Large) + ArcFace loss, 512-dim float32

Section titled “UFME Choice: Vision Transformer (ViT-Base or ViT-Large) + ArcFace loss, 512-dim float32”
AspectUFME (ViT)CNN (IR-100)Hybrid (EdgeNeXt)
IJB-C TAR@FAR=1e-4~97.5% (TransFace)~97.0% (Glint360K)94.85% (EdgeFace)
Occlusion resilienceExcellent (global attention)GoodGood
Data requirementHigh (>1M identities)ModerateLow
ONNX exportYes (opset ≥14)YesYes
Inference speedSlowerFasterFastest

Verdict: ViT choice is validated. ViTs outperform CNNs in 13/15 evaluations when pretrained on large data. UFME’s stated advantage (correlating distant facial features from the first layer) is confirmed by research. The key risk is ViT data hunger — mitigated by TransFace’s EHSM/DPAP or LVFace’s PCO training techniques.

LossYearKey AdvantageIJB-C TAR@1e-4Best For
ArcFace2019Clean geodesic margin~97.0%Strong baseline
AdaFace2022Quality-adaptive margin~97.4%Low-quality/surveillance
ElasticFace2022Stochastic marginSOTA 7/9 benchmarksGeneral robustness
TopoFR2024Topological alignmentSOTA+Structure preservation
LVFace PCO2025Progressive cluster optimisationSOTALarge-scale training

Verdict: ArcFace is a solid baseline but no longer the best standalone loss.

Recommendation: Adopt AdaFace loss, keep ViT backbone, keep 512-dim

Section titled “Recommendation: Adopt AdaFace loss, keep ViT backbone, keep 512-dim”
  • AdaFace’s quality-adaptive margin is particularly relevant for production identity systems: immigration images vary wildly in quality (passport photos vs CCTV captures vs aged documents)
  • The feature norm as quality proxy aligns with UFME’s quality pipeline — AdaFace internally does what UFME’s quality gate does externally
  • If training from scratch: use LVFace’s PCO for ViT training stability
  • If fine-tuning InsightFace pretrained: AdaFace loss is a drop-in replacement for ArcFace
  • 512-dim embedding is confirmed as industry standard — no change needed

Perfect fit. The loss function is a training-time concern, not a runtime concern. The ViT architecture and 512-dim output are already in the design. Changing the loss requires zero runtime code changes.


UFME Choice: Sharded FAISS with IndexIVFFlat, 20 nodes, 10M vectors/node

Section titled “UFME Choice: Sharded FAISS with IndexIVFFlat, 20 nodes, 10M vectors/node”
ApproachRAM (200M × 512d)Recall@1QPS (CPU)FilteringProduction
UFME: IVF-Flat~400 GB95-99%~2KBitsetLibrary
IVF-PQ (M=64)~12-25 GB85-95% (99%+ w/ rerank)~5-10KBitsetLibrary
HNSW~550 GB97-99.5%~3KLimitedLibrary
DiskANN~64 GB + SSD≥95%<5ms latencyVia host DBMedium
ScaNN~12-25 GBBest-in-class~2× next fastestAdaptiveMedium
MilvusConfigurableConfigurableHighExcellentHigh

The UFME design uses IndexIVFFlat which stores raw float32 vectors. At 200M × 512-dim × 4 bytes = 400 GB of RAM. The design acknowledges this and distributes across 20 nodes with 10M vectors each (~20 GB/node). This works but is expensive.

IVF-PQ would reduce total RAM to ~12-25 GB (fits on 2-4 nodes instead of 20), with recall recoverable to ~99% via reranking top-100 candidates against exact vectors.

Recommendation: Use IVF-PQ as primary index, with exact-vector reranking

Section titled “Recommendation: Use IVF-PQ as primary index, with exact-vector reranking”

Tier 1 (Primary): IVF-PQ with M=64 subvectors

  • 200M vectors compressed to ~12-25 GB total
  • nlist ≈ 14,000-20,000 Voronoi cells
  • nprobe = 64-128 for 90-95% initial recall
  • Rerank top-100 candidates against stored exact vectors → Recall@1 ≥ 97%
  • 4-6 nodes instead of 20 (significant cost reduction)

Tier 2 (Future scale): DiskANN for >500M gallery

  • SSD-backed, ~64 GB RAM even for 1B vectors
  • <5ms latency at 95% recall

Tier 3 (Batch analytics): CAGRA/GPU via FAISS cuVS integration

  • For batch video surveillance identity resolution
  • 33-77× throughput vs CPU HNSW
  • Scatter-gather topology: Confirmed as the correct pattern. All production systems at this scale use distributed fan-out.
  • Inner Product metric: Correct for L2-normalised vectors (equivalent to cosine).
  • Pre-filtering bitsets: Aligned with SOTA. FAISS supports search_with_mask natively.
  • Avoiding Milvus/Qdrant: The VISION.md’s concern about “unacceptable network and translation overhead” from commercial vector DBs is valid for a latency-sensitive biometric system. Raw FAISS with custom distribution is the right choice.

The implementation plan’s event-sourced, immutable-snapshot approach to index management is not standard in SOTA — it is architecturally superior. Most production FAISS deployments use mutable in-place updates, which creates exactly the complecting problems Hickey identifies. The atomic-swap snapshot model is a genuine innovation over standard practice.

Good fit with one change. Swap IndexIVFFlat for IndexIVFPQ in the outbound FAISS adapter. The domain layer, ports, pipeline stages, and scatter-gather coordination are all unchanged. This is exactly the kind of change the hexagonal architecture was designed to absorb.


UFME Choice: “Auxiliary lightweight network” for blur, illumination, yaw/pitch/roll

Section titled “UFME Choice: “Auxiliary lightweight network” for blur, illumination, yaw/pitch/roll”
MethodTypeISO CompliantBest For
UFME (lightweight net)CNN regressionNoBasic gating
MagFaceImplicit in FR lossNoZero-overhead
CR-FIQACertainty ratioNoBest AUC
SDD-FIQAWasserstein pseudo-labelsNoGeneralisation
OFIQBSI reference implYes (29794-5)ISO compliance
ViT-FIQALearnable quality tokenNoViT integration

For production identity verification systems, ISO/IEC 29794-5 compliance is likely mandatory. OFIQ (Open Source Face Image Quality) is the BSI/eu-LISA reference implementation, specifically designed for border control and ID systems. It is:

  • Open source (C/C++)
  • The only ISO 29794-5 compliant implementation
  • Evaluated by NIST FATE Quality
  • Maintained by BSI + eu-LISA (the EU biometric infrastructure agency)
  1. OFIQ for ISO compliance — Run as the primary quality gate for all images. Produces ISO-standard quality components (illumination, pose, focus, expression, occlusion, etc.). Satisfies the QualityPort protocol.
  2. MagFace-style implicit quality — Use the ViT feature norm as an additional quality signal during AdaFace training. This is free (no extra model) and provides a quality proxy correlated with recognition performance.
  3. CR-FIQA or ViT-FIQA for research benchmarking — Useful for internal quality distribution analysis but not required for production.

Perfect fit. The complecting audit already separated quality measurement from quality policy. OFIQ is a pure measurement function that satisfies QualityPort. The quality gate remains a separate configurable step.


UFME Choice: “Auxiliary AI model” for spatial inconsistencies, Moiré patterns, texture degradation

Section titled “UFME Choice: “Auxiliary AI model” for spatial inconsistencies, Moiré patterns, texture degradation”
ApproachTypeCovers PhysicalCovers DigitalCross-Domain
UFME (texture analysis)CNNYesPartialWeak
CDCN++CDC + NASYesNoModerate
S-AdapterViT adapterYesNoStrong
UniAttackUnified modelYesYesGood
InstructFLIPVLMYesYesStrong
MADationCLIP + LoRANoMorphing onlyStrong
NIST FATE participantsVariousEvaluatedEvaluatedEvaluated
  1. No deepfake detection — VISION.md mentions “deepfakes” but the described method (Moiré patterns, texture) is primarily physical-PAD. Digital attacks (face swap, GAN faces, reenactment) require different detection approaches.
  2. No morphing attack detection (MAD) — Critical for border control and identity verification. Morphing attacks blend two identities into one passport photo, making this a high-priority threat.
  3. No domain generalisation strategy — PAD models notoriously overfit to training conditions. Cross-dataset HTER remains 6-15% for most methods.
  4. No ISO 30107-3 compliance pathway — NIST FATE PAD uses ISO 30107-3 metrics. The UFME PAD description doesn’t reference compliance.

Recommendation: Multi-layer PAD with unified physical+digital detection

Section titled “Recommendation: Multi-layer PAD with unified physical+digital detection”
  1. Primary PAD: Unified physical+digital detector

    • Deploy a ViT-based unified model (e.g., S-Adapter or UniAttack approach)
    • Single model handles print, replay, mask, deepfake, face swap
    • Satisfies the PAD port as a pure measurement function
  2. Morphing Attack Detection: MADation (CLIP + LoRA)

    • Separate module for morphing detection specifically
    • Critical for document-based enrollment (passport photos)
    • Foundation model approach provides strong generalisation
  3. ISO 30107-3 compliance

    • Evaluate through NIST FATE PAD program
    • Report APCER/BPCER/ACER per ISO standard

Good fit. The complecting audit already identified that PAD should be a composable stage with measurement separated from decision. Multiple PAD modules (physical, digital, morphing) can each satisfy a PadPort protocol independently. The pipeline orchestrator composes them.


UFME Choice: Affine similarity transform to 112×112 pixel grid using 5 landmarks

Section titled “UFME Choice: Affine similarity transform to 112×112 pixel grid using 5 landmarks”

This is correct and unchanged since 2019. The standard alignment pipeline is:

  1. Detect 5 landmarks (eyes, nose, mouth corners)
  2. Compute similarity transform to canonical template positions
  3. Apply affine warp to produce 112×112 crop

Research confirms 5-point alignment is sufficient for recognition. 68-point adds geometry analysis but doesn’t improve recognition accuracy. 478-point (MediaPipe FaceMesh) is for AR/expression, not recognition.

Verdict: No change needed. UFME’s alignment is aligned with SOTA.

Section titled “Verdict: No change needed. UFME’s alignment is aligned with SOTA.”

UFME Choice: Docker, Kubernetes, ONNX Runtime (CPU AVX-512, optional TensorRT GPU)

Section titled “UFME Choice: Docker, Kubernetes, ONNX Runtime (CPU AVX-512, optional TensorRT GPU)”

Production stack is well-chosen:

  • ONNX Runtime is the standard inference runtime (InsightFace, AdaFace, EdgeFace all ship ONNX)
  • FP16 TensorRT provides 1.8× speedup with <0.05% accuracy drop
  • INT8 quantisation provides 4× size reduction with minimal embedding error (+0.02%)
  • AVX-512 on Intel CPUs is optimal for 512-dim dot products

Recommendation: Add quantisation options to the inference adapter

Section titled “Recommendation: Add quantisation options to the inference adapter”
  • FP16 for GPU (TensorRT): default when GPU available
  • INT8 for edge/cost-constrained: viable for all UFME models
  • ONNX Runtime batch size tuning: ~3.2× speedup at batch=8 vs batch=1

Meta and NVIDIA announced cuVS integration into FAISS in May 2025. This enables:

  • GPU-accelerated IVF index build (4.7-8.1× speedup)
  • CAGRA graph index for batch search (33-77× speedup)
  • Drop-in replacement for CPU FAISS — same Python API

Consider for Tier 2 batch analytics workloads.


ComponentUFME DesignSOTA BestGapEffort to Close
ArchitectureHexagonal, statelessHexagonal, statelessNone-
DetectionRetinaFaceSCRFD_10G/34GFSmall (same ecosystem)Adapter swap
Alignment5-point affine, 112×1125-point affine, 112×112None-
Recognition backboneViT-Base/LargeViT with PCO/EHSMSmall (training technique)Training config
Training lossArcFaceAdaFace / TopoFRMediumTraining config
Embedding dim512 float32512 float32None-
Vector indexIndexIVFFlatIndexIVFPQ + rerankHigh (20→4 nodes)Adapter config
Scatter-gatherFan-out gRPCFan-out gRPCNone-
FilteringPre-computed bitsetsPre-computed bitsetsNone-
Index lifecycleEpochal snapshotsMutable (UFME is better)UFME leads SOTA-
Quality assessmentLightweight CNNOFIQ (ISO 29794-5)High (ISO compliance)New adapter
PADTexture analysisUnified phys+digitalHigh (deepfake gap)New models
Morphing detectionNot addressedMADation (CLIP+LoRA)High (border control)New module
Inference runtimeONNX RuntimeONNX Runtime + TensorRTSmallConfig
DeploymentDocker + K8sDocker + K8sNone-

  1. Switch IndexIVFFlat → IndexIVFPQ with reranking. Reduces cluster from 20 to 4-6 nodes. Massive cost reduction.
  2. Integrate OFIQ for ISO 29794-5 quality compliance. Production deployments typically require this.
  3. Add unified PAD covering both physical spoofs and digital attacks (deepfakes).
  4. Add morphing attack detection — critical for border control and identity verification.
  1. Upgrade RetinaFace → SCRFD_10G (same ecosystem, better Hard accuracy).
  2. Switch ArcFace → AdaFace loss for quality-adaptive matching (helps with low-quality operational images).
  3. Add FP16/INT8 quantisation options to inference adapters.
  4. Plan NIST FATE evaluation for both PAD (ISO 30107-3) and Quality.
  1. CAGRA/GPU integration via FAISS cuVS for batch video analytics.
  2. DiskANN tier for scaling beyond 500M vectors.
  3. LVFace PCO training for ViT stability at scale.
  4. Synthetic training data (FRCSyn approach) for bias mitigation.

The most important finding: UFME’s hexagonal architecture absorbs all SOTA changes gracefully.

Every recommended change is either:

  • An adapter swap (RetinaFace → SCRFD, IndexIVFFlat → IndexIVFPQ, OFIQ integration)
  • A training configuration change (ArcFace → AdaFace)
  • A new module addition (morphing detection, unified PAD)

No recommended change touches the core domain layer. The ports, pure functions, frozen dataclasses, and composition patterns are all unchanged. This is the payoff of Hickey’s simplicity: when concerns are not complected, improvements to one concern don’t ripple through the system.


  1. For high-accuracy server-side detection: SCRFD_10G or SCRFD_34GF + 5-point alignment is the InsightFace standard stack — excellent ONNX support, well-maintained.
  2. For edge/lightweight deployment: YuNet (<1MB, <2ms on modern CPU) is remarkable. SCRFD_500M is another option.
  3. For mobile/browser: BlazeFace + MediaPipe FaceMesh is the production-proven path (Google-maintained, TFLite/WASM).
  4. Landmark choice: 5-point is the standard for recognition preprocessing (fast, sufficient). 68-point adds detailed geometry analysis. 478-point enables AR/expression/3D reconstruction.
  5. License caution: InsightFace buffalo_l pretrained models are non-commercial research only. Commercial deployment requires a license from InsightFace.
  6. 2025 trend: YOLOv12’s attention-centric design is emerging as the new YOLO baseline for general object/face detection. MediaPipe continues to dominate mobile face mesh.

Vector Search Architecture for UFME at 200M+ Scale

Section titled “Vector Search Architecture for UFME at 200M+ Scale”

Based on the vector search research, the recommended approach for UFME’s production face search system at 200M+ gallery:

IVF-PQ + FAISS on CPU cluster

  • 200M x 512-dim compressed to ~12-25 GB with M=64 PQ
  • nlist = sqrt(N) ~ 14,000-20,000 Voronoi cells
  • nprobe = 64-128 at query time for 90-95% recall
  • Rerank top-100 candidates with exact L2/cosine -> final Recall@1 >= 97%
  • Cost: 2-4 CPU nodes with 64 GB RAM each

CAGRA on GPU (A100 / H100) via FAISS cuVS integration

  • For batch identity resolution on video streams
  • 33-77x faster than CPU HNSW at 90-95% recall

DiskANN or SPANN when gallery exceeds 500M

  • SSD-backed; <5ms latency at 95% recall@1
  • ~64 GB RAM sufficient even for 1B vectors
  • Use partitioned IVF: create sub-indexes per watchlist/group for mandatory pre-filtering
  • Or use Milvus/Qdrant with native metadata filtering for flexible post-filtering
  • Avoid large-n post-filtering with HNSW (recall drops with selectivity)

See detailed research in: