Skip to content

Face Detection

Research compiled: 2026-02-20


The standard benchmark for face detection is WIDER FACE, containing 32,203 images with 393,703 annotated faces across three difficulty splits: Easy, Medium, and Hard. The Hard split is most challenging (small, occluded, low-resolution faces). All mAP values below are AP (average precision) on the validation set unless stated otherwise.


2.1 SCRFD (Sample and Computation Redistribution for Efficient Face Detection)

Section titled “2.1 SCRFD (Sample and Computation Redistribution for Efficient Face Detection)”
  • Paper: arXiv:2105.04714 — published ICLR 2022
  • Source: InsightFace / deepinsight
  • Key idea: Redistributes training samples to harder detection stages (Sample Redistribution) and reallocates compute between backbone/neck/head (Computation Redistribution).

Model family performance on WIDER Face validation:

ModelGFLOPsEasyMediumHardNotes
SCRFD_500M0.590.5788.1268.510.57M params, ultra-lightweight
SCRFD_2.5GF2.5~93~9177.87Post SR+CR
SCRFD_10G1095.1693.8783.05Strong accuracy
SCRFD_34GF34~96~95~90Beats TinaFace by 4.78% on Hard, 3× faster

Inference speed: SCRFD_500MF and RetinaFace-MobileNet0.25 both achieve ~42–46 ms on VGA (640×480) images on CPU.

Landmarks: Outputs 5 keypoints (eyes, nose, mouth corners).

Production readiness: Yes — ONNX export supported. Part of InsightFace. Widely used in production pipelines. Note: buffalo_l and similar pretrained packages are non-commercial only; a commercial license is required for production use.


Performance:

BackboneEasyMediumHard
MobileNet-0.25~91.7~89~72
ResNet-50~95~93~83

Inference speed: RetinaFace-MobileNet0.25: ~42 ms on VGA (CPU); ResNet-50 variant is slower.

Landmarks: 5 keypoints. Also predicts 3D dense alignment (68 points via additional head in some variants).

Production readiness: Yes — ONNX export. Well-established, many community wrappers.


2.3 YOLO5Face / YOLOv8-Face / YOLOv11-Face / YOLOv12-Face

Section titled “2.3 YOLO5Face / YOLOv8-Face / YOLOv11-Face / YOLOv12-Face”

YOLO-based detectors adapted for face detection are actively maintained community efforts.

  • Paper: arXiv:2105.12931
  • YOLOv5x6 backbone achieves 96.67 / 95.08 / 86.55 (Easy/Medium/Hard) — among the best at time of release.
  • Outputs 5 facial landmarks.
ModelEasyMediumHard
YOLOv8n-Face94.5–94.692.2–92.379.0–79.6
YOLOv8-Lite-s93.491.278.6
YOLOv8-Lite-t90.487.773.3
  • Outputs 5 landmarks.
  • ONNX export: Yes (ONNX models are ~2× size of PyTorch due to serialization format).
  • YOLOv11 (nano) marginally outperforms YOLOv12 on precision/mAP50 for face detection.
  • YOLOv12 introduces Area Attention (A2) module + FlashAttention — achieves higher mAP at all scales with similar or better latency.
  • ONNX YOLOv12-Face models released December 2025.
  • Forensic face detection study (2025): YOLOv12 achieves superior latency and precision vs YOLOv8/YOLOv10 baselines on WIDER FACE subset.

Production readiness: Yes — all YOLO variants support ONNX/CoreML/TFLite export.


Performance:

MetricValue
WIDER Face Easy (AP)88.44%
WIDER Face Medium (AP)86.56%
WIDER Face Hard (AP)75.03%
Hard mAP (single-scale)81.1%

Speed: ~1.6 ms per frame at 320×320 on Intel i7-12700K (CPU); ~5 ms vs. 25 ms for traditional Cascade methods.

Model size: Only 75,856 parameters — less than 1/5 of other small detectors.

Landmarks: 5 keypoints.

Production readiness: Excellent — ships natively with OpenCV DNN module, no extra dependencies. Zero-cost deployment. Ideal for edge/serverless. ONNX export supported.


2.5 BlazeFace / MediaPipe Face Detection + Face Mesh

Section titled “2.5 BlazeFace / MediaPipe Face Detection + Face Mesh”
  • Source: Google / MediaPipe
  • Architecture: BlazeFace (lightweight, SSD-inspired, GPU-friendly anchor scheme) + separate 3D landmark model.

BlazeFace performance:

  • Competitive accuracy to heavier models.
  • 200–1000+ FPS on high-end mobile phones (GPU-accelerated).
  • Designed for real-time mobile/browser inference (TFLite, WebAssembly, GPU delegate).

Face Mesh (landmark model):

  • Outputs 468 (legacy) or 478 3D face landmarks in real-time on mobile.
  • Operates on face crops from BlazeFace detector.
  • Includes iris landmarks in the 478-point version.

Landmarks: 5 (BlazeFace detector) → then 478 3D (Face Mesh landmark model).

Production readiness: Excellent — Google-maintained, used in billions of devices. TFLite + WASM. Not ONNX natively (TFLite format; community conversions exist).


2.6 InsightFace Buffalo Pack (Production Bundle)

Section titled “2.6 InsightFace Buffalo Pack (Production Bundle)”

The buffalo_l model pack bundles:

  • Detection: SCRFD_10G (ONNX)
  • 3D Landmark: 1k3d68.onnx — 68 3D landmark predictor
  • Recognition: ArcFace R100 (ONNX)
  • Attribute: gender/age model

Key detail: buffalo_l is widely used in open-source projects (e.g., immich) but is non-commercial research only. Commercial licensing available separately.


2.7 ASFD (Automatic and Scalable Face Detector)

Section titled “2.7 ASFD (Automatic and Scalable Face Detector)”
  • Paper: arXiv:2201.10781 — Tencent
  • ASFD-D6 achieves ~96.7 / 96.2 / 92.1 (Easy/Medium/Hard test set) — near top of Papers with Code leaderboard.
  • Large model (ResNeXt + NAS-searched neck), primarily a research benchmark leader.
  • Not widely used in production pipelines.

3. Face Landmark Detection: Comparison by Point Count

Section titled “3. Face Landmark Detection: Comparison by Point Count”
PointsModel ExamplesUse CasesTradeoffs
5SCRFD, RetinaFace, YuNet, YOLOv8-Face, BlazeFaceFace alignment for recognition, crop/warp preprocessingFastest; sufficient for alignment & recognition
68dlib, 1k3d68 (InsightFace), face-alignment lib (adrianbulat)Facial analysis, expression, detailed geometry~99.7 MB (dlib); 8–10% slower than 5-point
468/478MediaPipe Face Mesh, TF face-landmarks-detectionFace swap, AR, expression detection, 3D reconstructionFull face mesh; mobile-optimized (TFLite); ~9 MB TFLite model

Research finding (2023): 68 landmarks are efficient for 3D face alignment — adding more points shows diminishing returns for face recognition downstream tasks. 5-point alignment is the practical standard for recognition pipelines.

CVPR 2025: T-FAKE paper demonstrates accurate 70 and 478-point landmark prediction in challenging conditions (thermal images), suggesting dense landmark detection is maturing.

ICCV 2025: “Heatmap Regression without Soft-Argmax for Facial Landmark Detection” advances accuracy on standard benchmarks beyond previous SOTA (STAR method).


NIST runs the Face Recognition Technology Evaluation (FRTE), which focuses on recognition accuracy, not detection in isolation. The FRTE FIVE track covers face detection in video.

Key result (April 2025): NEC ranked #1 in 1:N Identification on 12M-person gallery with 0.07% authentication error rate — but this is recognition, not detection.

Detection quality is evaluated separately via the FATE Quality program.


ModelWF EasyWF MediumWF HardSpeedLandmarksSizeONNXProduction
SCRFD_500M90.688.168.5~46 ms CPU VGA5~1MBYesYes*
SCRFD_10G95.293.983.1~80 ms CPU VGA5~17MBYesYes*
RetinaFace MN0.2591.789.072.0~42 ms CPU VGA5~2MBYesYes
RetinaFace R5095.093.083.0slower5~105MBYesYes
YuNet88.486.675.01.6 ms i7 320px5<1MBYesExcellent
YOLOv8n-Face94.592.279.0fast5~6MBYesYes
YOLOv12-Face~95+~93+~80+fast5variesYesYes
BlazeFacecompetitivecompetitive200-1000+ FPS mobile5~2MB TFLiteNo (TFLite)Yes (mobile)
MediaPipe FaceMeshN/A (landmark model)real-time mobile478 3D~9MB TFLiteNo (TFLite)Yes (mobile)
InsightFace 1k3d68N/A (landmark model)~5ms GPU68 3D~72MBYesYes*
ASFD-D696.796.292.1slow5largeNoResearch
YOLO5Face (YOLOv5x6)96.795.186.6moderate5largeYesYes

*Non-commercial license for InsightFace pretrained models; commercial license available.


For UFME-specific recommendations based on this research, see Executive Summary.