Face Detection

Research compiled: 2026-02-20

1. Benchmark: WIDER Face

The standard benchmark for face detection is WIDER FACE, containing 32,203 images with 393,703 annotated faces across three difficulty splits: Easy, Medium, and Hard. The Hard split is most challenging (small, occluded, low-resolution faces). All mAP values below are AP (average precision) on the validation set unless stated otherwise.

2. Top Detection Models

2.1 SCRFD (Sample and Computation Redistribution for Efficient Face Detection)

Paper: arXiv:2105.04714 — published ICLR 2022
Source: InsightFace / deepinsight
Key idea: Redistributes training samples to harder detection stages (Sample Redistribution) and reallocates compute between backbone/neck/head (Computation Redistribution).

Model family performance on WIDER Face validation:

Model	GFLOPs	Easy	Medium	Hard	Notes
SCRFD_500M	0.5	90.57	88.12	68.51	0.57M params, ultra-lightweight
SCRFD_2.5GF	2.5	~93	~91	77.87	Post SR+CR
SCRFD_10G	10	95.16	93.87	83.05	Strong accuracy
SCRFD_34GF	34	~96	~95	~90	Beats TinaFace by 4.78% on Hard, 3× faster

Inference speed: SCRFD_500MF and RetinaFace-MobileNet0.25 both achieve ~42–46 ms on VGA (640×480) images on CPU.

Landmarks: Outputs 5 keypoints (eyes, nose, mouth corners).

Production readiness: Yes — ONNX export supported. Part of InsightFace. Widely used in production pipelines. Note: buffalo_l and similar pretrained packages are non-commercial only; a commercial license is required for production use.

2.2 RetinaFace

Paper: CVPR 2020 — classic single-stage multi-task face detector
Source: deepinsight/insightface

Performance:

Backbone	Easy	Medium	Hard
MobileNet-0.25	~91.7	~89	~72
ResNet-50	~95	~93	~83

Inference speed: RetinaFace-MobileNet0.25: ~42 ms on VGA (CPU); ResNet-50 variant is slower.

Landmarks: 5 keypoints. Also predicts 3D dense alignment (68 points via additional head in some variants).

Production readiness: Yes — ONNX export. Well-established, many community wrappers.

2.3 YOLO5Face / YOLOv8-Face / YOLOv11-Face / YOLOv12-Face

YOLO-based detectors adapted for face detection are actively maintained community efforts.

YOLO5Face (2021, still widely used)

Paper: arXiv:2105.12931
YOLOv5x6 backbone achieves 96.67 / 95.08 / 86.55 (Easy/Medium/Hard) — among the best at time of release.
Outputs 5 facial landmarks.

YOLOv8-Face (2023–2024)

Community implementations: lindevs/yolov8-face, yakhyo/yolov8-face-onnx-inference
Performance on WIDER Face (val):

Model	Easy	Medium	Hard
YOLOv8n-Face	94.5–94.6	92.2–92.3	79.0–79.6
YOLOv8-Lite-s	93.4	91.2	78.6
YOLOv8-Lite-t	90.4	87.7	73.3

Outputs 5 landmarks.
ONNX export: Yes (ONNX models are ~2× size of PyTorch due to serialization format).

YOLOv11 / YOLOv12-Face (2024–2025)

YOLOv11 (nano) marginally outperforms YOLOv12 on precision/mAP50 for face detection.
YOLOv12 introduces Area Attention (A2) module + FlashAttention — achieves higher mAP at all scales with similar or better latency.
ONNX YOLOv12-Face models released December 2025.
Forensic face detection study (2025): YOLOv12 achieves superior latency and precision vs YOLOv8/YOLOv10 baselines on WIDER FACE subset.

Production readiness: Yes — all YOLO variants support ONNX/CoreML/TFLite export.

2.4 YuNet (OpenCV Built-in)

Paper: YuNet: A Tiny Millisecond-level Face Detector — published Machine Intelligence Research 2023
Source: opencv/opencv_zoo

Performance:

Metric	Value
WIDER Face Easy (AP)	88.44%
WIDER Face Medium (AP)	86.56%
WIDER Face Hard (AP)	75.03%
Hard mAP (single-scale)	81.1%

Speed: ~1.6 ms per frame at 320×320 on Intel i7-12700K (CPU); ~5 ms vs. 25 ms for traditional Cascade methods.

Model size: Only 75,856 parameters — less than 1/5 of other small detectors.

Landmarks: 5 keypoints.

Production readiness: Excellent — ships natively with OpenCV DNN module, no extra dependencies. Zero-cost deployment. Ideal for edge/serverless. ONNX export supported.

2.5 BlazeFace / MediaPipe Face Detection + Face Mesh

Source: Google / MediaPipe
Architecture: BlazeFace (lightweight, SSD-inspired, GPU-friendly anchor scheme) + separate 3D landmark model.

BlazeFace performance:

Competitive accuracy to heavier models.
200–1000+ FPS on high-end mobile phones (GPU-accelerated).
Designed for real-time mobile/browser inference (TFLite, WebAssembly, GPU delegate).

Face Mesh (landmark model):

Outputs 468 (legacy) or 478 3D face landmarks in real-time on mobile.
Operates on face crops from BlazeFace detector.
Includes iris landmarks in the 478-point version.

Landmarks: 5 (BlazeFace detector) → then 478 3D (Face Mesh landmark model).

Production readiness: Excellent — Google-maintained, used in billions of devices. TFLite + WASM. Not ONNX natively (TFLite format; community conversions exist).

2.6 InsightFace Buffalo Pack (Production Bundle)

Source: deepinsight/insightface, HuggingFace: immich-app/buffalo_l

The buffalo_l model pack bundles:

Detection: SCRFD_10G (ONNX)
3D Landmark: 1k3d68.onnx — 68 3D landmark predictor
Recognition: ArcFace R100 (ONNX)
Attribute: gender/age model

Key detail: buffalo_l is widely used in open-source projects (e.g., immich) but is non-commercial research only. Commercial licensing available separately.

2.7 ASFD (Automatic and Scalable Face Detector)

Paper: arXiv:2201.10781 — Tencent
ASFD-D6 achieves ~96.7 / 96.2 / 92.1 (Easy/Medium/Hard test set) — near top of Papers with Code leaderboard.
Large model (ResNeXt + NAS-searched neck), primarily a research benchmark leader.
Not widely used in production pipelines.

3. Face Landmark Detection: Comparison by Point Count

Points	Model Examples	Use Cases	Tradeoffs
5	SCRFD, RetinaFace, YuNet, YOLOv8-Face, BlazeFace	Face alignment for recognition, crop/warp preprocessing	Fastest; sufficient for alignment & recognition
68	dlib, 1k3d68 (InsightFace), face-alignment lib (adrianbulat)	Facial analysis, expression, detailed geometry	~99.7 MB (dlib); 8–10% slower than 5-point
468/478	MediaPipe Face Mesh, TF face-landmarks-detection	Face swap, AR, expression detection, 3D reconstruction	Full face mesh; mobile-optimized (TFLite); ~9 MB TFLite model

Research finding (2023): 68 landmarks are efficient for 3D face alignment — adding more points shows diminishing returns for face recognition downstream tasks. 5-point alignment is the practical standard for recognition pipelines.

CVPR 2025: T-FAKE paper demonstrates accurate 70 and 478-point landmark prediction in challenging conditions (thermal images), suggesting dense landmark detection is maturing.

ICCV 2025: “Heatmap Regression without Soft-Argmax for Facial Landmark Detection” advances accuracy on standard benchmarks beyond previous SOTA (STAR method).

4. NIST Evaluation Context

NIST runs the Face Recognition Technology Evaluation (FRTE), which focuses on recognition accuracy, not detection in isolation. The FRTE FIVE track covers face detection in video.

Key result (April 2025): NEC ranked #1 in 1:N Identification on 12M-person gallery with 0.07% authentication error rate — but this is recognition, not detection.

Detection quality is evaluated separately via the FATE Quality program.

5. Summary Comparison Table

Model	WF Easy	WF Medium	WF Hard	Speed	Landmarks	Size	ONNX	Production
SCRFD_500M	90.6	88.1	68.5	~46 ms CPU VGA	5	~1MB	Yes	Yes*
SCRFD_10G	95.2	93.9	83.1	~80 ms CPU VGA	5	~17MB	Yes	Yes*
RetinaFace MN0.25	91.7	89.0	72.0	~42 ms CPU VGA	5	~2MB	Yes	Yes
RetinaFace R50	95.0	93.0	83.0	slower	5	~105MB	Yes	Yes
YuNet	88.4	86.6	75.0	1.6 ms i7 320px	5	<1MB	Yes	Excellent
YOLOv8n-Face	94.5	92.2	79.0	fast	5	~6MB	Yes	Yes
YOLOv12-Face	~95+	~93+	~80+	fast	5	varies	Yes	Yes
BlazeFace	competitive	competitive	—	200-1000+ FPS mobile	5	~2MB TFLite	No (TFLite)	Yes (mobile)
MediaPipe FaceMesh	N/A (landmark model)	—	—	real-time mobile	478 3D	~9MB TFLite	No (TFLite)	Yes (mobile)
InsightFace 1k3d68	N/A (landmark model)	—	—	~5ms GPU	68 3D	~72MB	Yes	Yes*
ASFD-D6	96.7	96.2	92.1	slow	5	large	No	Research
YOLO5Face (YOLOv5x6)	96.7	95.1	86.6	moderate	5	large	Yes	Yes

*Non-commercial license for InsightFace pretrained models; commercial license available.

6. UFME Recommendations

For UFME-specific recommendations based on this research, see Executive Summary.