Skip to content

Kubernetes Deployment

UFME ships with production-ready Kubernetes manifests (deploy/k8s/) and a Helm chart (deploy/helm/) that parameterises the same resources. Use the raw manifests for simple fixed deployments; use Helm for environments that require per-environment value overrides.


┌──────────────────────────────────┐
│ ufme namespace │
│ │
Ingress ──────►│ ufme-api (Deployment) │
│ 2–8 replicas, HPA │
│ :8080 (HTTP, Prometheus scrape) │
│ │ gRPC │
│ ▼ │
│ ufme-shard (StatefulSet) │
│ ufme-shard-0 … ufme-shard-4 │
│ :50051, headless Service │
│ │
│ ufme-compaction (CronJob) │
│ runs every hour │
└──────────────────────────────────┘

Three workload types:

WorkloadKindDefault replicasPurpose
ufme-apiDeployment2 (HPA: 2–8)REST gateway, pipeline orchestration
ufme-shardStatefulSet5FAISS IVF-PQ index shards
ufme-compactionCronJob— (hourly)Rebuilds index from event log

All resources live in the ufme namespace:

apiVersion: v1
kind: Namespace
metadata:
name: ufme

Apply with kubectl apply -f deploy/k8s/namespace.yaml.


Shards use a StatefulSet because:

  • Stable pod identity — shards are addressed as ufme-shard-0.ufme-shard.ufme.svc via the headless Service. The API gateway uses these stable DNS names to fan out gRPC requests.
  • Ordered startup — shard-0 starts before shard-1, ensuring predictable index restoration on cluster restart.
  • Per-pod PVCsvolumeClaimTemplates provisions one ReadWriteOnce PVC per shard pod, automatically named index-data-ufme-shard-{n}. Default size: 10 Gi (adjustable via shard.storage.size in Helm values).
spec:
clusterIP: None # headless — DNS returns pod IPs directly
selector:
app.kubernetes.io/name: ufme-shard
ports:
- name: grpc
port: 50051

A headless service with clusterIP: None gives each pod a stable DNS A record (ufme-shard-{n}.ufme-shard.ufme.svc.cluster.local).

ResourceRequestLimit
CPU1 core2 cores
Memory4 Gi6 Gi
Storage (PVC)10 Gi

The default 6 Gi memory limit is appropriate for development gallery sizes. For 40M+ vector production galleries, increase to accommodate the compressed IVF-PQ index (~2.56 GB) plus any reranking vectors.

Shards benefit from dedicated high-memory nodes. Use node selectors and tolerations to isolate shard pods:

values.yaml
shard:
nodeSelector:
role: shard
tolerations:
- key: dedicated
value: faiss-shard
effect: NoSchedule

Label your shard nodes: kubectl label node <node> role=shard and apply the matching taint.


The API gateway runs as a standard Deployment with HorizontalPodAutoscaler.

spec:
minReplicas: 2
maxReplicas: 8
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

Scale-out triggers at 70% CPU utilisation across the deployment. The minimum of 2 replicas provides availability during rolling updates.

ResourceRequestLimit
CPU500m2 cores
Memory1 Gi4 Gi

Both liveness and readiness probes call GET /health:

ProbePathInitial delayPeriod
Liveness/health10 s30 s
Readiness/health5 s10 s

The API pod is annotated for Prometheus scraping:

annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"

The compaction worker rebuilds the FAISS index from the event log on a schedule:

spec:
schedule: "0 */1 * * *" # every hour
concurrencyPolicy: Forbid # never two compactions simultaneously
backoffLimit: 2
activeDeadlineSeconds: 3600

It reads from two PVCs shared with the API:

PVCMountSizePurpose
ufme-event-log/data/events20 GiAppend-only enrolment/delete events
ufme-index-output/data/index50 GiNew index snapshots written here

After building a new snapshot, the compaction binary calls SwapSnapshot RPC on each shard, which atomically replaces the live index.

ResourceRequestLimit
CPU2 cores4 cores
Memory8 Gi16 Gi

PodDisruptionBudgets ensure availability during node maintenance:

# API: at least 1 replica always available
minAvailable: 1
# Shard: at least 1 shard always available
minAvailable: 1

Applied via deploy/k8s/pdbs.yaml.


The Helm chart at deploy/helm/ is functionally equivalent to the raw manifests but parameterised.

Terminal window
helm install ufme ./deploy/helm \
--namespace ufme \
--create-namespace \
--values deploy/helm/values.yaml
deploy/helm/values.yaml
api:
replicas: 2
hpa:
enabled: true
minReplicas: 2
maxReplicas: 8
targetCPUUtilization: 70
shard:
replicas: 5
storage:
size: 10Gi
storageClass: "" # use default storage class
s3Bucket: "" # optional: S3-backed snapshot storage
compaction:
schedule: "0 */1 * * *"
intervalSecs: 3600
config:
thresholds:
pad: "0.85"
mad: "0.75"
quality: "0.40"
similarity: "0.45"
models:
scrfd: "/models/scrfd_10g.onnx"
adaface: "/models/w600k_r50.onnx"
pad: "/models/MiniFASNetV2.onnx"
mad: "/models/mad_selfmad_hrnet_w18.onnx"
quality: "/models/ediffiqa_tiny.onnx"

Override values per environment:

Terminal window
# staging (2 shards, smaller resources)
helm upgrade ufme ./deploy/helm \
--set shard.replicas=2 \
--set shard.storage.size=5Gi
# production (5 shards, dedicated nodes, S3 snapshots)
helm upgrade ufme ./deploy/helm \
--set shard.replicas=5 \
--set shard.storage.s3Bucket=my-ufme-snapshots \
--set shard.storage.size=100Gi

  1. Build Docker images: make build
  2. Push images to your registry and update image.repository in values.yaml
  3. Apply namespace: kubectl apply -f deploy/k8s/namespace.yaml
  4. Apply PVCs: kubectl apply -f deploy/k8s/pvcs.yaml
  5. Apply ConfigMap: kubectl apply -f deploy/k8s/configmap.yaml
  6. Install Helm chart or apply raw manifests
  7. Verify shard pods are Running: kubectl get pods -n ufme
  8. Verify API is reachable: kubectl port-forward svc/ufme-api 8080:8080 -n ufme
  9. Load demo gallery: curl -X POST http://localhost:8080/api/v1/demo/load