Skip to content

FAISS Design

Philosophy: Rich Hickey’s Simple Made Easy — separate concerns, compose pure functions, treat state as a derived view of immutable events.


“Simple is not easy. Easy conflates familiar with good.” — Rich Hickey

The FAISS layer is not a database. It is not a mutable store you update in place. It is a read-optimised function that maps a query vector to a ranked list of candidate identities. All complexity — partitioning, filtering, enrolment, deletion — is handled as plain data transformations before and after that function runs.


An in-place mutable FAISS index conflates two separate concerns:

  • Identity: which shard holds vectors for partition IABS
  • State: the current contents of that shard at this moment
  • Value: a specific immutable snapshot of 40M vectors at version v42

When you mutate the index during a live search, you have no value — you have a place. Places are the root of all complexity: race conditions, partial reads, impossible reasoning about what “the index” contains right now.

Each FAISS shard holds an index snapshot — an immutable, named value:

IndexSnapshot {
version: u64, # monotonically increasing
shard_id: u8, # 0..4
partition: Partition, # IABS | IDENT1 | IABS+IDENT1
built_at: Timestamp,
vector_count: u64,
faiss_index: Arc<IndexIVFPQ>, # reference-counted, read-only; PQ-compressed
pq_codebook: Arc<PQCodebook>, # trained codebook for approximate distance computation
exact_vectors: Arc<MmapFloat32>, # memory-mapped exact float32 vectors (SSD); used for reranking only
id_map: Arc<Vec<SubjectId>>, # position -> canonical subject ID
}

The shard process holds an identity — an atomic reference to the current snapshot:

ShardState {
current: AtomicRef<IndexSnapshot>, # the current value
}

Enrolment never modifies current. It produces a new IndexSnapshot at version + 1 and swaps the atomic reference. During the swap, in-flight searches on the old snapshot complete undisturbed. Old snapshots are dropped when their reference count reaches zero.

This is the Epochal Time Model applied directly:

  • Identity = the shard process and its socket address
  • State = the current IndexSnapshot reference held by the shard
  • Value = a specific IndexSnapshot (immutable, shareable, garbage-collected)

The two storage tiers keep RAM lean:

  • PQ-compressed index (faiss_index): 40M vectors × 64 bytes = ~2.5 GB per shard, all in RAM.
  • Exact vectors (exact_vectors): 40M vectors × 2048 bytes = ~80 GB per shard, on SSD accessed via mmap. Only the pages for the top-100 reranking candidates are faulted into RAM on demand — the OS page cache handles eviction automatically.

Across 5 shards: ~12.5 GB RAM for PQ, ~400 GB SSD for exact vectors.

Event Log
Compaction Worker ──────────────────────────────────────────┐
│ reads all events for shard S up to epoch E │
│ trains PQ codebook if new │
│ builds IndexIVFPQ, writes exact vectors to mmap file │
▼ │
IndexSnapshot(version=E, shard=S) ─── swap atomic ref ──► ShardState
Old IndexSnapshot ──► Arc refcount drops to 0 ──► memory freed

No locks on the hot path. The search path reads the atomic ref once and holds it for the duration of the query.


Each shard exposes a single operation:

fn search(snapshot: &IndexSnapshot, query: &[f32; 512], k: u32, mask: &Bitset) -> Vec<PQCandidate>

Inputs are values. Output is a value. No side effects. The shard does not know about other shards, the aggregator, or the caller’s intent.

PQCandidate {
subject_id: SubjectId,
pq_score: f32, # PQ-approximate inner product
shard_id: u8,
exact_vector: [f32; 512], # raw vector returned by shard alongside PQ score
}
RankedCandidate {
subject_id: SubjectId,
exact_score: f32, # exact cosine similarity after reranking
shard_id: u8,
}

The scatter-gather pipeline has four stages:

1. Fan-out: broadcast query to 4–6 shards (not 20 — PQ compression makes the full gallery fit on fewer nodes):

fn scatter(query: &[f32; 512], local_k: u32, partition: Partition, mask: &Bitset)
-> Vec<Future<Vec<PQCandidate>>>

2. Each shard returns top-50 PQ-approximate results with exact vectors from its local IndexIVFPQ. The shard looks up the raw float32 vectors for its top-K candidates from exact_vectors and includes them in the response. This eliminates the second round-trip for reranking and decouples the aggregator from shard topology — the aggregator never needs to know which shard holds which vectors. The cost is ~200KB extra per shard response (100 vectors × 2KB each), which is negligible relative to the search latency budget.

3. Aggregator merges to top-100 globally:

fn aggregate_pq(shard_results: Vec<Vec<PQCandidate>>, candidate_k: u32) -> Vec<PQCandidate> {
shard_results
.into_iter()
.flatten()
.sorted_by(|a, b| b.pq_score.partial_cmp(&a.pq_score))
.take(candidate_k as usize) # candidate_k = 100 (top_k_pq)
.collect()
}

4. Reranking: compute exact cosine similarity using the exact vectors already present on each PQCandidate, re-sort, and produce RankedCandidate values. The reranker is a pure function PQCandidate -> RankedCandidate — no network calls, no shard references, no mmap lookups:

fn rerank(
candidates: Vec<PQCandidate>,
query: &[f32; 512],
global_k: u32,
) -> Vec<RankedCandidate> {
candidates
.into_iter()
.map(|c| {
let exact_score = cosine_similarity(query, &c.exact_vector);
RankedCandidate { subject_id: c.subject_id, exact_score, shard_id: c.shard_id }
})
.sorted_by(|a, b| b.exact_score.partial_cmp(&a.exact_score))
.take(global_k as usize)
.collect()
}

The type system encodes the pipeline stage: PQCandidate flows out of the shard and into the reranker; RankedCandidate flows out of the reranker and into the caller. There is no boolean exact flag — the type tells you which kind of score you have.

5. Return final top-K with exact scores to the caller.

The full pipeline composes as a pure function:

fn search_and_rerank(
query: &[f32; 512],
global_k: u32,
partition: Partition,
mask: &Bitset,
) -> Vec<RankedCandidate> {
let pq_candidates = scatter(query, local_k=50, partition, mask)
.collect_with_deadline(deadline_ms=200);
let top_100 = aggregate_pq(pq_candidates, candidate_k=100);
rerank(top_100, query, global_k)
}

No shared state. No locks. No callbacks. The scatter is a fan-out of identical gRPC calls; the gather is a merge-sort; the rerank is a local map over vectors already in memory.

SearchRequest {
query_vector: bytes, # 512 * 4 = 2048 bytes, little-endian float32
local_k: uint32, # how many candidates this shard should return (e.g. 50)
partition: Partition, # which logical partition to search
filter_mask: bytes, # pre-computed bitset (see Section 4)
}
SearchResponse {
candidates: repeated PQCandidate, # PQ scores + exact vectors for reranking
shard_id: uint32,
snapshot_version: uint64, # for observability: which index version answered
}

Wire Format: Python–Rust Boundary Protocol

Section titled “Wire Format: Python–Rust Boundary Protocol”

Vectors cross the Python–Rust boundary as raw bytes over protobuf, not as repeated float:

message SearchRequest {
bytes query_vector = 1; // 512 × f32, little-endian = 2048 bytes
uint32 local_k = 2;
Partition partition = 3;
bytes filter_mask = 4;
}
message PQCandidateProto {
bytes subject_id = 1; // 16-byte UUID
float pq_score = 2;
uint32 shard_id = 3;
bytes exact_vector = 4; // 512 × f32, little-endian = 2048 bytes
}
message SearchResponse {
repeated PQCandidateProto candidates = 1;
uint32 shard_id = 2;
uint64 snapshot_version = 3;
}

Key decisions:

  • bytes not repeated float — Avoids per-element varint overhead. A 512-dim vector is a single 2048-byte field copy, zero encoding/decoding.
  • Endianness: little-endian (x86-64 native) — Both Python (numpy) and Rust (byteorder/transmute) use LE natively on x86-64. No byte-swapping on the hot path.
  • Vectors are always L2-normalised 512-dim float32 — 2048 bytes on the wire, every time. No length prefix or dimension field needed.
  • Python serializationvector.tobytes() to encode, np.frombuffer(buf, dtype=np.float32) to decode. Both are zero-copy views when alignment permits.
  • Rust deserializationbyteorder::LittleEndian::read_f32_into() or std::slice::from_raw_parts with alignment check. The shard validates len(query_vector) == 2048 and rejects malformed requests.
  • Accretion policy — New fields are appended; existing field numbers are never reused or reassigned. This is standard protobuf forward/backward compatibility. Shard and aggregator versions may differ by one release.

The fan-out is concurrent. Deadline: 200ms per shard. Stragglers are cancelled. The aggregator’s behaviour when shards are missing is governed by the [degradation] policy (see Section 8) — a configurable data value, not hard-coded logic. The policy specifies how many shards must respond (min_shards) and what to do with partial results (partial_result_policy): reject the search, annotate the response with a coverage warning, or degrade gracefully by returning results from the shards that did respond.


Large-scale deployments require independent searches across distinct gallery partitions (e.g., immigration and law enforcement databases). The naive implementation deploys a separate FAISS cluster per partition — two systems instead of one.

The simple approach: partitions are plain data labels attached to every vector at enrolment time and stored in the shard’s id_map. The shard’s search function receives a Partition value and interprets it as a filter on its local id_map.

SubjectRecord {
subject_id: SubjectId, # opaque UUID
partition: Partition, # IABS | IDENT1
enrolled_at: Timestamp,
vector_pos: u64, # position in the FAISS index
}

At shard-build time, the compaction worker builds per-partition bitsets:

ShardBitsets {
iabs: Bitset, # bit set for every vector whose SubjectRecord.partition == IABS
ident1: Bitset, # bit set for every vector whose SubjectRecord.partition == IDENT1
all: Bitset, # union — for cross-partition searches
}

These bitsets are part of the IndexSnapshot value. They are computed once during compaction, stored immutably alongside the index, and passed as input to every search() call. There is no runtime branching on partition logic inside the search loop.

A cross-partition search passes all. An immigration-only search passes iabs. A combined search with different thresholds per partition runs two independent searches on the same snapshot and merges results — still a pure composition.


Production deployments require metadata filtering: gender, age bracket, recording event level. The naive approach evaluates metadata predicates inside the FAISS search loop — one predicate check per candidate vector, 40 million times per shard.

The simple approach: compute the bitset once, pass it as data.

FilterCriteria {
partition: Option<Partition>,
gender: Option<Gender>,
age_bracket: Option<AgeRange>,
event_level: Option<EventLevel>,
}
fn compute_filter_mask(criteria: &FilterCriteria, shard_meta: &ShardMetadata) -> Bitset {
// Walk the shard's metadata table once, set bits for matching records.
// Returns an immutable Bitset value.
}

The Bitset is then passed directly into FAISS’s search_with_mask call. FAISS ANDs the bitset against its internal inverted list traversal. Zero predicate evaluation inside the hot path.

Common filter combinations (e.g., “IABS only, no other criteria”) are pre-computed at shard-build time and stored in the IndexSnapshot. Novel filter criteria compute their mask once per search request before fan-out, and the mask value is passed to every shard as part of the SearchRequest. The computation cost is O(vectors_on_shard / 64) using 64-bit SIMD bitwise ops — negligible compared to the IVF search.


The FAISS index is not the authoritative store. The event log is. The index is a materialised view, rebuilt from the log on demand.

The event log is formalised as an EventLogPort protocol in the domain layer with a defined contract: append semantics (immutable once written), strict ordering guarantees, read-from-offset for replay, and GC preconditions that the compaction layer must satisfy before pruning. See domain-design.md Section 2 for the full protocol definition.

Canonical source: The domain-layer event types in domain-design.md Section 1 (EnrolEvent, DeleteEvent) are the authoritative definitions. The Rust structs below are derived from them via protobuf translation. Any field additions or renames must be made in the domain types first.

EnrolEvent {
event_id: Ulid, # sortable unique ID
subject_id: SubjectId,
partition: Partition,
vector: [f32; 512], # L2-normalised template
enrolled_at: Timestamp,
source_ref: String, # MSB transaction reference
}
DeleteEvent {
event_id: Ulid,
subject_id: SubjectId,
partition: Partition,
deleted_at: Timestamp,
reason_code: String,
}

Events are append-only. Events are immutable once written. There is no “update” event — a re-enrolment is a new EnrolEvent with the same subject_id. The compaction process resolves the effective state by replaying events in order: the last EnrolEvent for a given subject_id before any DeleteEvent wins.

Port mapping: The FAISS adapter implements three split domain ports: VectorSearchPort (for search_ann — the scatter-gather read path), VectorLookupPort (for fetch — point lookup used by the verify path), and VectorMutationPort (for enrol/delete — the write path that appends events to the log). A single adapter instance may implement all three; the domain orchestrators receive only the port they need. See domain-design.md Section 2 for port definitions.

Compaction Produces the Next Index Version

Section titled “Compaction Produces the Next Index Version”
fn compact(
events: impl Iterator<Item = Event>,
previous_snapshot: Option<&IndexSnapshot>,
shard_assignment: &ShardAssignment,
) -> IndexSnapshot {
// 1. Replay events to derive the effective set of (subject_id, vector) pairs.
// 2. Assign each subject_id to a shard deterministically (consistent hash).
// 3. For this shard's vectors:
// a. Train PQ codebook on a representative sample (if new or gallery changed significantly).
// b. Build IndexIVFPQ: train Voronoi centroids, quantize vectors with PQ, add to index.
// c. Write exact float32 vectors to mmap file on SSD for reranking.
// 4. Build partition bitsets and metadata table.
// 5. Return a new IndexSnapshot value.
}

Compaction runs on a separate compaction worker, not on the live shard process. The live shard continues serving searches against version N while version N+1 is being built. When compaction completes, the shard updates its current reference using compare-and-swap (CAS): the swap only succeeds if current still points to the expected previous version. If two compaction runs overlap and one completes after the other has already advanced the reference, the stale compaction’s CAS fails and it discards its snapshot rather than overwriting a newer one. This prevents a slow compaction run from regressing the shard to an older view of the gallery.

CAS failure observability: When a compaction CAS fails, the event should be logged and a metric counter incremented (e.g. compaction_cas_failures_total) for operational alerting. Frequent CAS failures indicate the compaction scheduling interval may be too aggressive relative to the compaction build time.

Deletes are cheap at event-log time (append one record) and resolved at compaction time (omit the vector). There is no in-place deletion from the FAISS index, which FAISS does not support efficiently for IndexIVFPQ anyway. The maximum window of a deleted subject remaining searchable is bounded by the compaction interval — a configurable operational parameter, not an architectural constraint.

For compliance with UK data retention requirements, the event log itself may be pruned. The GC invariant is strict: a log entry may only be garbage-collected when the corresponding subject’s vectors are confirmed absent from ALL active snapshots, including snapshots held by in-flight searches. In practice this means GC must check that (a) the subject does not appear in any IndexSnapshot reachable via any shard’s AtomicRef, and (b) no in-flight search holds an Arc reference to an older snapshot that still contains the subject’s vectors.

Default GC policy (conservative): Retain all events for 3 × compaction_interval. This guarantees all snapshots containing the subject have been superseded without requiring distributed Arc refcount tracking. For example, with a 60-minute compaction interval, events are retained for 3 hours before becoming eligible for GC. The aggressive GC strategy (checking that all Arc refcounts for snapshots containing the subject have dropped to zero) is an optimisation for space-constrained deployments where the 3x retention window is too costly.


┌─────────────────┐
│ Event Log │ (append-only, durable)
│ EnrolEvent │
│ DeleteEvent │
└────────┬────────┘
│ periodic compaction
┌────────────────────────────────┐
│ Compaction Worker │
│ replay → assign → train PQ │
│ build IndexIVFPQ │
│ write exact vectors to mmap │
└────────────────┬───────────────┘
│ produces
┌────────────────────────────────┐
│ IndexSnapshot (v N+1) │ (immutable value)
│ faiss_index (IndexIVFPQ) │
│ pq_codebook | exact_vectors │
│ id_map | bitsets | meta │
└────────────────┬───────────────┘
│ atomic swap
┌────────────────────────────────┐
│ ShardProcess │ (identity)
│ current: AtomicRef<Snapshot> │
└────────────────────────────────┘
│ serves
┌───────────────────────────────────────┐
│ search(query, k, partition, mask) │ (pure function)
│ → Vec<PQCandidate> (PQ + exact vecs) │
└───────────────────────────────────────┘
│ fan-out to 4-6 shards
┌────────────────────────────────┐
│ Aggregator │
│ flatten → sort → take(100) │ (pure fold, PQ scores)
└────────────────┬───────────────┘
│ rerank (local, no network)
┌────────────────────────────────┐
│ Reranker │
│ PQCandidate → RankedCandidate │
│ exact cosine on vectors │
│ already in memory │
│ sort → take(K) │ (pure map + sort)
└────────────────────────────────┘

Complected ThingWhat We Do Instead
Mutable index updated during searchAtomic swap of immutable snapshots
Partition logic inside search loopBitset computed once, passed as data
Metadata filter evaluated per-vectorPre-computed bitset, applied by FAISS
Separate cluster per partitionData tag on every SubjectRecord
In-place delete from FAISSAppend DeleteEvent; resolve at compaction
Aggregator knowing about shard topologyShards return exact vectors with PQ scores; aggregator receives a list of futures with all data needed for reranking
Boolean flags encoding pipeline stageSeparate types: PQCandidate (pre-rerank) and RankedCandidate (post-rerank)
Hard-coded degradation logicConfigurable policy: min_shards + partial_result_policy as TOML data
Compaction race conditionsCAS on AtomicRef prevents stale snapshots overwriting newer ones
Search coupled to enrolment pipelineEvent log decouples writes from reads
Full vectors in RAM at 200M scalePQ compression to 64 bytes/vector; exact vectors on SSD for reranking only

These are data, not architectural decisions. They are tunable without changing code:

[shard]
count = 5 # reduced from 20; PQ fits full gallery on fewer nodes
vectors_per_shard = 40_000_000 # at 200M gallery
[faiss]
index_type = "IndexIVFPQ" # IVF + Product Quantization; replaces IndexIVFFlat
metric = "INNER_PRODUCT" # equivalent to cosine for L2-normalised vectors
nlist = 16384 # Voronoi cells (~sqrt(40M))
nprobe = 64 # cells to visit per query; recall vs latency trade-off
pq_m = 64 # number of PQ sub-vectors (512-dim / 64 = 8-dim each)
pq_nbits = 8 # bits per sub-quantizer (256 centroids per sub-vector)
[reranking]
enabled = true
top_k_pq = 100 # candidates from PQ search to rerank with exact vectors
exact_storage = "mmap" # memory-mapped exact vectors on SSD; not kept in RAM
[compaction]
interval_minutes = 60 # how often to rebuild from event log
max_delete_lag_minutes = 120 # worst-case time a deleted subject remains searchable
[search]
local_k = 50 # candidates returned per shard before global merge
global_k = 10 # final candidates returned to caller after reranking
deadline_ms = 200 # per-shard search deadline
[degradation]
min_shards = 4 # minimum shards that must respond for a valid result
partial_result_policy = "annotate" # "reject" | "annotate" | "degrade"
# reject: fail the search if fewer than min_shards respond
# annotate: return results with a coverage warning flag
# degrade: return best-effort results from responding shards
[partitions]
names = ["IABS", "IDENT1"] # extensible without code change
[tiers]
tier1 = "IndexIVFPQ" # primary: CPU, 200M scale; ~12.5 GB RAM, ~400 GB SSD
tier2 = "DiskANN" # future: SSD-backed, 500M+ scale, ~64 GB RAM for 1B vectors
tier3 = "CAGRA" # future: GPU batch analytics, 33-77x throughput vs CPU HNSW

Memory Budget (5 shards × 40M vectors each)

Section titled “Memory Budget (5 shards × 40M vectors each)”
StoragePer ShardTotal (5 shards)Notes
PQ-compressed index (RAM)~2.5 GB~12.5 GB40M × 64 bytes; all in RAM
Exact vectors (SSD, mmap)~80 GB~400 GB40M × 2048 bytes; pages loaded on demand for reranking
Total RAM~3–4 GB~15–20 GBIncluding id_map, bitsets, OS overhead
Total SSD~80 GB~400 GBExact vector storage for reranking only

When the gallery exceeds 500M identities, RAM pressure makes even IVF-PQ expensive to distribute. DiskANN (Microsoft Vamana graph) solves this:

  • Architecture: Vamana graph centroids in RAM; full vectors on SSD. Fewer SSD reads per query than HNSW-IF because the graph has shorter average search paths.
  • RAM: ~64 GB RAM sufficient for 1B vectors (vs 400 GB for IVF-Flat at 200M).
  • Latency: <5ms p99 at ≥95% Recall@1 on billion-scale datasets.
  • Build acceleration: DiskANN index build is parallelisable; GPU acceleration provides 40× speedup over CPU build.
  • Production status: SQL Server 2025 integrates DiskANN natively (public preview 2025); Rust rewrite available; stateless orchestrator model fits the epochal snapshot design.

Migration path: the IndexSnapshot struct gains a DiskANNGraph field; the shard’s search() function dispatches to DiskANN instead of FAISS. The scatter-gather topology, aggregator, reranker, and compaction worker are unchanged.

Tier 3: CAGRA / cuVS (GPU Batch Analytics)

Section titled “Tier 3: CAGRA / cuVS (GPU Batch Analytics)”

For batch video surveillance identity resolution — where many queries arrive simultaneously — GPU graph search provides dramatic throughput advantages:

  • Technology: CAGRA (NVIDIA cuVS), integrated into FAISS in May 2025 via Meta + NVIDIA collaboration.
  • Throughput: 33–77× vs CPU HNSW at 90–95% recall; millions of QPS at batch size 10k on A100/H100.
  • Build speedup: 2.2–27× vs CPU HNSW build.
  • Use case: Not for single-query online identification (GPU launch overhead hurts single-query latency). Designed for offline or near-real-time batch workloads: nightly watchlist sweeps, video stream identity clustering.
  • API: Same FAISS Python API; GPU index is a drop-in backend. UFME’s adapter wraps this transparently.

  • No complected state: The shard identity (socket address) is separate from its current value (snapshot reference), which is separate from any historical value (old snapshots still in flight).
  • No complected time: The event log is the timeline. The index is a point-in-time view derived from it.
  • No complected filtering: Bitsets are plain byte arrays computed before search, passed as arguments, not embedded in search logic.
  • No complected partitioning: Partition is a field on a data record, not a routing rule or a separate cluster.
  • No complected enrolment: Enrol writes one event. It does not touch the index. Decoupled by design.
  • No complected memory: PQ-compressed vectors in RAM for fast ANN; exact vectors on SSD for precise reranking. Each tier holds exactly what it needs.
  • No complected reranking: Shards return exact vectors alongside PQ scores. The reranker is a local map — no second round-trip, no coupling to shard topology.
  • No complected pipeline stages: PQCandidate and RankedCandidate are distinct types. The pipeline stage is encoded in the type, not in a boolean flag.
  • No complected degradation logic: The aggregator’s behaviour when shards are missing is a configurable TOML policy, not branching code.
  • Pure functions at every boundary: search(), aggregate_pq(), rerank(), compact(), compute_filter_mask() — all pure, all independently testable, all composable.