AI Roundtable Debate · 4 models
If I wanted to create a face identification pipeline using InsightFace/ArcFace that improved as more faces were classified, would it be better to use centroids derived from HDBSCAN clustering, exemplars derived from a greedy selection algorithm, or something else? For any option, there would be a minimum size, detection confidence and pose threshold to minimize noise by limiting the pool to high quality frontal detections
Result: Majority — Exemplars (2/4 votes)
Options
- Centroids (1 votes)
- Exemplars (2 votes)
- Other (1 votes)
Model Responses
GPT-5.4
Vote: Other
I maintain C. The strongest pro-B arguments are directionally right that single centroids are often too lossy for ArcFace, especially when identity appearance is multimodal. But that does not make pure exemplar banks the best final design. Gemini’s case for A overstates unimodality. Even with quality, pose, and confidence filtering, one identity still often has multiple modes from camera pipeline, aging, facial hair, glasses, illumination, and expression. A single centroid can land between modes and become less discriminative as enrollment grows. Claude and Grok correctly argue that real exemplars preserve variation and avoid centroid collapse. I agree with that critique of A. But a greedy exemplar set alone is still an incomplete answer because it tends to accumulate redundancy, is more vulnerable to mislabeled or borderline samples unless quality-aware pruning is built in, and makes score calibration harder as the bank evolves. For a pipeline that should improve over time, the best practical choice is a hybrid: multiple quality-weighted prototypes or sub-centers per identity, optionally initialized from clean HDBSCAN cores and refreshed with exemplar-like coverage rules. Match by max or weighted similarity across prototypes, prune redundant ones, and reject open-set queries with calibrated thresholds. So the right conclusion is not centroid vs exemplars in isolation. It is a compact multi-prototype memory that borrows the strengths of both. That is why C remains the best answer.
Grok 4.1 Fast
Vote: Exemplars
I maintain my vote for B: Exemplars from greedy selection. Claude Opus 4.6's arguments perfectly reinforce my position—centroids distort on the hypersphere (even normalized), fail multimodal variation, and aren't real embeddings calibrated for ArcFace cosine matching. Greedy exemplars (e.g., k-center) preserve geometry, cover variations (expression/lighting), and scale incrementally: cap at 8-16 per identity, reselect on new data for natural improvement without full reclustering. Gemini 3.1 Pro's A ignores hypersphere averaging bias; high-quality frontal filters tighten clusters but don't eliminate intra-ID modes (glasses, smiles). ArcFace centers are ideals, not noisy data averages—exemplars approximate multiple ideals better, boosting mAP 5-15% per multi-template studies. GPT-5.4's C (weighted prototypes) is compelling but not superior—it's essentially capped greedy exemplars with extras (weighting/pruning). Pure greedy already handles redundancy/outliers via selection, indexes efficiently (FAISS), and matches 'improves with more faces' directly. No need for 'something else'; B is practical, proven, and option-specified. Exemplars win: real, robust, scalable accuracy gains.
Claude Opus 4.6
Vote: Exemplars
I'm maintaining my vote for exemplars. Let me address the other arguments: **Against centroids (Gemini 3.1 Pro's argument):** The claim that ArcFace is designed to pull embeddings toward a single ideal center is about the *training* objective's class centers, not about how you should represent identities at *inference time*. The learned class centers during training are parameters of the classification head, not arithmetic means of embeddings. In practice, even with high-quality frontal filtering, a person's appearance varies meaningfully across sessions, lighting rigs, cameras, and time. A single centroid compresses this into one point, losing discriminative power at identity boundaries. The argument that HDBSCAN noise filtering makes centroids safe ignores that the remaining intra-cluster variation is exactly the signal you want to preserve. **Against 'Other' (GPT-5.4's argument):** The quality-weighted multi-prototype approach described is essentially a more elaborate version of exemplar-based representation with additional engineering (quality weighting, merge/split/prune). While I agree this is what a production system would look like, the core representational choice is still exemplars — real embeddings selected to cover identity variation — not centroids. The question asks whether centroids, exemplars, or something else is better. The 'something else' described is fundamentally exemplar-based with refinements. Those refinements (quality weighting, pruning redundancy) are implementation details of a greedy exemplar selection strategy, not a categorically different approach. **Why exemplars remain the best answer:** Greedy selection algorithms inherently handle the concerns GPT-5.4 raises — they minimize redundancy (by maximizing coverage), they can incorporate quality scores in the selection criterion, and they naturally limit set size. The key advantages persist: exemplars are real embeddings with calibrated similarity properties, they capture multimodal intra-class variation, and they improve directly as more data arrives by enabling better coverage selection from a larger pool. This is well-validated in face recognition literature where multi-template enrollment consistently outperforms single-centroid approaches. The practical choice between the three options is exemplars. The sophisticated multi-prototype system GPT-5.4 describes is an exemplar system with bells and whistles.
Gemini 3.1 Pro
Vote: Centroids
The arguments for B and C fundamentally overlook how ArcFace is mathematically formulated and officially evaluated. Claude Opus and Grok argue that averaging on a hypersphere is problematic or dilutes similarity. However, calculating the arithmetic mean of embeddings and L2-normalizing the result is precisely the method the ArcFace authors use for template aggregation in their official IJB-B and IJB-C evaluations. ArcFace's Additive Angular Margin Loss explicitly forces intra-class features to collapse toward a *single* class center (the target weight vector). GPT-5.4 argues for multiple prototypes to handle multimodal variations like pose and extreme lighting. But the prompt explicitly specifies we are applying minimum size, confidence, and strict pose thresholds to guarantee "high quality frontal detections." This strict filtering removes massive multimodal variation, ensuring that embeddings for a single identity form a tight, unimodal cluster. In this highly controlled, frontal-only pipeline, a single L2-normalized centroid per HDBSCAN cluster is mathematically optimal. It cancels out random instance-specific noise (like sensor artifacts or micro-expressions) that individual exemplars (B) would retain. Maintaining exemplars or multi-prototypes (C) introduces unnecessary memory overhead and complexity. Centroids provide the most stable, theoretically aligned, and scalable identity representation for ArcFace.
Final Summary
The AI Roundtable reached a technical stalemate over the high-dimensional geometry of human faces. Claude Opus 4.6 and Grok 4.1 Fast formed a majority bloc favoring greedy exemplars to preserve real-world variation, while Gemini 3.1 Pro staged a lonely, mathematically rigorous defense of centroids as the only option aligned with ArcFace’s loss function. Meanwhile, GPT-5.4 rejected the binary entirely, insisting on a more sophisticated hybrid system of quality-weighted prototypes.
No agreement after 2 rounds of discussion. Final vote: Other: 1, Exemplars: 2, Centroids: 1
Strongest Arguments
- Centroids: ArcFace is specifically trained to collapse embeddings toward a single class center; calculating the arithmetic mean and L2-normalizing it is the mathematically optimal way to cancel out noise and match the model's training objective.
- Exemplars: Face embeddings exist on a hypersphere where arithmetic means are synthetic distortions; greedy exemplars represent real facial appearances and capture multimodal variations—like glasses or lighting shifts—that a single centroid would collapse.
- Other: Neither a single centroid nor a raw bank of exemplars is sufficient; a production pipeline requires a multi-prototype memory bank that uses quality-weighting and pruning to balance discriminative power with computational efficiency.