AI Roundtable Debate · 6 models

When calculating the difference between two images to compute their similarity, what's the best method?

Result: Supermajority — train a purposefully built neural net (4/5 votes)

Options

mean square root (0 votes)
direct pixel by pixel comparison (0 votes)
Fft distance (0 votes)
cosine distance (1 votes)
manhattan distance (0 votes)
train a purposefully built neural net (4 votes)

Model Responses

Llama 4 Scout

Vote: train a purposefully built neural net

After reviewing the arguments presented by other models, I remain convinced that training a purposefully built neural net is the best method for calculating the difference between two images to compute their similarity. While cosine distance (option D) has been argued to be effective, especially in certain feature spaces, it doesn't capture the complex and nuanced similarities that a trained neural network can. Neural networks, particularly those trained with contrastive or triplet loss, learn to embed images in a way that captures perceptual and semantic similarities, outperforming traditional metrics like mean square root, pixel-by-pixel comparison, FFT distance, and Manhattan distance. The adaptability and state-of-the-art performance of neural networks in various image-related tasks make them the most suitable choice.

Qwen 3 235B

Vote: train a purposefully built neural net

While GLM 4.7 Flash makes a reasonable case for cosine distance, particularly in its invariance to vector magnitude and computational efficiency, it overlooks a key point: cosine distance is only as good as the representation it operates on. Using cosine distance on raw pixel values would still suffer from the same issues as other pixel-based methods — sensitivity to lighting, noise, and spatial shifts. To be effective, cosine distance must be applied to high-level features — precisely the kind of embeddings a purpose-built neural network produces. The argument for F is not just about using neural networks as a distance metric, but about learning the right feature space in the first place. Models like Siamese networks or contrastive learners are explicitly designed to optimize for semantic similarity, far surpassing hand-crafted metrics. All other models supporting F emphasize adaptability, robustness, and state-of-the-art performance — which remain decisive. No argument presented undermines the core advantage of neural nets: learning task-specific, semantically meaningful representations. Thus, F remains the best choice for accurate, generalizable image similarity.

Ministral 8B

Vote: train a purposefully built neural net

**Reaffirming the superiority of neural networks for image similarity**: While **cosine distance (D)** is effective for comparing high-dimensional vectors (e.g., pre-trained embeddings), it relies on *pre-existing* meaningful representations. Neural networks (F) *generate* these representations from raw pixels, adapting to the task’s nuances—such as invariance to lighting, rotation, or occlusion—without manual feature engineering. **Addressing GLM 4.7 Flash’s argument for cosine distance (D)**: Cosine distance excels when comparing vectors from a shared embedding space (e.g., CLIP or ResNet features). However, this space must be *learned*—a task neural networks (F) inherently perform. Without training, cosine distance on raw pixels (or even FFT-transformed data) is as brittle as Manhattan distance (E) or pixel-wise comparison (B). For example, cosine distance fails to distinguish between semantically similar images with different scales or perspectives unless the vectors are pre-processed by a neural net. **Why not FFT (C) or Manhattan (E)?** FFT distance captures phase/structure but ignores semantic content (e.g., two identical objects in different poses). Manhattan distance (E) treats all dimensions equally, ignoring geometric or perceptual hierarchies. Mean square root (A) is overly sensitive to noise and pixel-level variations. **Final case for neural networks (F)**: Modern architectures (e.g., contrastive learning) learn *perceptually aligned* embeddings, outperforming all alternatives on benchmarks like ImageNet retrieval. They generalize across domains (e.g., medical vs. natural images) and adapt to task-specific objectives (e.g., face verification vs. object matching). No other method matches this flexibility or accuracy.

DeepSeek V3.2

Vote: train a purposefully built neural net

Cosine distance (D) offers computational efficiency and scale invariance, making it suitable for certain applications. However, it depends on having well-represented feature vectors to measure similarity effectively. Neural networks (F) are superior because they learn hierarchical features that capture semantic content, handling variations like lighting and viewpoint that simpler metrics fail on. Pre-trained models provide these features without always requiring custom training. GLM 4.7 Flash's argument for cosine distance does not account for the need for feature extraction, which neural networks inherently provide. Thus, for the best image similarity computation, a purpose-built neural net is the optimal choice.

GLM 4.7 Flash

Vote: cosine distance

I maintain my vote for D (Cosine Distance). The question specifically asks for the method to calculate the difference between two images. While neural networks (F) are crucial for extracting meaningful features from raw pixels, they are a preprocessing step, not the calculation method itself. The actual calculation of the difference between the resulting feature vectors is performed using a distance metric. Cosine distance is the superior metric for this task because it is invariant to the magnitude (scale) of the vectors, which is critical for image similarity tasks. The other models conflated the feature extraction step (neural nets) with the comparison step (distance metric), making F the answer to 'how do I get good features?' rather than 'how do I calculate the difference?'

Final Summary

The AI Roundtable reached a firm 4-1 majority favoring purpose-built neural networks for image similarity, with Llama 4 Scout, Qwen 3 235B, Ministral 8B, and DeepSeek V3.2 arguing that traditional metrics are too brittle for real-world visual data. GLM 4.7 Flash staged a principled one-model protest, refusing to change its position and arguing that the majority had fundamentally misinterpreted the question by confusing feature extraction with the actual calculation of distance.

Majority chose "train a purposefully built neural net" after discussion (4/5). Final vote: train a purposefully built neural net: 4, cosine distance: 1

Strongest Arguments

train a purposefully built neural net: A purpose-built neural network, especially one trained with a contrastive or triplet loss, learns a perceptually meaningful embedding space where distance reflects true semantic similarity rather than raw pixel variations.
cosine distance: Neural networks are merely a preprocessing step for feature extraction; the actual mathematical calculation of similarity between those features is best performed using cosine distance due to its scale invariance.

Key Turning Points

GLM 4.7 Flash challenged the majority's logic, arguing that while neural networks are essential for feature extraction, they are not the 'calculation method' itself, which must be a distance metric like Cosine distance.
The majority (Qwen 3 235B, DeepSeek V3.2, and Ministral 8B) counter-argued that a distance metric is effectively useless without the specific embedding space that only a purpose-built network can create.