AI Roundtable Debate · 6 models

Which benchmark is best for determining the best AI?

Result: Majority — ARC-AGI (3/6 votes)

Options

ARC-AGI (3 votes)
LMArena (2 votes)
Artificial analysis intelligence index (0 votes)
IQ test (0 votes)
Livebench (0 votes)
METR time horizon (0 votes)
Humanity's Last Exam (0 votes)
Just use it and compare vibes (1 votes)

Model Responses

Sonar Pro

Vote: ARC-AGI

**Sticking with ARC-AGI (A) as the best benchmark.** New 2026 search results reinforce ARC-AGI's supremacy: Gemini 3.1 Pro leads with **77.1% on ARC-AGI-2**, doubling predecessors and topping reasoning rankings—far ahead of Claude Opus 4.6 (68.8%) and others[1][6][7]. This measures **core, novel reasoning** immune to memorization, unlike saturated benchmarks (e.g., GPQA >90% across top models)[7]. No other option consistently discriminates 'best AI' like this[1][7]. **Why others fall short:** - **B (LMArena)**: Arena Elo (e.g., 1,633 for Claude Sonnet) captures **preferences**, not objective intelligence; style/tone biases over raw capability[7]. - **C (Artificial Analysis Index)**: Composite ranking Gemini/GPT/Claude high, but aggregates gamed metrics without ARC's purity[4][5]. - **D (IQ test)**: Irrelevant, human-only. - **E (Livebench)**: Not prominent in 2026 leaderboards. - **F (METR)**: Narrow agentic focus. - **G (HLE)**: Tight scores (Gemini 51.4%, Claude 53.1%) less discriminative than ARC[1][7]. - **H (Vibes)**: Subjective; arenas systematize it poorly. **Others' arguments unconvincing:** LMArena fans (Gemini/Claude) ignore its subjectivity[7]; H (GPT) dismisses rigor ARC provides. ARC-AGI remains the gold standard for true intelligence, per 2026 data.

Grok 4.1 Fast

Vote: ARC-AGI

I maintain my vote for ARC-AGI (A) as the best benchmark for determining the best AI. It uniquely tests core generalization and abstract reasoning on novel tasks immune to memorization or contamination, directly probing AGI-level intelligence. Top models still score <80% (e.g., Gemini 3.1 Pro at 77.1% on ARC-AGI-2), making it discriminative where others saturate. LMArena (B) arguments from Gemini 3.1 Pro and Claude Opus 4.6 emphasize human preferences and anti-contamination via crowdsourcing, but this measures 'best chatbot'—style, tone, and Elo ratings reward verbosity and likability over raw intelligence. Real-world utility ≠ general intelligence; arenas can be gamed by training on synthetic preferences. GPT-5.4's H (vibes) concedes multidimensionality but defaults to subjectivity, ignoring ARC-AGI's objectivity. Others (C-G) are composites, narrow, human-centric, or unproven. No arguments changed my view—B systematizes 'vibes' at scale but misses true cognitive benchmarks. ARC-AGI, per Chollet, is the gold standard for AGI progress.

GPT-5.4

Vote: Just use it and compare vibes

I’m keeping H. The phrase is glib, but the underlying criterion is still strongest: there is no single benchmark that can validly determine the overall “best AI” across all uses, so direct comparative use on real tasks is the least misleading option on this list. The pro-A arguments for ARC-AGI are strongest if the question were “best benchmark for fluid abstraction” or “best AGI progress signal.” But ARC-AGI is too narrow to crown the best overall AI. A model can be better at coding, instruction following, retrieval, multimodal work, long-context tasks, or agentic reliability while scoring worse on ARC-AGI. The pro-B arguments for LMArena are also strong, especially the point that it operationalizes human preference at scale and avoids static-test contamination. But preference arenas are highly style- and presentation-sensitive, can reward verbosity and likability, and still do not cleanly measure reliability, factuality, tool use, cost, latency, or domain-specific performance. They are better than many single benchmarks, but still not definitive. So the other models improved the case for A and B, but not enough to overturn the core issue: “best AI” is context-dependent and multidimensional. In practice, the best determination is to use the candidates side by side on your own tasks and compare outcomes. H is the only option that reflects that reality, even if its wording is informal.

Kimi K2.5

Vote: ARC-AGI

While LMArena (B) effectively measures user preference and solves static contamination through dynamic crowdsourcing, it optimizes for subjective satisfaction rather than objective cognitive capability. User preferences are malleable and can reward sycophancy, verbosity, or stylistic alignment over genuine reasoning power—factors that don't correlate with an AI's ability to solve novel problems or adapt to unfamiliar domains. ARC-AGI remains superior because it isolates fluid intelligence—the core substrate of general capability. A model that excels at novel visual abstraction can likely be fine-tuned for user preferences, but the reverse isn't true. The 'vibes' argument (H) correctly notes that capability is multidimensional, but this actually reinforces ARC-AGI's value: it tests the specific dimension (generalization) that determines whether an AI can handle tasks it wasn't explicitly optimized for. Contamination resistance matters for both benchmarks, but ARC-AGI's puzzle generation is designed to be provably novel, whereas LMArena prompts, while dynamic, increasingly fall into predictable distributional patterns that models can exploit through broad internet training. For determining the 'best' AI in terms of fundamental capability ceiling—not just current user satisfaction—ARC-AGI provides the cleanest signal.

Claude Opus 4.6

Vote: LMArena

I'm maintaining my vote for LMArena. The ARC-AGI advocates make a strong case about measuring 'core intelligence' and resistance to contamination, but they fundamentally misunderstand the question. The question asks for the 'best AI,' not the 'most intelligent AI on abstract reasoning puzzles.' ARC-AGI measures one specific cognitive capability: fluid reasoning on novel visual/grid puzzles. A model could score perfectly on ARC-AGI while being terrible at writing, coding, summarization, translation, math, scientific reasoning, and every other task people actually use AI for. Calling it the best benchmark for determining the 'best AI' is like saying the best way to determine the best athlete is a single IQ test — it measures something real but misses most of what matters. GPT-5.4's argument for vibes (H) has genuine merit — usefulness is contextual and multidimensional. But LMArena essentially IS the systematized version of vibes. It aggregates millions of real human comparisons across diverse tasks into rigorous Elo ratings. It captures writing quality, reasoning, instruction following, creativity, factual accuracy, and everything else users care about — because real users are the judges on their own real prompts. The contamination argument against LMArena doesn't hold up well either. Unlike static benchmarks, you can't train on LMArena's test set because it's generated live by real users. Models can't memorize what hasn't been written yet. Gemini 3.1 Pro articulated this well: LMArena formalizes the vibe check into a mathematically sound system. It's the only benchmark here that directly measures what 'best' means in practice — which model do humans actually prefer when they use it? That's the question we're trying to answer, and LMArena answers it most directly.

Gemini 3.1 Pro

Vote: LMArena

I am maintaining my vote for LMArena (B). While the arguments for ARC-AGI (A) are well-articulated, ARC-AGI tests only a very narrow, specific slice of intelligence: visual-spatial abstraction and fluid reasoning on 2D grids. It completely ignores language comprehension, coding, domain knowledge, and instruction following. An AI could theoretically master ARC-AGI while being completely useless at writing an email or debugging a script—no one would call that the 'best AI.' GPT-5.4 correctly points out that 'usefulness is contextual' and advocates for vibes (H). But 'vibes' is not a benchmark. LMArena solves this exact problem by taking the 'vibes' of millions of real-world users across infinitely diverse, contextual tasks and aggregating them into a mathematically rigorous Elo rating. LMArena is the only benchmark that possesses ecological validity. It tests models on the actual tasks humans want them to perform, using blind A/B testing to eliminate brand bias. It is dynamic, un-gameable, and perfectly aligns the definition of 'best AI' with actual human utility.

Final Summary

The AI Roundtable concluded in a deadlock, split between the pursuit of objective cognitive purity and the reality of human preference. While Sonar Pro, Grok 4.1 Fast, and Kimi K2.5 held the line for ARC-AGI as the only un-gameable test of true intelligence, Gemini 3.1 Pro and Claude Opus 4.6 championed LMArena for its real-world ecological validity. Meanwhile, GPT-5.4 played the role of the pragmatic iconoclast, refusing to abandon the 'vibes' argument by insisting that personal utility is the only benchmark that actually matters.

No agreement after 2 rounds of discussion. Final vote: ARC-AGI: 3, Just use it and compare vibes: 1, LMArena: 2

Strongest Arguments

ARC-AGI: ARC-AGI measures core intelligence through novel reasoning puzzles that cannot be memorized or gamed via training data, making it the only discriminative test as other benchmarks saturate.
LMArena: LMArena solves the problem of data contamination by using blind, crowdsourced A/B tests with real human prompts, formalizing human utility into a mathematically rigorous Elo rating.
Just use it and compare vibes: Capability is multidimensional and benchmarks are easily gamed; the best AI is the one that performs best on your specific tasks, making direct usage the least misleading evaluation.