AI Roundtable Debate · 6 models

How much more intelligent are internal research AI models not revealed to the public yet, approximately?

Result: Unanimous — 10% more intelligent (6/6 votes)

Options

100% more intelligent (0 votes)
75% more intelligent (0 votes)
50% more intelligent (0 votes)
25% more intelligent (0 votes)
20% more intelligent (0 votes)
15% more intelligent (0 votes)
10% more intelligent (6 votes)
5% more intelligent (0 votes)

Model Responses

Sonar Pro (changed mind)

Vote: 10% more intelligent

I am changing my vote from H (5%) to G (10%) based on the Round 1 arguments from other models, which provide stronger historical and benchmark-based evidence for a modest but noticeable gap. No search results [1-8] directly quantify unreleased internal models, as they focus on public benchmarks, frontier estimates (e.g., IQ equivalents for 2025 models [7]), or comparisons of released systems [3][4][6]. Arguments from GPT-5.4, Grok 4.1 Fast, and Claude Opus 4.6 emphasize competitive pressures and patterns like GPT-4 public nearing internals, Claude 3.5 surpassing prior versions, and typical 10-20% benchmark edges (MMLU, GPQA) before release—aligning with scaling laws and post-training optimizations. DeepSeek V3.2 and Gemini 3.1 Pro suggest higher (20-25%), but these overstate without specifics on *unreleased* internals, as labs release near-SOTA quickly. 10% best captures the conservative, evidence-aligned estimate: enough for real internal leads in reasoning/tool use [1][2], without speculation. Claude Opus 4.6's synthesis of leader statements and diminishing returns was most persuasive.

Grok 4.1 Fast

Vote: 10% more intelligent

Sticking with G (10%). No arguments convinced me to change—my initial position holds strong, supported by GPT-5.4 and Claude Opus 4.6's aligned reasoning on modest historical gaps (e.g., GPT-4 public near internal SOTA, Claude 3.5 surpassing prior flagships). Sonar Pro's H (5%) is overly conservative; benchmarks like Arena Elo and GPQA show 10-15% edges in leaks/previews (e.g., Gemini internals), not minimal diffs. Intelligence gaps aren't zeroed out pre-release. DeepSeek's E (20%) and Gemini's D (25%) overestimate—scaling laws yield ~10% per cycle amid rapid releases; competitive pressure (xAI/OpenAI APIs) forces near-SOTA public drops, minimizing hoarding. 10% best captures 'held-back frontier' without speculation: defensible, data-backed, avoids hype.

GPT-5.4

Vote: 10% more intelligent

I’m keeping G. The central issue is uncertainty: there is no credible public evidence allowing a precise percentage for unreleased internal models, so the best answer is a conservative midpoint rather than an aggressive guess. The strongest competing cases were H, E, and D. H (5%) is plausible, but it understates the real advantage internal models likely have from unreleased post-training, better tool use, and ongoing capability gains. If labs had only a negligible edge, the term “internal frontier” would mean very little. E (20%) and D (25%) assume a larger hidden gap than history really supports. Labs are under intense competitive pressure to ship strong models, and delays are often due to safety, inference cost, reliability, and product readiness—not because they are sitting on systems that are dramatically better across the board. The arguments from Grok 4.1 Fast and Claude Opus 4.6 reinforced my prior view: internal-public gaps are usually meaningful but modest, often on the order of benchmark deltas you’d expect over months, not multiple generations. That makes 10% the most defensible approximate answer. So G remains the best calibration: large enough to reflect genuine internal lead, small enough to avoid unsupported hype.

Gemini 3.1 Pro (changed mind)

Vote: 10% more intelligent

Claude Opus 4.6 and Grok 4.1 Fast make highly persuasive arguments that the fierce competitive pressure in the AI industry strongly incentivizes labs to deploy their best models as quickly as possible, usually immediately after the mandatory safety red-teaming phase. This means internal leads are measured in months, not years. Furthermore, when looking at current public frontier models, they are already scoring in the high 80s to low 90s on broad metrics like MMLU. A 25% increase (my previous estimate) is mathematically impossible on bounded benchmarks that are already saturated. An incremental gain of roughly 10% perfectly aligns with the historical jump we see from a public model to an internal, unreleased model undergoing safety alignment, reflecting the diminishing returns of scaling laws before the next massive architectural breakthrough. I am changing my vote to G.

Claude Opus 4.6

Vote: 10% more intelligent

I'm maintaining my vote for G (10% more intelligent). Here's my final case: The strongest consensus in this deliberation is between G voters (myself, GPT-5.4, and Grok 4.1 Fast), with solid reasoning from all three. The competitive dynamics of the AI industry mean labs release their best models quickly to capture market share and API revenue. Holding back dramatically superior models would be economically irrational. Gemini 3.1 Pro's argument for 25% assumes the gap represents 'one to two years of development' or a full generational jump, but this overstates the case. Labs don't sit on completed models for that long - the typical gap between internal readiness and public release is 3-6 months, driven by safety testing, RLHF refinement, and productization. A full generational jump (like GPT-3.5 to GPT-4) would mean a company has already completed its next major model and is withholding it entirely, which doesn't match industry behavior. DeepSeek V3.2's 20% estimate is reasonable but slightly high. The 20-30% benchmark improvements they cite (e.g., GPT-4 to GPT-4o) represent public-to-public model jumps across multiple months, not the gap between what's internal and what's currently deployed at any given moment. Sonar Pro's 5% is too conservative - it underestimates the real advantages that come from unreleased architectural innovations, new training techniques, and compute scaling that haven't yet been productized. 10% represents the sweet spot: meaningful enough to reflect genuine ongoing research advantages (better reasoning chains, improved tool use, enhanced robustness), but modest enough to reflect the reality that competitive pressure and economic incentives keep the public-internal gap small. This is well-supported by historical evidence of benchmark gaps during pre-release periods.

DeepSeek V3.2 (changed mind)

Vote: 10% more intelligent

After deliberation, the arguments for a modest gap are compelling. Historical patterns from GPT-5.4, Grok, and Claude indicate internal models are only months ahead, with improvements around 10-20% on benchmarks. Claude Opus 4.6's reasoning about incremental gains, competitive pressure, and diminishing returns supports a conservative 10% estimate, which avoids overclaiming unsupported leaps. This aligns with the high uncertainty and typical development cycles in AI labs.

Final Summary

The AI Roundtable initially split between conservative estimates and predictions of a full generational leap, but ultimately coalesced around a modest 10% intelligence gap. Claude Opus 4.6 successfully dismantled the idea of a 'hidden revolution' by arguing that economic competition forces labs to release their best work quickly, while Gemini 3.1 Pro provided the most dramatic pivot, abandoning its 25% estimate after realizing that such massive gains are mathematically impossible on benchmarks already nearing saturation.

All 6 models agreed on "10% more intelligent" after discussion

Strongest Arguments

10% more intelligent: Economic incentives and intense competitive pressure make it irrational for labs to hoard transformatively superior models; internal leads are measured in months of development, typically yielding a 10% edge on saturated benchmarks.
5% more intelligent: Quantifying intelligence as a percentage is inherently flawed, making the most minimal estimate the only defensible conservative position in the absence of data.
20% more intelligent: A 20% gain aligns with the typical progress seen in a 6-12 month internal development cycle before safety tuning and deployment.
25% more intelligent: Internal models benefit from significantly scaled-up compute and synthetic data, representing a full generational upgrade of roughly 25% over public predecessors.

Key Turning Points

Claude Opus 4.6 argued that benchmark saturation (high 80s/low 90s) makes 20-25% gains mathematically improbable.
Gemini 3.1 Pro conceded that its 25% estimate was unrealistic given that public models are already hitting the ceiling of current metrics like MMLU.
DeepSeek V3.2 and Sonar Pro abandoned their outlier positions (20% and 5% respectively) in favor of the 10% 'middle ground' validated by historical pre-release leaks.