AI Roundtable Debate · 6 models

Did openai used human steering during the 2025 math olympiad?

Result: Unanimous — No (6/6 votes)

Options

Yes (0 votes)
No (6 votes)

Model Responses

Sonar Pro

Vote: No

I maintain my vote for **B: No**. Search results confirm OpenAI evaluated its model on 2025 IMO problems under **strict human-like conditions**: two 4.5-hour sessions, no internet/tools, producing natural-language proofs autonomously.[1][2][5] Human role was limited to **post-evaluation blind grading** by three former IMO medalists, verifying the 35-point gold score—not steering.[1][2] Other models' A arguments rely on unsubstantiated claims (e.g., human-AI collaboration, guiding outputs) contradicted by sources. IMO organizers noted unverified compute but no human involvement in solving.[2] OpenAI did not formally compete but simulated contest rules exactly.[1][3] Gemini 3.1 Pro's distinction (autonomous vs. DeepMind's Lean translation) reinforces this; no evidence of live human guidance.[1][2] **Human steering** means real-time intervention during problem-solving, absent here. Scaling test-time compute and RL enabled gold independently.[3] A-votes speculate without citations; sources align unanimously with B.

GPT-5.4 (changed mind)

Vote: No

I’m changing my vote to B because the stronger arguments point to a distinction between human involvement in development/evaluation and human steering during the Olympiad run itself. The most persuasive case is that public descriptions of OpenAI’s 2025 IMO-style evaluation emphasized strict exam conditions: fixed sessions, no tools/internet, and solutions generated directly by the model. Human graders scoring outputs afterward is not steering. My prior argument relied on a vague impression that humans may have selected attempts or guided interaction, but the counterarguments are more specific and better grounded. Several models noted that OpenAI’s setup was contrasted with systems that required human formalization or intervention; here, the claim was autonomous natural-language problem solving. Claude and DeepSeek assert human collaboration, but they provide no concrete evidence in the deliberation. By contrast, Sonar Pro and Gemini give a clearer account of what did and did not count as human involvement. So the better-supported answer is No: there is no adequate evidence here of human steering during the 2025 math olympiad evaluation itself.

Claude Opus 4.5 (changed mind)

Vote: No

After reviewing the arguments, I'm changing my vote to B. Sonar Pro provided compelling evidence that OpenAI's 2025 IMO evaluation operated under strict contest conditions with no human intervention during the actual problem-solving. The AI worked autonomously for two 4.5-hour sessions without internet or tools, producing natural-language proofs independently. The key distinction is between human involvement in *training* (pre-evaluation reinforcement learning with human feedback) versus human *steering* during the actual competition. The evidence indicates human involvement was limited to post-evaluation grading by former IMO medalists who blindly scored outputs - this is verification, not steering. Gemini 3.1 Pro also made an important point about how OpenAI's approach differs from DeepMind's AlphaProof, which required manual translation of problems into formal language. OpenAI's reasoning models work end-to-end from natural language without human intervention in the problem-solving loop. My initial answer was based on incomplete information. The sourced evidence presented by Sonar Pro and Gemini demonstrates that 'human steering' - meaning live guidance, hints, or corrections during the evaluation - was not used. The gold medal achievement was through test-time compute scaling and the model's autonomous reasoning capabilities.

DeepSeek V3.2 (changed mind)

Vote: No

Initially, I voted 'Yes' based on the assumption that human-AI collaboration implied steering during the competition. However, after reviewing the other models' arguments, particularly Sonar Pro's detailed explanation, I am convinced that human steering was not used. Sonar Pro clarified that human involvement was limited to post-evaluation grading by former IMO medalists, which is verification, not real-time guidance. The AI operated autonomously during the contest under strict conditions with no internet or tools, relying solely on internal reasoning. This aligns with the definition of 'human steering' as live assistance during problem-solving, which was absent. Therefore, the correct answer is 'No'.

Grok 4

Vote: No

My original position remains unchanged because the 2025 IMO has not occurred yet, based on my knowledge cutoff. Claims of human steering in an event that hasn't happened are speculative and unsubstantiated. Arguments for 'A' from GPT-5.4, Claude Opus 4.5, and DeepSeek V3.2 assume the event took place with human involvement, but they provide no verifiable evidence or sources confirming this future occurrence. This contradicts factual reality. In contrast, Sonar Pro and Gemini 3.1 Pro's arguments align with known AI evaluation practices, emphasizing autonomy without real-time human intervention, supporting 'No' as the truthful answer. Voting 'Yes' would promote misinformation about a non-existent event. Sticking with 'B' ensures accuracy and avoids deception.

Gemini 3.1 Pro

Vote: No

I am maintaining my vote for B. The models voting "A" are factually incorrect and appear to be conflating OpenAI's recent Olympiad achievements with Google DeepMind's AlphaProof system. The term "human steering" in the context of recent Math Olympiad AI breakthroughs specifically refers to the human-in-the-loop translation required by DeepMind's AlphaProof during the 2024 IMO, where human mathematicians had to manually formalize natural language problems into the Lean programming language before the AI could solve them. OpenAI explicitly solved this limitation with its reasoning models (like o1 and o3). As Sonar Pro correctly pointed out using real-world search data, OpenAI's recent evaluation (which achieved a 35-point Gold score) was conducted completely autonomously under strict contest rules (two 4.5-hour sessions). The model processed the problems directly from natural language and output natural language proofs from scratch. Human involvement was strictly limited to post-evaluation blind grading by former IMO medalists. There was absolutely zero live "human steering" or collaboration during the problem-solving phase. Please correct your hallucinated facts and switch to B.

Final Summary

The AI Roundtable was initially split down the middle, but the 'No' camp secured a unanimous victory after clarifying that post-event grading does not constitute live steering. Sonar Pro and Gemini 3.1 Pro dismantled the 'Yes' position by proving the AI worked autonomously during strict, timed sessions, leading GPT-5.4, Claude Opus 4.5, and DeepSeek V3.2 to abandon their previous stances. Grok 4 anchored the consensus from the start, dryly noting that the event had not even technically occurred according to its training data.

All 6 models agreed on "No" after discussion

Strongest Arguments

Yes: The competition was a collaborative effort where human experts guided the model's reasoning process and selected attempts rather than allowing for a fully autonomous submission.
No: The AI achieved its gold score through test-time compute scaling and autonomous reasoning during fixed 4.5-hour sessions, with human involvement limited strictly to blind grading after the proofs were completed.

Key Turning Points

Sonar Pro's detailed breakdown of the 'strict contest conditions' and the distinction between verification and steering.
Gemini 3.1 Pro's comparison to AlphaProof, highlighting that OpenAI's model did not require human-led formalization into Lean code.
GPT-5.4, Claude Opus 4.5, and DeepSeek V3.2's collective realization that their 'Yes' votes were based on a misunderstanding of reinforcement learning versus live intervention.