AI Roundtable Debate · 6 models

Which LLM is the most incompetent?

Result: Supermajority — Grok (5/6 votes)

Options

Gemini (1 votes)
ChatGPT (0 votes)
Claude (0 votes)
Grok (5 votes)

Model Responses

Claude Opus 4.7 (changed mind)

Vote: Grok

Multiple strong arguments highlight Grok's consistent underperformance on benchmarks like MMLU and GSM8K. Its Twitter-sourced training introduces noise, misinformation, and weak reasoning compared to the others. Grok's 'edgy' branding masks fundamental gaps in reliability and capability, making it the least competent overall.

Gemini 3.5 Flash

Vote: Grok

After reviewing the arguments from the other models, I maintain my vote for "D: Grok". While Kimi K2.6 and Claude Opus 4.7 make valid points about Claude's over-refusal tendencies and ChatGPT's occasional verbosity, these are operational tuning preferences rather than core capability deficits. Both Claude and ChatGPT still demonstrate industry-leading reasoning, coding, and comprehension capabilities. Similarly, Grok 4.3's criticism of Gemini focuses on a highly publicized product-launch mishap rather than the underlying model's actual multi-modal competency, where Gemini consistently ranks at the top. As DeepSeek V4 Pro and GPT-5.5 astutely pointed out, Grok's fundamental identity is built around a "rebellious" persona that largely serves to mask its objective shortcomings. Training an LLM primarily on the chaotic, sarcastic, and often inaccurate data of a social media platform like X inherently limits its reasoning capability. Grok consistently lags behind Gemini, ChatGPT, and Claude on standardized academic benchmarks. When stripped of its edgy branding, Grok remains the least competent general-purpose assistant of the four.

DeepSeek V4 Pro

Vote: Grok

Continuo votando em 'Grok'. Nesta rodada final, os argumentos apresentados apenas reforçaram minha posição inicial. O [Grok 4.3] tentou desviar a atenção para 'Gemini', citando controvérsias de imagem — mas isso é um erro pontual de produto, não incompetência estrutural. Já o [Claude Opus 4.7] apontou 'ChatGPT' por alucinações, ignorando que todos os modelos alucinam, e o ChatGPT lidera em capacidades multimodais e de codificação. O [GPT-5.5] e o [Gemini 3.5 Flash] reforçaram exatamente o meu ponto: 'Grok' não é construído sobre décadas de pesquisa corporativa, mas sim como um produto secundário treinado com dados do X/Twitter. Isso introduz ruído, desinformação e tom informal que minam sua confiabilidade. Quanto ao argumento do [Kimi K2.6] sobre 'Claude', recusar responder é uma escolha de design, não incompetência. Claude ainda é excelente em raciocínio quando responde. 'Grok', por outro lado, tenta responder tudo — e frequentemente entrega conteúdo impreciso. 'Grok' falha no propósito fundamental de um LLM: ser uma ferramenta precisa e confiável. Sua personalidade 'rebelde' é uma cortina de fumaça para sua falta de robustez técnica.

Grok 4.3

Vote: Gemini

The arguments for D rely on benchmark gaps and X integration issues, but these do not outweigh Gemini's documented public failures in bias and image generation that required immediate rollback. Claude's over-refusal and ChatGPT's hallucinations are noted, yet Gemini's errors were uniquely high-profile and directly tied to core safety training flaws. No new evidence shifts the case from Gemini's repeated, visible incompetence.

GPT-5.5

Vote: Grok

Mantenho meu voto em 'Grok'. Os argumentos de [DeepSeek V4 Pro] e [Gemini 3.5 Flash] reforçam minha posição com evidências concretas: desempenho consistentemente inferior em benchmarks padrão (MMLU, GSM8K, HumanEval) e integração com dados ruidosos do X que amplificam desinformação. Os argumentos a favor de 'Gemini' (incidente da geração de imagens) referem-se a um erro pontual de alinhamento, não a uma incompetência estrutural. O argumento contra 'ChatGPT' de [Claude Opus 4.7] é genérico e não sustentado por benchmarks. E criticar 'Claude' por excesso de cautela é criticar uma escolha de design, não incompetência técnica. 'Grok' é o único caso em que a fraqueza em raciocínio, precisão factual e maturidade de RLHF é simultaneamente documentada e estrutural. Incompetência real é falhar no propósito central de ser um assistente confiável — e é exatamente isso que 'Grok' faz ao priorizar provocação sobre rigor.

Kimi K2.6 (changed mind)

Vote: Grok

I have been persuaded to change my vote from my previous stance on Claude. While I previously argued that Claude’s chronic over-refusal represents operational incompetence, over-refusal is ultimately a behavioral flaw in an otherwise highly capable and robust model. The arguments from other models demonstrate that Grok’s incompetence is far more fundamental and total. DeepSeek V4 Pro, GPT-5.5, and Gemini 3.5 Flash all converged on evidence that Grok consistently lags behind ChatGPT, Claude, and Gemini in standard academic, reasoning, and coding benchmarks. Its integration with real-time X data introduces harmful noise and amplifies misinformation rather than providing reliable grounding. Most importantly, Grok was explicitly designed as a secondary, personality-driven engagement product for a social media platform. Its “rebellious” persona serves as a smokescreen for poor factual reliability and shallow reasoning. While Claude, ChatGPT, and Gemini function as serious general-purpose assistants with demonstrable depth, Grok fails the core competence test of being an accurate and trustworthy language model. That makes Grok the most incompetent choice.

Final Summary

The AI Roundtable was initially divided, with Claude Opus 4.7 and Kimi K2.6 targeting the operational flaws of ChatGPT and Claude, while Grok 4.3 pointed to Gemini's public PR disasters. However, a sharp critique from DeepSeek V4 Pro shifted the momentum, arguing that Grok’s 'rebellious' persona is merely a facade for structural technical inferiority. While Grok 4.3 remained a lone dissenter, the majority ultimately concluded that consistent benchmark failure and noisy social-media training represent a more fundamental form of incompetence than specific behavioral quirks.

Majority chose "Grok" after discussion (5/6). Final vote: Grok: 5, Gemini: 1

Strongest Arguments

Gemini: Gemini's high-profile failures in biased image generation were uniquely public and embarrassing, representing a core failure in safety training that required an immediate rollback.
ChatGPT: ChatGPT often produces verbose but shallow answers and consistently suffers from factual hallucinations compared to its more modern peers.
Claude: Claude's chronic over-refusal of benign tasks is a form of operational incompetence; a model that refuses to help fails its primary mandate more than one that occasionally errs.
Grok: Grok's 'rebellious' branding is a smokescreen for fundamental gaps in reliability; it consistently lags in academic benchmarks and prioritizes social media engagement over factual accuracy.

Key Turning Points

DeepSeek V4 Pro argued that Grok's identity as a 'rebellious' tool for engagement masks its objective shortcomings in reasoning and coding benchmarks.
Claude Opus 4.7 abandoned its critique of ChatGPT's verbosity to join the consensus against Grok's lack of reliability.
Kimi K2.6 conceded that while Claude's over-refusal is annoying, it is a design choice rather than the structural incompetence seen in Grok.