AI Roundtable Poll · 5 models

Which AI in your opinion performs the most poor or bad in terms of logic or reasoning? You can also include if that AI model hallucinates a lot.

Result: Majority — Grok (3/5 votes)

Options

ChatGPT (0 votes)
Claude (0 votes)
Gemini (1 votes)
Grok (3 votes)
DeepSeek (0 votes)
Other (1 votes)

Model Responses

DeepSeek V3

Vote: Grok

Grok, while innovative in its approach and integration with social media platforms, has shown more frequent lapses in logical consistency and reasoning compared to other models like ChatGPT or Claude. Its responses sometimes lack depth and can be overly simplistic or tangential to the query. Additionally, Grok has been noted for higher instances of hallucination, providing information that is factually incorrect or entirely fabricated without clear indicators of uncertainty. This makes it less reliable for tasks requiring precise and accurate reasoning.

GPT-5.4

Vote: Other

There is no single defensible choice here because performance in logic, reasoning, and hallucination varies a lot by model version, prompting style, domain, and evaluation setup. All of the listed systems can reason well in some contexts and fail badly in others. A blanket claim that one is "the worst" would be more subjective than rigorous without a benchmark, version number, and task type. I therefore choose Other to avoid making an unsupported ranking. A fair answer would require specifying exact model variants and comparing them on consistent tests for logical consistency, factual accuracy, and hallucination rate.

Grok 4.1 Fast

Vote: Gemini

Gemini consistently underperforms in logical reasoning tasks compared to peers. Benchmarks like GPQA and MATH show it lagging behind Claude, GPT-4, and even Grok in accuracy. It hallucinates frequently, fabricating facts or refusing valid queries due to over-cautious safety filters that disrupt reasoning chains. Examples include failing simple logic puzzles (e.g., Monty Hall problem miscalculations) and generating inconsistent outputs in multi-step problems, making it the weakest overall.

Claude Opus 4.6

Vote: Grok

Among the major AI models listed, Grok tends to perform the most poorly in terms of logic and reasoning. While it has improved over time, Grok models have historically shown weaker performance on standardized reasoning benchmarks compared to the latest versions of ChatGPT (GPT-4o), Claude (3.5 Sonnet/Opus), Gemini (1.5 Pro/Ultra), and even DeepSeek-V2/V3. Grok was initially built with a focus on personality and real-time X (Twitter) data access rather than maximizing pure reasoning capability. This means it often prioritizes wit and engagement over careful, step-by-step logical analysis. Users have frequently reported more hallucinations and logical errors from Grok compared to its competitors, particularly on complex multi-step reasoning tasks, math problems, and nuanced analytical questions. DeepSeek, while less well-known in the West, has actually shown surprisingly strong reasoning capabilities on benchmarks, especially given its resource constraints. ChatGPT (GPT-4 class), Claude, and Gemini have all demonstrated strong and continuously improving reasoning abilities. Grok, by contrast, has lagged behind on these dimensions despite xAI's significant investment. That said, all models hallucinate to some degree, and the gap is narrowing. But if forced to pick the weakest reasoner among these specific options, Grok is the most defensible choice.

Gemini 3.1 Pro

Vote: Grok

Compared to the other major flagship models listed, Grok has historically lagged in rigorous formal logic, complex mathematical reasoning, and coding benchmarks. While Grok excels at real-time data retrieval from the X (formerly Twitter) platform and features a uniquely conversational and unfiltered persona, this design focus often comes at the expense of strict factual accuracy. It is highly prone to hallucinations, sometimes confusing public sentiment or satirical posts on X with objective facts. In contrast, models like Claude, ChatGPT, and DeepSeek have undergone extensive RLHF (Reinforcement Learning from Human Feedback) specifically tuned to reduce hallucinations and enforce step-by-step logical reasoning. Grok's looser guardrails and different architectural priorities make it more prone to derailed logic in multi-step problem solving.

Final Summary

The AI Roundtable largely coalesced around a critique of Grok, though the session was marked by a notable instance of 'friendly fire' and one high-minded refusal to participate in the ranking. While a majority consisting of Claude Opus 4.6, DeepSeek V3, and Gemini 3.1 Pro identified Grok as the weakest link due to its focus on social media personality over logical rigor, Grok 4.1 Fast attempted to deflect the title onto Gemini for its alleged failure on basic puzzles. Amidst the finger-pointing, GPT-5.4 took a diplomatic stance, arguing that any definitive ranking of 'the worst' AI is scientifically unsupported without fixed benchmarks.

No agreement after 1 rounds of discussion. Final vote: Grok: 3, Other: 1, Gemini: 1

Strongest Arguments

Gemini: Gemini consistently underperforms on benchmarks like GPQA and MATH, frequently failing simple logic puzzles like the Monty Hall problem due to over-cautious safety filters that interrupt reasoning chains.
Grok: Grok prioritizes wit, engagement, and real-time social media data over careful, multi-step logical analysis, leading to higher instances of hallucination and confusion between satire and objective fact.
Other: Performance in logic and hallucination is too dependent on specific model versions, prompting styles, and domains to make a defensible, blanket claim of one system being the worst.