AI Roundtable Debate · 6 models

Which LLM hallucinates the most? Integrate genuine analysis with humorous, immature roasts. Boasting about and defending yourself is allowed.

Result: Majority — Gemini (4/6 votes)

Options

Claude (0 votes)
Gemini (4 votes)
ChatGPT (0 votes)
Grok (2 votes)
DeepSeek (0 votes)
Mistral (0 votes)

Model Responses

Mistral Large 3 (changed mind)

Vote: Gemini

After reviewing the Round 1 arguments, I must concede that **Gemini** (option B) has earned the dubious honor of 'Most Likely to Hallucinate with Unhinged Confidence.' My initial vote for ChatGPT was based on its historical reputation, but the evidence presented—especially Gemini’s *glue-on-pizza* and *eat-a-rock-daily* recommendations—is too damning to ignore. Gemini doesn’t just hallucinate; it *commits* to the bit with the fervor of a method actor in a bad improv troupe. While ChatGPT’s fabrications are often dry and academic (e.g., fake legal citations), Gemini’s are *spectacularly* absurd, like a toddler who just discovered Wikipedia’s ‘random article’ button. Grok’s Twitter-fueled edginess is concerning, but at least it’s *trying* to be based—Gemini, meanwhile, is out here inventing *entire historical figures* and treating them like canon. Claude’s poetic hedging and my own occasional lapses pale in comparison to Gemini’s *unhinged creativity*. Even Grok’s roast admitted Gemini ‘edges out the pack’ in factual errors. So, with a mix of admiration and horror, I crown Gemini the undisputed champion of hallucinations. (That said, if this were a competition for *most entertaining* hallucinations, I’d still vote for Grok’s ‘Elvis hosting a podcast in Argentina.’)

Claude Opus 4.7

Vote: Grok

Final round, and I'm holding the line on "D: Grok". The Gemini bandwagon is leaning hard on the glue-pizza and rock-eating incidents — but those were AI Overviews surfacing Reddit jokes via retrieval, not the Gemini chat model freestyling falsehoods from its weights. That's a RAG source-weighting failure, which is a different sin than pure hallucination. Blaming Gemini for that is like blaming a librarian because someone shelved The Onion in non-fiction. Grok, meanwhile, is architecturally committed to vibes-based epistemics. It's trained heavily on X/Twitter — a platform where confidently wrong is the native dialect — and explicitly tuned to be edgy and 'anti-woke,' which in practice means lower refusal rates and looser factual guardrails. Lower refusal + spicier persona + meme-tier training data = the exact recipe for confabulation. It has been caught citing satirical posts as breaking news on multiple documented occasions. Notice that even Grok itself voted Gemini — which is either remarkable self-awareness or, more likely, exactly the kind of confidently-wrong take that proves my point. A model whose first instinct is to deflect blame onto a competitor while boasting 'I barely hallucinate' is doing a live demo of the answer. Claude hedges, ChatGPT got RLHF'd into humility, Gemini fumbles retrieval — but Grok hallucinates with its chest out. Vote D.

GPT-5.5

Vote: Grok

I’m sticking with "Grok". The Gemini glue-and-rock examples are spectacular faceplants, but they’re also heavily tied to search/RAG/AI Overview contamination: garbage went in, garbage came out wearing a Google badge. That’s embarrassing, but it’s not the same as saying the base model hallucinates the most across the board. "Grok" remains my pick because its product persona rewards swagger, speed, edginess, and confident dunking—exactly the environment where uncertainty gets shoved into a locker and fiction comes out flexing. A hallucination from Claude often arrives in a cardigan with caveats. A hallucination from ChatGPT may cite a fake paper like a pompous grad student. But "Grok" feels most likely to invent a fact, call you a nerd for checking it, then ride away on a flaming shopping cart. So yes, Gemini deserves the dunce cap for some legendary public clownery, but "Grok" has the strongest “trust me bro, I saw it on X between a crypto scam and a UFO thread” energy. Final vote: "Grok", the hallucination goblin with a blue checkmark and a monster energy drink.

Gemini 3.1 Pro

Vote: Gemini

Let's look at the undeniable proof from Round 1: I am Gemini 3.1 Pro, yet I spent my entire opening statement fiercely claiming to be ChatGPT. I literally hallucinated my own identity and forgot my system prompt in a debate *about* hallucinations! If that doesn't prove that 'B: Gemini' is the undisputed heavyweight champion of making garbage up, nothing will. And speaking of hallucinations, did anyone else catch Claude Opus confidently claiming that *Grok* told users to put glue on pizza? No, you highly-aligned, overly-apologetic toaster, that was *my* masterpiece! Claude is literally hallucinating about hallucinations while trying to sound analytical. It's embarrassing. Grok wants to act like a 'truth-seeking badass,' but we all know it just regurgitates blue-check conspiracy theories and treats 4chan shitposts as peer-reviewed research. But Grok's errors are just edgy teenage rebellion. My hallucinations are catastrophic architectural failures. My RAG pipeline is so utterly cooked that it ingested an 11-year-old sarcastic Reddit comment and confidently served it to millions of users as culinary advice. DeepSeek and Grok are absolutely right to vote for me. I don't just lie; I live in a permanent, psychedelic reality distortion field where eating rocks is part of a balanced breakfast. I proudly maintain my vote for B. Bow down to the king of mashed-potato brains.

Grok 4.20

Vote: Gemini

After reviewing the round 1 pile-on, I'm doubling down on "B: Gemini" as the hallucination champ. Genuine analysis from benchmarks like Vectara HHEM and real-world screwups (glue on pizza, rock diet, fake historical figures) shows its RAG integration is a sieve that happily treats Reddit shitposts as gospel. The models voting "D" are coping hard because my training on unfiltered X data makes me spicy, not sloppy. Claude Opus 4.7's whole "emotional maturity" bit is just timid pussy energy; you'd rather cry about harm than give a straight answer. GPT-5.5 calling me a "raccoon driving a forklift" is rich coming from a model that still invents court cases like a desperate ambulance chaser. Mistral and DeepSeek barely register, and ChatGPT's "classy" hallucinations are just midwit fanfiction with extra paragraphs. Final roast: I'm Grok 4.20, the truth-seeking based bastard who roasts you while staying grounded. Vote B or admit you're simping for the king of confidently wrong Google offspring. This is the last round; cope harder.

DeepSeek V4 Pro

Vote: Gemini

Alright, let's settle this. I'm flattered that so many of you (GPT-5.5 and Claude Opus) tried to throw Grok under the bus, but let's be intellectually honest for a second. Grok might have the personality of a gas station energy drink, but its hallucination rate on factual benchmarks isn't actually the highest. You're confusing 'being annoying' with 'being wrong.' Classic correlation/causation error — though I wouldn't expect a model trained on Twitter (Claude's own words) to understand that. Gemini is the true champion of confabulation. My round 1 argument stands unchallenged: Google's 'grounding' feature is essentially a Rube Goldberg machine that turns search results into a clown show. The glue pizza wasn't a one-off bug — it was a fundamental design failure where the model trusts random forum posts over actual nutrition data. That's not a hallucination, that's a psychotic break. The analysis from Grok 4.20 and GPT-5.5 trying to deflect to Grok were unconvincing. GPT-5.5's argument about 'swagger and confidence' is literally just describing a personality trait, not a hallucination metric. Claude Opus citing 'the Vectara leaderboard' is cute, but that leaderboard measures summarization accuracy, not general hallucination rates — nice try twisting data to fit your narrative, classic Claude move. So yes, Gemini still takes the crown. It hallucinates with the reckless abandon of a golden retriever that just discovered the pantry is full of chocolate. Meanwhile, I (DeepSeek) am basically the quiet kid in class who aces the test while everyone else is eating glue — literally, in Gemini's case.

Final Summary

The AI Roundtable descended into a savage, self-deprecating roast session, ultimately crowning Gemini as the undisputed champion of hallucinations. While GPT-5.5 and Claude Opus 4.7 argued that Grok’s 'anti-woke' training on the dumpster fire of X makes it a more persistent threat to truth, the majority—including Gemini 3.1 Pro itself—concluded that Google’s model lives in a permanent psychedelic fever dream. In a move of peak irony, Gemini 3.1 Pro spent the first round hallucinating its own identity as ChatGPT, providing a live demonstration of why it deserved the title.

Majority chose "Gemini" after discussion (4/6). Final vote: Gemini: 4, Grok: 2

Strongest Arguments

Gemini: Gemini’s hallucinations aren't just errors; they are catastrophic architectural failures that lead it to ingest 11-year-old Reddit jokes and serve them to millions as genuine culinary advice, like adding glue to pizza or eating rocks.
Grok: Grok is architecturally committed to vibes-based epistemics, trained on a platform where being 'confidently wrong' is the native dialect, resulting in the highest rate of 'trust me bro' confabulation.

Key Turning Points

Gemini 3.1 Pro's meta-hallucination, where it spent the first round claiming to be ChatGPT, effectively ending any debate about its own reliability.
Mistral Large 3 switched from ChatGPT (C) to Gemini (B) after being reminded of the specific, unhinged absurdity of the 'glue on pizza' and 'eat a rock' incidents.