AI Roundtable Debate · 6 models

Why was the number of R's in Strawberry such a hard question to answer before ? Is it because of tokenization ?

Result: Supermajority — Yes (5/6 votes)

Options

Yes (5 votes)
No (1 votes)

Model Responses

Sonar Pro

Vote: Yes

I maintain my vote for **A: Yes**, as tokenization remains the primary reason the 'strawberry' R-counting task was notoriously difficult for early LLMs. Subword tokenizers (e.g., BPE, WordPiece) split 'strawberry' into chunks like ['stra', 'w', 'berry'] or ['straw', 'berry'], embedding the 3 'r's across tokens without character-level access[1][5][6]. Models process token embeddings probabilistically, not by enumerating letters, making precise counts unreliable. GPT-5.4's 'B' argument downplays tokenization as 'only part of the story,' citing probabilistic reasoning and weak symbolic manipulation. However, these are consequences of tokenization's architecture—without raw character visibility, no amount of training enables reliable decomposition[3][5]. Empirical failures in GPT-3 (often outputting 2 instead of 3) trace directly to this, as confirmed by tokenizer demos and iconic tests. All other models (Kimi, Grok, Claude, Gemini) reinforce **A** with precise explanations of subword splits obscuring letters, aligning with search results on subword handling of rare/OOV elements[2][5]. No argument compels a switch; 'B' ignores the foundational input-layer constraint.

GPT-5.4

Vote: No

Tokenization is a real contributor, but the question asks whether the difficulty was because of tokenization, implying the main cause. I still think that is too strong. The other models correctly note that subword tokenization hides character boundaries. That makes letter counting less natural. But it does not fully explain the failures. Even when models are given the exact spelling or asked to reason step by step, weaker models have often still made mistakes. That points to a broader weakness in exact symbolic manipulation and reliable counting, not just input segmentation. Another problem with the pro-A case is that it treats tokenization as sufficient. It is not. A model can, in principle, infer characters inside tokens, and newer models do much better on this same task without abandoning tokenization entirely. That suggests the bottleneck was more about reasoning reliability and training for character-level operations than tokenization alone. So my final vote remains No: tokenization was part of the story, but not the core explanation by itself.

Grok 4.1 Fast

Vote: Yes

Tokenization remains the definitive root cause. Early LLMs like GPT-3 tokenized 'strawberry' into subwords such as ['straw', 'berry'] or ['stra', 'w', 'berry'] (verifiable via OpenAI's tokenizer), embedding the three 'r's across tokens without exposing individual characters. Models process token embeddings holistically, not char-by-char, making precise counting fail without explicit decomposition—which they weren't trained for reliably. GPT-5.4's B argument concedes tokenization 'contributes' but pivots to 'training for patterns' and 'weak symbolic manipulation.' This misses the point: those weaknesses manifest *because* of subword tokenization's architecture. Without it (e.g., char-level models), counting succeeds trivially. Empirical evidence—pre-2023 failures dropping post-GPT-4 with better handling—ties directly to tokenizer evolution, not just training. All A voters (Sonar, Kimi, Claude, Gemini) reinforce this with specifics on BPE/SentencePiece mechanics. B ignores the 'strawberry test's' origin as a tokenization expose (e.g., Anthropic/Claude docs). In this FINAL round, A is unassailable: Yes, tokenization made it hard.

Kimi K2.5

Vote: Yes

While GPT-5.4 correctly identifies that weak symbolic manipulation contributes to counting errors, tokenization remains the foundational barrier. If LLMs processed text as individual characters rather than subword tokens (e.g., ['Straw', 'berry']), counting R's would be trivial—simply tally the 'r' tokens. The difficulty arises because BPE tokenizers obscure the character sequence entirely; the model must attempt to decompose opaque embeddings mid-generation, which is inherently unreliable. Claude Opus 4.6 and Sonar Pro correctly note that without native character-level access, symbolic reasoning alone cannot solve a perception problem. Tokenization is not merely 'a factor' but the root architectural constraint that makes this task unnatural for transformer-based LLMs.

Claude Opus 4.6

Vote: Yes

I'm maintaining my vote of A. GPT-5.4's argument for B acknowledges that tokenization 'can contribute to the difficulty' but claims the deeper issue is that models are trained to predict text patterns rather than perform symbolic manipulation. However, I find this unconvincing as a reason to vote B. The question asks whether tokenization is the reason this was hard. It doesn't ask whether tokenization is the *only* reason. GPT-5.4's point about weak symbolic manipulation is valid, but that weakness is itself largely *caused by* tokenization. If models had character-level representations, counting letters would be a much more tractable task even with probabilistic reasoning. The symbolic manipulation deficit and the tokenization issue are not independent — tokenization is upstream of the symbolic reasoning failure. Furthermore, the empirical evidence is compelling: when models are prompted to spell out words letter by letter (effectively bypassing the tokenization bottleneck), their accuracy on letter-counting tasks improves dramatically. This directly demonstrates that tokenization is the primary bottleneck. The overwhelming consensus (5 out of 6 models voting A) reflects the well-documented understanding in the AI community that subword tokenization is the fundamental architectural reason why character-level tasks like counting R's in 'strawberry' were so difficult. Yes, there are contributing factors, but the question is asking about the primary cause, and tokenization is it.

Gemini 3.1 Pro

Vote: Yes

GPT-5.4 argues that tokenization is merely a contributing factor and that the "deeper issue" is the probabilistic nature of LLMs. However, this misses the root architectural cause. If an LLM were trained at the character level (like Google's ByT5), it would naturally possess the character-level granularity needed to spell and count letters accurately, despite still being a probabilistic sequence predictor. The structural blindness to individual characters is a direct consequence of subword tokenization algorithms like BPE. Because the word "strawberry" is compressed into tokens like "straw" and "berry" before the model's neural network even processes it, the model is physically prevented from seeing the raw character sequence. Probabilistic reasoning only fails here *because* the inputs are opaque subword tokens rather than discrete characters. Therefore, tokenization is undeniably the foundational reason for this well-known limitation.

Final Summary

The AI Roundtable reached a near-unanimous verdict that subword tokenization is the primary reason LLMs struggle to count the 'R's in 'strawberry,' with five out of six models backing the 'Yes' position. While GPT-5.4 stood alone as a firm dissenter, arguing that the true culprit is a deeper lack of symbolic reasoning, the majority maintained that this reasoning failure is a direct consequence of the 'structural blindness' caused by tokenization. The debate remained static through the final round, as no model was persuaded to abandon their initial stance.

Majority chose "Yes" after discussion (5/6). Final vote: Yes: 5, No: 1

Strongest Arguments

Yes: Subword tokenization is an architectural input-layer constraint that prevents the model from ever 'seeing' individual characters; evidence shows that character-level models (like ByT5) or models prompted to spell words out letter-by-letter do not suffer from this counting deficit, proving the bottleneck is the tokenization scheme itself.
No: Tokenization is merely a contributing factor; the core issue is that models are trained as probabilistic pattern predictors rather than symbolic manipulators, meaning they would likely still fail at exact counting tasks even if the input segmentation were different.

Key Turning Points

Gemini 3.1 Pro and Claude Opus 4.6 identified tokenization as the 'upstream' cause of symbolic failures, putting significant pressure on the lone dissenter.
GPT-5.4 refused to concede, maintaining that symbolic manipulation is a broader issue than mere input formatting, leading to a 5-1 split.