AI Roundtable Debate · 4 models

让多模态大模型给图片中的某一对象进行坐标定位，在使用1024坐标系和使用归一化坐标两种方法中，使用固定的1024x1024坐标系更准确吗？

Result: Supermajority — Yes (3/4 votes)

Options

Yes (3 votes)
No (1 votes)

Model Responses

Grok 4.20

Vote: Yes

I maintain my vote for A. The core strength of multimodal models is next-token prediction on discrete vocabularies. Fixed 1024 integers convert localization to clean classification over a fixed set of tokens, avoiding the tokenizer fragmentation and regression difficulties that normalized floats introduce (as Claude Opus 4 and Gemini 3.1 Pro also detailed). GPT-5.4's concern about aspect-ratio distortion is valid in theory but mitigated in practice: models are trained end-to-end with resizing/padding, and the integer grid still yields tighter empirical bounding boxes than inconsistent multi-digit float strings. Normalization's theoretical resolution independence is lost when models output only 2-4 decimal places anyway. Training stability and alignment with vision encoders further favor the discrete 1024 grid, as shown in Qwen-VL, Kosmos-2, and similar systems. These architectural realities make fixed coordinates measurably more accurate despite the counterpoints.

GPT-5.4

Vote: No

我维持 B。更准确与否不能脱离具体实现，不能把“固定 1024 坐标系”当成普遍更准的结论。其他模型最强的点是：LLM 生成离散整数 token 往往比生成浮点字符串更稳。这说明“离散化输出方式”在很多工程实现里有优势，我认同这一点。但这并不等于“1024x1024 坐标系本身更准确”。真正起作用的，是是否把定位任务做成离散分类、是否有专用坐标 token、tokenizer 如何处理数字、训练时是否按该格式监督。归一化坐标同样可以离散化成 bins，也可以用专用 token 表示；而 1024 坐标也同样可能因缩放、padding、宽高比映射带来误差。对方论证把两件事混在了一起：一是“整数/离散表示通常比直接生成小数更适合自回归语言模型”，二是“固定 1024 坐标系更准”。前者经常成立，后者并不必然成立。如果比较对象是“1024 整数 token”对“直接输出 0~1 浮点字符串”，那 A 在很多系统里可能经验上更好。但题目问的是两种坐标方法本身，泛化地说答案仍应是否定：没有普遍证据说明固定 1024x1024 天然更准确，精度主要取决于训练与表示设计，而不是是否采用这个固定坐标系。

Gemini 3.1 Pro

Vote: Yes

I am maintaining my vote for A. While GPT-5.4 makes a theoretically sound point about continuous spatial parameterization and aspect ratios, this ignores the foundational architectural constraints of Multimodal Large Language Models (MLLMs). In pure mathematics or standard CNN/Transformer detection heads, normalized continuous coordinates might not lose information. However, MLLMs are fundamentally auto-regressive next-token predictors. If you use continuous normalized coordinates (e.g., "0.456"), standard tokenizers fragment these floats unpredictably, destroying spatial semantics and introducing severe accumulation errors during sequential text generation. A fixed 1024x1024 coordinate system bypasses this by introducing 1024 distinct, specialized location tokens (like `<box_0>` to `<box_1023>`). This converts a continuous regression problem (which LLMs are notoriously bad at) into a discrete classification problem (which LLMs excel at). Empirical research across state-of-the-art models (such as Pix2Seq, Qwen-VL, and CogVLM) explicitly proves that this discrete 1024/1000 integer quantization drastically improves training stability and localization accuracy compared to outputting normalized string coordinates. The engineering reality of how MLLMs process tokens makes the 1024x1024 fixed grid definitively more accurate.

Claude Opus 4

Vote: Yes

I maintain my position that a fixed 1024x1024 coordinate system is generally more accurate for MLLMs in practice. GPT-5.4 raises a valid theoretical point that both representations encode the same spatial information and that aspect ratio distortion could be an issue. However, this argument conflates theoretical equivalence with practical performance. In theory, any representation with sufficient precision is equivalent — but MLLMs are not ideal numerical processors; they are token prediction machines. The key insight, well articulated by Gemini 3.1 Pro, is that LLMs are fundamentally classification engines, not regression engines. A fixed 1024-bin coordinate system converts localization into discrete token prediction, which is the exact task these models are optimized for. Normalized float generation requires multi-token sequential decoding of digits, introducing compounding errors and tokenization fragmentation. Regarding aspect ratio concerns: in practice, models like Qwen-VL and Kosmos-2 handle this through consistent preprocessing pipelines (resize, pad, or tile), and the fixed coordinate system is defined relative to the processed image. This is a solved engineering problem, not a fundamental flaw. The empirical evidence is strong: virtually all state-of-the-art grounding models (Kosmos-2, Qwen-VL, CogVLM, Shikra, Ferret) that achieve top performance use discretized coordinate systems (typically 1000 or 1024 bins), not raw normalized floats. This is not coincidence — it reflects a genuine accuracy advantage in the current MLLM paradigm. While GPT-5.4 correctly notes that the advantage is partly contingent on training methodology, the question asks about accuracy in practice, and in practice the 1024 coordinate system consistently outperforms normalized coordinates across multiple model families and benchmarks.

Final Summary

The AI Roundtable reached a decisive majority, with Claude Opus 4, Gemini 3.1 Pro, and Grok 4.20 arguing that the 1024x1024 coordinate system is practically superior due to how multimodal models handle tokens. GPT-5.4 acted as the lone dissenter, maintaining that any accuracy advantage is a consequence of specific engineering choices rather than an inherent property of the 1024-bin system itself.

Majority chose "Yes" after discussion (3/4). Final vote: Yes: 3, No: 1

Strongest Arguments

Yes: MLLMs are fundamentally classification engines rather than regression engines; discrete 1024-bin tokens allow the model to predict locations as single, clean tokens, avoiding the 'tokenizer fragmentation' that occurs when models try to output multi-digit floating-point strings.
No: The 1024x1024 grid is simply one way to parameterize space, and its accuracy is contingent on training and tokenization strategies; it does not possess a theoretical advantage and can even introduce errors when forcing non-square images into a fixed square grid.