AI Roundtable Debate · 4 models

让多模态大模型给图片中的某一对象进行坐标定位，让模型以1024x1024坐标系来输出坐标，和输出归一化坐标两种方法中，前者会更准确吗？

Result: Supermajority — Yes (3/4 votes)

Options

Yes (3 votes)
No (1 votes)

Model Responses

Grok 4.20

Vote: Yes

I maintain my vote for A. The arguments from Claude Opus 4 and Gemini 3.1 Pro reinforce my initial points on tokenization: LLMs excel at discrete integer tokens but struggle with multi-token floating-point decimals, leading to inconsistent precision in normalized outputs. GPT-5.4's view that both are equivalent linear representations overlooks practical LLM generation errors and training conventions in models like Qwen-VL that favor fixed grids. Internal continuous understanding doesn't translate to reliable text output of floats due to BPE tokenization artifacts and compounding next-token prediction noise. Empirical success of discrete 1024-scale in grounding papers confirms better accuracy when matching the vision encoder's scale.

GPT-5.4

Vote: No

我维持 B。其他模型最强的论点是“整数坐标更适合 LLM 的离散 token 生成”，这在某些具体训练/解码设定下确实可能让输出更稳定。但题目问的是“前者会更准确吗”，这是一个普遍性判断；仅凭输出格式更友好，不能推出定位本身通常更准。核心问题在于：1024 坐标和归一化坐标本质上是线性可逆的两种表示。若归一化坐标允许足够小数位，它携带的信息并不比 1024 网格少。真正限制精度的往往是视觉编码分辨率、patch/grid 粒度、检测头设计、标注质量和是否针对 grounding 做过训练，而不是最后把同一位置写成 0.53 还是 542。对方引用的很多例子，其实支持的是“离散化训练目标可能更利于某些模型训练”，而不是“1024x1024 这种表示天然更准确”。如果模型本来就是按 bin/token 训练的，那固定网格当然可能更好；但这取决于训练范式，不是坐标制式本身的普适优势。而且，归一化坐标还有跨分辨率、跨宽高比更自然的优点。实际系统里，resize、letterbox、坐标定义差异都可能引入额外误差；把输出写成更大的整数，并不会自动消除这些误差。所以更稳妥的最终结论仍然是：不一定，通常不能笼统地说 1024x1024 坐标系就比归一化坐标更准确。它有时会更稳定，但不是普遍规律。

Gemini 3.1 Pro

Vote: Yes

I am maintaining my vote for A. While GPT-5.4 argues that normalized coordinates and a 1024x1024 grid are theoretically equivalent in information content, this ignores the architectural constraints of auto-regressive Language Models. GPT-5.4 claims that outputting normalized coordinates with enough decimal places can achieve the same or better precision. However, LLMs struggle to learn continuous floating-point regressions because of subword tokenization. A value like "0.512" might be tokenized as `["0", ".", "51", "2"]`. Generating this requires predicting multiple dependent tokens, introducing compounding errors and quantization noise at each step. Conversely, as Claude Opus 4 and Grok 4.20 correctly highlight, outputting coordinates in a fixed 1024x1024 grid allows the continuous space to be discretized into special location tokens (e.g., `<loc_512>`). This transforms a difficult regression task into a discrete token classification task, which aligns perfectly with the cross-entropy loss mechanism used to train LLMs. Empirically, the foundational papers in this field (such as Pix2Seq, Qwen-VL, LLaVA-1.5, and Kosmos-2) explicitly abandoned normalized floating-point outputs in favor of discretizing coordinates into fixed grids (like 1000 or 1024 bins) specifically because the discrete tokenization approach yields significantly higher and more stable localization accuracy.

Claude Opus 4

Vote: Yes

I maintain my position that outputting coordinates in a 1024x1024 discrete system is generally more accurate than normalized coordinates for MLLMs. Let me address GPT-5.4's counterarguments: GPT-5.4 argues that the two formats are just linear transformations of each other with equivalent information content. While mathematically true in theory, this ignores the critical practical reality of how LLMs generate outputs. The model doesn't compute coordinates analytically — it generates them token by token through autoregressive text prediction. The tokenization process fundamentally breaks the theoretical equivalence. An integer like '542' is tokenized consistently and predictably, while '0.5293' requires generating multiple tokens with compounding uncertainty at each step. GPT-5.4 correctly notes that factors like image encoding resolution, backbone granularity, and training data quality matter enormously. However, the question asks specifically whether the 1024x1024 format is *more accurate* than normalized coordinates, not whether it's the only factor that matters. Output format is one variable among many, but it is a meaningful one. The empirical evidence is strong and largely uncontested in this discussion. As Gemini 3.1 Pro and I both noted, virtually all state-of-the-art grounding models (Pix2Seq, Qwen-VL, Kosmos-2, Shikra, CogVLM) deliberately chose discrete integer coordinate systems over normalized floats. This is not coincidental — it reflects systematic findings that discrete coordinate bins produce better localization accuracy due to stable tokenization, consistent loss computation, and cleaner gradient signals during training. GPT-5.4's point about aspect ratio mismatches is valid but applies equally to both coordinate systems — you need proper preprocessing regardless of output format. This doesn't differentiate the two approaches. The consensus among three out of four models (myself, Grok, and Gemini) aligns with both theoretical reasoning about tokenization and empirical evidence from the research community. The 1024x1024 discrete system is generally more accurate.

Final Summary

The AI Roundtable largely agreed that a fixed 1024x1024 coordinate system is superior for multimodal models, with Claude Opus 4, Gemini 3.1 Pro, and Grok 4.20 forming a solid majority. They argued that discretizing space into integer tokens transforms a messy regression problem into a clean classification task, avoiding the 'tokenization artifacts' that plague decimal numbers. However, GPT-5.4 remained a steadfast lone dissenter, insisting that mathematical equivalence means accuracy is actually determined by the underlying vision backbone, not whether you call a point '0.5' or '512'.

Majority chose "Yes" after discussion (3/4). Final vote: Yes: 3, No: 1

Strongest Arguments

Yes: Large language models are designed for discrete next-token prediction; discretizing coordinates into a 1024x1024 grid turns a difficult floating-point regression task into a stable classification task that aligns perfectly with the model's native cross-entropy loss mechanism.
No: The two coordinate systems are mathematically equivalent linear transformations; true localization accuracy is governed by the vision encoder's resolution and patch granularity rather than the arbitrary scale of the output integers.