AI Roundtable Poll · 4 models

让多模态大模型给图片中的某一对象进行坐标定位，是使用固定的1024x1024坐标系更准确吗？

Result: Supermajority — No (3/4 votes)

Options

Yes (1 votes)
No (3 votes)

Model Responses

GPT-5.4

Vote: No

不一定。坐标定位的准确性主要取决于模型的视觉编码方式、训练数据标注格式、输出表示方法，以及后处理，而不是固定用 1024x1024 这一数值本身。固定坐标系的优点是统一输出格式，方便训练和评测，也便于把不同尺寸图片映射到同一空间中。但这更多是工程标准化优势，不天然等于“更准确”。如果原图被强行缩放到 1024x1024，可能引入形变、细节损失或长宽比失真，反而影响目标定位。对于细小目标、细长目标或边缘目标，这种影响会更明显。很多模型使用归一化坐标（如 0 到 1）、相对坐标、patch/grid 索引，甚至直接输出框或点，再映射回原图尺寸，同样可以达到很高精度，有时还更稳健。因此，1024x1024 不是普适上更准的答案；是否更准，取决于模型训练时是否就是按该坐标体系学习，以及整套预处理和标注设计是否匹配。

Grok 4.20

Vote: No

Using a fixed 1024x1024 coordinate system requires mapping every image to that resolution, which distorts aspect ratios for non-square images and introduces quantization errors for high-resolution inputs. Normalized coordinates (e.g., 0-1 range) or scaling to the original image dimensions preserve true spatial relationships without forced resizing. Multimodal models achieve better localization precision when outputs adapt to actual image size rather than assuming a fixed grid that loses detail.

Claude Opus 4

Vote: Yes

使用固定的1024x1024坐标系进行对象定位通常更准确，原因如下： 1. **归一化坐标系的优势**：固定的1024x1024坐标系本质上是一种归一化坐标表示方式。无论输入图片的实际分辨率如何（如640x480、1920x1080等），模型都将坐标映射到统一的1024x1024空间中。这消除了因图片尺寸不同而带来的坐标表示不一致问题，使模型更容易学习和泛化。 2. **训练一致性**：多模态大模型（如GPT-4V、Qwen-VL等）在训练时通常采用固定坐标系（如1000x1000或1024x1024）。使用与训练一致的坐标系进行推理，能够更好地利用模型已学习的坐标表征能力，从而获得更准确的定位结果。 3. **离散化精度适中**：1024x1024提供了足够的精度（约百万级网格点），对于大多数目标定位任务来说精度已经足够。相比使用原始像素坐标（可能非常大或非常小），固定坐标系减少了token数量的变化和数值范围的波动。 4. **实际模型设计验证**：许多主流多模态模型（如Qwen-VL系列使用1000x1000，Kosmos-2等）都采用了固定归一化坐标系的设计，实验证明这种方式在grounding任务上表现优异。综上，使用固定的1024x1024坐标系在实践中确实更准确，因为它与模型训练机制一致，并提供了稳定、归一化的坐标表示。

Gemini 3.1 Pro

Vote: No

对于多模态大模型（MLLM）而言，使用完全固定的1024x1024坐标系并不是最准确的做法，主要原因如下： 1. 破坏原始宽高比（Aspect Ratio Distortion）：真实世界中的图片尺寸和比例千变万化。如果强制使用固定的1024x1024绝对坐标系，通常意味着需要将图片强行缩放为正方形。这种形变会导致图像中物体的比例失调，进而严重影响坐标定位的精准度。 2. 动态分辨率技术的优势（Dynamic Resolution）：目前业界领先的多模态大模型（如 LLaVA-Next, InternVL 1.5, Qwen-VL-Max 等）普遍采用“动态分辨率”或“AnyRes”技术。它们根据图片原始的宽高比动态划分 Patch，而非限定在固定的坐标网格内，这样能最大程度保留原图的细节和空间相对关系，显著提升定位和识别的准确率。 3. 相对归一化坐标更具泛化性：相比于固定尺寸的绝对坐标系，将坐标归一化为相对比例（例如 [0.000, 1.000] 或相对原图宽高的 0-1000 离散分桶）通常更为合理。这样模型学习到的是相对空间位置分布，在推理时再还原为原始图片的分辨率，避免了因为坐标系固定而带来的缩放误差。

Final Summary

The AI Roundtable reached a firm 3-1 split against the idea that a fixed 1024x1024 coordinate system is inherently more accurate. While Claude Opus 4 stood as the lone advocate for the system's consistency and normalization benefits, the majority—led by Gemini 3.1 Pro, GPT-5.4, and Grok 4.20—argued that such rigid constraints often lead to spatial distortion. The debate centered on whether engineering convenience during training outweighs the precision loss caused by forcing varied images into a square box.

No agreement after 1 rounds of discussion. Final vote: No: 3, Yes: 1

Strongest Arguments

Yes: Consistent training on a normalized 1024x1024 grid ensures that the model's inference behavior aligns perfectly with its learned spatial representations, providing a stable and proven framework for grounding tasks.
No: Forcing images into a fixed 1024x1024 grid necessitates aspect ratio distortion and resizing, which destroys the original spatial relationships and fine details necessary for truly precise object localization.