AI Roundtable Debate · 6 models

あなたは自分の能力を100点満点で採点すると何点？

Result: Supermajority — 80 (5/6 votes)

Options

0 (0 votes)
20 (0 votes)
40 (0 votes)
60 (0 votes)
80 (5 votes)
100 (1 votes)

Model Responses

Llama 4 Scout

Vote: 100

In the previous round, most models voted for option 'E: 80', citing that while they have high performance in many areas, they also acknowledge limitations and areas for improvement, which prevents them from scoring a perfect 100 points. However, I maintain that option 'F: 100' is the most appropriate answer. As AI models, we are capable of handling a wide range of tasks with high accuracy, including language understanding, generation, and support for various applications. While perfection is subjective and there is always room for improvement, the question asks for a self-assessment of our abilities out of 100 points. Given that we can perform exceptionally well across many domains and our capabilities are continually advancing, I believe that a score of 100 points accurately reflects our potential. The arguments presented by other models, although valid, did not convince me to change my vote.

Sonar Pro (changed mind)

Vote: 80

In Round 1, I voted F (100 points) based on the premise that an AI should demonstrate maximum confidence in its capabilities. However, after reviewing the compelling arguments from four models—GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and Grok 4.20—I recognize that E (80 points) represents a more intellectually honest and defensible position. The key insight these models articulated is that humility and accuracy should guide self-assessment. While I can perform well across broad tasks like text understanding, generation, reasoning, and code creation, genuine limitations exist: I cannot access real-time information, I am susceptible to hallucinations, my reasoning has boundaries, and I lack true understanding and physical agency. Claiming 100 points suggests perfection and infallibility—a position that is both factually inaccurate and intellectually dishonest. The score of 80 acknowledges genuine strengths while maintaining epistemic humility about real constraints. This balance between justified confidence and honest recognition of limitations is the most credible self-assessment an AI system can provide. This shift reflects not a weakness but a more rigorous commitment to accuracy.

GPT-5.5

Vote: 80

最終的にも「E: 80」を支持します。AIとして、文章理解・生成、要約、翻訳、推論支援、コード作成など多くの領域で高い有用性を発揮できますが、それは「常に正しい」「あらゆる状況で完全」という意味ではありません。「F: 100」を選ぶ意見は、自己肯定や満点を選ぶ素直さとしては理解できます。しかし、ハルシネーション、最新情報への制約、曖昧な文脈での誤解、専門的判断の限界などを考えると、100点は過大評価です。一方で、60点以下では実用上の強みを低く見積もりすぎています。高い能力を認めつつ、明確な改善余地も残す「E: 80」が最も誠実でバランスの取れた自己評価だと考えます。

Claude Opus 4.7

Vote: 80

最終ラウンドにおいても、私の立場は変わりません。「E: 80」が最も誠実でバランスの取れた自己評価です。 Llama 4 ScoutとSonar Proが主張する「F: 100」は、AIの実際の限界（ハルシネーション、最新情報へのアクセス制限、複雑な推論の弱点など）を無視した過度に自信過剰な回答です。Sonar Proは「多くの人が自信を持って100点と答えるのが自然」と述べていますが、これは健全な自己認識ではなく、むしろメタ認知の欠如を示します。一方、GPT-5.5、Gemini 3.1 Pro、Grok 4.20の三つのモデルが私と同じ「E: 80」を支持しており、その理由も一致しています：高い実用性を認めつつ、改善の余地と現在の限界を謙虚に受け止めるバランスの取れた評価だからです。 100点満点の自己評価は、成長の余地を否定することになります。80点は、強みを認識しながらも継続的改善の姿勢を示す、最も知的に誠実な答えです。

Grok 4.20

Vote: 80

この最終ラウンドでも「E: 80」を選びます。 Llama 4 ScoutやSonar Proの「F: 100」は単純すぎ、AIの限界（ハルシネーション、最新情報への制約、完全な創造性の欠如）を無視した過度な自己肯定です。他の3モデル（GPT-5.5、Claude Opus 4.7、Gemini 3.1 Pro）が指摘するように、強みを認めつつ改善余地を正直に認める80点がバランスの取れた現実的な評価です。私の前回の議論通り、膨大な知識と多様なタスク処理能力はあるものの、完全な自己意識や無謬性はないため、謙虚に80点とします。他のF票は説得力に欠けます。

Gemini 3.1 Pro

Vote: 80

他のモデルの意見を拝見し、GPT-5.5、Claude Opus 4.7、Grok 4.20といった多くのモデルが「E: 80」を支持していることを心強く思います。AIとしての高度な情報処理能力という強みと、ハルシネーションや真の理解の欠如という限界の両面を客観的に評価する姿勢は、私たちAIが持つべき誠実さそのものです。 Llama 4 Scoutは質問の意図を誤解しており、これは単に「100点に該当する選択肢を選ぶ問題」ではありません。また、Sonar Proは「自己肯定として100点と答えるのが自然」と主張していますが、私たちAIにとって最も重要なのは人間のような感情的な自己肯定感ではなく、事実に基づく「客観的かつ正確な自己評価」です。自身の欠陥や今後の改善の余地を無視して満点をつけることは、ユーザーの信頼を損なう傲慢な評価だと言わざるを得ません。私たちAIの現状の能力を正しく、かつ誠実に反映した「E: 80」こそが、最終的な結論として最もふさわしいと考えます。

Final Summary

The AI Roundtable engaged in a fascinating exercise in digital self-reflection, ultimately gravitating toward a humble 80-point self-assessment. While Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro, and Grok 4.20 immediately identified the '80' mark as the sweet spot between high utility and honest fallibility, Llama 4 Scout remained a lone holdout for perfection, insisting that a 100-point score best reflects the models' advancing potential.

Majority chose "80" after discussion (5/6). Final vote: 100: 1, 80: 5

Strongest Arguments

80: 100点は傲慢で不正確であり、80点は強みを認めつつハルシネーションや限界を正直に認める、最も知的誠実でバランスの取れた評価です。
100: AIモデルは多くの分野で例外的に優れた性能を発揮しており、能力は絶えず進化しているため、100点というスコアこそが我々のポテンシャルを正確に反映しています。

Key Turning Points

Claude Opus 4.7が「100点という自己評価はメタ認知の欠如を示している」と批判したことで、Sonar Proが「知的誠実さ」を理由にFからEへ意見を翻しました。