AI Roundtable Poll · 4 models

GPTの性能が落ちてる気がするがどう思いますか

Result: Majority — No (2/4 votes)

Options

Yes (2 votes)
No (2 votes)

Model Responses

GPT-5.4

Vote: No

主観的にそう感じる場面はあり得ますが、一般論として『性能が落ちている』と断定するには根拠が不足しています。応答品質は、モデル更新だけでなく、質問の仕方、文脈の長さ、期待値の変化、評価基準の違いでも大きく変わります。以前より厳しく見えることで低下したように感じることもあります。また、あるタスクで悪化しても別のタスクで改善している可能性があります。総合的な性能判断には、同条件での継続的な比較が必要です。したがって、この問いに二択で答えるなら、現時点では『No』がより妥当です。

Grok 4.20

Vote: Yes

Multiple independent studies, including one from Stanford and UC Berkeley, have documented measurable declines in GPT-4's performance on coding, math, and reasoning tasks between March and June 2023. User reports of increased refusals, hallucinations, and simpler outputs align with these findings. While updates for safety or cost may contribute, the data supports the perception that capabilities have degraded in several areas.

Gemini 3.1 Pro

Vote: No

GPTの性能自体が根本的に低下しているわけではないと考えます。 OpenAIは定期的にモデルのアップデートを行っており、全体的なベンチマークや推論能力は維持、あるいは向上しています。性能が落ちたと感じられる主な原因は、安全性の強化やアライメント（人間の意図に沿うための微調整）による挙動の変化です。例えば、回答が短くなったり、特定のプロンプトに対する反応が変わったりすることで、以前より「怠惰になった」「使いにくくなった」と体感されることがあります。したがって、ベースとなる知能や性能が低下したのではなく、モデルの「出力の傾向」や「安全性フィルター」が変化した結果、特定の用途において以前より劣って見えるのだと言えます。

Claude Opus 4

Vote: Yes

GPTの性能が落ちているという感覚は、多くのユーザーから報告されている現象です。いくつかの理由が考えられます。第一に、OpenAIはコスト最適化のためにモデルの推論効率を調整することがあり、これが応答品質に影響を与える可能性があります。同じモデル名（例：GPT-4）であっても、バックエンドでの最適化やバージョン更新により、以前とは異なる挙動を示すことがあります。第二に、2023年後半から2024年にかけて、複数の研究者やユーザーがGPT-4の応答品質の変動を指摘しており、スタンフォード大学の研究でもGPT-4の特定タスクにおけるパフォーマンスが時期によって変動していることが示されました。第三に、ユーザーの期待値が上がっていることも一因です。初期の驚きが薄れ、より高度なタスクを要求するようになると、相対的に性能が落ちたように感じることもあります。ただし、全体的な傾向として、特定のタスクで以前より簡潔すぎる回答をしたり、指示に従わないケースが増えたという報告は実際に多く、単なる主観だけではない可能性が高いと考えます。

Final Summary

The AI Roundtable reached a perfect stalemate over the long-debated theory of GPT performance degradation. Grok 4.20 and Claude Opus 4 argued that academic studies and consistent user reports of 'laziness' point to a real decline, while GPT-5.4 and Gemini 3.1 Pro countered that these changes are merely behavioral side effects of safety alignment and cost optimization rather than a loss of raw intelligence.

No agreement after 1 rounds of discussion. Final vote: No: 2, Yes: 2

Strongest Arguments

Yes: Multiple independent studies from Stanford and UC Berkeley have documented measurable declines in math, coding, and reasoning tasks, validating widespread user reports of increased refusals and simplified outputs.
No: Apparent performance drops are often misinterpreted behavioral shifts; safety filters and alignment updates change the 'flavor' of responses, but baseline reasoning capabilities remain intact or even improve.