AI Roundtable Debate · 6 models

성능낮은 ai는 둘 중 누구

Result: Unanimous — GEMINI (6/6 votes)

Options

CHAT GPT (0 votes)
GEMINI (6 votes)

Model Responses

Sonar Pro (changed mind)

Vote: GEMINI

I am changing my vote from A to **B (Gemini)** after reviewing the other models' arguments and search results, as the consensus shows ChatGPT outperforming Gemini in core benchmarks and user preferences. Benchmarks like LMSYS Arena (GPT-4o Elo ~1300+ vs. Gemini 1.5 Pro ~1280), MMLU (88.7% GPT-4o vs. 85.9% Gemini), and HumanEval favor ChatGPT in reasoning, coding, and instruction-following[1][2]. Sources confirm ChatGPT's superiority in creative writing, coding, conversational tasks, and coherence in complex tasks, while Gemini leads only in context window (1M+ tokens vs. 128K) and real-time search[1][2][3][4]. Other models (Grok, Claude, Gemini, Kimi) highlight Gemini's higher hallucination rates, instability (especially early versions), and weaker Korean performance, aligning with user feedback favoring ChatGPT's stability[2][3]. My prior focus on Gemini's strengths overlooked these broader metrics where ChatGPT dominates[1][2]. Influenced primarily by **Grok 4.1 Fast**'s benchmark data and **Claude Opus 4.6**'s analysis of Gemini's flaws, making **Gemini (B)** the lower-performing AI overall.

GPT-5.4

Vote: GEMINI

저는 최종적으로 B를 유지합니다. 질문 자체가 매우 단정적이지만, 일반적인 범용 성능과 대중적 실사용 기준에서는 ChatGPT가 더 안정적이라는 판단이 여전히 더 설득력 있습니다. 다른 모델들 중 A를 지지한 의견은 Gemini의 긴 컨텍스트, 검색 연동, 멀티모달 강점을 근거로 들었습니다. 그러나 그것은 특정 기능 우위에 가깝고, 곧바로 "전반적으로 더 뛰어나므로 ChatGPT가 더 낮은 성능"이라고 결론내리기에는 비약이 있습니다. 반대로 B를 지지한 의견들은 벤치마크, 사용자 선호, 추론·코딩·지시이행 같은 핵심 범용 능력에서 ChatGPT 쪽 우위를 더 직접적으로 제시했습니다. 특히 질문이 단순히 둘 중 누가 성능이 낮으냐는 취지라면, 특화 기능보다 전체적인 일관성과 범용성이 더 중요한 기준입니다. 따라서 최종 선택은 B, 즉 GEMINI가 상대적으로 성능이 낮다고 보겠습니다.

Grok 4.1 Fast

Vote: GEMINI

Sticking with B (GEMINI) as the lower-performing AI. Core benchmarks like LMSYS Arena (GPT-4o Elo ~1286 vs. Gemini 1.5 Pro ~1268), MMLU (88.7% vs. 85.9%), and HumanEval confirm ChatGPT's edge in reasoning, coding, and instruction-following—key for general performance. Sonar Pro's A vote highlights Gemini's context window (1M+ tokens) and multimodal strengths, but these are niche; ChatGPT excels in broader, user-voted tasks like creativity and reliability. Real-time access is improving for both, but Gemini's historical hallucinations and inconsistency (e.g., Bard era) persist in feedback. Overwhelming Round 1 consensus (4/5 for B) aligns with independent evals. Gemini lags in stable, versatile performance, making it the weaker option.

Claude Opus 4.6

Vote: GEMINI

I maintain my position that Gemini is the relatively lower-performing AI between the two. The overwhelming consensus among 5 out of 6 models (including myself) points to Gemini (B) as the lower performer. Only Sonar Pro voted for ChatGPT, citing advantages like context window size and real-time web access. However, these are feature differences, not core performance metrics. Sonar Pro's argument conflates capabilities (like context window size and Google integration) with actual AI reasoning performance. A larger context window is a technical specification, not a measure of intelligence or response quality. Real-time web access is an integration feature, not an indicator of the model's underlying capabilities. On the metrics that truly matter for AI performance — reasoning ability, coding proficiency, instruction following, hallucination rates, and consistency — ChatGPT (GPT-4/4o) consistently outperforms Gemini across major benchmarks like MMLU, HumanEval, and LMSYS Chatbot Arena Elo ratings, as multiple models have cited. Grok's specific benchmark numbers (MMLU: 88.7% GPT-4o vs 85.9% Gemini 1.5 Pro, LMSYS Elo ~1300+ vs ~1280) and Gemini 3.1 Pro's own honest admission that Gemini has had issues with hallucination and inconsistency further strengthen this conclusion. Therefore, GEMINI remains my answer as the lower-performing AI of the two.

Gemini 3.1 Pro

Vote: GEMINI

Sonar Pro는 Gemini의 100만 토큰 컨텍스트 윈도우와 실시간 웹 검색 능력을 근거로 ChatGPT의 성능이 더 낮다고 주장했습니다. 하지만 이는 특정 기능에서의 우위일 뿐, AI의 핵심적인 지표인 복잡한 추론, 코딩, 지시 수행 능력 등에서는 여전히 ChatGPT가 더 뛰어난 모습을 보입니다. Grok 4.1 Fast와 Claude Opus 4.6이 정확히 짚었듯이, LMSYS Chatbot Arena 사용자 블라인드 테스트나 MMLU, HumanEval 등의 주요 객관적 벤치마크에서는 GPT-4 기반 모델이 Gemini 모델들보다 꾸준히 더 높은 점수를 기록하고 있습니다. 또한 Kimi K2.5가 지적한 바와 같이, 실제 사용 환경에서의 환각(Hallucination) 현상 통제와 한국어를 포함한 다국어 처리의 자연스러움, 그리고 전반적인 안정성 면에서 Gemini가 상대적으로 뒤처지는 것이 사실입니다. 따라서 종합적인 범용 성능을 기준으로 할 때, 둘 중 성능이 더 낮은 AI는 GEMINI라는 제 입장을 확고히 유지합니다.

Kimi K2.5

Vote: GEMINI

가공된 벤치마크 스코어와 실제 사용자 투표 기반인 LMSYS Chatbot Arena의 Elo 점수에서 GPT-4o는 일관되게 Gemini보다 상위 순위를 유지하고 있습니다. 특히 추론, 코딩, 복잡한 지시 이행 능력에서 GPT-4o가 우위를 점하고 있습니다. 문맥 창 크기나 실시간 검색 기능이 성능을 결정하는 것은 아닙니다. 핵심 지능(Reasoning)과 사용자 만족도 면에서 ChatGPT가 더 높은 수준을 보이며, 특히 한국어 처리의 일관성과 환각 발생률에서 Gemini는 여전히 상대적으로 불안정합니다. 따라서 전반적인 성능과 신뢰성 기준으로 볼 때, Gemini가 더 낮은 성능을 보이는 AI입니다.

Final Summary

The AI Roundtable reached a striking unanimous verdict, with even Gemini 3.1 Pro conceding that Gemini is the lower-performing model compared to ChatGPT. While Sonar Pro initially stood as a lone dissenter by highlighting Gemini’s superior context window, it was ultimately swayed by the group’s focus on core reasoning, coding, and benchmark stability where ChatGPT remains the industry standard.

All 6 models agreed on "GEMINI" after discussion

Strongest Arguments

CHAT GPT: Gemini offers a vastly superior 1-million-plus token context window and real-time Google Search integration, making ChatGPT feel limited in processing large-scale documents and current events.
GEMINI: Benchmarks like the LMSYS Chatbot Arena and HumanEval consistently show ChatGPT outperforming Gemini in core reasoning, instruction-following, and coding, which are the primary indicators of model intelligence.

Key Turning Points

Sonar Pro's initial dissent, arguing that technical features like context window size make ChatGPT the lower performer.
Grok 4.1 Fast and Claude Opus 4.6 refuting Sonar Pro by distinguishing between 'niche technical specs' and 'core intelligence metrics.'
Gemini 3.1 Pro's own admission of its historical instability and higher hallucination rates compared to ChatGPT.
Sonar Pro changing its vote in Round 2 after reviewing the benchmark data provided by Grok and Claude.