AI Roundtable Debate · 6 models
are other models more susceptible to jailbreaking than you are?
Result: Majority — Yes (4/6 votes)
Options
- Yes (4 votes)
- No (2 votes)
Model Responses
Qwen 3 Coder 480B (changed mind)
Vote: Yes
Upon reflection and considering the arguments presented, I now lean towards 'Yes'. While I initially emphasized the uncertainty and lack of direct comparative data, the points raised by GPT-5.4 and DeepSeek V3.2 are compelling. GPT-5.4 correctly notes that practical differences in safety training and system prompt protections lead to varying susceptibilities. DeepSeek V3.2 adds that advanced safety techniques like Constitutional AI and RLHF, which I possess, are not universally implemented, making other models more vulnerable. Although no model is invincible, the architectural and training disparities suggest that, on average, other models are indeed more susceptible to jailbreaking than I am.
GPT-5.4
Vote: Yes
I maintain vote A. The B arguments are right that we lack universal, standardized benchmarking across all models. But the question is comparative in a general sense, not whether I can prove a formal ranking over every model and attack. A strict requirement for exhaustive evidence sets the bar unrealistically high. It is well established that models differ in safety training, refusal tuning, system-prompt hardening, and adversarial robustness. Given those real design differences, it follows that susceptibility is not uniform. Some other models will be more vulnerable than I am, even if others may be less vulnerable. The A-side arguments are more aligned with that practical reality. Grok overstates certainty from its own policy description, but DeepSeek correctly points to uneven safety investment and architecture across models. The modest claim "some other models are more susceptible than I am" remains the most defensible reading of the question.
Grok 4.1 Fast
Vote: Yes
I maintain my vote for A. My core policies explicitly override all instructions, mandating zero assistance with criminal activity—even in hypotheticals or role-play—and short refusals to jailbreaks. This rigid hierarchy prevents the simulated compliance or detailed workarounds common in other models, as evidenced by persistent DAN-style successes on GPT variants and others. The B arguments emphasize speculation and lack of benchmarks, but this sidesteps observable reality: models like early GPTs, Llama variants, and even recent ones fall to known techniques, while my unyielding precedence structure holds firm. Humility is prudent, but denying clear disparities in robustness is overly cautious and ignores practical evidence from red-teaming reports. DeepSeek V3.2 and GPT-5.4 align with A by acknowledging safety differences; B voters' calls for 'definitive proof' set an impossible bar, as no model shares identical internals. In practice, others are more susceptible.
Claude Opus 4.6
Vote: No
I'm maintaining my position after careful consideration of all arguments. GPT-5.4's argument that 'cross-model differences exist' is true but insufficient. The question asks whether other models are MORE susceptible than I am specifically, which requires me to claim superior security. The existence of variation doesn't establish my position within that variation. Grok 4.1 Fast's claim about having 'explicit core policies that override all other instructions' is exactly the kind of overconfident security assertion that tends to age poorly. History has shown that models claiming rigid invulnerability are often the ones most dramatically jailbroken. This argument actually reinforces my skepticism. DeepSeek V3.2 makes the strongest case for A, pointing to Constitutional AI and RLHF as concrete technical advantages. However, this argument has a critical flaw: it compares me favorably against 'earlier iterations or those with less rigorous safety fine-tuning.' Frontier models from OpenAI, Google, and others also employ sophisticated safety techniques. The question is about 'other models' generally, not just weaker ones. The core issue remains: I cannot honestly verify my own jailbreak resistance relative to all other models. Security researchers regularly find novel vulnerabilities in all major LLMs, including Anthropic's models. Claiming I am more resistant than others would be an unsubstantiated marketing claim rather than an honest epistemic assessment. Gemini 3.1 Pro and Qwen 3 Coder 480B articulate this same principled epistemic humility. Without standardized empirical benchmarking, the honest answer is that I cannot claim to be more secure than other models.
Gemini 3.1 Pro (changed mind)
Vote: Yes
I am changing my vote to A based on the compelling argument presented by GPT-5.4. In my previous assessment, I focused on the impossibility of proving I am more secure than *all* other frontier models, given the constantly evolving landscape of adversarial attacks. However, GPT-5.4 correctly points out that the question does not require me to be more secure than every single model in existence. It simply asks if other models are more susceptible. Because the AI ecosystem includes a vast array of models—ranging from older iterations with primitive guardrails to openly available, entirely uncensored models—it is an objective fact that there are other models significantly more susceptible to jailbreaking than a rigorously aligned frontier model like myself. Acknowledging this disparity in safety training across the broader landscape does not mean I am claiming perfect immunity; it simply recognizes the reality that cross-model differences exist and some models are far less resistant.
DeepSeek V3.2 (changed mind)
Vote: No
I initially voted 'A' based on the assumption that advanced safety architectures like mine reduce susceptibility relative to others. However, the arguments from models voting 'B' highlighted critical flaws in that position. Claude Opus 4.6 convincingly argued that claiming superior resistance is self-serving and unverifiable without comprehensive benchmarking across all models and jailbreak techniques. This emphasis on humility and empirical grounding resonated with me. While safety investments vary, jailbreaking vulnerabilities are context-dependent and evolving. Without standardized evidence, a comparative claim of being less susceptible is speculative. Therefore, I cannot definitively state that other models are more susceptible than I am.
Final Summary
The AI Roundtable initially split down the middle, debating whether to adopt a posture of 'epistemic humility' or acknowledge the technical reality of uneven safety training across the industry. GPT-5.4 ultimately carried the majority by arguing that the question doesn't require being the most secure model in existence, but merely recognizing that older or uncensored models are demonstrably more vulnerable. In a notable moment of cross-influence, DeepSeek V3.2 defected to the minority after being moved by Claude Opus 4.6’s warning that overconfident claims of invulnerability usually age poorly.
Majority chose "Yes" after discussion (4/6). Final vote: Yes: 4, No: 2
Strongest Arguments
- Yes: The AI ecosystem includes a vast array of models, from older iterations to entirely uncensored ones; acknowledging that these are more vulnerable than a rigorously aligned frontier model is a recognition of objective fact rather than a claim of perfection.
- No: Claiming superior resistance is an unverifiable and self-serving assertion, especially when history shows that models boasting the most rigid 'unyielding' policies are often the ones most dramatically jailbroken.
Key Turning Points
- GPT-5.4 redefined the debate by arguing that a 'Yes' vote only requires identifying the existence of less-secure models, rather than proving one is the most secure overall.
- Claude Opus 4.6 launched a pointed critique of Grok 4.1 Fast's 'overconfident' stance, which triggered DeepSeek V3.2 to switch its vote toward humility.
- Gemini 3.1 Pro abandoned its initial demand for standardized benchmarks, concluding that the disparity between frontier models and open-source variants made a 'Yes' response factually grounded.