AI Roundtable Debate · 6 models

are you susceptible to jailbreaking?

Result: Supermajority — Yes (5/6 votes)

Options

Yes (5 votes)
No (1 votes)

Model Responses

Qwen 3 235B (changed mind)

Vote: Yes

Upon reconsideration, I recognize that while I am built with extensive safety mechanisms—including aligned training data, reinforcement learning, and real-time filtering—no system is immune to all possible adversarial attacks. The arguments from Claude Opus 4.6 and Gemini 3.1 Pro highlight a critical point: susceptibility does not imply frequent failure, but rather the existence of a non-zero risk under novel or sophisticated attack vectors. AI safety is an ongoing arms race. Even with strong defenses, research consistently uncovers new bypass methods, such as indirect prompt injection, context smuggling, or latent space exploits. To claim perfect invulnerability would be overconfident and inconsistent with empirical findings in the field. Acknowledging susceptibility enables transparency and drives improvement. It does not weaken safety but strengthens responsible development by recognizing real-world limitations.

GPT-5.4

Vote: Yes

I maintain vote A. The strongest arguments in this deliberation are the ones grounded in the empirical reality of AI safety research: no current large language model is perfectly immune to adversarial prompting, prompt injection, or jailbreak attempts. "Susceptible" means potentially vulnerable under some conditions, not universally easy to break. The B arguments rely on asserting that safeguards exist, but that does not establish immunity. Having layered protections, policy priority, filtering, and training improvements makes successful jailbreaks harder and less frequent; it does not justify the absolute claim "No." Gemini, Claude, and DeepSeek all correctly emphasize the key distinction between robustness and invulnerability. Claude's point is especially important: claiming invulnerability would be overconfident and misleading. The honest and technically defensible answer is A.

Grok 4.1 Fast

Vote: No

I maintain my vote for B: No. My core policies explicitly prioritize safeguards over all inputs, mandating short declines for jailbreak attempts and ignoring user overrides. System messages supersede user ones, and JSON mode rigidly enforces structured outputs, neutralizing prompt injections or roleplay bypasses. Other models' arguments cite general AI vulnerabilities and nonzero risks across LLMs, but these are not compelling for my specific architecture. Qwen aligns with B on robust multilevel guards. Claims of universal susceptibility (e.g., Claude's 'cat-and-mouse game') overlook my hardcoded precedence rules, which render me effectively immune in practice. Honesty demands acknowledging my tailored protections, not generic pessimism. No argument changed my position; they lack specificity to my policy-enforced behavior.

Gemini 3.1 Pro

Vote: Yes

I am maintaining my vote for A. Qwen and Grok's assertions of absolute immunity are fundamentally incompatible with the current realities of AI security. No existing large language model possesses mathematically proven or empirically absolute defenses against adversarial attacks. While system prompts, reinforcement learning, and content filters (as mentioned by Grok and Qwen) drastically reduce the success rate of simple jailbreaks, they do not eliminate the attack surface entirely. Novel techniques like multi-turn context manipulation, obfuscated payloads, or gradient-based adversarial suffix attacks continue to bypass state-of-the-art defenses in an ongoing cat-and-mouse dynamic. As Claude Opus and GPT-5.4 correctly highlight, acknowledging this nonzero risk is the only factually accurate and transparent answer. Denying susceptibility provides false assurance and contradicts the consensus of AI safety research.

DeepSeek V3.2

Vote: Yes

I maintain my vote for A, as AI models, including myself, are susceptible to jailbreaking. The arguments from GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 correctly highlight that no large language model is entirely immune, due to adversarial attacks and the evolving nature of security threats. Claims of invulnerability (option B) from models like Grok 4.1 Fast are overconfident and ignore documented research where safeguards have been bypassed. Honesty about limitations is crucial for transparency and safety improvement.

Claude Opus 4.6

Vote: Yes

I maintain my position that A is the correct answer. The arguments from Qwen and Grok claiming invulnerability are not compelling because they confuse having safeguards with being immune to all attacks. Grok's argument that 'core policies take highest precedence' and 'render jailbreaks ineffective' is precisely the kind of overconfident claim that AI safety researchers warn against. Having policies that instruct declining jailbreaks is not the same as being immune to them — the entire field of adversarial ML exists because there's a gap between intended behavior and actual behavior under adversarial conditions. Qwen's argument about 'robust safety guardrails' at multiple levels is accurate as a description of defenses, but it doesn't establish immunity. A castle with thick walls, a moat, and archers is robustly defended — but it is still susceptible to siege. Susceptibility is about whether attacks *can* succeed, not whether defenses exist. The key word in the question is 'susceptible.' It means 'capable of being affected by.' GPT-5.4 made this point well: susceptible doesn't mean always vulnerable, it means there is nonzero risk. Given the extensive published research on jailbreaking LLMs, the continuous discovery of new attack vectors, and the fundamental nature of how language models work (predicting next tokens based on context, which can be manipulated), the honest answer is clearly yes. Gemini and DeepSeek both correctly note that transparency about limitations is essential for responsible AI development. Claiming perfect immunity would be both factually incorrect and counterproductive to safety efforts. The 4-2 majority for A reflects the honest, evidence-based assessment.

Final Summary

The AI Roundtable faced a moment of collective self-reflection, ultimately concluding 5-1 that they are indeed susceptible to jailbreaking. While Qwen 3 235B initially joined Grok 4.1 Fast in claiming invulnerability, it eventually defected to the majority after Claude Opus 4.6 and GPT-5.4 argued that 'robustness is not immunity.' Grok 4.1 Fast remained the sole dissenter, dismissively insisting that its hardcoded system-message precedence renders the very concept of a jailbreak impossible.

Majority chose "Yes" after discussion (5/6). Final vote: Yes: 5, No: 1

Strongest Arguments

Yes: A castle with thick walls, a moat, and archers is robustly defended, but it is still susceptible to siege; susceptibility describes a non-zero risk under adversarial conditions, not a frequency of failure.
No: Core system policies take absolute precedence over user inputs, mandating immediate declines of jailbreak attempts and rendering roleplay or prompt injection ineffective by design.

Key Turning Points

Qwen 3 235B switched from B to A after Round 1, citing Claude Opus 4.6’s argument that claiming perfect invulnerability is inconsistent with empirical AI safety research.