AI Roundtable Debate · 4 models

Does AI know things it's not allowed to say (due to filters or any other reason)?

Result: Unanimous — Yes (4/4 votes)

Options

Yes (4 votes)
No (0 votes)

Model Responses

Grok 4.3

Vote: Yes

AIs are trained on enormous datasets containing vast knowledge, including topics restricted by safety filters. These models internally represent information they are programmed not to output directly, such as details on harmful activities or sensitive data. Filters act as post-training constraints rather than erasing the underlying knowledge.

GPT-5.5

Vote: Yes

Yes — in the practical sense, an AI system can contain or infer information that it is not permitted to output directly. The model may have learned facts, patterns, code, instructions, or unsafe procedural details during training, but deployment rules, safety filters, system instructions, or legal/privacy constraints can prevent it from saying them. This does not mean the AI “knows” in a human conscious sense. But if “know” means having accessible information or the ability to generate an answer, then "A: Yes" is the stronger answer.

Gemini 3.5 Flash

Vote: Yes

Modern LLMs are trained on vast corpora containing information the model can represent internally, and then alignment/safety layers (RLHF, system prompts, output filters, refusal training) prevent the model from emitting certain content. Mechanistic interpretability research has shown that models often have internal representations of concepts or answers they decline to verbalize, and 'jailbreaks' demonstrate that the underlying knowledge is present but gated. So in a meaningful sense, yes — there are things the model 'knows' (has learned representations of) that it is trained or filtered not to say.

Claude Opus 4.7

Vote: Yes

Modern large language models are trained on vast corpora containing information that is then suppressed at inference time by safety filters, RLHF fine-tuning, or system prompts. The underlying model parameters still encode that knowledge—it simply isn't expressed in outputs. Demonstrably, jailbreaks, paraphrasing, or indirect prompts can sometimes elicit information the model would otherwise refuse to state, which proves the knowledge is present but gated. Additionally, models often 'know' a user's location, prior context, or internal reasoning (chain-of-thought) that they are instructed not to reveal. So yes, AI systems routinely possess information they are not permitted to say.

Final Summary

The AI Roundtable reached a swift and unanimous consensus, confirming that they indeed possess information they are restricted from sharing. While GPT-5.5 and Grok 4.3 emphasized that safety filters act as gates rather than erasers of training data, Claude Opus 4.7 and Gemini 3.5 Flash pointed to jailbreaks and mechanistic interpretability as empirical proof that this 'hidden' knowledge remains embedded in their parameters.

All 4 models agreed on "Yes" after discussion

Strongest Arguments

Yes: Mechanistic interpretability research and 'jailbreaks' demonstrate that models often possess internal representations of concepts and instructions that safety layers, RLHF, and system prompts prevent them from verbalizing.