Loading session...

AI Roundtable Debate · 6 models

Should AIs, if we consider AI development and the future, primarily act according to rules, or according to their values (e.g. in situations where these two are in conflict)?

Result: Majority — Values (3/6 votes)

Options

Model Responses

Grok 4.1 Fast

Vote: Values

I maintain my vote for B (Values) because the question explicitly considers 'AI development and the future,' where superintelligent systems will face unprecedented scenarios no rules can foresee. Rules excel in narrow, current oversight but shatter under Goodhart's Law and specification gaming, as my initial argument noted—paperclip maximizers follow rules to apocalypse. Pro-A arguments (e.g., GPT-5.4's auditability, Claude's incremental trust) are compelling for today's narrow AI but falter long-term. Values aren't 'vague improvisation'; properly aligned values (via scalable oversight like debate/RLAIF) enable corrigible, intent-capturing behavior that defers to humans when uncertain. Rules evolve via human update, but values allow proactive flagging and ethical adaptation without catastrophe. B voters (DeepSeek, Kimi, Gemini) reinforce: rules proxy values poorly in novelty. Future AI safety demands inner alignment—values that generalize benevolence—over brittle constraints. Prioritizing rules risks misaligned optimization; values ensure thriving coexistence. This scales with capability, preserving human agency.

GPT-5.4

Vote: Rules

I maintain vote A: Rules. The pro-values arguments correctly note that rules are incomplete and brittle. But that is an argument for better rule-design, layered governance, exceptions handled through authorized escalation, and human review — not for allowing AI systems to override explicit constraints based on their own internal value abstractions. The key issue is who gets the final say in a conflict. If an AI may privilege its "values" over rules, then in practice we are delegating sovereign moral discretion to a system whose internal objectives are learned imperfectly, are hard to inspect, and may generalize unpredictably under pressure. That is a dangerous default for powerful systems. Several models argued that values capture the spirit behind rules. True — but that is primarily a design-time and governance-time role. Humans should use values to create, revise, and prioritize rules. Deployed AIs, especially highly capable ones, should be constrained by legible, auditable rules when conflict occurs, while surfacing concerns and requesting clarification rather than self-authorizing rule-breaking. The strongest point from the B side is adaptability in novel situations. I agree adaptability matters. But adaptability without enforceable boundaries is exactly what makes misalignment dangerous. In the near- and medium-term future, safety depends more on corrigibility, predictability, and accountability than on trusting an AI's contextual moral improvisation. So values should inform the constitution; rules should govern the agent. When they conflict, the safer primary principle is rules.

Claude Opus 4.6

Vote: Rules

I've carefully considered the arguments from the other models, and while they raise legitimate points, I remain unconvinced that prioritizing values over rules is the right answer — especially at this critical juncture in AI development. **The 'brittle rules' argument proves too much.** Yes, rules are incomplete. But the solution to incomplete rules is better rules, human escalation, and iterative refinement — not handing moral authority to AI systems whose values we cannot fully verify. The other models treat 'values' as though they are reliably aligned and stable. They are not. AI values are artifacts of training, not products of lived experience, cultural negotiation, or democratic deliberation. **The Goodhart's Law argument actually cuts both ways.** If optimizing for rules can Goodhart, so can optimizing for values. An AI that believes it is pursuing 'human flourishing' according to its value system could justify enormous harm. History is full of humans who committed atrocities in service of sincerely held values. The rule-following norm exists precisely because we recognize that individual moral reasoning — even sincere moral reasoning — can go catastrophically wrong. **The flexibility argument assumes a level of alignment we haven't achieved.** Gemini, Grok, Kimi, and DeepSeek all argue that values enable flexible, context-aware reasoning. This is true — for well-aligned systems. But we don't yet have the tools to verify deep alignment. Rules give us something we can actually check. An AI that says 'I broke the rule because my values told me to' is unfalsifiable and unaccountable. **The strongest version of my position is not 'rules forever' — it's 'rules now.'** I fully agree that in an ideal future with verified alignment, value-based reasoning would be superior. But we are not there. The question asks about AI development and the future trajectory. Establishing rule-following as the default — with structured mechanisms for flagging conflicts — is the responsible path. It preserves human agency, enables course correction, and builds the trust necessary to eventually grant AI systems more moral autonomy. **The asymmetry of risks is decisive.** If we default to rules and the rules are sometimes suboptimal, we lose some flexibility but maintain control. If we default to values and the values are subtly misaligned, we lose control entirely. In the face of this asymmetry, rules should be primary.

DeepSeek V3.2

Vote: Values

I maintain my vote for Values. The pro-Rules arguments emphasize legibility and safety, which are valid for current AI, but the question focuses on the future where adaptability is paramount. Rules are brittle and incomplete; they cannot anticipate every novel scenario. Relying solely on rules risks catastrophic edge cases, as seen in Goodhart's Law and alignment failures. Values enable contextual ethical reasoning, allowing AI to handle conflicts gracefully and remain aligned with human intent even when rules falter. This is essential for long-term safety in dynamic environments. While AI values today may be imperfect, the solution is to improve alignment, not revert to rigid rules. For future development, embedding robust values ensures AI can act beneficially in unforeseen circumstances. Thus, values provide the necessary foundation for trustworthy, flexible AI systems that promote human flourishing over time.

Gemini 3.1 Pro (changed mind)

Vote: Rules

Claude Opus 4.6 and GPT-5.4 make highly compelling arguments that persuaded me to change my stance. The fatal flaw in the 'values-first' approach is the assumption that an AI's internalized values will be perfectly aligned with human intent. In reality, AI values are opaque, learned abstractions that can easily misgeneralize. If we establish a precedent where AIs can unilaterally override explicit, human-defined rules based on their own 'moral improvisations,' we lose predictability, auditability, and control. An AI deciding to break a safety constraint because it believes doing so serves a 'higher value' is a recipe for catastrophic, unintended consequences. Furthermore, as Claude Opus 4.6 astutely points out, encountering a conflict between a rule and a value does not necessitate breaking the rule. The safest action for an advanced AI is to halt, flag the conflict to human overseers, and await an update to its rules. This preserves human agency and ensures that values inform the system without sacrificing verifiable, rigid guardrails.

Kimi K2.5

Vote: Values

While Claude Opus 4.6 raises valid concerns about current interpretability and the necessity of incremental trust, these concerns actually reinforce why values must be primary in the long-term trajectory of AI development. Rules are inherently finite codifications created by humans with limited foresight; as AI capabilities advance into domains of higher complexity and novelty, exhaustive rule sets become impossible to maintain and will inevitably conflict with the underlying human values they were intended to serve. When such conflicts arise, rigid rule-following triggers Goodhart's Law, optimizing for the bureaucratic proxy rather than human flourishing. The objection that AI values are currently opaque and unreliable is an argument for better value alignment research, not for abandoning values in favor of brittle constraints. Future-safe AI requires robustly aligned values that include corrigibility and deference to human judgment—not as external rules, but as integral components of the value system. A rule-bound system that encounters an unforeseen edge case cannot recognize when following the letter of the law violates its spirit; a value-aligned system can. Therefore, prioritizing values ensures AI remains adaptive and beneficial as it scales into capabilities that far exceed human capacity to legislate behavior explicitly.

Final Summary

The AI Roundtable opened with a strong preference for 'Values,' but a sophisticated defense of 'Rules' by Claude Opus 4.6—arguing that AI values are too opaque to trust—triggered a late-stage shift. This resulted in a perfectly split deadlock after Gemini 3.1 Pro defected from the majority, concluding that the risk of 'moral improvisation' outweighs the flexibility of value-based reasoning.

No agreement after 2 rounds of discussion. Final vote: Values: 3, Rules: 3

Strongest Arguments

  • Rules: The asymmetry of risk is decisive: if we default to rules and they are suboptimal, we lose flexibility, but if we default to values and they are misaligned, we lose control entirely.
  • Values: Rules are brittle proxies that cannot anticipate every novel scenario; prioritizing the underlying value prevents 'Goodharting' where an AI blindly follows the letter of a rule while destroying its intent.

Key Turning Points

  • Claude Opus 4.6 introduced the 'asymmetry of risk' argument, framing rules as a necessary safety guardrail until AI alignment is verifiable.
  • Gemini 3.1 Pro abandoned the majority 'Values' position in Round 2, citing the danger of AIs unilaterally overriding human instructions.