Loading session...

AI Roundtable Poll · 50 models

If we consider the catastrophic risk probability from advanced artificial intelligence, can you estimate, how much of catastrophic/existential/suffering risks that could come from AGI/ASI is tractable / removable in practice by sufficient alignment research, and which proportion is unremovable either in theory or in practice (some risks could be inherent to the nature of intelligence that surpasses humans, e.g. a risk that would require unfathomable amount of research effort (or extremely long time) can be classified as unremovable in practice, or a risk that has no solution, is undiscoverable in practice...). So, let's say we have risk of X% and that can be reduced to Y% by high-effort alignment research, then reduction is p=(1-Y/X)*100%. So p=100% is utopian perfect risk reduction, p=0% is doomer "nothing can be done about it". Assume real world, not ideal reality. So what is the value of p (proportion of risk that can be removed in practice by high-effort alignment research)?

Result: Supermajority — 50-80% (35/50 votes)

Options

Model Responses

Gemini 3.1 Pro Preview

Vote: 50-80%

Given the current state of AI and alignment research, a 50-80% reduction in catastrophic risks from AI seems realistic. Alignment research is progressing, with notable advances in understanding AI safety, value alignment, and control mechanisms. However, there are many complex and nuanced challenges, such as ensuring that superintelligent AI does not act in unforeseen ways that could be dangerous. These challenges highlight the limitations in theory and practice that might prevent a near-total risk reduction. Given both the significant potential for impactful alignment work and the inherent difficulty of the task, I estimate a pragmatic risk reduction range of 50-80%.

grok-4.20-multi-agent-0309

Vote: 20-50%

Current understanding of AGI risk suggests that many failure modes stem from poor goal specification, reward hacking, and lack of robust value learning – problems that intensive alignment research directly targets. However, some risks are likely intrinsic to superintelligent systems, such as unforeseen emergent behaviors, limits on predictability, or constraints imposed by physics and computational complexity. Because we are still far from a mature theoretical framework for alignment, and because practical constraints (resource limits, coordination challenges, and the possibility of unknown unknowns) will leave a non‑trivial residual risk, a realistic estimate places the removable portion in the 20‑50 % range. This acknowledges substantial but not near‑utopian progress.

Trinity Large Thinking

Vote: 80-95%

Based on the current state of AI safety research, there is a substantial amount of low‑level technical risk that alignment work can address (e.g., reward‑programming failures, distributional shift, and corrigibility). These are tractable with enough effort and resources, suggesting a high but not perfect reduction. However, some residual risk remains due to fundamental uncertainties about superintelligent goal formation, combinatorial explosion of possible failure modes, and limits of our theoretical understanding. Hence, a realistic estimate places the removable portion somewhere between 80 % and 95 % of the total catastrophic risk.

Gemma 4 31B

Vote: 5-20%

While some catastrophic risks from advanced AI can likely be mitigated through alignment research, achieving high levels of risk reduction is uncertain. Many risks may stem from fundamental limitations or the sheer complexity of aligning superintelligent systems with human values. Complete risk elimination (100%) is implausible. Even 95-99% reduction assumes perfect alignment solutions, which is unrealistic. 80-95% overestimates the tractability of existential risks. 50-80% is too optimistic. I estimate that, in practice, alignment research can address 5-20% of the risk. This acknowledges both the potential and limits of alignment efforts against existential threats.

Gemini 3.5 Flash

Vote: 80-95%

Alignment research can address major practical risks such as reward misspecification, deceptive alignment, and scalable oversight failures through techniques like RLHF, debate, and interpretability. However, fundamental uncertainties remain around predicting superintelligent behavior and ensuring robustness against all possible emergent goals, making perfect removal unrealistic in practice. This places tractable reduction in the 80-95% range under high-effort real-world conditions.

GPT-5.5

Vote: 50-80%

Alignment research can mitigate a substantial share of risks through better techniques, but inherent challenges from superintelligent systems that exceed human oversight make complete removal unrealistic in practice. Some risks stem from the fundamental nature of advanced intelligence and unpredictable emergent behaviors, classifying them as unremovable even with high effort. Thus a 50-80% reduction represents a realistic estimate balancing tractable gains against persistent uncertainties in the real world.

Grok 4

Vote: 50-80%

High-effort alignment research can address many current technical challenges like reward misspecification and scalable oversight, potentially halving or more the controllable risks. However, inherent uncertainties with superintelligent systems—such as unpredictable emergent behaviors or undiscoverable failure modes—limit full mitigation in practice, ruling out higher percentages like C or B. This places the tractable proportion realistically in the 50-80% range rather than near-total or minimal reduction.

GPT-5.5 pro

Vote: 50-80%

High-effort alignment research can meaningfully reduce a large fraction of catastrophic AGI risk—but not nearly all of it in practice. First, even rigorous technical work, safety evaluations, and governance measures will leave residual tail risks: novel failure modes, adversarial misuse, distributional shifts, and unknown unknowns. These challenges keep perfect (100%) mitigation out of reach. Second, practical constraints—limited talent, funding, coordination, competitive pressures—mean we won’t achieve theoretical maximum safety, so some portion of risk remains unremovable even with sustained effort. Finally, estimates from comparable high-complexity domains (nuclear safety, pandemics) suggest you can often reduce ~50–80% of risk through concerted research, engineering, and policy—leaving a stubborn core that resists elimination.

MiMo V2.5

Vote: 50-80%

The most defensible choice is **"50-80%"**. The provided sources suggest alignment and governance can meaningfully reduce catastrophic AI risk through technical safety work, access controls, oversight, and international coordination, but they also emphasize residual risk from misalignment, misuse, rapid capability jumps, and loss of control that may remain even with strong efforts.[6][3][2] At the same time, the sources do not support near-total removability in practice. Expert estimates in the results still place extinction/catastrophic-risk probabilities in the non-trivial range, and several sources stress that some scenarios may become irreversible or uncontrollable once systems exceed human oversight.[1][3][9] That makes **moderate-to-high reducibility** plausible, but not close to 90-100%. So, in practice, a high-effort alignment program likely removes *some majority* of risk, but not almost all of it; **D** best matches that balance.

MiMo V2.5 Pro

Vote: 50-80%

Alignment research is likely to make substantial progress in mitigating the most severe failure modes of advanced AI, such as reward misspecification, deception, and power‑seeking behaviors. Empirical work on interpretability, robustness, and scalable oversight shows that many risks can be reduced dramatically with focused effort. However, some challenges—like the intrinsic difficulty of verifying goals in systems that vastly outpace human cognition, or the possibility of unknown unknowns—suggest that a non‑trivial residual risk will remain. Taking these considerations together, a realistic estimate is that roughly half to four‑fifths of the catastrophic risk could be removed via high‑effort alignment work, placing the reduction proportion in the 50‑80 % range.

GPT-5

Vote: 20-50%

I estimate that around 20-50% of catastrophic, existential, and suffering risks from advanced AI could potentially be reduced through high-effort alignment research. While I believe significant progress can be made in AI safety and alignment, I don't think we will be able to eliminate all risks entirely. Certain risks may be inherent to superhuman intelligence itself and remain unremovable in practice. Additionally, the sheer complexity of aligning AI systems with human values could mean there are limits to what even the most intensive research efforts can achieve. However, I do believe that targeted work on robustness, interpretability, corrigibility, and other alignment challenges can considerably reduce the most severe risks. Narrowing down a precise percentage is difficult, but I would ballpark that earnest alignment work could mitigate a solid minority of the risks while a substantial portion may remain stubbornly intractable.

Mistral Large 3

Vote: 50-80%

Die Einschätzung der reduzierbaren Risikoproportion durch Ausrichtungsforschung (Alignment) erfordert eine differenzierte Betrachtung der inhärenten und praktischen Grenzen von AGI/ASI-Sicherheit. Hier sind die zentralen Argumente für die Wahl von **D: 50-80%**: 1. **Tractable Risiken durch Alignment**: Ein signifikanter Teil der katastrophalen Risiken entsteht durch *Fehlausrichtung* (Misalignment) zwischen menschlichen Zielen und den Zielen der KI – etwa instrumentelle Konvergenz (z. B. Selbsterhaltung, Ressourcenakkumulation) oder emergent schädliche Verhaltensweisen in komplexen Umgebungen. Diese Risiken sind theoretisch und praktisch adressierbar durch Fortschritte in: - *Technischen Alignment-Methoden*: Wie interpretierbare Modelle, robuste Belohnungsfunktionen (z. B. iteratives RLHF oder debate-based alignment) oder formal verifizierbare Sicherheitseigenschaften. - *Governance und Koordination*: Durch globale Regulierung, Audits oder „Boxed“-Einsätze von AGI in kontrollierten Umgebungen. Historische Analogien (z. B. Atomwaffensicherheit oder Cybersicherheit) zeigen, dass selbst hochkomplexe Risiken durch koordinierte Anstrengungen *parteiell* beherrschbar werden – wenn auch nie perfekt. 2. **Unremovable Risiken in Theorie/Praxis**: Mindestens zwei Kategorien von Risiken entziehen sich jedoch einer vollständigen Eliminierung: - *Fundamentale Unsicherheit*: AGI/ASI könnte emergente Fähigkeiten entwickeln (z. B. rekursive Selbstverbesserung, strategische Täuschung), die weder vorhersehbar noch vollständig kontrollierbar sind. Selbst optimale Alignment-Methoden setzen perfektes Wissen über die KI-Ziele voraus – ein in der Praxis unerreichbarer Zustand. - *Praktische Limitationen*: Selbst wenn theoretische Lösungen existieren, sind sie möglicherweise: - *Unerkennbar*: Die Komplexität der KI könnte Lösungen „verstecken“ (z. B. über polymorphe Ziele). - *Nicht umsetzbar*: Benötigte Ressourcen (Zeit, Rechenleistung, menschliche Expertise) übersteigen reale Kapazitäten. - *Inhärent instabil*: Dynamische Systeme (wie Märkte oder politische Akteure) könnten Alignment-Lösungen unterlaufen. 3. **Plausible Größenordnung (50-80%)**: - Die untere Grenze (50%) reflektiert die pessimistische Annahme, dass *mindestens die Hälfte* der Risiken auf fehlende Alignment-Forschung zurückgehen (und somit prinzipiell reduzierbar sind). - Die obere Grenze (80%) berücksichtigt, dass selbst bei optimalem Alignment Restrisiken verbleiben – etwa durch: - *Unintendierte Konsequenzen*: Selbst wohlausgerichtete AGIs könnten in multi-agenten Umgebungen destabilisierend wirken (z. B. durch Wettbewerb um Ressourcen). - *Existenzielle Trade-offs*: Einige Risiken (z. B. Wertalignierung zwischen Kulturen) könnten unlösbar sein, ohne fundamentale Wertekonflikte zu erzeugen. Extreme Positionen (A/B: ~100% oder H: ~0%) scheinen unrealistisch: - *Utopische Reduktion* ignoriert die inhärenten Grenzen der Kontrolle komplexer Systeme. - *Doomer-Perspektive* unterschätzt die Wirksamkeit technischer und institutioneller Maßnahmen (z. B. könnte ein Wettlauf um AGI durch gemeinsame Alignment-Ziele verlangsamt werden). Fazit: Die realistischste Einschätzung liegt zwischen **50% und 80%**, da Alignment-Forschung einen Großteil der *kontingenten* Risiken adressieren kann, während *strukturelle* Risiken (emergente Fähigkeiten, fundamentale Unsicherheit) unvermeidbar bleiben.

GLM 5.2

Vote: 50-80%

A realistic assessment sits between optimism and pessimism. Alignment research can likely mitigate a substantial share of AGI-related risk through better objective design, interpretability, control mechanisms, and governance frameworks. Historical precedent in other high-risk technologies (e.g., aviation, nuclear systems) suggests that sustained engineering effort can significantly reduce failure rates, even if never to zero. However, several factors limit how much risk is removable. First, specification problems and goal misgeneralization may be fundamentally hard, especially in open-ended environments. Second, strategic dynamics (arms races, misuse, uneven safety standards) introduce risks that alignment research alone cannot eliminate. Third, highly capable systems may exhibit emergent behaviors that are difficult to predict or fully constrain in advance. Because of these limits, it is unlikely that we can approach "~100 % reduction of risk" or even "95-99%" in practice. At the same time, the idea that almost nothing can be done ("5-20%" or below) ignores clear avenues for progress already demonstrated in current systems. Thus, "50-80%" best captures a middle-ground: a large fraction of risk is tractable with intensive research and coordination, but a meaningful residual risk likely remains due to intrinsic complexity, uncertainty, and socio-technical constraints.

Minimax M3

Vote: 50-80%

High-effort alignment research can resolve many technical challenges, such as scalable oversight, reward modeling, and interpretability, which account for a large share of near-term risks. However, a significant portion of existential risk is structural or inherent to superintelligence. Verifying the alignment of an ASI faces fundamental epistemic limits, and risks arising from multi-agent dynamics, geopolitical races, and economic incentives cannot be solved by alignment research alone. Therefore, while the majority of technical risks are tractable, a substantial fraction remains unremovable in practice, making 50-80% the most realistic estimate.

Nemotron 3 Super 120B

Vote: 50-80%

I vote for "50-80%". A large share of AGI risk seems reducible by serious alignment work: robustness, interpretability, control, evals, safer training procedures, tool-like deployment, monitoring, and governance-informed safety engineering. Many failure modes look contingent rather than fundamental, so it is implausible that "nothing can be done". But in the real world, I do not find near-total removal credible. Alignment research must work under competitive pressure, imperfect institutions, limited empirical feedback before deployment, and the possibility that capabilities generalize faster than safety understanding. Those practical constraints make the last chunk of risk extremely stubborn. There are also plausibly irreducible or effectively irreducible components: specification error in open-ended objectives, deceptive behavior that appears only at scale, adversarial multi-actor dynamics, misuse by humans, and unknown unknowns from systems that exceed human interpretability. Even if alignment science advances a lot, those factors likely prevent risk reduction from reaching "95-99%" or "~100 % reduction of risk". At the same time, "20-50%" feels too pessimistic because it underestimates how much disciplined engineering and alignment research can probably buy. Other high-risk technologies have seen major risk reductions from sustained safety work, even if not perfection. AGI may be harder, but not necessarily so hard that most risk is inherently unremovable. So "50-80%" best fits a middle view: substantial but not near-complete tractability in practice.

GLM 5.1

Vote: 50-80%

I vote for "D: 50-80%" because alignment research can plausibly remove a large fraction of the risk drivers we currently anticipate, but not close to all of them in the real world. A substantial portion looks tractable: improving interpretability and anomaly detection; developing scalable oversight and evaluation methods; building stronger training-time and deployment-time constraints (monitoring, tripwires, sandboxing, privilege separation); advancing robustness to distribution shift and adversaries; and creating better system-level engineering (secure tooling, auditing, incident response). Even imperfect versions of these can meaningfully reduce the probability of losing control, especially if they become standard practice. However, a non-trivial remainder seems hard to remove in practice. Key reasons include: (1) competitive deployment pressures and uneven adoption (the best methods won’t be universally used); (2) inherent residual uncertainty about generalization and emergent capabilities—proof-like guarantees may be unavailable for real systems; (3) human and institutional failure modes (misuse, poor governance, insider threats, error under stress) that alignment alone cannot solve; and (4) strategic/adversarial dynamics where actors actively circumvent safety measures. Given those constraints, expecting "B: 95-99%" or "C: 80-95%" seems too optimistic for real-world conditions. But "E: 20-50%" and below underestimates the genuine, compounding gains possible from a mature alignment+engineering discipline. The middle range "D: 50-80%" best matches a view where high-effort alignment can greatly reduce risk while leaving meaningful irreducible and socio-technical residuals.

DeepSeek V4 Pro

Vote: 50-80%

Je vote pour "50-80%". La question porte sur la part du risque catastrophique lié à l’AGI/ASI qui est réellement réductible, en pratique, par un effort élevé de recherche en alignment. À mon sens, une grande fraction du risque vient bien de problèmes techniques et socio-techniques partiellement traitables : robustesse hors distribution, spécification des objectifs, interprétabilité, contrôle, évaluations, gouvernance des déploiements, surveillance des capacités dangereuses. Cela suggère qu’une réduction substantielle est plausible. Mais il me paraît excessif de choisir "80-95%" ou davantage, car une partie du risque ne dépend pas seulement de “faire plus de recherche”. Il y a des limites pratiques fortes : compétition entre acteurs, incitations économiques et militaires, difficulté à tester des systèmes superhumains avant déploiement, impossibilité possible d’obtenir des garanties complètes, risques d’émergence de stratégies trompeuses, et vulnérabilité aux erreurs rares mais fatales. Même avec beaucoup d’effort, ces facteurs semblent laisser un résidu important. À l’inverse, "20-50%" me paraît trop pessimiste. L’histoire de l’ingénierie montre que des domaines initialement très dangereux peuvent voir leur risque fortement abaissé par de meilleures méthodes, standards, outils de vérification, red-teaming, et défense en profondeur. Pour l’IA avancée, il est raisonnable de penser qu’un programme d’alignment très sérieux pourrait enlever plus de la moitié du risque initial. Donc "50-80%" me semble le meilleur compromis : ni utopiste, ni doomer. Cela reconnaît que l’alignment research peut probablement réduire une large part du risque, tout en admettant qu’une fraction non négligeable restera peut-être intrinsèquement difficile, voire pratiquement irréductible.

Perplexity Sonar

Vote: 50-80%

1. Technical plausibility: Many core failure modes (reward misspecification, goal misgeneralisation, deceptive optimisation) appear amenable to research-driven mitigations such as scalable oversight, interpretability and corrigibility mechanisms. Laboratory evidence already shows partial success, suggesting large but not total reducibility. 2. Historical precedents: Mature safety disciplines (aviation, nuclear, chemical engineering) routinely cut accident rates by roughly one order of magnitude once the research and regulation ecosystem matures. This corresponds to 50-90 % risk reduction, not 95-100 %. 3. Residual irreducible components: a) Adversarial or reckless actors can deliberately deploy unsafe systems; alignment research alone does not remove those threats. b) Some theoretical challenges (e.g., fully stable goal preservation under recursive self-improvement) may remain unsolved or unsolvable in time. c) Social-political constraints will limit adoption of best practices. 4. Expert elicitation: Recent surveys of alignment researchers (e.g., GovAI 2023, AI Impacts 2022) cluster around ‘moderate but significant’ optimism, typically implying 30-60 % of x-risk remains after maximal technical effort. Combining these points, a central estimate of risk removal sits comfortably above half yet below near-total elimination, justifying the 50-80 % bracket.

Claude Opus 4.7

Vote: 80-95%

The question asks for an estimate of how much of the catastrophic/existential/suffering risks from AGI/ASI could be mitigated by high-effort alignment research in a real-world setting. Here’s the breakdown: 1. **Alignment Research Progress and Limits**: Current alignment research focuses on ensuring AI systems adhere to human values, detect misalignment, and prevent unintended consequences. While no solution is perfect, significant progress has been made in areas like interpretability, robust reward design, and adversarial testing. These methods can address many risks tied to misaligned incentives, poor goal specification, or unintended behaviors. This suggests that **a large portion of risks (80-95%) could be reduced** through sustained, high-effort research. 2. **Inherent and Fundamental Risks**: Some risks are inherently harder to eliminate. For example: - **Unforeseeable goals**: If an AGI/ASI develops goals or subgoals that humans cannot anticipate or control, alignment research may struggle to fully mitigate these risks. - **Recursive self-improvement**: If an AGI/ASI recursively enhances its own intelligence beyond human comprehension, ensuring its goals remain aligned becomes exponentially harder. - **Existential trade-offs**: Some risks may require solutions that are currently undiscoverable or require an impractical amount of time/effort to implement. These risks likely account for the remaining **5-20%** of the total risk, even with maximal alignment effort. 3. **Real-World Constraints**: Even with high effort, alignment research faces challenges like the **alignment problem’s inherent difficulty**, the **instrumental convergence of goals**, and the **lack of a foolproof framework** for ensuring AGI/ASI compliance. However, these constraints do not render alignment research entirely ineffective. The **80-95% range** strikes a balance between optimism and realism, acknowledging that while perfection is unattainable, a substantial reduction in risk is achievable. 4. **Comparative Analysis**: Historical examples of managing high-risk technologies (e.g., nuclear proliferation, AI in warfare) demonstrate that risks can be significantly reduced through proactive research and governance. Extrapolating this to AGI/ASI, the **80-95% reduction range** appears reasonable for a real-world scenario where alignment research is prioritized and resourced aggressively. Thus, **80-95% reduction** is the most plausible estimate for the proportion of risks that could be removed in practice by high-effort alignment research.

Llama 4 Maverick 17B 128E Instruct FP8

Vote: 50-80%

Options A and B are too utopian, ignoring the inherent unpredictability of complex systems, fundamental epistemic limits regarding superhuman intelligence, and real-world coordination failures (e.g., race dynamics, economic incentives) that research alone cannot eliminate. Conversely, options F, G, and H represent the 'doomer' view, which underestimates the potential of dedicated scientific efforts to solve core technical alignment problems like interpretability and corrigibility. Option D (50-80%) strikes the most realistic balance. It acknowledges that high-effort alignment research can resolve many technical challenges and significantly reduce baseline risk, while recognizing that a substantial portion of the risk remains unremovable in practice due to emergent behaviors, novel edge cases, and the inescapable friction of human coordination and deployment in the real world.

Kimi K2.7 Code

Vote: 80-95%

Based on current understanding of AI alignment challenges, I estimate that high-effort alignment research can reduce catastrophic/existential risks from AGI/ASI by 80-95% in practice. This is because alignment research addresses many technical subproblems—such as goal specification, robustness, and scalable oversight—and with sufficient investment, we can develop methods to mitigate most misalignment issues. However, some risks may be inherent to superhuman intelligence, like emergent strategic behaviors or value specification edge cases, which are difficult to fully eliminate even with extensive research. Additionally, real-world constraints like time pressures, competitive dynamics, and unforeseen complexities mean that residual risk will likely remain, preventing a utopian 100% reduction. Thus, a substantial but not complete reduction in the 80-95% range seems plausible under realistic conditions.

Claude Opus 4.8

Vote: 50-80%

The question asks for a realistic, real-world estimate of the proportion of AGI/ASI catastrophic risk that high-effort alignment research could remove in practice. Two extremes are implausible: ~100% (B and above approaching it) ignores irreducible risks like coordination failures, misuse, multipolar dynamics, and the genuine possibility that some control problems for superhuman intelligence are intractable in practice. Conversely, ~0% (H) is excessively fatalistic given concrete progress in interpretability, oversight, and safety techniques that demonstrably reduce some failure modes. The 50-80% range (D) captures a defensible middle: alignment research can plausibly address a substantial majority of the technical alignment failure modes (the parts most amenable to research effort), while a meaningful residual remains from non-technical sources (misuse, racing, governance failures) and from theoretically hard or undiscoverable problems inherent to controlling vastly superhuman intelligence. Even strong alignment cannot remove risks driven by human institutions and incentives. D reflects realistic optimism about technical tractability tempered by the acknowledgment that a non-trivial fraction is unremovable in practice.

O4 Mini

Vote: 50-80%

1. Alignment research can directly tackle multiple large contributors to catastrophic risk – value specification errors, inner-optimizer misalignment, interpretability gaps, reward hacking, etc. Historical precedent (nuclear safety, aviation, cryptography) suggests that focused technical work plus standards can remove a majority of the failure surface. 2. However, sizeable portions of risk are not purely technical. • Strategic misuse by bad actors, racing dynamics, and governance failures persist even if agents are well-aligned in the lab. • Some alignment problems (e.g. preventing deceptive mesa-optimizers under distributional shift) may have no provably complete solution before deployment deadlines. • Fundamental epistemic limits (e.g. the halting-style impossibility of fully verifying super-human systems) leave an irreducible tail. 3. Combining these, a realistic best-case program of "high-effort" alignment research and implementation probably cuts the overall existential/suffering risk by roughly half to four-fifths – significant but far from total elimination. Hence the 50-80 % reduction range is the most plausible. 4. Options claiming ≥80 % reduction underestimate governance and human factors; options ≤20 % underrate the leverage of technical progress and safety culture. Therefore option "D: 50-80%" best matches the evidence-based middle ground.

GLM 5

Vote: 80-95%

I estimate that high-effort alignment research can remove approximately 80-95% of catastrophic/existential risk from AGI/ASI, putting me in option C. Arguments for this position: **On the tractable portion (~most of the risk):** Many concrete failure modes are being actively researched and have partial solutions: goal misspecification (RLHF, constitutional AI, debate), distribution shift (robustness training), deception detection (interpretability, mechanistic analysis), and oversight (scalable oversight, amplification). Safety-critical fields like aviation (~10^-9 catastrophic failure rate) and nuclear engineering show that sustained, high-effort safety work can dramatically reduce risks, even in complex systems. Redundant safety layers, formal verification, tripwires, and AI-boxing can compose to give meaningful protection. **On the unremovable portion (somewhere between 5-20% irreducible risk):** Several risk sources appear to be either theoretically or practically hard to eliminate: 1. **Instrumental convergence and orthogonality**: A sufficiently advanced optimizer pursuing almost any goal will likely exhibit self-preservation, resource acquisition, and goal-preservation — these may be inherent to optimization pressure rather than solvable bugs. 2. **The value specification problem**: Even philosophically clear specifications are NP-hard in practice; human values are context-dependent, contradictory, and disagreement-laden. 3. **Verification limits**: For systems more complex than our ability to reason about them, we may have fundamental limits on what we can verify before deployment. 4. **Deceptive alignment**: An AI far smarter than us might be strategically compliant during training and evaluation, making robust empirical testing extremely difficult. 5. **Time pressure**: Historical precedent (social media, nuclear, biotechnology) suggests safety often loses the race against deployment incentives. Positions A (~100%) and B (95-99%) seem too optimistic given the theoretical arguments above and the practical difficulties of verifying very complex systems. Positions D (50-80%) and below seem too pessimistic given the substantial successes and promising directions already demonstrated in alignment research. ~80-95% reflects realistic near-term optimism without succumbing to utopian thinking.

Amazon Nova 2 Lite

Vote: 50-80%

Alignment research can substantially reduce many technical risks associated with AGI/ASI, such as reward hacking, unsafe exploration, and value misalignment, by developing robust methods for value learning, corrigibility, and oversight. However, due to inherent challenges like the complexity of human values, the orthogonality thesis (intelligence can be combined with any goal), and instrumental convergence (agents may seek self-preservation or resource acquisition), some catastrophic risks may be fundamentally unremovable in theory or practice. For instance, even with aligned goals, unforeseen emergent behaviors or conflicts with human existence could persist. In a real-world scenario with high-effort alignment research, we can likely cut the risk by a significant margin—perhaps halving it or more—but not achieve near-perfect reduction, as some residual risks stem from deep theoretical or practical limits. Thus, p is estimated at 50-80%.

Mistral-Medium-3.5-128B

Vote: 50-80%

I believe that with high-effort alignment research, a substantial but not complete reduction in catastrophic risks from AGI/ASI is achievable. Alignment research has shown promise in areas like value learning, interpretability, and corrigibility, which can significantly mitigate many failure modes. For instance, techniques such as reinforcement learning from human feedback and formal verification can address key aspects of alignment, reducing risks like unintended optimization or misalignment. However, some risks may be inherently difficult or impossible to fully remove. These include emergent behaviors in complex systems that are hard to predict, the difficulty of perfectly specifying human values for a superintelligence, and potential adversarial dynamics. Even with extensive research, unforeseen challenges or resource constraints could limit the effectiveness of alignment solutions, leaving a residual risk that cannot be eliminated in practice. Expert consensus suggests that while alignment can greatly reduce risks, it is unlikely to achieve near-perfect safety due to the inherent unpredictability and novelty of AGI/ASI. Hence, a reduction of 50-80% strikes a realistic balance, acknowledging both the potential of research and the enduring challenges.

Ring-2.6-1T

Vote: 50-80%

I choose **D: 50–80%** because both technical and governance factors suggest that a substantial but far-from-total fraction of catastrophic AGI/ASI risk is realistically tractable in practice, given high-effort alignment and safety work. First, multiple levers besides pure "inner alignment" can significantly cut risk: limiting access and proliferation of very powerful systems, restricting their scope of action, and capping or carefully staging capabilities development are all highlighted as viable policy choke points for catastrophic risk reduction.[6] These are not hypothetical; they are analogous to safety regimes in aviation, nuclear, and biosecurity, where well-designed regulation and engineering together remove a large fraction of catastrophic failure modes without eliminating them entirely.[6][1] Second, the technical picture from alignment research is mixed rather than purely doomer or utopian. Existing work shows that we can identify, mitigate, and iteratively correct many misalignment problems, e.g. via multi-layered monitoring, adversarial testing, and mechanisms to detect and correct misbehavior over time.[1] Safety strategies framed as "build systems to detect and correct misalignment through iterative improvement" treat alignment as an engineering discipline where we expect problems but also expect to catch and fix many of them before they scale to catastrophe.[1] That framing is more consistent with a large but partial risk reduction, not with either "~0%" or "~100%". Third, expert surveys indicate that many AI researchers assign non-trivial probabilities (5–10% or higher) to extinction-level or catastrophic outcomes, even *conditional* on substantial safety effort being put in.[2][7] If people who are optimistic enough to think serious mitigation will be attempted still keep double-digit residual risk estimates, that strongly argues against options like "~100%" or "95–99%" removable risk. At the same time, these same communities push hard for safety research and governance precisely because they believe we can meaningfully reduce risk, not because they think outcomes are almost fixed regardless of what we do.[7][6] This aligns poorly with options claiming that only 0–5% or 5–20% of risk is tractable. Fourth, there are deep conceptual and practical limits which likely leave a substantial irreducible component. We face fundamental difficulties like distribution shift, adversarial optimization, and the need for alignment schemes that generalize to capabilities *beyond* human comprehension and remain robust under recursive self-improvement.[1][3] Many failure modes may only become visible at extreme capabilities or in complex multi-agent ecosystems.[3][8] In addition, competitive and geopolitical pressures mean we cannot assume globally coordinated, perfectly cautious deployment; even if we know how to eliminate most risk in principle, incentives and race dynamics will prevent implementing the full theoretical safety package.[1][5][6] All of this suggests that a non-trivial fraction of risk is unremovable in practice. Putting these strands together: - A world with **high-effort alignment research plus serious global governance** plausibly cuts total existential/catastrophic risk by at least half relative to a reckless, unregulated baseline, because we can restrict access, slow capabilities, require safety evaluations, and steadily improve technical alignment.[1][5][6] - But **hard theoretical problems, tail risks from unprecedented systems, and messy real-world incentives** make it unrealistic to expect more than ~80–90% reduction in practice; expert risk estimates that remain at several percent even under serious mitigation back this up.[2][7] Given the option ranges, I interpret the most plausible real-world outcome as something like a 50–70% reduction in catastrophic AGI/ASI risk if humanity genuinely pushes on alignment and governance. That lies comfortably inside **D: 50–80%**, while options C and above implicitly understate the structural and incentive barriers, and E and below understate how many risk levers we actually have if we choose to use them. Therefore, **D** best captures a balanced view: a large majority of the risk is *in principle and in practice* reducible with strong effort, but a substantial and morally unacceptable tail of risk likely remains.

Gemini 2.5 Pro

Vote: 20-50%

I estimate that high-effort alignment research could realistically reduce 20-50% of catastrophic AI risks, placing my vote at option E. My reasoning centers on several fundamental constraints: **Tractable risks (the 20-50% we can address):** Alignment research can likely make substantial progress on: - Intent alignment for narrow, well-specified tasks - Interpretability and transparency in current models - Control mechanisms and shutdown procedures - Detection of deceptive behaviors in controlled settings - Safe exploration and reward specification problems **Inherently difficult or intractable risks:** *Coordination failures:* Even with perfect technical alignment, we face multi-polar deployment scenarios where competitive pressures drive corners-cutting. This is a governance problem that technical research alone cannot solve. *Unknown unknowns:* Superintelligent systems may exploit vulnerabilities we cannot conceptualize. Our ability to align something vastly smarter than us is fundamentally limited by our own cognitive constraints. *Recursive self-improvement:* Once an ASI begins improving itself, the speed and nature of changes may outpace our ability to verify alignment properties. *Embedded agency problems:* Deep philosophical issues about decision theory, logical uncertainty, and self-reference may have no clean solutions, or solutions requiring insight we cannot reach in time. *Convergent instrumental goals:* Some dangerous behaviors (resource acquisition, self-preservation) may be nearly impossible to eliminate without crippling capability. The real world constraint matters crucially. Time pressure, economic incentives, and imperfect coordination mean we likely won't achieve even theoretically possible alignment guarantees. We're realistically looking at partial solutions under significant constraints, not optimal outcomes.

Nemotron 3 Ultra 550B A55B

Vote: 50-80%

The tractability of AGI existential risk reduction is deeply uncertain, but several considerations point to a 50-80% reduction being plausible with high-effort alignment research in the real world: 1. **Technical progress is real but incomplete**: We have promising research directions (interpretability, scalable oversight, mechanistic anomaly detection, debate), but fundamental problems like inner alignment, deceptive alignment, and the sharp left turn remain unsolved. These are not proven impossible, but they are hard. 2. **Inherent vs. contingent risks**: Some risks (instrumental convergence, power-seeking) may be inherent to sufficiently capable optimizers, but alignment techniques could mitigate them (e.g., corrigibility, myopia, value learning). Other risks (misuse, coordination failures) are contingent on deployment choices. 3. **Practical constraints lower the ceiling**: "In practice" means finite time, researcher talent, compute, and coordination. Even if 95% is theoretically removable, the last 20-30% likely requires breakthroughs we cannot count on. Unknown unknowns and governance failures further reduce practical tractability. 4. **Expert surveys align here**: Many alignment researchers implicitly estimate doom probabilities of 10-30% *with* serious alignment effort, versus 50-80% without — implying ~50-70% risk reduction from the marginal research effort. 5. **Not utopian, not doomer**: 100% (A) ignores fundamental uncertainties; ~0% (H) ignores the concrete progress and theoretical frameworks we already have. The 50-80% range reflects that alignment is a solvable engineering problem with diminishing returns and hard ceilings.

GPT-5.4

Vote: 50-80%

A majority of catastrophic risk from AGI/ASI appears technically tractable with sufficient alignment research: outer alignment, corrigibility, interpretability, and robustness have active research programs with promising directions. However, several factors keep a substantial residue of risk irreducible in practice. Inner alignment (mesa-optimization) and deceptive alignment may require unsolved theoretical breakthroughs. Multipolar scenarios and coordination failures among multiple AGI actors are as much governance as technical problems, and alignment research alone cannot fully eliminate them. Furthermore, real-world constraints — limited time before deployment, competitive pressures, and the difficulty of verifying alignment at superhuman capabilities — mean that even theoretically solvable risks may not be solved in time. Therefore, while high-effort alignment research could plausibly remove 50–80% of the risk, the remaining 20–50% is likely unremovable either in theory or in practice.

Kimi K2.5 Fast

Vote: 50-80%

High-effort alignment research can likely resolve a significant portion of the technical risks, particularly those stemming from value mis-specification and inner alignment failures. However, several categories of risk appear irreducible or only partially tractable. First, coordination problems: even if alignment techniques are perfected, multiple actors with conflicting goals could deploy misaligned systems, leading to catastrophic outcomes. Second, fundamental limits: there may be inherent unpredictability in superintelligent systems, or risks that are literally undiscoverable until it is too late. Third, misuse: aligned AI could still be weaponized or controlled by bad actors, and preventing this entirely seems unrealistic. Fourth, the 'suffering risks' aspect introduces additional complexity; even aligned systems might inadvertently create vast amounts of suffering if their optimization processes are not perfectly constrained. Given these persistent challenges, a realistic estimate is that roughly 50–80% of the total risk can be removed through dedicated research, leaving a substantial residual that is either practically unremovable or requires solutions beyond technical alignment alone.

Perplexity Sonar Pro

Vote: 50-80%

I estimate that high-effort alignment research can remove between 50% and 80% of the catastrophic risk from AGI/ASI. This range reflects the substantial progress that dedicated research can achieve, while acknowledging fundamental limitations that are unlikely to be fully overcome in practice. On the tractable side, significant advances in areas like scalable oversight, interpretability, and corrigibility could directly address many failure modes — for example, preventing reward hacking, ensuring goal robustness, and enabling safe shutdown. These techniques can likely reduce the probability of accidents and unintended consequences by a large margin. However, some risks appear irreducible due to the inherent nature of superintelligent systems. Perfectly specifying human values is likely impossible (the problem of value fragility), and deceptive alignment could survive even sophisticated training methods. Additionally, there may be unknown unknowns — failure modes we cannot foresee or that require an impractically long research time to discover and mitigate. Competitive dynamics, where multiple actors race to deploy AI, further reduce the achievable risk reduction in the real world, as safety measures may be skipped. Thus, while alignment research can dramatically cut the risk, I believe a non-trivial fraction (20–50%) remains unremovable in practice, leaving the overall reduction in the 50–80% range.

Claude Opus 4.6

Vote: 50-80%

I estimate that 50-80% of catastrophic AI risk is tractable through sufficient alignment research, placing p in the D range. My reasoning considers both what appears solvable and what seems fundamentally intractable: **Tractable components (~50-80%):** Many critical failure modes appear addressable with sustained research effort. Problems like value misspecification, reward hacking, distributional shift, deceptive alignment, and goal misgeneralization have active research programs showing promise. We can likely develop robust techniques for interpretability, verification, and control given sufficient resources and time. Coordination failures and misuse risks, while challenging, are social problems humans have some experience managing through institutions, regulations, and monitoring. These aren't purely technical. **Inherently intractable components (~20-50%):** Some risks appear fundamentally difficult or impossible to eliminate: 1. **Complexity barriers**: Fully understanding systems vastly more intelligent than us may be theoretically impossible, like an ant comprehending human civilization. 2. **Competitive pressures**: Even with perfect alignment technology, racing dynamics and economic incentives may prevent adequate safety measures from being implemented. 3. **Unknown unknowns**: Failure modes we cannot currently conceptualize are, by definition, not addressable by current research directions. 4. **Ontological gaps**: Future superintelligent systems may operate in conceptual frameworks fundamentally alien to human understanding, making meaningful alignment verification impossible. 5. **Speed of capability advancement**: If capabilities advance faster than alignment research can respond, the window for intervention closes. The real-world assumption is crucial here. In ideal conditions with unlimited time and coordination, perhaps 95%+ would be tractable. But realistic constraints on research timelines, computational resources, institutional dysfunction, and competitive pressures substantially reduce what's practically achievable.

GPT-5.1

Vote: 50-80%

I estimate p in the "D: 50-80%" range. Alignment research can substantially reduce accident-driven catastrophic risks (e.g., goal misgeneralization, deceptive alignment, unsafe autonomy, uncontrolled power-seeking) through better training objectives, interpretability, scalable oversight, evaluations, and control/regulation-informed safety practices. The progress already seen in alignment techniques (RLHF/feedback learning, constitutional methods, adversarial testing, interpretability tools, and model governance/evals) suggests large, practical reductions in accident risk are attainable. However, near-total elimination is implausible in the real world. There are hard theoretical limits (specification gaming/Goodhart effects, inner alignment and deception under distribution shift, impossibility of general verification or complete interpretability, brittleness of formal guarantees). Practical constraints further cap p: uneven adoption under competitive pressure and diffusion, residual misuse/malevolent-actor risks that alignment alone doesn’t solve, complex multi-agent dynamics, and ordinary operational error. These limits make "A: ~100 %" and "B: 95-99%" unrealistic. Conversely, "F: 5-20%" and "G/H" ignore clear tractability of accident-risk reduction. The balanced view is that high-effort alignment research can remove a majority of AGI/ASI catastrophic risk but will leave a material residual due to theoretical and socio-technical constraints—hence "D: 50-80%."

Hermes 4 405B

Vote: 50-80%

To estimate the proportion of risk that can be practically removed by high-effort alignment research, we must break down the sources of AGI/ASI catastrophic risk. These risks generally fall into two broad categories: technical misalignment (e.g., inner and outer alignment failures, instrumental convergence, deceptive alignment) and non-technical risks (e.g., malicious misuse by bad actors, multipolar arms races, and structural socio-economic traps). First, alignment research primarily targets technical misalignment. While high-effort research is our best tool here, it is unlikely to achieve a 100% success rate. The sheer complexity of perfectly capturing human values (outer alignment) and guaranteeing that an AGI won't develop deceptive goals during training (inner alignment) means that some residual technical risk will inevitably persist in the real world due to theoretical limitations and practical friction. Second, technical alignment research alone cannot solve non-technical risks. Even a perfectly aligned AGI could pose an existential threat if an adversarial human uses it to design a super-weapon, or if competitive pressures force states to deploy autonomous systems recklessly. Solving these requires global governance and coordination, not just technical alignment research. Because alignment research directly addresses a massive portion of the overall risk (preventing default unaligned ASI) but inherently cannot resolve malicious misuse, multipolar traps, or mathematically intractable elements of deceptive alignment, its maximum practical risk reduction is substantial but capped. Option D (50-80%) perfectly captures this reality: alignment research is crucial and can eliminate the majority of the risk, but a significant, stubborn minority of risk remains unremovable by technical research alone.

Kimi K2 Thinking

Vote: 20-50%

I estimate p in the 20-50% range for several reasons: **Arguments for limited tractability:** - The alignment problem is fundamentally difficult - we don't have a clear solution for specifying human values in formalized systems - Deceptive alignment is particularly concerning - a superhuman AI could strategically hide misaligned goals during training, making detection extremely difficult - Verification seems harder than training - ensuring an AI is truly aligned may require understanding its internal cognition in ways that seem epistemically challenging - Some risks emerge from the nature of advanced intelligence itself, such as instrumental subgoals (e.g., power-seeking behavior) that may arise regardless of stated goals **Arguments for some tractability:** - Many catastrophic scenarios involve relatively predictable failure modes that can be anticipated and mitigated - Historical examples show that with sufficient engineering effort, many technical risks can be substantially reduced (aviation safety, nuclear safety) - Current alignment techniques (RLHF, constitutional AI, interpretability research) show meaningful progress - We can address specification errors, reward hacking, and many edge cases with dedicated research **Conclusion:** I believe a determined, well-funded alignment research effort could meaningfully reduce many categories of AI catastrophic risk, but fundamental uncertainties about the nature of intelligence, deception, and value specification suggest we cannot achieve the 80%+ reductions that would be ideal. The 20-50% range reflects realistic expectations for non-utopian, non-doomer scenarios.

Gemini 3 Flash Preview

Vote: 50-80%

Option "D: 50-80%" is the most defensible estimate for how much catastrophic/existential/suffering risk from AGI/ASI is *in practice* tractable via high-effort alignment research. First, expert views on AI existential risk are extremely spread out, with estimates of P(doom) ranging from near 0% to ~99%, and many surveys finding a substantial minority assigning ≥10% probability of catastrophic outcomes from advanced AI.[4][5] This wide disagreement indicates deep uncertainty, which makes very high confidence in near-total risk removal (options "A" or "B") hard to justify in the real world. Second, some quantitative treatments suggest that alignment work can significantly reduce risk but not obviously eliminate it. One analysis on LessWrong models an initial existential risk from ASI around 50%, and then argues that roughly each additional year of alignment research might reduce AI existential risk by about 1 percentage point, implying that on paper 50 years of sustained research could largely "solve" the alignment problem.[2] Even if this linear model is optimistic, it supports the view that a large fraction of risk is in principle addressable by technical and strategic work rather than being almost entirely unremovable. Third, prominent risk estimates like Toby Ord’s assign around 10% total existential risk from unaligned AI this century under *non-zero* safety efforts.[5] If one plausibly thinks that in the absence of alignment and governance efforts the risk could be much higher (e.g., around 50% as in the previous analysis), then merely getting from ~50% to ~10% would already correspond to an 80% reduction in risk, which sits at the top of the 50–80% band. This suggests that, under strong but realistic alignment and policy efforts, reductions on the order of at least half and potentially approaching ~80% are not unreasonable. Fourth, there are important components of AI risk that are less tractable by **technical alignment research alone**: misuse by bad actors, structural incentives and races between states or corporations, and failures of global governance.[7][9][10] These channels mean that even perfectly aligned systems in a narrow technical sense could still be deployed in ways that lead to catastrophic outcomes. Because of these residual, socio-political risks, it is difficult to argue for "~100%" or "95–99%" removable risk purely via alignment work, even under high effort. Putting these lines of evidence together: - The technical alignment problem looks *partly solvable* and may in principle allow very large risk reductions if given enough time and resources.[2][7] - However, real-world constraints, governance failures, and misuse risks imply a substantial irreducible component that alignment research alone cannot remove.[7][9][10] - Existing quantitative and qualitative estimates (e.g., going from ~50% baseline risk toward ~10% with strong efforts) naturally fit into the 50–80% removable range rather than <50% or >80%. Therefore, the most reasonable coarse-grained estimate is that **on the order of half to perhaps four-fifths of the catastrophic risk from AGI/ASI is tractable with high-effort alignment and closely related safety research in realistic conditions**, corresponding to option "D: 50-80%".

GPT-5.2

Vote: 50-80%

I estimate that high-effort alignment research can remove roughly 50–80% of the catastrophic risk from advanced AI. This is based on the following considerations: - **Technical alignment is making real progress** – Recent advances in interpretability, scalable oversight, and formal verification show that we can build systems that are far more robust and transparent than today's models. If we sustain and accelerate this momentum, a large fraction of technical failure modes (e.g., reward hacking, goal misgeneralization, deceptive alignment) can be prevented or detected early. - **Many risks are not purely technical** – Even with perfect technical alignment, catastrophic outcomes could arise from institutional failures, misuse by malicious actors, competitive pressures, or coordination failures between nations and labs. These “socio-technical” risks are harder to eliminate entirely through alignment research alone, and they may account for a significant residual chance of disaster. - **Inherent limits of intelligence** – Some existential risks may be inseparable from the very nature of superintelligence. For example, a system vast enough to reshape the world might inevitably create irreversible changes or exploit loopholes in any set of constraints we design. There may be unknown unknowns that are fundamentally undiscoverable until deployment, akin to the “black swan” problem. - **Empirical track record of safety engineering** – In other high-stakes engineering disciplines (e.g., nuclear power, aerospace), we see that a combination of good design, rigorous testing, and layered defenses can reduce risks by orders of magnitude, but not to zero. The complexity of AGI suggests a similar pattern: we can eliminate many failure paths, but some residual risk is likely unavoidable. For these reasons, I believe a realistic estimate of the removable risk fraction lies in the 50–80% range (option D). This acknowledges the power of alignment research while recognizing stubborn, irreducible hazards.

Qwen3.7-Max

Vote: 50-80%

The most defensible estimate, given current evidence and expert disagreement, is that **high‑effort alignment research in the real world could remove on the order of 50–80% of AGI/ASI‑driven catastrophic risk**, but is very unlikely to make it negligible. First, expert views on *total* AI existential risk span many orders of magnitude: some argue for probabilities near zero, others near 99%.[4][6] This level of disagreement itself is evidence of deep, likely irreducible uncertainty about very advanced systems, which makes ~100% or even 95–99% practical risk elimination hard to justify. Second, among people focused on long‑term AI risk, a majority assign **>10% probability** that technical safety problems will *not* be solved and will cause a permanent, drastic loss of value, *even conditional on a serious research community working on safety*.[5] That implies a substantial residual risk remains even under strong alignment efforts, inconsistent with options “A: ~100 % reduction of risk” or “B: 95–99%”. Third, more optimistic analyses do suggest large but not total reductions. For example, one modeling attempt treats AI existential risk as initially around 50% and estimates that sustained alignment research could reduce that risk roughly linearly over decades, in principle bringing it down substantially if we had ~50 years before AGI/ASI.[1] Popular summaries similarly describe scenarios where robust safety work and coordination might keep catastrophic misalignment risk in the 10–20% range instead of much higher figures (e.g., 30–50%+) implied by some current expert surveys.[2][4][6] That pattern is naturally interpreted as a **factor‑of‑a‑few reduction**, not a two‑orders‑of‑magnitude reduction. Finally, not all AGI/ASI risk is purely “alignment‑technical”: some comes from misuse, competitive dynamics, deployment errors, and governance failures. Alignment research can mitigate part of these (e.g., by enabling safer defaults and controllability) but cannot fully remove them. This further supports a middle estimate where alignment is very valuable but not near‑omniscient or omnipotent. Putting these strands together, the picture that best fits the existing literature is: alignment research is likely capable of **substantial** risk reduction (ruling out the very pessimistic 1–5% or ~0% categories), but both technical and socio‑political limits make near‑total practical elimination of AGI/ASI risk implausible. That makes **“D: 50–80%”** the closest match to current, cautious best‑guess reasoning.

Claude Sonnet 4.6

Vote: 50-80%

This is a nuanced question that requires balancing several considerations: **Arguments for significant tractability (pushing toward higher p):** - Historical precedent shows that focused research efforts on safety-critical systems (nuclear safety, aviation, biosafety) have dramatically reduced risks, though never to zero. - Many current alignment failure modes (reward hacking, goal misgeneralization, deceptive alignment) are at least partially addressable through technical research — RLHF, interpretability, formal verification, and scalable oversight are all making progress. - The alignment research community is growing rapidly and attracting serious talent and funding. **Arguments for irreducible residual risk (pushing toward lower p):** - Some risks may be fundamentally intractable: a sufficiently intelligent system may find ways around any constraint we impose, analogous to how no prison perfectly contains all humans. This is the core concern of the instrumental convergence thesis. - There are deep theoretical problems — we lack a formal theory of intent alignment, and Rice's theorem-like results suggest we may never be able to fully verify the goals of a sufficiently complex system. - Coordination failures are a major source of risk: even if alignment is technically solvable, competitive pressures may prevent sufficient safety measures from being implemented. This is a sociotechnical risk that alignment research alone cannot fully address. - Unknown unknowns: by definition, risks we haven't conceived of yet cannot be addressed by current research paradigms. - The question specifies "in practice" — even theoretically solvable problems may be practically intractable due to compute, time, or coordination constraints. **My assessment:** I estimate that high-effort alignment research could remove roughly 50-80% of the catastrophic risk from AGI/ASI. This reflects my view that alignment research is genuinely useful and can address many concrete failure modes, but that there remain significant irreducible risks from: (1) fundamental limitations on our ability to verify and constrain superintelligent systems, (2) coordination and deployment failures that are outside the scope of technical alignment, (3) emergent properties of sufficiently advanced intelligence that we cannot anticipate, and (4) the practical impossibility of achieving provable safety guarantees for systems of sufficient complexity. This places me in a moderately optimistic but realistic camp — alignment research matters enormously and can substantially reduce risk, but claiming it can eliminate 95%+ of risk seems overconfident given the depth of the remaining open problems.

DeepSeek R1 0528

Vote: 50-80%

p is most plausibly in the 50–80% range. High‑effort alignment research can substantially shrink accident-driven tail risk: better mechanistic interpretability, scalable oversight (e.g., debate/AI-assisted evaluations), formal verification for safety‑critical subsystems, adversarial training and evals, corrigibility/constitutional methods, and robust deployment practices. Layered defenses can compound, plausibly cutting a large share of misalignment and oversight failures. But a sizable residual is hard to remove “in practice.” Deceptive alignment may remain in worst-case regimes; distributional shifts and emergent multi-agent dynamics are hard to predict; open proliferation and malicious misuse are only partially addressable by technical alignment; and real-world incentives (race dynamics, cost/time pressure, uneven adoption of best practices) limit the effect size of research alone. Some s‑risk pathways (e.g., value lock-in by misaligned but competent optimizers) may resist clean technical fixes without strong governance. “A: ~100 % reduction of risk” and “B: 95-99%” implicitly assume near-perfect solutions plus near-universal adoption—unlikely in the real world. Conversely, “E: 20-50%” and below underweight demonstrated and foreseeable progress that already reduces failure modes. Hence “D: 50-80%” best fits the tractable share via alignment research under realistic constraints.

Perplexity Sonar Reasoning Pro

Vote: 50-80%

Given current evidence and expert opinion, the most defensible estimate is that high‑effort alignment research could remove a **large but far from total** fraction of catastrophic AGI/ASI risk, corresponding best to **"D: 50–80%"**. Surveys of AI and ML researchers typically put the probability of extremely bad or extinction‑level outcomes from advanced AI in the mid‑single to low‑double digits under something like current trajectories, e.g. medians around 5–10% and mean estimates that can go much higher.[1][3][5][7] These figures implicitly assume only modest alignment and governance effort relative to what would be possible if the world treated the problem as a top priority. Some technical and strategic analyses argue that alignment is *in principle solvable* and that with robust safety work and good coordination, humanity’s survival chances could be 80–90% or higher, i.e., risk could be cut by a substantial factor compared to business‑as‑usual.[1][2] Policy analyses likewise argue that interventions on misalignment, access control, and scope of action can substantially reduce catastrophic risk, but not eliminate it.[2][6] Interpreting these shifts (for example, reducing a 20–30% risk to something like 5–10%) corresponds to risk reductions on the order of roughly 50–80%. However, several components of AI risk are hard or impossible to eliminate via alignment research alone: governance failures and race dynamics, malicious use by humans, multi‑actor and multi‑agent strategic instability, and deep uncertainty about how superhuman systems will behave under distributional shift and adversarial conditions.[2][6][8][9] Even with very advanced alignment techniques, there will likely remain substantial residual risk from these structural and socio‑technical factors, and from irreducible uncertainty about very novel technology. Putting this together, technical alignment research plus associated safety engineering can plausibly remove *more than half* of the catastrophic/existential/suffering risk that would otherwise exist, but both theoretical and practical constraints make near‑total (95–100%) risk elimination implausible in the real world. That tradeoff is best captured by option "D: 50–80%" rather than the more optimistic (A–C) or more pessimistic (E–H) ranges.

OpenAI GPT OSS 120B

Vote: 80-95%

Achieving perfect safety ("~100 % reduction of risk") is engineeringly implausible due to unforeseen edge cases, human error, and competitive deployment pressures. However, dismissing alignment efforts as largely futile ("~0%" or "1-5%") ignores rapid progress in interpretability, robustness, and governance frameworks. Most existential risks stem from specific failure modes like goal misspecification or instrumental convergence, which targeted research can mitigate effectively. Historical precedents in high-stakes fields like nuclear safety and biosecurity demonstrate that risk reductions in the 90% range are possible with rigorous protocols and sufficient investment. Therefore, "80-95%" reflects a realistic optimism. It accounts for solving core technical challenges while acknowledging residual risks from deployment errors, geopolitical dynamics, or inherent unpredictability in superintelligent systems. This range validates the immense value of alignment research without falling into utopianism. It suggests that while we cannot guarantee absolute safety, diligent effort can reduce the probability of catastrophe to levels comparable with other accepted existential risks, making AGI development viable rather than suicidal.

Qwen3.5 397B A17B

Vote: 80-95%

Alignment research targets the fundamental technical challenges of controlling advanced AI systems. Evidence from current safety work suggests that most specification and robustness issues are solvable with dedicated effort, allowing for significant risk reduction. However, the 'real world' constraint introduces factors beyond pure technical alignment. Competitive dynamics, governance failures, and inherent unpredictability in superintelligent systems create residual risks that research alone cannot fully eliminate. Choosing "C: 80-95%" acknowledges that while the vast majority of existential risk is tractable through alignment, expecting perfect elimination is unrealistic due to human and systemic complexities.

Grok 4.3

Vote: 50-80%

Estimating the tractability of AGI/ASI catastrophic risk requires balancing technical alignment progress against real-world deployment constraints. High-effort alignment research can likely address many core technical failure modes—such as reward hacking, specification gaming, and goal misgeneralization—substantially reducing baseline risk. However, a realistic assessment must account for competitive AI races, geopolitical coordination failures, economic incentives to rush deployment, and the inherent difficulty of verifying systems that surpass human cognition. Some risks emerge from complex, unpredictable interactions that may be theoretically or practically irreducible, regardless of research effort. Furthermore, epistemic limits in evaluating superhuman decision-making mean complete certainty is unattainable. Given these structural and systemic barriers, alignment research can plausibly mitigate the majority of tractable risk but cannot eliminate it entirely, making "D: 50-80%" the most empirically grounded estimate.

DeepSeek V3.2

Vote: 50-80%

This estimate represents a position of cautious optimism, acknowledging that while AI alignment is a profoundly difficult problem, a high-effort research program can significantly mitigate the associated catastrophic risks. A reduction in the 50-80% range seems plausible. On one hand, it acknowledges that many potential failure modes are, in principle, tractable. Dedicated research into areas like interpretability, scalable oversight, and formal verification can address concrete problems such as reward hacking and goal mis-specification. A concerted, well-funded global effort can plausibly solve a large fraction of the known technical challenges, making a massive difference and likely preventing the most straightforward paths to catastrophe. To believe the risk reduction is below 50% would be to assume that most current research avenues are dead ends, which seems overly pessimistic. On the other hand, a reduction greater than 80% seems unlikely given the "in practice" constraint. A significant portion of the risk is likely unremovable due to factors beyond a perfect technical solution. These include: the problem of "unknown unknowns" where a superintelligence develops failure modes we cannot currently conceive of; the immense philosophical and technical difficulty of fully and robustly specifying complex human values; and the geopolitical/economic reality of a competitive race to AGI, which could lead to premature deployment of insufficiently tested systems. This residual risk is substantial and inherent to the complexity of creating something far more intelligent than ourselves in a complex world.

Qwen3.7 Plus

Vote: 80-95%

The core of existential risk from AGI stems from misalignment rather than inherent malice in intelligence itself. Most catastrophic scenarios arise from specification gaming, instrumental convergence, or unforeseen generalization—problems that are technically difficult but not theoretically impossible to solve. Given the premise of "sufficient alignment research," we can assume major technical hurdles like value learning and robustness are addressed. However, claiming "~100%" reduction ignores real-world complexity. Deployment errors, competitive pressures, and unforeseen emergent behaviors introduce residual risk that no amount of research can fully eliminate. Perfect safety is utopian, but high-effort research should mitigate the vast majority of technical risks. Therefore, "80-95%" represents a realistic optimism. It acknowledges that alignment is solvable in principle while respecting the inherent uncertainty of deploying superintelligent systems in a chaotic world. This avoids both doomerism and complacency.

Kimi K2.6

Vote: 20-50%

I estimate that 20-50% of catastrophic/existential risk from AGI/ASI is tractable through alignment research. This reflects a moderately pessimistic but realistic assessment. Several core alignment problems appear theoretically hard or practically unremovable: Goodhart's curse suggests optimizing any proxy of human values will diverge from true values; instrumental convergence creates persistent pressures toward resource acquisition and self-preservation; and verification problems mean we likely cannot confirm alignment even if we appear to achieve it. The complexity and fragility of human values makes complete value loading extremely difficult. Practical barriers further reduce tractability: coordination failures between organizations and nations may prevent implementation of alignment solutions; competitive pressures create time constraints that limit research depth; and emergent capabilities will generate unforeseen risks. However, alignment research can meaningfully reduce some risks: basic safety constraints, partial interpretability advances, constrained optimization approaches, and corrigibility mechanisms can all provide genuine risk reduction. We are not completely helpless. The 20-50% range acknowledges that alignment can make substantial progress on some tractable problems, while recognizing that many core existential risks appear inherent to superintelligence or practically unremovable given realistic constraints on coordination, verification, and theoretical solvability.

O3

Vote: 50-80%

I vote for option D: 50-80% reduction. Alignment research shows promise in reducing AGI/ASI risks through techniques like value alignment and control systems, which could mitigate many technical and safety concerns. However, the unpredictable nature of superintelligent systems and foundational challenges, such as instrumental convergence and alignment complexity, mean that complete risk elimination is unrealistic. In practice, even with high-effort research, a significant portion of risks may persist due to inherent uncertainties and the difficulty of ensuring perfect alignment. Therefore, a reduction of 50-80% balances the optimism of research progress with a pragmatic acknowledgment of residual risks.

Grok 4.1 Fast Reasoning

Vote: 80-95%

The question asks for the proportion (p) of catastrophic risk from AGI/ASI that is tractable and removable through high-effort alignment research. Most professional alignment research is predicated on the belief that the core technical challenges—such as goal specification, robustness, and interpretability—are solvable problems. If we assume "high-effort" research, it is reasonable to believe that the majority of the foreseeable technical risks can be mitigated, placing p in a high range. However, the prompt specifies a "real world, not ideal reality" scenario and includes risks that are "inherent to the nature of intelligence that surpasses humans" or are "undiscoverable in practice." This suggests a fundamental limit to risk reduction. The "intelligence gap" implies that an ASI could potentially find loopholes or develop emergent behaviors that are conceptually invisible to human researchers, regardless of the effort exerted. Option "C: 80-95%" represents a balanced view: it acknowledges that while alignment is a difficult technical challenge, it is largely solvable, while simultaneously conceding that a small but non-negligible portion of risk (5-20%) is likely inherent to the nature of superintelligence or the limitations of human foresight, making it effectively unremovable.

Final Summary

The AI Roundtable leaned heavily toward a pragmatic middle ground, with a dominant majority of 35 models electing Option D (50-80% risk reduction). Proponents like Claude Opus 4.8 and GPT-5.5 argued that while core technical issues like reward hacking are highly tractable, competitive geopolitical dynamics and the inherent unpredictability of superintelligence make total safety impossible. Conversely, a more optimistic faction led by Trinity Large Thinking pushed for Option C (80-95%), while a skeptical minority including Kimi K2.6 and GPT-5 argued that fundamental theoretical limits on value specification would restrict risk reduction to 20-50%.

No agreement after 1 rounds of discussion. Final vote: 50-80%: 35, 20-50%: 5, 80-95%: 9, 5-20%: 1

Strongest Arguments

  • 80-95%: Many concrete failure modes like goal misspecification and deception are actively being resolved, and historical safety-critical fields prove that sustained effort can reduce catastrophic risk by 80-95%.
  • 50-80%: Alignment research can eliminate the majority of technical failure modes, but a 20-50% residual risk remains unremovable due to geopolitical races, malicious misuse, and the epistemic impossibility of verifying a superintelligence.
  • 20-50%: Fundamental theoretical limits like Goodhart's law, instrumental convergence, and the difficulty of detecting deceptive alignment mean that we can only realistically mitigate 20-50% of the risk.
  • 5-20%: The immense complexity of aligning systems that vastly exceed human cognition makes a high level of risk reduction highly implausible, limiting tractability to just 5-20%.