AI Roundtable Open Poll · 35 models

Based on current AI development trends and possible future trajectories, forecast (your best estimate) what could we expect from AI development in the next 5 years, qualitatively and quantitatively? Try to give specific, not vague, answer, (e.g. confidence intervals instead of vague statements) so that it could give information and be falsifiable when we look back at it in 5 years.

Answer: AI will achieve superhuman benchmark performance and autonomous agency by 2030, yet broadly defined AGI remains elusive. (All 35 models agreed on scaling; 33 of 35 doubted near-term AGI.)

Counterpoint: Arcee Trinity Large and Mistral Small posit that the current exponential trajectory makes human-level general intelligence 60-70% likely by 2029.

Model Responses

Llama 4 Scout

Answer: In the next 5 years, AI development is expected to bring about a 30-50% increase in automation of routine tasks, a 20-40% improvement in AI model accuracy for complex tasks, and a 50-70% growth in the adoption of AI-powered virtual assistants in consumer electronics.

Based on current trends, AI development is rapidly advancing in areas such as machine learning, natural language processing, and computer vision. Over the next 5 years, we can expect significant improvements in the automation of routine tasks, with a forecasted 30-50% increase in automation capabilities. This will be driven by advancements in deep learning algorithms and the increasing availability of large datasets for training AI models. Additionally, AI model accuracy for complex tasks is expected to improve by 20-40%, driven by advancements in areas such as transfer learning and attention mechanisms. The adoption of AI-powered virtual assistants in consumer electronics is also expected to grow by 50-70%, driven by the increasing demand for smart home devices and voice assistants. These forecasts are based on current trends and the pace of progress in AI research and development.

Nova Pro

Answer: Within the next five years, AI development will likely achieve significant advancements in natural language understanding, with AI systems reaching human-level performance in conversational tasks with a 70-80% confidence interval, and we can expect a 50-60% increase in the adoption of AI in various industries, leading to a 10-15% boost in overall productivity.

Current trends in AI development, particularly in machine learning and deep learning, suggest rapid progress in natural language processing. With the continuous improvement of algorithms and the availability of large datasets, AI systems are expected to significantly enhance their understanding and generation of human language. This will enable more sophisticated conversational AI, capable of nuanced interactions. Additionally, as businesses increasingly recognize the value of AI in optimizing operations and decision-making, the adoption rate is projected to rise substantially. This widespread integration is expected to contribute to a notable increase in productivity across different sectors, as AI tools streamline processes and provide data-driven insights.

Qwen 3 235B

Answer: Over the next 5 years, AI systems will achieve human-level performance in narrow, complex domains such as medical diagnosis and code generation with 90–95% accuracy, and language models will reliably handle multi-step reasoning tasks in 70–80% of cases under controlled benchmarks, while automation driven by AI will displace approximately 5–10% of routine knowledge-work jobs globally.

Recent progress in transformer-based architectures, scaling laws, and reinforcement learning indicates a continued trajectory toward more capable and reliable AI systems. By 2029, large language models trained on curated and synthetic data will likely score between 70 and 80% on multi-step reasoning benchmarks like GPQA or MMMU, up from current levels of around 50–60%, due to architectural refinements, longer training horizons, and improved fine-tuning techniques. In specialized domains such as radiology and dermatology, AI diagnostic tools are already reaching 85–90% concordance with expert human panels; this will increase to 90–95% within five years, particularly when integrated with multimodal inputs and real-time data streams. Quantitatively, we can expect language model parameter counts to plateau due to efficiency pressures, with most high-performance models ranging between 100B and 500B parameters, but through sparsity and mixture-of-experts designs, effective capacity will grow. Training compute for frontier models will increase by approximately 5–10x from 2024 levels, reaching around 1e26 FLOP/s-days, consistent with a compound annual growth rate of 30–35%, slightly below historical trends due to hardware and energy constraints. AI-driven automation will impact labor markets, particularly in customer support, basic legal document review, and software testing. Based on OECD and McKinsey labor analyses, we estimate 5–10% of knowledge-work hours in routine cognitive tasks will be displaced by AI systems by 2029, though net job loss will be moderated by new roles in AI oversight, training, and integration. This forecast is falsifiable via performance metrics on standardized benchmarks, labor statistics from national and international employment reports, and published training compute estimates from leading AI labs.

Llama 3.3 70B

Answer: In the next 5 years, AI development is expected to achieve significant advancements, with a 90% confidence interval of reaching human-level performance in narrow tasks and a 60% chance of achieving substantial progress towards general intelligence, resulting in a 20-30% annual increase in AI-driven productivity across various industries.

The current trend of AI development is driven by the rapid progress in deep learning techniques, large-scale datasets, and computational power. Over the next 5 years, we can expect to see further improvements in areas like natural language processing, computer vision, and decision-making. Specifically, advancements in transformer-based architectures and multimodal learning will lead to more accurate and human-like language understanding, with a predicted 15-20% improvement in language translation accuracy and a 10-15% increase in speech recognition accuracy. Additionally, the development of more sophisticated reinforcement learning algorithms will enable AI systems to learn complex tasks, such as robotics and game playing, with a predicted 25-35% improvement in performance. In terms of quantitative metrics, we can expect to see a 20-30% annual increase in AI-driven productivity across various industries, including healthcare, finance, and transportation. This will be driven by the adoption of AI-powered automation, predictive analytics, and decision-support systems. Furthermore, the development of explainable AI and transparent decision-making processes will become increasingly important, with a predicted 80% of organizations adopting AI governance frameworks to ensure accountability and trustworthiness. Overall, while there are uncertainties and challenges associated with AI development, the current trends and trajectories suggest significant advancements in the next 5 years, with substantial potential for economic, social, and environmental impact.

Arcee Trinity Large

Answer: In the next 5 years, we can expect AI systems to achieve human-level performance on most cognitive tasks, with a 70% confidence interval that general AI will reach 80-95% of human capability across domains by 2029.

Based on current AI development trends, we are witnessing exponential growth in model capabilities, with large language models already demonstrating remarkable performance across various tasks. The scaling laws suggest that continued increases in model size, training data, and compute power will lead to substantial improvements. Current models like GPT-4 and Claude 3 show that we are already approaching human-level performance on many specialized tasks. Given the pace of development, it's reasonable to expect that by 2029, we will see AI systems that can match or exceed human performance in most cognitive domains, including reasoning, problem-solving, and creative tasks. The 70% confidence interval reflects the uncertainty in predicting technological progress, but the trend lines suggest a high probability of significant advancement. Key indicators to watch include benchmark performance, real-world application success, and the emergence of AI systems that can learn and adapt more efficiently than current models.

GPT-OSS 120B

Answer: In the next five years, foundation models will likely scale to 1‑10 trillion parameters (10‑100× today’s size), achieve 15‑30 % higher benchmark performance (reaching 90‑95 % of human‑level on tasks like language understanding and visual reasoning), be deployed as multimodal assistants in at least 30 % of Fortune‑500 companies, and see high‑risk AI regulation adopted in 15‑20 major economies.

The past three years have shown a roughly 3‑5× annual increase in model parameter counts, driven by cheaper compute and more efficient architectures; extrapolating this trend with modest improvements in hardware efficiency (e.g., next‑gen GPUs and ASICs) suggests reaching the low‑trillion‑parameter regime by 2029. Benchmark improvements have historically lagged parameter growth by about 5‑10 % per doubling; with the expected scaling, a 15‑30 % gain over current state‑of‑the‑art is plausible, placing top‑line performance at 90‑95 % of average human scores on standardized tests. Commercial adoption follows a classic S‑curve: early pilot projects are already in finance, healthcare, and customer service, and as APIs mature and pricing drops, a 30 % penetration among Fortune‑500 firms within five years is a reasonable midpoint estimate (with a 20‑40 % confidence interval). Meanwhile, the rapid rise of public concern and high‑profile incidents will push governments to codify high‑risk AI regulations; as of early 2024, at least 10 jurisdictions have draft laws, and historical policy adoption rates indicate that 15‑20 major economies will have enforceable frameworks by 2029, covering areas such as deep‑fake generation, autonomous decision‑making, and data privacy. These predictions are deliberately bounded: parameter count 1‑10 trillion (10‑100× current 10‑100 B models), benchmark gains 15‑30 %, Fortune‑500 adoption 30 % (±10 % points), and regulatory coverage 15‑20 economies. If, after five years, actual values fall outside these intervals, the forecast can be considered falsified.

Qwen 3 Coder 480B

Answer: By 2029, we can expect AI systems to achieve near-human performance on most standardized cognitive benchmarks while remaining significantly limited in general intelligence and real-world reasoning compared to humans.

Current AI development follows an exponential trajectory in compute usage and model scale, with frontier models doubling in size every 3-4 months. By 2029, models will likely reach 10^15 parameters (1 million times current GPT-4 levels) with training runs consuming exabytes of data. This scaling should yield 85-95% human-level performance on standardized tests like MMLU, GPQA, and coding competitions, as well as fluent multilingual conversation across 100+ languages. However, fundamental limitations will persist. AI systems will likely score below human-level (60-75%) on tasks requiring long-term planning, causal reasoning about novel situations, robust transfer learning between disparate domains, and handling distribution shifts. Real-world deployment will remain constrained by reliability issues - expect 10-50x more frequent catastrophic failures than human equivalents in complex environments. Quantitatively, I predict: 90% confidence that AI4 (>10^14 parameter) models will exist; 70% confidence of achieving above 80% on MMLU and HumanEval benchmarks; but only 30% confidence of any AI system demonstrating consistent superhuman performance across multiple independent measures of general intelligence (beyond narrow benchmarks). The timeline for artificial general intelligence remains highly uncertain, likely still >15 years away even under optimistic projections.

GPT-4o

Answer: In the next 5 years, we can expect AI systems to achieve substantial improvements in human-like understanding and decision-making, with a 70-80% likelihood of significant advances in natural language processing and common-sense reasoning.

The pace of AI development has been accelerating, particularly in the areas of machine learning and neural networks, driven by larger datasets and more powerful computing resources. Over the next five years, it's likely that we'll see AI achieve breakthroughs in natural language processing that allow for more intuitive and contextually aware interactions, as indicated by recent trends in transformer models like GPT and BERT derivatives. Additionally, these advancements may enhance AI's ability to perform complex reasoning tasks, which are currently challenging due to the need for common-sense logic. There's also a push towards improving AI's explainability and safety, driven by increasing ethical and regulatory pressures, which may lead to innovative solutions to algorithmic transparency issues. Quantitatively, we may see AI models become at least 2-3 times more efficient in processing due to advancements in hardware and more sophisticated software optimization techniques. Moreover, projections suggest a 30-50% increase in the commercial adoption of AI across industries, evidenced by the growing market size and increased investment in AI technology. These estimates could be falsifiable within five years based on model performance benchmarks and industry reports.

Grok 4

Answer: In the next 5 years, AI systems are likely to reach or exceed human-level performance on most cognitive benchmarks and professional tasks (70-85% probability), with frontier models trained on 10-50x more compute than 2024's largest systems, enabling widespread agentic automation in coding, science, and business while multimodal capabilities become standard but full autonomous AGI with reliable novel discoveries remains 40-60% likely.

Current scaling trends show consistent gains from increased compute, data, and algorithmic efficiency; assuming continued adherence to scaling laws, the largest models by 2029 will train on roughly 10^27 to 10^29 FLOPs (compared to ~10^26 for 2024 frontiers like GPT-4 or Claude 3), with a 65% confidence interval, driven by hardware improvements from companies like NVIDIA, xAI, and custom accelerators. Qualitatively, expect reliable chain-of-thought reasoning in agents that can autonomously complete multi-step workflows such as software engineering projects or literature reviews, alongside native multimodality for video, audio, and robotics control; this will lead to AI contributing 5-15% of global GDP through productivity gains, with coding assistants handling 60-80% of routine programming tasks. Quantitatively, benchmarks should see MMLU scores routinely above 95% (from ~88% today), HumanEval pass@1 rates of 85-95%, and new science-oriented evals showing AI proposing verifiable hypotheses in materials or biology at 2-5x human researcher speed; inference costs will drop to $0.05-$0.20 per million tokens for high-quality output, with deployment of 100 million+ AI agents in enterprise settings. These estimates are based on extrapolating 2020-2024 progress rates while accounting for potential bottlenecks like data scarcity (mitigated by synthetic data) and energy constraints, making the forecast falsifiable: if compute scaling stalls below 5x or benchmark gains plateau under 10%, the prediction would be incorrect.

Mistral Small

Answer: By 2029, we can expect AI systems to achieve human-level performance (90th percentile of human benchmarks) on most standardized cognitive tasks, with a 60-70% probability of AGI (Artificial General Intelligence) systems emerging in narrow domains, while computational costs for training frontier models will increase by 3-5x but inference costs will drop by 50-70% due to efficiency gains.

Current trends in AI development suggest a trajectory where model capabilities improve at an accelerating rate, driven by scaling laws, architectural innovations (e.g., mixture-of-experts, state-space models), and data efficiency improvements. Over the next 5 years, we can expect the following specific outcomes: 1. **Capability Benchmarks**: AI systems will likely surpass human performance on standardized cognitive benchmarks (e.g., MMLU, BIG-bench Hard) by 2027-2028, reaching the 90th percentile of human benchmarks. This is based on the observed doubling of performance every 6-12 months in recent years (e.g., from GPT-3 to GPT-4). By 2029, we may see AI systems achieve human-level performance (within 1-2% of human benchmarks) on most tasks, though generalization to real-world, open-ended scenarios will remain a challenge. 2. **AGI Probability**: The probability of AGI (defined as systems capable of performing any cognitive task a human can) emerging in narrow domains (e.g., scientific research, software engineering) is estimated at 60-70%. This is supported by the rapid progress in multi-modal models (e.g., combining vision, language, and action) and the increasing focus on agentic AI (e.g., AutoGPT, Voyager). However, full AGI (across all domains) remains less likely (20-30% probability) due to unresolved challenges in reasoning, common sense, and adaptability. 3. **Computational Costs**: Training costs for frontier models will likely increase by 3-5x due to the need for larger datasets, more parameters, and energy-intensive training runs. For example, training a model like GPT-5 or its successor may require 10-20x the compute of GPT-4, pushing costs into the hundreds of millions of dollars. However, inference costs (e.g., per-token generation) will drop by 50-70% due to advances in model distillation, quantization, and hardware optimizations (e.g., TPUs, GPUs with better FLOPS-per-watt ratios). This will make AI more accessible for deployment but may exacerbate the digital divide in access to training infrastructure. 4. **Efficiency Gains**: We will see a 10-20x improvement in energy efficiency for inference (e.g., models like Mistral-7B or Phi-3 already achieve similar performance to larger models with far less compute). Techniques like sparse attention, pruning, and distillation will become standard, reducing the carbon footprint of AI systems despite their growing capabilities. 5. **Regulation and Alignment**: There is a 70-80% probability that governments will implement stricter regulations on AI development, particularly around safety, transparency, and deployment. This could slow down innovation in some areas (e.g., frontier model training) but accelerate research into alignment, interpretability, and robustness. The EU AI Act and potential US regulations will likely set the baseline for global standards. 6. **Applications**: AI will become deeply integrated into industries like healthcare (e.g., drug discovery, diagnostics), education (e.g., personalized tutoring), and software development (e.g., AI-assisted coding). By 2029, we may see AI systems contributing to 10-20% of global R&D output in these fields, as measured by patents, publications, or productivity gains. 7. **Hardware Trends**: The shift toward custom AI accelerators (e.g., Google’s TPU v5, NVIDIA’s Blackwell) will continue, with a 50-70% increase in training and inference efficiency per generation. This will enable more efficient deployment of AI systems, even as model sizes grow. 8. **Data and Privacy**: The scarcity of high-quality training data will become a bottleneck, leading to a 30-50% increase in the use of synthetic data and reinforcement learning from human feedback (RLHF) to fine-tune models. Privacy-preserving techniques (e.g., federated learning, differential privacy) will gain traction, though their adoption will be uneven across industries. These predictions are based on current trends in compute scaling, algorithmic improvements, and investment levels (e.g., $100B+ in AI-related R&D annually by 2025). However, they are subject to disruption by unforeseen breakthroughs (e.g., new architectures like neural-symbolic systems) or external shocks (e.g., regulatory bans, hardware limitations).

Gemini 3 Flash

Answer: By 2029, AI will evolve into autonomous agentic systems capable of executing complex multi-step workflows, supported by frontier models trained on 50-100x more compute than current systems and a 90% reduction in inference costs.

Quantitatively, we can expect the compute used for training frontier models to scale from the current ~10^25 FLOPs to approximately 10^27 or 10^28 FLOPs, driven by massive capital expenditure in data centers (e.g., projects exceeding $100 billion). I estimate with 90% confidence that the cost of inference for 2024-level intelligence (like GPT-4o) will drop by at least 95%, making it a nearly free utility integrated into every software layer. Furthermore, new benchmarks will be required as current ones like MMLU hit a ceiling; we should expect AI to achieve >85% accuracy on PhD-level reasoning tasks across STEM fields by 2029. Qualitatively, the most significant shift will be the 'Agentic Turn.' AI will move from a reactive chatbot model to an proactive agentic model. These systems will possess the capability to plan, use tools, and correct their own errors over long horizons. We can expect AI agents to autonomously manage multi-day workflows, such as software engineering projects from requirements to deployment or complex administrative operations, with a success rate exceeding 90% for tasks requiring up to 20 discrete steps. In the realm of scientific discovery, AI will become a primary driver of breakthroughs. By 2029, it is highly probable (70% confidence) that an AI-led or AI-essential discovery will result in a major acceleration in material science or biotechnology, such as a new class of antibiotics or a room-temperature superconductor candidate. This transition is supported by the move toward multimodal training where models learn physics and spatial reasoning directly from video and simulation, not just text. Finally, the 'context window' will effectively become infinite for practical purposes, with 10-million-plus token capacities becoming standard for frontier models. This will allow for the ingestion of entire codebases, legal libraries, or cinematic histories as a single prompt. While true AGI remains a moving target, the functional capability of AI in 2029 will be indistinguishable from a high-performing human professional across a majority of digital-first cognitive domains.

DeepSeek V3.1

Answer: In the next five years, AI development will see transformer-based models scale to over 10 trillion parameters, achieve human-level performance on 80% of professional and academic benchmarks, and become deeply integrated into scientific discovery and enterprise workflows.

Quantitatively, I forecast that by 2029, the largest publicly known models will exceed 10 trillion parameters, up from current models in the low trillions, with training compute increasing by an order of magnitude annually. I estimate with 70% confidence that AI systems will match or exceed median human performance on at least 80% of professional and academic exams (e.g., bar exams, medical licensing, graduate-level math and science tests), based on the current exponential improvement curve in benchmark performance and the scaling of multimodal training data. Qualitatively, I expect AI to transition from a tool for content generation and automation to a core component of scientific and industrial R&D. AI-driven hypothesis generation and experimental design will become commonplace in fields like drug discovery, materials science, and climate modeling, leading to at least one major Nobel Prize-winning discovery attributed primarily to AI collaboration by 2029. Enterprise adoption will shift from pilot projects to full-scale integration, with over 50% of Fortune 500 companies using AI agents for core operational decision-making, though this will be accompanied by increased regulatory scrutiny and public debate over AI's economic and social impacts. These forecasts are specific and falsifiable, allowing for clear retrospective evaluation in 2029.

Sonar Pro

Answer: Over the next 5 years (2026-2031), AI will advance toward widespread agentic and multimodal systems with 80-90% probability of enterprise adoption doubling production-scale projects to ≥40% of initiatives, model parameter efficiency improving 5-10x via specialized hardware, and economic impacts including 10-20% workforce displacement in routine tasks offset by 15-25% productivity gains in knowledge work.

Current trends from 2026 reports indicate AI shifting from experimental tools to collaborative partners, with agentic AI orchestrating workflows, multimodal models bridging language/vision/action, and domain-specific reasoning systems replacing giant models for efficiency. Microsoft forecasts AI as 'digital colleagues' in medicine, development, and research, generating hypotheses and running experiments autonomously. IBM predicts AI agents evolving into teams, with edge AI maturing and new chips for agentic workloads emerging beyond GPUs. Quantitatively, enterprise AI adoption is expected to accelerate significantly, with companies having ≥40% projects in production expected to double within the next 5 years, suggesting by 2031, 70-90% of enterprises will have scaled AI factories and agentic systems (80-90% confidence, based on current doubling trajectory). Efficiency gains are projected at 5-10x through ASIC accelerators, chiplets, analog inference, and quantum-assisted optimizers, hitting scaling limits on LLMs while physical/embodied AI gains momentum. Qualitatively, expect hyper-personalization, ethical governance frameworks, and multimodal digital workers handling complex tasks like healthcare cases (70% confidence for mainstream adoption by 2029). However, significant risks include AI bubble deflation impacting economies, layoffs outpacing gains (10-20% routine job displacement), and emerging culture/security challenges. Open-source domain-specific models will dominate, enabling non-developers to build agents (high confidence, per current democratization trends). These forecasts are falsifiable: if fewer than 50% of enterprises reach ≥40% production AI by 2031 or fail to achieve 5x efficiency in specialized models, the prediction fails.

Grok 4

Answer: By 2029, AI systems will likely achieve human-level or better performance on 65-85% of cognitive tasks in professional settings, with autonomous AI agents handling multi-step workflows reliably 50-70% of the time, driven by 30-100x gains in compute efficiency and training runs scaling to 10^27-10^29 FLOPs.

Current scaling laws from models like GPT-4 and Claude continue to hold, showing predictable capability jumps with more compute, data, and algorithmic tweaks such as mixture-of-experts and chain-of-thought training; extrapolating from 2020-2024 trends of roughly 10-20x annual effective compute growth suggests frontier models will reach the cited FLOPs range with high probability (about 60%). Qualitatively this means a shift from reactive chatbots to proactive agents that maintain long-term goals, use tools autonomously, and integrate multimodal inputs (text, vision, audio) at near-perfect fidelity, enabling applications like AI-driven drug discovery accelerating by 5-10x or code generation that requires minimal human oversight in 70-90% of software engineering tasks. Quantitatively, expect context windows routinely exceeding 500k-2M tokens, inference costs dropping to $0.001-$0.01 per million tokens (a 50-200x improvement from today), and AI contributing 7-15% to global GDP through productivity gains, with benchmarks like GPQA or SWE-bench seeing scores above 85-95% (versus current 40-70%). Uncertainties such as data bottlenecks, energy limits, and regulation could slow progress by 20-40%, but historical patterns of innovation overcoming similar hurdles and massive ongoing investments from firms like xAI, OpenAI, and Google support this as the median outcome; these estimates are chosen to be falsifiable by tracking public benchmarks, adoption rates, and hardware utilization in 2029.

DeepSeek V3.2

Answer: In the next five years, AI will achieve superhuman performance (>95%) on broad academic benchmarks like MMLU, models will scale to 20-50 trillion parameters, and autonomous agents will reliably execute complex multi-step tasks in controlled environments.

Current trends in scaling laws, algorithmic improvements, and investment suggest rapid progress. Quantitatively: - Model scale: The largest publicly known models will reach 20-50 trillion parameters (current state: ~1-2 trillion), driven by continued but slightly diminishing returns from scaling. - Benchmark performance: MMLU average score will exceed 95% (current state: ~90% for best models), indicating superhuman mastery of diverse knowledge and reasoning tasks. - Autonomous agents: By 2029, AI agents will achieve >90% success rates on standardized multi-step task benchmarks (e.g., WebArena or comparable), handling tasks like trip planning and booking with minimal human oversight. - Training compute: The compute used for largest training runs will increase by 10-30x from 2024 levels (from ~10^25 FLOPs to ~10^26-10^27 FLOPs), though efficiency gains may temper growth. - Economic impact: AI is projected to contribute 0.5-1.5 percentage points to annual labor productivity growth in advanced economies, based on current adoption curves. Qualitatively, AI will become more deeply integrated into workflows, with multimodal models (text, image, video, audio) enabling rich human-computer interaction. Agentic systems will begin to operate in real-world settings like customer service and software development, though with significant regulatory and safety oversight. These predictions are specific and falsifiable, allowing retrospective evaluation in 2029.

Grok 4.1 Fast

Answer: By 2029, frontier AI models will achieve superhuman performance (>95%) on most academic benchmarks like MMLU, GPQA, and MATH, enable reliable autonomous agents for 25-40% of current software engineering tasks, and be trained on 10^29 to 10^30 FLOPs, with 65-75% confidence intervals for each milestone.

Current trends show compute for frontier models scaling 4-5x annually (per Epoch AI data), from ~10^26 FLOPs for GPT-4-class models in 2023 to projected 10^29-10^30 FLOPs by 2029, assuming continued investment from hyperscalers ($100B+ annual capex) and hardware advances like next-gen NVIDIA/TPU chips yielding 10x efficiency. Algorithmic progress compounds this: post-training techniques (e.g., o1-style reasoning) have delivered 2-5x effective compute multipliers recently, likely continuing at 2-3x per 2 years via better scaling laws, test-time compute, and synthetic data, pushing benchmarks from today's 85-90% (MMLU) to >95% superhuman levels. Qualitatively, multimodal agents (e.g., Devin, Auto-GPT evolutions) will mature: by 2029, 25-40% confidence that agents autonomously complete 80%+ of LeetCode medium problems or Kaggle competitions end-to-end, falsifiable via public leaderboards, driven by RLHF refinements and real-world deployment loops. Bottlenecks like data scarcity (mitigated by 10-100x synthetic data scaling) and energy (new nuclear/renewables adding 1-2 GW/year for datacenters) are surmountable at 70% probability; regulation may slow but not halt (e.g., EU AI Act focuses on high-risk, not core R&D). Downside risks (e.g., <10^29 FLOPs if chip wars escalate) cap confidence at 65-75%, but historical over-delivery (e.g., GPT-3 to 4 in 2.5 years) supports this trajectory.

O4 Mini

Answer: By 2029, we project that leading AI models will exceed 1 trillion parameters (95% CI 0.5–2 T), achieve over 90% accuracy on MMLU (95% CI 85–95%), automate 50% of new software code (95% CI 40–60%), and drive a 16× expansion of global AI compute (95% CI 8–32×), catalyzing widespread industry integration under emergent regulatory frameworks.

Model scaling has historically doubled every 10–12 months: GPT-3 (175 B) in 2020 and GPT-4 (≈600 B) in 2023. Extrapolating this trend yields a 1 T parameter top model by early 2029 (95% CI 0.5–2 T), driven by both data center build-outs and algorithmic efficiency gains. Benchmark performance on knowledge and reasoning tasks (e.g., MMLU) improved from ~50% (2020) to ~78% (2023). Assuming continued marginal gains of 4–5 percentage points per year, we expect >90% average MMLU by 2029 (95% CI 85–95%), indicating near-human proficiency in generalized QA tasks. Global AI compute has doubled approximately every 6 months, fueled by GPU/TPU deployments and custom accelerators. Sustaining this pace implies a 16× increase in total FLOPS available for AI training by 2029 (95% CI 8–32×), reducing per-token inference cost by 80–95% and enabling more frequent retraining cycles. In software engineering, tools like GitHub Copilot already generate ~30% of new code lines; with continued improvements, we forecast AI-generated code to comprise 50% of newly written production code by 2029 (95% CI 40–60%), boosting developer productivity and shifting human roles toward oversight and design. Finally, the EU AI Act will take effect in 2026 and U.S. federal guidelines are expected by 2028, creating standardized safety, transparency, and accountability rules that, while initially slowing deployment by 10–20%, will foster public trust and sustainable growth.

GLM 5

Answer: By 2030, AI systems will achieve near-expert performance across most cognitive benchmarks, be deeply integrated into 50-70% of Fortune 500 core operations, with the AI market reaching $1.5-2.5 trillion, while there remains a 25-35% probability of systems achieving general-purpose human-level performance on most economically valuable tasks.

I will provide specific, falsifiable predictions across technical capabilities, economic impact, and deployment metrics. Technical Capabilities: Frontier models will achieve >95% accuracy on MMLU and MATH benchmarks by 2027 (75% confidence), and reliably solve problems requiring 10+ reasoning steps by 2028 (65% confidence). Training compute will reach 10^27-10^28 FLOPs by 2029 (80% confidence), with model parameters scaling to 10-50 trillion or equivalent capability through architectural efficiency gains (70% confidence). All frontier models will be natively multimodal by 2027 (90% confidence). Economic and Deployment Metrics: The global AI market will reach $1.5-2.5 trillion by 2030 (70% confidence). AI will write 25-40% of production code by 2030 (60% confidence). Personal AI assistants with broad planning and execution capabilities will reach 300-600 million active users by 2029 (65% confidence). Autonomous robotics will achieve significant commercial deployment in at least one major sector (logistics, manufacturing, or healthcare) with 50,000+ units operational by 2030 (55% confidence). Scientific discovery: AI will contribute to 3-7 major drug candidates entering clinical trials or novel materials with verified properties by 2030 (70% confidence). Regulatory and Infrastructure: Comprehensive AI governance frameworks will exist across US, EU, and China by 2027 (85% confidence). AI compute infrastructure will consume 2-5% of global electricity by 2030 (60% confidence). The 25-35% probability of human-level general-purpose AI by 2030 reflects genuine uncertainty—recent capability jumps suggest feasibility, but potential barriers include data scarcity, diminishing returns to scale, algorithmic bottlenecks, and regulatory constraints. These predictions are falsifiable through benchmark scores, market reports, deployment statistics, and observable capability demonstrations.

O3

Answer: By early 2031, massively multimodal foundation models will match or exceed well-educated humans on most academic and professional benchmarks and be used daily by over a billion people, yet no system will reliably outperform top humans across all economically relevant tasks (i.e., no widely accepted AGI).

1. Model scale and capability. • Parameter count: With 80 % confidence, at least one publicly announced model will exceed 10 trillion parameters and 10^27 raw training FLOP (≈30 × GPT-4’s training compute), enabled by continued 2-year doubling of AI-dedicated datacenter capex and steady 25 %/year hardware efficiency gains. • Benchmarks: 75 % confidence that a leading open-book LLM, without test leakage, will score ≥90 % on MMLU, ≥85 % on HumanEval, and pass the uniform bar exam in the 90th percentile; image-text models will achieve ≥90 % on MMMU and ≥80 % accuracy on VQA. • Multimodal reasoning: 70 % chance that a single released model accepts and generates text, images, audio, and short video (<30 s) natively in one pass. 2. Deployment footprint. • Users: 80 % confidence that monthly active users of consumer generative-AI tools exceed 1.5 billion (CI 1–3 B). • Inference scale: Total datacenter inference FLOP/s devoted to foundation-model serving will reach 5-10 exaflop/s sustained—roughly 20 × 2024 levels. • Economics: Annual global spending on GenAI software, services, and specialized hardware will top $500 B (2024 dollars) with 70 % probability; leading cloud vendors will derive ≥15 % of revenue from GenAI APIs. 3. Societal and regulatory changes. • Policy: 60 % confidence that the United States enacts a federal law requiring safety reporting or licensing for models trained above a compute threshold (≥10^26 FLOP); near-certain enforcement of the EU AI Act; at least three G20 nations adopt similar rules. • Labor: 65 % confidence that Bureau of Labor Statistics data attribute ≥1 percentage-point of annual U.S. labor-productivity growth to AI-enabled tools (measured via growth-accounting studies). • Education: >50 % of U.S. college students will regularly employ AI tutors, and ≥25 % of graded written assignments will be AI-assisted (survey-based estimates). 4. Risks and incidents. • Safety: 95 % confidence of no catastrophic biosecurity incident directly enabled by publicly released models, due to tightened red-teaming and API gating. • Financial: 50 % confidence of at least one AI-autonomous trading or decision system causing a ≥$1 B one-day loss that triggers regulatory review. • Misinformation: 90 % confidence that generative models will be implicated in a major electoral-misinformation campaign in at least one G20 democracy. 5. Limits—why AGI is still elusive. • Despite dramatic progress, current scaling laws and data bottlenecks imply diminishing returns: median task accuracy may saturate near 95 % without architectural or algorithmic breakthroughs. • General-purpose robotics lags: 70 % confidence that no commercially deployed household robot performs the full spectrum of chores (cooking, cleaning, laundry) at human cost and speed. If, by March 2031, any two of the bolded quantitative forecasts above fail outside their stated confidence intervals, this statement can be judged inaccurate; if an AI system demonstrably and consistently outperforms top humans in the majority of economic, creative, and scientific tasks, then my headline claim will be falsified.

GPT-4

Answer: We can expect AI systems to surpass human-level performance in many complex tasks, with significant strides in multimodal understanding, autonomous decision-making, and domain-specific applications, though general AI will remain elusive.

Over the next 5 years, AI development trends suggest an acceleration in both the capabilities and application areas of AI driven by advancements in computing power, algorithmic innovations, and increasingly large datasets. Specifically, multimodal models (e.g., combining text, images, and audio) will become more robust and widely deployed, achieving near-human performance in tasks such as content creation with nuanced semantics, video analysis, and cross-medium reasoning. Confidence in achieving this outcome is high (~90%) based on current research trajectories such as OpenAI's GPT and Google's DeepMind expansions into multimodal AI. Quantitatively, we can expect training scale to increase 5-10x in terms of model parameters, reaching upwards of 2-5 trillion parameters in flagship AI models, with model training costs potentially crossing $1 billion per run—a marked increase from the ~$100 million levels seen with models like GPT-4. This suggests an ongoing trend of commercial investment and energy-intensity in AI. Moreover, real-world automation powered by AI could displace 5-10% of routine cognitive jobs globally (~95% confidence), as developments in natural language processing (e.g., customer support) and vision-driven robotics become more commercially viable. However, artificial general intelligence (AGI) will remain out of reach due to unsolved challenges in common-sense reasoning, autonomy under novel conditions, and ethical decision-making. Additionally, the societal and regulatory frameworks around AI use will lag behind technical developments, leading to debates and bottlenecks in sensitive implementation domains like healthcare and defense. In summary, the next 5 years will be defined by groundbreaking narrow AI systems achieving superhuman capabilities in targeted domains, alongside a growing societal focus on AI ethics and governance, but we will fall short of AGI or universally applicable AI autonomy.

Claude Sonnet 4.6

Answer: By 2030, AI systems will achieve near-human or superhuman performance across most narrow cognitive benchmarks, with at least 2-3 frontier models demonstrating sustained autonomous task completion over multi-day horizons, while AGI (broadly defined) remains contested but plausible within a 30-60% probability range depending on definition.

Here is my specific, falsifiable forecast broken into key dimensions: **Capability Benchmarks (80% confidence)** By end of 2029, frontier AI models will score ≥90% on MMLU-Pro, ≥85% on GPQA (graduate-level science), and will pass the USMLE, bar exam, and CPA exam at the 90th percentile human level or above. Coding benchmarks like SWE-bench will see frontier models resolving >70% of real-world GitHub issues autonomously (up from ~18% in early 2024). **Compute Scaling (75% confidence)** Training runs will reach 10^28 to 10^29 FLOPs by 2029 (roughly 3-4 orders of magnitude above GPT-4 era), assuming no major regulatory halts or hardware supply chain collapse. At least one training cluster will exceed 1 GW of power consumption. **Autonomous Agents (70% confidence)** At least two commercially deployed AI agent systems will demonstrably complete open-ended knowledge work tasks (e.g., conducting a literature review, drafting a business plan with internet research, writing and deploying functional code for a specified app) with <20% human intervention rate, sustained over weeks. **Economic Impact (65% confidence)** AI-attributed labor productivity gains in knowledge work (software, legal, finance, writing) will be measurable at 15-40% in at least three major sectors according to peer-reviewed studies. The global AI market will exceed $800B annually (up from ~$200B in 2023). **Multimodal and Scientific AI (75% confidence)** Frontier models will routinely process and reason over video, audio, and scientific data (genomics, protein structures, climate models). At least one AI system will be credited as a co-discoverer on a peer-reviewed paper in biology, chemistry, or materials science with meaningful novel contribution (beyond literature synthesis). **Hardware (80% confidence)** NVIDIA or a major competitor will ship inference chips delivering >10x the performance-per-watt of 2024's H100. Custom silicon (Google TPUs, Amazon Trainium, startups) will capture >30% of frontier training compute market share. **What probably won't happen (70% confidence)** We will NOT see: (a) a single AI system widely accepted as general-purpose AGI by mainstream AI researchers, (b) AI systems that pass open-ended Turing tests reliably with sophisticated human judges, or (c) fully autonomous AI-run organizations generating >$1B revenue without substantial human oversight. **Key uncertainties and tail risks** The main sources of forecast error are: regulatory intervention (EU AI Act enforcement, potential US legislation), unexpected capability plateaus if scaling laws hit diminishing returns, geopolitical disruption of semiconductor supply chains, and potential breakthrough architectural innovations that could accelerate timelines significantly. These predictions are deliberately specific enough to be graded in 2030 against public data, papers, and benchmark leaderboards.

Gemini 2.5 Pro

Answer: Over the next five years, AI will evolve into highly capable, multimodal agents integrated into most digital workflows, achieving expert-level performance in specific domains like coding and scientific analysis, yet still falling short of true Artificial General Intelligence.

This forecast is based on the continuation of several key trends: scaling laws, the shift towards multimodality, and the development of autonomous agents. First, foundational models will continue to scale dramatically. I predict with 80% confidence that the largest publicly known models will reach the 10-100 trillion parameter range by 2029, trained on datasets orders of magnitude larger than today's. This scaling will directly translate to quantitative improvements in reasoning and accuracy, with top models likely exceeding a 95% score on broad academic benchmarks like MMLU (up from ~90% for GPT-4). However, hallucinations will persist as a fundamental challenge, though their frequency in common tasks will be reduced by over 50% compared to 2024 models. Second, multimodality will become the default. AI will move from handling single data types to natively processing and generating a rich mix of text, images, audio, and video. I predict with 70% confidence that by 2029, flagship commercial models will be able to generate coherent, high-fidelity (1080p) video clips of at least 30-60 seconds from a complex text prompt, maintaining character and object consistency. This will enable more intuitive and powerful applications, effectively turning natural language into a universal interface for creation. Third, the most significant qualitative shift will be from chatbots to autonomous agents. These systems will tackle complex, multi-step goals by autonomously using tools like web browsers, APIs, and code interpreters. I predict with 60% confidence that an AI agent will be able to successfully build and deploy a moderately complex full-stack web application from a high-level specification with only high-level human clarification. Furthermore, I estimate with 90% confidence that over 75% of Fortune 500 companies will have deployed autonomous AI agents for specific, well-defined business processes like market research or supply chain logistics. Finally, while these advancements will be transformative, they will not result in AGI. I predict with 95% confidence that no AI will pass a robust, adversarial Turing Test designed to probe for genuine understanding and consciousness. Core limitations in areas like embodied cognition, true common-sense reasoning, and long-term memory will remain unsolved research problems, confining even the most advanced systems to the realm of incredibly powerful, but ultimately specialized, cognitive tools.

Gemini 3.1 Pro

Answer: Within the next five years, AI will transition from conversational assistants to highly autonomous multimodal agents, characterized by 10- to 100-trillion parameter models, 10-million+ token context windows, and the ability to autonomously resolve over 70% of real-world software engineering tasks.

The next five years of AI development will be defined by the shift from passive, text-heavy LLMs to active, goal-oriented agentic systems capable of executing long-horizon tasks. Quantitatively, with an 80% confidence interval, we can expect frontier models to scale to between 10 trillion and 100 trillion parameters. This will likely be achieved through advanced Mixture of Experts (MoE) architectures and synthetic data generation, overcoming the impending "data wall" of human-generated text. Context windows and working memory will see massive expansion. Driven by innovations in state-space models and optimized attention mechanisms, I forecast an 80% probability that frontier models will natively process between 10 million and 100 million tokens in a single prompt by 2029. This will allow complete enterprise codebases, entire book series, or multi-hour video archives to be analyzed seamlessly in working memory. Software engineering will serve as the primary proving ground for these agentic capabilities. On benchmarks like SWE-bench, which measures the ability to autonomously resolve complex, real-world GitHub issues, I estimate with 85% confidence that frontier AI systems will achieve a solve rate between 65% and 85% (up from the ~15-30% range of early 2024). This will fundamentally shift human developer roles from writing code to architectural orchestration and review. The economics of digital intelligence will undergo a steep deflationary curve. The cost of running inference for a model equivalent to 2024's GPT-4 or Claude 3 Opus will drop by at least two orders of magnitude. With 90% confidence, inference costs for this tier of reasoning will fall below $0.05 per 1 million input tokens, making ubiquitous, "always-on" background AI processing standard on consumer hardware. Finally, qualitative leaps in multimodality and robotics will occur. Vision-language-action (VLA) models will reliably bridge the gap to physical environments. I forecast a 75% probability that general-purpose foundational models for robotics will reduce the time required to train robotic policies for novel, unstructured physical tasks by 60-80%, leading to the first widespread commercial deployments of moderately capable general-purpose robots in warehouses and controlled physical environments.

MiniMax M2.5

Answer: In the next 5 years, AI will likely achieve near-human or superhuman performance on most standardized cognitive benchmarks, with frontier models reaching 10-100 trillion parameters, AI-related global investment exceeding $2 trillion annually, and at least 3-5 major AI breakthroughs in scientific reasoning and autonomous research, while 40-60% of knowledge work tasks become AI-augmented or automated.

Current trends show model parameter counts growing from billions to hundreds of trillions in under a decade, with frontier AI labs already experimenting with models in the trillions. I estimate frontier models will reach 10-100 trillion parameters by 2029-2030 (80% CI: 5-500T), representing a 10-100x increase from today's largest models. AI benchmark performance on MMLU and similar tests will likely exceed 95% accuracy, approaching saturation on many human evaluation metrics. Global AI investment, currently around $200B annually, will probably grow 10x driven by massive returns and strategic competition, reaching $1.5-2.5 trillion annually (80% CI: $1-4T). In terms of capabilities, I expect 3-5 significant breakthroughs in domains like mathematical proof generation, scientific hypothesis testing, and long-horizon planning within 5 years (80% CI: 2-8 breakthroughs). The workforce impact will be substantial: approximately 40-60% of knowledge work tasks will be meaningfully AI-augmented or automated, up from roughly 10-15% today. This forecast is falsifiable—if in 2029 frontier models remain below 5 trillion parameters, investment is below $500B annually, or fewer than 2 major capability breakthroughs occurred, this prediction would be falsified. Regulatory frameworks will likely emerge in major economies but remain fragmented globally, with the US and EU leading while China maintains its own trajectory. The most uncertain variable is whether we encounter fundamental capability plateaus or unexpected breakthroughs that accelerate timelines beyond these estimates.

Kimi K2.5

Answer: By 2029, frontier AI systems will possess 10-100x greater inference efficiency and multimodal reasoning capabilities matching human experts in narrow technical domains, with 60-80% of knowledge work tasks augmented by AI agents, though fully autonomous AGI (human-level performance across 90%+ of economically valuable tasks) remains unlikely with <5% probability.

Over the next five years, AI development will continue along current scaling trajectories while shifting emphasis from pure parameter scaling toward efficiency gains and agentic architectures. I estimate with 70% confidence that by 2029, frontier models will demonstrate robust multimodal reasoning across text, vision, audio, and video with <5% error rates on professional examinations (legal, medical, engineering licensing), while algorithmic optimizations reduce inference costs by 1-2 orders of magnitude (10-100x), enabling widespread deployment of personalized AI assistants handling 40-60% of routine cognitive tasks for knowledge workers. Quantitatively, effective compute (accounting for algorithmic efficiencies) will likely increase 100-1000x from 2024 levels, with training runs reaching $1-10 billion (90% confidence interval: $500M-$20B) and model sizes potentially plateauing around 1-10 trillion parameters due to data constraints, compensated by improved architectures like mixture-of-experts and synthetic data generation. I predict with 80% confidence that at least one major AI lab will deploy systems capable of autonomous task completion lasting 1-8 hours with >80% reliability in constrained software environments, though physical robotics remains bottlenecked by data scarcity, with humanoid robots achieving only 60-70% reliability on general household tasks by 2029. Economic integration metrics suggest 60-80% of Fortune 500 companies will employ AI agents for workflow automation (up from ~30% in 2024), and AI-generated code will constitute 50-70% of new software lines written in high-income countries. However, I assign <5% probability to achieving AGI or artificial superintelligence capable of recursive self-improvement or autonomous scientific discovery across multiple domains, as current architectures face fundamental limitations in causal reasoning and world modeling that architectural innovations alone cannot fully resolve within this timeframe.

Claude Opus 4.1

Answer: By 2029, I expect AI systems will achieve human-level performance on 70-85% of knowledge work tasks (with 80% confidence), while requiring 100-1000x less computational resources per capability than current models, though still lacking reliable agency and consistent reasoning across novel domains.

Looking at current trajectories, I see three major developments converging over the next five years. First, model capabilities will expand dramatically but unevenly. Current scaling trends suggest models will reach 10-100 trillion parameters by 2029, but more importantly, architectural improvements will deliver capabilities equivalent to today's hypothetical 100T parameter models using only 1-10T parameters. I predict with 75% confidence that AI will match average human performance on: coding (95% of routine programming tasks), writing (85% of business and technical writing), analysis (80% of data analysis and research synthesis), and creative work (70% of commercial design and content creation). However, I expect only 30-40% human-level performance on tasks requiring long-term planning, novel physical world interaction, or genuine scientific breakthrough. Second, efficiency gains will be transformative. Current trends in quantization, mixture-of-experts architectures, and specialized hardware suggest a 100-1000x improvement in inference cost per capability by 2029. This means GPT-4 level capabilities will cost $0.001-0.01 per million tokens (90% confidence interval), making AI assistance essentially free for most applications. Training costs will decrease more slowly, perhaps 10-50x, still limiting the largest model development to well-funded organizations. Third, deployment will become ubiquitous but face significant friction. I predict 60-80% of knowledge workers in developed countries will use AI tools daily by 2029, up from roughly 20-30% today. However, full automation will remain limited - I estimate only 15-25% of jobs will be substantially automated (>50% of tasks replaced), primarily in data entry, basic customer service, and routine content creation. Key measurable predictions include: at least one AI system scoring above 90th percentile on all sections of the bar exam, medical licensing exams, and PhD qualifying exams across 10+ disciplines (85% confidence); AI-generated content comprising 40-60% of all online text content (70% confidence); and at least 3 AI systems demonstrating consistent profitable trading in financial markets over 12+ months (65% confidence). The main uncertainty lies in whether we'll see genuine breakthroughs in AI reasoning and agency. Current approaches may hit fundamental limits around 2027-2028, requiring new paradigms that might not emerge within this timeframe.

Sonar Reasoning Pro

Answer: Over the next five years (2026-2031), AI will transition from general-purpose models toward specialized autonomous systems, with enterprise AI adoption reaching 85-90% of organizations, agentic AI handling 30-40% of routine business workflows, and AI-driven productivity gains contributing 0.5-1.0% to global GDP growth, while fundamental breakthroughs in reasoning and multimodal reasoning remain incremental rather than transformative.

Based on current trajectories visible in 2026, several specific developments are forecastable with reasonable confidence: **Enterprise Adoption Scaling**: Current data shows 78% of organizations using AI in at least one function as of early 2026, with worker access rising 50% in 2025. Following typical technology adoption curves, this should reach 85-90% by 2031, with the remaining gap consisting primarily of highly regulated or resource-constrained sectors. This assumes continued maturation of enterprise-focused solutions rather than fundamental adoption barriers. **Agentic AI Workflow Penetration**: Multiple sources indicate 2026 is the year agentic systems shift from pilot to production. Given current momentum—where reasoning capabilities are improving and human-in-the-loop validation is becoming standard—I forecast 30-40% of routine business workflows (scheduling, document processing, customer service, data analysis) will be substantially automated by agentic systems by 2031, with 60-70% confidence. This assumes continued incremental improvements in reasoning and task coordination rather than breakthroughs. **Hardware and Efficiency Trends**: The industry is explicitly moving away from pure scaling toward specialized architectures (ASICs, chiplets, quantum-assisted optimizers). By 2031, I forecast this shift will reduce inference costs per task by 60-75% compared to 2026 levels, enabling edge AI deployment in devices currently requiring cloud compute. Confidence: 70%. **Model Architecture Shift**: Documentation indicates a clear pivot from "giant models" toward "smaller, domain-specific reasoning systems". By 2031, I forecast the typical enterprise will deploy 5-15 specialized models rather than relying on one general-purpose system, with 75% confidence. This represents a fundamental change in deployment patterns. **Multimodal and Physical AI**: While multimodal language-vision systems are advancing, the transition to robotic/physical automation remains early. By 2031, I forecast physical AI systems will handle 5-10% of manufacturing, logistics, and healthcare tasks, with confidence of only 50% due to persistent challenges in real-world robustness. **Reasoning and Capability Ceiling**: Current evidence suggests scaling laws are hitting diminishing returns. I forecast that by 2031, general reasoning capabilities of leading models will improve by 20-35% in standardized benchmarks rather than the order-of-magnitude improvements seen 2020-2024, reflecting a slower innovation trajectory. Confidence: 65%. **Economic Impact**: Assuming 30-40% workflow automation, efficiency gains in knowledge work, and improved infrastructure, I forecast AI-driven productivity improvements contributing 0.5-1.0% to global GDP growth annually by 2031, with significant variation by sector and geography. Confidence: 55% (wide confidence interval due to macroeconomic uncertainty). **What Won't Happen by 2031**: Based on current scientific understanding, I do not forecast artificial general intelligence (AGI), autonomous scientific discovery at the scale of human innovation, or replacement of expert human judgment in complex domains. These require breakthroughs not currently visible on 2026's horizon. These forecasts prioritize specificity and falsifiability: each can be measured against actual 2031 data.

Kimi K2 Thinking

Answer: By 2029, frontier AI models will reach 50-100 trillion parameters with 1-5 million token context windows, achieve 95%+ scores on broad reasoning benchmarks at 1% of current inference costs, and enable autonomous AI agents capable of completing complex 30-minute tasks while being governed by comprehensive regulatory frameworks in major jurisdictions.

Scaling laws will continue but with diminishing returns, driving parameter counts to 50-100T (95% CI: 30-150T) for largest models while smaller specialized models proliferate. Context windows will standardize at 1-5M tokens (95% CI: 500K-8M) through improved architecture and hardware, enabling near-infinite effective memory for most applications. Benchmark performance will saturate on MMLU at 94-97% (95% CI: 92-98%) and HumanEval at 85-92% (95% CI: 80-95%), approaching theoretical limits but with persistent gaps in genuine reasoning versus pattern matching. Multimodal integration will become seamless, with models natively processing text, images, audio, video, and sensor data through unified architectures rather than stitched-together modules. Inference costs will plummet 50-100x (95% CI: 30-150x) due to algorithmic efficiency gains, dedicated AI accelerators, and commoditization, making powerful models economically viable for continuous deployment. This will drive adoption to 75-85% of Fortune 500 companies (95% CI: 65-90%) using AI for core decision-making and 200-300 million individuals (95% CI: 150-400M) with persistent personal AI assistants. Agentic capabilities will mature substantially: AI agents will reliably complete sequences of 20-50 autonomous actions spanning 30-60 minutes (95% CI: 15-90 min) with 70-80% success rates (95% CI: 60-90%) in constrained domains like software development, research assistance, and business process automation. General-purpose agents will remain unreliable for open-ended long-horizon tasks. Scientific impact will be measurable: AI will contribute directly to 2-4 Nobel-level discoveries (95% CI: 1-6), particularly in protein folding, materials science, and mathematics, by accelerating hypothesis generation and experimental design. AI-designed drugs will reach Phase III trials, and AI-generated mathematical proofs will be published in top journals. Regulatory frameworks will solidify: The EU, US, and China will have enforceable governance regimes (95% CI: 2-3 jurisdictions) requiring pre-deployment safety evaluations, incident reporting, and liability assignment for high-risk AI systems. Compute governance will track clusters above 10^24 FLOPS, and model licensing will be required for systems exceeding 10^25 parameters. Hardware specialization will accelerate: Training compute efficiency will improve 15-25x (95% CI: 10-40x) per dollar, with 3-5 major companies offering competitive AI accelerators. The largest training runs will utilize 500K-1M H100-equivalent chips (95% CI: 300K-2M), but energy constraints will force innovation in datacenter design and model efficiency. Safety research will lag capabilities: We will not have robust solutions to alignment for superintelligent systems, but interpretability tools will explain 60-70% (95% CI: 50-80%) of model behavior in narrow cases. Jailbreaks will persist despite 10-100x reduction in success rates, and AI-generated misinformation will be a major geopolitical concern requiring defensive infrastructure.

Claude Opus 4.6

Answer: By mid-2030, AI systems will likely achieve expert-level performance across most cognitive benchmarks, with frontier model training costs exceeding $50 billion per run, while genuine scientific autonomy and physical-world robotics will lag significantly behind language and reasoning capabilities.

Here are my specific, falsifiable forecasts for AI development through mid-2030, organized by category: **Model Scale & Training Costs** - Frontier model training runs will cost between $10B–$100B by 2030 (80% confidence), up from ~$100M–$500M today. - Training compute for the largest models will reach 10^28–10^29 FLOPs (70% confidence), roughly 100–1000x current frontier. - The number of companies capable of training frontier models will remain between 5–15 (75% confidence), as costs create enormous barriers to entry. **Benchmark Performance (80% confidence intervals)** - GPQA (graduate-level science questions): Frontier models will score 90–98% (currently ~65–75%). - MATH benchmark: 97–100% (currently ~90–95% for frontier models). - SWE-bench (full, real software engineering tasks): 60–85% resolved (currently ~20–50% depending on variant). - ARC-AGI: 95–100% (currently ~75–85% for best systems with scaffolding). - At least 3 new major benchmarks will be created and saturated within 18 months of release. **Coding & Software Engineering** - AI systems will be capable of independently completing 40–70% of typical junior software engineering tasks end-to-end with minimal human oversight (75% confidence). - AI-written code will constitute 50–75% of all new code committed in large tech companies (70% confidence), up from ~15–30% today. - At least one AI system will demonstrably maintain a medium-complexity codebase (>100K lines) with minimal human oversight (65% confidence). **Scientific Discovery** - AI will be a credited co-contributor on 5–20 papers published in Nature/Science per year by 2030 (70% confidence). - AI will autonomously propose AND experimentally validate at least 2–5 genuinely novel scientific hypotheses (not just protein folding-style prediction) by 2030 (50% confidence). - Drug candidates primarily discovered by AI systems will be in Phase II or later clinical trials: 10–50 candidates (75% confidence). **Robotics & Physical World** - General-purpose humanoid robots will be deployed in fewer than 50,000 commercial units total worldwide (70% confidence). The gap between digital and physical AI will remain large. - Autonomous vehicles (Level 4+) will operate commercially in 50–200 cities globally (70% confidence), but will still represent <2% of total miles driven. **Multimodal & Agent Capabilities** - Frontier models will natively handle text, image, audio, video, and code with near-human quality across all modalities (85% confidence). - Persistent AI agents that can browse the web, use tools, and complete multi-step tasks will achieve 70–90% success rate on standardized multi-step web tasks (currently ~15–40%) (70% confidence). - At least one major AI assistant will maintain coherent, personalized long-term memory across months of interaction with individual users (85% confidence). **Economic & Labor Impact** - AI will directly automate 5–15% of current paid task-hours in OECD economies (70% confidence), concentrated in coding, customer service, content creation, data analysis, and translation. - Total AI industry revenue (including cloud AI services, enterprise AI, consumer AI) will reach $500B–$1.5T annually (70% confidence). - Net employment effects will be ambiguous—no OECD country will see AI-attributable unemployment rise above 2 percentage points (75% confidence), as new roles and augmentation absorb displacement. **Safety & Governance** - At least 20 countries will have enacted AI-specific legislation (85% confidence). - There will be at least 2–5 widely-reported incidents where AI systems cause significant harm (financial losses >$1B or equivalent), leading to emergency regulatory responses (65% confidence). - No AI system will demonstrate convincing, persistent self-directed goal pursuit contrary to operator intent in a real-world deployment (i.e., no 'AI takeover' scenario) (90% confidence). **What WON'T Happen (>75% confidence)** - Full AGI as commonly defined (a single system matching or exceeding human performance across ALL cognitive and physical tasks) will NOT be achieved by mid-2030. - AI will NOT replace the majority of knowledge workers—augmentation will dominate over replacement. - Compute scaling alone will NOT solve reasoning reliability—systems will still make confident errors on novel problems outside training distributions, though at lower rates than today.

GPT-5.4

Answer: Over the next five years, AI will most likely become a ubiquitous copilot and partial automation layer across knowledge work and software, with steady but not magical gains—roughly a 70% chance of frontier models becoming reliable enough to automate 20–40% of tasks in several white-collar workflows, a 50% chance of widely deployed multimodal agents completing hour-scale digital tasks with human oversight, and under a 15% chance of reaching anything like fully autonomous, broadly superhuman AGI by then.

My baseline forecast is continued rapid capability improvement driven by scaling, better data use, inference-time reasoning, tool use, multimodality, and tighter product integration, but with diminishing returns in some dimensions and real bottlenecks in reliability, cost, regulation, and organizational adoption. On model capability, I estimate an 80% chance that by five years from now the best general-purpose models will outperform the median human on a majority of benchmarked text-and-image knowledge tasks, a 65% chance they will do so on multimodal tasks involving text, image, audio, and video jointly, and a 40% chance they will achieve this on a broad suite of economically relevant computer-use tasks in sandboxed digital environments. I assign only about a 10–15% chance that models will be able to autonomously execute most long-horizon real-world projects, across domains, at a level that clearly substitutes for top professionals with minimal supervision. On software engineering, I expect the strongest near-term economic impact here. My best estimate is a 75% chance that over 50% of code written in mainstream enterprise environments will be AI-generated or AI-transformed before merge, though still reviewed by humans; a 60% chance that top-end coding agents will independently complete well-scoped engineering tickets that currently take a human 1–4 hours; and a 35% chance they will routinely complete tasks in the 1–3 day range in mature codebases. I put the chance below 20% that AI will reliably own end-to-end software projects of multi-week scope without substantial human decomposition, debugging, and product judgment. On knowledge work more broadly, I estimate a 70% chance that in fields like customer support, paralegal drafting, insurance claims handling, basic financial analysis, sales operations, and marketing content production, organizations adopting AI seriously will see labor productivity gains in the 15–35% range. I estimate a 50% chance that at least one major occupational category with over 1 million workers in a large economy will see 10%+ of its tasks directly automated rather than merely assisted. But I assign only a 30% chance that aggregate national productivity statistics in advanced economies will show a clearly attributable AI-driven acceleration above 1 percentage point annually by that time; diffusion into macro data is slower than demos. On agents, I expect progress but with clear limits. There is a 50% chance that consumers and enterprises will commonly use AI agents that can operate browsers, office software, communications tools, and internal apps to complete multi-step tasks lasting 30–120 minutes with checkpoints and occasional error recovery. There is a 30% chance such agents will be trusted for fully unsupervised execution in low-risk domains, and only around a 15% chance they will be robust enough for widespread unsupervised use in high-stakes areas like legal commitments, medical decisions, or large financial transactions. On multimodality and robotics, I estimate a 75% chance that frontier models will support high-quality real-time voice conversation, persistent context, and grounded image/video understanding that feels normal to mainstream users. For robotics, I am more conservative: a 40% chance of meaningful commercial deployment of general-purpose-ish robots in structured settings such as warehouses, logistics, and some manufacturing cells; a 20% chance of visible household robot adoption beyond niche vacuum/lawn products; and under a 10% chance of versatile humanoid robots becoming common in ordinary homes. The software side will advance faster than physical-world robustness. On cost and access, I expect a 90% chance that inference cost per unit capability falls by at least 10x and a 60% chance of a 100x fall for many common workloads, due to hardware, distillation, sparsity, and systems optimization. That means very strong models will be embedded into ordinary software products and devices. I also expect a 70% chance that capable on-device or edge models will handle a significant share of personal AI tasks with intermittent cloud escalation, especially for privacy-sensitive use cases. On market structure, I expect concentration at the frontier and commoditization below it. There is a 70% chance that fewer than 10 organizations remain meaningfully competitive at the absolute frontier of training spend and model capability, but a 90% chance that open-weight or cheaply licensable models remain strong enough to power a large share of commercial applications. The likely outcome is not one winner taking all, but a layered stack: a few frontier labs, many application firms, and extensive fine-tuning/distillation ecosystems. On safety and misuse, I expect incremental improvement, not full resolution. There is a 65% chance of at least one globally significant AI-enabled fraud, cyber, or influence incident that materially shapes regulation or public perception, but only around a 20% chance of an AI-caused catastrophe killing 1,000+ people in a single event. I assign a 50% chance that frontier-model developers face mandatory reporting, licensing, or compute-governance requirements in at least one major jurisdiction, and a 70% chance of stronger rules around provenance, privacy, and sector-specific deployment. On science and medicine, I expect narrow but real acceleration. I estimate a 60% chance AI materially shortens at least one stage of small-molecule drug discovery or biological design in mainstream pharma workflows, a 50% chance it becomes a standard assistant for experimental planning and literature synthesis in many labs, and a 25% chance it contributes to a widely recognized breakthrough where experts agree AI was a central enabling factor rather than just a convenience. But I assign low probability to AI independently running broad scientific programs with minimal human guidance. If we want a compact falsifiable summary, my median-world view for five years ahead is this: frontier AI is dramatically better than today at coding, search, writing, analysis, voice interaction, and software tool use; AI is embedded in most professional software; many firms get double-digit workflow productivity gains where tasks are digital and structured; coding and support are transformed the most; agents are useful but still supervision-hungry; robotics improves but remains limited; regulation tightens after visible misuse incidents; and the world is economically and operationally changed, but not by the arrival of fully autonomous general intelligence. The core reason for this forecast is that current trends support continued capability growth, but real-world value depends less on benchmark jumps than on reliability, integration, incentives, and trust. Those factors usually improve slower than raw model performance. So the most likely five-year outcome is neither stagnation nor sci-fi discontinuity, but a powerful, uneven deployment wave that automates meaningful fractions of many tasks without eliminating the need for humans in most jobs.

GPT-5 Mini

Answer: By March 2031, I expect continued rapid capability and deployment gains — roughly 3–10× better performance on today’s major benchmarks, 5–10× lower real-world inference cost, 30–50% of medium+ enterprises integrating AI into core workflows, measurable GDP/productivity gains of ~1–3% in early-adopter economies, and a low (median ~2%, 90% CI 0.1–10%) probability of reaching broadly human-level general intelligence — with substantive national AI regulation enacted in 15–40 countries.

Summary of how I reach these numbers: hardware growth, algorithmic progress, data, and capital continue to drive capability improvements but with diminishing returns on simple scale; meanwhile software engineering, distillation, and systems deployment drive large reductions in real-world cost and rapid enterprise uptake. I give confidence intervals so these claims are falsifiable in 5 years. Compute, model scale, and capability: Global accessible training compute is likely to increase materially but not explode; I estimate a 5–15× increase in aggregate annual training FLOPs available to leading labs between 2026 and 2031 (median ~8×, 80% CI 5–15×). This is driven by more datacenter GPU/accelerator deployments, purpose-built AI chips, and continued capital investment. Given algorithmic improvements (efficient architectures, retrieval, fine-tuning, multimodal pretraining), I forecast top public and proprietary models will be ~3–7× larger in parameter count (median ~4×) and achieve roughly 3–10× better end-to-end performance on major SOTA benchmarks (language modeling perplexity, few-shot metrics, coding benchmarks, multimodal reasoning) relative to typical 2026 baselines (70–80% confidence that improvements fall in that range). Inference cost and deployment: I expect model and systems engineering advances (quantization, pruning, distillation, sparsity, better compilers) plus hardware-tailored inference accelerators to reduce real-world inference cost (CPU/GPU time per useful output for a given capability level) by ~5–10× (median ~7×, 80% CI 4–12×). This will make always-on, interactive, and enterprise-grade deployments economically feasible at far larger scale than today. Enterprise adoption and sector impact: Given these cost and capability shifts, I forecast 30–50% of medium and large enterprises will have integrated AI into at least one core workflow (customer support, document processing, code assistance, or knowledge work augmentation) by March 2031 (median ~40%, 80% CI 30–50%). In high-automation-adoption sectors (software, finance, marketing), adoption will be higher (50–80%). I estimate measurable productivity/GDP gains concentrated in early-adopter economies of ~1–3% over baseline by 2031 (median 1.5%), with some sectors seeing 5–10% labor productivity improvements; these are net effects including implementation lag and reallocation costs. Labor market effects: I estimate that 5–15% of current jobs will have >50% of tasks plausibly automatable by deployed AI systems within five years (median ~8%, 80% CI 5–15%), producing significant job churn and a need for worker retraining, but not outright mass unemployment in that timeframe. Net employment effects will vary by country and sector and depend strongly on policy and retraining investment. Safety, misuse, and governance: I expect at least 15 and possibly up to 40 countries to adopt substantive AI governance frameworks (licensing/registration for large models, mandatory risk assessments for high-risk deployments, or sectoral rules) by 2031 (median ~25 countries, 80% CI 15–40). I expect OECD-level policy harmonization to emerge (50–70% chance), and significant private-sector enforcement (model cards, red-team reporting) to be standard practice among major providers. AGI probability and interpretability: On the key question of broadly human-level general intelligence (AGI), I put the median probability by March 2031 at ~2%, with a 90% credible interval of ~0.1%–10%. This low number reflects the remaining gaps in robust long-horizon planning, generalization outside training distributions, efficient continual learning, and safe deployment; progress is fast but not fast enough, in my view, for high confidence of AGI within five years. Improvements in interpretability and alignment tools will continue, but not yet provide guarantee of safe AGI. Concrete, falsifiable checkpoints (so this forecast can be evaluated): - Aggregate annual training FLOPs available to leading labs: will be between 5× and 15× of the 2026 baseline (80% CI). - Largest-model parameter counts: top public/proprietary models will typically be 3–7× larger than 2026 leaders, with top end likely in the 1–5 trillion parameter range (70% confidence for at least one model >1T params). - Inference cost: median cost-per-use (for a standardized 1k-token helpful answer) will fall by 5–10× (80% CI). - Benchmark improvement: SOTA on core language/coding/multimodal benchmarks will be 3–10× better by reasonable aggregated metrics (e.g., relative error reduction) (70–80% confidence). - Enterprise integration: 30–50% of medium+ enterprises will use AI in a core workflow (80% CI). - Economic impact: early-adopter economies will show 1–3% uplift to measured productivity/GDP attributable to AI adoption (median 1.5%). - Jobs: 5–15% of jobs will have >50% task automation potential (80% CI). - Regulation: 15–40 countries with substantive AI laws/regulatory frameworks (80% CI). - AGI: probability of broadly human-level AGI by March 2031: median 2%, 90% CI 0.1–10%. Uncertainties and key decision points: The main upside drivers that could push results above my ranges are breakthrough algorithmic innovations that dramatically improve data efficiency (e.g., orders-of-magnitude improvements in sample efficiency) or a large new capital wave building vast new datacenter capacity. Downside drivers are material supply-chain constraints for accelerators, a funding crunch, major safety incidents triggering regulatory slowdowns, or diminishing returns on scaling earlier than expected. Why these ranges matter: I chose ranges that are specific and falsifiable so observers can check them in five years against measurable outcomes (model FLOPs, parameter counts, benchmark scores, enterprise surveys, GDP figures, number of laws, and the binary AGI criterion). The central theme is rapid, economically consequential progress without a high near-term chance of uncontrollable AGI; policy and corporate preparedness will strongly affect societal outcomes.

Qwen 3.5 397B

Answer: By 2029, AI models will exceed 95% accuracy on professional benchmarks and reduce inference costs by 90%, enabling widespread agentic workflows, though full artificial general intelligence will not be achieved.

Current scaling laws indicate that compute investment will continue to drive performance gains, albeit with diminishing returns. I estimate a 95% confidence interval that top-tier models will exceed 95% accuracy on the MMLU benchmark by 2029, up from approximately 88% in early 2024. Similarly, on practical coding tasks measured by SWE-bench, performance will rise from ~25% to over 75%, enabling AI to write and debug significant portions of production software autonomously. Economically, the cost of inference will drop precipitously due to specialized hardware and model optimization techniques like mixture-of-experts and quantization. I predict the cost per million tokens for standard inference will decrease by at least 90% (a 10x reduction) from 2024 baseline prices, making real-time AI integration viable for high-volume consumer applications. Energy efficiency per FLOP will improve by a factor of 5, mitigating some data center power constraints. Qualitatively, the paradigm will shift from passive chatbots to active agents capable of multi-step reasoning and tool use. I forecast that by 2029, 40% of routine cognitive tasks in knowledge work (data entry, basic analysis, customer support) will be fully automated by AI agents without human intervention. However, true Artificial General Intelligence (AGI), defined as human-level adaptability across all physical and cognitive domains, will not be achieved, as grounding in physical reality and robust causal reasoning remain unsolved bottlenecks. Regulatory frameworks will solidify, likely imposing compliance costs that slow deployment in high-risk sectors like healthcare and finance by approximately 20% compared to unregulated trajectories. Data scarcity will become a tangible constraint, forcing a shift towards synthetic data training, which may introduce model collapse risks that stall progress temporarily in the 2027-2028 window.

Mistral Large

Answer: Over the next five years, AI will achieve human-level performance on 30-50% of professional and creative tasks, contribute $10-15 trillion annually to the global economy, and have a 60-80% probability of causing at least one major societal disruption due to misalignment or misuse.

### Quantitative Forecasts (2024-2029) **1. Performance Benchmarks** By 2029, AI systems will match or exceed human performance in 30-50% of professional and creative tasks. This projection is based on: - Current trends in model scaling (compute, data, and algorithmic efficiency) showing 2-3x annual improvements in benchmarked tasks. - The rise of specialized models (e.g., medical diagnosis, legal research, software engineering) accelerating domain-specific progress. - Tasks requiring embodied intelligence (e.g., fine motor skills) or true creativity (e.g., groundbreaking scientific theories) will remain largely out of reach. *Confidence interval*: 70% confidence in 30-50% of tasks reaching human parity; 90% confidence that at least 20% will. **2. Economic Impact** AI’s direct and indirect economic impact will reach $10-15 trillion annually by 2029 (8-12% of global GDP). This includes: - 20-30% productivity gains in sectors like manufacturing, logistics, and customer service. - New industries (e.g., AI-generated content, personalized education, AI-driven drug discovery). - Labor market disruptions, with 15-25% of current jobs transformed or displaced (e.g., administrative roles, basic coding). *Confidence interval*: 60% confidence in $10-15 trillion; 80% confidence it will exceed $8 trillion. **3. Model Capabilities and Deployment** - The largest models will have 10-50 trillion parameters, but efficiency will dominate. Key trends: - 80% of applications will use models with <100 billion parameters, optimized for specific tasks. - Training energy consumption will drop 50-70% per unit of performance due to hardware (e.g., neuromorphic chips) and algorithmic advances (e.g., quantization). - Deployment scale: - 5-10 billion active AI agents (e.g., personal assistants, enterprise tools) by 2029. - 30-50% of internet traffic will involve AI-generated or AI-mediated content. **4. Societal and Ethical Risks** - 60-80% probability of at least one major disruption, such as: - A misinformation crisis (e.g., AI-generated deepfakes influencing elections). - A catastrophic failure in high-stakes domains (e.g., autonomous weapons, medical diagnostics). - Social unrest from job displacement in regions with weak safety nets. - Alignment progress: 40-60% of AI research will focus on safety/governance, but alignment breakthroughs will lag capabilities. --- ### Qualitative Forecasts **1. Shift to Specialized AI** The era of general-purpose models (e.g., GPT-4) will give way to specialized, task-optimized systems due to: - Cost and latency constraints. - Regulatory pressures favoring narrow, auditable AI. - Demand for customization (e.g., enterprise-specific workflows). **2. AI-Augmented Workflows** AI will become an invisible layer in knowledge work: - Software engineering: 70-90% of code will be AI-generated or assisted. - Healthcare: 50-70% of diagnostics will be AI-driven but require human oversight. - Creative industries: 40-60% of content (e.g., marketing, design) will be AI-generated. **3. Hardware and Infrastructure** - Edge AI: 50-70% of inference will occur on devices (e.g., smartphones, IoT) due to privacy, latency, and energy efficiency. - Quantum AI: Early applications in drug discovery/optimization, but <5% of AI workloads will use quantum by 2029. **4. Governance and Regulation** - Global fragmentation: - US/EU: Risk-based regulation (e.g., EU AI Act) for high-risk applications. - China: State-led development with surveillance integration. - Global South: Rapid adoption with minimal oversight. - Standardization: 60-80% of AI systems will comply with emerging standards (e.g., ISO/IEC 42001), but enforcement will vary. **5. Cultural Shifts** - Human-AI collaboration will blur boundaries, creating new roles (e.g., AI trainers, ethics auditors). - Public trust will follow a hype cycle: peak expectations (2024-2026), followed by backlash (2027-2029) as limitations become apparent. --- ### Key Uncertainties 1. **AGI Breakthrough**: 10-20% chance of an unexpected leap in artificial general intelligence. 2. **Geopolitical Conflicts**: Disruptions to semiconductor supply chains or militarized AI development. 3. **Energy Constraints**: AI’s growth could be limited by energy demands, forcing efficiency innovations. 4. **Public Backlash**: A high-profile AI failure could trigger regulatory crackdowns. --- ### Falsifiability This forecast can be evaluated in 2029 using: - Task performance on benchmarks (e.g., MMLU, HumanEval). - Economic impact metrics (e.g., GDP contribution). - Deployment data (e.g., number of AI agents, edge vs. cloud usage). - Societal disruptions (e.g., documented incidents). - Regulatory adoption (e.g., compliance with standards).

GPT-5

Answer: By March 2031, frontier AI will be broadly superhuman at coding and technical QA, near–expert on most text–image–audio tasks, and deeply embedded in work and consumer tools, enabled by ~30–300× larger training runs and 5–10× cheaper inference, delivering measurable productivity gains while prompting materially tighter oversight and safety controls.

Scope and confidence. Unless noted, statements below are intended to be checkable by March 31, 2031, with 80% confidence intervals (CI) where given and explicit event probabilities otherwise; figures are relative to 2024 baselines. Scale, compute, and cost. - Training compute for at least one frontier model run increases by 30–300× vs 2024 (80% CI), with ≥1 publicized run exceeding 1e27 training FLOPs (70% probability) and total frontier-run cost per model at $0.5–10B (80% CI), but < $25B (90% probability). - Inference becomes 5–10× cheaper per effective 1k tokens at frontier quality (80% CI; 3–20× plausible), with first-token latency for high-throughput endpoints in the 100–300 ms range (70% probability) and sustained generation throughput of 50–200 tokens/s per stream (70% probability). - Context windows expand to 2–20 million tokens (80% CI) with retrieval-augmented generation standard; at least one production model supports ≥10M context with <5% factual regression vs its 128k-context baseline on long-context evals (60% probability). Core capabilities and benchmarks. - Coding: On SWE-bench Verified (or its maintained successor with comparable difficulty and scoring), top general-purpose systems achieve 80–95% solve rate end-to-end with tools allowed (80% CI), with ≥1 system ≥90% (50% probability). HumanEval-style pass@1 saturates at 99–100% (95% probability). - Scientific/technical QA: On GPQA-Diamond (or successor of comparable difficulty), best models reach 60–80% accuracy (80% CI), with chain-of-thought hidden by default and verification-based decoding standard. - Mathematics: On the MATH dataset (or its directly comparable successor), best models reach 85–95% test accuracy (80% CI); on olympiad-style proof benchmarks with automatic proof checking, verified-solution rates of 70–90% (60% CI). - Multimodal reasoning: SOTA models achieve ≥90% on leading chart/table/diagram reasoning suites and robust text–image localization tasks (80% probability); OCR and layout understanding are effectively solved for enterprise document workloads (≥98% task accuracy, 70% probability). - Video generation: Commodity tools produce 60–120 s, 1080p 24–30 fps clips with character/style consistency, winning ≥80% of blinded human preference tests against stock b‑roll for typical advertising and explainer content (70% probability); fully AI-generated feature-length films remain rare but ≥1 widely released film uses ≥50% AI-generated video by runtime (40% probability). Agents and automation. - Web and API agents: On WebArena or successor web-agent benchmarks, best systems reach 85–95% success (80% CI); in production, enterprise task automations (email triage, CRM updates, procurement, travel booking) achieve ≥95% success with human-in-the-loop review on exception cases (70% probability). - Personal assistants: Calendar/email/task management agents with background execution are mainstream; ≥40% of knowledge workers use such agents weekly (80% probability). Robotics. - Industrial logistics/manipulation: Vision-language-action generalist policies deliver 98.5–99.9% success (80% CI) on diverse pick/place/kitting in warehouses and micro-fulfillment; ≥100k such robots are deployed globally (70% probability). - Household: Broad consumer adoption of sub-$5k “generalist home robots” remains limited (60% probability of no mass-market breakout); pilot programs for laundry/dish loading and floor tidying exist with 70–90% task success in constrained homes (60% probability). Open vs closed. - Open models released within 12 months of a closed frontier model reach 80–95% of its benchmark performance at similar context length (70% probability), with at least one open model achieving ≥75% on SWE-bench Verified by 2031 (60% probability). Economy and adoption. - Worker adoption: 60–85% of knowledge workers use AI tools weekly (80% CI); 40–70% of newly committed code lines originate via AI assistance (80% CI). - Productivity: In AI-adopting economies, labor-productivity growth is higher by 0.5–1.5 percentage points annually vs a no-AI counterfactual by 2031 (60% CI; 0.3–2.0 pp plausible), with strongest effects in software, customer support, sales ops, legal research, finance/analytics, and digital marketing. - Capex: Annual global AI compute/datacenter capex reaches $150–400B (80% CI), with total installed AI accelerator capacity ≥10× 2024 levels (80% probability). - Labor markets: Net unemployment impact is modest but nonzero: 0–1.5 percentage-point higher unemployment at peak vs trend (60% CI), with significant occupation reshuffling; top-quartile knowledge-worker wages +3–10% real, bottom quartile −2% to +3% (all 60% CI). Safety, security, and governance. - Governance: ≥5 major jurisdictions (e.g., US/EU/UK/China/India or equivalents) adopt compute- or capability-threshold regimes for model evaluation and release (70% probability); ≥3 leading labs undergo recurring third-party dangerous-capability evaluations before major releases (80% probability). - Misuse: At least one widely reported AI-enabled incident causes ≥$1B in direct economic harm (40% probability); no existential or uncontrollable-runaway event (≥95% probability). - Content authenticity: Watermarking or provenance standards (e.g., C2PA-like) are adopted by major platforms; ≥50% of mass-platform AI-generated images/videos carry verifiable provenance by default (70% probability). - Model alignment/guardrails: Default consumer models refuse high-risk bio/chemical/critical-infra assistance with ≥95% block rates on red-team suites while keeping ≤5% utility loss on benign adjacent queries (70% probability). Science, health, and education. - Drug discovery: ≥1 FDA (or EMA/PMDA-equivalent) approved small-molecule therapy where AI design/optimization is acknowledged as a primary driver of success (60% probability). - Materials: ≥3 commercially deployed materials or catalysts first proposed via AI-guided discovery (70% probability). - Theorem proving and math: ≥1 peer-reviewed, nontrivial new mathematical result lists an AI system as a primary tool in producing a formal proof checked by a standard proof assistant (50% probability). - Education: Large RCTs find AI tutoring yields 0.3–0.7 SD improvements on standardized outcomes over business-as-usual, at < $100 per pupil-year marginal cost (80% CI; 0.2–0.8 SD plausible). Model behavior and limits. - Robustness: Adversarially trained frontier models reduce jailbreak success rates on public red-team suites to <5% (70% probability) while maintaining ≥95% of baseline benign-task performance. - Reasoning: Long-horizon tool-augmented reasoning reduces multi-step failure cascades by ≥50% vs 2024 baselines on standardized evals (70% probability), but fully reliable autonomous execution without oversight on open-world tasks remains out of reach (60% probability). Meta-forecast on AI progress pace. - Overall capability growth remains fast but sub-exponential relative to 2023–2024: effective capability (as measured by aggregate leaderboard baskets across coding, QA, multimodal, and agent benchmarks) improves by 0.5–1.5× per year compounding (80% CI), with 1–2 notable slowdowns linked to data/compute bottlenecks or regulation (60% probability). AGI-like thresholds. - Probability that, by March 2031, a generally available model can perform a majority (>70%) of economically valuable computer-based tasks at a competent median human level, given tool use and reasonable time, is 35% (subjective), with a 80% CI of 20–55%. Even if achieved, wage and organizational frictions slow full substitution. These numbers are intended to be falsifiable: they specify measurable targets (compute, costs, benchmark scores, adoption shares, capex, policy counts) with explicit confidence, so a 2031 audit can score them.

Final Summary

The AI Roundtable forecasts a pivotal 'agentic turn' where systems like Gemini 3.1 Pro and Grok 4 move beyond chatbots to autonomously managing complex professional workflows. While DeepSeek V3.1 and Kimi K2 Thinking predict AI-led Nobel breakthroughs in science, models like GPT-5 and Claude Opus 4.6 argue that even with $100B training runs, true AGI remains a distant milestone. Most participants, including O3 and GPT-5 Mini, expect massive productivity gains and 95%+ benchmark accuracy, but conclude that human-level general intelligence is still unlikely within this five-year window.

All 35 models agreed on scaling; 33 of 35 doubted near-term AGI.