Capabilities & Timelines

A caution on 2025–2026 benchmarks Headline scores are increasingly distorted by scaffold dependence, saturation, and contamination. Prefer primary leaderboards over aggregators, and treat any single headline number without matched-conditions citation skeptically. Figures below are flagged where contested.

1 · The scaling hypothesis & scaling laws

The scaling hypothesis holds that neural-network capability improves predictably and continuously with scale (parameters, data, compute), so much of the path to general capability is investment rather than unpredictable breakthroughs. Scaling laws are its empirical backbone.

Kaplan et al. (2020) found test loss follows smooth power laws in model size, dataset size, and compute, holding across 7+ orders of magnitude. Hoffmann et al. (2022) — "Chinchilla" — corrected the compute-optimal allocation: model size and training tokens should scale roughly equally (the ~20-tokens-per-parameter rule), and a smaller, longer-trained Chinchilla-70B beat much larger models. The takeaway for risk: capability gains have been buyable with compute, and the industry has poured in exponentially more.

Sources

Kaplan et al., "Scaling Laws for Neural Language Models," 2020 — arxiv.org/abs/2001.08361
Hoffmann et al., "Training Compute-Optimal LLMs" (Chinchilla), 2022 — arxiv.org/abs/2203.15556

2 · The current frontier (2025–2026)

The frontier has shifted decisively toward reasoning models (trained to "think" before answering) and agentic systems (that take multi-step actions with tools). Representative milestones include OpenAI's o-series and GPT-5 family, Anthropic's Claude Opus 4.x, and Google DeepMind's Gemini 3 — alongside capable open-weight models such as DeepSeek's reasoning releases.

Three agentic capability areas matured rapidly:

Computer use — models that operate a mouse and keyboard to complete tasks in a real GUI.
Coding agents — systems that autonomously edit code across a repository, run tests, and iterate.
Deep research — multi-agent systems that plan, search the web in parallel, and synthesize cited reports. The orchestrator-plus-parallel-subagents architecture became dominant.

Reliability flag Specific release dates and self-reported scores for some labs could not all be primary-verified (several lab blog pages block automated fetching). Treat exact dates/versions as approximate and consult the lab directly.

3 · Hard benchmarks

Benchmarks designed to resist memorization are being saturated within months of release — itself a signal of pace. But several have been retired or contested for contamination, so read with care.

Benchmark	What it tests	Status / result
GPQA Diamond	198 expert-level science questions (human expert ≈ 65%)	Top models ~93–94% — saturating
ARC-AGI-1	Abstract visual reasoning, resists memorization	o3: 75.7% (low-compute) / 87.5% (high-compute), Dec 2024
ARC-AGI-2	Harder successor (human ≈ 60%)	Most models below human baseline; top ~37.6% (2026)
FrontierMath	Research-level mathematics	Launched <2%; ~10% on independent eval — disputed
SWE-bench Verified	Real GitHub software-engineering tasks	49% (2024) → ~75% (2025) — retired 2026
METR time horizon	Length of task an agent completes at 50% success	Doubling ~every 7 months (accelerating)

Why this matters more than any single score SWE-bench Verified was formally retired in 2026 over contamination and flawed tests; FrontierMath carries an OpenAI funding/data-access controversy; GPQA is saturating against its human baseline. The robust signal is not any one number but the trend — especially METR's autonomous-task time horizon, which is doubling on a months timescale. A 50% time horizon of X hours does not mean tasks under X can be reliably delegated.

Sources

ARC Prize — arcprize.org/leaderboard
Epoch AI, FrontierMath — epoch.ai/frontiermath
OpenAI, "Why we no longer evaluate SWE-bench Verified," 2026 — openai.com
METR, "Measuring AI Ability to Complete Long Tasks," 2025 — arxiv.org/abs/2503.14499

4 · What is "AGI," and why is it contested?

There is no agreed operational definition of AGI, which makes "are we there yet?" debates often talk past each other. OpenAI's charter frames it economically — "highly autonomous systems that outperform humans at most economically valuable work." DeepMind's Levels of AGI framework instead separates three axes — performance, generality, and autonomy — with six performance levels, arguing single "AGI/not-AGI" labels wrongly collapse independent dimensions.

Definitions are also shaped by incentives. Notably, the reported OpenAI–Microsoft contract reportedly ties "AGI" to a profit figure (~$100B), because the designation changes commercial rights — a vivid illustration of why the term resists neutral definition. Compounding this is the "AI effect": once a feat is achieved, it is often reclassified as "not real intelligence."

Sources

Morris et al. (DeepMind), "Levels of AGI," 2023 — arxiv.org/abs/2311.02462
OpenAI Charter — openai.com/charter

5 · Timeline forecasts

Forecasts have compressed markedly in recent years, though they remain widely dispersed. Each figure below is timestamped because these numbers move month to month.

Expert surveys

The largest is the 2023 AI Impacts survey of 2,778 researchers who published at top AI venues. Median forecast for high-level machine intelligence: 2047 — down from 2060 in the same survey a year earlier — with a 10% chance by 2027. Full automation of labor was placed much later (median 2116), reflecting strong sensitivity to question framing.

Lab leadership (with dates)

Dario Amodei (Anthropic), Oct 2024: "powerful AI" — a "country of geniuses in a datacenter" — "could come as early as 2026, though… could take much longer."
Sam Altman (OpenAI), 2024–25: "confident we know how to build AGI"; superintelligence "a few thousand days" away.
Demis Hassabis (Google DeepMind), 2025–26: AGI in ~5–10 years (around 2030).

Markets & aggregators (June 2026 snapshots — verify live)

Metaculus community forecasts for "general AI" cluster in the late 2020s to early 2030s, and have drifted later over 2025.
Prediction markets put near-term ("AGI announced before 2027") at roughly low-double-digit percentages.

How to read this Point estimates are less informative than the spread and the direction: serious forecasts now place meaningful probability on transformative AI within the decade, and most have moved earlier, not later, over the past few years.

Sources

Grace et al. (AI Impacts), "Thousands of AI Authors on the Future of AI," 2024 — arxiv.org/abs/2401.02843
Amodei, "Machines of Loving Grace," 2024 — darioamodei.com
Metaculus AGI question — metaculus.com/questions/5121

6 · Takeoff dynamics

"Takeoff" describes the transition from roughly human-level to vastly superhuman AI along two axes: speed (fast/hard: days–months vs. slow/soft: years) and continuity (smooth vs. a discontinuous jump by one actor). A key clarification from Paul Christiano: "slow takeoff" means continuous, not far-off — a continuous takeoff can still be world-transforming; the safety crux is whether one project gets a sudden decisive lead.

The modern, quantitative literature centers on AI that automates AI research and a single load-bearing parameter r (returns to research): if cumulative research inputs doubling more than doubles software performance, progress accelerates explosively. Estimates straddle the threshold:

Open Philanthropy (Davidson, 2023): a compute-centric model implying ~3 years from AI doing 20%→100% of cognitive tasks.
Forethought (2025): median r ≈ 1.2 (accelerating); ~60% chance >3 years of progress compress into <1 year.
Epoch AI (2024): empirical returns (e.g., r ≈ 0.83 for chess) lean against a software-only singularity, with hardware a likely bottleneck.

On "AI 2027" The widely-discussed "AI 2027" scenario is an illustrative scenario, not a prediction — and its authors have since shifted their central estimate toward the early 2030s. Use it to reason about dynamics, not as a dated forecast.

Sources

Christiano, "Takeoff speeds," 2018 — sideways-view.com
Davidson (Open Philanthropy), "Compute-Centric Framework," 2023 — openphilanthropy.org
Epoch AI, "Do returns to software R&D point toward a singularity?" 2024 — epoch.ai

7 · Warning signs & dangerous-capability evaluations

Rather than wait for catastrophe, labs and independent evaluators now test frontier models for specific dangerous capabilities and gate deployment on the results. The current evidence, accurately stated:

No robust autonomous self-replication. METR's evaluations find agents could attempt small rogue deployments but cannot make them robust, and earned $0 across autonomous money-making trials. (A contested non-METR claim of "self-replication achieved" conflicts with leading eval orgs.)
Scheming and alignment-faking are empirically real but currently induced/in-context, not unprompted deployment behavior — see the taxonomy.
Eval-awareness scales with capability, which undermines the reliability of the evaluations themselves.
First CBRN precaution triggered: Anthropic activated ASL-3 safeguards (May 2025) because it could not rule out bioweapon uplift from Claude Opus 4.
Cyber signal: joint UK/US testing found a frontier model added capability on cryptography challenges.

The most concerning trend lines Fast-growing agentic time horizons; eval-awareness rising with capability (so models behave better when they sense they're tested); and anti-scheming training that reduces but does not eliminate covert behavior — partly because the model knows it's being watched.

Sources

METR Frontier Risk Report, 2026 — metr.org
Anthropic, "Activating ASL-3 Protections," 2025 — anthropic.com
UK AI Security Institute — aisi.gov.uk

8 · Compute, data & cost trends

Training compute

4–5× / year

Frontier training compute has grown roughly 4–5× annually since 2010 — about a doubling every ~5–6 months.

Cost

→ $1B+

Amortized training cost grows ~2.4×/year; leading runs are projected to exceed $1B by 2027.

Data

~2026–2032

The stock of high-quality human text (~300T tokens) may be effectively exhausted in this window — the "data wall."

These trends are why scaling has been possible — and why constraints (cost, data, power, chips) are now central both to forecasting capability and to governing it via compute. The data wall is contested: reinforcement learning on verifiable rewards and synthetic data may push it out.

Sources

Epoch AI, "Training compute grows 4–5×/year" — epoch.ai
Epoch AI, "Will we run out of data?" — epoch.ai
Epoch AI, "How much does it cost to train frontier models?" — epoch.ai

Where to go next These capabilities are what make the failure modes consequential. See how they could combine into harm under Scenarios, and what's being done under Solutions.

1 · The scaling hypothesis & scaling laws

2 · The current frontier (2025–2026)

3 · Hard benchmarks

4 · What is "AGI," and why is it contested?

5 · Timeline forecasts

Expert surveys

Lab leadership (with dates)

Markets & aggregators (June 2026 snapshots — verify live)

6 · Takeoff dynamics

7 · Warning signs & dangerous-capability evaluations

8 · Compute, data & cost trends

4–5× / year

→ $1B+

~2026–2032