Governance & Solutions

Part A · Technical alignment research

These agendas attack the failure modes directly — trying to specify objectives faithfully, supervise systems we can't fully evaluate, read what models are "thinking," and bound the damage a misaligned system could do.

RLHF and its limitations

Reinforcement Learning from Human Feedback (RLHF) trains a reward model on human preferences, then optimizes a policy against it — the pipeline behind ChatGPT-era assistants (Christiano et al. 2017; Ouyang et al. InstructGPT 2022). It works, but the most-cited critique (Casper, Davies et al. 2023) taxonomizes its problems: biased/limited human evaluators, reward-model misspecification and Goodharting, and policy distribution shift. A core worry: RLHF optimizes for human approval, not truth — incentivizing sycophancy.

Sources

Ouyang et al. (OpenAI), "InstructGPT," 2022 — arxiv.org/abs/2203.02155
Casper, Davies et al., "Open Problems and Fundamental Limitations of RLHF," 2023 — arxiv.org/abs/2307.15217

Constitutional AI / RLAIF

Constitutional AI (Anthropic, 2022) trains a model to critique and revise its own outputs against an explicit written "constitution," then uses AI-generated preferences (RLAIF) in place of much human labeling. The aim is more scalable and transparent oversight — encoding values explicitly rather than implicitly via crowdworker labels. It underpins Anthropic's Claude models; limitations include dependence on the constitution's quality and the AI judge's own biases.

Sources

Bai et al. (Anthropic), "Constitutional AI," 2022 — arxiv.org/abs/2212.08073

Scalable oversight: debate, recursive reward modeling, weak-to-strong

How do you supervise a system smarter than you? Proposals include debate (two AIs argue, a human judges), recursive reward modeling (bootstrap aligned agents to help evaluate harder tasks), and weak-to-strong generalization (OpenAI, 2023) — testing whether a weak supervisor can elicit a stronger model's full capabilities, as an analogy for humans supervising superhuman AI. The early result: strong students partly, but not fully, generalize beyond weak teachers' errors — suggesting naïve RLHF may scale poorly.

Context OpenAI's Superalignment team, which led this work, was disbanded around May 2024 after its co-leads departed; OpenAI said the work would continue across other teams.

Sources

Burns et al. (OpenAI), "Weak-to-Strong Generalization," 2023 — arxiv.org/abs/2312.09390
Irving et al., "AI Safety via Debate," 2018 — arxiv.org/abs/1805.00899

Mechanistic interpretability fast progress

Mechanistic interpretability reverse-engineers the internal computations of neural networks into human-understandable features and circuits — aspiring to ground-truth understanding rather than behavioral inference, so we could audit a model for deception or dangerous cognition. Anthropic's dictionary learning with sparse autoencoders ("Towards Monosemanticity," 2023; "Scaling Monosemanticity," 2024) extracted millions of interpretable, steerable features from a production model — famously demonstrating control by clamping a "Golden Gate Bridge" feature. 2025 work on attribution graphs ("On the Biology of a Large Language Model") revealed internal multi-step reasoning and forward planning.

Sources

Templeton et al. (Anthropic), "Scaling Monosemanticity," 2024 — transformer-circuits.pub
Anthropic, "On the Biology of a Large Language Model," 2025 — transformer-circuits.pub

AI control (Redwood Research)

AI control asks a different question from alignment: how do we deploy a powerful but possibly misaligned model safely without assuming it's trustworthy? Greenblatt et al. (2023) design red-team/blue-team protocols — trusted monitoring, trusted editing, limited human auditing — and measure the safety-usefulness frontier when an untrusted strong model might try to insert backdoors. It complements alignment by bounding the damage a scheming model can do given a fixed budget of trusted labor, and has become central given concerns about scheming in agentic deployments.

Sources

Greenblatt et al. (Redwood), "AI Control," 2023/2024 — arxiv.org/abs/2312.06942

Dangerous-capability evaluations & red-teaming

Independent and in-house evaluations test frontier models for capabilities that could enable catastrophe (autonomous replication, cyber-offense, CBRN uplift) and for propensities like scheming. METR pioneered autonomous-capability and time-horizon evals; Apollo Research focuses on deceptive alignment and scheming; and government bodies — the UK AI Security Institute and the US CAISI (the renamed former US AI Safety Institute, at NIST) — conduct independent pre- and post-deployment testing. A related strand develops "safety cases" — structured arguments that a system is safe to deploy.

Naming flag Institute names and mandates have shifted with political changes: the UK's became the "AI Security Institute" (2025) and the US institute was renamed CAISI (2025). See the directory.

Sources

METR — metr.org · Apollo Research — apolloresearch.ai

Responsible scaling policies / preparedness frameworks

These lab commitments tie deployment and scaling to risk thresholds. Anthropic's Responsible Scaling Policy uses graduated AI Safety Levels (ASL); it activated ASL-3 safeguards for Claude Opus 4 in May 2025. OpenAI's Preparedness Framework v2 (Apr 2025) tracks biological/chemical, cybersecurity, and AI self-improvement capabilities. Google DeepMind's Frontier Safety Framework v2 (Feb 2025) defines Critical Capability Levels and uniquely names deceptive alignment as a tracked risk.

Caveat These are voluntary, self-enforced commitments that can weaken under competitive pressure — a core argument for external governance (below).

Sources

Anthropic RSP — anthropic.com
OpenAI Preparedness Framework v2 — openai.com
DeepMind Frontier Safety Framework — deepmind.google

Part B · Governance & policy

Technical work alone cannot resolve race dynamics or misuse incentives. Governance aims to make safety commitments verifiable and binding.

The EU AI Act

The world's first comprehensive AI law (Regulation 2024/1689) entered into force on 1 August 2024, phasing in over years: prohibited-AI rules from Feb 2025; general-purpose AI (GPAI) obligations from 2 August 2025; most high-risk-system obligations from 2 August 2026. A voluntary GPAI Code of Practice (Jul 2025) covers transparency, copyright, and safety/security for models posing "systemic risk"; major providers signed (Meta declined).

Flag As of mid-2026 the Commission has signaled possible "simplification"/targeted delays via a Digital Omnibus — deadlines remain subject to change.

Sources

EU AI Act implementation timeline — artificialintelligenceact.eu
European Commission, regulatory framework — ec.europa.eu

United States: executive orders, NIST, state laws

President Biden's Executive Order 14110 (Oct 2023) directed agencies on AI safety; it was rescinded in January 2025 and replaced by an order reorienting policy toward competitiveness and deregulation. The voluntary NIST AI Risk Management Framework (2023) remains a key reference, and the former US AI Safety Institute at NIST was renamed CAISI in 2025. At the state level, California's SB 1047 was vetoed (2024), but its successor SB 53 — the Transparency in Frontier AI Act — was signed (Sept 2025, effective Jan 2026), the first US state law targeting catastrophic frontier-AI risk via mandatory safety frameworks, incident reporting, and whistleblower protections.

Sources

California, "Governor signs SB 53," 2025 — gov.ca.gov
NIST CAISI — nist.gov/caisi

United Kingdom & international coordination

The UK launched the first state AI Safety Institute at the Bletchley Park Summit (Nov 2023; 28 countries + EU signed the Bletchley Declaration), followed by summits in Seoul (2024, voluntary Frontier AI Safety Commitments) and Paris (2025). In Feb 2025 the UK renamed its body the AI Security Institute. Internationally, the International AI Safety Report — chaired by Yoshua Bengio, published Jan 2025 with ~96 experts from 30 countries — provides a shared scientific synthesis, and the UN has established an Independent International Scientific Panel on AI and a Global Dialogue on AI Governance.

Sources

International AI Safety Report 2025 — internationalaisafetyreport.org
UK AI Security Institute — aisi.gov.uk

Governance proposals on the table

Compute governance — using AI chips as a verifiable chokepoint: training-run reporting thresholds, hardware-enabled verification, chip registries, export controls.
Mandatory evaluations & third-party audits — pre-deployment dangerous-capability testing, as in the EU AI Act and SB 53.
International coordination — a "CERN for AI" (shared international lab) or an "IAEA for AI" (monitoring/verification agency), possibly with a secure-chips agreement.
Liability regimes — strict or risk-based developer liability to internalize harms.
A scientific-consensus body — an "IPCC for AI," partly realized by the International AI Safety Report.

Sources

Chatham House, "A 'CERN for AI'," 2024 — chathamhouse.org
"Toward a Global Regime for Compute Governance," 2025 — arxiv.org/abs/2506.20530

This work is talent- and attention-constrained

Technical alignment, evaluations, governance, and policy all need more capable people. There are concrete, non-counterproductive ways to help.

See how to get involved

Part A · Technical alignment research

RLHF and its limitations

Constitutional AI / RLAIF

Scalable oversight: debate, recursive reward modeling, weak-to-strong

Mechanistic interpretability fast progress

AI control (Redwood Research)

Dangerous-capability evaluations & red-teaming

Responsible scaling policies / preparedness frameworks

Part B · Governance & policy

The EU AI Act

United States: executive orders, NIST, state laws

United Kingdom & international coordination

Governance proposals on the table

This work is talent- and attention-constrained