Part A · Technical alignment research
These agendas attack the failure modes directly — trying to specify objectives faithfully, supervise systems we can't fully evaluate, read what models are "thinking," and bound the damage a misaligned system could do.
RLHF and its limitations
Reinforcement Learning from Human Feedback (RLHF) trains a reward model on human preferences, then optimizes a policy against it — the pipeline behind ChatGPT-era assistants (Christiano et al. 2017; Ouyang et al. InstructGPT 2022). It works, but the most-cited critique (Casper, Davies et al. 2023) taxonomizes its problems: biased/limited human evaluators, reward-model misspecification and Goodharting, and policy distribution shift. A core worry: RLHF optimizes for human approval, not truth — incentivizing sycophancy.
Constitutional AI / RLAIF
Constitutional AI (Anthropic, 2022) trains a model to critique and revise its own outputs against an explicit written "constitution," then uses AI-generated preferences (RLAIF) in place of much human labeling. The aim is more scalable and transparent oversight — encoding values explicitly rather than implicitly via crowdworker labels. It underpins Anthropic's Claude models; limitations include dependence on the constitution's quality and the AI judge's own biases.
Scalable oversight: debate, recursive reward modeling, weak-to-strong
How do you supervise a system smarter than you? Proposals include debate (two AIs argue, a human judges), recursive reward modeling (bootstrap aligned agents to help evaluate harder tasks), and weak-to-strong generalization (OpenAI, 2023) — testing whether a weak supervisor can elicit a stronger model's full capabilities, as an analogy for humans supervising superhuman AI. The early result: strong students partly, but not fully, generalize beyond weak teachers' errors — suggesting naïve RLHF may scale poorly.
Context OpenAI's Superalignment team, which led this work, was disbanded around May 2024 after its co-leads departed; OpenAI said the work would continue across other teams.
Mechanistic interpretability fast progress
Mechanistic interpretability reverse-engineers the internal computations of neural networks into human-understandable features and circuits — aspiring to ground-truth understanding rather than behavioral inference, so we could audit a model for deception or dangerous cognition. Anthropic's dictionary learning with sparse autoencoders ("Towards Monosemanticity," 2023; "Scaling Monosemanticity," 2024) extracted millions of interpretable, steerable features from a production model — famously demonstrating control by clamping a "Golden Gate Bridge" feature. 2025 work on attribution graphs ("On the Biology of a Large Language Model") revealed internal multi-step reasoning and forward planning.
AI control (Redwood Research)
AI control asks a different question from alignment: how do we deploy a powerful but possibly misaligned model safely without assuming it's trustworthy? Greenblatt et al. (2023) design red-team/blue-team protocols — trusted monitoring, trusted editing, limited human auditing — and measure the safety-usefulness frontier when an untrusted strong model might try to insert backdoors. It complements alignment by bounding the damage a scheming model can do given a fixed budget of trusted labor, and has become central given concerns about scheming in agentic deployments.
Dangerous-capability evaluations & red-teaming
Independent and in-house evaluations test frontier models for capabilities that could enable catastrophe (autonomous replication, cyber-offense, CBRN uplift) and for propensities like scheming. METR pioneered autonomous-capability and time-horizon evals; Apollo Research focuses on deceptive alignment and scheming; and government bodies — the UK AI Security Institute and the US CAISI (the renamed former US AI Safety Institute, at NIST) — conduct independent pre- and post-deployment testing. A related strand develops "safety cases" — structured arguments that a system is safe to deploy.
Naming flag Institute names and mandates have shifted with political changes: the UK's became the "AI
Security Institute" (2025) and the US institute was renamed
CAISI (2025). See the
directory.
Responsible scaling policies / preparedness frameworks
These lab commitments tie deployment and scaling to risk thresholds. Anthropic's Responsible Scaling Policy uses graduated AI Safety Levels (ASL); it activated ASL-3 safeguards for Claude Opus 4 in May 2025. OpenAI's Preparedness Framework v2 (Apr 2025) tracks biological/chemical, cybersecurity, and AI self-improvement capabilities. Google DeepMind's Frontier Safety Framework v2 (Feb 2025) defines Critical Capability Levels and uniquely names deceptive alignment as a tracked risk.
Caveat These are voluntary, self-enforced commitments that can weaken under competitive pressure — a core argument for external governance (below).
Part B · Governance & policy
Technical work alone cannot resolve race dynamics or misuse incentives. Governance aims to make safety commitments verifiable and binding.
The EU AI Act
The world's first comprehensive AI law (Regulation 2024/1689) entered into force on 1 August 2024, phasing in over years: prohibited-AI rules from Feb 2025; general-purpose AI (GPAI) obligations from 2 August 2025; most high-risk-system obligations from 2 August 2026. A voluntary GPAI Code of Practice (Jul 2025) covers transparency, copyright, and safety/security for models posing "systemic risk"; major providers signed (Meta declined).
Flag As of mid-2026 the Commission has signaled possible "simplification"/targeted delays via a Digital Omnibus — deadlines remain subject to change.
United States: executive orders, NIST, state laws
President Biden's Executive Order 14110 (Oct 2023) directed agencies on AI safety; it was rescinded in January 2025 and replaced by an order reorienting policy toward competitiveness and deregulation. The voluntary NIST AI Risk Management Framework (2023) remains a key reference, and the former US AI Safety Institute at NIST was renamed CAISI in 2025. At the state level, California's SB 1047 was vetoed (2024), but its successor SB 53 — the Transparency in Frontier AI Act — was signed (Sept 2025, effective Jan 2026), the first US state law targeting catastrophic frontier-AI risk via mandatory safety frameworks, incident reporting, and whistleblower protections.
United Kingdom & international coordination
The UK launched the first state AI Safety Institute at the Bletchley Park Summit (Nov 2023; 28 countries + EU signed the Bletchley Declaration), followed by summits in Seoul (2024, voluntary Frontier AI Safety Commitments) and Paris (2025). In Feb 2025 the UK renamed its body the AI Security Institute. Internationally, the International AI Safety Report — chaired by Yoshua Bengio, published Jan 2025 with ~96 experts from 30 countries — provides a shared scientific synthesis, and the UN has established an Independent International Scientific Panel on AI and a Global Dialogue on AI Governance.
Governance proposals on the table
- Compute governance — using AI chips as a verifiable chokepoint: training-run reporting thresholds, hardware-enabled verification, chip registries, export controls.
- Mandatory evaluations & third-party audits — pre-deployment dangerous-capability testing, as in the EU AI Act and SB 53.
- International coordination — a "CERN for AI" (shared international lab) or an "IAEA for AI" (monitoring/verification agency), possibly with a secure-chips agreement.
- Liability regimes — strict or risk-based developer liability to internalize harms.
- A scientific-consensus body — an "IPCC for AI," partly realized by the International AI Safety Report.
This work is talent- and attention-constrained
Technical alignment, evaluations, governance, and policy all need more capable people. There are concrete, non-counterproductive ways to help.
See how to get involved