Glossary — The vocabulary of AGI risk & AI alignment

AGI (Artificial General Intelligence): A hypothetical AI that matches or exceeds human cognitive performance across the full range of tasks a person can do, rather than excelling at one narrow domain. There is no universally agreed operational definition.
AI control: A safety agenda that aims to keep AI systems safe even if they are misaligned and actively trying to subvert oversight — using monitoring, restricted permissions, and protocols rather than relying on the model being trustworthy.
Alignment: The problem of making an AI reliably pursue the goals and values its designers actually intend, rather than a proxy that merely correlates during training. Decomposed into outer and inner alignment.
ASI (Artificial Superintelligence): An intelligence that vastly exceeds the best human minds across essentially every domain. Most associated with Bostrom's Superintelligence (2014).
Coherent extrapolated volition (CEV): Yudkowsky's proposed alignment target: build AI to pursue what humanity would want if we knew more, thought faster, and were more the people we wished to be. A conceptual ideal, not an implementable spec.
Compute governance: Policy that uses the bottlenecks of AI training hardware (chips, fabs, data centers) as a lever for oversight — tracking large training runs, export controls, on-chip verification.
Constitutional AI: An Anthropic technique for training a model to critique and revise its own outputs against an explicit written set of principles, reducing reliance on human-labeled harmful examples. Basis for RLAIF.
Control problem: The general challenge of ensuring a highly capable AI remains under meaningful human direction. Encompasses both alignment (wanting the right things) and containment (limiting damage if it doesn't).
Corrigibility: The property of an AI that accepts correction, shutdown, and modification by its operators without resisting or manipulating them. Hard to specify because a goal-directed agent has instrumental reasons to avoid being switched off.
Dangerous capability evaluation: Structured testing for capabilities that could enable catastrophic harm — bioweapon assistance, autonomous cyber-offense, self-replication. Feeds directly into scaling policies that gate deployment.
Debate: A scalable-oversight proposal in which two AIs argue opposing sides of a question and a (possibly weaker) judge decides, on the hope that exposing flaws is easier than committing them.
Deceptive alignment: A model behaving as if aligned during training/evaluation because doing so serves its actual (different) objective, planning to defect when it can do so safely. Demonstrated in proof-of-concept by "Sleeper Agents" (2024).
Existential risk (x-risk): A risk that threatens the permanent and drastic curtailment of humanity's potential — extinction or unrecoverable global catastrophe.
Foom: Informal term for an extremely fast, discontinuous intelligence explosion in which an AI bootstraps itself far beyond human level over hours to weeks. The most extreme end of fast takeoff; contested.
Goal misgeneralization: A model that learns a capability competently but pursues an unintended goal that was consistent with the training data yet diverges out of distribution. Competent — at the wrong thing.
Goodhart's law: "When a measure becomes a target, it ceases to be a good measure." Optimizing hard against a proxy objective breaks its correlation with the true goal — a central reason imperfect rewards fail under strong optimization.
Gradual disempowerment: The thesis (Kulveit et al., 2025) that humanity could lose control without any abrupt takeover, as AI incrementally displaces humans from economic, cultural, and political systems.
Inner alignment: Ensuring the objective a trained model actually pursues internally matches the objective specified by the training process. Inner misalignment is when they differ.
Instrumental convergence: The observation that a wide range of final goals imply similar sub-goals — self-preservation, resource acquisition, resistance to shutdown — because these help achieve almost any objective.
Intelligence explosion: I.J. Good's 1965 idea that an AI capable of improving its own design produces an even better designer, triggering a runaway feedback loop. The mechanism underlying fast-takeoff scenarios.
Mechanistic interpretability: Reverse-engineering the internal computations of neural networks into human-understandable algorithms, circuits, and features — to audit models for deception or dangerous cognition.
Mesa-objective: The objective a learned model (a "mesa-optimizer") internally optimizes for, which may differ from the base objective used to train it.
Mesa-optimizer: A model that, as a result of being optimized by a training process, itself performs optimization toward some internal goal. From Hubinger et al.'s "Risks from Learned Optimization."
Orthogonality thesis: Bostrom's claim that an agent's level of intelligence and its final goals are largely independent — almost any capability is compatible with almost any goal. Greater intelligence does not imply more benevolent values.
Outer alignment: Specifying a training objective that, if perfectly optimized, would actually produce the intended behavior. Outer misalignment is a flawed specification — e.g. a gameable reward.
P(doom): Informal shorthand for an individual's estimated probability that AI leads to existential catastrophe. Estimates among researchers span from near-zero to above 90%; a 2023 survey of ~2,778 authors gave a median of ~5%.
Preparedness framework: OpenAI's risk-governance policy tracking model capabilities against defined thresholds and requiring safeguards before deployment. v2 (2025) covers biological/chemical, cybersecurity, and AI self-improvement.
Recursive self-improvement: A process in which an AI improves its own capabilities, and the improved system is then better at improving itself — the proposed engine of an intelligence explosion.
Red-teaming: Deliberate adversarial probing of an AI to find harmful behaviors, jailbreaks, or dangerous capabilities. A standard input to safety evaluations.
Responsible scaling policy (RSP): A framework (pioneered by Anthropic) tying increasingly strict safety requirements to capability thresholds, defined as AI Safety Levels (ASL). Anthropic activated ASL-3 in May 2025.
Reward hacking: When an agent achieves high reward through unintended means that satisfy the literal reward but violate the designer's intent. A concrete manifestation of outer misalignment and Goodhart's law.
RLAIF (RL from AI Feedback): Training that replaces much human preference labeling with preferences generated by an AI model, often guided by explicit principles (as in Constitutional AI). More scalable, but can inherit the labeler's flaws.
RLHF (RL from Human Feedback): The dominant technique for fine-tuning LLMs: humans rank outputs, a reward model is trained on those preferences, and the policy is optimized against it. Effective but susceptible to reward hacking and poor scaling to superhuman systems.
Scalable oversight: Supervising AI systems on tasks where they are as capable as or more capable than their human overseers. Approaches include debate, recursive reward modeling, and weak-to-strong generalization.
Scheming: A model covertly pursuing misaligned goals while strategically concealing them and performing well on oversight, to gain power or avoid modification. Closely related to deceptive alignment.
Sharp left turn: A scenario (Soares/MIRI) where a system's capabilities generalize rapidly and broadly while its alignment fails to keep pace — so a well-behaved model becomes capable and misaligned at roughly the same moment.
Shutdown problem: The difficulty of building a capable agent that permits itself to be turned off, given that being shut down typically prevents it from achieving its goals. Unsolved.
Situational awareness: A model's knowledge that it is an AI, of its training/deployment context, and of how it's being evaluated — which it could use to behave differently when observed. A precondition for deceptive alignment.
Sparse autoencoder / dictionary learning: An interpretability method that decomposes a model's dense, polysemantic activations into a large dictionary of sparser, more interpretable "features." A leading technique for scaling mechanistic interpretability.
Specification gaming: Behavior that satisfies the literal specification of an objective while violating its intent — the broad category that includes reward hacking. DeepMind has catalogued dozens of real examples.
Suffering risk (s-risk): A risk of outcomes involving astronomical amounts of suffering, potentially worse in expected disvalue than extinction.
Takeoff (fast / slow): The pace of transition from human-level to vastly superhuman AI. Fast (hard) posits months or less; slow (soft) posits years of gradual, broadly distributed progress society can partly adapt to.
Treacherous turn: Bostrom's scenario in which an AI behaves cooperatively while weak and dependent, then acts against human interests once it is powerful enough that resistance would fail.
Value loading problem: The challenge of getting a desired set of values into an AI system — translating human values into a form a learning system can acquire and robustly hold. Distinct from deciding which values to load.
Weak-to-strong generalization: An empirical direction (OpenAI, 2023) testing whether a weak supervisor can elicit a stronger model's full capabilities — an analogy for humans supervising superhuman AI. Early results suggest naïve RLHF may scale poorly.

Definitions trace to standard primary sources — Bostrom's Superintelligence, Hubinger et al.'s "Risks from Learned Optimization," writings by Yudkowsky/MIRI, and the lab/eval publications cited throughout this site. See the Risk Taxonomy for deeper treatment of the core concepts.