Tag Index
Found 7 total tags.
alignment-techniques
Alignment Techniques
Methods for making AI systems pursue goals consistent with human intentions. The pages tagged here cover the active techniques — RLHF, Constitutional AI, IDA, debate, scalable-oversight, W2S — and the operational counterpart, control, which bounds consequences when alignment cannot be guaranteed. The cluster is what most of the safety field works on day-to-day.
The technical question that organises this pillar is how do we keep alignment intact as capability scales beyond human evaluator capacity? — see superalignment for the program-level framing, outer-vs-inner-alignment for the foundational decomposition.
Pages tagged here
evaluations
Evaluations
Measuring whether AI systems have crossed risk-relevant capability thresholds — dangerous-capabilities, deceptive-alignment propensity, autonomy, autonomous-replication, situational awareness — before deployment. Capability evaluations are the trigger mechanism in every modern frontier-safety framework: Anthropic’s RSP, OpenAI’s Preparedness Framework, and DeepMind’s FSF all gate decisions on eval results. See capability-evaluations for the methodology, capability-evals for the SR2025 research agenda, and model-organisms-of-misalignment for the testbed methodology that validates the evals.
The pillar’s central methodological problem: behavioral evaluation has a known ceiling against strategically-deceptive systems — see chain-of-thought-monitoring and the interpretability pillar for the proposed escape hatches.
Pages tagged here
field-infrastructure
Field Infrastructure
The institutional layer of AI safety: labs (anthropic, deepmind, openai, redwood-research, metr, miri), AI Safety Institutes (UK + US AISI and partner-network), policy and governance orgs (FLI, FHI, lawzero, cltr, cesia, acrai), and reference resources (the AI Safety Atlas, 80000-hours, lesswrong, alignment-forum, ea-forum). Together they constitute the field’s working infrastructure — the people, organisations, and venues through which AI safety actually happens.
The post-Bletchley shift (2023–2025) moved this infrastructure from research-and-advocacy into government-backed institutions; the AI Safety Summit series and the International AI Safety Report are the current high-water marks.
Pages tagged here
1 item with this tag.
forecasting
Forecasting
Capability trajectories, takeoff dynamics, and timelines for transformative AI. The pillar covers METR’s autonomous-task horizon trends, Epoch AI’s compute and dataset projections, AI population dynamics, scaling-laws extrapolations, and the broader strategic question of how much adaptation time exists between dangerous capability and effective response.
Most safety strategy disagreements trace back to underlying forecasting views: slow-takeoff worlds make iterative alignment viable; fast-takeoff worlds compress the field’s decision window dramatically. See takeoff-dynamics for the four-dimensional decomposition (Speed × Continuity × Homogeneity × Polarity) and intelligence-explosion for the recursive-self-improvement mechanism.
Pages tagged here
1 item with this tag.
governance
Governance
Policies, regulations, institutions, and international coordination mechanisms for advanced AI. The cluster includes corporate-level frameworks (RSPs, OpenAI Preparedness Framework, DeepMind FSF), national regulation (EU AI Act, US executive orders, AISIs), and international coordination (Bletchley Summit, the AI Safety Institute network, the International AI Safety Report 2025). See ai-governance for the umbrella concept.
The pillar’s structural problem: AI development is global and capability-driven, but regulation is national and process-driven. Bridging that gap — and resisting race-to-the-bottom dynamics on safety — is the field’s open governance question. Power concentration is the failure mode this layer specifically addresses; technical alignment alone cannot.
Pages tagged here
2 items with this tag.
interpretability
Interpretability
Understanding what trained neural networks are doing internally — features, circuits, attribution graphs, learned algorithms — sufficient to predict generalization, audit alignment, and detect deceptive-alignment. Mechanistic interpretability is the most ambitious sub-area; activation-space methods, sparse autoencoders, and attribution graphs are the dominant current methodologies. See interpretability for the umbrella concept and reverse-engineering for the SR2025 research agenda.
This is the pillar that does not give up on verification beyond behavioral evaluation — the load-bearing alternative to “test the outputs and hope for the best” once models become capable of strategic deception.
Pages tagged here
risk-models
Risk Models
Failure-mode taxonomy and threat-model analysis for advanced AI: deceptive-alignment, scheming, mesa-optimization, goal-misgeneralization, specification-gaming, reward-hacking, goodharts-law, instrumental-convergence, power-seeking, ai-takeover-scenarios, and the meta-arguments in ai-risk-arguments and existential-risk. The cluster supplies the substantive content the rest of the safety field’s interventions are designed to address.
This pillar is where the field’s central tension lives — the gap between behavioral evaluation (what current alignment methods rely on) and the most concerning failure modes (which are designed to defeat behavioral evaluation). outer-vs-inner-alignment is the framing that locates the gap; ai-control is the operational response when the gap can’t be closed.