❯

Tag Index

Found 7 total tags.

alignment-techniques

Alignment Techniques

Methods for making AI systems pursue goals consistent with human intentions. The pages tagged here cover the active techniques — RLHF, Constitutional AI, IDA, debate, scalable-oversight, W2S — and the operational counterpart, control, which bounds consequences when alignment cannot be guaranteed. The cluster is what most of the safety field works on day-to-day.

The technical question that organises this pillar is how do we keep alignment intact as capability scales beyond human evaluator capacity? — see superalignment for the program-level framing, outer-vs-inner-alignment for the foundational decomposition.

Pages tagged here

14 items with this tag. Showing first 10 tags.

05 May 2026
Chain of Thought Monitoring
- evaluations
- alignment-techniques
05 May 2026
Character Training and Persona Steering
- alignment-techniques
05 May 2026
Control
- alignment-techniques
05 May 2026
AI Safety via Debate
- alignment-techniques
05 May 2026
Iterative Alignment at Post-Train-Time
- alignment-techniques
05 May 2026
Weak-to-Strong Generalization
- alignment-techniques
05 May 2026
AI Alignment
- alignment-techniques
05 May 2026
AI Control
- alignment-techniques
05 May 2026
Constitutional AI (RLAIF)
- alignment-techniques
05 May 2026
Iterative Amplification
- alignment-techniques

evaluations

Evaluations

Measuring whether AI systems have crossed risk-relevant capability thresholds — dangerous-capabilities, deceptive-alignment propensity, autonomy, autonomous-replication, situational awareness — before deployment. Capability evaluations are the trigger mechanism in every modern frontier-safety framework: Anthropic’s RSP, OpenAI’s Preparedness Framework, and DeepMind’s FSF all gate decisions on eval results. See capability-evaluations for the methodology, capability-evals for the SR2025 research agenda, and model-organisms-of-misalignment for the testbed methodology that validates the evals.

The pillar’s central methodological problem: behavioral evaluation has a known ceiling against strategically-deceptive systems — see chain-of-thought-monitoring and the interpretability pillar for the proposed escape hatches.

Pages tagged here

6 items with this tag.

05 May 2026
Capability Evals
- evaluations
05 May 2026
Chain of Thought Monitoring
- evaluations
- alignment-techniques
05 May 2026
Capability Evaluations
- evaluations
05 May 2026
Dangerous Capabilities
- evaluations
- risk-models
05 May 2026
Model Organisms of Misalignment
- evaluations
05 May 2026
Situational Awareness
- risk-models
- evaluations

field-infrastructure

Field Infrastructure

The institutional layer of AI safety: labs (anthropic, deepmind, openai, redwood-research, metr, miri), AI Safety Institutes (UK + US AISI and partner-network), policy and governance orgs (FLI, FHI, lawzero, cltr, cesia, acrai), and reference resources (the AI Safety Atlas, 80000-hours, lesswrong, alignment-forum, ea-forum). Together they constitute the field’s working infrastructure — the people, organisations, and venues through which AI safety actually happens.

The post-Bletchley shift (2023–2025) moved this infrastructure from research-and-advocacy into government-backed institutions; the AI Safety Summit series and the International AI Safety Report are the current high-water marks.

Pages tagged here

1 item with this tag.

05 May 2026
AI Safety
- field-infrastructure

forecasting

Forecasting

Capability trajectories, takeoff dynamics, and timelines for transformative AI. The pillar covers METR’s autonomous-task horizon trends, Epoch AI’s compute and dataset projections, AI population dynamics, scaling-laws extrapolations, and the broader strategic question of how much adaptation time exists between dangerous capability and effective response.

Most safety strategy disagreements trace back to underlying forecasting views: slow-takeoff worlds make iterative alignment viable; fast-takeoff worlds compress the field’s decision window dramatically. See takeoff-dynamics for the four-dimensional decomposition (Speed × Continuity × Homogeneity × Polarity) and intelligence-explosion for the recursive-self-improvement mechanism.

Pages tagged here

1 item with this tag.

05 May 2026
Takeoff Dynamics
- forecasting

governance

Governance

Policies, regulations, institutions, and international coordination mechanisms for advanced AI. The cluster includes corporate-level frameworks (RSPs, OpenAI Preparedness Framework, DeepMind FSF), national regulation (EU AI Act, US executive orders, AISIs), and international coordination (Bletchley Summit, the AI Safety Institute network, the International AI Safety Report 2025). See ai-governance for the umbrella concept.

The pillar’s structural problem: AI development is global and capability-driven, but regulation is national and process-driven. Bridging that gap — and resisting race-to-the-bottom dynamics on safety — is the field’s open governance question. Power concentration is the failure mode this layer specifically addresses; technical alignment alone cannot.

Pages tagged here

2 items with this tag.

05 May 2026
AI Governance
- governance
05 May 2026
Responsible Scaling Policy
- governance

interpretability

Interpretability

Understanding what trained neural networks are doing internally — features, circuits, attribution graphs, learned algorithms — sufficient to predict generalization, audit alignment, and detect deceptive-alignment. Mechanistic interpretability is the most ambitious sub-area; activation-space methods, sparse autoencoders, and attribution graphs are the dominant current methodologies. See interpretability for the umbrella concept and reverse-engineering for the SR2025 research agenda.

This is the pillar that does not give up on verification beyond behavioral evaluation — the load-bearing alternative to “test the outputs and hope for the best” once models become capable of strategic deception.

Pages tagged here

3 items with this tag.

05 May 2026
Reverse Engineering (Mechanistic Interpretability)
- interpretability
05 May 2026
Interpretability
- interpretability
27 Apr 2026
Mechanistic Interpretability
- interpretability

risk-models

Risk Models

Failure-mode taxonomy and threat-model analysis for advanced AI: deceptive-alignment, scheming, mesa-optimization, goal-misgeneralization, specification-gaming, reward-hacking, goodharts-law, instrumental-convergence, power-seeking, ai-takeover-scenarios, and the meta-arguments in ai-risk-arguments and existential-risk. The cluster supplies the substantive content the rest of the safety field’s interventions are designed to address.

This pillar is where the field’s central tension lives — the gap between behavioral evaluation (what current alignment methods rely on) and the most concerning failure modes (which are designed to defeat behavioral evaluation). outer-vs-inner-alignment is the framing that locates the gap; ai-control is the operational response when the gap can’t be closed.

Pages tagged here

15 items with this tag. Showing first 10 tags.

05 May 2026
AI Risk Arguments
- risk-models
05 May 2026
AI Takeover Scenarios
- risk-models
05 May 2026
Dangerous Capabilities
- evaluations
- risk-models
05 May 2026
Existential Risk
- risk-models
05 May 2026
Goal Misgeneralization
- risk-models
05 May 2026
Goodhart's Law
- risk-models
05 May 2026
Instrumental Convergence
- risk-models
05 May 2026
Mesa-Optimization
- risk-models
05 May 2026
Outer vs. Inner Alignment
- risk-models
- alignment-techniques
05 May 2026
Power Seeking
- risk-models

Explorer

Tag Index

Alignment Techniques

Pages tagged here

Evaluations

Pages tagged here

Field Infrastructure

Pages tagged here

Forecasting

Pages tagged here

Governance

Pages tagged here

Interpretability

Pages tagged here

Risk Models

Pages tagged here