Activation engineering — SR2025 Agenda Snapshot
One-sentence summary: Programmatically modify internal model activations to steer outputs toward desired behaviors; a lightweight, interpretable supplement to fine-tuning.
Theory of Change
Test interpretability theories by intervening on activations; find new insights from interpretable causal interventions on representations. Or: build more stuff to stack on top of finetuning. Slightly encourage the model to be nice, add one more layer of defence to our bundle of partial alignment methods.
Broad Approach
engineering / cognitive
Target Case
average
Orthodox Problems Addressed
Value is fragile and hard to specify
Key People
Runjin Chen, Andy Arditi, David Krueger, Jan Wehner, Narmeen Oozeer, Reza Bayat, Adam Karvonen, Jiuding Sun, Tim Tian Hua, Helena Casademunt, Jacob Dunefsky, Thomas Marshall
Funding
Coefficient Giving, Anthropic
Estimated FTEs: 20-100
Critiques
Understanding (Un)Reliability of Steering Vectors in Language Models
See Also
Outputs in 2025
15 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: activation-engineering (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Activation engineering) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- sparse-coding
- causal-abstractions
- data-attribution
- extracting-latent-knowledge
- human-inductive-biases
- learning-dynamics-and-developmental-interpretability
- lie-and-deception-detectors
- model-diffing
- monitoring-concepts
- other-interpretability
- pragmatic-interpretability
- representation-structure-and-geometry
- reverse-engineering
- character-training-and-persona-steering
- inference-time-steering
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]