Natural abstractions — SR2025 Agenda Snapshot
One-sentence summary: Develop a theory of concepts that explains how they are learned, how they structure a particular system’s understanding, and how mutual translatability can be achieved between different collections of concepts.
Theory of Change
Understand the concepts a system’s understanding is structured with and use them to inspect its “alignment/safety properties” and/or “retarget its search”, i.e. identify utility-function-like components inside an AI and replacing calls to them with calls to “user values” (represented using existing abstractions inside the AI).
Broad Approach
cognitive
Target Case
worst-case
Orthodox Problems Addressed
Instrumental convergence, Superintelligence can fool human supervisors, Humans cannot be first-class parties to a superintelligent value handshake
Key People
John Wentworth, Paul Colognese, David Lorrell, Sam Eisenstat, Fernando Rosas
Funding
Estimated FTEs: 1-10
Critiques
Chan et al (2023), Soto, Harwood, Soares (2023)
See Also
causal-abstractions, representational alignment, convergent abstractions, feature universality, Platonic representation hypothesis, microscope AI
Outputs in 2025
10 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: natural-abstractions (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Natural abstractions) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- causal-abstractions
- agent-foundations
- asymptotic-guarantees
- behavior-alignment-theory
- heuristic-explanations
- high-actuation-spaces
- other-corrigibility
- the-learning-theoretic-agenda
- tiling-agents
- representation-structure-and-geometry
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]