Natural abstractions — SR2025 Agenda Snapshot

One-sentence summary: Develop a theory of concepts that explains how they are learned, how they structure a particular system’s understanding, and how mutual translatability can be achieved between different collections of concepts.

Theory of Change

Understand the concepts a system’s understanding is structured with and use them to inspect its “alignment/safety properties” and/or “retarget its search”, i.e. identify utility-function-like components inside an AI and replacing calls to them with calls to “user values” (represented using existing abstractions inside the AI).

Broad Approach

cognitive

Target Case

worst-case

Orthodox Problems Addressed

Instrumental convergence, Superintelligence can fool human supervisors, Humans cannot be first-class parties to a superintelligent value handshake

Key People

John Wentworth, Paul Colognese, David Lorrell, Sam Eisenstat, Fernando Rosas

Funding

Estimated FTEs: 1-10

Critiques

Chan et al (2023), Soto, Harwood, Soares (2023)

See Also

causal-abstractions, representational alignment, convergent abstractions, feature universality, Platonic representation hypothesis, microscope AI

Outputs in 2025

10 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: natural-abstractions (these were generated alongside this file from the same export).

Source

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.