Tiling agents — SR2025 Agenda Snapshot

One-sentence summary: An aligned agentic system modifying itself into an unaligned system would be bad and we can research ways that this could occur and infrastructure/approaches that prevent it from happening.

Theory of Change

Build enough theoretical basis through various approaches such that AI systems we create are capable of self-modification while preserving goals.

Broad Approach

cognitivist science

Target Case

worst-case

Orthodox Problems Addressed

Value is fragile and hard to specify, Corrigibility is anti-natural, Goals misgeneralize out of distribution

Key People

Abram Demski

Funding

Estimated FTEs: 1-10

See Also

agent-foundations

Outputs in 2025

4 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: tiling-agents (these were generated alongside this file from the same export).

Source

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.