Tiling agents — SR2025 Agenda Snapshot
One-sentence summary: An aligned agentic system modifying itself into an unaligned system would be bad and we can research ways that this could occur and infrastructure/approaches that prevent it from happening.
Theory of Change
Build enough theoretical basis through various approaches such that AI systems we create are capable of self-modification while preserving goals.
Broad Approach
cognitivist science
Target Case
worst-case
Orthodox Problems Addressed
Value is fragile and hard to specify, Corrigibility is anti-natural, Goals misgeneralize out of distribution
Key People
Abram Demski
Funding
Estimated FTEs: 1-10
See Also
Outputs in 2025
4 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: tiling-agents (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Tiling agents) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- agent-foundations
- asymptotic-guarantees
- behavior-alignment-theory
- heuristic-explanations
- high-actuation-spaces
- natural-abstractions
- other-corrigibility
- the-learning-theoretic-agenda
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]