Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework
Rishane Dassanayake, Mario Demetroudi, James Walpole, Lindley Lentati, Jason R. Brown, Edward James Young — 2025-07-17 — arXiv
Summary
Presents a systematic safety case framework for assessing and mitigating manipulation risks from misaligned AI systems, structured around three core arguments (inability, control, trustworthiness) with specified evidence requirements, evaluation methodologies, and implementation considerations for AI companies.
Source
- Link: https://arxiv.org/abs/2507.12872
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment