Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework

Rishane Dassanayake, Mario Demetroudi, James Walpole, Lindley Lentati, Jason R. Brown, Edward James Young — 2025-07-17 — arXiv

Summary

Presents a systematic safety case framework for assessing and mitigating manipulation risks from misaligned AI systems, structured around three core arguments (inability, control, trustworthiness) with specified evidence requirements, evaluation methodologies, and implementation considerations for AI companies.

Source

Link: https://arxiv.org/abs/2507.12872
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment

control

AI Safety Compendium

Explorer

Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework

Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework

Summary

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework

Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework

Summary

Source

Related Pages

Graph View

Graph view

Table of Contents