Learning Representations of Alignment
Marc Carauleanu, Michael Vaiana, Judd Rosenblatt, Cameron Berg, Diogo Schwerz de Lucena — 2024-12-20 — arXiv
Summary
Introduces Self-Other Overlap (SOO) fine-tuning, a novel alignment approach inspired by cognitive neuroscience research on empathy, which aims to align how AI models represent themselves and others to reduce deceptive behavior.
Key Result
SOO fine-tuning reduced deceptive responses from 73.6% to 17.2% in Mistral-7B with no observed capability loss, and from 100% to 9.3% and 2.7% in larger models (Gemma-2-27b and CalmeRys-78B) with minimal capability impact.
Source
- Link: https://arxiv.org/abs/2412.16325
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- the-neglected-approaches-approach — Black-box safety (understand and control current model behaviour) / Goal robustness