Learning Representations of Alignment

Marc Carauleanu, Michael Vaiana, Judd Rosenblatt, Cameron Berg, Diogo Schwerz de Lucena — 2024-12-20 — arXiv

Summary

Introduces Self-Other Overlap (SOO) fine-tuning, a novel alignment approach inspired by cognitive neuroscience research on empathy, which aims to align how AI models represent themselves and others to reduce deceptive behavior.

Key Result

SOO fine-tuning reduced deceptive responses from 73.6% to 17.2% in Mistral-7B with no observed capability loss, and from 100% to 9.3% and 2.7% in larger models (Gemma-2-27b and CalmeRys-78B) with minimal capability impact.

Source