Model Integrity

Joe Edelman, Oliver Klingefjord — 2024-12-05 — Meaning Alignment Institute — Substack

Summary

Proposes ‘model integrity’ (values-based predictability) as an alternative to compliance-based alignment, and demonstrates WiseLLaMa-8B, an 8B model fine-tuned with structured values cards that generates responses with inline value annotations.

Key Result

User study (n=78) showed 77% found model responses adhered to stated values, 94% could predict which response came from the values-trained model, and 74% reported increased trust.

Source