Model Integrity
Joe Edelman, Oliver Klingefjord — 2024-12-05 — Meaning Alignment Institute — Substack
Summary
Proposes ‘model integrity’ (values-based predictability) as an alternative to compliance-based alignment, and demonstrates WiseLLaMa-8B, an 8B model fine-tuned with structured values cards that generates responses with inline value annotations.
Key Result
User study (n=78) showed 77% found model responses adhered to stated values, 94% could predict which response came from the values-trained model, and 74% reported increased trust.
Source
- Link: https://meaningalignment.substack.com/p/model-integrity
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- aligning-to-context — Multi-agent first