Model Integrity

Joe Edelman, Oliver Klingefjord — 2024-12-05 — Meaning Alignment Institute — Substack

Summary

Proposes ‘model integrity’ (values-based predictability) as an alternative to compliance-based alignment, and demonstrates WiseLLaMa-8B, an 8B model fine-tuned with structured values cards that generates responses with inline value annotations.

Key Result

User study (n=78) showed 77% found model responses adhered to stated values, 94% could predict which response came from the values-trained model, and 74% reported increased trust.

Source

Link: https://meaningalignment.substack.com/p/model-integrity
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- aligning-to-context — Multi-agent first

aligning-to-context

AI Safety Compendium

Explorer

Model Integrity

Model Integrity

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Model Integrity

Model Integrity

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents