Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control
Yuxin Xiao, Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, … (+2 more) — 2024-11-04 — arXiv
Summary
Proposes Sparse Activation Control, a training-free method that identifies and controls specific attention heads in LLMs to simultaneously improve multiple dimensions of trustworthiness (safety, factuality, bias) through targeted activation steering.
Key Result
Llama models successfully aligned with human preferences on safety, factuality, and bias concurrently by controlling sparse attention heads that independently affect different behavioral dimensions.
Source
- Link: https://arxiv.org/abs/2411.02461
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- activation-engineering — White-box safety (i.e. Interpretability) / Concept-based interpretability