Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control

Yuxin Xiao, Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, … (+2 more) — 2024-11-04 — arXiv

Summary

Proposes Sparse Activation Control, a training-free method that identifies and controls specific attention heads in LLMs to simultaneously improve multiple dimensions of trustworthiness (safety, factuality, bias) through targeted activation steering.

Key Result

Llama models successfully aligned with human preferences on safety, factuality, and bias concurrently by controlling sparse attention heads that independently affect different behavioral dimensions.

Source