Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control

Yuxin Xiao, Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, … (+2 more) — 2024-11-04 — arXiv

Summary

Proposes Sparse Activation Control, a training-free method that identifies and controls specific attention heads in LLMs to simultaneously improve multiple dimensions of trustworthiness (safety, factuality, bias) through targeted activation steering.

Key Result

Llama models successfully aligned with human preferences on safety, factuality, and bias concurrently by controlling sparse attention heads that independently affect different behavioral dimensions.

Source

Link: https://arxiv.org/abs/2411.02461
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- activation-engineering — White-box safety (i.e. Interpretability) / Concept-based interpretability

activation-engineering

AI Safety Compendium

Explorer

Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control

Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control

Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents