Robust LLM safeguarding via refusal feature adversarial training
Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda — 2024-09-30
Source
- Link: https://arxiv.org/pdf/2409.20089
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- meta — Labs (giant companies)