AI Safety Compendium

Home

❯

summaries

❯

Robust LLM safeguarding via refusal feature adversarial training

27 Apr 20261 min read

Robust LLM safeguarding via refusal feature adversarial training

Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda — 2024-09-30

Source

Link: https://arxiv.org/pdf/2409.20089
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- meta — Labs (giant companies)

Related Pages

meta

Graph View

Graph view

The interactive citation graph is desktop-only. Visit this page on a larger screen to explore how concepts, agendas, papers, and organisations link together.

Robust LLM safeguarding via refusal feature adversarial training
Source
Related Pages

Suggest a source
Connect
Overview
About (proof of concept)
Email feedback
Made by IT for Humanity

AI Safety Compendium

Explorer

Robust LLM safeguarding via refusal feature adversarial training

Robust LLM safeguarding via refusal feature adversarial training

Source

Graph View

Graph view

Table of Contents