Safety Pretraining: Toward the Next Generation of Safe AI
Pratyush Maini, Sachin Goyal, Dylan Sam, Alex Robey, Yash Savani, Yiding Jiang, … (+4 more) — 2025-04-23 — Carnegie Mellon University — arXiv
Summary
Presents a data-centric pretraining framework that builds safety into LLMs from the start through four methods: safety filtering of webdata, safety rephrasing of unsafe content, native refusal datasets (RefuseWeb and Moral Education), and harmfulness-tag annotated pretraining.
Key Result
Safety-pretrained models reduce attack success rates from 38.8% to 8.4% on standard LLM safety benchmarks with no performance degradation on general tasks.
Source
- Link: https://arxiv.org/abs/2504.16980
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- data-filtering — Black-box safety (understand and control current model behaviour) / Better data