Safety Pretraining: Toward the Next Generation of Safe AI

Pratyush Maini, Sachin Goyal, Dylan Sam, Alex Robey, Yash Savani, Yiding Jiang, … (+4 more) — 2025-04-23 — Carnegie Mellon University — arXiv

Summary

Presents a data-centric pretraining framework that builds safety into LLMs from the start through four methods: safety filtering of webdata, safety rephrasing of unsafe content, native refusal datasets (RefuseWeb and Moral Education), and harmfulness-tag annotated pretraining.

Key Result

Safety-pretrained models reduce attack success rates from 38.8% to 8.4% on standard LLM safety benchmarks with no performance degradation on general tasks.

Source