Safety Pretraining: Toward the Next Generation of Safe AI

Pratyush Maini, Sachin Goyal, Dylan Sam, Alex Robey, Yash Savani, Yiding Jiang, … (+4 more) — 2025-04-23 — Carnegie Mellon University — arXiv

Summary

Presents a data-centric pretraining framework that builds safety into LLMs from the start through four methods: safety filtering of webdata, safety rephrasing of unsafe content, native refusal datasets (RefuseWeb and Moral Education), and harmfulness-tag annotated pretraining.

Key Result

Safety-pretrained models reduce attack success rates from 38.8% to 8.4% on standard LLM safety benchmarks with no performance degradation on general tasks.

Source

Link: https://arxiv.org/abs/2504.16980
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- data-filtering — Black-box safety (understand and control current model behaviour) / Better data

data-filtering

AI Safety Compendium

Explorer

Safety Pretraining: Toward the Next Generation of Safe AI

Safety Pretraining: Toward the Next Generation of Safe AI

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Safety Pretraining: Toward the Next Generation of Safe AI

Safety Pretraining: Toward the Next Generation of Safe AI

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents