Towards Alignment Auditing as a Numbers-Go-Up Science

Sam Marks — 2025-08-04 — Anthropic — LessWrong

Summary

Proposes that alignment auditing should become a ‘numbers-go-up’ field organized around the progress metric of auditing agent performance on standard testbeds, similar to how other ML fields use concrete metrics like perplexity or throughput.

Source

Link: https://lesswrong.com/posts/bGYQgBPEyHidnZCdE/towards-alignment-auditing-as-a-numbers-go-up-science
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- other-evals — Evals

other-evals

AI Safety Compendium

Explorer

Towards Alignment Auditing as a Numbers-Go-Up Science

Towards Alignment Auditing as a Numbers-Go-Up Science

Summary

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Towards Alignment Auditing as a Numbers-Go-Up Science

Towards Alignment Auditing as a Numbers-Go-Up Science

Summary

Source

Related Pages

Graph View

Graph view

Table of Contents