Teaching AI to Handle Exceptions: Supervised Fine-tuning with Human-aligned Judgment
Matthew DosSantos DiSorbo, Harang Ju, Sinan Aral — 2025-03-XX — Harvard Business School, Johns Hopkins University, MIT Sloan School of Management
Source
- Link: https://arxiv.org/html/2503.02976v2#S3
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- human-inductive-biases — White-box safety (i.e. Interpretability)