Can a Neural Network that only Memorizes the Dataset be Undetectably Backdoored?
Matjaz Leonardis — 2025-07-10 — ODYSSEY 2025 Conference
Summary
Demonstrates that even highly interpretable neural networks that memorize datasets can be undetectably backdoored, challenging the assumption that interpretability enables backdoor detection. Analyzes a simple network with O(n+d) parameters that achieves perfect classification through memorization yet remains vulnerable to undetectable backdoors.
Key Result
Despite being interpretable and memorizing the training data with minimal parameters, the network can still be undetectably backdoored unless one has full knowledge of the dataset.
Source
- Link: https://openreview.net/forum?id=TD1NfQuVr6
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals