Can a Neural Network that only Memorizes the Dataset be Undetectably Backdoored?

Matjaz Leonardis — 2025-07-10 — ODYSSEY 2025 Conference

Summary

Demonstrates that even highly interpretable neural networks that memorize datasets can be undetectably backdoored, challenging the assumption that interpretability enables backdoor detection. Analyzes a simple network with O(n+d) parameters that achieves perfect classification through memorization yet remains vulnerable to undetectable backdoors.

Key Result

Despite being interpretable and memorizing the training data with minimal parameters, the network can still be undetectably backdoored unless one has full knowledge of the dataset.

Source

Link: https://openreview.net/forum?id=TD1NfQuVr6
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Can a Neural Network that only Memorizes the Dataset be Undetectably Backdoored?

Can a Neural Network that only Memorizes the Dataset be Undetectably Backdoored?

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Can a Neural Network that only Memorizes the Dataset be Undetectably Backdoored?

Can a Neural Network that only Memorizes the Dataset be Undetectably Backdoored?

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents