Reconstructing Training Data from Trained Neural Networks

Reconstruction of training data from a trained binary classifier:

Randomly initialized data points are "drifted" towards training samples by minimizing our proposed loss

Abstract

Understanding to what extent neural networks memorize training data is an intriguing question with practical and theoretical implications. In this paper we show that in some cases a significant fraction of the training data can in fact be reconstructed from the parameters of a trained neural network classifier. We propose a novel reconstruction scheme that stems from recent theoretical results about the implicit bias in training neural networks with gradient-based methods. To the best of our knowledge, our results are the first to show that reconstructing a large portion of the actual training samples from a trained neural network classifier is generally possible. This has negative implications on privacy, as it can be used as an attack for revealing sensitive training data. We demonstrate our method for binary MLP classifiers on a few standard computer vision datasets.

Reconstructions

We reconstruct training samples from binary classifiers.
Below we show reconstructions from MLPs trained on 500 images of CIFAR10/MNIST, labeled as animals/vehicles and odd/even digit, respectively.
(Train errors are zero and test accuracies are 88.0%/77.6%)

Technical TL;DR

Our approach relies on theoretical results about the implicit bias in training neural networks with gradient-based methods. The implicit bias has been studied extensively in recent years with the motivation of explaining generalization in deep learning.
We use results by Liu & Li (2019) and Ji & Telgarsky (2020), which establish that, under some technical assumptions, if we train a neural network with the binary cross entropy loss, its parameters will converge to a stationary point of a certain margin-maximization problem. This result implies that the parameters of the trained network satisfy a set of equations w.r.t. the training dataset. In our approach, given a trained network, we find a dataset that solves this set of equations w.r.t. the trained parameters.

More specifically, the trained parameters θ and the dataset {(x_i,y_i)}ⁿ_i=1 satisfy the following sets of equations (where λ₁,...,λ_n are real numbers and Φ(θ;⋅) represents the neural network with parameters θ):

We use the stationary condition to conduct a novel loss function:

The animations at the top of this page show the trajectories of several x_i, from a random noise intialization until they reach a datapoint from the training set, by minimizing the loss L.

BibTeX

@inproceedings{NEURIPS2022_90692737,
 author = {Haim, Niv and Vardi, Gal and Yehudai, Gilad and Shamir, Ohad and Irani, Michal},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh},
 pages = {22911--22924},
 publisher = {Curran Associates, Inc.},
 title = {Reconstructing Training Data From Trained Neural Networks},
 url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/906927370cbeb537781100623cca6fa6-Paper-Conference.pdf},
 volume = {35},
 year = {2022}
}