One of the frequently cited research papers related to data privacy is “Robust De-Anonymization of Large Sparse Datasets”, A. Narayanan and V. Shmatikov, in Proceedings of the 2008 IEEE Symposium on Security and Privacy, May 2008. The paper examined a dataset of Netflix user movie reviews where personal information, such as user name, had been removed.
I’ve seen summaries of this paper used many times in technical articles and online Web sites, but most of the summaries have minor inaccuracies. Here is my summary of the two key results from the research paper.
1.) When an adversary knows only a little bit of information about a particular record in the anonymized Netflix movie review dataset, the adversary can find the full record. Specifically, when an adversary knows just 8 movie ratings, of which 2 can be completely wrong, and dates that can have a 14-day error, 99% of records can be uniquely identified in the dataset. Note that personal information isn’t revealed because no personal information is in the dataset.
2.) By using IMDB movie review dataset information, which does have personal information supplied by users, it is possible to match IMDB reviews with Netflix reviews and therefore find personal information that was removed from the Netflix dataset.
The identities of the people who created ancient Egyptian art and jewelry will remain anonymous/private forever. Three modern interpretations in film. Left: “Caesar and Cleopatra” (1945). Center: “Gods of Egypt” (2016). Right: “Cleopatra” (1934).