Guess Who? 5 examples why removing names is not an option

Blog

December 4, 2019

An introduction to Guess Who
Linkage attacks: your dataset linked to other (public) data sources
Informed individuals
Data as fingerprint
General Data Protection Regulation (GDPR)

An introduction to Guess Who

Guess Who? Although I am sure that most of you know this game from back in the day. Here’s a brief recap. The goal of the game: discover the name of the cartoon character selected by your opponent by asking ‘yes’ and ‘no’ questions, like ‘Does the person wear a hat?’ or ‘Does the person wear glasses’? Players eliminate candidates based on the opponent’s response and learn attributes that relate to their opponent’s mystery character. The first player who figures out the other player’s mystery character wins the game.

You got it. One must identify the individual out of a dataset by having only access to the corresponding attributes. In fact, we regularly see this concept of Guess Who applied in practice, but then employed on datasets formatted with rows and columns containing attributes of real people. The main difference when working with data is that people tend to underestimate the ease by which real individuals can be unmasked by having access to only a few attributes.

As the Guess Who game illustrates, someone can identify individuals by having access to only a few attributes. It serves as a simple example of why removing only ‘names’ (or other direct identifiers) from your dataset fails as an anonymization technique. In this blog, we provide four practical cases to inform you about the privacy risks associated with the removal of columns as a means of data anonymization.

Syntho Guide

Your guide into synthetic data generation

Download guide →

2) Linkage attacks: your dataset linked to other (public) data sources

The risk of a linkage attack is the most important reason why solely removing names does not work (anymore) as a method for anonymization. With a linkage attack, the attacker combines the original data with other accessible data sources in order to uniquely identify an individual and learn (often sensitive) information about this person.

The key here is the availability of other data resources that are present now or may become present in the future. Think about yourself. How much of your own personal data can be found on Facebook, Instagram ,or LinkedIn that could potentially be abused for a linkage attack?

In earlier days, the availability of data was much more limited, which partly explains why the removal of names was sufficient to preserve the privacy of individuals. Less available data means fewer opportunities for linking data. However, we are now (active) participants in a data-driven economy, where the amount of data is growing at an exponential rate. More data, and improving technology for gathering data will lead to increased potential for linkage attacks. What would one write in 10 years about the risk of a linkage attack?

Case study

Sweeney (2002) demonstrated in an academic paper how she was able to identify and retrieve sensitive medical data from individuals based on linking a publicly available data set of ‘hospital visits’ to the publicly available voting registrar in the United States. Both datasets were assumed to be properly anonymized through the deletion of names and other direct identifiers.

linkage attack example visualization - Syntho

Based on only the three parameters (1) Zip Code, (2) Gender, and (3) Date of Birth, she showed that 87% of the entire US population could be re-identified by matching the aforementioned attributes from both datasets. Sweeney then repeated her work with having ‘country’ as an alternative to ‘Zip Code’. Additionally, she demonstrated that 18% of the entire US population could be identified only by having access to a dataset containing information about the (1) home country, (2) gende,r and (3) date of birth. Think about the aforementioned public sources, like Facebook, LinkedIn or Instagram. Is your country, gende,r and date of birth visible, or are other users able to deduct it?

Sweeney’s results

Quasi-identifiers	% uniquely identified of US population (248 million)
5-digit ZIP, gender, date of birth	87%
Place, gender, date of birth	53%
Country, gender, date of birth	18%

This example demonstrates that it can be remarkably easy to de-anonymize individuals in seemingly anonymous data. First, this study indicates a huge magnitude of risk, as 87% of the US population can be easily identified using few characteristics. Second, the exposed medical data in this study was highly sensitive. Examples of exposed individuals’ data from the hospital visits dataset include ethnicity, diagnosis, and medication. Attributes that one may rather keep secret, for example, from insurance companies.

3) Informed individuals

Another risk of removing only direct identifiers, such as names, arises when informed individuals have superior knowledge or information about the traits or behavior of specific individuals in the dataset. Based on their knowledge, the attacker may then be able to link specific data records to actual people.

Case study

An example of an attack on a dataset using superior knowledge is the New York taxi case, where Atockar (2014) was able to unmask specific individuals. The employed dataset contained all taxi journeys in New York, enriched with basic attributes like start coordinates, end coordinates, price, and tip of the ride.

An informed individual who knows New York was able to derive taxi trips to the dult club ‘Hustler’. By filtering the ‘end location’, he deduced the exact start addresses and thereby identified various frequent visitors. Similarly, one could deduce taxi rides when the home address of the individual was known. The time and location of several celebrity movie stars were discovered on gossip sites. After linking this information to the NYC taxi data, it was easy to derive their taxi rides, the amount they paid, and whether they had tipped.

Drop of coordinates of Hustler

Taxi tracking map of Bradley Cooper

Taxi tracking map of Jessica Alba

4) Data as a fingerprint

A common line of argumentation is ‘this data is worthless’ or ‘no one can do anything with this data’. This is often a misconception. Even the most innocent data can form a unique ‘fingerprint’ and be used to re-identify individuals. It is the risk derived from the belief that the data itself is worthless, while it is not.

The risk of identification will increase with the increase of data, AI, and other tools and algorithms that enable the uncovering of complex relationships in data. Consequently, even if your dataset cannot be uncovered now, and is presumably useless for unauthorized persons today, it may not be tomorrow.

Case study

A great example is the case where Netflix intended to crowdsource its R&D department by introducing an open Netflix competition to improve its movie recommendation system. ‘The one that improves the collaborative filtering algorithm to predict user ratings for films wins a prize of US $1,000,000’. In order to support the crowd, Netflix published a dataset containing only the following basic attributes: userID, movie, date of grade, and grade (so no further information on the user or film itself).

UserID	Movie	Date of grade	Grade
123456789	Mission impossible	10-12-2008	4

Dataset structure Netflix price

In isolation, the data appeared futile. When asking the question ’Is there any customer information in the dataset that should be kept private?’, the answer was:

‘No, all customer identifying information has been removed; all that remains are ratings and dates. This follows our privacy policy …’

However, Narayanan (2008) from the University of Texas at Austin proved otherwise. The combination of grades, date of grade, and movie of an individual forms a unique movie fingerprint. Think about your own Netflix behavior. How many people do you think watched the same set of movies? How many watched the same set of movies at the same time?

The main question is how to match this fingerprint. It was rather simple. Based on information from the well-known movie-rating website IMDb (Internet Movie Database), a similar fingerprint could be formed. Consequently, individuals could be re-identified.

While movie-watching behavior might not be presumed as sensitive information, think about your own behavior – would you mind if it went public? Examples that Narayanan provided in his paper are political preferences (ratings on ‘Jesus of Nazareth’ and ‘The Gospel of John’) and sexual preferences (ratings on ‘Bent’ and ‘Queer as Folk’) that could be easily distilled.

GDPR might not be super-exciting, nor the silver bullet among blog topics. Yet, it is helpful to get the definitions straight when processing personal data. Since this blog is about the common misconception of removing columns as a way to anonymize data and to educate you as a data processor, let us start by exploring the definition of anonymization according to GDPR.

According to Recital 26 from the GDPR, anonymized information is defined as:

‘information which does not relate to an identified or identifiable natural person or personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.’

Since one processes personal data that relates to a natural person, only part 2 of the definition is relevant. In order to comply with the definition, one has to ensure that the data subject (individual) is not or no longer identifiable. As indicated in this blog, however, it is remarkably simple to identify individuals based on a few attributes. So, removing names from a dataset does not comply with the GDPR definition of anonymization.

In conclusion

We challenged one commonly considered and, unfortunately, still frequently applied approach of data anonymization: removing names. In the Guess Who game and four other examples about:

Linkage attacks
Informed individuals
Data as a fingerprint
General Data Protection Regulation (GDPR)

It was shown that removing names fails as anonymization. Although the examples are striking cases, each shows the simplicity of re-identification and the potential negative impact on the privacy of individuals.

In conclusion, the removal of names from your dataset does not result in anonymous data. Hence, we better avoid using both terms interchangeably. I sincerely hope you will not apply this approach for anonymization. And, if you still do, ensure that you and your team fully understand the privacy risks, and are permitted to accept those risks on behalf of the affected individuals.