Large-scale genetic data sets are frequently shared with other research groups and even released on the Internet to allow for secondary analysis. Study participants are usually not informed about such data sharing because data sets are assumed to be anonymous after stripping off personal identifiers.
The assumption of anonymity of genetic data sets, however, is tenuous because genetic data are intrinsically self-identifying. Two types of re-identification are possible: the "Netflix" type and the "profiling" type. The "Netflix" type needs another small genetic data set, usually with less than 100 SNPs but including a personal identifier. This second data set might originate from another clinical examination, a study of leftover samples or forensic testing. When merged to the primary, unidentified set it will re-identify all samples of that individual.
Even with no second data set at hand, a "profiling" strategy can be developed to extract as much information as possible from a sample collection. Starting with the identification of ethnic subgroups along with predictions of body characteristics and diseases, the asthma kids case as a real-life example is used to illustrate that approach.
Depending on the degree of supplemental information, there is a good chance that at least a few individuals can be identified from an anonymized data set. Any re-identification, however, may potentially harm study participants because it will release individual genetic disease risks to the public.
Large-scale SNP data sets are shared with other research groups or even released on the Internet to foster new collaborations or to allow for second-look analysis , . Study participants are not always informed about such data sharing because these kinds of data are assumed to be anonymous after stripping off personal identifiers like name or date of birth (a process also called pseudonymization).
Confidentiality has been seen in the past as a fundamental ethical principle in health care and breaching confidentiality is usually a reason for disciplinary action. It has been assigned such a great value because it directly originates from the patient's autonomy to control his or her own life. Releasing sensitive information in a professional patient-physician relationship is regarded by most patients as an implicit contract that doctors will keep all information confidential.
Such a contract is also inferred in genetic epidemiology [6,7]. The European Court of Human Rights therefore ruled in 2008 that the world's largest DNA database, based in the UK, violated Article 8 of the European Convention on Human Rights which protects privacy . An earlier survey conducted by the Genetics and Public Policy Center found that the majority of the Americans surveyed supported genetic testing for research and health care, but 92% also felt concern that genetic test results revealing risk of future disease might be used in ways that would be harmful to a person . Genetic privacy is therefore well founded in theory and appreciated in practice.
Anonymity is derived from the Greek word ανωνυμία, meaning "no name", however it soon took on the meaning "non-identifiable". Instead of being truly "non-identifiable", anonymity may be seen as a variable level of hiding an identity that largely depends on surrounding information (statistically seen as a proposition of a true or false identity assignment with a Bayesian probability P for the proposition to be true or false). A set of banknotes in a pocket may be largely uninformative, as long as they do not contain a certain mix of issuing offices that will allow reconstruction of the travel history of the owner.
The assumption of anonymity of genetic data sets is tenuous if personal identifiers are merely removed and kept in another place . Moreover, the usual k-anonymization strategy, where each relevant entity is hidden in at least k peers, is not feasible with such highly informative data sets . Some authors even believe that de-identifying records is just a matter of economic investment ranging between $ 0 and $ 17,000 even for data protected under the "safe harbor" act, the U.S. Health Insurance Portability and Accountability Act .
Genetic data are intrinsically self-identifying , hence their use in forensic investigations. With the progress of genome-wide association studies, earlier work on genetic privacy has become outdated , ,  and will finally be buried with the advent of whole genome sequencing . Although current study leaflets still promise "strict confidentiality" to study participants, some authors conclude that anonymity is already a thing of the past . Family history and genetic diagnoses are already being traced on the Internet  using dedicated websites .
Are there any real threats to anonymity? The answer may depend on the overall interest in re-identification of data driven by financial interests, (pseudo-) ethical reasons, personal interests or just curiosity. Benitez and Malin  describe three types of data intruders, notably prosecutor, journalist, and marketer types. The prosecutor attack has a specific target; the journalist attack is interested in identifying a particular record, while the adversary's goal in the marketer attack is to identify as many records as possible.
Heeney et al.  further examined the motivation of data intruders that range from scientists doing further research, police and secret services for forensic purposes, but also agents working in marketing, insurance or employment offices. In addition there is a large community interested in genealogy some with even "strong [...] motivations, including adoptees and donor-conceived children", who claim to have rights to de-identify genetic information.
There are already examples in the literature where a single individual could be identified ; surnames of individuals in the Hapmap samples could be traced  or James Watson's ApoE gene status inferred although not being released .
At least two types of de-anonymization attacks can be differentiated. The first attack scenario comprises another rather small genetic data set of less than 100 SNPs but including personal identifiers (3). This second data set might originate from a later clinical examination, another study of leftover samples, or some forensic testing. This data set might then be merged to the large unidentified set by using identical markers in both sets. Although such a scenario is rather trivial from a technical point of view, it puts considerable pressure on tested individuals to avoid any retesting of their DNA (and of all close relatives). This type of attack may be called the "Netflix" type, named after the famous historical approach of two programmers who linked the anonymized Netflix video rental database with the true name Internet Movie Rating Database. They used a new class of statistical de-anonymization attack against a high-dimensional data set that included individual preferences, recommendations and transaction records .
The second type of a de-anonymization attack is a "profiling" approach that combines various levels of evidence. Although described earlier (for example as trail re-identification) this approach became only possible very recently. It employs different levels where information is being gathered and may be illustrated by a practical example using the "asthma kids" data set .
Genetic profiling using genome-wide SNP panels: A practical example.
- Phenotype prediction
- Disease prediction
- Clinical status
Depending on the data structure and the degree of supplemental information, there is a good chance that at least some individuals can be immediately de-identified by a profiling approach.
Risks and defense
Any re-identification will expose individual genetic risks in the public as individuals become vulnerable to the consequences of genetic testing ranging from un-insurability, un-employability or other discrimination. It is difficult to anticipate any further use of genetic data while it is expected that threats to privacy and confidentiality will increase as genomic technologies are rolled out more widely.
Several options have been proposed on how to deal with genetic privacy in the future. These include open consent, better encoding techniques, or implementing legal constraints along with restricted data access.
Although open consent (no "promises of anonymity, privacy or confidentiality are made") acknowledges the fact that there are no anonymous genetic data, while agreeing to this may be only an option for individuals with "a master's degree in genetics" . For genetic epidemiology studies, however, that rely on high response ratios, open consent is not a realistic option.
Better encoding could be another option. Making malicious data de-identification is, however, difficult to control on a worldwide scale while most countries still lack data protection laws. Also, more sophisticated encoding techniques are unlikely to prevent data misuse because they will just compete with better de-identification strategies.
The most feasible solution therefore will be highly restricted data access. Simply put - data not available cannot be decrypted. This policy seems to be adopted now by large research organizations like the NIH and the Wellcome Trust, who have already removed genetic data from their websites .
Most importantly, it seems necessary to increase public awareness of genetic privacy and to inform probands continuously about the use of their samples and data . The risks of re-identification of anonymized data should be included in informed consent procedures, and any data sharing needs to be explicitly approved by the DNA donor. As a measure of precaution, genetic data should not be distributed on public Internet sites, and data sets with more than 100 SNP markers should be removed from public web servers if not explicitly endorsed by the donor. As suggested before , data access should be restricted to scientific collaborations under confidentiality agreements only.
List of abbreviations used
DNA: deoxyribonucleic acid; SNP: single nucleotide polymorphism; NIH: National Institute of Health; AIDS: acquired immune deficiency syndrome; MC4R, OCA2, MYP2, HPGD, NR5A1, ABCC11, TAS2R, BNC2, AR, PAX1, TCHH, MAO-A, RGS2, AVPR1A, GHS-R1A, NPY2R, APBB1, CHRNA3, VMAT2, IQSEC2 denote gene names: for details see http://www.genecards.org webcite.
The author declares that he has no competing interests.
The author did all research and wrote the manuscript.
The author wishes to thank Carol Oberschmidt for her revision of the text. There was no direct funding of this work while article processing charges have been covered by Helmholtz Zentrum München - German Research Center for Environmental Health grant G - 505 000 - 003.
Heeney C, Hawkins N, de Vries J, Boddington P, Kaye J: Assessing the Privacy Risks of Data Sharing in Genomics. [http://www.publichealth.ox.ac.uk/helex/publications/HeeneyEtAl2010] webcite
European Court Rules DNA Retention Illegal [http://www.privacyinternational.org/article.shtml?cmd%5B347%5D=x-347-563175] webcite
U.S. Public Opinion on Uses of Genetic Information and Genetic Discrimination [http:/ / www.dnapolicy.org/ resources/ GINAPublic_Opinion_Genetic_Informat ion_Discrimination.pdf] webcite
Homer N, Szelinger S, Redman M, Duggan D, Tembe W, Muehling J, Pearson JV, Stephan DA, Nelson SF, Craig DW: Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays.
Moffatt MF, Kabesch M, Liang L, Dixon AL, Strachan D, Heath S, Depner M, von Berg A, Bufe A, Rietschel E, Heinzmann A, Simma B, Frischer T, Willis-Owen SA, Wong KC, Illig T, Vogelberg C, Weiland SK, von Mutius E, Abecasis GR, Farrall M, Gut IG, Lathrop GM, Cookson WOCM: Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma.
The pre-publication history for this paper can be accessed here: