My sister's keeper?: genomic research and the identifiability of siblings

Cassa, Christopher A; Schmidt, Brian; Kohane, Isaac S; Mandl, Kenneth D

doi:10.1186/1755-8794-1-32

Research article
Open access
Published: 25 July 2008

My sister's keeper?: genomic research and the identifiability of siblings

Christopher A Cassa^1,2,
Brian Schmidt²,
Isaac S Kohane^1,3 &
…
Kenneth D Mandl^1,3

BMC Medical Genomics volume 1, Article number: 32 (2008) Cite this article

8566 Accesses
20 Citations
10 Altmetric
Metrics details

Abstract

Background

Genomic sequencing of SNPs is increasingly prevalent, though the amount of familial information these data contain has not been quantified.

Methods

We provide a framework for measuring the risk to siblings of a patient's SNP genotype disclosure, and demonstrate that sibling SNP genotypes can be inferred with substantial accuracy.

Results

Extending this inference technique, we determine that a very low number of matches at commonly varying SNPs is sufficient to confirm sib-ship, demonstrating that published sequence data can reliably be used to derive sibling identities. Using HapMap trio data, at SNPs where one child is homozygotic major, with a minor allele frequency ≤ 0.20, (N = 452684, 65.1%) we achieve 91.9% inference accuracy for sibling genotypes.

Conclusion

These findings demonstrate that substantial discrimination and privacy risks arise from use of inferred familial genomic data.

Peer Review reports

Background

Genomic data are increasingly integrated into clinical environments, stored in genealogical and medical records[1, 2] and shared with the broader research community[3, 4] without full appreciation of the extent to which these commodity level measurements may disclose the health risks or even identity of family members. While siblings, on average, share half of their contiguous chromosomal segments, well over half of a sibling's allelic values can be inferred using only population-specific allele frequency data and the genotypes of another sib. The informed consent process for research and clinical genomic data transmission must therefore include rigorous treatment of accurately quantified disclosure risks for all who will be impacted by such activity.

It is remarkably easy to positively identify a person with fewer than 40 independent, commonly varying SNPs, using a physical sample or a copy of those values[5]. As DNA sequences cannot be revoked or changed once they are released, any disclosure of such data poses a life-long privacy risk. Unlike conventional fingerprints, which provide little direct information about patients or relatives, SNP genotypes may encode phenotypic characteristics, which can link sequences to people[6]. Despite these privacy issues[7, 8], use of genetic sequencing is increasing in both forensics[9] and clinical medicine. The recent genetic fingerprinting provision in the renewal of the federal Violence Against Women Act[10], alone, may result in one million new sequenced individuals each year, markedly increasing the number of available links between identities and genotypes. This genetic fingerprinting has an impact on people beyond those directly sequenced–genetic testing partially reveals genotypes of siblings and other family members.

At each locus in a child's genome, each parent transmits only one of his or her two chromosomes. If we have the genotype of one child, and would like to use that information to help infer the genotype of a sibling, we consider both the known parental genotypes (for the alleles they have transmitted to their first sibling,) and also consider those chromosomes they have but have not transmitted. We assume that the unknown parental alleles are drawn from a reference population, such as one of the HapMap populations. Now, considering the genotype of the inferred sibling (2^nd child), with probability 0.25, the sibling will receive the same 2 chromosomes transmitted to the first child, in which case they will have the same genotype. With probability 0.25, the inferred sibling will receive both previously untransmitted chromosomes, in which case the sibling will have the same genotype distribution as the reference population. If only one of the same chromosomes is transmitted, then one chromosome will be the same and the other will be drawn from the population.

Methods

To quantify the risk of SNP disclosure to relatives, we demonstrate a model for inferring sibling genotypes using proband SNP data and population-specific allele frequency databases, such as the HapMap[10, 11]. We also evaluate the probability that two people, in a selected pool of individuals, are siblings given a match at an independent subset of SNPs, and show that this number can be made remarkably low with appropriate SNP selection.

Enhanced ability to infer sibling genotypes

First, consider the case where one sibling's genotype is known to be 'AA', and the goal is to determine the probability that a second sibling's genotype will also be 'AA' at that locus. Because there is additional knowledge–the familial relationship between the two sibs–the prior probability of the second sib carrying a specific genotype at a selected SNP will be altered under the new constraint. A conditional probability expression that sums over the nine possible parental genotypic combinations (for example, maternal genotype 'Aa' with paternal genotype 'AA') at a single SNP, each denoted as i can be used:

\begin{array}{l} p (S i b_{2} A A | S i b_{1} A A) = \sum_{i = 1}^{9} p (S i b_{2} A A | p a r e n t a l c o m b . i) p (p a r e n t a l c o m b . i | S i b_{1} A A) \\ = \sum_{i = 1}^{9} \frac{p (S i b_{2} A A \cap p a r e n t a l c o m b . i)}{p (p a r e n t a l c o m b . i)} p (p a r e n t a l c o m b . i | S i b_{1} A A) \end{array}

where Sib ₁ AA and Sib ₂ AA refer to Sib₁ and Sib₂ genotypes 'AA' at a selected SNP, respectively.

With unknown parental genotypes, we would calculate p(Sib ₂ AA) considering all nine possible parental genotype combinations, but knowledge that Sib₁ has genotype 'AA' allows exclusion of any parental combinations where either parent has genotype 'aa', as that would require the transmission of at least one copy of the 'a' allele to Sib₁, if non-paternity and new mutations are excluded. HapMap SNP population frequencies, p and q, for each selected SNP, can be used to calculate the probabilities of each parental combination, i. Once these values have been calculated, the genotype of the first sibling eliminates possible parental genotypic candidates (Figs. 1A–C), and the remaining probabilities are normalized.

Measuring the information content of Sibling genotype data

When calculating the probability of a specific Sib₂ genotype given a known Sib₁ genotype, it is possible to directly measure the benefit of the proband genotype information in improving Sib₂ inferences. This involves measuring the difference between the prior Hardy-Weinberg probability for the genotype, given only population frequencies, and the posterior probability, as calculated by the conditional expression above. To measure the information content provided by the first sibling's genotype, we propose the use of a likelihood ratio test statistic, comparing models where two individuals are known to be siblings versus two individuals that are known to be unrelated. There are a total of nine possible likelihood ratios, Λ_{Ind1, Ind2 genotypes}, for each of the possible individual genotypic combinations, such as Ind ₁ AA:

\begin{array}{l} Λ_{I n d_{1}, I n d_{2} g e n o t y p e s} = \frac{p (I n d_{2} g e n o t y p e | I n d_{1} g e n o t y p e \cap s i b l i n g s)}{p (I n d_{2} g e n o t y p e | I n d_{1} g e n o t y p e \cap u n r e l a t e d)} \\ = \frac{p (S i b_{2} g e n o t y p e | S i b_{1} g e n o t y p e \cap s i b l i n g s)}{(\frac{p (I n d_{2} g e n o t y p e \cap p (I n d_{1} g e n o t y p e \cap u n r e l a t e d)}{p (I n d_{1} g e n o t y p e \cap u n r e l a t e d)})} \\ = \frac{\sum_{i = 1}^{9} p (S i b_{2} g e n o t y p e | p a r e n t a l c o m b . i) p (p a r e n t a l c o m b . i | S i b_{1} g e n o t y p e)}{(\frac{p (I n d_{2} g e n o t y p e \cap p (I n d_{1} g e n o t y p e \cap u n r e l a t e d)}{p (I n d_{1} g e n o t y p e \cap u n r e l a t e d)})} \\ = \frac{\sum_{i = 1}^{9} \frac{p (S i b_{2} g e n o t y p e \cap p a r e n t a l c o m b . i)}{p (p a r e n t a l c o m b . i)} p (p a r e n t a l c o m b . i | S i b_{1} g e n o t y p e)}{(\frac{p (I n d_{2} g e n o t y p e) \cdot p (I n d_{1} g e n o t y p e) \cdot (1 - \frac{1}{N})}{p (I n d_{1} g e n o t y p e) \cdot (1 - \frac{1}{N})})} \\ ≅ \frac{\sum_{i = 1}^{9} \frac{p (S i b_{2} g e n o t y p e \cap p a r e n t a l c o m b . i)}{p (p a r e n t a l c o m b . i)} p (p a r e n t a l c o m b . i | S i b_{1} g e n o t y p e)}{p (I n d_{2} g e n o t y p e)} \end{array}

The denominator becomes p(Ind ₂ genotype), which is either p ², 2pq, or q ². This is intuitive; when considering two unrelated individuals, the probability that the 2^nd has a specific genotype can only be identified using the population frequencies for that genotype. The numerator is the posterior probability expression derived in Table 1, also in terms of p and q. The log of this odds ratio can then be used as a statistic for measuring relatedness, depending only on the SNP allele frequency and the Sib₁ genotype (Fig. 2).

Table 1 Sib₂ inference error reduction when Sib₁genotype is known.

Full size table

The allele frequency, p, that maximizes this statistic can then be found numerically for each Λ_{Ind1, Ind2 genotypes} expression, to identify which allele frequencies and conditions are most informative for genotypic inferences. These results are below in Table 2.

Table 2 Finding the MAF that maximizes the log likelihood ratio test statistic for each Sib₂genotypic inference type.

Full size table

Confirming sib-ship with two non-matching sets of SNP genotypes

The above inference technique can be extended to confirm sib-ship in two non-matching samples of SNP sequence data. Given a set of matches at M independent loci from a pool of N individuals, an expanded form of Bayes Theorem can be used to calculate p(sibs|match at M loci) directly:

\begin{array}{l} p (s i b s | m a t c h a t M l o c i) = \frac{p (m a t c h a t M l o c i | s i b s) p (s i b s)}{p (m a t c h a t M l o c i | s i b s) p (s i b s) + p (m a t c h a t M l o c i |! s i b s) p (! s i b s)} \\ = \frac{{[p (b o t h A A | s i b s) + p (b o t h A a | s i b s) + p (b o t h a a | s i b s)]}^{M} (\frac{1}{N})}{{[p (b o t h A A | s i b s) + p (b o t h A a | s i b s) + p (b o t h a a | s i b s)]}^{M} (\frac{1}{N}) + p {(m a t c h |! s i b s)}^{M} (1 - \frac{1}{N})} \end{array}

p(match|!sibs) can be calculated for each SNP using the population frequency; it is the probability that two unrelated individuals in the population would share the same genotype, 'AA', 'Aa', or 'aa'. The expression p(match|!sibs) is effectively the same as p(match) as long as the sample pool, N, is large enough, as the probability of sib-ship is very low in a large pool. For three different pool sizes, (N = 100,000;10,000,000;6,000,000,000), we have created a sib-ship probability surface that varies with the number of matched SNPs and MAF of those SNPs (Fig. 3) and published supporting values for these probabilities in Table 3. For SNPs that commonly vary in the population, a small number of genotypic matches are required to confirm sib-ship.

Table 3 Probability of sib-ship for three pool sizes.

Full size table

Modeling a series of SNP inferences using a binomial distribution

A binomial distribution can be used to represent a series of sibling genotypic inferences, such as the probability of correct inferences at 50 SNP loci, if each inference meets specific criteria. Independent inferences can be treated as a random variable with probability p of success, as long as independent SNPs are selected, with the same MAF and Sib₁ genotype.

p (k, n, p) = (\begin{matrix} n \\ k \end{matrix}) p^{k} {(1 - p)}^{n - k}

where p(k, n, p) refers to the probability that k correct inferences were made out of n attempted inferences when the probability of success for each inference attempt is p. This measure will enable those who attempt to infer SNP genotypes to calculate the probability of matching at a subset of independent SNPs.

The cumulative binomial measures the probability of reaching up to k successes in n trials with probability p of success at each attempt:

F (k; n, p) = P (X \leq k) = \sum_{j = 0}^{k} (\begin{matrix} n \\ j \end{matrix}) p^{j} {(1 - p)}^{n - j}

If n guesses are considered (i.e. n SNPs are genotyped and used for sib inference), F(k, n, p) is the probability that at least k of those will be correct.

Results

Validation of SNP genotype inference using HapMap trio data

We then empirically infer sibling genotypic sequences from HapMap trio child genotypes using the above technique. At 700,000 SNP loci on chromosomes 2, 4, and 7, in each of 30 HapMap CEPH trios, the trio sibling, Sib₁, known genotypes are combined with the CEPH and global HapMap SNP allele frequencies to produce genotypic inferences of a hypothetical sib, Sib₂, at these loci. The inference method produces three genotypic probabilities for Sib₂ (or subsequent siblings): p(Sib ₂ AA|Sib ₁ genotype), p(Sib ₂ Aa|Sib ₁ genotype), and p(Sib ₂ aa|Sib ₁ genotype) for each SNP, which we call the SNP probability vector.

The ability to correctly infer a sibling genotype from a trio child genotype can be validated by comparing whether the best estimated genotype, using only the sibling genotype and population frequencies, matches the best estimated genotype using the parental genotypic data (Fig. 1D). We do this by comparing the plural, largest, value in the SNP probability vector, with the plural value in the SNP probability vector that would be expected given the parental genotypes and Mendelian Inheritance. The fraction of correct inferences for SNPs where the Sib₁ is homozygous major or heterozygous versus MAF are graphed in Figs. 4A–B, respectively. There were insufficient SNPs where the trio child was homozygous minor, so they have been excluded from this analysis. The appendix contains details about the HapMap population used as well as the distance and scoring metric used.

For inferences at SNPs where the trio child, Sib₁, was homozygous major, with MAF < 0.05 (N = 300512,43.2%), we are able to correctly infer the genotype of other siblings, e.g. Sib₂, with 98.5% accuracy when using population-specific allele frequency data. At SNPs with MAF < 0.20 (N = 452684,65.1%) we achieve 91.9% average accuracy. For SNPs where the first sibling is heterozygous, with MAF > 0.20 (N = 125796,18.1%), it is possible to infer the correct genotype of the second sibling with 57.7% average accuracy. Without Sib₁ genotypes, all inferences for homozygous major SNPs with MAF ≥ 0.33 and heterozygous SNPs with MAF ≤ 0.33 would be incorrect when validated against plural parental values. At these allele frequencies, as well as others, use of Sib₁ genotypes markedly improves Sib₂ inferences.

Deriving propensity to disease from sibling SNP data

Additionally, sibling SNP data can be used to quantify an individual's disease propensity through genotypic inference, without that individual's actual sequence data. For example, the likelihood ratio test statistic above may also be used to describe relative risk, using a multiplicative model.

\begin{array}{l} Γ_{S i b_{2} g e n o t y p e | S i b_{1} g e n o t y p e} = \frac{p r o b a b i l i t y w i t h s i b l i n g k n o w l e d g e}{p r o b a b i l i t y w i t h o u t s i b l i n g k n o w l e d g e} \\ = \frac{p (S i b_{2} g e n o t y p e | S i b_{1} g e n o t y p e)}{p (S i b_{2} g e n o t y p e)} \\ = \frac{\sum_{i = 1}^{9} \frac{p (S i b_{2} g e n o t y p e \cap p a r e n t a l c o m b . i)}{p (p a r e n t a l c o m b . i)} p (p a r e n t a l c o m b . i | S i b_{1} g e n o t y p e)}{p (S i b_{2} g e n o t y p e)} \end{array}

For example, the relative risk of Sib ₂ Aa, carrying one copy of the disease allele 'a', is provided by information from the Sib₁aa genotype:

\begin{array}{l} Γ_{A a | S i b_{1} a a} = \frac{p (S i b_{2} A a | S i b_{1} a a)}{p (S i b_{2} A a)} \\ = \frac{\frac{1}{2} p^{2} + p q}{2 p q} \\ = \frac{\frac{1}{2} p + (1 - p)}{2 (1 - p)} \\ = \frac{1 - \frac{1}{2} p}{2 - 2 p} \end{array}

In this example, at MAF = 0.01, the relative risk of genotype 'Aa' is 25.25, given information that Sib₁ carries genotype 'aa' at that locus. However, at MAF = 0.5, the relative risk of genotype 'Aa' is 0.75, given information that Sib₁ carries genotype 'aa', explaining that the risk of having the genotype 'Aa' is reduced at this MAF. This may seem counterintuitive, as the risk of carrying a disease allele is actually higher at this MAF, but Sib₂ carrying genotype 'Aa' is lower than in the control population, while the relative risk of carrying the disease allele with genotype 'aa' is higher.

\begin{array}{l} Γ_{a a | S i b_{1} a a} = \frac{p (S i b_{2} a a | S i b_{1} a a)}{p (S i b_{2} a a)} \\ = \frac{\frac{1}{4} p^{2} + p q + q^{2}}{q^{2}} \\ = \frac{\frac{1}{4} p^{2} + p (1 - p) + {(1 - p)}^{2}}{{(1 - p)}^{2}} \end{array}

At MAF 0.5, Γ_aa|Sib1aa is 2.25, demonstrating that it is more likely that a disease allele will be carried by Sib₂ in genotype 'aa' than in the control population given the Sib₁ genotype.

The explicit probability of developing a disease is also altered. If an individual with genotype 'Aa' at a specific locus has a probability p _dof developing a disease by age a, and that individual has a probability p _sof having that genotype given his sibling's genotype at that locus, his probability of developing that disease by age a is p _s· p _d. This can easily be extended to multiple independent loci, important for diseases in which a set of common or rare variants dictates disease likelihood[12, 13]. As SNPs are both clinically informative and there is a wealth of supporting allele frequency data, they have been the focus of our analysis, however there are other genomic data types which should be considered in a rigorous privacy and propensity analysis, including copy number variant and mutation data.

Discussion

These findings demonstrate that substantial discrimination and privacy concerns arise from use of inferred familial genomic data. While the Genetic Information Nondiscrimination Act of 2008 (GINA, H.R. 493), recently passed into law, would mitigate the threat of direct discriminatory action by employers or insurers[14], there will continue to be other uses of genomic data that pose privacy risks, including the use of genetic testing in setting life, disability, and long-term care insurance premiums[15]. Familial genotypic sequences can be used to assist in forensic or criminal investigations for indirect identification of genotype, increasing the number of people who may be identified[16, 17]. Similarly, Freedom of Information Act (FOIA)[18] requests related to federally-funded genome wide association studies could potentially be used to identify research participants and their family members. Clinically, choosing the detail and type of disease propensity information that must be disclosed to patients and their potentially affected family members is also under debate[19, 20].

Quantifying the information content of disclosed genomic data will add clarity to the informed consent process when a patient shares genotypic data for research use. For research investigations, it is conceivable that a subject would want to limit the impact of her genomic disclosure on her family members, or be asked to have a discussion with specific family members before proceeding. Providing subjects with different levels of genomic anonymity based on their sequence data, along with an estimate of the probability of re-identification and familial impact for each of those anonymity levels, will allow patients to trade off altruistically motivated sharing[21] with privacy consideration, especially when they volunteer to share all the variants in their genome[22].

While the inference accuracy rates are very high, particularly for inferences where Sib₁ has a homozygous major genotype, we would like to caution that some of these findings are not always highly informative. For example, if the MAF is 0.01, where 99% of the alleles in the population are the major allele, the prior probability for a homozygous major allele is 0.99*0.99 ≅ 0.98. If Sib₁ has a homozygous major allele, the posterior probability of observing a homozygous major allele in another sibling is (1/4 + 1/4*0.99*0.99 + 1/2*0.99) ≅ 0.99. In this case, the difference between prior and posterior probabilities is only 0.01, and knowledge of the Sib₁ genotype provides very little information, as most accuracy comes from the allele frequency in the population.

However, homozygous minor alleles are much more informative. With a MAF of 0.2, if Sib₁ has a homozygous minor genotype, the probability of Sib₂ having the same genotype, given only the reference population is 0.04. Given that Sib₁ has a homozygous minor genotype, Sib₂ will have a homozygous minor allele with probability of (1/4 + 1/4*0.2*0.2 + 1/2*0.2) = 0.36, which is quite different from the prior probability of 0.04.

One limitation of this study is that the population-based estimates for MAF rely on the HapMap study population sizes, which, at present, are small, though these types of sources will continue to expand. For example, the CEPH population contains 90 participants, so each trio child contributes 1/90^th of the allele frequency data used in the study. This approach also depends on the independence of the loci considered, and would need to be adapted for SNPs that are in linkage disequilibrium. Extending this study to include linked SNP loci is possible, using the haplotype block information for HapMap populations that is available. To ensure that SNPs are independent, linkage data from the HapMap population can be used to confirm independence, and SNPs that are far from one another may be selected. Additionally, this approach does not consider the possibility of genotypic errors, which may be common on some platforms. An adjustment using a binomial probability distribution could be used to account for possible errors.

Conclusion

Technologies for sequencing large numbers of SNPs are rapidly dropping in cost, which will help realize the promise of personalized medicine, but pose substantial personal and familial privacy risks. While electronic storage and transmission of genetic tests is not yet a common component of medical record data, these tests will soon be stored in electronic medical records and personally controlled health records[23]. This mandates the need for improved informed consent models and access control mechanisms for genomic data. The increasingly common practice of electronically publishing research-related SNP data requires a delicate balance between the enormous potential benefits of shared genomic data through NCBI and other resources, and the privacy rights of both sequenced individuals and their family members.

Appendix

HapMap CEPH and global population SNP genotypes and allele frequency data

The demographic data used in this project are population-specific SNP allele frequencies from the CEPH HapMap population, Utah residents with ancestry from northern and western Europe, and the global SNP allele frequencies (from all populations that participated in the HapMap)[10] The HapMap project has compiled allele frequency values for a large selection of SNPs – loci in the genome that account for a great deal of genetic variability in populations. Within the CEPH population, there are 30 familial trios, each containing one mother, father, and child. Additionally, the individual genotypes of the 90 CEPH trio participants are directly used in this study. One limitation of this population specific allele frequency database is the small size of each HapMap population – the CEPH population contains 90 participants, and as such, each trio child contributes 1/90^th of the allele frequency data that are used in the study.

Inferring sibling genotypic sequences from HapMap trio children

Here, we explore a specific example of sibling genotypic inference in greater depth, considering the case where one sibling's genotype is known to be 'AA', and the goal is to determine the probability that the second sibling's genotype will also be 'AA' at that locus. The conditional probability expression that sums over the nine possible parental genotypic combinations (for example, maternal genotype 'Aa' with paternal genotype 'AA') at a single SNP, with each specific parental genotypic combination denoted as i can be used:

\begin{array}{l} p (S i b_{2} A A | S i b_{1} A A) = \sum_{i = 1}^{9} p (S i b_{2} A A | p a r e n t a l c o m b . i) p (p a r e n t a l c o m b . i | S i b_{1} A A) \\ = \sum_{i = 1}^{9} \frac{p (S i b_{2} A A \cap p a r e n t a l c o m b . i)}{p (p a r e n t a l c o m b . i)} p (p a r e n t a l c o m b . i | S i b_{1} A A) \end{array}

where Sib ₁ AA and Sib ₂ AA refer to Sib₁ and Sib₂ genotypes 'AA' at a selected SNP, respectively.

With unknown parental genotypes, we would calculate p(Sib ₂ AA) considering all nine possible parental genotype combinations, but knowledge that Sib₁ has genotype 'AA' allows exclusion of any parental combinations where either parent has genotype 'aa', as that would require the transmission of at least one copy of the 'a' allele to Sib₁, if non-paternity and new mutations are excluded.

For example, when the child is homozygous major, all possible parental genotypic candidates that involve one or both parent genotypes of 'aa' are excluded, as it is not possible to have a child with genotype 'AA' if either parent does not have at least one copy of the 'A' allele. In this case, there are four possible parental genotypic combinations:

\begin{array}{l} = \sum_{i = 1}^{4} \frac{p (S i b_{2} A A \cap p a r e n t a l c o m b . i)}{p (p a r e n t a l c o m b . i)} p (p a r e n t a l c o m b . i | S i b_{1} A A) \\ = (\frac{p (S i b_{2} A A \cap A A_{M} A A_{F})}{p (A A_{M} A A_{F})}) p (A A_{M} A A_{F} | S i b_{1} A A) + (\frac{p (S i b_{2} A A \cap A A_{M} A a_{F})}{p (A A_{M} A a_{F})}) p (A A_{M} A a_{F} | S i b_{1} A A) \\ + (\frac{p (S i b_{2} A A \cap A a_{M} A A_{F})}{p (A a_{M} A A_{F})}) p (A a_{M} A A_{F} | S i b_{1} A A) + (\frac{p (S i b_{2} A A \cap A a_{M} A a_{F})}{p (A a_{M} A a_{F})}) p (A a_{M} A a_{F} | S i b_{1} A A) \\ = (1) (p^{2}) + (\frac{1}{2}) (p q) + (\frac{1}{2}) (p q) + (\frac{1}{4}) (q^{2}) \\ = p^{2} + p q + \frac{q^{2}}{4} \\ = p^{2} [+ p q + \frac{q^{2}}{4}] \end{array}

which allows calculation directly from the SNP population frequencies. Before knowledge of the Sib₁ genotype was used, p(Sib ₂ AA) would have been the Hardy-Weinberg frequency for major homozygotes, p ². However, with the Sib₁ genotype, p(Sib ₂ AA|Sib ₁ AA), the additional constraint increases the probability to p ² +pq+(q ² /4), increasing inference accuracy by pq+(q ² /4).

The remaining entries in the probability vector, p(Sib ₂ Aa|Sib ₁ AA), and p(Sib ₂ aa|Sib ₁ AA), can then be calculated just as we have done for p(Sib ₂ AA|Sib ₁ AA) above. Again, these probabilities have been generated without any actual knowledge of the parent genotypes. If the Sib₁ genotype were instead 'Aa' or 'aa', the above technique can similarly be used (with a different combination of possible parental genotypes) to calculate the two other probability vectors, [p(Sib ₂ AA|Sib ₁ Aa), p(Sib ₂ Aa|Sib ₁ Aa), p(Sib ₂ aa|Sib ₁ Aa)] and [p(Sib ₂ AA|Sib ₁ aa), p(Sib ₂ Aa|Sib ₁ aa), p(Sib ₂ aa|Sib ₁ aa)].

Validating the sibling genotype probability vector using parental genotypic data

To validate the results of the refining strategy on inferring the second sibling genotype, the authentic parental genotypes are used to create the probability vector p('AA'), p('Aa'), p('aa') at the SNP being evaluated, for the children the pair would be expected to have. For each of the trio pairs at each of the SNPs being tested, the probability vector was calculated.

Error reduction calculation

The error reduction measurement identifies the extent to which inference error is reduced. For example, where we are trying to infer the probability that Sib₂ has genotype 'AA' at a specific SNP, we calculate the absolute value of the difference between our best inference and the Hardy Weinberg probability for Sib₂ to have genotype 'AA', using population-specific allele frequency data and the Sib₁ genotype, |p(Sib ₂ AA|Sib ₁ genotype)-p(Sib ₂ AA)|. This value is specifically the improvement to the probability value from the new data, when inferring the specific event that Sib₂ will have genotype 'AA' and Sib₁ will have the specific genotype in question.

Any change to p(Sib ₂ AA) must also correspond with the opposite change in the sum of p(Sib ₂ Aa) and p(Sib ₂ aa). To accurately represent the overall error reduction by Sib₁ genotype, with any of three possible Sib₂ genotypes, the average of the three values is measured. For example, where the Sib₁ genotype is 'AA', the overall average improvement (and error reduction) is the average of |p(Sib ₂ AA) - p(Sib ₂ AA|Sib ₁ AA)|, |p(Sib ₂ Aa) - p(Sib ₂ Aa|Sib ₁ AA)|, and |p(Sib ₂ aa) - p(Sib ₂ aa|Sib ₁ AA)|.

Scoring metric for calculating correct fraction of inferences

To ascertain whether the inferences are helpful for producing correct answers, a scoring metric was used to calculate the fraction of correct SNP inferences, in our empirical inference validation study. For each SNP inference, the scoring metric provides a full point when the plural entry in the inference vector, (the maximum of p('AA'), p('Aa'), and p('aa'), and thus the predicted sib genotype), matches the plural entry in the parental validation vector (the empirical most likely genotype). Given the parental genotype values, it is possible, and not infrequent, that a validation probability vector has two matching plural values, for example, if p('AA') = p('Aa') = 0.5. When this is the case, one half point was awarded if the plural value in the inference vector matched one of the two validation choices, to signify that one of the two equally likely candidates was chosen.

There are some conditions that arise from use of a simple scoring metric, where it becomes difficult to score well. For example, a heterozygous Sib₁ will likely result in a 0.5 score for inferences. A score of 1 point would be possible if one parent had a genotype of 'AA' and the other had genotype 'aa', making the probability that the parents would have a child with genotype 'Aa' equal 1. Most remaining parental combinations would not result in the probability of child genotype 'Aa' equal to 1, and would likely result in only a half point. These values can be adjusted using machine learning techniques or more robust decision making algorithms, but those are out of the scope of this work.

References

Adida B, Kohane IS: GenePING: secure, scalable management of personal genomic data. BMC Genomics. 2006, 7: 93-10.1186/1471-2164-7-93.
Article PubMed PubMed Central Google Scholar
Hoffman MA: The genome-enabled electronic medical record. J Biomed Inform. 2007, 40 (1): 44-46. 10.1016/j.jbi.2006.02.010.
Article CAS PubMed Google Scholar
Kaiser J: Genomic databases. NIH goes after whole genome in search of disease genes. Science. 2006, 311 (5763): 933-10.1126/science.311.5763.933a.
Article CAS PubMed Google Scholar
Thomas DC: Are we ready for genome-wide association studies?. Cancer Epidemiol Biomarkers Prev. 2006, 15 (4): 595-598. 10.1158/1055-9965.EPI-06-0146.
Article PubMed Google Scholar
Lin Z, Owen AB, Altman RB: Genetics. Genomic research and human subject privacy. Science. 2004, 305 (5681): 183-10.1126/science.1095019.
Article CAS PubMed Google Scholar
Malin BA, Sweeney LA: Inferring genotype from clinical phenotype through a knowledge based algorithm. Pac Symp Biocomput. 2002, 41-52.
Google Scholar
Lowrance WW, Collins FS: Ethics. Identifiability in genomic research. Science. 2007, 317 (5838): 600-602. 10.1126/science.1147699.
Article CAS PubMed Google Scholar
Malin BA: An evaluation of the current state of genomic data privacy protection technology and a roadmap for the future. J Am Med Inform Assoc. 2005, 12 (1): 28-34. 10.1197/jamia.M1603.
Article PubMed PubMed Central Google Scholar
Brenner CH, Weir BS: Issues and strategies in the DNA identification of World Trade Center victims. Theor Popul Biol. 2003, 63 (3): 173-178. 10.1016/S0040-5809(03)00008-X.
Article CAS PubMed Google Scholar
A haplotype map of the human genome. Nature. 2005, 437 (7063): 1299-1320. 10.1038/nature04226.
Olivier M: A haplotype map of the human genome. Physiol Genomics. 2003, 13 (1): 3-9.
Article CAS PubMed Google Scholar
Hirschhorn JN, Daly MJ: Genome-wide association studies for common diseases and complex traits. Nat Rev Genet. 2005, 6 (2): 95-108. 10.1038/nrg1521.
Article CAS PubMed Google Scholar
Sebastiani P, Ramoni MF, Nolan V, Baldwin CT, Steinberg MH: Genetic dissection and prognostic modeling of overt stroke in sickle cell anemia. Nat Genet. 2005, 37 (4): 435-440. 10.1038/ng1533.
Article CAS PubMed PubMed Central Google Scholar
Holden C: Genetic discrimination. Long-awaited genetic nondiscrimination bill headed for easy passage. Science. 2007, 316 (5825): 676-10.1126/science.316.5825.676b.
Article CAS PubMed Google Scholar
Hudson KL, Holohan MK, Collins FS: Keeping pace with the times--the Genetic Information Nondiscrimination Act of 2008. N Engl J Med. 2008, 358 (25): 2661-2663. 10.1056/NEJMp0803964. 2008/06/21
Article CAS PubMed Google Scholar
Bieber FR, Brenner CH, Lazer D: Human genetics. Finding criminals through DNA of their relatives. Science. 2006, 312 (5778): 1315-1316. 10.1126/science.1122655.
Article CAS PubMed Google Scholar
Bieber FR, Lazer D: Guilt by association: should the law be able to use one person's DNA to carry out surveillance on their family? Not without a public debate. New Sci. 2004, 184 (2470): 20.
PubMed Google Scholar
Freedom of Information Act . 5 USC 552. 1996
Kohut K, Manno M, Gallinger S, Esplen MJ: Should healthcare providers have a duty to warn family members of individuals with an HNPCC-causing mutation? A survey of patients from the Ontario Familial Colon Cancer Registry. J Med Genet. 2007, 44 (6): 404-407. 10.1136/jmg.2006.047357.
Article PubMed PubMed Central Google Scholar
Offit K, Groeger E, Turner S, Wadsworth EA, Weiser MA: The "duty to warn" a patient's family members about hereditary disease risks. Jama. 2004, 292 (12): 1469-1473. 10.1001/jama.292.12.1469.
Article CAS PubMed Google Scholar
Kohane IS, Altman RB: Health-information altruists--a potentially critical resource. N Engl J Med. 2005, 353 (19): 2074-2077. 10.1056/NEJMsb051220.
Article CAS PubMed Google Scholar
Church GM: The personal genome project. Mol Syst Biol. 2005, 1: 2005 0030.
PubMed PubMed Central Google Scholar
Simons WW, Mandl KD, Kohane IS: The PING personally controlled electronic medical record system: technical architecture. J Am Med Inform Assoc. 2005, 12 (1): 47-54. 10.1197/jamia.M1592.
Article PubMed PubMed Central Google Scholar

Pre-publication history

The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1755-8794/1/32/prepub

Download references

Acknowledgements

The authors would like to gratefully acknowledge the assistance of Dr. John Tsitsiklis and Dr. Shannon Wieland for discussion of probabilistic techniques and support from the National Library of Medicine, National Institutes of Health grant R01-LM009375-01A1.

Author information

Authors and Affiliations

Children's Hospital Informatics Program at the Harvard-MIT Division of Health Sciences and Technology, Boston, MA, USA
Christopher A Cassa, Isaac S Kohane & Kenneth D Mandl
Clinical Decision Making Group, CSAIL, Massachusetts Institute of Technology, Cambridge, MA, USA
Christopher A Cassa & Brian Schmidt
Harvard Medical School, Boston, MA, USA
Isaac S Kohane & Kenneth D Mandl

Authors

Christopher A Cassa
View author publications
You can also search for this author in PubMed Google Scholar
Brian Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Isaac S Kohane
View author publications
You can also search for this author in PubMed Google Scholar
Kenneth D Mandl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christopher A Cassa.

Additional information

Competing interests

The authors declare there are no competing interests.

Authors' contributions

CC conceived of the study design, carried out the statistical analysis, generated the figures, and drafted the manuscript. BS carried out experiments using HapMap data and imputed family data. KM helped draft and revise the manuscript, and helped perform the statistical analysis. ZK assisted in conception of the study and critical review of the manuscript. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Cassa, C.A., Schmidt, B., Kohane, I.S. et al. My sister's keeper?: genomic research and the identifiability of siblings. BMC Med Genomics 1, 32 (2008). https://doi.org/10.1186/1755-8794-1-32

Download citation

Received: 26 November 2007
Accepted: 25 July 2008
Published: 25 July 2008
DOI: https://doi.org/10.1186/1755-8794-1-32

My sister's keeper?: genomic research and the identifiability of siblings

Abstract

Background

Methods

Results

Conclusion

Background

Methods

Enhanced ability to infer sibling genotypes

Measuring the information content of Sibling genotype data

Confirming sib-ship with two non-matching sets of SNP genotypes

Modeling a series of SNP inferences using a binomial distribution

Results

Validation of SNP genotype inference using HapMap trio data

Deriving propensity to disease from sibling SNP data

Discussion

Conclusion

Appendix

HapMap CEPH and global population SNP genotypes and allele frequency data

Inferring sibling genotypic sequences from HapMap trio children

Validating the sibling genotype probability vector using parental genotypic data

Error reduction calculation

Scoring metric for calculating correct fraction of inferences

References

Pre-publication history

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Genomics

Contact us