Abstract
The goals of our analysis were to map functional loci, which contribute to the casecontrol status of a trait of interest, using large pedigrees. We used logistic regression fitted with the generalized estimation equation to test associations between a dichotomous phenotype and all genotyped common and rare singlenucleotide polymorphisms. In addition to the association study, we also developed and applied a simple and fast identicalbydescentbased test to identify loci that were shared among affected individuals more often than expected by chance. Among the top significant loci, we assessed the statistical power and the false discovery rate of both methods. We also demonstrated that familybased studies, compared with the standard populationbased association studies, have great values and advantages for the discovery of multiple rare causal variants.
Background
Populationbased genomewide association studies (GWAS) using unrelated individuals are becoming increasingly popular in genetic research. Recent large GWAS have shown that common genetic variants are involved in common diseases, but most of the variants found in this way account for only a small portion of the trait variance. On the other hand, accumulating evidence from candidategenebased resequencing suggests that many rare genetic variants contribute to the trait variance of common diseases. Pedigree resources are conventionally believed to be powerful for identifying rare variants and are considered appropriate for the application of linkage strategies. However, linkage analysis often requires nuclear families characterized by multiple informative offspring and is more applicable for quantitative traits [1,2]. Other methods, which exist to appropriately perform association analyses of dichotomous traits with extended pedigrees, are often computationally impractical for largescale genomewide association analyses. Hence we are motivated to explore alternative strategies that can be efficiently carried out on a genomewide scale and can appropriately handle familial relationships in arbitrarystructured pedigrees.
The family data from Genetic Analysis Workshop 17 (GAW17) provide us with a great opportunity to investigate appropriate approaches for the familybased association tests. The approaches we consider in this work include fitting a logistic regression model with adjustments of covariates and a novel approach using identicalbydescent (IBD) measures.
Methods
We used the family data sets provided by GAW17. This data set had 697 subjects (209 affected and 488 unaffected individuals) from 8 extended families and fully informative markers for 3,205 genes. Assuming that recombination was not allowed within genes, IBD scores were also provided at each gene location (see Almasy et al. [3] for additional details of the GAW17 data). Using the casecontrol data sets, we explored two different types of methods: association analysis and IBD linkage analysis.
Association analyses
Throughout this paper, we assume an additive model for the genetic effects. The association analyses were based on the logistic regression model. Preliminary analysis showed that both Age and Smoke are independent risk factors and are significantly associated with the casecontrol status. Hence the resulting final logistic regression model is:
where y_{i} is the affected status of individual i, case samples are coded 1 and control samples are coded 0, and g_{i} is the genotype of individual i at a given SNP marker. The intercept parameter μ is called the base line odds and the β parameters represent log odds ratio corresponding to the variables used in the model. Assuming that minor allele B is the risk allele, g_{i} is coded 0, 1, or 2 corresponding to genotypes AA, AB, or BB, respectively.
Because the individuals are no longer independent in the large pedigrees, the joint likelihood function for all individuals has a complicated form. We used the generalized estimation equation (GEE) [4,5] with fixed covariance structure to fit the logistic regression (Eq. (1)). We computed the kinship matrix of the 697 individuals using the kinship program in the R package [6] and used the kinship matrix to specify the covariance matrix in the GEE.
Singlenucleotide polymorphisms (SNPs) with minor allele frequency (MAF) less than 1% were considered rare in our work. In the first stage, we excluded rare SNPs before the association analysis and considered only common causal variants. A rare variant, as part of a group of rare variants in the same gene, is also more likely to contribute to the susceptibility of a disease. However, if we apply logistic regressions directly on the rare variants, the GEE will fail because of the singularity in the correlation matrix. The association methods developed for common variants will have limited efficiency for mapping rare variants unless enormous sample sizes exist. Grouping and collapsing rare variants into meaningful groups (e.g., by functional genes, by pathways) has been shown to be a feasible option to improve efficiency in studying rare variants [7]. Therefore, in the second stage of analysis, we collapsed rare SNPs within the same gene and evaluated their association with the disease trait. We noticed that for 99% of the 3,205 genes, the chance for an individual to carry 4 or more rare alleles in a gene is no larger than 0.0072. Hence we define a combined genotype of a set of grouped rare variants based on the total count of rare alleles as follows:
After assigning a collapsed genotype to grouped rare variants, we evaluate the association between collapsed genotypes and disease outcomes using the logistic model specified in Eq. (1).
IBD analysis
Phenotypes of relatives are similar because relatives share similar environment and similar genetic variants (genotypes or haplotypes). Genotypes are similar because these relatives share genes that are identical by descent. Intuitively, the diseaseassociated loci are more likely to be similar in case samples than in control samples. Comparing the distributions of IBD scores between arbitrary pairs of case subjects and pairs of control subjects appears to be a promising approach to identify loci associated with a trait. Based on this general idea, we formulated a new test using a 2 × 3 contingency table that can identify loci shared among affected individuals more often than expected by chance. Our algorithm considers all relationships simultaneously and can be used to test association in pedigrees of arbitrary size.
Suppose that K extended families with multiple affected offspring are randomly collected from a natural human population. Consider a SNP marker with two alleles, A and B, that is genotyped for all individuals in these K extended families. For an arbitrary pair of samples i and j within a family k, the IBD score between them at a SNP marker is defined as 0, 0.5, or 1 if 0, 1, or 2 of their shared alleles, respectively, arose from the same allele in an earlier generation. Between any two individuals, there are 14 combinations of their IBD scores and genotypes. We list all 14 states in Table 1 and denote these states by S_{1}, S_{2}, …, S_{14}. Assuming that allele B is the disease allele, a pair of individuals shares exactly two copies of inherited B alleles only in S_{1}, exactly one copy in S_{2}, S_{4}, and S_{7}, and one copy with probability 0.5 in S_{8}.
Table 1. IBD configurations for pairs of individuals
Under the null hypothesis of no association, the frequencies of the 14 states are the same in affected and unaffected samples. Consequently, the pairs of affected and pairs of unaffected individuals have equal frequencies of sharing zero, one, or two copies of B alleles IBD. On the other hand, under the alternative hypothesis of association, we expect to observe that pairs of affected individuals share at least one copy of the B allele with greater chance. This also means, under the alternative hypothesis, that we expect that affected individuals involved in forming pairs will share at least one copy of the inherited B allele with a greater chance than unaffected individuals.
Based on this general idea, we tabulate affected individuals and unaffected individuals by a 2 × 3 contingency table in the following way. The affected individuals can be separated into three groups: Group 1 contains those individuals who form pairs IBD at both alleles; group 2 contains individuals who form pairs IBD at exact one allele and who are not in group 1; and group 3 contains the rest of the individuals. Similarly, unaffected individuals can be assigned to the three groups in the same manner. The cells of Table 2 correspond to the counts of affected and unaffected individuals in the three groups.
Table 2. Contingency table
Next, we demonstrate this tabulating procedure using a toy. In our example, 13 individuals form two threegeneration pedigrees (as illustrated in Figure 1). Individuals 1, 6, 7, 10, 12, and 13 are affected; individuals 2, 3, 4, 5, 8, 9, and 11 are unaffected. Among the affected individuals, individuals 6 and 7 form a pair (6, 7) with two IBD alleles. We assign them to group 1, and the cell count nBB^{1} of cell (1,1) is 2. Individuals who form pairs IBD at exact one allele include individuals 1, 6, 7, and 12. Because individuals 6 and 7 have already been assigned to group 1, only individuals 1 and 12 are assigned to group 2. Note that, individuals 10 and 13 form a pair (10, 13) at S_{8} (see Table 1). Because they share exactly one copy of IBD allele B with probability 0.5, they are assigned to group 2 with probability 0.5. Therefore the cell count nB^{1} of cell (1,2) is 2 + 0.5(2) = 3, and the cell count n^{0} of cell (1,3) is 0.5(2) = 1. Similarly, among the unaffected individuals, none of them form pairs with two IBD alleles; individuals 2 and 8 form a pair (2, 8), which has one IBD allele; and the rest of the unaffected individuals, individuals 3, 4, 5, 9 and 11, form pairs with zero IBD alleles. Hence the cell counts nBB^{0}, nB^{0}, and n^{0} of cell (2,1), cell (2,2), and cell (2,3) are 0, 2, and 5, respectively.
Figure 1. Receiver operating characteristic curves. Plot of the truepositive rate against the falsepositive rate for the different possible cutpoints of a diagnostic test. These ROC curves are for the two familybased approaches and the populationbased association analysis that we performed using GAW17 familybased data sets and populationbased data sets. The shape of each curve indicates the quality of the corresponding method. The closer a ROC curve is to the diagonal of the plot, the worse the corresponding method is. A hypothetical perfect method would have a ROC curve that is a constant function with truepositive value 1 and falsepositive value 0. The two familybased approaches (the logistic regression using GEE and the IBD analysis) performed significantly better than the standard populationbased association study.
Using Table 2, we formulated our IBD test as a chisquare test, and the test statistic is:
Under the null hypothesis H_{0}, the expected cell frequencies of the affected individuals should be the same as the expected cell frequencies of the unaffected individuals, and the test statistic has an asymptotic chisquared distribution with two degrees of freedom.
One great advantage of this IBDbased method is that we can apply it not only to the common variants but also to the rare variants. However, when the MAF (the frequency of the B allele) is small, we may observe extremely low cell counts for nBB^{1} and nBB^{0}. Under such circumstances, no longer follows a chisquare distribution, and it is more appropriate for us to evaluate the significance of the association through permutation testing. We propose to obtain the empirical pvalues by shuffling the affected statuses of samples within each family. And as a byproduct, possible biases caused by different family sizes or family structures will also be corrected. In our toy example, the chisquare test pvalue (see Table 3 for the observed and expected cell counts) was 0.089, and the resulting empirical pvalue after 5,000 rounds of permutations was 0.413.
Table 3. Contingency table of the toy example
Results
We first compared the performances of two different familybased approaches (association analysis and IBD analysis) in terms of detection power and false discovery rate using the 200 simulated extended family data sets in GAW17. We then carried out standard stratified association analyses with the populationbased data sets so that we could further investigate the application and the value of using familybased approaches. In particular, we were keen to find out whether familybased approaches have certain advantages in discovering rare causal variants.
Familybased association analysis
The association analyses were performed by fitting the logistic regression using the GEE on all common SNPs (MAF > 0.01) and combining rare SNPs. Within each gene, we first made corrections for the false discovery rate (FDR) [8]; then we selected the minimum pvalue to present the genelevel pvalues. In this way, we were able to make fair comparisons across genes with different sizes. We calculated the correlations of these FDRcorrected [8] genelevel pvalues for all possible pairs out of 200 rounds of simulations, and the average correlation of these across all 200 rounds of simulations was 0.965, which suggested that the results given by the association analyses were fairly stable. We noticed that the association tests were still inflated with some false positives, as shown in the QQ plots (Figure 2b). The estimated inflation factors ranged from 1.02 to 1.36.
Figure 2. Pedigrees of two hypothetical threegeneration extended families
We also evaluated falsepositive rates. We declared the significance of the genelevel pvalues with an arbitrary cutoff value α. Genes with pvalues less than 0.05 were considered detected in each simulation. Averaging over 200 replicates, the chance of wrongly declaring noncausal genes (over all 3,205 genes) was 0.096 and 0.033 at an α level of 0.05 and 0.01, respectively.
We further evaluated the power of detecting true functional genes at an α level of 0.05 and 0.01, respectively. Among all 3,205 genes, the familybased association study successfully identified gene VEGFA as the most significant gene with a power as high as 81% at α = 0.05. Four other causal genes with the highest discovery rates were LPL, VNN1, SHC1, and SIRT1. Compared to the other 31 causal genes, these 5 genes have both strong effects and relatively high frequencies of carrying risk variants.
IBD analysis
The proposed IBD test was performed on all common and rare SNPs. Raw pvalues of all SNPs were the empirical pvalues based on 10,000 permutations. Within each gene, the FDRcorrected minimum pvalues were used to present the genelevel significance. As shown in Figure 3, the sensitivity of the IBD analysis approach was comparable to the association analysis approach. However, the results from the proposed IBD test were not as stable as the results from the association analyses; the average correlation of the FDRcorrected genelevel pvalues across all 200 rounds of simulations was 0.461. As shown in the QQ plot in Figure 3c, the IBD analyses were inflated with a certain amount of false positives in most of the replicates, and in other replicates our IBD tests were underpowered. The estimated inflation factors of the IBD tests ranged from 0.86 to 1.41, and the actual falsepositive rates at α = 0.05 and α = 0.01 were 0.112 and 0.039, respectively. The five causal genes with the highest discovery rates were VEGFC, SHC1, VEGFA, HIF3A, and SIRT1. At a significance level of 0.05, our IBD test successfully picked up genes VEGFC and SHC1 with a power of 82% and 67%, respectively, which ranked them first and fifteenth out of all 3,205 genotyped genes.
Figure 3. QQ plots of three analyses using population data sets and family data sets. The panels display QQ plots of FDRcorrected genelevel pvalues from all 200 replicates. Green dots are based on replicate 5. (a) Populationbased casecontrol association studies, using the stratified analysis. (b) Familybased association analyses obtained by fitting logistic regression using the GEE. (c) Familybased IBD analyses, in which raw pvalues of all rare and common variants are empirical pvalues. These empirical pvalues are based on 10,000 rounds of permutations.
Comparing familybased approaches with the populationbased association analysis
As a comparison to the familybased analyses, we also performed a standard stratified analysis on the 200 simulated population casecontrol data sets. We assessed the power of populationbased studies for all causal genes based on the outputs from the PLINK computer program. FLT1, which has three common variants and large genetic effects, is the only gene with a detection power greater than 50%. For most other causal genes with rare variants, familybased studies had better power of detection than populationbased studies. The QQ plot in Figure 3a shows that the standard stratified analyses were also inflated with some false positives. The estimated inflation factors ranged from 1.02 to 1.37, and the actual falsepositive rates at α = 0.05 and α = 0.01 were 0.089 and 0.029, respectively. Although all three types of analyses have comparable falsepositive rates, the receiver operating characteristic (ROC) curves (Figure 1) show that the two familybased analyses clearly performed significantly better than the standard populationbased association study in terms of having higher sensitivity.
Conclusions
In this work, our analyses provide new insights into the genetic studies of family data with large extended pedigrees for dichotomous traits. We tested the utility of two types of familybased approaches: a logistic regression fitted using GEE and our proposed IBD test. The logistic regression with GEE was used to test the associations in large pedigrees while simultaneously controlling for environmental covariates. To properly handle the rare variants, we provide an operable and straightforward scheme to collapse the rare variants within a gene. This simple collapsing strategy was shown to be useful. As an example, we had a reasonable power for identifying causal gene SIRT1, which is enriched with rare causal variants. We also showed that the linkage of disease alleles can be tested based on IBD scores in extended pedigrees. By incorporating information from not only parents and siblings but also other ancestors, we developed a chisquare test based on a 2 × 3 contingency table. Our IBD test provides an attractive alternative to the conventional tests because it is computationally fast and does not require one to specify an inheritance model explicitly.
We compared the performance of familybased studies using these two approaches with the performance of the populationbased studies using the standard stratified analysis. Populationbased studies seemed to have better power for detecting common variants, and the familybased studies seemed to have better power for detecting rare variants. If a risk allele is present in early founders and the effects of risk alleles are relatively large, the IBD analyses clearly outperformed the familybased association analyses. We noticed that, even for rare variants that have extremely low frequencies or that are found in only a few families, the IBD analysis has a good chance of picking up them. For example, in this work, the power of detecting risk genes VEGFC and HIF3A using the IBD tests was significantly higher than the power obtained using the logistic regressions. In other instances, familybased association analyses had better power to detect genes that have relatively high frequencies of risk alleles and relatively large genetic effects. In the worst scenario, if the risk allele was extremely rare, not present in early founders, and of small genetic effect, both methods failed. Because the two familybased approaches have their own advantages for dealing with rare variants, these two approaches can potentially compensate each other. Combining these two types of analysis may be a more powerful solution in the search for causal variants.
Although hunting for rare causal variants using family data sets seems promising, many practical issues need to be addressed before the effectiveness of familybased analyses can be fully recognized. For example, the IBD test can produce unstable results, and it is not very straightforward to obtain the genelevel IBD scores in the first place. Also, we noticed that all analyses were inflated with some false positives. We suspect that the inflated false positives may be caused by those nonfunctional variants whose genotypes are highly correlated with the functional variants. Luedtke et al. [9] offers a detailed discussion of these socalled spurious associated genes. Other possible sources of inflated false positives in practical studies include insufficient correction of population stratification, inappropriate handling of rare variants, and the effects of linkage disequilibrium structures. We will further investigate the influence of these possible sources in our future studies.
Competing interests
The authors declare that there are no competing interests.
Authors’ contributions
TL and TA conceived the project. TL performed the statistical analysis. Both the authors read and approved the final manuscript.
Acknowledgments
We thank the organizers of Genetic Analysis Workshop 17 for providing the exome data set. The Genetic Association Workshop is supported by National Institutes of Health grant R01 GM031575. We also thank all members of the Human Genetics Groups at the Genome Institute of Singapore for useful comments and inputs.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/5?issue=S9.
References

Weeks ED, Lang K: The affectedpedigreemember method of linkage analysis.
Am J Hum Genet 1988, 42:315326. PubMed Abstract  PubMed Central Full Text

Risch N: Linkage strategies for genetically complex traits. II. The power of affected relative pairs.
Am J Hum Genet 1990, 46:229241. PubMed Abstract  PubMed Central Full Text

Almasy LA, Dyer TD, Peralta JM, Kent JW Jr, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 miniexome simulation.
BMC Proc 2011, 5(suppl 9):S2. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Liang KY, Zeger SL: Longitudinal data analysis using generalized linear models.
Biometrika 1986, 73:1322. Publisher Full Text

Liang KY, Zeger SL: Longitudinal data analysis for discrete and continuous outcomes.
Biometrics 1986, 42:121130. PubMed Abstract  Publisher Full Text

Carey VJ: Ported to R by Thomas Lumley (versions 3.13 and 4.4) and Brian Ripley (version 4.13): GEE—Generalized Estimation Equation solver [4.13]. [http://cran.rproject.org/] webcite
2007.

Asimit J, Zeggini E: Rare variant association analysis methods for complex traits.
Annu Rev Genet 2010, 44:293308. PubMed Abstract  Publisher Full Text

Benjamini Y, Hochberg YJ: Controlling the false discovery rate: a practical and powerful approach to multiple testing.

Luedtke A, Powers S, Petersen A, Sitarik A, Bekmetjev A, Tintle NL: Evaluating methods for the analysis of rare variants in sequence data.
BMC Proc 2011, 5(suppl 9):S119. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text