To detect rare variants associated with a phenotype, we develop a novel statistical method that can use both family and unrelated case-control data. Unlike the currently existing methods, we first use family data to calculate weights to be given to rare variants, differentiating between concordantly affected and discordant sib pairs. These weights are then used in an association test applied to the unrelated case-control data. We applied the proposed method to the simulated sequencing data in Genetic Analysis Workshop 17 and identified two genes associated with the disease.
Genome-wide association studies, which are based on the common disease/common variants assumption, have successfully identified susceptibility loci for complex traits. However, the variants discovered through these studies explain only a modest portion of the trait variability . With the new technological advances, it has been suggested that it is time to shift the search from common variants of modest effect to rarer variants of large effect by effectively searching the full genome . Rare variants may hold promise to predict individual risk for personalized medicine because of their large effect, although it has been argued that common variants illuminate the biologic pathways that underlie diseases .
Bodmer and Tomlinson  suggested that a set of low-frequency variants from different genes can account for a significant proportion of the variability of relatively common diseases. To achieve reasonable statistical power, it is critical to define the rare variants and test them collectively. The existing statistical methods in the literature mainly collapse rare variants . Madsen and Browning  proposed using the inverse of the variance of the minor allele frequency (MAF) in control subjects as a weight and then collapsing the weighted rare variants.
Briefly, for the ith individual Madsen and Browning  define a genetic score:
where L is the number of variants, gij is the genotypic score, and wj is the weight for the jth single-nucleotide polymorphism (SNP). The weight wj is defined as the inverse of the jth SNP’s standard deviation estimated in control subjects when the corresponding MAF is less than α (such as 0.02) and 0 otherwise. Then the Wilcoxon rank sum test is applied to do the association test. Madsen and Browning rank the genetic scores, calculate the sum of the ranks for case subjects as:
and calculate the p-value using a permutation strategy. That is, they permute disease status among individuals 1,000 times to compute 1,000 statistics X, denoted , , …, . Then they use the sample mean and the standard deviation of , , …, to calculate the test statistic:
which follows approximately a standard normal distribution under the null hypothesis. Madsen and Browning  demonstrated that this weighted-sum method is more powerful than the collapsing method .
Recently, we have demonstrated that family data are useful for searching for rare variants [7,8], because the rare variants can be substantially enriched among segregating family members. Here, we present a statistical method to test rare variants by using both family and unrelated case-control sequencing data.
Defining a weight for each SNP using the family data set
We start by assuming that we have L variants belonging to a group (gene, pathway, specific genomic region, etc.). We have shown that family data, such as affected sib pairs, have enriched information for detecting rare disease-associated variants because the same disease variants are more likely to segregate within a family . Thus each family can be considered homogeneous; that is, affected family members share the same or allelic disease variants, the latter being an idea used in traditional linkage analysis. Our method uses either affected sib pairs or discordant sib pairs to determine weights for such rare variants.
Assume that a SNP has two alleles A and a, where A always refers to the minor allele. For the ith sib pair and the jth SNP, we define a genotype score . Let α be a predefined threshold. We always let if a SNP has a MAF greater than α. For those SNPs with MAF ≤ α , we define as follows. For affected sib pairs, if both sibs carry A, if neither sib carries A, and if one sib carries A and the other does not, where K is the number of other SNPs in the region with MAF <α carried by the other sib. For discordant sib pairs, if the affected sib carries allele A and the unaffected sib does not, if the affected sib does not carry A, and if both sibs carry A, where K is the number of alleles with MAF <α present at the other SNPs carried by the affected sib.
where Nsib is the number of sib pairs. For the jth SNP, define:
where pj is the MAF of the SNP, estimated from the case and control subjects combined. We rank the bj in descending order and give the jth SNP weight wj = bj if bj falls in the top quartile and weight wj = 0 if not. The proposed weight incorporates information about the variants shared (not shared) by affected (discordant) sib pairs.
Performing association tests
After defining the weights for individual SNPs, using either affected or discordant sib pairs, we use the weights to test for association between the phenotype and each set of variants in a group. We assume that we have ND unrelated case subjects and NC unrelated control subjects. Let k be the kth unrelated individual. For each SNP j (j = 1, …, L), we let wj be the weight as found earlier using the sib pairs. We use the wj to calculate the kth individual’s genetic score, similar to the idea of Madsen and Browning ; that is,
where gkj is the number of minor alleles in SNP j for individual k. We then define:
where the summation is over the unrelated case subjects. Similarly, we define:
calculated for the unrelated control subjects. Our null hypothesis is that no marker is associated with the phenotype; that is, we have , and so we define our test statistic for association as:
where and are estimates of the sample variances and in case and control subjects, respectively.
Application to the Genetic Analysis Workshop 17 sequence data set
The Genetic Analysis Workshop 17 (GAW17) simulated sequence data set includes both family and unrelated individuals data, which is ideal for the proposed method. Each replicate is composed of 697 individuals in the family data set and 697 individuals in the unrelated individuals data set. There are 200 replicates in total, all with the same genotype data but with phenotype data independently simulated across the replicates.
We first performed an analysis using only one replicate, which was selected to be the 82nd replicate. To calculate the weights, we clustered the sib pairs into affected and discordant sib pairs. Based on the 82nd phenotype replicate, we identified 38 affected and 22 discordant sib pairs. We then set α = 0.01 as the MAF cutoff to define the rare variants and calculated weights based on the identified 38 affected sib pairs using the Eq. (4-5).
The 82nd phenotype replicate includes 209 unrelated case subjects and 488 unrelated control subjects. This data set was used to test for association in each of 3,205 genes.
We compared the power of the proposed method with Madsen and Browning’s method using the same unrelated case-control data. The same MAF threshold of α = 0.01 was used to define rare variants for Madsen and Browning’s method.
We expect there to be little or no power for detecting rare variants using a single replicate because of the small sample size. We therefore included additional replicates until we reached a sample size of 400 affected sib pairs and 2,400 case subjects and 2,400 control subjects. We used 400 affected sib pairs, which was suggested by our previous study . Although the genotypes are the same for the different replicates, the genotype-phenotype association is simulated independently for the different replicates. Thus the way we increased the sample size will have little impact on our power comparison.
We applied the proposed method and Madsen and Browning’s method to 3,205 genes using the 82nd replicate. As expected, we found virtually no power for either method.
We next increased the sample size to 400 affected sib pairs to calculate the weights and an additional 2,000 case subjects and 2,000 control subjects for the association test using our proposed method. For comparison, we used 2,400 case subjects and 2,400 control subjects for Madsen and Browning’s method. Thus the total sample size is the same for both methods. Figure 1 presents the result for testing 3,205 genes. The horizontal line indicates the 5% significance level after adjusting for 3,205 tests. After correcting for multiple comparisons, we observed 14 and 7 genes reaching significance for the proposed and Madsen and Browning’s methods, respectively; 2 of 14 and 1 of 7 significant genes are real causal signals.
Figure 1. Manhattan plot of –log10(p-value) for each single gene after pooling data to reach a larger sample size The upper panel is the result of our proposed method, and the lower panel is the result of Madsen and Browning’s method.
We next examined the 15 causal genes generated in the simulations (Table 1). Both methods detected PICK3C2B, but our proposed method resulted in a much smaller p-value (8.67 × 10−10 vs. 1.48 × 10−5). Furthermore, the proposed method identified HSP90AA1 (p = 8.63 × 10−6), which was missed by Madsen and Browning’s method.
Table 1. Causal gene with tested p-value
Discussion and conclusions
In this paper, we propose a novel method to analyze the GAW17 data set. Unlike the existing methods, the proposed method calculates the weights using either affected or discordant sib pairs. The proposed method requires that both the case and control groups and the family members are genotyped for the same set of SNPs or are resequenced in the same region. Compared with Madsen and Browning’s method, the proposed method detects the true causal gene HSP90AA1, which was missed by Madsen and Browning’s method. Our method demonstrates that incorporating family data can potentially improve statistical power to detect rare variants in an association analysis, which is consistent with the result of Zhu et al. .
The authors declare that there are no competing interests.
XZ designed the study. TF performed the data analysis. TF and XZ drafted the manuscript. RCE revised the manuscript. All authors read and approved the final manuscript.
This work was supported by National Institutes of Health grant HL086718 from the National Heart, Lung, and Blood Institute, National Human Genome Research Institute grant HG003054, and National Center for Research Resources grant RR03655.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.