<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1753-6561-5-S9-S20</ui>
   <ji>1753-6561</ji>
   <fm>
      <dochead>Proceedings</dochead>
      <bibl>
         <title>
            <p>Incorporating predicted functions of nonsynonymous variants into gene-based analysis of exome sequencing data: a comparative study</p>
         </title>
         <aug>
            <au ca="yes" id="A1"><snm>Wei</snm><fnm>Peng</fnm><insr iid="I1"/><insr iid="I2"/><email>Peng.Wei@uth.tmc.edu</email></au>
            <au id="A2"><snm>Liu</snm><fnm>Xiaoming</fnm><insr iid="I1"/><insr iid="I2"/><email>Xiaoming.Liu@uth.tmc.edu</email></au>
            <au id="A3"><snm>Fu</snm><fnm>Yun-Xin</fnm><insr iid="I1"/><insr iid="I2"/><email>Yunxin.Fu@uth.tmc.edu</email></au>
         </aug>
         <insg>
            <ins id="I1"><p>Division of Biostatistics, University of Texas School of Public Health, 1200 Herman Presser Drive, Houston, TX 77030, USA</p></ins>
            <ins id="I2"><p>Human Genetics Center, University of Texas School of Public Health, 1200 Herman Presser Drive, Houston, TX 77030, USA</p></ins>
         </insg>
         <source>BMC Proceedings</source>
         
         
         <supplement><title><p>Genetic Analysis Workshop 17: Unraveling Human Exome Data</p></title><editor>S Ghosh, H Bickeb&#246;ller, J Bailey, JE Bailey-Wilson, R Cantor, W Daw, AL DeStefano, CD Engelman, A Hinrichs, J Houwing-Duistermaat, IR K&#246;nig, J Kent Jr., N Pankratz, A Paterson, E Pugh, Y Sun, A Thomas, N Tintle, X Zhu, JW MacCluer and L Almasy</editor><note>Proceedings</note></supplement><conference><title><p>Genetic Analysis Workshop 17</p></title><location>Boston, MA, USA</location><date-range>13-16 October 2010</date-range><url>http://www.gaworkshop.org/</url></conference><issn>1753-6561</issn>
         <pubdate>2011</pubdate>
         <volume>5</volume>
         <issue>Suppl 9</issue>
         <fpage>S20</fpage>
         <url>http://www.biomedcentral.com/1753-6561/5/S9/S20</url>
         <xrefbib><pubid idtype="doi">10.1186/1753-6561-5-S9-S20</pubid></xrefbib>
      </bibl>
      <history><pub><date><day>29</day><month>11</month><year>2011</year></date></pub></history>
      <cpyrt><year>2011</year><collab>Wei et al; licensee BioMed Central Ltd.</collab><note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <p>Next-generation sequencing has opened up new avenues for the genetic study of complex traits. However, because of the small number of observations for any given rare allele and high sequencing error, it is a challenge to identify functional rare variants associated with the phenotype of interest. Recent research shows that grouping variants by gene and incorporating computationally predicted functions of variants may provide higher statistical power. On the other hand, many algorithms are available for predicting the damaging effects of nonsynonymous variants. Here, we use the simulated mini-exome data of Genetic Analysis Workshop 17 to study and compare the effects of incorporating the functional predictions of single-nucleotide polymorphisms using two popular algorithms, SIFT and PolyPhen-2, into a gene-based association test. We also propose a simple mixture model that can effectively combine test results based on different functional prediction algorithms.</p>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Despite the great success of genome-wide association studies (GWAS) in identifying hundreds of loci harboring common single-nucleotide polymorphisms (SNPs) that are associated with complex diseases, most common SNPs identified to date have small effect sizes and the proportion of heritability explained is at best modest for most traits. Thus investigators have become interested in low-frequency or rare variants (minor allele frequency [MAF] &lt; 1%) that may contribute to genetic risk <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Recent advances in next-generation sequencing technologies have made it possible, at a relatively low cost, to extend association studies to low-frequency and rare variants, particularly in targeted resequencing of candidate genes or the whole exome.</p>
         <p>The statistical power to detect disease association with an individual rare variant is limited, partly because of the small number of observations for any given variant and partly because of the high frequency of sequencing errors. In response to this challenge, several new and powerful statistical methods have been proposed recently, including the combined multivariate and collapsing (CMC) method of Li and Leal <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>, the weighted-sum method of Madsen and Browning <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>, and the variable threshold (VT) approach of Price et al. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Despite different statistical models, a common strategy adopted by these methods is to group the variants according to function, such as genes and pathways, and compare the group counts or distributions rather than the counts for each variant in the group. The rationale behind this grouping strategy is that if many different mutations in a group affect disease risk, then it may be beneficial to focus on the group rather than on each variant individually.</p>
         <p>The VT method of Price et al. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> is of particular interest because, in contrast to a prespecified threshold for defining rare variants in the CMC method, it allows the allele frequency threshold to vary and thus adapts to properties of individual genes. It is motivated by the fact that some genes may harbor functional alleles at higher frequencies, whereas other genes may have only private functional variants. Another feature of the VT method is that it can incorporate computational predictions of the functional effects of nonsynonymous variants (e.g., by PolyPhen-2 <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>) into the association test, thereby avoiding the loss of power that results from combining both functional and nonfunctional alleles, as in previous grouping methods. The VT method is more powerful than the CMC and the weighted-sum methods for analyzing simulated and empirical sequencing data.</p>
         <p>We note that Price et al. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> used and studied only functional predictions from PolyPhen-2. However, several other algorithms are available for computationally predicting functions of nonsynonymous variants, such as the &#8220;sorting tolerant from intolerant&#8221; (SIFT) algorithm of Kumar et al. <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, MutationTaster of Schwarz et al. <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, and the &#8220;screening for nonacceptable polymorphisms&#8221; (SNAP) algorithm of Bromberg et al. <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. It is yet unclear how the results of different functional prediction-algorithm-based VT tests compare with each other. The objective here is to use the Genetic Analysis Workshop 17 (GAW17) simulated mini-exome data to compare the results of the VT test incorporating predicted functions of nonsynonymous variants from two popular algorithms, PolyPhen-2 and SIFT. Although previous investigators have compared the accuracy of the two algorithms in predicting deleterious mutations (e.g., Flanagan et al. <abbrgrp><abbr bid="B9">9</abbr></abbrgrp> and Adzhubei et al. <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>), we are the first, to our knowledge, to study the effects of incorporating functional predictions based on different computational algorithms in the context of association tests of sequencing data. In addition, we propose a simple mixture model to combine the test results based on different functional prediction algorithms.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Data description</p>
            </st>
            <p>We analyze the simulated mini-exome data set provided by GAW17. This data set consists of a collection of 697 unrelated individuals and their genotypes and phenotypes. The subjects are from the 1000 Genomes Project (<url>http://www.1000genomes.org</url>). There are 24,487 SNPs, among which 13,572 are nonsynonymous, mapped to the exons of 3,205 genes. Two hundred replicates of the phenotype simulation were carried out based on some simulating model, and three quantitative traits and a qualitative trait were available. See Blangero et al. <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> for simulation details. In this study, we analyze only the qualitative trait, that is, disease status, from replicate 1. There were 209 case subjects and 488 control subjects. Because we focus on a gene-based association test, we restrict our analysis only to genes with at least two SNPs, resulting in 1,979 genes and 23,261 SNPs, among which 13,086 are nonsynonymous. The summary statistics of the number of SNPs that each of the 1,979 genes has are as follows: minimum = 2, 25th percentile = 3, median = 6, 75th percentile = 15, and maximum = 231.</p>
         </sec>
         <sec>
            <st>
               <p>SIFT and PolyPhen-2 algorithms</p>
            </st>
            <p>The SIFT algorithm is a multistep, sequence-homology-based algorithm that classifies amino acid substitutions resulting from nonsynonymous SNPs. The underlying premise for the SIFT algorithm is based on the evolutionary conservation of the amino acids within protein families: Highly conserved positions tend to be intolerant to substitutions, whereas those with a low degree of conservation tolerate most substitutions <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. The SIFT algorithm predicts that a nonsynonymous variant will be damaging if the scaled probability score, also termed the SIFT score, is less than 0.05; otherwise, the algorithm predicts that the variant will be tolerated.</p>
            <p>In contrast to the SIFT algorithm, which does not use the protein structure information, the PolyPhen-2 algorithm uses a na&#239;ve Bayes classifier to predict damaging effects of nonsynonymous variants based on eight sequence-based and three structure-based predictive features <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. The PolyPhen-2 algorithm calculates the na&#239;ve Bayes posterior probability that a given mutation will be damaging and qualitatively predicts that it will be benign, possibly damaging, or probably damaging, corresponding to posterior probability intervals [0, 0.2], (0.2, 0.85), and [0.85, 1], respectively.</p>
            <p>We obtained the predicted functional scores of all 13,572 nonsynonymous SNPs by means of the online versions of the SIFT algorithm (<url>http://sift.jcvi.org/index.html</url>) and the PolyPhen-2 algorithm (<url>http://genetics.bwh.harvard.edu/pph2/</url>). For both algorithms, we used human genome build 36 from the National Center for Biotechnology Information (NCBI) as the reference genome sequence. For the PolyPhen-2 algorithm, HumDiv was selected as the classifier model because it was recommended for evaluating rare alleles at loci potentially involved in complex phenotypes <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Variable threshold test</p>
            </st>
            <p>In the VT test, rare alleles are grouped together by optimizing an allele frequency threshold that maximizes the difference, as quantified by a <it>z</it>-score, between distributions of trait values or disease status for individuals with and without rare alleles. To control type I error, we applied the same optimization procedure to permuted data to obtain an exact <it>p</it>-value for association. The rationale underlying the VT method is that for each gene there is some unknown threshold <it>T</it> for which variants with a MAF less than <it>T</it> are substantially more likely to be functional than those with a MAF greater than <it>T.</it> Specifically, for a given gene with <it>m</it> SNPs in its exons, we define the <it>z</it>-score for a given threshold <it>T</it> as:</p>
            <p>
               <display-formula id="M1">
                  <graphic file="1753-6561-5-S9-S20-i1.gif"/>
               </display-formula>
            </p>
            <p>Where <inline-formula><graphic file="1753-6561-5-S9-S20-i2.gif"/></inline-formula> is an indicator variable that is equal to 1 if the MAF of SNP <it>i</it> is less than the threshold <it>T</it> and equal to 0 otherwise, <it>C<sub>ij</sub></it> is the reference allele count of SNP <it>i</it> in subject <it>j</it>, <it>&#960;<sub>j</sub></it> is the phenotype of subject <it>j</it> equal to 0 and 1 for control subjects and case subjects, respectively, <inline-formula><graphic file="1753-6561-5-S9-S20-i3.gif"/></inline-formula> is the mean value of <it>&#960;<sub>j</sub></it> across subjects <it>j</it>, and <it>S<sub>i</sub></it> is the functional prediction score of SNP <it>i</it>, which is between 0 and 1 (larger values indicate higher probability of damaging effect). In addition, the maximum <it>z</it>-score is defined as:</p>
            <p>
               <display-formula id="M2">
                  <graphic file="1753-6561-5-S9-S20-i4.gif"/>
               </display-formula>
            </p>
            <p>The statistical significance of <it>z</it><sub>max</sub> is then assessed by permutations on phenotypes. In addition, the VT test has been implemented as an R function, available at <url>http://genetics.bwh.harvard.edu/rare_variants/</url>.</p>
         </sec>
         <sec>
            <st>
               <p>Incorporating the predicted functions of variants into the VT test</p>
            </st>
            <p>To study and compare the effects of incorporating different predicted functions of SNPs into a gene-based association test, we carried out four versions of the VT test: (1) an unweighted VT test, in which all SNPs, both synonymous and nonsynonymous, were grouped (thus <it>S<sub>i</sub></it> in Eq. (1) was 1 for all SNPs); (2) a binary weight VT test, in which only nonsynonymous SNPs were grouped (thus <it>S<sub>i</sub></it> was 1 for nonsynonymous SNPs and 0 otherwise); (3) a SIFT-based VT test, in which <it>S<sub>i</sub></it> was equal to (1 &#8722; SIFT prediction score) for nonsynonymous SNPs and 0 otherwise; and (4) a PolyPhen-2-based VT test, in which <it>S<sub>i</sub></it> was equal to the PolyPhen-2 score for nonsynonymous SNPs and 0 otherwise. For those nonsynonymous SNPs without a prediction score, we imputed them with the corresponding median scores: 0.1 for the SIFT algorithm and 0.2 for the PolyPhen-2 algorithm. For each gene, 10,000 permutations were carried out to obtain the <it>p</it>-value.</p>
         </sec>
         <sec>
            <st>
               <p>Mixture model for combining test results</p>
            </st>
            <p>Here, we propose a simple mixture model to combine <it>p</it>-values resulted from association tests based on different functional prediction algorithms. Let <it>p<sub>g</sub></it><sub>1</sub> and <it>p<sub>g</sub></it><sub>2</sub> be gene <it>g</it>&#8217;s VT test <it>p</it>-values corresponding to the SIFT and PolyPhen-2 algorithms, respectively, for <it>g</it> = 1, &#8230;, <it>G</it>. Define the <it>z</it>-transformation:</p>
            <p>
               <display-formula id="M3">
                  <graphic file="1753-6561-5-S9-S20-i5.gif"/>
               </display-formula>
            </p>
            <p>so that smaller <it>p</it>-values correspond to larger <it>z</it>-values, where &#934;<sup>&#8722;1</sup> is the inverse cumulative distribution function of <it>N</it>(1, 0) and <it>k</it> = 1, 2. We assume that (<it>x<sub>g</sub></it><sub>1</sub>, <it>x<sub>g</sub></it><sub>2</sub>) follows a two-component bivariate normal mixture model, that is, that its density is given by:</p>
            <p>
               <display-formula id="M4">
                  <graphic file="1753-6561-5-S9-S20-i6.gif"/>
               </display-formula>
            </p>
            <p>where <it>f</it><sub>0</sub> and <it>f</it><sub>1</sub> are two bivariate normal densities corresponding to <it>z</it>-values of non-phenotype-associated and phenotype-associated genes, respectively. The two-component normal mixture model is a simple yet powerful statistical method for genome-wide discoveries <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. The posterior probability of gene <it>g</it> being associated with the phenotype is given by:</p>
            <p>
               <display-formula id="M5">
                  <graphic file="1753-6561-5-S9-S20-i7.gif"/>
               </display-formula>
            </p>
            <p>which can be used to rank genes and to estimate the false discovery rate (FDR) for a given cutoff for claiming significant genes and thus to control the FDR at a desired level, for example, 5% <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. For simplicity, we further assume that <it>x<sub>g</sub></it><sub>1</sub> and <it>x<sub>g</sub></it><sub>2</sub> are conditionally independent given whether gene <it>g</it> is associated with the phenotype or not; that is,</p>
            <p>
               <display-formula id="M6">
                  <graphic file="1753-6561-5-S9-S20-i8.gif"/>
               </display-formula>
            </p>
            <p>where <it>&#981;</it>(<it>x</it>; <it>&#956;</it>, <it>&#963;</it><sup>2</sup>) is the density function of <it>N</it>(<it>&#956;</it>, <it>&#963;</it><sup>2</sup>) and <it>l</it> = 0, 1. The conditional independence mixture model is similar to a na&#239;ve Bayes method except that the mixture model is unsupervised learning, whereas the Bayes method is supervised learning. Note that this simplified model may not provide goodness-of-fit to the <it>z</it>-values, and thus the resulting posterior probabilities can only be used to rank genes, not to estimate the FDR (see Wei and Pan <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>). The parameter estimates in the normal mixture model can be obtained by means of the EM algorithm, which is implemented in the R package mclust. In addition, <it>p</it>-values from a single type of association test, for example, the SIFT-based VT test, can be used to fit a two-component univariate normal mixture model and the FDR can be similarly estimated.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Prediction score comparison: SIFT vs. PolyPhen-2 algorithms</p>
            </st>
            <p>As described in the Methods section, we obtained the prediction scores of being deleterious using the SIFT and PolyPhen-2 algorithms for the 13,572 SNPs annotated as nonsynonymous in the annotation file supplied by GAW17. Nine hundred thirty-nine nonsynonymous SNPs did not have a SIFT score and 1,241 nonsynonymous SNPs did not have a PolyPhen-2 score, probably because of gene annotation errors or insufficient sequence evidence. Note that nonsynonymous variants with a PolyPhen-2 score larger than 0.2 were predicted to be possibly or probably damaging, whereas those with a SIFT score less than 0.05 were predicted to be damaging. As a result, we plotted the (1 &#8722; SIFT score) against the PolyPhen-2 score in Figure <figr fid="F1">1a</figr>. The scatterplot together with the LOESS curve shows that the two scores are positively correlated, although there are quite a few SNPs with discordant prediction scores. We also assessed the correlation of dichotomous predictions from the two algorithms. Using 0.2 and 0.95 as thresholds for the PolyPhen-2 and SIFT scores, respectively, we obtained a two-by-two table with cell counts as follows: P+ and S+ = 3,600, P&#8722; and S&#8722; = 4,492, P+ and S &#8722; = 2,403, and P&#8722; and S+ = 1,345. This resulted in an odds ratio (OR) estimate equal to 5 (chi-square test <it>p</it> &lt; 10<sup>&#8722;16</sup>), meaning that the odds of being predicted to be deleterious using the PolyPhen-2 algorithm for variants that were predicted to be deleterious using the SIFT algorithm were five times the odds for those that were predicted to be benign using the SIFT algorithm. Similar comparison results held for the 13,086 nonsynonymous SNPs corresponding to the 1,979 genes with at least two SNPs.</p>
            <fig id="F1"><title><p>Figure 1</p></title><caption><p>SIFT scores versus PolyPhen-2 scores.</p></caption><text>
   <p><b>SIFT scores versus PolyPhen-2 scores.</b> (a) (1 &#8722; SIFT score) plotted against PolyPhen-2 score. The red dashed lines correspond to the thresholds for predicting deleterious variants: 0.95 for SIFT and 0.2 for PolyPhen-2. The blue solid line corresponds to the LOESS curve (locally weighted scatterplot smoothing). (b) SIFT-based VT test <it>p</it>-values plotted against PolyPhen-2-based VT test <it>p</it>-values. Red plus signs correspond to genes that had tied rank 1 (posterior probabilities of association equal to 1) by the mixture model combining both tests. (c) Enlarged section of part b. (d) SIFT-based VT test <it>z</it>-values plotted against PolyPhen-2-based VT test <it>z</it>-values. Red plus signs correspond to genes that had tied rank 1 by the mixture model combining both tests. (e) Raw versus recalibrated PolyPhen-2 scores; solid line is the identical line. (f) Raw versus recalibrated PolyPhen-2 score-based VT test <it>p</it>-values.</p>
</text><graphic file="1753-6561-5-S9-S20-1"/></fig>
         </sec>
         <sec>
            <st>
               <p>Comparison of SIFT-based and PolyPhen-2-based VT tests</p>
            </st>
            <p>Figures <figr fid="F1">1b-d</figr> compare the <it>p</it>-values and <it>z</it>-values of SIFT-based and PolyPhen-2-based VT tests. Although the two <it>p</it>-values are positively correlated overall, they can be substantially different from each other. However, smaller <it>p</it>-values seem to be better correlated, as demonstrated by the upper-right part of the <it>z</it>-value plot Figure <figr fid="F1">1d</figr>. In addition, we fitted a two-component bivariate normal mixture model to combine the <it>p</it>-values of the two tests, as described in the Methods section. One hundred sixty genes were ranked 1 (i.e., the posterior probabilities of association were all equal to 1) in the combined analysis and were plotted as red plus signs in Figures <figr fid="F1">1b-d</figr>. Not only were genes with small <it>p</it>-values highly ranked, but genes with moderately small <it>p</it>-values could also be boosted to have a tied rank of 1 (Figure <figr fid="F1">1c</figr>).</p>
            <p>In addition to the comparison between SIFT-based and PolyPhen-2-based tests, we also performed comparisons among all four versions of the VT test. Specifically, we looked at the overlaps among the top 100 genes by each test, as shown by the Venn diagrams in Figure <figr fid="F2">2</figr>. We can see that the SIFT-based and the binary weight-based tests share a large number of genes, whereas the PolyPhen-2-based and the unweighted tests share much fewer genes with the former two tests. This comparison also suggests, however, that association tests incorporating different functional predictions could lead to quite different results. In practice, it is unlikely that one functional prediction algorithm will be dominantly better than the other, which necessitates a combined analysis in an effective way, such as the mixture model proposed here. In addition, Table <tblr tid="T1">1</tblr> lists the top 10 genes by the SIFT-based VT test, all of which were tied at rank 1 by the combined analysis. All genes had small <it>p</it>-values, as ascertained by the other three tests, as well as a large number of SNPs sufficiently representing the corresponding genes.</p>
            <fig id="F2"><title><p>Figure 2</p></title><caption><p>Venn diagrams for the top 100 genes.</p></caption><text>
   <p><b>Venn diagrams for the top 100 genes.</b> Top 100 genes found by (a) PolyPhen-2, SIFT, and binary-weight-based VT tests and (b) unweighted, SIFT, and binary-weight-based VT tests.</p>
</text><graphic file="1753-6561-5-S9-S20-2"/></fig>
            <tbl id="T1"><title><p>Table 1</p></title><caption><p>Top ten genes ranked by SIFT-based VT test <it>p</it>-value</p></caption><tblbdy cols="7">
      <r>
         <c ca="left">
            <p>Gene</p>
         </c>
         <c ca="center">
            <p>SIFT</p>
         </c>
         <c ca="center">
            <p>PolyPhen-2</p>
         </c>
         <c ca="center">
            <p>Binary</p>
         </c>
         <c ca="center">
            <p>Unweighted</p>
         </c>
         <c ca="center">
            <p>Number of SNPs</p>
         </c>
         <c ca="center">
            <p>Number of nonsynonymous SNPs</p>
         </c>
      </r>
      <r>
         <c cspan="7">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>FAM13A1</it>
            </p>
         </c>
         <c ca="center">
            <p>0.0002</p>
         </c>
         <c ca="center">
            <p>0.0001</p>
         </c>
         <c ca="center">
            <p>0.0007</p>
         </c>
         <c ca="center">
            <p>0.0003</p>
         </c>
         <c ca="center">
            <p>34</p>
         </c>
         <c ca="center">
            <p>23</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>DGKZ</it>
            </p>
         </c>
         <c ca="center">
            <p>0.0002</p>
         </c>
         <c ca="center">
            <p>0.0005</p>
         </c>
         <c ca="center">
            <p>0.0002</p>
         </c>
         <c ca="center">
            <p>0.0004</p>
         </c>
         <c ca="center">
            <p>22</p>
         </c>
         <c ca="center">
            <p>15</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>TRIM42</it>
            </p>
         </c>
         <c ca="center">
            <p>0.0003</p>
         </c>
         <c ca="center">
            <p>0.0002</p>
         </c>
         <c ca="center">
            <p>0.0009</p>
         </c>
         <c ca="center">
            <p>0.0032</p>
         </c>
         <c ca="center">
            <p>39</p>
         </c>
         <c ca="center">
            <p>30</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>ADAM15</it>
            </p>
         </c>
         <c ca="center">
            <p>0.0003</p>
         </c>
         <c ca="center">
            <p>0.0003</p>
         </c>
         <c ca="center">
            <p>0.0016</p>
         </c>
         <c ca="center">
            <p>0.0004</p>
         </c>
         <c ca="center">
            <p>30</p>
         </c>
         <c ca="center">
            <p>20</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>FLT1</it>
            </p>
         </c>
         <c ca="center">
            <p>0.0003</p>
         </c>
         <c ca="center">
            <p>0.0007</p>
         </c>
         <c ca="center">
            <p>0.0002</p>
         </c>
         <c ca="center">
            <p>0.0002</p>
         </c>
         <c ca="center">
            <p>35</p>
         </c>
         <c ca="center">
            <p>20</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>GRIA4</it>
            </p>
         </c>
         <c ca="center">
            <p>0.0004</p>
         </c>
         <c ca="center">
            <p>0.0003</p>
         </c>
         <c ca="center">
            <p>0.0004</p>
         </c>
         <c ca="center">
            <p>0.0066</p>
         </c>
         <c ca="center">
            <p>18</p>
         </c>
         <c ca="center">
            <p>6</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>IRF6</it>
            </p>
         </c>
         <c ca="center">
            <p>0.0005</p>
         </c>
         <c ca="center">
            <p>0.0005</p>
         </c>
         <c ca="center">
            <p>0.0021</p>
         </c>
         <c ca="center">
            <p>0.0119</p>
         </c>
         <c ca="center">
            <p>15</p>
         </c>
         <c ca="center">
            <p>7</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>HDAC4</it>
            </p>
         </c>
         <c ca="center">
            <p>0.0007</p>
         </c>
         <c ca="center">
            <p>0.0144</p>
         </c>
         <c ca="center">
            <p>0.0011</p>
         </c>
         <c ca="center">
            <p>0.0010</p>
         </c>
         <c ca="center">
            <p>36</p>
         </c>
         <c ca="center">
            <p>16</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>GDF15</it>
            </p>
         </c>
         <c ca="center">
            <p>0.0009</p>
         </c>
         <c ca="center">
            <p>0.0006</p>
         </c>
         <c ca="center">
            <p>0.0040</p>
         </c>
         <c ca="center">
            <p>0.0006</p>
         </c>
         <c ca="center">
            <p>10</p>
         </c>
         <c ca="center">
            <p>6</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>SUSD2</it>
            </p>
         </c>
         <c ca="center">
            <p>0.0009</p>
         </c>
         <c ca="center">
            <p>0.0008</p>
         </c>
         <c ca="center">
            <p>0.0015</p>
         </c>
         <c ca="center">
            <p>0.0005</p>
         </c>
         <c ca="center">
            <p>45</p>
         </c>
         <c ca="center">
            <p>29</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>All genes had tied rank 1 by the mixture model combining both SIFT-based and PolyPhen-2-based VT test <it>p</it>-values. <it>P</it>-values were obtained from 10,000 permutations.</p>
   </tblfn></tbl>
         </sec>
         <sec>
            <st>
               <p>Comparison of raw and recalibrated PolyPhen-2 scores</p>
            </st>
            <p>Price et al. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> suggested that, to obtain optimal results, the PolyPhen-2 scores should be recalibrated before being applied to the VT test. We obtained the recalibrated PolyPhen-2 scores using the computer program provided by Price et al. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Figure <figr fid="F1">1e</figr> shows the raw versus the recalibrated PolyPhen-2 scores, which were calculated using a nonlinear monotone transformation of the raw scores. In addition, the VT test <it>p</it>-values based on the raw and recalibrated PolyPhen-2 scores are compared in Figure <figr fid="F1">1f</figr>. Although the <it>p</it>-values are highly correlated with Spearman&#8217;s rank correlation coefficient equal to 0.98, they could be quite different for some genes.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>In the present analyses, we compared the raw and recalibrated PolyPhen-2 scores in the VT test. It would also be of interest to develop methods for recalibrating the SIFT scores; however, this would necessitate having available credible neutral and damaging nonsynonymous SNPs as a training set to derive the recalibration transformation. Another possible direction for future investigation is to develop association tests that are more robust to misspecifications of functional predictions and can incorporate covariate effects including environmental factors.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusions</p>
         </st>
         <p>Motivated by the fact that many algorithms for predicting damaging effects of nonsynonymous variants are available, we performed a comparative study of the effects of incorporating different functional predictions into association tests using the GAW17 simulated mini-exome data set. Our study reveals that, although the PolyPhen-2 and SIFT prediction scores are positively correlated overall, they can be substantially different from each other, quantitatively as well as qualitatively. As a result, the SIFT-based and the PolyPhen-2-based VT test results can also differ. Importantly, our analyses suggest that the two-component normal mixture model proposed here provides a probabilistic approach to effectively combining the heterogeneous test results. Further refinements, including relaxing the conditional independence assumption to improve the goodness-of-fit, are needed.</p>
      </sec>
      <sec>
         <st>
            <p>Competing interests</p>
         </st>
         <p>The authors declare that there are no competing interests.</p>
      </sec>
      <sec>
         <st>
            <p>Authors&#8217; contributions</p>
         </st>
         <p>PW conceived and designed the study, performed the statistical analyses and drafted the manuscript. XL co-designed the study. All authors helped to draft the manuscript. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgments</p>
            </st>
            <p>We thank both reviewers for their constructive comments. Peng Wei was partially supported by a PRIME grant from the University of Texas School of Public Health. The analyses were performed without knowledge of the underlying simulating model. The Genetic Analysis Workshop is supported by National Institutes of Health grant R01 GM031575.</p>
            <p>This article has been published as part of <it>BMC Proceedings</it> Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at <url>http://www.biomedcentral.com/1753-6561/5?issue=S9</url>.</p>
         </sec>
      </ack>
      <refgrp><bibl id="B1"><title><p>Uncovering the roles of rare variants in common diseases through whole-genome sequencing</p></title><aug><au><snm>Cirulli</snm><fnm>ET</fnm></au><au><snm>Goldstein</snm><fnm>DB</fnm></au></aug><source>Nature Reviews Genetics</source><pubdate>2010</pubdate><volume>11</volume><fpage>415</fpage><lpage>425</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nrg2779</pubid><pubid idtype="pmpid" link="fulltext">20479773</pubid></pubidlist></xrefbib></bibl><bibl id="B2"><title><p>Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data</p></title><aug><au><snm>Li</snm><fnm>B</fnm></au><au><snm>Leal</snm><fnm>SM</fnm></au></aug><source>Am J Hum Genet</source><pubdate>2008</pubdate><volume>83</volume><fpage>311</fpage><lpage>321</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.ajhg.2008.06.024</pubid><pubid idtype="pmcid">2842185</pubid><pubid idtype="pmpid" link="fulltext">18691683</pubid></pubidlist></xrefbib></bibl><bibl id="B3"><title><p>A groupwise association test for rare mutations using a weighted sum statistic</p></title><aug><au><snm>Madsen</snm><fnm>BE</fnm></au><au><snm>Browning</snm><fnm>SR</fnm></au></aug><source>PLoS Genet</source><pubdate>2009</pubdate><volume>5</volume><fpage>e1000384</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1371/journal.pgen.1000384</pubid><pubid idtype="pmcid">2633048</pubid><pubid idtype="pmpid" link="fulltext">19214210</pubid></pubidlist></xrefbib></bibl><bibl id="B4"><title><p>Pooled association tests for rare variants in exon-resequencing studies</p></title><aug><au><snm>Price</snm><fnm>AL</fnm></au><au><snm>Kryukov</snm><fnm>GV</fnm></au><au><snm>de Bakker</snm><fnm>PI</fnm></au><au><snm>Purcell</snm><fnm>SM</fnm></au><au><snm>Staples</snm><fnm>J</fnm></au><au><snm>Wei</snm><fnm>LJ</fnm></au><au><snm>Sunyaev</snm><fnm>SR</fnm></au></aug><source>Am J Hum Genet</source><pubdate>2010</pubdate><volume>86</volume><fpage>832</fpage><lpage>838</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.ajhg.2010.04.005</pubid><pubid idtype="pmcid">3032073</pubid><pubid idtype="pmpid" link="fulltext">20471002</pubid></pubidlist></xrefbib></bibl><bibl id="B5"><title><p>A method and server for predicting damaging missense mutations</p></title><aug><au><snm>Adzhubei</snm><fnm>IA</fnm></au><au><snm>Schmidt</snm><fnm>S</fnm></au><au><snm>Peshkin</snm><fnm>L</fnm></au><au><snm>Ramensky</snm><fnm>VE</fnm></au><au><snm>Gerasimova</snm><fnm>A</fnm></au><au><snm>Bork</snm><fnm>P</fnm></au><au><snm>Kondrashov</snm><fnm>AS</fnm></au><au><snm>Sunyaev</snm><fnm>SR</fnm></au></aug><source>Nat Meth</source><pubdate>2010</pubdate><volume>7</volume><fpage>248</fpage><lpage>249</lpage><xrefbib><pubid idtype="doi">10.1038/nmeth0410-248</pubid></xrefbib></bibl><bibl id="B6"><title><p>Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm</p></title><aug><au><snm>Kumar</snm><fnm>P</fnm></au><au><snm>Henikoff</snm><fnm>S</fnm></au><au><snm>Ng</snm><fnm>PC</fnm></au></aug><source>Nat Protoc</source><pubdate>2009</pubdate><volume>4</volume><fpage>1073</fpage><lpage>1081</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nprot.2009.86</pubid><pubid idtype="pmpid" link="fulltext">19561590</pubid></pubidlist></xrefbib></bibl><bibl id="B7"><title><p>Mutation Taster evaluates disease-causing potential of sequence alterations</p></title><aug><au><snm>Schwarz</snm><fnm>JM</fnm></au><au><snm>Rodelsperger</snm><fnm>C</fnm></au><au><snm>Schuelke</snm><fnm>M</fnm></au><au><snm>Seelow</snm><fnm>D</fnm></au></aug><source>Nat Meth</source><pubdate>2010</pubdate><volume>7</volume><fpage>575</fpage><lpage>576</lpage><xrefbib><pubid idtype="doi">10.1038/nmeth0810-575</pubid></xrefbib></bibl><bibl id="B8"><title><p>SNAP predicts effect of mutations on protein function</p></title><aug><au><snm>Bromberg</snm><fnm>Y</fnm></au><au><snm>Yachdav</snm><fnm>G</fnm></au><au><snm>Rost</snm><fnm>B</fnm></au></aug><source>Bioinformatics</source><pubdate>2008</pubdate><volume>24</volume><fpage>2397</fpage><lpage>2398</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btn435</pubid><pubid idtype="pmcid">2562009</pubid><pubid idtype="pmpid" link="fulltext">18757876</pubid></pubidlist></xrefbib></bibl><bibl id="B9"><title><p>Using SIFT and PolyPhen to predict loss-of-function and gain-of-function mutations</p></title><aug><au><snm>Flanagan</snm><fnm>SE</fnm></au><au><snm>Patch</snm><fnm>AM</fnm></au><au><snm>Ellard</snm><fnm>S</fnm></au></aug><source>Genet Test Mol Biomarkers</source><pubdate>2010</pubdate><volume>14</volume><fpage>533</fpage><lpage>537</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1089/gtmb.2010.0036</pubid><pubid idtype="pmpid">20642364</pubid></pubidlist></xrefbib></bibl><bibl id="B10"><title><p>Genetic Analysis Workshop 17 mini-exome simulation</p></title><aug><au><snm>Almasy</snm><fnm>LA</fnm></au><au><snm>Dyer</snm><fnm>TD</fnm></au><au><snm>Peralta</snm><fnm>JM</fnm></au><au><snm>Kent</snm><fnm>JW</fnm><suf>Jr</suf></au><au><snm>Charlesworth</snm><fnm>JC</fnm></au><au><snm>Curran</snm><fnm>JE</fnm></au><au><snm>Blangero</snm><fnm>J</fnm></au></aug><source>BMC Proc</source><pubdate>2011</pubdate><volume>5</volume><issue>suppl 9</issue><fpage>S2</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1753-6561-5-S9-S2</pubid><pubid idtype="pmcid">3254896</pubid><pubid idtype="pmpid">21810212</pubid></pubidlist></xrefbib></bibl><bibl id="B11"><title><p>A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays</p></title><aug><au><snm>McLachlan</snm><fnm>GJ</fnm></au><au><snm>Bean</snm><fnm>RW</fnm></au><au><snm>Jones</snm><fnm>LB</fnm></au></aug><source>Bioinformatics</source><pubdate>2006</pubdate><volume>22</volume><fpage>1608</fpage><lpage>1615</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btl148</pubid><pubid idtype="pmpid" link="fulltext">16632494</pubid></pubidlist></xrefbib></bibl><bibl id="B12"><title><p>Oracle and adaptive compound decision rules for false discovery rate control</p></title><aug><au><snm>Sun</snm><fnm>W</fnm></au><au><snm>Cai</snm><fnm>T</fnm></au></aug><source>J Am Stat Assoc</source><pubdate>2007</pubdate><volume>102</volume><fpage>901</fpage><lpage>912</lpage><xrefbib><pubid idtype="doi">10.1198/016214507000000545</pubid></xrefbib></bibl><bibl id="B13"><title><p>Network-based genomic discovery: application and comparison of Markov random field models</p></title><aug><au><snm>Wei</snm><fnm>P</fnm></au><au><snm>Pan</snm><fnm>W</fnm></au></aug><source>J R Stat Soc Ser C Appl Stat</source><pubdate>2010</pubdate><volume>59</volume><fpage>105</fpage><lpage>125</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1111/j.1467-9876.2009.00686.x</pubid><pubid idtype="pmcid">3046412</pubid><pubid idtype="pmpid">21373371</pubid></pubidlist></xrefbib></bibl></refgrp>
   </bm>
</art>