<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art><ui>1753-6561-6-S7-S3</ui><ji>1753-6561</ji><fm>
<dochead>Proceedings</dochead>
<bibl>
<title>
<p>A <it>&#957;</it>-support vector regression based approach for predicting imputation quality</p>
</title>
<aug>
<au ca="yes" id="A1"><snm>Huang</snm><fnm>Yi-Hung</fnm><insr iid="I1"/><insr iid="I2"/><email>irashadow@gmail.com</email></au>
<au id="A2"><snm>Rice</snm><mi>P</mi><fnm>John</fnm><insr iid="I3"/></au>
<au id="A3"><snm>Saccone</snm><mi>F</mi><fnm>Scott</fnm><insr iid="I3"/></au>
<au id="A4"><snm>Ambite</snm><mnm>Luis</mnm><fnm>Jos&#233;</fnm><insr iid="I4"/></au>
<au id="A5"><snm>Arens</snm><fnm>Yigal</fnm><insr iid="I4"/></au>
<au id="A6"><snm>Tischfield</snm><mi>A</mi><fnm>Jay</fnm><insr iid="I5"/></au>
<au ca="yes" id="A7"><snm>Hsu</snm><fnm>Chun-Nan</fnm><insr iid="I1"/><insr iid="I4"/><email>chunnan@isi.edu</email></au>
</aug>
<insg>
<ins id="I1"><p>Institute of Information Science, Academia Sinica, Taipei 115, Taiwan</p></ins>
<ins id="I2"><p>Department of Computer Science, National Taiwan University, Taipei 106, Taiwan</p></ins>
<ins id="I3"><p>Department of Psychiatry, Washington University, St. Louis, Missouri, USA</p></ins>
<ins id="I4"><p>Information Science Institute, University of Southern California, Marina del Rey, California, USA</p></ins>
<ins id="I5"><p>Department of Genetics, Rutgers University, Piscataway, New Jersey, USA</p></ins>
</insg>
<source>BMC Proceedings</source>


<supplement><title><p>Proceedings of the Great Lakes Bioinformatics Conference 2012</p></title><editor>Laura Brown, Margit Burmeister and Elodie Ghedin</editor><note>Proceedings</note></supplement><conference><title><p>Great Lakes Bioinformatics Conference 2012</p></title><location>Ann Arbor, MI, USA</location><date-range>15-17 May 2012</date-range><url>http://www.iscb.org/glbio2012/</url></conference><issn>1753-6561</issn>
<pubdate>2012</pubdate>
<volume>6</volume>
<issue>Suppl 7</issue>
<fpage>S3</fpage>
<url>http://www.biomedcentral.com/1753-6561/6/S7/S3</url>
<xrefbib><pubid idtype="doi">10.1186/1753-6561-6-S7-S3</pubid></xrefbib>
</bibl>
<history><pub><date><day>13</day><month>11</month><year>2012</year></date></pub></history>
<cpyrt><year>2012</year><collab>Huang et al.; licensee BioMed Central Ltd.</collab><note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
<abs>
<sec>
<st>
<p>Abstract</p>
</st>
<sec>
<st>
<p>Background</p>
</st>
<p>Decades of genome-wide association studies (GWAS) have accumulated large volumes of genomic data that can potentially be reused to increase statistical power of new studies, but different genotyping platforms with different marker sets have been used as biotechnology has evolved, preventing pooling and comparability of old and new data. For example, to pool together data collected by 550K chips with newer data collected by 900K chips, we will need to impute missing loci. Many imputation algorithms have been developed, but the posteriori probabilities estimated by those algorithms are not a reliable measure the quality of the imputation. Recently, many studies have used an imputation quality score (IQS) to measure the quality of imputation. The IQS requires to know true alleles to estimate. Only when the population and the imputation loci are identical can we reuse the estimated IQS when the true alleles are unknown.</p>
</sec>
<sec>
<st>
<p>Methods</p>
</st>
<p>Here, we present a regression model to estimate IQS that learns from imputation of loci with known alleles. We designed a small set of features, such as minor allele frequencies, distance to the nearest known cross-over hotspot, <it>etc</it>., for the prediction of IQS. We evaluated our regression models by estimating IQS of imputations by BEAGLE for a set of GWAS data from the NCBI GEO database collected from samples from different ethnic populations.</p>
</sec>
<sec>
<st>
<p>Results</p>
</st>
<p>We construct a <it>&#957;</it>-SVR based approach as our regression model. Our evaluation shows that this regression model can accomplish mean square errors of less than 0.02 and a correlation coefficient close to 0.75 in different imputation scenarios. We also show how the regression results can help remove false positives in association studies.</p>
</sec>
<sec>
<st>
<p>Conclusion</p>
</st>
<p>Reliable estimation of IQS will facilitate integration and reuse of existing genomic data for meta-analysis and secondary analysis. Experiments show that it is possible to use a small number of features to regress the IQS by learning from different training examples of imputation and IQS pairs.</p>
</sec>
</sec>
</abs>
</fm><bdy>
<sec>
<st>
<p>Background</p>
</st>
<p>In the past decade, the data sets collected for genome wide association studies (GWAS) have grown geometrically. Reusing these valuable data in new studies is difficult because they are collected through different study designs and on different platforms. Various imputation algorithms (<it>e.g.</it>, IMPUTE <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp>, BEAGLE <abbrgrp>
<abbr bid="B2">2</abbr>
<abbr bid="B3">3</abbr>
<abbr bid="B4">4</abbr>
<abbr bid="B5">5</abbr>
<abbr bid="B6">6</abbr>
<abbr bid="B7">7</abbr>
</abbrgrp>, and MACH <abbrgrp>
<abbr bid="B8">8</abbr>
</abbrgrp>) have been developed to predict the individual genotypes at un-typed markers. Although these imputation algorithms have already been put to use, the methods of measuring imputation quality are still rarely addressed. The imputation quality score of the single-nucleotide polymorphism (SNP) genotypes are quite different at distinct loci. For this reason, we want to investigate how to measure the imputation quality for a particular SNP that is imputed by these algorithms. After the imputation quality measurement is established, researchers can pay more attention to those poorly-imputed SNPs in the data integration process. Recently, <abbrgrp>
<abbr bid="B9">9</abbr>
</abbrgrp> proposed a new statistic for assessing the imputation reliability and it is designated as the imputation quality score (IQS). The IQS has been shown to be commensurate with the true quality of the imputation and successfully applied to filter false positive associations in GWAS studies that use imputed genotypes <abbrgrp>
<abbr bid="B9">9</abbr>
</abbrgrp>.</p>
<p>The IQS for each imputed SNP is computed by two scores, the proportion of observed agreement (<it>P<sub>o</sub>
</it>) and the proportion of chance agreement (<it>P<sub>c</sub>
</it>), to account not just for the accuracy of the imputation but also whether it is accurate by chance alone. In detail, the computation of IQS requires the posterior probabilities of AA, AB and BB as output by the imputation program. For one SNP genotyped on <it>N </it>individuals, the probabilities can be readily constructed as shown in Table <tblr tid="T1">1</tblr> where each cell, <it>n<sub>ij</sub>
</it>, represents the number of individuals with true genotype <it>j </it>and imputed genotype <it>i</it>. The observed agreement <it>P<sub>o </sub>
</it>is defined in percentage <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1753-6561-6-S7-S3-i1"><m:mrow>
   <m:msub>
      <m:mrow>
         <m:mi>P</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>o</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mfrac>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mo mathsize="big">&#8721;</m:mo>
            </m:mrow>
            <m:mrow>
               <m:mi>i</m:mi>
            </m:mrow>
         </m:msub>
         <m:msub>
            <m:mrow>
               <m:mi>n</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>i</m:mi>
               <m:mi>i</m:mi>
            </m:mrow>
         </m:msub>
      </m:mrow>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>n</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mo class="MathClass-bin">&#8901;</m:mo>
               <m:mo class="MathClass-bin">&#8901;</m:mo>
            </m:mrow>
         </m:msub>
      </m:mrow>
   </m:mfrac>
</m:mrow>
</m:math>
</inline-formula>. Similar to <it>P<sub>o</sub>
</it>, The chance agreement <it>P<sub>c </sub>
</it>is defined as the proportion of agreement which is expected by chance: <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1753-6561-6-S7-S3-i2"><m:mrow>
   <m:msub>
      <m:mrow>
         <m:mi>P</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>c</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mfrac>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mo mathsize="big">&#8721;</m:mo>
            </m:mrow>
            <m:mrow>
               <m:mi>i</m:mi>
            </m:mrow>
         </m:msub>
         <m:msub>
            <m:mrow>
               <m:msub>
                  <m:mrow>
                     <m:mi>n</m:mi>
                  </m:mrow>
                  <m:mrow>
                     <m:mi>i</m:mi>
                  </m:mrow>
               </m:msub>
            </m:mrow>
            <m:mrow>
               <m:mo class="MathClass-bin">&#8901;</m:mo>
            </m:mrow>
         </m:msub>
         <m:msub>
            <m:mrow>
               <m:msub>
                  <m:mrow>
                     <m:mi>n</m:mi>
                  </m:mrow>
                  <m:mrow>
                     <m:mo class="MathClass-bin">&#8901;</m:mo>
                  </m:mrow>
               </m:msub>
            </m:mrow>
            <m:mrow>
               <m:mi>i</m:mi>
            </m:mrow>
         </m:msub>
      </m:mrow>
      <m:mrow>
         <m:msup>
            <m:mrow>
               <m:msub>
                  <m:mrow>
                     <m:mi>n</m:mi>
                  </m:mrow>
                  <m:mrow>
                     <m:mo class="MathClass-bin">&#8901;</m:mo>
                     <m:mo class="MathClass-bin">&#8901;</m:mo>
                  </m:mrow>
               </m:msub>
            </m:mrow>
            <m:mrow>
               <m:mn>2</m:mn>
            </m:mrow>
         </m:msup>
      </m:mrow>
   </m:mfrac>
</m:mrow>
</m:math>
</inline-formula>, where <it>n<sub>i.</sub>
</it>, <it>n<sub>.i</sub>
</it>, and <it>n</it>
<sub>.. </sub>are defined in Table <tblr tid="T1">1</tblr>. Then IQS is calculated by the Cohen's kappa coefficient <abbrgrp>
<abbr bid="B10">10</abbr>
</abbrgrp> and is defined as a function of <it>P<sub>o </sub>
</it>and <it>P<sub>c </sub>
</it>as</p>
<tbl id="T1"><title><p>Table 1</p></title><caption><p>Marginal cross classification of the genotypes used for the computation of IQS</p></caption><tblbdy cols="5">
      <r>
         <c ca="center" cspan="3">
            <p>
               <b>True genotypes</b>
            </p>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>
               <b>Imputed Genotypes</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>
                  <it>AA</it>
               </b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>
                  <it>AB</it>
               </b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>
                  <it>BB</it>
               </b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>
                  <it>Total</it>
               </b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>
               <it>AA</it>
            </p>
         </c>
         <c ca="center">
            <p>
               <it>n</it>
               <sub>11</sub>
            </p>
         </c>
         <c ca="center">
            <p>
               <it>n</it>
               <sub>12</sub>
            </p>
         </c>
         <c ca="center">
            <p>n<sub>13</sub></p>
         </c>
         <c ca="center">
            <p>
               <it>n</it>
               <sub>1.</sub>
            </p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>
               <it>AB</it>
            </p>
         </c>
         <c ca="center">
            <p>
               <it>n</it>
               <sub>21</sub>
            </p>
         </c>
         <c ca="center">
            <p>
               <it>n</it>
               <sub>22</sub>
            </p>
         </c>
         <c ca="center">
            <p>
               <it>n</it>
               <sub>23</sub>
            </p>
         </c>
         <c ca="center">
            <p>
               <it>n</it>
               <sub>2.</sub>
            </p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>
               <it>BB</it>
            </p>
         </c>
         <c ca="center">
            <p>
               <it>n</it>
               <sub>31</sub>
            </p>
         </c>
         <c ca="center">
            <p>
               <it>n</it>
               <sub>32</sub>
            </p>
         </c>
         <c ca="center">
            <p>
               <it>n</it>
               <sub>33</sub>
            </p>
         </c>
         <c ca="center">
            <p>
               <it>n</it>
               <sub>3.</sub>
            </p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>
               <it>Total</it>
            </p>
         </c>
         <c ca="center">
            <p>
               <it>n</it>
               <sub>.1</sub>
            </p>
         </c>
         <c ca="center">
            <p>
               <it>n</it>
               <sub>.2</sub>
            </p>
         </c>
         <c ca="center">
            <p><it>n</it>.<sub>3</sub></p>
         </c>
         <c ca="center">
            <p><it>n</it>..</p>
         </c>
      </r>
   </tblbdy></tbl>
<p>
<display-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1753-6561-6-S7-S3-i3"><m:mrow>
   <m:mi mathvariant="bold">I</m:mi>
   <m:mi mathvariant="bold">Q</m:mi>
   <m:mi mathvariant="bold">S</m:mi>
   <m:mo>=</m:mo>
   <m:mstyle scriptlevel="+1">
      <m:mfrac>
         <m:mrow>
            <m:msub>
               <m:mi>P</m:mi>
               <m:mi>o</m:mi>
            </m:msub>
            <m:mo>&#8722;</m:mo>
            <m:msub>
               <m:mi>P</m:mi>
               <m:mi>c</m:mi>
            </m:msub>
         </m:mrow>
         <m:mrow>
            <m:mn>1</m:mn>
            <m:mo>&#8722;</m:mo>
            <m:msub>
               <m:mi>P</m:mi>
               <m:mi>c</m:mi>
            </m:msub>
         </m:mrow>
      </m:mfrac>
   </m:mstyle>
   <m:mo>.</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>Assessment of <it>P<sub>o</sub>
</it>, <it>P<sub>c</sub>
</it>, and IQS needs the true genotypes to be known. <abbrgrp>
<abbr bid="B9">9</abbr>
</abbrgrp> showed that for the same population and the same locus imputed using the same set of loci with known genotypes, the estimated IQS are highly correlated. We showed it by dividing a sample by half and imputing SNPs of the Illumina 1 M array using the SNP genotyping results from the Illumina 550 K array, and then we estimated the IQS scores. We obtained a correlation coefficient of 0.99 for the IQS scores for the same set of imputed SNPs. That is, we can expect that IQS scores will be nearly the same if the population, the imputed SNPs, and the SNPs of known genotypes, are identical. If there are previously estimated IQS scores available that match these conditions, then the scores can be reused. Therefore, it is possible to obtain IQS scores without knowing true genotypes by querying the IQS from a pre-constructed IQS database.</p>
<p>However, exhausting all populations and combinations of imputation loci to establish such a database of all useful IQS may take considerable resources. Here, we try to develop a computational method to estimate IQS without known true genotypes. We assess whether or not it is possible to build a regression model from imputations of SNP sites with known alleles, and then use the regression model to estimate IQS for SNPs with unknown alleles. The idea is to use additional statistical information to build a regression model to predict the IQS. Also, in practice, people work with specific sets of variants and this method will facilitate creation of a database of the IQS of those variants.</p>
</sec>
<sec>
<st>
<p>Methods and materials</p>
</st>
<sec>
<st>
<p>
<it>&#957;</it>
<b>-Support vector regression</b>
</p>
</st>
<p>In a multi-dimensional regression problem, we have a data set of <it>l </it>
<it>d</it>-dimensional independent variables <it>x<sub>i </sub>
</it>&#8712; &#8477;<it>
<sup>d</sup>
</it>, <it>i </it>= 1,..., <it>l </it>and dependent variables <it>y<sub>i </sub>
</it>&#8712; &#8477;. In our IQS regression problem, <it>y<sub>i </sub>
</it>represents the true IQS and <it>x<sub>i </sub>
</it>denotes the input feature vector. The goal is to find a function that approximates <it>y<sub>i</sub>
</it>. A solution of this problem based on a kernel method is to find the function <it>y<sub>i </sub>
</it>&#8776; <it>f</it>(<it>x<sub>i</sub>
</it>, <it>w</it>, <it>b</it>) = <it>w</it>. <it>&#966; </it>(<it>x<sub>i</sub>
</it>) - <it>b</it>, where <it>w </it>and <it>b </it>&#8712; &#8477;<it>
<sup>d </sup>
</it>are parameters and <it>&#966; : </it>&#8477;<it>
<sup>d </sup>
</it>&#8594; &#8477;<it>
<sup>d </sup>
</it>is a mapping such that there exists a kernel function that computes the inner product <it>&#966; </it>(<it>x<sub>i</sub>
</it>). <it>&#966; </it>(<it>x<sub>j</sub>
</it>) = <it>k</it>(<it>x<sub>i</sub>
</it>, <it>x<sub>j</sub>
</it>). Because the radial basis function (RBF) can preserve a relatively high accuracy in comparison with other kernel functions (data not shown), our choice of the kernel function is the RBF kernel <abbrgrp>
<abbr bid="B11">11</abbr>
<abbr bid="B12">12</abbr>
</abbrgrp>.</p>
<p>Many models and algorithms have been developed to search for the parameters <it>w </it>and <it>b </it>of the regression function that maximally ts the input set of data. The <it>&#949;</it>-Support Vector Regression model (<it>&#949;</it>-SVR) is one of the useful models. Its formulation is given as:</p>
<p>
<display-formula id="M1">
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1753-6561-6-S7-S3-i4"><m:mrow>
   <m:mtable class="gathered">
      <m:mtr>
         <m:mtd>
            <m:mstyle mathvariant="bold">
               <m:mi>m</m:mi>
               <m:mi>i</m:mi>
               <m:mi>n</m:mi>
               <m:mi>i</m:mi>
               <m:mi>m</m:mi>
               <m:mi>i</m:mi>
               <m:mi>z</m:mi>
               <m:mi>e</m:mi>
            </m:mstyle>
            <m:mspace class="thinspace" width="0.3em"/>
            <m:mfrac>
               <m:mrow>
                  <m:mn>1</m:mn>
               </m:mrow>
               <m:mrow>
                  <m:mn>2</m:mn>
               </m:mrow>
            </m:mfrac>
            <m:mspace class="thinspace" width="0.3em"/>
            <m:mo class="MathClass-rel">|</m:mo>
            <m:mo class="MathClass-rel">|</m:mo>
            <m:mi>w</m:mi>
            <m:mo class="MathClass-rel">|</m:mo>
            <m:msup>
               <m:mrow>
                  <m:mo class="MathClass-rel">|</m:mo>
               </m:mrow>
               <m:mrow>
                  <m:mn>2</m:mn>
               </m:mrow>
            </m:msup>
            <m:mo class="MathClass-bin">+</m:mo>
            <m:mi>C</m:mi>
            <m:munderover accent="false" accentunder="false">
               <m:mrow>
                  <m:mo mathsize="big"> &#8721;</m:mo>
               </m:mrow>
               <m:mrow>
                  <m:mi>i</m:mi>
                  <m:mo class="MathClass-rel">=</m:mo>
                  <m:mn>1</m:mn>
               </m:mrow>
               <m:mrow>
                  <m:mi>l</m:mi>
               </m:mrow>
            </m:munderover>
            <m:mrow>
               <m:mo class="MathClass-open">(</m:mo>
               <m:mrow>
                  <m:msub>
                     <m:mrow>
                        <m:mi>&#958;</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>i</m:mi>
                     </m:mrow>
                  </m:msub>
                  <m:mo class="MathClass-bin">+</m:mo>
                  <m:msubsup>
                     <m:mrow>
                        <m:mi>&#958;</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>i</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mo class="MathClass-bin">*</m:mo>
                     </m:mrow>
                  </m:msubsup>
               </m:mrow>
               <m:mo class="MathClass-close">)</m:mo>
            </m:mrow>
         </m:mtd>
      </m:mtr>
      <m:mtr>
         <m:mtd>
            <m:mstyle mathvariant="bold">
               <m:mi>s</m:mi>
               <m:mi>u</m:mi>
               <m:mi>b</m:mi>
               <m:mi>j</m:mi>
               <m:mi>e</m:mi>
               <m:mi>c</m:mi>
               <m:mi>t</m:mi>
            </m:mstyle>
            <m:mspace class="thinspace" width="0.3em"/>
            <m:mstyle mathvariant="bold">
               <m:mi>t</m:mi>
               <m:mi>o</m:mi>
            </m:mstyle>
            <m:mspace class="thinspace" width="0.3em"/>
            <m:mfenced close="" open="{" separators="">
               <m:mrow>
                  <m:mtable class="array" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" equalrows="false">
                     <m:mtr>
                        <m:mtd class="array" columnalign="center">
                           <m:msub>
                              <m:mrow>
                                 <m:mi>y</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo class="MathClass-bin">-</m:mo>
                           <m:mi>w</m:mi>
                           <m:mo class="MathClass-bin">&#8901;</m:mo>
                           <m:mi>&#966;</m:mi>
                           <m:mrow>
                              <m:mo class="MathClass-open">(</m:mo>
                              <m:mrow>
                                 <m:msub>
                                    <m:mrow>
                                       <m:mi>x</m:mi>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>i</m:mi>
                                    </m:mrow>
                                 </m:msub>
                              </m:mrow>
                              <m:mo class="MathClass-close">)</m:mo>
                           </m:mrow>
                           <m:mo class="MathClass-bin">-</m:mo>
                           <m:mi>b</m:mi>
                           <m:mo class="MathClass-rel">&#8804;</m:mo>
                           <m:mi>&#949;</m:mi>
                           <m:mo class="MathClass-bin">+</m:mo>
                           <m:msub>
                              <m:mrow>
                                 <m:mi>&#958;</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                              </m:mrow>
                           </m:msub>
                        </m:mtd>
                     </m:mtr>
                     <m:mtr>
                        <m:mtd class="array" columnalign="center">
                           <m:mi>w</m:mi>
                           <m:mo class="MathClass-bin">&#8901;</m:mo>
                           <m:mi>&#966;</m:mi>
                           <m:mrow>
                              <m:mo class="MathClass-open">(</m:mo>
                              <m:mrow>
                                 <m:msub>
                                    <m:mrow>
                                       <m:mi>x</m:mi>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>i</m:mi>
                                    </m:mrow>
                                 </m:msub>
                              </m:mrow>
                              <m:mo class="MathClass-close">)</m:mo>
                           </m:mrow>
                           <m:mo class="MathClass-bin">+</m:mo>
                           <m:mi>b</m:mi>
                           <m:mo class="MathClass-bin">-</m:mo>
                           <m:msub>
                              <m:mrow>
                                 <m:mi>y</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo class="MathClass-rel">&#8804;</m:mo>
                           <m:mi>&#949;</m:mi>
                           <m:mo class="MathClass-bin">+</m:mo>
                           <m:msubsup>
                              <m:mrow>
                                 <m:mi>&#958;</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mo class="MathClass-bin">*</m:mo>
                              </m:mrow>
                           </m:msubsup>
                        </m:mtd>
                     </m:mtr>
                     <m:mtr>
                        <m:mtd class="array" columnalign="center">
                           <m:msub>
                              <m:mrow>
                                 <m:mi>&#958;</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo class="MathClass-punc">,</m:mo>
                           <m:msubsup>
                              <m:mrow>
                                 <m:mi>&#958;</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mo class="MathClass-bin">*</m:mo>
                              </m:mrow>
                           </m:msubsup>
                           <m:mo class="MathClass-rel">&#8805;</m:mo>
                           <m:mn>0</m:mn>
                           <m:mi>.</m:mi>
                        </m:mtd>
                     </m:mtr>
                     <m:mtr>
                        <m:mtd class="array" columnalign="center"/>
                     </m:mtr>
                  </m:mtable>
               </m:mrow>
            </m:mfenced>
         </m:mtd>
      </m:mtr>
      <m:mtr>
         <m:mtd/>
      </m:mtr>
   </m:mtable>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>The parameter <it>C </it>is used to determine the complexity of model and controls the tradeoff between the training error minimization and the model complexity. If it is too small, the model may underfit the data. The parameter <it>&#949; </it>serves as the tolerance of errors of the regression. Combined with the slack variables &#958;<it>
<sub>i </sub>
</it>
<inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1753-6561-6-S7-S3-i5"><m:mrow>
   <m:msubsup>
      <m:mrow>
         <m:mi>&#958;</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>i</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mo class="MathClass-bin">*</m:mo>
      </m:mrow>
   </m:msubsup>
</m:mrow>
</m:math>
</inline-formula> we have a soft-margin approach to regression that can be flexibly adjusted <abbrgrp>
<abbr bid="B11">11</abbr>
<abbr bid="B12">12</abbr>
</abbrgrp>.</p>
<p>The <it>&#957;</it>-Support Vector Regression (<it>&#957;</it>-SVR) introduces another parameter <it>&#957; </it>in the formulation, which is proven to be easier to adjust than <it>C</it>. One of the reasons is that the range of <it>&#957; </it>is [0,1] while the range of <it>C </it>is [0, &#8734;) <abbrgrp>
<abbr bid="B13">13</abbr>
<abbr bid="B14">14</abbr>
<abbr bid="B15">15</abbr>
</abbrgrp>.</p>
<p>
<display-formula id="M2">
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1753-6561-6-S7-S3-i6"><m:mrow>
   <m:mtable class="gathered">
      <m:mtr>
         <m:mtd>
            <m:mstyle mathvariant="bold">
               <m:mi>m</m:mi>
               <m:mi>i</m:mi>
               <m:mi>n</m:mi>
               <m:mi>i</m:mi>
               <m:mi>m</m:mi>
               <m:mi>i</m:mi>
               <m:mi>z</m:mi>
               <m:mi>e</m:mi>
            </m:mstyle>
            <m:mspace class="thinspace" width="0.3em"/>
            <m:mfrac>
               <m:mrow>
                  <m:mn>1</m:mn>
               </m:mrow>
               <m:mrow>
                  <m:mn>2</m:mn>
               </m:mrow>
            </m:mfrac>
            <m:mo class="MathClass-rel">|</m:mo>
            <m:mo class="MathClass-rel">|</m:mo>
            <m:mi>w</m:mi>
            <m:mo class="MathClass-rel">|</m:mo>
            <m:msup>
               <m:mrow>
                  <m:mo class="MathClass-rel">|</m:mo>
               </m:mrow>
               <m:mrow>
                  <m:mn>2</m:mn>
               </m:mrow>
            </m:msup>
            <m:mo class="MathClass-bin">+</m:mo>
            <m:mi>C</m:mi>
            <m:mrow>
               <m:mo class="MathClass-open">(</m:mo>
               <m:mrow>
                  <m:mi>v</m:mi>
                  <m:mi>&#949;</m:mi>
                  <m:mo class="MathClass-bin">+</m:mo>
                  <m:mfrac>
                     <m:mrow>
                        <m:mn>1</m:mn>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>l</m:mi>
                     </m:mrow>
                  </m:mfrac>
                  <m:munderover accent="false" accentunder="false">
                     <m:mrow>
                        <m:mo mathsize="big"> &#8721;</m:mo>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>i</m:mi>
                        <m:mo class="MathClass-rel">=</m:mo>
                        <m:mn>1</m:mn>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>l</m:mi>
                     </m:mrow>
                  </m:munderover>
                  <m:mrow>
                     <m:mo class="MathClass-open">(</m:mo>
                     <m:mrow>
                        <m:msub>
                           <m:mrow>
                              <m:mi>&#958;</m:mi>
                           </m:mrow>
                           <m:mrow>
                              <m:mi>i</m:mi>
                           </m:mrow>
                        </m:msub>
                        <m:mo class="MathClass-bin">+</m:mo>
                        <m:msubsup>
                           <m:mrow>
                              <m:mi>&#958;</m:mi>
                           </m:mrow>
                           <m:mrow>
                              <m:mi>i</m:mi>
                           </m:mrow>
                           <m:mrow>
                              <m:mo class="MathClass-bin">*</m:mo>
                           </m:mrow>
                        </m:msubsup>
                     </m:mrow>
                     <m:mo class="MathClass-close">)</m:mo>
                  </m:mrow>
               </m:mrow>
               <m:mo class="MathClass-close">)</m:mo>
            </m:mrow>
         </m:mtd>
      </m:mtr>
      <m:mtr>
         <m:mtd>
            <m:mstyle mathvariant="bold">
               <m:mi>s</m:mi>
               <m:mi>u</m:mi>
               <m:mi>b</m:mi>
               <m:mi>j</m:mi>
               <m:mi>e</m:mi>
               <m:mi>c</m:mi>
               <m:mi>t</m:mi>
            </m:mstyle>
            <m:mspace class="thinspace" width="0.3em"/>
            <m:mstyle mathvariant="bold">
               <m:mi>t</m:mi>
               <m:mi>o</m:mi>
            </m:mstyle>
            <m:mspace class="thinspace" width="0.3em"/>
            <m:mfenced close="" open="{" separators="">
               <m:mrow>
                  <m:mtable class="array" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" equalrows="false">
                     <m:mtr>
                        <m:mtd class="array" columnalign="center">
                           <m:msub>
                              <m:mrow>
                                 <m:mi>y</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo class="MathClass-bin">-</m:mo>
                           <m:mi>w</m:mi>
                           <m:mo class="MathClass-bin">&#8901;</m:mo>
                           <m:mi>&#966;</m:mi>
                           <m:mrow>
                              <m:mo class="MathClass-open">(</m:mo>
                              <m:mrow>
                                 <m:msub>
                                    <m:mrow>
                                       <m:mi>x</m:mi>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>i</m:mi>
                                    </m:mrow>
                                 </m:msub>
                              </m:mrow>
                              <m:mo class="MathClass-close">)</m:mo>
                           </m:mrow>
                           <m:mo class="MathClass-bin">-</m:mo>
                           <m:mi>b</m:mi>
                           <m:mo class="MathClass-rel">&#8804;</m:mo>
                           <m:mi>&#949;</m:mi>
                           <m:mo class="MathClass-bin">+</m:mo>
                           <m:msub>
                              <m:mrow>
                                 <m:mi>&#958;</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                              </m:mrow>
                           </m:msub>
                        </m:mtd>
                     </m:mtr>
                     <m:mtr>
                        <m:mtd class="array" columnalign="center">
                           <m:mi>w</m:mi>
                           <m:mo class="MathClass-bin">&#8901;</m:mo>
                           <m:mi>&#966;</m:mi>
                           <m:mrow>
                              <m:mo class="MathClass-open">(</m:mo>
                              <m:mrow>
                                 <m:msub>
                                    <m:mrow>
                                       <m:mi>x</m:mi>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>i</m:mi>
                                    </m:mrow>
                                 </m:msub>
                              </m:mrow>
                              <m:mo class="MathClass-close">)</m:mo>
                           </m:mrow>
                           <m:mo class="MathClass-bin">+</m:mo>
                           <m:mi>b</m:mi>
                           <m:mo class="MathClass-bin">-</m:mo>
                           <m:msub>
                              <m:mrow>
                                 <m:mi>y</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo class="MathClass-rel">&#8804;</m:mo>
                           <m:mi>&#949;</m:mi>
                           <m:mo class="MathClass-bin">+</m:mo>
                           <m:msubsup>
                              <m:mrow>
                                 <m:mi>&#958;</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mo class="MathClass-bin">*</m:mo>
                              </m:mrow>
                           </m:msubsup>
                        </m:mtd>
                     </m:mtr>
                     <m:mtr>
                        <m:mtd class="array" columnalign="center">
                           <m:msub>
                              <m:mrow>
                                 <m:mi>&#958;</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo class="MathClass-punc">,</m:mo>
                           <m:msubsup>
                              <m:mrow>
                                 <m:mi>&#958;</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mo class="MathClass-bin">*</m:mo>
                              </m:mrow>
                           </m:msubsup>
                           <m:mo class="MathClass-rel">&#8805;</m:mo>
                           <m:mn>0</m:mn>
                           <m:mo class="MathClass-punc">,</m:mo>
                           <m:mi>&#949;</m:mi>
                           <m:mo class="MathClass-rel">&#8805;</m:mo>
                           <m:mn>0</m:mn>
                        </m:mtd>
                     </m:mtr>
                     <m:mtr>
                        <m:mtd class="array" columnalign="center"/>
                     </m:mtr>
                  </m:mtable>
               </m:mrow>
            </m:mfenced>
         </m:mtd>
      </m:mtr>
      <m:mtr>
         <m:mtd/>
      </m:mtr>
   </m:mtable>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>Moreover, the parameter <it>&#957; </it>can serve as an upper bound for the fraction of margin errors, and a lower bound for the fraction of the number of support vectors. In comparison with <it>C</it>, to select a suitable <it>&#957; </it>would be more intuitive <abbrgrp>
<abbr bid="B13">13</abbr>
<abbr bid="B14">14</abbr>
</abbrgrp>. Therefore, we chose <it>&#957;</it>-SVR over <it>&#949;</it>-SVR for our IQS prediction model. This model is also known to provide high out-of-sample generalization performance.</p>
<p>We chose LibSVM <abbrgrp>
<abbr bid="B16">16</abbr>
</abbrgrp> as our implementation of the <it>&#957;</it>-SVR model. The parameter <it>&#947; </it>in the radial basis function is set as 1/<it>d</it>. The parameter <it>&#957; </it>was searched within {0.1, 0.2, 0.3, . . . 1.0} and an optimal value of <it>&#957; </it>were selected by applying a 10-fold cross validation on the training data set. The regression model can be applied to approximate <it>P<sub>o </sub>
</it>and <it>P<sub>c </sub>
</it>as well as the IQS.</p>
</sec>
<sec>
<st>
<p>Features generation</p>
</st>
<p>Other regression models can also be used but the key to the success is to identify a set of variables that influence the imputation quality as the input features <it>x<sub>i </sub>
</it>in the regression model. We intended to use all useful information related to imputation quality as features for the regression model. Under consideration of the statistical correlation analysis (data not shown), we selected the following 12 defined features of a SNP whose allele we want to impute within a given sample.</p>
<p indent="1">1. Chromosome position: The chromosome where the SNP located.</p>
<p indent="1">2. Physical position: The position of the imputed SNP in bp.</p>
<p indent="1">3. Minor allele frequency (MAF): Previously, <abbrgrp>
<abbr bid="B9">9</abbr>
</abbrgrp> have shown that the minor allele frequency is an important variable correlated with the true IQS. The above three features are available in the annotation file from the genotyping platform provider.</p>
<p indent="1">4. B allele frequency: This is derived from the allele signal intensity measurement for each locus of each individual in the raw CEL files. The raw CEL files are available from the Hapmap samples <abbrgrp>
<abbr bid="B17">17</abbr>
</abbrgrp>. For each imputed SNP, we used the mean of the B allele frequency of the SNP on the samples of the corresponding ethnic population.</p>
<p indent="1">5. MAF in the reference panel: In addition to using the available MAF provided by the annotation file, we also consider the MAF in the reference panel.</p>
<p indent="1">6. Ratio of genotypes AA/AB: It is used to to indicate the proportion of genotype AA for each imputed SNP in the reference panel.</p>
<p indent="1">7. Ratio of genotypes BB/AB: Similar to feature 6, it is used to to indicate the proportion of genotype BB for each imputed SNP in the reference panel.</p>
<p indent="1">8. Distance to the nearest genotyped SNP: This is to capture an indication that the imputation quality will be better if the nearest genotyped SNP in the inference panel is closer.</p>
<p indent="1">9. Distance to the nearest recombination hotspot: The distance to the nearest recombination hotspot also plays an important role in the quality of the imputation. We used the recombination rates and hotspots available in the release version phase II build b35 to GRCh37 from the International HapMap Project <abbrgrp>
<abbr bid="B17">17</abbr>
</abbrgrp>.</p>
<p indent="1">10. The nearest recombination hotspot's recombination rate (cM/Mb, centiMorgans per megabase): This variable is important in the imputation process. The IMPUTE2 program uses it explicitly as a required input for the imputation <abbrgrp>
<abbr bid="B1">1</abbr>
<abbr bid="B18">18</abbr>
</abbrgrp>.</p>
<p indent="1">11. Posterior probability estimated by the imputation program: This variable is available from the output of the imputation program. The Beagle program provides the genotype probabilities file and the genotype dosage file. We used the mean values of the posterior probabilities estimated for all the individuals in the inference panel.</p>
<p indent="1">12. B-allele dosage: Given the posterior genotype probabilities for a SNP (Pr(<it>AA</it>), Pr(<it>AB</it>), and Pr(<it>BB</it>)), the estimated B-allele dosage for each individual is equal to 0 &#215; Pr(<it>AA</it>) + 1 &#215; Pr(<it>AB</it>) + 2 &#215; Pr(<it>BB</it>), which is reported in the genotype dosage file. We used the mean values of the B-allele dosage values of all the individuals in the inference panel.</p>
<p>It is worth mentioning that the posterior probability estimated by the imputation program and the B-allele dosage are highly correlated to predicting the IQS under the statistical correlation analysis. These features will be used in the regression model for the IQS as well as the regression for the observed agreement <it>P<sub>o </sub>
</it>and the chance agreement <it>P<sub>c</sub>
</it>. We will show that these 12 features are useful to construct an adequate regression model.</p>
</sec>
<sec>
<st>
<p>Data preparation</p>
</st>
<p>We prepared three data sets to evaluate the performance of our regression models. These data sets contain genotyping results of samples chosen to cover different ethnic backgrounds collected in different disease studies. We selected recent data sets genotyped with advanced platforms that cover a large number of SNPs so that we can flexibly keep those SNPs covered by old, obsolete platforms (with less SNPs probed) and hold out the rest to impute. Meanwhile, since we have their true genotypes, we can use the true genotypes of these SNPs as the gold standard to evaluate imputation quality and regression.</p>
<p>The Merlion Lung Cancer Study 2 DNA <abbrgrp>
<abbr bid="B19">19</abbr>
</abbrgrp> and Oral Squamous Cell Carcinoma samples <abbrgrp>
<abbr bid="B20">20</abbr>
</abbrgrp> from the NCBI GEO database <abbrgrp>
<abbr bid="B21">21</abbr>
</abbrgrp> were used for evaluating the regression model. The Merlion Lung Cancer samples consist of two ethnic populations, East-Asian (EA) and Western-European (WE). Samples were all genotyped on the Affymetrix Genome-Wide Human SNP Array 6.0 platform. This platform contains more than 906,600 SNP probes, including the historical 482,000 SNPs in the Affymetrix GeneChip Human Mapping 500K Array Set. After the preprocessing of raw CEL files, there are 763,252 SNPs reported for the EA population of Merlion Lung Cancer samples, 778,058 SNPs for the WE population of Merlion Lung Cancer samples, and 693,494 SNPs for the oral Squamous Cell Carcinoma samples.</p>
</sec>
<sec>
<st>
<p>Regression performance evaluation</p>
</st>
<p>We designed scenarios to simulate the imputation of missing SNPs in a data set genotyped using an old platform to the large set of SNPs on the Affymetrix SNP 6.0 array. These scenarios involve a <it>training set </it>to construct our regression model in advance. This involves holding out a set of SNPs to impute, evaluating true IQS with known alleles, using the true IQS to train the regression model. Then the trained regression model can be applied to estimate IQS of imputed SNPs in a <it>test set</it>, where a set of SNPs is assumed to have missing genotypes. The design of the scenarios is to create different combinations of the training and test sets and see how the regression performance is affected.</p>
<p>To create both training and test sets, we basically divided the SNPs on the Affymetrix SNP 6.0 array into two sets. One contains those SNPs genotyped in both an old platform and Affymetrix SNP 6.0 array. This set simulates SNPs with "known" genotypes to be used to impute other SNPs. The other contains the remaining SNPs covered only by the Affymetrix SNP 6.0 array. This set simulates "missing" SNPs to be imputed.</p>
<p>Table <tblr tid="T2">2</tblr> and Table <tblr tid="T3">3</tblr> show our design of training and test sets for four scenarios to evaluate generalization of the regression model. Scenario 1 is the simplest case, which tests the regression performance when a sample of the same ethnic and disease phenotype is used for training. We used the WE lung cancer sample to create the training set. Alleles of the randomly picked 10% of SNPs of the training set were erased, denoted as "missing." Under the Affymetrix 500k array, these "missing" SNPs were imputed using the other 90% genotyped SNPs to a full set of SNPs on the same platform. As a result, there are 41,304 SNPs of the WE lung cancer sample used for the model training. We also used the WE lung cancer sample to create the test set, which consists of 320,172 SNPs covered only by the Affymetrix SNP 6.0 array. Their genotypes were then imputed from SNPs covered by the Affymetrix mapping 500k array and our regression model was applied to assess the imputation quality.</p>
<tbl id="T2"><title><p>Table 2</p></title><caption><p>Summary of training set composition for different evaluation scenarios</p></caption><tblbdy cols="5">
      <r>
         <c ca="center">
            <p>
               <b>Scenarios</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Ethnic population</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Samples</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>from Platform</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>to Platform</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 1</p>
         </c>
         <c ca="center">
            <p>Western European</p>
         </c>
         <c ca="center">
            <p>Lung cancer</p>
         </c>
         <c ca="center">
            <p>from Affymetrix 500k</p>
         </c>
         <c ca="center">
            <p>to Affymetrix 500k</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 2</p>
         </c>
         <c ca="center">
            <p>Western European</p>
         </c>
         <c ca="center">
            <p>Lung cancer</p>
         </c>
         <c ca="center">
            <p>from Illumina 550k</p>
         </c>
         <c ca="center">
            <p>to Illumina 550k</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 3</p>
         </c>
         <c ca="center">
            <p>East Asian</p>
         </c>
         <c ca="center">
            <p>Lung cancer</p>
         </c>
         <c ca="center">
            <p>from Affymetrix 500k</p>
         </c>
         <c ca="center">
            <p>to Affymetrix 500k</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 4</p>
         </c>
         <c ca="center">
            <p>East Asian</p>
         </c>
         <c ca="center">
            <p>Lung cancer</p>
         </c>
         <c ca="center">
            <p>from Affymetrix 500k</p>
         </c>
         <c ca="center">
            <p>to Affymetrix 500k</p>
         </c>
      </r>
   </tblbdy></tbl>
<tbl id="T3"><title><p>Table 3</p></title><caption><p>Summary of test set composition for different evaluation scenarios</p></caption><tblbdy cols="5">
      <r>
         <c ca="center">
            <p>
               <b>Scenarios</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Ethnic population</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Samples</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>from Platform</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>to Platform</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 1</p>
         </c>
         <c ca="center">
            <p>Western European</p>
         </c>
         <c ca="center">
            <p>Lung cancer</p>
         </c>
         <c ca="center">
            <p>from Affymetrix 500k</p>
         </c>
         <c ca="center">
            <p>
               <b>to Affymetrix SNP 6.0</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 2</p>
         </c>
         <c ca="center">
            <p>Western European</p>
         </c>
         <c ca="center">
            <p>Lung cancer</p>
         </c>
         <c ca="center">
            <p>
               <b>from Affymetrix 500k</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>to Affymetrix SNP 6.0</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 3</p>
         </c>
         <c ca="center">
            <p>
               <b>Western European</b>
            </p>
         </c>
         <c ca="center">
            <p>Lung cancer</p>
         </c>
         <c ca="center">
            <p>from Affymetrix 500k</p>
         </c>
         <c ca="center">
            <p>
               <b>to Affymetrix SNP 6.0</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 4</p>
         </c>
         <c ca="center">
            <p>East Asian</p>
         </c>
         <c ca="center">
            <p>
               <b>Oral cancer</b>
            </p>
         </c>
         <c ca="center">
            <p>from Affymetrix 500k</p>
         </c>
         <c ca="center">
            <p>
               <b>to Affymetrix SNP 6.0</b>
            </p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>High-lighted fields are the settings that are different from the training set used in the corresponding scenarios.</p>
   </tblfn></tbl>
<p>In Scenario 2, the generalization performance of our IQS regression model was evaluated when it was trained using "known" and "missing" SNPs covered by platforms different from those to be used in testing. We used the WE lung cancer sample again but used the Illumina 550k array instead of the Affymetrix SNP 6.0 array to choose SNPs. There are 41,304 SNPs of the WE lung cancer sample on the Illumina 550k array. After the regression model is constructed, we then used the same test set created in Scenario 1.</p>
<p>In Scenario 3, our IQS regression model is applied to different ethnic populations. We used the EA lung cancer sample to create the training set, resulting in 37,611 SNPs of the EA lung cancer sample on the Affymetrix 500k array. The regression model constructed by the EA lung cancer samples was used to predict the IQS of SNPs of the WE lung cancer samples as in Scenario 1.</p>
<p>Scenario 4 tests if our regression model can be generalized across samples collected for different disease studies. We used the same training set as in the scenarios above and used the EA Oral Squamous Cell Carcinoma sample as the test set. This test set also simulates imputation from the Affymetrix mapping 500k array to the Affymetrix SNP 6.0 array and consists of 320,172 SNPs.</p>
<p>For all scenarios, we chose the imputation program Beagle. Beagle is based on the Hidden Markov Model (HMM) <abbrgrp>
<abbr bid="B22">22</abbr>
</abbrgrp>. To estimate missing alleles, an EM algorithm is adopted to optimize the parameters to fit the HMM model from a given genotyped reference panel <abbrgrp>
<abbr bid="B2">2</abbr>
<abbr bid="B7">7</abbr>
</abbrgrp>. In terms of imputation accuracy, Beagle perform as well as other imputation programs but is known to be more efficient with regard to running time and memory space required <abbrgrp>
<abbr bid="B23">23</abbr>
</abbrgrp>.</p>
<p>The 1000 Genomes Project samples (August 2010 release) served as the reference panel. As the larger reference panel has developed, researchers have become more confident to combine two studies or extend a specific study on different platforms <abbrgrp>
<abbr bid="B23">23</abbr>
</abbrgrp>. We removed those SNPs with MAF less than 1% that usually lead to decreased imputation accuracy <abbrgrp>
<abbr bid="B9">9</abbr>
<abbr bid="B23">23</abbr>
</abbrgrp>. About 2% of SNPs were removed before the imputation. Notably, there are a few SNPs with inconsistent genotyped markers compared to the reference panel. These few SNPs (&lt; 0.01%) will be excluded from the training or test set in order to focus only on the reasonable imputation results.</p>
</sec>
</sec>
<sec>
<st>
<p>Results and discussion</p>
</st>
<p>Table <tblr tid="T4">4</tblr> shows the regression performance of our model for predicting the IQS under different model training and imputation scenarios and Figure <figr fid="F1">1</figr> shows the scatter plot. The results show that our regression model achieved mean square errors less than 0.02 and correlation coefficients close to 0.75. The performance is consistent across different scenarios, suggesting that the regression model generalizes equally well in different scenarios. However, Figure <figr fid="F1">1</figr> shows that regression value usually overestimated values, especially for low IQS imputations.</p>
<tbl id="T4"><title><p>Table 4</p></title><caption><p>Summary of the IQS regression results for each scenario</p></caption><tblbdy cols="3">
      <r>
         <c ca="center" cspan="3">
            <p>
               <b>IQS regression results</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="3">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>
               <b>Scenario</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Mean Squared Error</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Correlation Coefficient</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="3">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 1</p>
         </c>
         <c ca="center">
            <p>0.0182</p>
         </c>
         <c ca="center">
            <p>0.740</p>
         </c>
      </r>
      <r>
         <c cspan="3">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 2</p>
         </c>
         <c ca="center">
            <p>0.0174</p>
         </c>
         <c ca="center">
            <p>0.748</p>
         </c>
      </r>
      <r>
         <c cspan="3">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 3</p>
         </c>
         <c ca="center">
            <p>0.0178</p>
         </c>
         <c ca="center">
            <p>0.736</p>
         </c>
      </r>
      <r>
         <c cspan="3">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 4</p>
         </c>
         <c ca="center">
            <p>0.0197</p>
         </c>
         <c ca="center">
            <p>0.751</p>
         </c>
      </r>
   </tblbdy></tbl>
<fig id="F1"><title><p>Figure 1</p></title><caption><p>IQS regression results, (A) Scenario 1, evaluating the regression result on the same platform</p></caption><text>
   <p><b>IQS regression results, </b>(A) Scenario 1, evaluating the regression result on the same platform. (B) Scenario 2, evaluating the regression result on different platforms. (C) Scenario 3, evaluating the regression result on the different ethnic population. (D) Scenario 4, under the same ethnic population, evaluating the regression result on the different disease samples.</p>
</text><graphic file="1753-6561-6-S7-S3-1"/></fig>
<p>The best performance was accomplished in Scenario 2, where the regression model was trained with a set of SNPs derived from different platforms from the test, suggesting that training with a wider variety of SNPs might allow the model to generalize better. The worst performance was from Scenario 4, where samples from studies of different diseases were tested. Nevertheless, the performance difference was not significant.</p>
<p>Tables <tblr tid="T5">5</tblr>, <tblr tid="T6">6</tblr> and Figures <figr fid="F2">2</figr>, <figr fid="F3">3</figr> show the regression results for <it>P<sub>o </sub>
</it>and <it>P<sub>c</sub>
</it>, respectively. It turned out that the results are better than those for the regression of the IQS. The result for <it>P<sub>c </sub>
</it>is particularly good because <it>P<sub>c </sub>
</it>is just the marginals. One may speculate that it may be useful to predict <it>P<sub>o </sub>
</it>and <it>P<sub>c </sub>
</it>separately and then combine them to obtain the estimated IQS. We tried this approach but the results were similar to directly predicting the IQS.</p>
<tbl id="T5"><title><p>Table 5</p></title><caption><p>Summary of the <it>P<sub>o </sub></it>regression results for each scenario</p></caption><tblbdy cols="3">
      <r>
         <c>
            <p/>
         </c>
         <c ca="center" cspan="2">
            <p>
               <b><it>P<sub>o </sub></it>regression results</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="3">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>
               <b>Scenario</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Mean Squared Error</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Correlation Coefficient</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="3">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 1</p>
         </c>
         <c ca="center">
            <p>0.00248</p>
         </c>
         <c ca="center">
            <p>0.840</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 2</p>
         </c>
         <c ca="center">
            <p>0.00249</p>
         </c>
         <c ca="center">
            <p>0.838</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 3</p>
         </c>
         <c ca="center">
            <p>0.00256</p>
         </c>
         <c ca="center">
            <p>0.835</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 4</p>
         </c>
         <c ca="center">
            <p>0.00301</p>
         </c>
         <c ca="center">
            <p>0.831</p>
         </c>
      </r>
   </tblbdy></tbl>
<tbl id="T6"><title><p>Table 6</p></title><caption><p>Summary of the <it>P<sub>c </sub></it>regression results for each scenario</p></caption><tblbdy cols="3">
      <r>
         <c>
            <p/>
         </c>
         <c ca="left" cspan="2">
            <p>
               <b><it>P<sub>c </sub></it>regression results</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="3">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>
               <b>Scenario</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Mean Squared Error</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Correlation Coefficient</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="3">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 1</p>
         </c>
         <c ca="center">
            <p>0.00062</p>
         </c>
         <c ca="center">
            <p>0.990</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 2</p>
         </c>
         <c ca="center">
            <p>0.00072</p>
         </c>
         <c ca="center">
            <p>0.988</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 3</p>
         </c>
         <c ca="center">
            <p>0.00071</p>
         </c>
         <c ca="center">
            <p>0.989</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Scenario 4</p>
         </c>
         <c ca="center">
            <p>0.00099</p>
         </c>
         <c ca="center">
            <p>0.984</p>
         </c>
      </r>
   </tblbdy></tbl>
<fig id="F2"><title><p>Figure 2</p></title><caption><p><it>P<sub>o </sub></it>regression results, (A) Scenario 1, evaluating the regression result on the same platform</p></caption><text>
   <p><it><b>P</b><sub><b>o</b></sub></it><b>regression </b><b>results</b>, (A) Scenario 1, evaluating the regression result on the same platform. (B) Scenario 2, evaluating the regression result on different platform. (C) Scenario 3, evaluating the regression result on the different ethnic population. (D) Scenario 4, under the same ethnic population, evaluating the regression result on the different disease samples.</p>
</text><graphic file="1753-6561-6-S7-S3-2"/></fig>
<fig id="F3"><title><p>Figure 3</p></title><caption><p><it>P<sub>c </sub></it>regression results, (A) Scenario 1, evaluating the regression result on the same platform</p></caption><text>
   <p><it><b>P</b><sub><b>c</b></sub></it><b>regression </b><b>results</b>, (A) Scenario 1, evaluating the regression result on the same platform. (B) Scenario 2, evaluating the regression result on different platform. (C) Scenario 3, evaluating the regression result on the different ethnic population. (D) Scenario 4, under the same ethnic population, evaluating the regression result on the different disease samples.</p>
</text><graphic file="1753-6561-6-S7-S3-3"/></fig>
<p>We also performed a test to evaluate whether we can use the regression results to filter out false positives in a GWAS. Previously, <abbrgrp>
<abbr bid="B9">9</abbr>
</abbrgrp> showed that by setting a suitable threshold for the true IQS a better filtering rate can be accomplished than by using the imputation accuracy, which is equivalent to <it>P<sub>o</sub>
</it>. In this test, we assumed that an imputation with a true IQS below a certain threshold can be considered as a true flase positive that must be filtered out. Under this assumption, we plotted the Receiver Operating Characteristic (ROC) curve of the regression results against the presumed false positives. The results are presented as Figure <figr fid="F4">4</figr> and <figr fid="F5">5</figr>. The predicted IQS can accomplish the Area Under Curve (AUC) value more than 0.96 when the threshold is set to 0.5, and more than 0.80 when threshold is 0.9. As <abbrgrp>
<abbr bid="B9">9</abbr>
</abbrgrp> suggested previously, the imputation accuracy may overestimate the quality of imputation. The results shown in Figure <figr fid="F4">4</figr> and <figr fid="F5">5</figr> show that the predicted IQS performs better than the predicted imputation accuracy with a larger AUC in all four scenarios, suggesting that the predicted IQS can filter out more presumed false positives than the predicted imputation accuracy, and the results are consistent in all four scenarios. We also show the curves of the true imputation accuracy as a reference.</p>
<fig id="F4"><title><p>Figure 4</p></title><caption><p>ROC curve at the threshold = 0.5, (A) Scenario 1, AUC(Predicted IQS):0.9617, AUC(True Imputation Accuracy):0.9718, and AUC(Predicted Imputation Accuracy):0.9354 (B) Scenario 2, AUC(Predicted IQS):0.9739, AUC(True Imputation Accuracy):0.9783, and AUC(Predicted Imputation Accuracy):0.9539 (C) Scenario 3, evaluating the regression result on the different ethnic population, AUC(Predicted IQS):0.9642, AUC(True Imputation Accuracy):0.9677, and AUC(Predicted Imputation Accuracy):0.9072 (D) Scenario 4, AUC(Predicted IQS):0.9656, AUC(True Imputation Accuracy):0.9758, and AUC(Predicted Imputation Accuracy):0.9223</p></caption><text>
   <p>ROC curve at the threshold = 0.5, (A) Scenario 1, AUC(Predicted IQS):0.9617, AUC(True Imputation Accuracy):0.9718, and AUC(Predicted Imputation Accuracy):0.9354 (B) Scenario 2, AUC(Predicted IQS):0.9739, AUC(True Imputation Accuracy):0.9783, and AUC(Predicted Imputation Accuracy):0.9539 (C) Scenario 3, evaluating the regression result on the different ethnic population, AUC(Predicted IQS):0.9642, AUC(True Imputation Accuracy):0.9677, and AUC(Predicted Imputation Accuracy):0.9072 (D) Scenario 4, AUC(Predicted IQS):0.9656, AUC(True Imputation Accuracy):0.9758, and AUC(Predicted Imputation Accuracy):0.9223</p>
</text><graphic file="1753-6561-6-S7-S3-4"/></fig>
<fig id="F5"><title><p>Figure 5</p></title><caption><p>ROC curve at the threshold = 0.9, (A) Scenario 1, AUC(Predicted IQS):0.8269, AUC(True Imputation Accuracy):0.9883, and AUC(Predicted Imputation Accuracy):0.8041 (B) Scenario 2, AUC(Predicted IQS):0.8082, AUC(True Imputation Accuracy):0.9848, and AUC(Predicted Imputation Accuracy):0.8030 (C) Scenario 3, AUC(Predicted IQS):0.8230, AUC(True Imputation Accuracy):0.9892, and AUC(Predicted Imputation Accuracy):0.7890 (D) Scenario 4, AUC(Predicted IQS):0.8620, AUC(True Imputation Accuracy):0.9967, and AUC(Predicted Imputation Accuracy):0.8399</p></caption><text>
   <p>ROC curve at the threshold = 0.9, (A) Scenario 1, AUC(Predicted IQS):0.8269, AUC(True Imputation Accuracy):0.9883, and AUC(Predicted Imputation Accuracy):0.8041 (B) Scenario 2, AUC(Predicted IQS):0.8082, AUC(True Imputation Accuracy):0.9848, and AUC(Predicted Imputation Accuracy):0.8030 (C) Scenario 3, AUC(Predicted IQS):0.8230, AUC(True Imputation Accuracy):0.9892, and AUC(Predicted Imputation Accuracy):0.7890 (D) Scenario 4, AUC(Predicted IQS):0.8620, AUC(True Imputation Accuracy):0.9967, and AUC(Predicted Imputation Accuracy):0.8399</p>
</text><graphic file="1753-6561-6-S7-S3-5"/></fig>
</sec>
<sec>
<st>
<p>Conclusion</p>
</st>
<p>We propose a <it>&#957;</it>-SVR based approach to the estimation of the true IQS of imputations of SNPs with unknown true genotypes. We show that our regression model generalizes equally well across SNP selections by different platforms and across different ethnic groups and disease populations. The model performed particularly well for predicting the true chance agreement of imputation. We also showed that the estimated IQS can be used to filter false positive associations in a GWAS to some extent. The results suggest that it is feasible to apply a regression model to predict the true IQS.</p>
<p>Our future work includes an effort to extend the feature set to improve the regression performance for predicting <it>P<sub>o </sub>
</it>and the IQS and to pool together a wide variety of data sets including different SNPs and populations as the training examples so that one model can be used to estimate the IQS for all imputations. When the model is sufficiently robust, our long-term goal is to impute to the same size all genotype data in repositories of GWAS data (to as large as the most advanced platforms) and apply this regression model to attach an estimated IQS to all imputations in addition to the posteriori probability from the imputation program and make the results available in the public domain.</p>
</sec>
<sec>
<st>
<p>Competing interests</p>
</st>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<st>
<p>Authors' contributions</p>
</st>
<p>YHH and CNH developed methods and designed the experiments. YHH and CNH drafted the manuscript. JPR, SFS, JLA, YA and JAT participated in the design of the study. JLA and JAT helped to revise the manuscript. CNH was responsible for all aspects of the project.</p>
</sec>
</bdy><bm>
<ack>
<sec>
<st>
<p>Acknowledgements</p>
</st>
<p>This work was supported in part by NIMH/NIH Grant Number MH068457 (CGSMD).</p>
<p>This article has been published as part of <it>BMC Proceedings</it> Volume 6 Supplement 7, 2012: Proceedings from the Great Lakes Bioinformatics Conference 2012. The full contents of the supplement are available online at <url>http://www.biomedcentral.com/bmcproc/supplements/6/S7</url>.</p>
</sec>
</ack>
<refgrp><bibl id="B1"><title><p>A flexible and accurate genotype imputation method for the next generation of genome-wide association studies</p></title><aug><au><snm>Howie</snm><fnm>BN</fnm></au><au><snm>Donnelly</snm><fnm>P</fnm></au><au><snm>Marchini</snm><fnm>J</fnm></au></aug><source>PLoS Genet</source><pubdate>2009</pubdate><volume>5</volume><issue>6</issue><fpage>e1000529</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1371/journal.pgen.1000529</pubid><pubid idtype="pmcid">2689936</pubid><pubid idtype="pmpid" link="fulltext">19543373</pubid></pubidlist></xrefbib></bibl><bibl id="B2"><title><p>Missing data imputation and haplotype phase inference for genome-wide association studies</p></title><aug><au><snm>Browning</snm><fnm>SR</fnm></au></aug><source>Human Genetics</source><pubdate>2008</pubdate><volume>124</volume><issue>5</issue><fpage>439</fpage><lpage>450</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1007/s00439-008-0568-7</pubid><pubid idtype="pmcid">2731769</pubid><pubid idtype="pmpid" link="fulltext">18850115</pubid></pubidlist></xrefbib></bibl><bibl id="B3"><title><p>Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies</p></title><aug><au><snm>Browning</snm><fnm>BL</fnm></au><au><snm>Yu</snm><fnm>Z</fnm></au></aug><source>The American Journal of Human Genetics</source><pubdate>2009</pubdate><volume>85</volume><issue>6</issue><fpage>847</fpage><lpage>861</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.ajhg.2009.11.004</pubid><pubid idtype="pmcid">2790566</pubid><pubid idtype="pmpid" link="fulltext">19931040</pubid></pubidlist></xrefbib></bibl><bibl id="B4"><title><p>A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals</p></title><aug><au><snm>Browning</snm><fnm>BL</fnm></au><au><snm>Browning</snm><fnm>SR</fnm></au></aug><source>The American Journal of Human Genetics</source><pubdate>2009</pubdate><volume>84</volume><issue>2</issue><fpage>210</fpage><lpage>223</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.ajhg.2009.01.005</pubid><pubid idtype="pmcid">2668004</pubid><pubid idtype="pmpid" link="fulltext">19200528</pubid></pubidlist></xrefbib></bibl><bibl id="B5"><title><p>Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering</p></title><aug><au><snm>Browning</snm><fnm>SR</fnm></au><au><snm>Browning</snm><fnm>BL</fnm></au></aug><source>The American Journal of Human Genetics</source><pubdate>2007</pubdate><volume>81</volume><issue>5</issue><fpage>1084</fpage><lpage>1097</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1086/521987</pubid><pubid idtype="pmcid">2265661</pubid><pubid idtype="pmpid" link="fulltext">17924348</pubid></pubidlist></xrefbib></bibl><bibl id="B6"><title><p>High-resolution detection of identity by descent in individuals</p></title><aug><au><snm>Browning</snm><fnm>SR</fnm></au><au><snm>Browning</snm><fnm>BL</fnm></au></aug><source>The American Journal of Human Genetics</source><pubdate>2010</pubdate><volume>86</volume><issue>4</issue><fpage>526</fpage><lpage>539</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.ajhg.2010.02.021</pubid><pubid idtype="pmcid">2850444</pubid><pubid idtype="pmpid" link="fulltext">20303063</pubid></pubidlist></xrefbib></bibl><bibl id="B7"><title><p>Efficient multilocus association testing for whole genome association studies using localized haplotype clustering</p></title><aug><au><snm>Browning</snm><fnm>BL</fnm></au><au><snm>Browning</snm><fnm>SR</fnm></au></aug><source>Genetic Epidemiology</source><pubdate>2007</pubdate><volume>31</volume><issue>5</issue><fpage>365</fpage><lpage>375</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1002/gepi.20216</pubid><pubid idtype="pmpid" link="fulltext">17326099</pubid></pubidlist></xrefbib></bibl><bibl id="B8"><title><p>Mach 1.0: rapid haplotype reconstruction and missing genotype inference</p></title><aug><au><snm>Li</snm><fnm>Y</fnm></au><au><snm>Abecasis aR</snm><fnm>Gonc</fnm></au></aug><source>American Journal of Human Genetic</source><pubdate>2006</pubdate><volume>S79</volume><issue>S79</issue><fpage>2290</fpage></bibl><bibl id="B9"><title><p>A New statistic to evaluate imputation reliability</p></title><aug><au><snm>Lin</snm><fnm>P</fnm></au><au><snm>Hartz</snm><fnm>SM</fnm></au><au><snm>Zhang</snm><fnm>Z</fnm></au><au><snm>Saccone</snm><fnm>SF</fnm></au><au><snm>Wang</snm><fnm>J</fnm></au><au><snm>Tischfield</snm><fnm>JA</fnm></au><au><snm>Edenberg</snm><fnm>HJ</fnm></au><au><snm>Kramer</snm><fnm>JR</fnm></au><au><snm>MGoate</snm><fnm>A</fnm></au><au><snm>Bierut</snm><fnm>LJ</fnm></au><au><snm>Rice</snm><fnm>JP</fnm></au><au><cnm>for the COGA Collaborators COGEND Collaborators G</cnm></au></aug><source>PLoS ONE</source><pubdate>2010</pubdate><volume>5</volume><fpage>e9697</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1371/journal.pone.0009697</pubid><pubid idtype="pmcid">2837741</pubid><pubid idtype="pmpid" link="fulltext">20300623</pubid></pubidlist></xrefbib></bibl><bibl id="B10"><title><p>A coefficient of agreement for nominal scales</p></title><aug><au><snm>Cohen</snm><fnm>J</fnm></au></aug><source>Educational and Psychological Measurement</source><pubdate>1960</pubdate><volume>20</volume><fpage>37</fpage><lpage>46</lpage><xrefbib><pubid idtype="doi">10.1177/001316446002000104</pubid></xrefbib></bibl><bibl id="B11"><title><p>A tutorial on support vector regression</p></title><aug><au><snm>Smola</snm><fnm>AJ</fnm></au><au><snm>Sch&#246;lkopf</snm><fnm>B</fnm></au></aug><source>Statistics and Computing</source><pubdate>2004</pubdate><volume>14</volume><issue>3</issue><fpage>199</fpage><lpage>222</lpage></bibl><bibl id="B12"><title><p>Support-vector networks</p></title><aug><au><snm>Cortes</snm><fnm>C</fnm></au><au><snm>Vapnik</snm><fnm>V</fnm></au></aug><pubdate>1995</pubdate><volume>20</volume><issue>3</issue><fpage>273</fpage><lpage>297</lpage></bibl><bibl id="B13"><title><p>A tutorial on <it>&#957;</it>-support vector machines</p></title><aug><au><snm>Chen</snm><fnm>P</fnm></au><au><snm>Lin</snm><fnm>CJ</fnm></au><au><snm>Sch&#246;lkopf</snm><fnm>B</fnm></au></aug><pubdate>2003</pubdate></bibl><bibl id="B14"><title><p>Training nu-support vector regression theory and algorithms</p></title><aug><au><snm>Chang</snm><fnm>CC</fnm></au><au><snm>Lin</snm><fnm>CJ</fnm></au></aug><xrefbib><pubid idtype="pmpid" link="fulltext">12180409</pubid></xrefbib></bibl><bibl id="B15"><title><p>New support vector algorithms</p></title><aug><au><snm>Sch&#246;lkopf</snm><fnm>B</fnm></au><au><snm>Smola</snm><fnm>AJ</fnm></au><au><snm>Williamson</snm><fnm>RC</fnm></au><au><snm>Bartlett</snm><fnm>PL</fnm></au></aug><pubdate>2000</pubdate><xrefbib><pubid idtype="pmpid">10905814</pubid></xrefbib></bibl><bibl id="B16"><title><p>LIBSVM a library for support vector machines</p></title><aug><au><snm>Chang</snm><fnm>CC</fnm></au><au><snm>Lin</snm><fnm>CJ</fnm></au></aug><source>ACM Transactions on Intelligent Systems and Technologies</source><pubdate>2011</pubdate><volume>2</volume><fpage>1</fpage><lpage>27</lpage></bibl><bibl id="B17"><title><p>The International HapMap project</p></title><aug><au><snm>Consortium</snm><fnm>TIH</fnm></au></aug><source>Nature</source><pubdate>2003</pubdate><volume>426</volume><fpage>789</fpage><lpage>796</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nature02168</pubid><pubid idtype="pmpid" link="fulltext">14685227</pubid></pubidlist></xrefbib></bibl><bibl id="B18"><title><p>Human recombination hotspots: before and after the HapMap Project</p></title><aug><au><snm>May</snm><fnm>C</fnm></au><au><snm>Slingsby</snm><fnm>M</fnm></au><au><snm>Jeffreys</snm><fnm>A</fnm></au></aug><pubdate>2008</pubdate><volume>2</volume><fpage>195</fpage><lpage>244</lpage></bibl><bibl id="B19"><title><p>Prediction of clinical outcome in multiple lung cancer cohorts by integrative genomics: implications for chemotherapy selection</p></title><aug><au><snm>Broet</snm><fnm>P</fnm></au><au><snm>Camilleri-Broet</snm><fnm>S</fnm></au><au><snm>Zhang</snm><fnm>S</fnm></au><au><snm>Alifano</snm><fnm>M</fnm></au><au><snm>Bangarusamy</snm><fnm>D</fnm></au><au><snm>Battistella</snm><fnm>M</fnm></au><au><snm>Wu</snm><fnm>Y</fnm></au><au><snm>Tuefferd</snm><fnm>M</fnm></au><au><snm>Regnard</snm><fnm>JF</fnm></au><au><snm>Lim</snm><fnm>E</fnm></au><au><snm>Tan</snm><fnm>P</fnm></au><au><snm>Miller</snm><fnm>LD</fnm></au></aug><source>Cancer Res</source><pubdate>2009</pubdate><volume>69</volume><issue>3</issue><fpage>1055</fpage><lpage>1062</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1158/0008-5472.CAN-08-1116</pubid><pubid idtype="pmpid" link="fulltext">19176396</pubid></pubidlist></xrefbib></bibl><bibl id="B20"><title><p>A novel molecular signature identified by systems genetics approach predicts prognosis in oral squamous cell carcinoma</p></title><aug><au><snm>Peng</snm><fnm>CH</fnm></au><au><snm>Liao</snm><fnm>CT</fnm></au><au><snm>Peng</snm><fnm>SC</fnm></au><au><snm>Chen</snm><fnm>YJ</fnm></au><au><snm>Cheng</snm><fnm>AJ</fnm></au><au><snm>Juang</snm><fnm>JL</fnm></au><au><snm>Tsai</snm><fnm>CY</fnm></au><au><snm>Chen</snm><fnm>TC</fnm></au><au><snm>Chuang</snm><fnm>YJ</fnm></au><au><snm>Tang</snm><fnm>CY</fnm></au><au><snm>Hsieh</snm><fnm>WP</fnm></au><au><snm>Yen</snm><fnm>TC</fnm></au></aug><source>PLoS ONE</source><pubdate>2011</pubdate><volume>6</volume><issue>8</issue><fpage>e23452</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1371/journal.pone.0023452</pubid><pubid idtype="pmcid">3154947</pubid><pubid idtype="pmpid" link="fulltext">21853135</pubid></pubidlist></xrefbib></bibl><bibl id="B21"><title><p>NCBI GEO: archive for functional genomics data sets 10 years on</p></title><aug><au><snm>Barrett</snm><fnm>T</fnm></au><au><snm>Troup</snm><fnm>DB</fnm></au><au><snm>Wilhite</snm><fnm>SE</fnm></au><au><snm>Ledoux</snm><fnm>P</fnm></au><au><snm>Evangelista</snm><fnm>C</fnm></au><au><snm>Kim</snm><fnm>IF</fnm></au><au><snm>Tomashevsky</snm><fnm>M</fnm></au><au><snm>Marshall</snm><fnm>KA</fnm></au><au><snm>Phillippy</snm><fnm>KH</fnm></au><au><snm>Sherman</snm><fnm>PM</fnm></au><au><snm>Muertter</snm><fnm>RN</fnm></au><au><snm>Holko</snm><fnm>M</fnm></au><au><snm>Ayanbule</snm><fnm>O</fnm></au><au><snm>Yefanov</snm><fnm>A</fnm></au><au><snm>Soboleva</snm><fnm>A</fnm></au></aug><source>Nucleic Acids Research</source><pubdate>2011</pubdate><volume>39</volume><issue>suppl 1</issue><fpage>D1005</fpage><lpage>D1010</lpage><xrefbib><pubidlist><pubid idtype="pmcid">3013736</pubid><pubid idtype="pmpid" link="fulltext">21097893</pubid></pubidlist></xrefbib></bibl><bibl id="B22"><title><p>Statistical inference for probabilistic functions of finite state Markov Chains</p></title><aug><au><snm>Baum</snm><fnm>LE</fnm></au><au><snm>Petrie</snm><fnm>T</fnm></au></aug><source>The Annals of Mathematical Statistics</source><pubdate>1966</pubdate><volume>37</volume><issue>6</issue><fpage>1554</fpage><lpage>1563</lpage><xrefbib><pubid idtype="doi">10.1214/aoms/1177699147</pubid></xrefbib></bibl><bibl id="B23"><title><p>A comprehensive evaluation of SNP genotype imputation</p></title><aug><au><snm>Nothnagel</snm><fnm>M</fnm></au><au><snm>Ellinghaus</snm><fnm>D</fnm></au><au><snm>Schreiber</snm><fnm>S</fnm></au><au><snm>Krawczak</snm><fnm>M</fnm></au><au><snm>Franke</snm><fnm>A</fnm></au></aug><source>Human Genetics</source><pubdate>2009</pubdate><volume>125</volume><issue>2</issue><fpage>163</fpage><lpage>171</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1007/s00439-008-0606-5</pubid><pubid idtype="pmpid" link="fulltext">19089453</pubid></pubidlist></xrefbib></bibl></refgrp>
</bm></art>