<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art><ui>1755-8794-5-24</ui><ji>1755-8794</ji><fm><dochead>Research article</dochead><bibl><title><p>Hybridization and amplification rate correction for affymetrix SNP arrays</p></title><aug><au id="A1"><snm>Wang</snm><fnm>Quan</fnm><insr iid="I1"/><email>wangquan@ctb.pku.edu.cn</email></au><au id="A2"><snm>Peng</snm><fnm>Peichao</fnm><insr iid="I2"/><email>peichao1128@gmail.com</email></au><au id="A3"><snm>Qian</snm><fnm>Minping</fnm><insr iid="I1"/><insr iid="I2"/><email>qianmp@math.pku.edu.cn</email></au><au id="A4" ca="yes"><snm>Wan</snm><fnm>Lin</fnm><insr iid="I3"/><insr iid="I4"/><email>linwan@usc.edu</email></au><au id="A5" ca="yes"><snm>Deng</snm><fnm>Minghua</fnm><insr iid="I1"/><insr iid="I2"/><insr iid="I5"/><email>dengmh@math.pku.edu.cn</email></au></aug><insg><ins id="I1"><p>Center for Theoretical Biology, Peking University, Beijing, 100871, People's Republic of China</p></ins><ins id="I2"><p>LMAM, School of Mathematical Sciences, Peking University, Beijing, 100871, People's Republic of China</p></ins><ins id="I3"><p>Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA, USA</p></ins><ins id="I4"><p>National Center for Mathematics and Interdisciplinary Sciences, and the Key Laboratory of Systems and Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, People's Republic of China</p></ins><ins id="I5"><p>Center for Statistical Science, Peking University, Beijing, 100871, People's Republic of China</p></ins></insg><source>BMC Medical Genomics</source><issn>1755-8794</issn><pubdate>2012</pubdate><volume>5</volume><issue>1</issue><fpage>24</fpage><url>http://www.biomedcentral.com/1755-8794/5/24</url><xrefbib><pubidlist><pubid idtype="doi">10.1186/1755-8794-5-24</pubid><pubid idtype="pmpid">22691279</pubid></pubidlist></xrefbib></bibl><history><rec><date><day>21</day><month>2</month><year>2012</year></date></rec><acc><date><day>12</day><month>6</month><year>2012</year></date></acc><pub><date><day>12</day><month>6</month><year>2012</year></date></pub></history><cpyrt><year>2012</year><collab>Wang et al.; licensee BioMed Central Ltd.</collab><note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt><kwdg><kwd>SNP array</kwd><kwd>Copy number variation (CNV)</kwd><kwd>Cross-hybridization</kwd><kwd>Genomic waves</kwd></kwdg><abs><sec><st><p>Abstract</p></st><sec><st><p>Background</p></st><p>Copy number variation (CNV) is essential to understand the pathology of many complex diseases at the DNA level. Affymetrix SNP arrays, which are widely used for CNV studies, significantly depend on accurate copy number (CN) estimation. Nevertheless, CN estimation may be biased by several factors, including cross-hybridization and training sample batch, as well as genomic waves of intensities induced by sequence-dependent hybridization rate and amplification efficiency. Since many available algorithms only address one or two of the three factors, a high false discovery rate (FDR) often results when identifying CNV. Therefore, we have developed a new CNV detection pipeline which is based on hybridization and amplification rate correction (CNVhac).</p></sec><sec><st><p>Methods</p></st><p>CNVhac first estimates the allelic concentrations (ACs) of target sequences by using the sample independent parameters trained through physicochemical hybridization law. Then the raw CN is estimated by taking the ratio of AC to the corresponding average AC from a reference sample set for one specific site. Finally, a hidden Markov model (HMM) segmentation process is implemented to detect CNV regions.</p></sec><sec><st><p>Results</p></st><p>Based on public HapMap data, the results show that CNVhac effectively smoothes the genomic waves and facilitates more accurate raw CN estimates compared to other methods. Moreover, CNVhac alleviates, to a certain extent, the sample dependence of inference and makes CNV calling with appreciable low FDRs.</p></sec><sec><st><p>Conclusion</p></st><p>CNVhac is an effective approach to address the common difficulties in SNP array analysis, and the working principles of CNVhac can be easily extended to other platforms.</p></sec></sec></abs></fm><bdy><sec><st><p>Background</p></st><p>Copy number variations (CNVs) play an essential role in facilitating human diseases susceptibility <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp> and have been shown to be one potential source of missing heritability of complex diseases <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Together with genome-wide association studies (GWAS), CNVs are predicted to be compelling in deciphering the pathology of human diseases <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. SNP arrays have been widely used for CNV studies, and tremendous data have been generated <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>. Although high throughput sequencing technologies are emerging and have been applied to genetic variation (including CNV) studies, the cost of a sequencing-based approach is still higher than traditional SNP arrays, especially in library construction <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. In addition, various studies have shown that the sequencing data are not sensitive to breakpoint detection <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp>. Moreover, sequencing technologies have poor mutation detection capability when the sequencing coverage (read depth) is relatively low <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. Thus, at their current stage of development, we believe that sequencing technologies are complementary, not substitute, tools of SNP arrays. Therefore, in this article, we aim to develop a new and more accurate CNV detection pipeline that avoids the common difficulties in SNP array analysis.</p><p>High quality CNV calls for accurate estimation of raw copy numbers and requires that statistical models be optimized <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. Although many methods have been developed for CNV calling from array-based data <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr></abbrgrp>, their accuracies are still far from satisfactory by the high incidence of false discovery rates (FDRs) <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp>. The high FDRs of these methods mainly arise from (1) cross-hybridization of probes <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>, (2) genomic waves of intensities <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr></abbrgrp> and (3) sample dependence of outputs <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr></abbrgrp>.</p><p>Cross-hybridization between probes and off-target sequences is a longstanding problem in microarray analysis <abbrgrp><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>. Therefore, most previous methods have typically ignored cross-hybridization and focused on taking mean or median intensities of probes as the estimated raw CNs <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B31">31</abbr></abbrgrp>. However, such estimated CNs hardly reflect the true allelic concentrations (ACs) of target sequences, and some studies <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B20">20</abbr></abbrgrp> have shown that cross-hybridization, if not considered, can lead to large bias. To circumvent this problem, one prior investigation used PICR (probe intensity composite representation) to model the hybridization and cross-hybridization based on the underlying physicochemical principle of DNA/DNA duplex formation in array experiments, and then removed the effect of cross-hybridization and accurately estimated AC at a given SNP site through a statistical method <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. Other similar models were also reported <abbrgrp><abbr bid="B28">28</abbr><abbr bid="B32">32</abbr></abbrgrp>.</p><p>In addition to cross-hybridization, Maris et al. have stated that &#8220;whole-genome microarrays with large-insert clones designed to determine DNA copy number often show variation in hybridization intensity that is related to the genomic position of the clones.&#8221; <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> These &#8216;genomic waves&#8217; have been observed in SNP arrays <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr></abbrgrp>. Genomic waves are shown to be correlated with GC-content <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B23">23</abbr></abbrgrp> and may stem from the amplification of DNA fragments <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>. In the preprocessing of arrays, DNA samples are first digested with restriction enzymes, such as Nsp, and then ligated with adapters before amplification. However, owing to differences in amplification efficiencies of fragments, the PCR procedure can bring in artifacts which may give rise to genomic waves <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>. Presence of the waves will hamper detection of aberrations <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> and introduce hundreds of potentially confounding CNV artifacts that can obscure bona fide variants <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>. To solve this difficulty, a computational approach via fitting regression models with GC-content included as a predictor variable was proposed by <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, and this approach have improved the accuracy of CNV detection.</p><p>Finally, it has long been known that different sample batches can lead to inconsistent results, even if data are collected by the same lab <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr></abbrgrp>. Owing to this effect, statistical power in meta-analysis of multiple samples may be significantly reduced <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. Almost all existing algorithms require multiple samples for training because of the numerous parameters, while different training sample batches can lead to different parameter estimation. The inconsistencies may be incurred by this sample-dependent parameter estimation. The effect has also been shown to be correlated with differences in batch sizes and the extent of homogeneity of samples in each batch. Hence, samples with high homogeneity are suggested to be placed into the same training batch <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. Several other methods to adjust this batch effect have also been proposed, such as <abbrgrp><abbr bid="B25">25</abbr><abbr bid="B35">35</abbr><abbr bid="B36">36</abbr></abbrgrp>.</p><p>To the best of our knowledge, existing methods only address one or two of the three factors discussed above. In this study, we developed a novel CNV detection pipeline based on hybridization and amplification rate correction (CNVhac<sup>a</sup>) to accurately detect CNVs for Affymetrix SNP array. In contrast to previous methods, CNVhac takes into account all three factors by proper modeling of cross-hybridization, smoothing genomic waves and alleviating sample batch dependence of parameter estimation, thus significantly improving the accuracy of CNV detection. Starting from dozens of basic constants concerning binding affinity, which can be well trained from one single array and are quite stable between arrays, CNVhac is able to get the binding affinity between all probes and sequences without suffering from sample batch dependence. Then CNVhac applies the PICR method <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> to address the effect of cross-hybridization. Finally, since we have found that the relative amplification efficiencies between different fragments are fairly stable from one array to another, a simple adjustment approach is proposed to smooth the genomic waves. Based on the accurate raw CN estimates, a hidden Markov model (HMM) is also proposed to detect breakpoints along the genome. The implementation of CNVhac with public datasets shows that our method does enhance the power of both raw CN estimation and CNV calling.</p></sec><sec><st><p>Methods</p></st><sec><st><p>Dataset</p></st><p>Dataset I. &#8216;The International HapMap project&#8217; <abbrgrp><abbr bid="B37">37</abbr></abbrgrp> mapped 270 samples (30 YRI trios, 30 CEU trios, 45 CHB and 45 JPT individuals) to Affymetrix SNP 6.0 array to identify and catalog genetic similarities and variants in human beings. The raw SNP 6.0 dataset (<url>http://www.affymetrix.com/support/technical/sample_data/genomewide_snp6_data.affx</url>) is applied in this paper.</p><p>Dataset II. Conrad et al. recently used the ultra-high-resolution NimbleGen tiling arrays (42&#8201;M probes) to identify CNVs for HapMap samples <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>. The identified CNVs were then filtered by two other technologies (Agilent and Illumina). Finally, over 5000 regions that were cross-platform verified as CNV in at least one of the HapMap individuals of dataset I were selected <abbrgrp><abbr bid="B38">38</abbr></abbrgrp> and referenced as benchmark in this article to assess the power of CNV calling in comparison with other algorithms. We have not performed any experimental research by ourselves, and both dataset I and II are downloaded from public databases. Therefore, there is no ethical approval problem in this study.</p></sec><sec><st><p>Estimation of raw CNs</p></st><p>The problems usually confronted in the estimation of raw CNs are discussed in the background section. Array intensities not only rely on ACs of target sequences, but also probe binding affinities. Based on <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>, we model hybridization and cross-hybridization with dozens of probe-independent parameters, which can be accurately estimated from single array and are consistent between arrays <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. Another simple adjustment is proposed to calibrate the various amplification efficiencies.</p><sec><st><p>Modeling hybridization and cross-hybridization</p></st><p>Considering one probe in a certain SNP probeset, we have the basic model <abbrgrp><abbr bid="B39">39</abbr><abbr bid="B40">40</abbr></abbrgrp>:</p><p><display-formula id="M1"><m:math name="1755-8794-5-24-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mi>I</m:mi>
   <m:mo>=</m:mo>
   <m:msub>
      <m:mi>I</m:mi>
      <m:mi>s</m:mi>
   </m:msub>
   <m:mo>+</m:mo>
   <m:msub>
      <m:mi>I</m:mi>
      <m:mi mathvariant="italic">bg</m:mi>
   </m:msub>
   <m:mo>+</m:mo>
   <m:mi>&#949;</m:mi>
   <m:mtext>,</m:mtext>
</m:mrow>
</m:math></display-formula></p><p>where <it>I, I</it><sub><it>s</it></sub> and <it>I</it><sub><it>bg</it></sub> stand respectively for probe intensity, specific hybridization intensity caused by target sequences and background nonspecific binding intensity, and <it>&#1013;</it> is the measurement error. <it>I</it><sub><it>s</it></sub> has been further modeled by Langmuir-like adsorption principle, and Equation (1) can be rewritten as:</p><p><display-formula id="M2"><m:math name="1755-8794-5-24-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mi>I</m:mi>
   <m:mo>=</m:mo>
   <m:msub>
      <m:mi>I</m:mi>
      <m:mi>s</m:mi>
   </m:msub>
   <m:mo>+</m:mo>
   <m:msub>
      <m:mi>I</m:mi>
      <m:mi mathvariant="italic">bg</m:mi>
   </m:msub>
   <m:mo>+</m:mo>
   <m:mi>&#949;</m:mi>
   <m:mo>=</m:mo>
   <m:mfrac>
      <m:mi>N</m:mi>
      <m:mrow>
         <m:mn>1</m:mn>
         <m:mo>+</m:mo>
         <m:mo>exp</m:mo>
         <m:mfenced open="(" close=")">
            <m:mi>E</m:mi>
         </m:mfenced>
      </m:mrow>
   </m:mfrac>
   <m:mo>+</m:mo>
   <m:msub>
      <m:mi>I</m:mi>
      <m:mi mathvariant="italic">bg</m:mi>
   </m:msub>
   <m:mo>+</m:mo>
   <m:mi>&#949;</m:mi>
   <m:mtext>,</m:mtext>
</m:mrow>
</m:math></display-formula></p><p>where <it>N</it> is AC of the target sequences, and <it>E</it> denotes specific binding free energy which can be modeled by position-dependent nearest-neighbor (PDNN) <abbrgrp><abbr bid="B39">39</abbr><abbr bid="B40">40</abbr></abbrgrp>:</p><p><display-formula id="M3"><m:math name="1755-8794-5-24-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mi>E</m:mi>
   <m:mo>=</m:mo>
   <m:munderover>
      <m:mo>&#8721;</m:mo>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mo>=</m:mo>
         <m:mn>1</m:mn>
      </m:mrow>
      <m:mn>24</m:mn>
   </m:munderover>
   <m:msub>
      <m:mi>&#969;</m:mi>
      <m:mi>i</m:mi>
   </m:msub>
   <m:mi>&#955;</m:mi>
   <m:mspace width="0.12em"/>
   <m:mfenced open="(" close=")">
      <m:mrow>
         <m:msub>
            <m:mi>b</m:mi>
            <m:mi>i</m:mi>
         </m:msub>
         <m:mtext>,</m:mtext>
         <m:mspace width="0.12em"/>
         <m:msub>
            <m:mi>b</m:mi>
            <m:mrow>
               <m:mi>i</m:mi>
               <m:mo>+</m:mo>
               <m:mn>1</m:mn>
            </m:mrow>
         </m:msub>
      </m:mrow>
   </m:mfenced>
   <m:mtext>,</m:mtext>
</m:mrow>
</m:math></display-formula></p><p>where <it>&#969;</it><sub><it>i</it></sub> is a weight factor which is dependent on the position of consecutive bases along the oligonucleotides, <it>b</it><sub><it>i</it></sub> is the <it>i</it>-th nucleotide of probe sequence, and &#955; is the stacking energy of the pair of nearest-neighbors along the probe. With &#955;(<it>b</it><sub><it>i</it></sub><it>b</it><sub><it>i</it> + 1</sub>) and <it>&#969;</it><sub><it>i</it></sub> known as basic constants which hardly change between arrays <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>, <it>N</it> can be easily estimated by regression.</p><p>However, the model ignores cross-hybridization. There are two alleles (allele A and allele B) in the genome for a certain single polymorphic locus. For high sequence similarity, each allele has a high possibility of binding to the probe which is designed to interrogate the other allele. This cross-hybridization may bring bias when estimating the AC of target sequences (See <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> and Additional file 1). Therefore, we go one step further to improve the model by assuming that <it>I</it><sub><it>s</it></sub> follows an additive model of <it>I</it><sub><it>sA</it></sub> and <it>I</it><sub><it>sB</it></sub>. Their meanings are clear: the contribution of allele A and B target sequences, respectively, to probe intensity. Both <it>I</it><sub><it>sA</it></sub> and <it>I</it><sub><it>sB</it></sub> can be modeled by Equation (2); thus our proposed model is</p><p><display-formula id="M4"><m:math name="1755-8794-5-24-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mi>I</m:mi>
   <m:mo>=</m:mo>
   <m:mfrac>
      <m:msub>
         <m:mi>N</m:mi>
         <m:mi>A</m:mi>
      </m:msub>
      <m:mrow>
         <m:mn>1</m:mn>
         <m:mo>+</m:mo>
         <m:mo>exp</m:mo>
         <m:mfenced open="(" close=")">
            <m:msub>
               <m:mi>E</m:mi>
               <m:mi>A</m:mi>
            </m:msub>
         </m:mfenced>
      </m:mrow>
   </m:mfrac>
   <m:mo>+</m:mo>
   <m:mfrac>
      <m:msub>
         <m:mi>N</m:mi>
         <m:mi>B</m:mi>
      </m:msub>
      <m:mrow>
         <m:mn>1</m:mn>
         <m:mo>+</m:mo>
         <m:mo>exp</m:mo>
         <m:mfenced open="(" close=")">
            <m:msub>
               <m:mi>E</m:mi>
               <m:mi>B</m:mi>
            </m:msub>
         </m:mfenced>
      </m:mrow>
   </m:mfrac>
   <m:mo>+</m:mo>
   <m:msub>
      <m:mi>I</m:mi>
      <m:mi mathvariant="italic">bg</m:mi>
   </m:msub>
   <m:mo>+</m:mo>
   <m:mi>&#949;</m:mi>
   <m:mtext>,</m:mtext>
</m:mrow>
</m:math></display-formula></p><p>where <it>N</it><sub><it>A</it></sub> and <it>N</it><sub><it>B</it></sub> are ACs for allele A and B, respectively, and <it>E</it><sub><it>A</it></sub> and <it>E</it><sub><it>B</it></sub> denote binding free energy. With quite a few probes in one probeset, the ordinary least squares (OLS) method yields unbiased estimates of <it>N</it><sub><it>A</it></sub> and <it>N</it><sub><it>B</it></sub>. The summation of <it>N</it><sub><it>A</it></sub> and <it>N</it><sub><it>B</it></sub> gives the total concentration <it>N</it> (See <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> and Additional file 1). For the nonpolymorphic probe with only one allele, <it>N</it> can be straightforwardly obtained from Equation (2).</p></sec><sec><st><p>Normalization between arrays</p></st><p>In order to eliminate the systematic bias between arrays which may arise from the different library preparation conditions of the experimental process, we use the following transformation:</p><p><display-formula id="M5"><m:math name="1755-8794-5-24-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msubsup>
      <m:mi>N</m:mi>
      <m:mi mathvariant="italic">mk</m:mi>
      <m:mo>'</m:mo>
   </m:msubsup>
   <m:mo>=</m:mo>
   <m:msub>
      <m:mi>N</m:mi>
      <m:mi mathvariant="italic">mk</m:mi>
   </m:msub>
   <m:mo>.</m:mo>
   <m:msub>
      <m:mi>&#945;</m:mi>
      <m:mi>m</m:mi>
   </m:msub>
   <m:mtext>,</m:mtext>
</m:mrow>
</m:math></display-formula></p><p>where <it>N</it><sub><it>mk</it></sub> is the total concentration for array <it>m</it> at locus <it>k</it>, and <inline-formula><m:math name="1755-8794-5-24-i6" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mi>&#945;</m:mi>
      <m:mi>m</m:mi>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mspace width="0.25em"/>
   <m:mn>2</m:mn>
   <m:mo>/</m:mo>
   <m:mi>m</m:mi>
   <m:mi>e</m:mi>
   <m:mi>d</m:mi>
   <m:mi>i</m:mi>
   <m:mi>a</m:mi>
   <m:mi>n</m:mi>
   <m:mfenced open="(" close=")">
      <m:mrow>
         <m:msub>
            <m:mi>N</m:mi>
            <m:mi mathvariant="italic">mk</m:mi>
         </m:msub>
         <m:mtext>,</m:mtext>
         <m:mi>k</m:mi>
         <m:mo>=</m:mo>
         <m:mspace width="0.25em"/>
         <m:mn>1</m:mn>
         <m:mo>,</m:mo>
         <m:mspace width="0.25em"/>
         <m:mn>2</m:mn>
         <m:mo>,</m:mo>
         <m:mo>&#8230;</m:mo>
         <m:mo>,</m:mo>
         <m:mi>K</m:mi>
      </m:mrow>
   </m:mfenced>
</m:mrow>
</m:math></inline-formula> is the normalization factor for array <it>m</it> (<it>K</it>&#8201;=&#8201;the total number of loci from one array).</p></sec><sec><st><p>Calibration for amplification efficiency</p></st><p>We have found that <inline-formula><m:math name="1755-8794-5-24-i7" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mi>N</m:mi>
   <m:mi mathvariant="italic">mk</m:mi>
   <m:mo>'</m:mo>
</m:msubsup>
</m:math></inline-formula> are fairly stable from one array to another, except for CNV regions for one certain locus <it>k</it> (see Additional file 1); therefore, a simple adjustment approach is proposed to calibrate the various amplification efficiencies:</p><p><display-formula id="M6"><m:math name="1755-8794-5-24-i8" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mover accent="true">
         <m:mi>N</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mi mathvariant="italic">mk</m:mi>
   </m:msub>
   <m:mo>=</m:mo>
   <m:msubsup>
      <m:mi>N</m:mi>
      <m:mi mathvariant="italic">mk</m:mi>
      <m:mo>'</m:mo>
   </m:msubsup>
   <m:mo>&#183;</m:mo>
   <m:mi>&#947;</m:mi>
   <m:mi>k</m:mi>
   <m:mtext>,</m:mtext>
</m:mrow>
</m:math></display-formula></p><p>where <inline-formula><m:math name="1755-8794-5-24-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mi>&#947;</m:mi>
      <m:mi>k</m:mi>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mn>2</m:mn>
   <m:mo>/</m:mo>
   <m:mi>m</m:mi>
   <m:mi>e</m:mi>
   <m:mi>d</m:mi>
   <m:mi>i</m:mi>
   <m:mi>a</m:mi>
   <m:mi>n</m:mi>
   <m:mspace width="0.12em"/>
   <m:mfenced open="(" close=")">
      <m:mrow>
         <m:msubsup>
            <m:mi>N</m:mi>
            <m:mi mathvariant="italic">mk</m:mi>
            <m:mo>'</m:mo>
         </m:msubsup>
         <m:mtext>,</m:mtext>
         <m:mspace width="0.12em"/>
         <m:mi>m</m:mi>
         <m:mo>=</m:mo>
         <m:mn>1</m:mn>
         <m:mtext>,</m:mtext>
         <m:mspace width="0.12em"/>
         <m:mn>2</m:mn>
         <m:mtext>,</m:mtext>
         <m:mo>&#8230;</m:mo>
         <m:mtext>,</m:mtext>
         <m:mspace width="0.12em"/>
         <m:mi>M</m:mi>
      </m:mrow>
   </m:mfenced>
</m:mrow>
</m:math></inline-formula>is the adjustment factor for each locus <it>k</it> (<it>M</it> is the total number of reference samples). In order to estimate the adjustment factor <inline-formula><m:math name="1755-8794-5-24-i10" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mi>&#947;</m:mi>
   <m:mi>k</m:mi>
</m:msub>
</m:math></inline-formula><sub>,</sub> a pool of reference samples is needed. In the case&#8211;control assay pattern, the control arrays are treated as the reference pool. In this article, the HapMap samples from dataset I are used to estimate <inline-formula><m:math name="1755-8794-5-24-i11" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mi>&#947;</m:mi>
   <m:mi>k</m:mi>
</m:msub>
</m:math></inline-formula>. CNVhac takes <inline-formula><m:math name="1755-8794-5-24-i12" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mover accent="true">
      <m:mi>N</m:mi>
      <m:mo>^</m:mo>
   </m:mover>
   <m:mi mathvariant="italic">mk</m:mi>
</m:msub>
</m:math></inline-formula> as the estimated raw CN for locus <it>k</it> in array <it>m</it>.</p></sec><sec><st><p>CNV calling</p></st><p>CNVhac implements a HMM-based algorithm to call CNVs. HMM methods have previously been successfully applied to other studies <abbrgrp><abbr bid="B13">13</abbr><abbr bid="B41">41</abbr><abbr bid="B42">42</abbr></abbrgrp>, and the main idea of our algorithm is similar to them. In our implementation of the HMM, the hidden state is the true CN ({0, 1, 2, 3 or &gt;=4}) of each locus along the genome, and the observed state is our estimated raw CN <inline-formula><m:math name="1755-8794-5-24-i13" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mover accent="true">
      <m:mi>N</m:mi>
      <m:mo>^</m:mo>
   </m:mover>
   <m:mi mathvariant="italic">mk</m:mi>
</m:msub>
</m:math></inline-formula>. For each locus, the emission probabilities are estimated from a normal distribution with true CN as mean. The transition probability of jumping out from normal state is presumed to be low, whereas jumping back to a normal CN or transitioning within the same state is relatively high. Furthermore, the distance between neighboring loci is correlated with transition probability <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. Given the initial emission and transition probabilities, the Viterbi algorithm <abbrgrp><abbr bid="B43">43</abbr></abbrgrp> is used to decode the hidden states. Then, the parameters can be updated iteratively until converging. A more detailed description of this method can be found in Additional file 1.</p></sec></sec></sec><sec><st><p>Results</p></st><p>The pipeline of CNVhac mainly consists of two major steps. The preprocessing step first estimates the raw CNs <inline-formula><m:math name="1755-8794-5-24-i14" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mover accent="true">
      <m:mi>N</m:mi>
      <m:mo>^</m:mo>
   </m:mover>
   <m:mi mathvariant="italic">mk</m:mi>
</m:msub>
</m:math></inline-formula>, and, second, the CNV calling step then searches for breakpoints through a HMM model. In this section, we compare CNVhac with two widely used raw CN estimation methods, CRMA_v2 (&#8216;Copy-number estimation using Robust Multichip Analysis&#8217; <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>) and cn.FARMS (&#8216;factor analysis for robust microarray summarization&#8217; <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>), to evaluate the accuracy of estimated raw CN <inline-formula><m:math name="1755-8794-5-24-i15" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mover accent="true">
      <m:mi>N</m:mi>
      <m:mo>^</m:mo>
   </m:mover>
   <m:mi mathvariant="italic">mk</m:mi>
</m:msub>
</m:math></inline-formula>. CRMA_v2 is an extension of CRMA <abbrgrp><abbr bid="B44">44</abbr></abbrgrp> for estimating raw CNs for downstream analyses. cn.FARMS presents a probabilistic latent variable model for summarizing probes to obtain raw CN estimates. Both CRMA_v2 and cn.FARMS outperform other studies on raw CN estimation <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>. Meanwhile, to assess the performance of CNV calling, we compare CNVhac with another popular approach known as Birdsuite <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>, which is asserted to be the best for CNV inference with Affymetrix SNP arrays <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Because Birdsuite does not estimate raw CNs, it is not considered in the comparison on raw CN estimation.</p><sec><st><p>Raw CN estimation on HapMap CEU samples</p></st><p>We assess the performance of raw CN estimation from two aspects: the accuracy in classifying the sex of HapMap individuals and the amplitude of genomic waviness. Females have two copies of X chromosome, while males only one; therefore, the CN of X chromosome can naturally be used as the benchmark to evaluate the power of the raw CN estimates to differentiate between one or two copies. We collected the same 59 CEU parents in Dataset I to do this classification task as <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. Children were excluded to avoid inherited biases. The sample of female founder NA12145 was also removed on the basis of its low true CN level <abbrgrp><abbr bid="B44">44</abbr></abbrgrp>. All the loci in the pseudoautosomal regions (PAR1 and PAR2), segmental duplications (<url>http://humanparalogy.gs.washington.edu/build36</url>) and CNV regions <abbrgrp><abbr bid="B38">38</abbr></abbrgrp> in chromosome X were excluded owing to CN contamination. Finally, 83121 polymorphic and nonpolymorphic loci were kept which gives 4904139 (=83121&#8201;&#215;&#8201;59) single locus classification tasks. The receiver operating characteristic (ROC) curve is introduced to assess the performance of different methods. The horizontal axis of the ROC curve represents the false positive rate (the fraction of males classified as females), while the vertical axis stands for the true positive rate (the fraction of females classified as females). Figure&#8201;<figr fid="F1">1</figr> shows the ROC for CNVhac, CRMA_v2 and cn.FARMS, respectively. The areas under ROC curve (AUCs) of CNVhac, CRMA_v2 and cn.FARMS are 0.9684, 0.9603 and 0.9627, respectively. We see that CNVhac outperforms CRMA_v2 and cn.FARMS when distinguishing males from females based on the estimated raw CNs.</p><fig id="F1"><title><p>Figure 1</p></title><caption><p>ROC curves of the sex classification for CNVhac, CRMA_v2 and cn.FARMS on 59 HapMap CEU founders</p></caption><text>
   <p><b>ROC curves of the sex classification for CNVhac, CRMA_v2 and cn.FARMS on 59 HapMap CEU founders.</b> Left: Full ROC curves. Right: Top-left corner of ROC curves. CNVhac performs better than CRMA_v2 and cn.FARMS.</p>
</text><graphic file="1755-8794-5-24-1"/></fig><p>The better result of sex classification by CNVhac may be attributed to better control of genomic waviness. To assess the waviness, we investigated the estimated raw CNs of chromosome X used above. The three sets of raw CNs were separately scaled to the same median. For females, the median is set as 2 and for males 1. Figure&#8201;<figr fid="F2">2</figr> shows an example of dissimilar genomic wave patterns for one female CEU founder, NA06985. The fluctuation of raw CNs is obvious in cn.FARMS, with somewhat less fluctuation in CRMA_v2. However, the waves are smoothed most effectively by CNVhac compared to the other methods. Figure&#8201;<figr fid="F3">3</figr> shows the density of raw CNs for female CEU founders and male founders, respectively. More precisely, we computed the variance of raw CNs. For females, the variances of cn.FARMS, CRMA_v2 and CNVhac are 0.2118, 0.1225 and 0.1112. For males, the variances are 0.2597, 0.0336 and 0.0289. For both females and males, CNVhac has the smallest variance (F test, all <it>p</it>-values are&#8201;&lt;&#8201;2e-16). This result implies that CNVhac can smooth the fluctuation through one simple, but effective, method.</p><fig id="F2"><title><p>Figure 2</p></title><caption><p>Genomic wave patterns on a segment of Chromosome X of one CEU female founder, NA06985, for (a) cn.FARMS, (b) CRMA_v2 and (c) CNVhac</p></caption><text>
   <p><b>Genomic wave patterns on a segment of Chromosome X of one CEU female founder, NA06985, for (a) cn.FARMS, (b) CRMA_v2 and (c) CNVhac.</b> CNVhac has the smallest amplitude of estimated raw CNs.</p>
</text><graphic file="1755-8794-5-24-2"/></fig><fig id="F3"><title><p>Figure 3</p></title><caption><p>Density of raw CNs estimated by different methods for (a) male CEU founders and (b) female CEU founders on chromosome X</p></caption><text>
   <p><b>Density of raw CNs estimated by different methods for (a) male CEU founders and (b) female CEU founders on chromosome X.</b> Raw CNs are scaled to the same median (for males 1 and females 2). CNVhac shows significantly smaller variance than CRMA_v2 and cn.FARMS (F test, all <it>p</it>-values are&#8201;&lt;&#8201;2e-16).</p>
</text><graphic file="1755-8794-5-24-3"/></fig></sec><sec><st><p>CNV calling on HapMap samples</p></st><p>The cross-platform verified regions in dataset II are defined as true CNVs to assess the power of CNV detection for CNVhac and Birdsuite on the 269 samples from dataset I (NA19012 is missing in the result of <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>). We filtered out those verified regions having fewer than 5 probes designed in Affymertix SNP 6.0 array, resulting in 1381 verified regions for our evaluation. Each sample has a different number of CNVs annotated in the 1381 selected regions <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>. In total, we have 49662 true CNVs annotated in the 1381 regions across the 269 samples. We assessed the performance of each algorithm by calculating the ratio of the predicted CNVs, which are supported by true CNVs to all the predicted CNVs along the genome (precision), and the fraction of true CNVs, which are predicted by this algorithm (recall). The concordance principle for predicted and true CNVs is that more than 50% of either region is covered by the other. When calculating the precision and recall, we summed up all 269 samples. Through the default parameter settings, the precision and recall of Birdsuite are 40.01% (19337/48333) and 38.94% (19337/49662), while the counterparts of CNVhac are 43.45% (5828/13412) and 11.74% (5828/49662). Compared to Birdsuite, CNVhac has a higher precision, but a lower recall. Note that the results of Birdsuite contain a set of predefined common CNVs provided by another study <abbrgrp><abbr bid="B45">45</abbr></abbrgrp>, whereas CNVhac identifies CNVs without a source of predefined common CNVs. In GWAS analyses, false discoveries are inclined to occur when identifying rare CNVs <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. Therefore, in the assessment of CNV calling power here, we removed the predefined common CNVs <abbrgrp><abbr bid="B45">45</abbr></abbrgrp> from both the predicted and true CNVs. Altogether we have 22043 true CNVs across the 269 samples this time. The 1-precision versus recall curve which is similar to ROC is introduced to show the performance. A curve more in the upper-left corner indicates better performance. Figure&#8201;<figr fid="F4">4</figr> shows the 1-precision versus recall curve of CNV calling for all 269 HapMap samples in Dataset I. At comparable levels of recall, we see that CNVhac gives higher precision than Birdsuite. A higher precision means a lower false discovery rate (FDR). The result implies that our method calls CNVs with a lower FDR.</p><fig id="F4"><title><p>Figure 4</p></title><caption><p>1-precision versus recall curves for CNV detection on 269 HapMap samples</p></caption><text>
   <p><b>1-precision versus recall curves for CNV detection on 269 HapMap samples.</b> A curve that is located more toward the upper-left corner indicates better performance. Note: FDR is 1-precision. Compared to Birdsuite, CNVhac shows an appreciably lower FDR when calling CNVs.</p>
</text><graphic file="1755-8794-5-24-4"/></fig></sec><sec><st><p>Sample batch dependence of CNV calling</p></st><p>As described in the Background section, different parameters trained from different sample batches may cause an in-consistent inference. To evaluate the sample batch dependence of CNV calling of CNVhac, we compare it with Bird-suite. In CNVhac, estimating adjustment factor <inline-formula><m:math name="1755-8794-5-24-i16" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mi>&#947;</m:mi>
   <m:mi>k</m:mi>
</m:msub>
</m:math></inline-formula> is the only step requiring a batch of samples. In Section 3.2, all 270 HapMap samples were used to estimate <inline-formula><m:math name="1755-8794-5-24-i17" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mi>&#947;</m:mi>
   <m:mi>k</m:mi>
</m:msub>
</m:math></inline-formula>. Here, we divided the 270 samples into 3 groups and then treated them as different pools of reference samples. Each group consisted of 90 samples. (The different choice of samples in each group can be found in Additional file 2). Adjustment factor <inline-formula><m:math name="1755-8794-5-24-i18" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mi mathvariant="italic">&#947;</m:mi>
   <m:mi mathvariant="italic">k</m:mi>
</m:msub>
</m:math></inline-formula> can be estimated within each group, respectively. With the different <inline-formula><m:math name="1755-8794-5-24-i19" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mi>&#947;</m:mi>
   <m:mi>k</m:mi>
</m:msub>
</m:math></inline-formula>, raw CN estimates <inline-formula><m:math name="1755-8794-5-24-i20" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mover accent="true">
      <m:mi>N</m:mi>
      <m:mo>^</m:mo>
   </m:mover>
   <m:mi mathvariant="italic">mk</m:mi>
</m:msub>
</m:math></inline-formula> change, as well as the CNV calling. For a specific sample <it>S</it><sub><it>i</it></sub>, three sets of CNV regions can be detected through different <inline-formula><m:math name="1755-8794-5-24-i21" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mi>&#947;</m:mi>
   <m:mi>k</m:mi>
</m:msub>
</m:math></inline-formula>. We assess the batch dependence by computing the ratio of intersection regions to union. For Birdsuite, 3 groups were created by the same way. Next, sample <it>S</it><sub><it>i</it></sub> was put to the other two groups which do not contain it. Hence, one can also obtain three sets of identified CNVs. We chose 6 individuals (2 CEU, 2 YRI, 1JPT and 1CHB) to call CNVs based on different groups. Table&#8201;<tblr tid="T1">1</tblr> displays the ratio of intersection to union, respectively, under default parameter setting. From this, we see that CNVhac shows significantly higher ratios than Birdsuite (<it>p</it>-value&#8201;=&#8201;6.5e-3 by Wilcoxon rank-sum test). This indicates that CNVhac alleviates the sample batch dependence of CNV calling to a certain extent.</p><table id="T1"><title><p>Table 1</p></title><caption><p><b>Results of CNV calling based on different training sample batches for CNVhac and Birdsuite</b></p></caption><tgroup align="left" cols="13"><colspec align="left" colname="c1" colnum="1" colwidth="1*"/><colspec align="left" colname="c2" colnum="2" colwidth="1*"/><colspec align="left" colname="c3" colnum="3" colwidth="1*"/><colspec align="left" colname="c4" colnum="4" colwidth="1*"/><colspec align="left" colname="c5" colnum="5" colwidth="1*"/><colspec align="left" colname="c6" colnum="6" colwidth="1*"/><colspec align="left" colname="c7" colnum="7" colwidth="1*"/><colspec align="left" colname="c8" colnum="8" colwidth="1*"/><colspec align="left" colname="c9" colnum="9" colwidth="1*"/><colspec align="left" colname="c10" colnum="10" colwidth="1*"/><colspec align="left" colname="c11" colnum="11" colwidth="1*"/><colspec align="left" colname="c12" colnum="12" colwidth="1*"/><colspec align="left" colname="c13" colnum="13" colwidth="1*"/><thead valign="top"><row rowsep="1"><entry colname="c1" morerows="1"/><entry colname="c2" nameend="c7" namest="c2"><p><b>Birdsuite</b></p></entry><entry colname="c8" nameend="c13" namest="c8"><p><b>CNVhac</b></p></entry></row><row rowsep="1"><entry colname="c2"><p><b>G1</b><sup><b>&#167;</b></sup></p></entry><entry colname="c3"><p><b>G2</b></p></entry><entry colname="c4"><p><b>G3</b></p></entry><entry colname="c5"><p><b>I</b><sup><b>&#182;</b></sup></p></entry><entry colname="c6"><p><b>U</b><sup><b>&#8224;</b></sup></p></entry><entry colname="c7"><p><b>Ratio</b><sup><b>&#8225;</b></sup></p></entry><entry colname="c8"><p><b>G1</b></p></entry><entry colname="c9"><p><b>G2</b></p></entry><entry colname="c10"><p><b>G3</b></p></entry><entry colname="c11"><p><b>I</b></p></entry><entry colname="c12"><p><b>U</b></p></entry><entry colname="c13"><p><b>Ratio</b></p></entry></row></thead><tfoot><p>&#167;The number of predicted CNVs using group 1 for parameter training.</p><p><sup>&#182;</sup>The number of CNVs in intersection set of &#8220;G1&#8221;, &#8220;G2&#8221; and &#8220;G3&#8221;.</p><p><sup>&#8224;</sup>The number of CNVs in union set of &#8220;G1&#8221;, &#8220;G2&#8221; and &#8220;G3&#8221;.</p><p><sup>&#8225;</sup>The ratio of intersection to union.</p></tfoot><tbody valign="top"><row><entry colname="c1"><p>NA12156</p></entry><entry colname="c2"><p>17</p></entry><entry colname="c3"><p>19</p></entry><entry colname="c4"><p>21</p></entry><entry colname="c5"><p>14</p></entry><entry colname="c6"><p>22</p></entry><entry colname="c7"><p>0.64</p></entry><entry colname="c8"><p>15</p></entry><entry colname="c9"><p>17</p></entry><entry colname="c10"><p>18</p></entry><entry colname="c11"><p>15</p></entry><entry colname="c12"><p>17</p></entry><entry colname="c13"><p>0.88</p></entry></row><row><entry colname="c1"><p>NA12878</p></entry><entry colname="c2"><p>22</p></entry><entry colname="c3"><p>21</p></entry><entry colname="c4"><p>19</p></entry><entry colname="c5"><p>15</p></entry><entry colname="c6"><p>28</p></entry><entry colname="c7"><p>0.54</p></entry><entry colname="c8"><p>29</p></entry><entry colname="c9"><p>26</p></entry><entry colname="c10"><p>24</p></entry><entry colname="c11"><p>20</p></entry><entry colname="c12"><p>33</p></entry><entry colname="c13"><p>0.61</p></entry></row><row><entry colname="c1"><p>NA18507</p></entry><entry colname="c2"><p>19</p></entry><entry colname="c3"><p>15</p></entry><entry colname="c4"><p>20</p></entry><entry colname="c5"><p>10</p></entry><entry colname="c6"><p>23</p></entry><entry colname="c7"><p>0.43</p></entry><entry colname="c8"><p>16</p></entry><entry colname="c9"><p>20</p></entry><entry colname="c10"><p>20</p></entry><entry colname="c11"><p>15</p></entry><entry colname="c12"><p>21</p></entry><entry colname="c13"><p>0.71</p></entry></row><row><entry colname="c1"><p>NA18517</p></entry><entry colname="c2"><p>20</p></entry><entry colname="c3"><p>21</p></entry><entry colname="c4"><p>21</p></entry><entry colname="c5"><p>14</p></entry><entry colname="c6"><p>25</p></entry><entry colname="c7"><p>0.56</p></entry><entry colname="c8"><p>21</p></entry><entry colname="c9"><p>21</p></entry><entry colname="c10"><p>18</p></entry><entry colname="c11"><p>16</p></entry><entry colname="c12"><p>23</p></entry><entry colname="c13"><p>0.7</p></entry></row><row><entry colname="c1"><p>NA18555</p></entry><entry colname="c2"><p>16</p></entry><entry colname="c3"><p>16</p></entry><entry colname="c4"><p>15</p></entry><entry colname="c5"><p>11</p></entry><entry colname="c6"><p>20</p></entry><entry colname="c7"><p>0.55</p></entry><entry colname="c8"><p>16</p></entry><entry colname="c9"><p>14</p></entry><entry colname="c10"><p>17</p></entry><entry colname="c11"><p>11</p></entry><entry colname="c12"><p>18</p></entry><entry colname="c13"><p>0.61</p></entry></row><row rowsep="1"><entry colname="c1"><p>NA18956</p></entry><entry colname="c2"><p>13</p></entry><entry colname="c3"><p>12</p></entry><entry colname="c4"><p>16</p></entry><entry colname="c5"><p>9</p></entry><entry colname="c6"><p>16</p></entry><entry colname="c7"><p>0.6</p></entry><entry colname="c8"><p>20</p></entry><entry colname="c9"><p>21</p></entry><entry colname="c10"><p>24</p></entry><entry colname="c11"><p>16</p></entry><entry colname="c12"><p>24</p></entry><entry colname="c13"><p>0.67</p></entry></row></tbody></tgroup></table></sec></sec><sec><st><p>Discussion</p></st><p>For years, the array-based technologies have been widely used for exploring CNV events. However, the inherent noise of microarray data may lead to high FDR when making inferences. In array experiments, hybridization is highly correlated with the sequence constitutions <abbrgrp><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr><abbr bid="B30">30</abbr><abbr bid="B32">32</abbr><abbr bid="B39">39</abbr><abbr bid="B40">40</abbr><abbr bid="B46">46</abbr></abbrgrp>. The binding affinities of probes can be subject to large variability by the various sequences. Most previous algorithms attempt to model the binding affinity through statistical or empirical methods <abbrgrp><abbr bid="B41">41</abbr><abbr bid="B44">44</abbr></abbrgrp>, which need multiple samples for training parameters. However, such multiple samples may lead to another problem: sample dependence of outputs <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. The various choices of training samples may result in different estimated parameters, leading, in turn, to incompatible results. All the algorithms which need multiple training samples have a possibility encountering this effect. Consequently, strategies based on single-array processing are preferred. Up to now, however, few single-array approaches have been presented. CRMA_v2 is a single-array preprocessing method for SNP array analysis. However, the raw CNs estimated by CRMA_v2 exhibit a wavy pattern, and thus may not be accurate enough for downstream CNV identification.</p><p>Motivated by addressing the cross-hybridization of probes, genomic waves of intensities and sample dependence of parameter estimation, we propose in this article a single-array preprocessing method, termed CNVhac, to estimate more accurate raw CNs. Based on the previous PICR method <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>, we model the hybridization and cross-hybridization of probes through physicochemical law. Wan et al. have shown that the PICR model can address the cross-hybridization effect very well <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. The genomic wave patterns of signal intensities are hypothesized to reflect the various amplification efficiencies of DNA fragments in the PCR process <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>. However, based on the diversity of sheared fragments and complicated PCR procedures, it is difficult to estimate the accurate amplification rate for each locus. Instead, we smooth the genomic waves by estimating an adjustment factor for each locus since we have found that the estimated CNs show a fairly stable pattern between loci (see Additional file 1). Compared to CRMA_v2 and cn.FARMS, this simple calibration method effectively reduces the amplitude of waviness. Note that the reduction of waviness is not simply a compression of variance in that CNVhac provides more accurate raw CN estimates which can well differentiate between one or two copies. Moreover, the number of parameters needed to estimate target concentration <inline-formula><m:math name="1755-8794-5-24-i22" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mover accent="true">
      <m:mi>N</m:mi>
      <m:mo>^</m:mo>
   </m:mover>
   <m:mi mathvariant="italic">mk</m:mi>
</m:msub>
</m:math></inline-formula>in CNVhac is much fewer than prior statistical models and can be estimated from one single array quite stably <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. This property avoids the sample dependence of parameter estimation. Compared to one popular CNV detection method known as Birdsuite <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B13">13</abbr></abbrgrp>, CNVhac, indeed, alleviates the sample dependence of CNV calling more effectively. However, CNVhac needs a pool of reference samples to estimate <inline-formula><m:math name="1755-8794-5-24-i23" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mi mathvariant="italic">&#947;</m:mi>
   <m:mi mathvariant="italic">k</m:mi>
</m:msub>
</m:math></inline-formula> for calibrating amplification efficiency. In the case&#8211;control assay pattern, the control samples are treated as the reference pool. While the dataset contains only case samples, anonymous normal samples, e.g., HapMap samples, can be used as the reference pool. Because of the different experimental conditions, the anonymous normal samples may bring sample-dependent bias for <inline-formula><m:math name="1755-8794-5-24-i24" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mi mathvariant="italic">&#947;</m:mi>
   <m:mi mathvariant="italic">k</m:mi>
</m:msub>
</m:math></inline-formula>. Actually, CNVhac cannot address this kind of sample dependence.</p><p>CNVs have attracted much attention in recent years because they are assumed to play a significant role in causing human disease <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B4">4</abbr></abbrgrp>. Especially, some recent studies and reviews have shown that rare CNVs contribute much more to neuropsychiatric disorders than previously thought <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B47">47</abbr><abbr bid="B48">48</abbr><abbr bid="B49">49</abbr><abbr bid="B50">50</abbr><abbr bid="B51">51</abbr></abbrgrp>. However, the mechanism underlying the influence of CNVs on human phenotypes is still not well understood. Furthermore, even a small fraction of false discoveries may introduce misunderstanding in the downstream association studies. Therefore, CNV calling methods are strongly de-sired to control the FDR <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. On the basis of raw CN estimates with cross-hybridization and amplification rate correction, CNVhac can identify rare CNVs with a lower FDR compared to the powerful Birdsuite method. This result implies that CNVhac can accurately identify CNVs, especially rare CNVs, for downstream association studies.</p><p>Since CNVhac is a single-array based strategy, the running time could be reduced by executing CNVhac on multiple processors in parallel when analyzing a large set of samples. Also, since parameters are consistent between arrays, there is no need to reprocess the early data when new samples are hybridized.</p></sec><sec><st><p>Conclusion</p></st><p>Cross-hybridization and different amplification efficiencies of probes are the common difficulties in microarray analysis. Most studies attempt to solve the problem by training numerous model parameters from a large dataset, but this might incur inconsistent results. Moreover, the statistical power of this methodology may be significantly reduced when the training dataset is not big enough. In this article, we first addressed cross-hybridization problem through physico-chemical law and then proposed a simple adjustment for the various amplification rates. Our method, CNVhac, avoids complicated statistical models which need many samples for training. By comparing CNVhac with other methods, we have established that our simple process is effective and suitable for all Affymetrix SNP array types with similar design standards. Finally, the working principle of CNVhac can be easily extended to other platforms, such as Illumina and Agilent arrays.</p></sec><sec><st><p>Endnotes</p></st><p>CNVhac<sup>a</sup>: The algorithm is implemented in R and C++ and is available at <url>http://www.math.pku.edu.cn/teachers/dengmh/CNVhac</url>.</p></sec><sec><st><p>Abbreviations</p></st><p>CN, Copy number; CNV, Copy number variation; FDR, False discovery rate; AC, Allelic concentration; HMM, Hidden Markov Model; GWAS, Genome-wide association studies; PICR, Probe intensity composite representation; PDNN, Position-dependent nearest-neighbor; OLS, Ordinary least squares; CRMA, Copy-number estimation using Robust Multichip Analysis; cn.FARMS, Factor analysis for robust microarray summarization; ROC, Receiver operating characteristic; AUC, Area under ROC curve.</p></sec><sec><st><p>Competing interests</p></st><p>The authors declare that they have no competing interests.</p></sec><sec><st><p>Authors&#8217; contributions</p></st><p>MPQ and MHD conceived the project. MPQ, LW and MHD proposed the main idea. QW and PCP developed the program. QW implemented the methods, analyzed the data, and wrote the manuscript. MPQ, LW and MHD finalized the manuscript. All authors read and approved the final manuscript.</p></sec><sec><st><p>Funding</p></st><p>This work was supported by the National Natural Science Foundation of China [No.31171262, No.11021463] and the National Key Basic Research Project of China [No.2009CB918503].</p></sec></bdy><bm><ack><sec><st><p>Acknowledgements</p></st><p>We thank Linbo Wang and Yongjian Kang for helpful discussions.</p></sec></ack><refgrp><bibl id="B1"><title><p>Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls</p></title><aug><au><snm>Craddock</snm><fnm>N</fnm></au><au><snm>Hurles</snm><fnm>ME</fnm></au><au><snm>Cardin</snm><fnm>N</fnm></au><au><snm>Pearson</snm><fnm>RD</fnm></au><au><snm>Plagnol</snm><fnm>V</fnm></au><au><snm>Robson</snm><fnm>S</fnm></au><au><snm>Vukcevic</snm><fnm>D</fnm></au><au><snm>Barnes</snm><fnm>C</fnm></au><au><snm>Conrad</snm><fnm>DF</fnm></au><au><snm>Giannoulatou</snm><fnm>E</fnm></au><etal/></aug><source>Nature</source><pubdate>2010</pubdate><volume>464</volume><fpage>713</fpage><lpage>720</lpage></bibl><bibl id="B2"><title><p>Rare copy number variants: a point of rarity in genetic risk for bipolar disorder and schizophrenia</p></title><aug><au><snm>Grozeva</snm><fnm>D</fnm></au><au><snm>Kirov</snm><fnm>G</fnm></au><au><snm>Ivanov</snm><fnm>D</fnm></au><au><snm>Jones</snm><fnm>IR</fnm></au><au><snm>Jones</snm><fnm>L</fnm></au><au><snm>Green</snm><fnm>EK</fnm></au><au><snm>St Clair</snm><fnm>DM</fnm></au><au><snm>Young</snm><fnm>AH</fnm></au><au><snm>Ferrier</snm><fnm>N</fnm></au><au><snm>Farmer</snm><fnm>AE</fnm></au><etal/></aug><source>Arch Gen Psychiatry</source><pubdate>2010</pubdate><volume>67</volume><fpage>318</fpage><lpage>327</lpage></bibl><bibl id="B3"><title><p>Finding the missing heritability of complex diseases</p></title><aug><au><snm>Manolio</snm><fnm>TA</fnm></au><au><snm>Collins</snm><fnm>FS</fnm></au><au><snm>Cox</snm><fnm>NJ</fnm></au><au><snm>Goldstein</snm><fnm>DB</fnm></au><au><snm>Hindorff</snm><fnm>LA</fnm></au><au><snm>Hunter</snm><fnm>DJ</fnm></au><au><snm>McCarthy</snm><fnm>MI</fnm></au><au><snm>Ramos</snm><fnm>EM</fnm></au><au><snm>Cardon</snm><fnm>LR</fnm></au><au><snm>Chakravarti</snm><fnm>A</fnm></au><etal/></aug><source>Nature</source><pubdate>2009</pubdate><volume>461</volume><fpage>747</fpage><lpage>753</lpage></bibl><bibl id="B4"><title><p>Extending genome-wide association studies to copy-number variation</p></title><aug><au><snm>McCarroll</snm><fnm>SA</fnm></au></aug><source>Hum Mol Genet</source><pubdate>2008</pubdate><volume>17</volume><fpage>R135</fpage><lpage>R142</lpage></bibl><bibl id="B5"><title><p>Accuracy of CNV Detection from GWAS Data</p></title><aug><au><snm>Zhang</snm><fnm>D</fnm></au><au><snm>Qian</snm><fnm>Y</fnm></au><au><snm>Akula</snm><fnm>N</fnm></au><au><snm>Alliey-Rodriguez</snm><fnm>N</fnm></au><au><snm>Tang</snm><fnm>J</fnm></au><au><snm>Gershon</snm><fnm>ES</fnm></au><au><snm>Liu</snm><fnm>C</fnm></au></aug><source>PLoS One</source><pubdate>2011</pubdate><volume>6</volume><fpage>e14511</fpage></bibl><bibl id="B6"><title><p>A single-array preprocessing method for estimating full-resolution raw copy numbers from all Affymetrix genotyping arrays including GenomeWideSNP 5 &amp; 6</p></title><aug><au><snm>Bengtsson</snm><fnm>H</fnm></au><au><snm>Wirapati</snm><fnm>P</fnm></au><au><snm>Speed</snm><fnm>TP</fnm></au></aug><source>Bioinformatics</source><pubdate>2009</pubdate><volume>25</volume><fpage>2149</fpage><lpage>2156</lpage></bibl><bibl id="B7"><title><p>cn.FARMS: a latent variable model to detect copy number variations in microarray data with a low false discovery rate</p></title><aug><au><snm>Clevert</snm><fnm>DA</fnm></au><au><snm>Mitterecker</snm><fnm>A</fnm></au><au><snm>Mayr</snm><fnm>A</fnm></au><au><snm>Klambauer</snm><fnm>G</fnm></au><au><snm>Tuefferd</snm><fnm>M</fnm></au><au><snm>De Bondt</snm><fnm>A</fnm></au><au><snm>Talloen</snm><fnm>W</fnm></au><au><snm>Gohlmann</snm><fnm>H</fnm></au><au><snm>Hochreiter</snm><fnm>S</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2011</pubdate><volume>39</volume><fpage>e79</fpage></bibl><bibl id="B8"><title><p>Computational methods for discovering structural variation with next-generation sequencing</p></title><aug><au><snm>Medvedev</snm><fnm>P</fnm></au><au><snm>Stanciu</snm><fnm>M</fnm></au><au><snm>Brudno</snm><fnm>M</fnm></au></aug><source>Nat Methods</source><pubdate>2009</pubdate><volume>6</volume><fpage>S13</fpage><lpage>S20</lpage></bibl><bibl id="B9"><title><p>Personalized copy number and segmental duplication maps using next-generation sequencing</p></title><aug><au><snm>Alkan</snm><fnm>C</fnm></au><au><snm>Kidd</snm><fnm>JM</fnm></au><au><snm>Marques-Bonet</snm><fnm>T</fnm></au><au><snm>Aksay</snm><fnm>G</fnm></au><au><snm>Antonacci</snm><fnm>F</fnm></au><au><snm>Hormozdiari</snm><fnm>F</fnm></au><au><snm>Kitzman</snm><fnm>JO</fnm></au><au><snm>Baker</snm><fnm>C</fnm></au><au><snm>Malig</snm><fnm>M</fnm></au><au><snm>Mutlu</snm><fnm>O</fnm></au><etal/></aug><source>Nat Genet</source><pubdate>2009</pubdate><volume>41</volume><fpage>1061</fpage><lpage>1067</lpage></bibl><bibl id="B10"><title><p>Diversity of human copy number variation and multicopy genes</p></title><aug><au><snm>Sudmant</snm><fnm>PH</fnm></au><au><snm>Kitzman</snm><fnm>JO</fnm></au><au><snm>Antonacci</snm><fnm>F</fnm></au><au><snm>Alkan</snm><fnm>C</fnm></au><au><snm>Malig</snm><fnm>M</fnm></au><au><snm>Tsalenko</snm><fnm>A</fnm></au><au><snm>Sampas</snm><fnm>N</fnm></au><au><snm>Bruhn</snm><fnm>L</fnm></au><au><snm>Shendure</snm><fnm>J</fnm></au><au><snm>Eichler</snm><fnm>EE</fnm></au></aug><source>Science</source><pubdate>2010</pubdate><volume>330</volume><fpage>641</fpage><lpage>646</lpage></bibl><bibl id="B11"><title><p>Genome structural variation discovery and genotyping</p></title><aug><au><snm>Alkan</snm><fnm>C</fnm></au><au><snm>Coe</snm><fnm>BP</fnm></au><au><snm>Eichler</snm><fnm>EE</fnm></au></aug><source>Nat Rev Genet</source><pubdate>2011</pubdate><volume>12</volume><fpage>363</fpage><lpage>376</lpage></bibl><bibl id="B12"><title><p>Next generation sequencing has lower sequence coverage and poorer SNP-detection capability in the regulatory regions</p></title><aug><au><snm>Wang</snm><fnm>W</fnm></au><au><snm>Wei</snm><fnm>Z</fnm></au><au><snm>Lam</snm><fnm>TW</fnm></au><au><snm>Wang</snm><fnm>J</fnm></au></aug><source>Sci Rep</source><pubdate>2011</pubdate><volume>1</volume><fpage>55</fpage></bibl><bibl id="B13"><title><p>Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs</p></title><aug><au><snm>Korn</snm><fnm>JM</fnm></au><au><snm>Kuruvilla</snm><fnm>FG</fnm></au><au><snm>McCarroll</snm><fnm>SA</fnm></au><au><snm>Wysoker</snm><fnm>A</fnm></au><au><snm>Nemesh</snm><fnm>J</fnm></au><au><snm>Cawley</snm><fnm>S</fnm></au><au><snm>Hubbell</snm><fnm>E</fnm></au><au><snm>Veitch</snm><fnm>J</fnm></au><au><snm>Collins</snm><fnm>PJ</fnm></au><au><snm>Darvishi</snm><fnm>K</fnm></au><etal/></aug><source>Nat Genet</source><pubdate>2008</pubdate><volume>40</volume><fpage>1253</fpage><lpage>1260</lpage></bibl><bibl id="B14"><title><p>dChipSNP: significance curve and clustering of SNP-array-based loss-of-heterozygosity data</p></title><aug><au><snm>Lin</snm><fnm>M</fnm></au><au><snm>Wei</snm><fnm>LJ</fnm></au><au><snm>Sellers</snm><fnm>WR</fnm></au><au><snm>Lieberfarb</snm><fnm>M</fnm></au><au><snm>Wong</snm><fnm>WH</fnm></au><au><snm>Li</snm><fnm>C</fnm></au></aug><source>Bioinformatics</source><pubdate>2004</pubdate><volume>20</volume><fpage>1233</fpage><lpage>1240</lpage></bibl><bibl id="B15"><title><p>A robust statistical method for case&#8211;control association testing with copy number variation</p></title><aug><au><snm>Barnes</snm><fnm>C</fnm></au><au><snm>Plagnol</snm><fnm>V</fnm></au><au><snm>Fitzgerald</snm><fnm>T</fnm></au><au><snm>Redon</snm><fnm>R</fnm></au><au><snm>Marchini</snm><fnm>J</fnm></au><au><snm>Clayton</snm><fnm>D</fnm></au><au><snm>Hurles</snm><fnm>ME</fnm></au></aug><source>Nat Genet</source><pubdate>2008</pubdate><volume>40</volume><fpage>1245</fpage><lpage>1252</lpage></bibl><bibl id="B16"><title><p>Joint estimation of copy number variation and reference intensities on multiple DNA arrays using GADA</p></title><aug><au><snm>Pique-Regi</snm><fnm>R</fnm></au><au><snm>Ortega</snm><fnm>A</fnm></au><au><snm>Asgharzadeh</snm><fnm>S</fnm></au></aug><source>Bioinformatics</source><pubdate>2009</pubdate><volume>25</volume><fpage>1223</fpage><lpage>1230</lpage></bibl><bibl id="B17"><title><p>Methods and strategies for analyzing copy number variation using DNA microarrays</p></title><aug><au><snm>Carter</snm><fnm>NP</fnm></au></aug><source>Nat Genet</source><pubdate>2007</pubdate><volume>39</volume><fpage>S16</fpage><lpage>S21</lpage></bibl><bibl id="B18"><title><p>Challenges and standards in integrating surveys of structural variation</p></title><aug><au><snm>Scherer</snm><fnm>SW</fnm></au><au><snm>Lee</snm><fnm>C</fnm></au><au><snm>Birney</snm><fnm>E</fnm></au><au><snm>Altshuler</snm><fnm>DM</fnm></au><au><snm>Eichler</snm><fnm>EE</fnm></au><au><snm>Carter</snm><fnm>NP</fnm></au><au><snm>Hurles</snm><fnm>ME</fnm></au><au><snm>Feuk</snm><fnm>L</fnm></au></aug><source>Nat Genet</source><pubdate>2007</pubdate><volume>39</volume><fpage>S7</fpage><lpage>S15</lpage></bibl><bibl id="B19"><title><p>Comparing CNV detection methods for SNP arrays</p></title><aug><au><snm>Winchester</snm><fnm>L</fnm></au><au><snm>Yau</snm><fnm>C</fnm></au><au><snm>Ragoussis</snm><fnm>J</fnm></au></aug><source>Brief Funct Genomic Proteomic</source><pubdate>2009</pubdate><volume>8</volume><fpage>353</fpage><lpage>366</lpage></bibl><bibl id="B20"><title><p>Hybridization modeling of oligonucleotide SNP arrays for accurate DNA copy number estimation</p></title><aug><au><snm>Wan</snm><fnm>L</fnm></au><au><snm>Sun</snm><fnm>K</fnm></au><au><snm>Ding</snm><fnm>Q</fnm></au><au><snm>Cui</snm><fnm>Y</fnm></au><au><snm>Li</snm><fnm>M</fnm></au><au><snm>Wen</snm><fnm>Y</fnm></au><au><snm>Elston</snm><fnm>RC</fnm></au><au><snm>Qian</snm><fnm>M</fnm></au><au><snm>Fu</snm><fnm>WJ</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2009</pubdate><volume>37</volume><fpage>e117</fpage></bibl><bibl id="B21"><title><p>Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization</p></title><aug><au><snm>Marioni</snm><fnm>JC</fnm></au><au><snm>Thorne</snm><fnm>NP</fnm></au><au><snm>Valsesia</snm><fnm>A</fnm></au><au><snm>Fitzgerald</snm><fnm>T</fnm></au><au><snm>Redon</snm><fnm>R</fnm></au><au><snm>Fiegler</snm><fnm>H</fnm></au><au><snm>Andrews</snm><fnm>TD</fnm></au><au><snm>Stranger</snm><fnm>BE</fnm></au><au><snm>Lynch</snm><fnm>AG</fnm></au><au><snm>Dermitzakis</snm><fnm>ET</fnm></au><etal/></aug><source>Genome Biol</source><pubdate>2007</pubdate><volume>8</volume><fpage>R228</fpage></bibl><bibl id="B22"><title><p>Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms</p></title><aug><au><snm>Diskin</snm><fnm>SJ</fnm></au><au><snm>Li</snm><fnm>M</fnm></au><au><snm>Hou</snm><fnm>C</fnm></au><au><snm>Yang</snm><fnm>S</fnm></au><au><snm>Glessner</snm><fnm>J</fnm></au><au><snm>Hakonarson</snm><fnm>H</fnm></au><au><snm>Bucan</snm><fnm>M</fnm></au><au><snm>Maris</snm><fnm>JM</fnm></au><au><snm>Wang</snm><fnm>K</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2008</pubdate><volume>36</volume><fpage>e126</fpage></bibl><bibl id="B23"><title><p>Preprocessing and downstream analysis of microarray DNA copy number profiles</p></title><aug><au><snm>van de Wiel</snm><fnm>MA</fnm></au><au><snm>Picard</snm><fnm>F</fnm></au><au><snm>van Wieringen</snm><fnm>WN</fnm></au><au><snm>Ylstra</snm><fnm>B</fnm></au></aug><source>Brief Bioinform</source><pubdate>2010</pubdate><volume>12</volume><issue>1</issue><fpage>10</fpage><lpage>21</lpage><note>http://bib.oxfordjournals.org/content/12/1/10.short</note></bibl><bibl id="B24"><title><p>Array of hope</p></title><aug><au><snm>Lander</snm><fnm>ES</fnm></au></aug><source>Nat Genet</source><pubdate>1999</pubdate><volume>21</volume><fpage>3</fpage><lpage>4</lpage></bibl><bibl id="B25"><title><p>Adjusting batch effects in microarray expression data using empirical Bayes methods</p></title><aug><au><snm>Johnson</snm><fnm>WE</fnm></au><au><snm>Li</snm><fnm>C</fnm></au><au><snm>Rabinovic</snm><fnm>A</fnm></au></aug><source>Biostatistics</source><pubdate>2007</pubdate><volume>8</volume><fpage>118</fpage><lpage>127</lpage></bibl><bibl id="B26"><title><p>Assessing batch effects of genotype calling algorithm BRLMM for the Affymetrix GeneChip Human Mapping 500&#8201;K array set using 270 HapMap samples</p></title><aug><au><snm>Hong</snm><fnm>H</fnm></au><au><snm>Su</snm><fnm>Z</fnm></au><au><snm>Ge</snm><fnm>W</fnm></au><au><snm>Shi</snm><fnm>L</fnm></au><au><snm>Perkins</snm><fnm>R</fnm></au><au><snm>Fang</snm><fnm>H</fnm></au><au><snm>Xu</snm><fnm>J</fnm></au><au><snm>Chen</snm><fnm>JJ</fnm></au><au><snm>Han</snm><fnm>T</fnm></au><au><snm>Kaput</snm><fnm>J</fnm></au><etal/></aug><source>BMC Bioinformatics</source><pubdate>2008</pubdate><volume>9</volume><issue>Suppl 9</issue><fpage>S17</fpage></bibl><bibl id="B27"><title><p>Modeling of DNA microarray data by using physical properties of hybridization</p></title><aug><au><snm>Held</snm><fnm>GA</fnm></au><au><snm>Grinstein</snm><fnm>G</fnm></au><au><snm>Tu</snm><fnm>Y</fnm></au></aug><source>Proc Natl Acad Sci U S A</source><pubdate>2003</pubdate><volume>100</volume><fpage>7575</fpage><lpage>7580</lpage></bibl><bibl id="B28"><title><p>Relationship between gene expression and observed intensities in DNA microarrays&#8211;a modeling study</p></title><aug><au><snm>Held</snm><fnm>GA</fnm></au><au><snm>Grinstein</snm><fnm>G</fnm></au><au><snm>Tu</snm><fnm>Y</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2006</pubdate><volume>34</volume><fpage>e70</fpage></bibl><bibl id="B29"><title><p>Breakdown of thermodynamic equilibrium for DNA hybridization in microarrays</p></title><aug><au><snm>Hooyberghs</snm><fnm>J</fnm></au><au><snm>Baiesi</snm><fnm>M</fnm></au><au><snm>Ferrantini</snm><fnm>A</fnm></au><au><snm>Carlon</snm><fnm>E</fnm></au></aug><source>Phys Rev E Stat Nonlin Soft Matter Phys</source><pubdate>2010</pubdate><volume>81</volume><fpage>012901</fpage></bibl><bibl id="B30"><title><p>The effects of mismatches on hybridization in DNA microarrays: determination of nearest neighbor parameters</p></title><aug><au><snm>Hooyberghs</snm><fnm>J</fnm></au><au><snm>Van Hummelen</snm><fnm>P</fnm></au><au><snm>Carlon</snm><fnm>E</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2009</pubdate><volume>37</volume><fpage>e53</fpage></bibl><bibl id="B31"><title><p>High-resolution identification of chromosomal abnormalities using oligonucleotide arrays containing 116,204 SNPs</p></title><aug><au><snm>Slater</snm><fnm>HR</fnm></au><au><snm>Bailey</snm><fnm>DK</fnm></au><au><snm>Ren</snm><fnm>H</fnm></au><au><snm>Cao</snm><fnm>M</fnm></au><au><snm>Bell</snm><fnm>K</fnm></au><au><snm>Nasioulas</snm><fnm>S</fnm></au><au><snm>Henke</snm><fnm>R</fnm></au><au><snm>Choo</snm><fnm>KH</fnm></au><au><snm>Kennedy</snm><fnm>GC</fnm></au></aug><source>Am J Hum Genet</source><pubdate>2005</pubdate><volume>77</volume><fpage>709</fpage><lpage>726</lpage></bibl><bibl id="B32"><title><p>An improved physico-chemical model of hybridization on high-density oligonucleotide microarrays</p></title><aug><au><snm>Ono</snm><fnm>N</fnm></au><au><snm>Suzuki</snm><fnm>S</fnm></au><au><snm>Furusawa</snm><fnm>C</fnm></au><au><snm>Agata</snm><fnm>T</fnm></au><au><snm>Kashiwagi</snm><fnm>A</fnm></au><au><snm>Shimizu</snm><fnm>H</fnm></au><au><snm>Yomo</snm><fnm>T</fnm></au></aug><source>Bioinformatics</source><pubdate>2008</pubdate><volume>24</volume><fpage>1278</fpage><lpage>1285</lpage></bibl><bibl id="B33"><title><p>Impact of whole genome amplification on analysis of copy number variants</p></title><aug><au><snm>Pugh</snm><fnm>TJ</fnm></au><au><snm>Delaney</snm><fnm>AD</fnm></au><au><snm>Farnoud</snm><fnm>N</fnm></au><au><snm>Flibotte</snm><fnm>S</fnm></au><au><snm>Griffith</snm><fnm>M</fnm></au><au><snm>Li</snm><fnm>HI</fnm></au><au><snm>Qian</snm><fnm>H</fnm></au><au><snm>Farinha</snm><fnm>P</fnm></au><au><snm>Gascoyne</snm><fnm>RD</fnm></au><au><snm>Marra</snm><fnm>MA</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2008</pubdate><volume>36</volume><fpage>e80</fpage></bibl><bibl id="B34"><title><p>Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression</p></title><aug><au><snm>Rhodes</snm><fnm>DR</fnm></au><au><snm>Yu</snm><fnm>J</fnm></au><au><snm>Shanker</snm><fnm>K</fnm></au><au><snm>Deshpande</snm><fnm>N</fnm></au><au><snm>Varambally</snm><fnm>R</fnm></au><au><snm>Ghosh</snm><fnm>D</fnm></au><au><snm>Barrette</snm><fnm>T</fnm></au><au><snm>Pandey</snm><fnm>A</fnm></au><au><snm>Chinnaiyan</snm><fnm>AM</fnm></au></aug><source>Proc Natl Acad Sci U S A</source><pubdate>2004</pubdate><volume>101</volume><fpage>9309</fpage><lpage>9314</lpage></bibl><bibl id="B35"><title><p>Singular value decomposition for genome-wide expression data processing and modeling</p></title><aug><au><snm>Alter</snm><fnm>O</fnm></au><au><snm>Brown</snm><fnm>PO</fnm></au><au><snm>Botstein</snm><fnm>D</fnm></au></aug><source>Proc Natl Acad Sci U S A</source><pubdate>2000</pubdate><volume>97</volume><fpage>10101</fpage><lpage>10106</lpage></bibl><bibl id="B36"><title><p>Adjustment of systematic microarray data biases</p></title><aug><au><snm>Benito</snm><fnm>M</fnm></au><au><snm>Parker</snm><fnm>J</fnm></au><au><snm>Du</snm><fnm>Q</fnm></au><au><snm>Wu</snm><fnm>J</fnm></au><au><snm>Xiang</snm><fnm>D</fnm></au><au><snm>Perou</snm><fnm>CM</fnm></au><au><snm>Marron</snm><fnm>JS</fnm></au></aug><source>Bioinformatics</source><pubdate>2004</pubdate><volume>20</volume><fpage>105</fpage><lpage>114</lpage></bibl><bibl id="B37"><title><p>A second generation human haplotype map of over 3.1 million SNPs</p></title><aug><au><cnm>The International HapMap Consortium</cnm></au></aug><source>Nature</source><pubdate>2007</pubdate><volume>449</volume><fpage>851</fpage><lpage>861</lpage></bibl><bibl id="B38"><title><p>Origins and functional impact of copy number variation in the human genome</p></title><aug><au><snm>Conrad</snm><fnm>DF</fnm></au><au><snm>Pinto</snm><fnm>D</fnm></au><au><snm>Redon</snm><fnm>R</fnm></au><au><snm>Feuk</snm><fnm>L</fnm></au><au><snm>Gokcumen</snm><fnm>O</fnm></au><au><snm>Zhang</snm><fnm>Y</fnm></au><au><snm>Aerts</snm><fnm>J</fnm></au><au><snm>Andrews</snm><fnm>TD</fnm></au><au><snm>Barnes</snm><fnm>C</fnm></au><au><snm>Campbell</snm><fnm>P</fnm></au><etal/></aug><source>Nature</source><pubdate>2010</pubdate><volume>464</volume><fpage>704</fpage><lpage>712</lpage></bibl><bibl id="B39"><title><p>Free energy of DNA duplex formation on short oligonucleotide microarrays</p></title><aug><au><snm>Zhang</snm><fnm>L</fnm></au><au><snm>Wu</snm><fnm>C</fnm></au><au><snm>Carta</snm><fnm>R</fnm></au><au><snm>Zhao</snm><fnm>H</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2007</pubdate><volume>35</volume><fpage>e18</fpage></bibl><bibl id="B40"><title><p>A model of molecular interactions on short oligonucleotide microarrays</p></title><aug><au><snm>Zhang</snm><fnm>L</fnm></au><au><snm>Miles</snm><fnm>MF</fnm></au><au><snm>Aldape</snm><fnm>KD</fnm></au></aug><source>Nat Biotechnol</source><pubdate>2003</pubdate><volume>21</volume><fpage>818</fpage><lpage>821</lpage></bibl><bibl id="B41"><title><p>PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data</p></title><aug><au><snm>Greenman</snm><fnm>CD</fnm></au><au><snm>Bignell</snm><fnm>G</fnm></au><au><snm>Butler</snm><fnm>A</fnm></au><au><snm>Edkins</snm><fnm>S</fnm></au><au><snm>Hinton</snm><fnm>J</fnm></au><au><snm>Beare</snm><fnm>D</fnm></au><au><snm>Swamy</snm><fnm>S</fnm></au><au><snm>Santarius</snm><fnm>T</fnm></au><au><snm>Chen</snm><fnm>L</fnm></au><au><snm>Widaa</snm><fnm>S</fnm></au><etal/></aug><source>Biostatistics</source><pubdate>2010</pubdate><volume>11</volume><fpage>164</fpage><lpage>175</lpage></bibl><bibl id="B42"><title><p>PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data</p></title><aug><au><snm>Wang</snm><fnm>K</fnm></au><au><snm>Li</snm><fnm>M</fnm></au><au><snm>Hadley</snm><fnm>D</fnm></au><au><snm>Liu</snm><fnm>R</fnm></au><au><snm>Glessner</snm><fnm>J</fnm></au><au><snm>Grant</snm><fnm>SF</fnm></au><au><snm>Hakonarson</snm><fnm>H</fnm></au><au><snm>Bucan</snm><fnm>M</fnm></au></aug><source>Genome Res</source><pubdate>2007</pubdate><volume>17</volume><fpage>1665</fpage><lpage>1674</lpage></bibl><bibl id="B43"><title><p>A tutorial on hidden Markov models and selected applications in speech recognition</p></title><aug><au><snm>Rabiner</snm><fnm>LR</fnm></au></aug><source>Proceedings of the IEEE</source><pubdate>1989</pubdate><volume>77</volume><fpage>257</fpage><lpage>286</lpage></bibl><bibl id="B44"><title><p>Estimation and assessment of raw copy numbers at the single locus level</p></title><aug><au><snm>Bengtsson</snm><fnm>H</fnm></au><au><snm>Irizarry</snm><fnm>R</fnm></au><au><snm>Carvalho</snm><fnm>B</fnm></au><au><snm>Speed</snm><fnm>TP</fnm></au></aug><source>Bioinformatics</source><pubdate>2008</pubdate><volume>24</volume><fpage>759</fpage><lpage>767</lpage></bibl><bibl id="B45"><title><p>Integrated detection and population-genetic analysis of SNPs and copy number variation</p></title><aug><au><snm>McCarroll</snm><fnm>SA</fnm></au><au><snm>Kuruvilla</snm><fnm>FG</fnm></au><au><snm>Korn</snm><fnm>JM</fnm></au><au><snm>Cawley</snm><fnm>S</fnm></au><au><snm>Nemesh</snm><fnm>J</fnm></au><au><snm>Wysoker</snm><fnm>A</fnm></au><au><snm>Shapero</snm><fnm>MH</fnm></au><au><snm>de Bakker</snm><fnm>PI</fnm></au><au><snm>Maller</snm><fnm>JB</fnm></au><au><snm>Kirby</snm><fnm>A</fnm></au><etal/></aug><source>Nat Genet</source><pubdate>2008</pubdate><volume>40</volume><fpage>1166</fpage><lpage>1174</lpage></bibl><bibl id="B46"><title><p>Inverse Langmuir method for oligonucleotide microarray analysis</p></title><aug><au><snm>Mulders</snm><fnm>GC</fnm></au><au><snm>Barkema</snm><fnm>GT</fnm></au><au><snm>Carlon</snm><fnm>E</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2009</pubdate><volume>10</volume><fpage>64</fpage></bibl><bibl id="B47"><title><p>De novo CNVs in bipolar disorder: recurrent themes or new directions?</p></title><aug><au><snm>Girirajan</snm><fnm>S</fnm></au><au><snm>Eichler</snm><fnm>EE</fnm></au></aug><source>Neuron</source><pubdate>2011</pubdate><volume>72</volume><fpage>885</fpage><lpage>887</lpage></bibl><bibl id="B48"><title><p>An evidence-based approach to establish the functional and clinical significance of copy number variants in intellectual and developmental disabilities</p></title><aug><au><snm>Kaminsky</snm><fnm>EB</fnm></au><au><snm>Kaul</snm><fnm>V</fnm></au><au><snm>Paschall</snm><fnm>J</fnm></au><au><snm>Church</snm><fnm>DM</fnm></au><au><snm>Bunke</snm><fnm>B</fnm></au><au><snm>Kunig</snm><fnm>D</fnm></au><au><snm>Moreno-De-Luca</snm><fnm>D</fnm></au><au><snm>Moreno-De-Luca</snm><fnm>A</fnm></au><au><snm>Mulle</snm><fnm>JG</fnm></au><au><snm>Warren</snm><fnm>ST</fnm></au><etal/></aug><source>Genet Med</source><pubdate>2011</pubdate><volume>13</volume><fpage>777</fpage><lpage>784</lpage></bibl><bibl id="B49"><title><p>Multiple recurrent de novo CNVs, including duplications of the 7q11.23 Williams syndrome region, are strongly associated with autism</p></title><aug><au><snm>Sanders</snm><fnm>SJ</fnm></au><au><snm>Ercan-Sencicek</snm><fnm>AG</fnm></au><au><snm>Hus</snm><fnm>V</fnm></au><au><snm>Luo</snm><fnm>R</fnm></au><au><snm>Murtha</snm><fnm>MT</fnm></au><au><snm>Moreno-De-Luca</snm><fnm>D</fnm></au><au><snm>Chu</snm><fnm>SH</fnm></au><au><snm>Moreau</snm><fnm>MP</fnm></au><au><snm>Gupta</snm><fnm>AR</fnm></au><au><snm>Thomson</snm><fnm>SA</fnm></au><etal/></aug><source>Neuron</source><pubdate>2011</pubdate><volume>70</volume><fpage>863</fpage><lpage>885</lpage></bibl><bibl id="B50"><title><p>High frequencies of de novo CNVs in bipolar disorder and schizophrenia</p></title><aug><au><snm>Malhotra</snm><fnm>D</fnm></au><au><snm>McCarthy</snm><fnm>S</fnm></au><au><snm>Michaelson</snm><fnm>JJ</fnm></au><au><snm>Vacic</snm><fnm>V</fnm></au><au><snm>Burdick</snm><fnm>KE</fnm></au><au><snm>Yoon</snm><fnm>S</fnm></au><au><snm>Cichon</snm><fnm>S</fnm></au><au><snm>Corvin</snm><fnm>A</fnm></au><au><snm>Gary</snm><fnm>S</fnm></au><au><snm>Gershon</snm><fnm>ES</fnm></au><etal/></aug><source>Neuron</source><pubdate>2011</pubdate><volume>72</volume><fpage>951</fpage><lpage>963</lpage></bibl><bibl id="B51"><title><p>CNVs: Harbingers of a Rare Variant Revolution in Psychiatric Genetics</p></title><aug><au><snm>Malhotra</snm><fnm>D</fnm></au><au><snm>Sebat</snm><fnm>J</fnm></au></aug><source>Cell</source><pubdate>2012</pubdate><volume>148</volume><fpage>1223</fpage><lpage>1241</lpage></bibl><bibl id="B52"><title><p>Population analysis of large copy number variants and hotspots of human genetic disease</p></title><aug><au><snm>Itsara</snm><fnm>A</fnm></au><au><snm>Cooper</snm><fnm>GM</fnm></au><au><snm>Baker</snm><fnm>C</fnm></au><au><snm>Girirajan</snm><fnm>S</fnm></au><au><snm>Li</snm><fnm>J</fnm></au><au><snm>Absher</snm><fnm>D</fnm></au><au><snm>Krauss</snm><fnm>RM</fnm></au><au><snm>Myers</snm><fnm>RM</fnm></au><au><snm>Ridker</snm><fnm>PM</fnm></au><au><snm>Chasman</snm><fnm>DI</fnm></au><etal/></aug><source>Am J Hum Genet</source><pubdate>2009</pubdate><volume>84</volume><fpage>148</fpage><lpage>161</lpage></bibl></refgrp><sec><st><p>Pre-publication history</p></st><p>The pre-publication history for this paper can be accessed here:</p><p><url>http://www.biomedcentral.com/1755-8794/5/24/prepub</url></p></sec></bm></art>