<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2156-9-27</ui>
   <ji>1471-2156</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>Improved detection of global copy number variation using high density, non-polymorphic oligonucleotide probes</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Shen</snm>
               <fnm>Fan</fnm>
               <insr iid="I1"/>
               <email>fan_shen@affymetrix.com</email>
            </au>
            <au id="A2">
               <snm>Huang</snm>
               <fnm>Jing</fnm>
               <insr iid="I1"/>
               <email>jing_huang@comcast.net</email>
            </au>
            <au id="A3">
               <snm>Fitch</snm>
               <mi>R</mi>
               <fnm>Karen</fnm>
               <insr iid="I1"/>
               <email>karen_fitch@affymetrix.com</email>
            </au>
            <au id="A4">
               <snm>Truong</snm>
               <mi>B</mi>
               <fnm>Vivi</fnm>
               <insr iid="I1"/>
               <email>vivi_truong@affymetrix.com</email>
            </au>
            <au id="A5">
               <snm>Kirby</snm>
               <fnm>Andrew</fnm>
               <insr iid="I2"/>
               <email>ankirby@mac.com</email>
            </au>
            <au id="A6">
               <snm>Chen</snm>
               <fnm>Wenwei</fnm>
               <insr iid="I1"/>
               <email>joyce_chen@affymetrix.com</email>
            </au>
            <au id="A7">
               <snm>Zhang</snm>
               <fnm>Jane</fnm>
               <insr iid="I1"/>
               <email>jane_zhang@affymetrix.com</email>
            </au>
            <au id="A8">
               <snm>Liu</snm>
               <fnm>Guoying</fnm>
               <insr iid="I1"/>
               <email>guoying_liu@affymetrix.com</email>
            </au>
            <au id="A9">
               <snm>McCarroll</snm>
               <mi>A</mi>
               <fnm>Steven</fnm>
               <insr iid="I3"/>
               <email>mccarroll@molbio.mgh.harvard.edu</email>
            </au>
            <au id="A10">
               <snm>Jones</snm>
               <mi>W</mi>
               <fnm>Keith</fnm>
               <insr iid="I1"/>
               <email>keith_jones@affymetrix.com</email>
            </au>
            <au id="A11" ca="yes">
               <snm>Shapero</snm>
               <mi>H</mi>
               <fnm>Michael</fnm>
               <insr iid="I1"/>
               <email>michael_shapero@affymetrix.com</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Affymetrix, Inc. 3420 Central Expressway; Santa Clara, CA 95051, USA</p>
            </ins>
            <ins id="I2">
               <p>Center for Human Genetic Research, Massachusetts General Hospital, Boston, MA 02114, USA</p>
            </ins>
            <ins id="I3">
               <p>Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA</p>
            </ins>
         </insg>
         <source>BMC Genetics</source>
         <issn>1471-2156</issn>
         <pubdate>2008</pubdate>
         <volume>9</volume>
         <issue>1</issue>
         <fpage>27</fpage>
         <url>http://www.biomedcentral.com/1471-2156/9/27</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">18373861</pubid>
               <pubid idtype="doi">10.1186/1471-2156-9-27</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>31</day>
               <month>10</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>28</day>
               <month>3</month>
               <year>2008</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>28</day>
               <month>3</month>
               <year>2008</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2008</year>
         <collab>Shen et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>DNA sequence diversity within the human genome may be more greatly affected by copy number variations (CNVs) than single nucleotide polymorphisms (SNPs). Although the importance of CNVs in genome wide association studies (GWAS) is becoming widely accepted, the optimal methods for identifying these variants are still under evaluation. We have previously reported a comprehensive view of CNVs in the HapMap DNA collection using high density 500 K EA (Early Access) SNP genotyping arrays which revealed greater than 1,000 CNVs ranging in size from 1 kb to over 3 Mb. Although the arrays used most commonly for GWAS predominantly interrogate SNPs, CNV identification and detection does not necessarily require the use of DNA probes centered on polymorphic nucleotides and may even be hindered by the dependence on a successful SNP genotyping assay.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>In this study, we have designed and evaluated a high density array predicated on the use of non-polymorphic oligonucleotide probes for CNV detection. This approach effectively uncouples copy number detection from SNP genotyping and thus has the potential to significantly improve probe coverage for genome-wide CNV identification. This array, in conjunction with PCR-based, complexity-reduced DNA target, queries over 1.3 M independent NspI restriction enzyme fragments in the 200 bp to 1100 bp size range, which is a several fold increase in marker density as compared to the 500 K EA array. In addition, a novel algorithm was developed and validated to extract CNV regions and boundaries.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Using a well-characterized pair of DNA samples, close to 200 CNVs were identified, of which nearly 50% appear novel yet were independently validated using quantitative PCR. The results indicate that non-polymorphic probes provide a robust approach for CNV identification, and the increasing precision of CNV boundary delineation should allow a more complete analysis of their genomic organization.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>With the completion of the human genome sequence, it is generally accepted that any two individuals are ~99.9% identical at the nucleotide level, and that the presence of single nucleotide polymorphisms (SNPs) in the genome are the major contributor to genetic diversity among humans <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. In part due to the accuracy and ease in which they can be scored, along with their stability and abundance in the genome, SNPs have become the marker of choice for whole genome association studies that use linkage disequilibrium (LD) mapping to identify genes involved in complex diseases <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>. Over the last several decades, it has also been accepted that there can be DNA copy number changes that occur among individuals, albeit in the context of limited and specific loci within the genome. These changes can span a spectrum from, for example, an extra copy of an entire chromosome (trisomy 21) in Down's syndrome to sub-chromosomal deletions responsible for genetic traits such as color blindness and &#945; and &#946; thalassemias <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. However, this paradigm of genetic variation underwent a major revision in 2004 with the identification of genome-wide copy number variants that occur among phenotypically normal individuals <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>. Since these initial reports, a large number of studies have described the wide spread and global distribution of CNVs in the genome <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr></abbrgrp>. As the cataloguing of CNVs in the genome continues, new studies are also aimed at understanding their function in normal cellular processes such as drug metabolism <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp> and gene expression <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>, in human disease susceptibility <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr></abbrgrp> and developmental disorders <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>, and in the natural selection process <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. Lastly, the role of CNVs in genomic disorders further underscores how profoundly gene function can be adversely affected in a multitude of ways that can lead to disease <abbrgrp><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr><abbr bid="B29">29</abbr></abbrgrp>. Recent estimates of the contribution of CNVs to total nucleotide diversity per genome range from 9 to 30 Mb and thus exceeds the ~3 Mb estimated to be due to SNPs <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B9">9</abbr><abbr bid="B30">30</abbr></abbrgrp>. In fact, a recent comparison of the genome sequence of an individual human with the NCBI human reference assembly suggested that DNA copy number variable regions contribute ~10 Mb to sequence heterogeneity <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. These results underlie the growing appreciation for and understanding of the need to account for CNVs in genome wide association studies. Although some common CNVs are in LD with SNPs and can therefore be assayed indirectly through SNP genotyping, a significant fraction of CNVs (particularly those in duplication-rich regions of the genome) are not well-captured by available SNP marker sets <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B12">12</abbr><abbr bid="B14">14</abbr><abbr bid="B32">32</abbr></abbrgrp>. Furthermore, even taggable CNVs need to be accurately typed before appropriate markers can be identified. Thus there is still an on-going need to develop molecular methods capable of direct and accurate detection of CNVs in order for this new class of polymorphisms to be effectively incorporated into genome wide LD mapping of genes involved in human disease <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>.</p>
         <p>There is a wide range of structural variation that can occur in the genome that includes deletions, insertions, duplications, and inversions, and these can range from 1&#8211;500 bp (fine-scale), 500 bp&#8211;100 kb (intermediate-scale), and >100 kb (large-scale) in size. Although there are many different molecular cytogenetic techniques that can be used to assess variants when one or several specific targeted loci are under investigation <abbrgrp><abbr bid="B26">26</abbr><abbr bid="B34">34</abbr><abbr bid="B35">35</abbr></abbrgrp>, there are only a limited number of approaches that provide genome-wide characterization, namely direct sequencing approaches such as fosmid paired-end sequencing <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> or Paired-End Mapping (PEM) <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> and array-based methods. Array-based methods that have been applied to CNV identification include the use of BAC clones <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp> and both long <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B36">36</abbr></abbrgrp> and short oligonucleotide probes <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B12">12</abbr><abbr bid="B37">37</abbr></abbrgrp>. We have reported in 2006 on a comprehensive analysis of CNVs in the HapMap DNA collection using two complementary platforms, namely BAC-array CGH and 500 K EA high-density genotyping array. While these two approaches often identified the same CNVs, there were differences in the types of CNVs unique to each approach. For example, while the 500 K EA array tended to identify smaller CNVs along with higher border resolution, the BAC array CGH approach was able to interrogate regions of the genome that are not easily amenable to SNP genotyping due to the presence of low copy repeat structures (segmental duplications). As a means to uncouple the requirement of SNP genotypes from CNV identification, we have designed and evaluated an array that uses non-polymorphic 25-mer probes in combination with a PCR-based, reduced complexity DNA target. This array has been used for high resolution analysis of DNA deletions in Gorlin syndrome samples <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>, and in this report we show using a well-characterized pair of DNA samples, in conjunction with a novel CNV detection algorithm, that nearly 200 CNVs are identified, of which over 120 had not previously been described in this specific sample pair. All novel CNVs were evaluated using an independent QPCR based method, and the overall results show a verification rate of nearly 85%. Thus, DNA probes designed to sites in the genome that do not contain SNPs are effective for CNV identification, and when combined with probes used for SNP genotyping, provide a potentially powerful approach for the integration of CNVs and SNPs into genome wide association studies.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>Whole genome sampling analysis (WGSA) uses single primer PCR in combination with adapter-ligated, restriction enzyme-digested genomic DNA as template to selectively and reproducibly amplify genomic fractions <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. Based on <it>in silico </it>NspI restriction enzyme digestion of the human reference genome (Build 35), over 1.33 million independent fragments are predicted in the 200 bp to 1100 bp size range. The 500 K EA array, which was previously used for genome-wide CNV detection, uses both NspI and StyI PCR representations on two individual arrays. In this configuration, the NspI WGSA target interrogates ~250 K SNPs which in general each reside on a unique restriction fragment. Thus only ~20% (0.25 M/1.3 M) of the <it>in silico </it>predicted NspI fragments are estimated to be represented on the 500 K EA array in the form of probes querying SNPs. Since the NspI PCR target has an estimated complexity of 550 Mb, it could potentially serve as a means to interrogate a significant fraction of the genome provided that two key criteria are met, namely, that these sequences can be reliably amplified by PCR during WGSA and that probes for all fragments are represented on the array and function in a specific manner in DNA hybridization. To this end, a new array was designed using non-polymorphic probes (referred to as the Nsp copy number (CN) array) for the goal of CNV detection.</p>
         <p>The Nsp CN array contains eight to ten independent, non-polymorphic probes per restriction fragment which were selected based on intrinsic criteria (see Methods). Globally, these arrays, in combination with NspI WGSA target only, result in an increase in probe coverage when compared to the 500 K EA genotyping arrays which used both NspI and StyI WGSA fractions (Figure <figr fid="F1">1</figr>). The median inter-marker distance for the Nsp CN arrays is 776 bp, compared to 2709 bp for 500 K EA probes <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. As expected, genome coverage is improved. For example, at an inter-marker distance of 2.5 Kb, the 500 K EA array covers ~46% of the genome whereas coverage increases to over 84% with the Nsp CN array. Because the selection of probe sequences is no longer constrained to SNPs, this array design also has improved coverage in regions likely to contain CNVs, such as segmental duplications <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. For example, while only 25.7% of segmental duplications contain at least one SNP found on the 500 K EA array, 90.3% of segmental duplications are represented by probes from at least one restriction fragment on the Nsp CN array before probe filtering (Table <tblr tid="T1">1</tblr>).</p>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Coverage of segmental duplication regions by 500 K EA and Nsp CN arrays.</p>
            </caption>
            <tblbdy cols="13">
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>
                        <b>500 K EA</b>
                     </p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c cspan="10" ca="center">
                     <p>
                        <b>Nsp CN array</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c cspan="13">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>Before probe filtering</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>After probe filtering</p>
                  </c>
                  <c ca="center">
                     <p>After local-correction filtering</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>After probe filtering</p>
                  </c>
                  <c ca="center">
                     <p>After local-correction filtering</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>After probe filtering</p>
                  </c>
                  <c ca="center">
                     <p>After local-correction filtering</p>
                  </c>
               </r>
               <r>
                  <c cspan="13">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c cspan="2" ca="center">
                     <p>
                        <b>Data set 1</b>
                     </p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c cspan="2" ca="center">
                     <p>
                        <b>Data set 2</b>
                     </p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c cspan="2" ca="center">
                     <p>
                        <b>Data set 3</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c cspan="9">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>At least one marker</p>
                  </c>
                  <c ca="center">
                     <p>25.7%</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>90.3%</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>74.1%</p>
                  </c>
                  <c ca="center">
                     <p>73.5%</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>74.3%</p>
                  </c>
                  <c ca="center">
                     <p>73.8%</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>74.0%</p>
                  </c>
                  <c ca="center">
                     <p>73.0%</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>At least two markers</p>
                  </c>
                  <c ca="center">
                     <p>13.4%</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>85.2%</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>61.7%</p>
                  </c>
                  <c ca="center">
                     <p>60.5%</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>61.8%</p>
                  </c>
                  <c ca="center">
                     <p>60.7%</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>61.6%</p>
                  </c>
                  <c ca="center">
                     <p>60.3%</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>At least three markers</p>
                  </c>
                  <c ca="center">
                     <p>7.7%</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>78.1%</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>50.4%</p>
                  </c>
                  <c ca="center">
                     <p>49.2%</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>50.7%</p>
                  </c>
                  <c ca="center">
                     <p>49.5%</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>50.2%</p>
                  </c>
                  <c ca="center">
                     <p>49.1%</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>At least four markers</p>
                  </c>
                  <c ca="center">
                     <p>5%</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>69.7%</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>40.7%</p>
                  </c>
                  <c ca="center">
                     <p>39.1%</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>41.0%</p>
                  </c>
                  <c ca="center">
                     <p>39.3%</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>40.7%</p>
                  </c>
                  <c ca="center">
                     <p>39.3%</p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p>Note: Each data set represents a replicate of 1X&#8211;5X samples. For 500 K EA, marker refers to SNPs; For Nsp CN array, markers refer to Nsp fragments.</p>
               <p>Segmental duplication data source [80]</p>
            </tblfn>
         </tbl>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Genome coverage of the Nsp CN array before and after probe filtering compared with 500 K EA arrays</p>
            </caption>
            <text>
               <p><b>Genome coverage of the Nsp CN array before and after probe filtering compared with 500 K EA arrays</b>. The X-axis is the distance between any given point in the gap-adjusted genome and the next closest marker. The curve shows the proportion of the genome where the closest marker is less than a certain distance. For example, for the after probe filtering Nsp CN array markers, 99.0% of the genome is less than 10 kb away from a Nsp fragment marker (compared to 99.8% for the before probe filtering Nsp CN array markers) while for the 500 K EA selected SNPs, only 84.9% of the genome has a SNP within 10 kb.</p>
            </text>
            <graphic file="1471-2156-9-27-1"/>
         </fig>
         <sec>
            <st>
               <p>Assay and array performance</p>
            </st>
            <p>Although the human reference genome is commonly used to predict outcomes of <it>in silico </it>restriction enzyme digestions, the precise relationship between all expected fragments, regardless of whether they contain a SNP or not, and the WGSA target output has not been systematically evaluated <abbrgrp><abbr bid="B40">40</abbr><abbr bid="B41">41</abbr></abbrgrp>. The Nsp CN array, which contains multiple independent probes per fragment, was used to evaluate how well each fragment is represented by the WGSA assay. For this purpose, the difference was estimated between probe-specific background (using a pooled panel of 'antigenomic' probes that are not present in the human genome and which vary in GC content in a similar manner to the perfect match probes <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>), and the target-dependent probe signal using a set of five genomic DNA samples that contain different numbers of X chromosomes (designated as the 1X to 5X sample set). Using a probe sequence-specific background model (see Methods), >97% of all probes show an intensity that is higher than background in each individual sample and > 94% of all probes are detected above background when all 5 samples are evaluated together as a group (Table <tblr tid="T2">2</tblr>). Although this metric does not measure the specificity of the signal per se but rather whether the signal is real or not in terms of being above background level, it does suggest that nearly all predicted restriction fragments are actually represented in the PCR target at a concentration sufficient for detection by hybridization. The small remaining set of non-responsive fragments could result from problems with restriction enzyme digestion, PCR amplification, hybridization, or sequence differences between the human genome reference sequence and the genomes of the samples being tested.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Estimation of number of probes that respond to target and display an intensity above the background</p>
               </caption>
               <tblbdy cols="7">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2" ca="center">
                        <p>
                           <b>Probes above background in each sample</b>
                        </p>
                     </c>
                     <c cspan="4" ca="center">
                        <p>
                           <b>Probes and fragments above background in 5/5 samples</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Probe count</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Percentage</b>
                        </p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>
                           <b>Probes # (%)</b>
                        </p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>
                           <b>Fragment # (%)</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6" ca="center">
                        <p>
                           <b>data set 1</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Sample1</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>12,017,471</p>
                     </c>
                     <c ca="center">
                        <p>97.47%</p>
                     </c>
                     <c ca="center">
                        <p>11,786,082</p>
                     </c>
                     <c ca="center">
                        <p>(95.59%)</p>
                     </c>
                     <c ca="center">
                        <p>1,329,822</p>
                     </c>
                     <c ca="center">
                        <p>(99.96%)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Sample2</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>12,025,953</p>
                     </c>
                     <c ca="center">
                        <p>97.54%</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Sample3</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>12,075,266</p>
                     </c>
                     <c ca="center">
                        <p>97.94%</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Sample4</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>12,092,454</p>
                     </c>
                     <c ca="center">
                        <p>98.08%</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Sample5</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>12,080,046</p>
                     </c>
                     <c ca="center">
                        <p>97.97%</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6" ca="center">
                        <p>
                           <b>data set 2</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Sample1</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>11,980,266</p>
                     </c>
                     <c ca="center">
                        <p>97.17%</p>
                     </c>
                     <c ca="center">
                        <p>11,697,525</p>
                     </c>
                     <c ca="center">
                        <p>(94.87%)</p>
                     </c>
                     <c ca="center">
                        <p>1,329,806</p>
                     </c>
                     <c ca="center">
                        <p>(99.96%)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Sample2</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>12,053,875</p>
                     </c>
                     <c ca="center">
                        <p>97.76%</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Sample3</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>12,056,015</p>
                     </c>
                     <c ca="center">
                        <p>97.78%</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Sample4</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>12,039,968</p>
                     </c>
                     <c ca="center">
                        <p>97.65%</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Sample5</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>11,981,189</p>
                     </c>
                     <c ca="center">
                        <p>97.17%</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6" ca="center">
                        <p>
                           <b>data set 3</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Sample1</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>11,965,896</p>
                     </c>
                     <c ca="center">
                        <p>97.05%</p>
                     </c>
                     <c ca="center">
                        <p>11,687,506</p>
                     </c>
                     <c ca="center">
                        <p>(94.79%)</p>
                     </c>
                     <c ca="center">
                        <p>1,329,818</p>
                     </c>
                     <c ca="center">
                        <p>(99.96%)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Sample2</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>12,061,150</p>
                     </c>
                     <c ca="center">
                        <p>97.82%</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Sample3</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>12,027,025</p>
                     </c>
                     <c ca="center">
                        <p>97.54%</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Sample4</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>12,060,619</p>
                     </c>
                     <c ca="center">
                        <p>97.82%</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Sample5</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>12,040,767</p>
                     </c>
                     <c ca="center">
                        <p>97.66%</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Note: Each data set represents a replicate of 1X&#8211;5X samples.</p>
               </tblfn>
            </tbl>
            <p>The probes present on the Nsp CN array have not been experimentally selected <it>a priori </it>for high performance with regard to detection of DNA copy number changes. In order to test if these probes are sensitive to changes in target dosage, the 1X to 5X DNA samples were used in WGSA and target was hybridized to the arrays for the purpose of X chromosome probe evaluation. Using all probes present on the X chromosome, a clear increase in signal was seen with increasing X chromosome dosage (Additional File <supplr sid="S1">1</supplr>). These results confirm that probes on the Nsp CN array display a dose response for the X chromosome. The use of these DNA samples also allows assessment of individual probe-specific dose response metrics (i.e. regression slope and linear correlation coefficient). For example, under ideal theoretical conditions, a single probe that maps to only one site on the X chromosome, when evaluated with the 1X to 5X sample set, would show a regression slope value of 1 when the linear regression is modeled using the log-transformed intensity as the response and the log-transformed copy number as the predictor. Similarly, a linear correlation coefficient of 1 would be expected. Thus, deviation from these ideal values provides an experimental approach to measuring each probe's ability to respond to changes in target concentration. Two examples are shown in Additional Files <supplr sid="S2">2</supplr> and <supplr sid="S3">3</supplr>.</p>
            <suppl id="S1">
               <title>
                  <p>Additional file 1</p>
               </title>
               <text>
                  <p>Dose response plots of a representative 1X&#8211;5X data set. Panels a-d show the scatter plots of standardized natural log intensity of the 1X, 3X, 4X, and 5X samples relative to the 2X sample. Here, standardization refers to the following data transformation: standardized intensity of chromosome X probe = (intensity of chromosome X probe-mean intensity of the autosomal probes)/standard deviation of the intensity of autosomal probes. Red dots represent randomly selected chromosome X probes and black dots represent randomly selected autosomal probes. The blues lines are the Y = X lines. Panel e shows the relationship between the natural log-transformed intensity and the natural log-transformed copy number. Natural log-transformed mean intensity of all chromosome X probes from the 1X&#8211;5X samples are plotted on the Y-axis and natural log-transformed copy number are plotted on the X-axis. The blue line is the linear regression line using the natural log-transformed mean intensity as response and natural log-transformed copy number as predictors.</p>
               </text>
               <file name="1471-2156-9-27-S1.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S2">
               <title>
                  <p>Additional file 2</p>
               </title>
               <text>
                  <p>Dose response of probes deteriorates as the number of genomic hits increases. Panel a shows the frequency distribution of genomic matches for a set of 80,000 randomly selected chromosome X probes. Panels b-c are box-plots showing the distribution of linear correlation coefficient and regression slope grouped by the number of genomic hits of a set of 80000 randomly selected chromosome X probes. Panel d shows chromosome X hits frequency distribution of the same set of randomly selected 80000 chromosome X probes. Panels e-f are box-plots showing the distribution of linear correlation coefficient and regression slope grouped by the number of chromosome X hits of this set of 80,000 randomly selected chromosome X probes. Natural log-transformed normalized (as described in Methods) intensity of chromosome X probes of a representative set of 1X&#8211;5X samples and natural log-transformed copy number were used to calculate linear correlation coefficient and regression slope for each probe.</p>
               </text>
               <file name="1471-2156-9-27-S2.jpeg">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S3">
               <title>
                  <p>Additional file 3</p>
               </title>
               <text>
                  <p>A 2-dimensional histogram showing the distribution of regression slope along with the distribution of natural log-transformed intensity. Natural log-transformed normalized (as described in Methods) intensity of 80,000 randomly selected chromosome X probes of a representative set of 1X&#8211;5X samples and natural log-transformed copy number were used to calculate the regression slope. The black vertical line denotes the maximum log intensity ratio and the green vertical line denotes the top 8% log intensity, above which there are few probes with high regression slopes. The top 10% intensity is used as the cut-off threshold in the probe filtering process.</p>
               </text>
               <file name="1471-2156-9-27-S3.jpeg">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>The impact of the number of genomic hits on probe dose response was also evaluated using the X chromosome probe intensities from the 1X&#8211;5X data set (Additional File <supplr sid="S2">2</supplr>). Linear correlation between log (probe intensity) and log (chrX copy number) was calculated for each of the chrX probes after grouping probes by number of perfect-match genomic hits. The Pearson's correlation coefficient of each group (Additional File <supplr sid="S2">2B</supplr>) dramatically decreased when the number of genomic hits was greater than two. The log (probe intensity) and log (chrX copy number) was further modeled by simple linear regression. Again, the regression coefficient (regression line slope, as shown in Additional File <supplr sid="S2">2C</supplr>) grouped by number of genomic matches indicated poorer performance when the probes were complementary to more than two sites in the genome. The same analyses stratifying on the number of chromosome X hits using the same set of chrX probes gave similar results (Additional File <supplr sid="S2">2D&#8211;2F</supplr>). Although these metrics were also smaller for probes with two-genomic matches as compared to single-match probes, the magnitude of the reduction was not as large relative to the change from two-genome matches to three or greater genomic matches. More importantly, since many CNVs are associated with segmental duplication regions, there is an increased likelihood for probes in CNV regions to have two genome hits. Thus, probes with two genome hits were not omitted in order to allow interrogation of segmental duplication regions (Table <tblr tid="T1">1</tblr>), while probes that have more than two genomic hits were removed as described in Methods.</p>
            <p>Several probe filtering steps were implemented in addition to the probe filtering described above for genomic hits in order to remove adversely performing probes (see Methods). These additional procedures included filtering based on probe GC content, restriction fragment length and GC content, NspI restriction site characteristics, hybridization signal intensities lower than background, hybridization signals that are too bright, and probe sets comprised of single probes. Following the probe filtering steps, sequence specific standardization was performed and the probes from each restriction fragment were summarized as described in Methods. At the completion of all filtering steps, ~77% of the initial probes and 92% of the initial restriction fragments were retained in a typical experiment, although the exact number varied dynamically for each sample set that was analyzed together (Additional File <supplr sid="S4">4</supplr>). Importantly, genome coverage was not significantly reduced by probe filtering (Figure <figr fid="F1">1</figr>) although coverage in segmental duplication regions with at least one marker was modestly reduced from 90% to 74% (Table <tblr tid="T1">1</tblr>). The overall impact of probe filtering as well as a median polish procedure (Robust Multichip Analysis (RMA)) on dose response was evaluated using the 1X&#8211;5X sample set dose response metrics. The linear correlation coefficient and the regression slope improved significantly in both cases (Additional File <supplr sid="S5">5</supplr>).</p>
            <suppl id="S4">
               <title>
                  <p>Additional file 4</p>
               </title>
               <text>
                  <p>Number of remaining probes and fragments following probe filtering for 3 replicates of 1X&#8211;5X samples. The data indicates the number of probes and fragments that have been retained after probe filtering for 3 replicates of the 1X&#8211;5X DNA samples.</p>
               </text>
               <file name="1471-2156-9-27-S4.xls">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S5">
               <title>
                  <p>Additional file 5</p>
               </title>
               <text>
                  <p>Dose response of probes improves after probe filtering and RMA procedure. Natural log-transformed normalized (as described in Methods) intensity of 80,000 randomly selected chromosome X probes of a representative set of 1X&#8211;5X DNA samples and natural log-transformed copy number were used to calculate linear correlation coefficient and regression slope for all probes(blue bars), natural log-transformed normalized intensity of post-filtering 64,035 of the 80,000 randomly selected chromosome X probes and natural log-transformed copy number were used to calculate linear correlation coefficient and regression slope for the filtered probes(grey bars), and natural log-transformed post-RMA chromosome X probe set intensity and natural log-transformed copy number were used to calculate linear correlation coefficient and regression slope for the fragments (red bars).</p>
               </text>
               <file name="1471-2156-9-27-S5.jpeg">
                  <p>Click here for file</p>
               </file>
            </suppl>
         </sec>
         <sec>
            <st>
               <p>Detection of copy number polymorphisms</p>
            </st>
            <p>To evaluate the capability of the Nsp CN array to identify CNVs, multiple independent replicates of two well characterized DNA samples (NA15510 as the test sample and NA10851 as the reference sample) that contain known copy number variations were used. Although CNVs in these two samples have previously been identified using high density oligonucleotide arrays <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B37">37</abbr></abbrgrp>, we hypothesized that improved probe density in regions devoid of SNPs, such as segmental duplications, should lead to the discovery of additional variants. For this purpose, a novel algorithm was developed to identify copy number variation regions. This algorithm contains three major parts as depicted in Figure <figr fid="F2">2</figr>. Intensity pre-processing includes probe filtering, standardization which takes into account probe specific metrics known to influence hybridization and signal intensity, and probe set summarization to provide a single measurement for each fragment. The genome segmentation step initially removes outlier fragments, uses kernel smoothing to improve the signal to noise ratio, and then applies a regression tree based method to divide the genome into consecutive regions. Lastly, CNV region identification is achieved by a permutation based test to define the significance threshold. The training set data for tuning various algorithm parameters (see Methods) consisted of a single replicate of NA15510 compared to NA10851. Tuned parameters were then used in subsequent analyses that included two independent test sets of NA15510 versus NA10851 as well as several HapMap trio samples.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Overview of the data analysis work flow (see Methods for details)</p>
               </caption>
               <text>
                  <p><b>Overview of the data analysis work flow </b>(see Methods for details).</p>
               </text>
               <graphic file="1471-2156-9-27-2"/>
            </fig>
            <p>Using the two independent test replicates between NA15510 and NA10851, 195 high confidence CNVs were identified in total (gains (98) and losses (97) were represented nearly equally), with 156 CNVs and 175 CNVs found in each of the two pair-wise comparisons. This represents, on average, a five fold increase over the number of CNVs identified in this same sample pair using 500 K EA arrays <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. In total, 10,126,153 nucleotides were included in these CNV regions, representing 0.355% of the gap-adjusted genome size, and 39.5% of the CNVs overlapped with segmental duplications (Additional File <supplr sid="S6">6A</supplr>). The mean and median size of CNVs identified on the Nsp CN array were significantly smaller as compared to CNVs found on the 500 K EA arrays (51,930 bp and 20,780 bp versus 293,800 bp and 48,950 bp respectively), a direct result of the improved probe coverage (Figure <figr fid="F3">3</figr>). There were 121 CNVs identified in both sample sets, corresponding to a reproducibility rate of ~77% (Additional File <supplr sid="S6">6</supplr>). There have been several reports describing CNVs found in this specific pair of samples using multiple detection platforms such as fosmid paired-end sequencing, whole genome tile path (WGTP) BAC array CGH, and 500 K EA arrays <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B15">15</abbr></abbrgrp>. The overlap of the 195 CNVs with this external data set identified 73 CNVs (37.44%) (Additional File <supplr sid="S6">6</supplr>), and thus these were considered to be validated based on the criteria of overlap with previously described CNVs found in these two samples. Interestingly, the average size of CNVs that overlapped with external data was 91,536 bp as compared to an average size of 28,229 bp for those CNVs that did not overlap with external data. By virtue of no overlap with the external data sets, there were 122 novel CNVs. 120 of these 122 CNVs were tested by QPCR and the results showed that 94/120 (78.3%) could be validated (Additional File <supplr sid="S6">6</supplr>), indicating that the majority of the novel CNVs represented real but previously unidentified structural variation between NA15510 and NA10851. Taken together, the percentage of the 195 total CNV calls that were validated (based on a combination of external data set overlap and QPCR analysis) was 86.5% and the percentage of CNV calls from each pair-wise comparison that was validated was near 89% (Additional File <supplr sid="S6">6</supplr>). To assess the number of false-positive CNV calls using this array and algorithm, 'self versus self' comparisons using the NA10851 reference sample were carried out. An average false discovery rate of 7.3% was determined (avg # CNV calls NA10851 vs NA10851/avg # CNV calls NA10851 vs NA15510), which is similar, although slightly lower, than the experimentally identified rate of false positive calls of 11% (100%-89%) for a test versus reference pair-wise sample comparison.</p>
            <suppl id="S6">
               <title>
                  <p>Additional file 6</p>
               </title>
               <text>
                  <p>List of QPCR data and CNV coordinates. Table A represents the coordinates of CNVs in NA15510 vs. NA10851. Table B summarizes QPCR results for NA15510 vs. NA10851. Table C represents QPCR results for the CNV border analysis. Table D represents QPCR results for Mendelian inheritance (MI) errors. Table E lists counts of CNVs in HapMap trio samples NA10846-NA12144-NA12125 and NA10831-NA12155-NA12156.</p>
               </text>
               <file name="1471-2156-9-27-S6.xls">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Size distribution of CNVs detected using the Nsp CN array (red bars) compared with 500 K EA (blue bars) CNVs</p>
               </caption>
               <text>
                  <p>
                     <b>Size distribution of CNVs detected using the Nsp CN array (red bars) compared with 500 K EA (blue bars) CNVs.</b>
                  </p>
               </text>
               <graphic file="1471-2156-9-27-3"/>
            </fig>
            <p>Regions containing low copy repeats are often not detectable with SNP genotyping arrays since SNPs in these regions do not typically perform well <abbrgrp><abbr bid="B43">43</abbr></abbrgrp>. The Nsp CN array contains non-polymorphic probes that are more likely to span duplicated regions, and thus the power to detect CNVs surrounding segmental duplications is increased. From our union list of CNVs identified from two replicates of NA15510 vs NA10851, we identified 77 CNVs (39.5%) that are associated with segmental duplications (Additional File <supplr sid="S6">6</supplr>), compared to 18 CNVs from a similar data set using the 500 K EA array <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. Figure <figr fid="F4">4</figr> illustrates a CNV associated with a segmental duplication.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Improved ability to detect CNVs in segmental duplication regions</p>
               </caption>
               <text>
                  <p><b>Improved ability to detect CNVs in segmental duplication regions</b>. In this CNV region associated with two segmental duplications, there is one SNP probe on the edge of the region (54347071 bp on chromosome 16, represented by the black dot) on the 500 K EA array, but multiple probes present on the Nsp CN array. The three panels represent three independent replicates (one training replicate (data set 2) and two test replicates (data set 1 and data set3)) of the test sample NA15510 and the reference sample NA10851 on the Nsp CN array. The log intensity ratios are plotted on the Y axis and the genomic location on the X axis. The red horizontal line represents the CNV region identified by the Nsp CN array and algorithm, while the purple horizontal lines represent segmental duplication regions. The green arrows indicate location of primers used for QPCR verification (listed in Additional File <supplr sid="S6">6</supplr>).</p>
               </text>
               <graphic file="1471-2156-9-27-4"/>
            </fig>
            <p>CNVs have previously been shown to be largely heritable <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B14">14</abbr></abbrgrp>. As such, the performance of the CNV detection assay and algorithm was assessed by evaluating Mendelian inheritance (MI) of CNVs in two trios that are part of the HapMap collection of DNA samples of Caucasian (CEU) descent (Figure <figr fid="F5">5</figr>). The 6 samples that comprise the two trio sets were each compared to the reference sample (NA10851). Thus, all CNVs derived from these comparisons are a composite of copy number variation in the test sample as well as the reference sample. This analysis showed that 95.1% of CNVs (157/165) identified in the 2 children of these trios were also found in at least one of the parents. This includes 113 CNVs that were called by the algorithm in both the child and parent and are classified as inherited (Figure <figr fid="F5">5A</figr>) as well as 44 CNVs with signal intensities in one of the parents that were just below the significance threshold cutoff and are classified as "display MI trend" (Figure <figr fid="F5">5B</figr>, Additional File <supplr sid="S6">6E</supplr>). The remaining CNVs could represent detection errors (false positive CNVs in the child or false negative CNVs in either parent), a "de novo" event in the child, a cell line artifact in the child's sample <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, or an inherited CNV that has a more complicated inheritance pattern (Figure <figr fid="F5">5C</figr>). To evaluate these possibilities, all eight non-inherited CNVs were evaluated for overlap with previously released data sets that used the same samples <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B11">11</abbr><abbr bid="B14">14</abbr><abbr bid="B32">32</abbr></abbrgrp> and were also experimentally evaluated using QPCR (Additional File <supplr sid="S6">6D</supplr>). This analysis showed that 4 of the 8 non-inherited CNVs were truly present in the child's sample, but were not detected in the parent's samples.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>CNV inheritance patterns in two family trios</p>
               </caption>
               <text>
                  <p><b>CNV inheritance patterns in two family trios</b>. Although most CNVs are clearly inherited (Figure 5A) or displayed an intensity profile in one of the parents that is just below the threshold cutoff (Figure 5B), there are CNVs that appear to be de novo (Figure 5C). This could be due to complicated inheritance of a common CNV present in both parents and the reference, a false positive in the child, or a de novo event in the child. The log intensity ratios are plotted on the Y axis (the dots represent the log intensity ratio of each probe) and the genomic location on the X axis. Red horizontal lines represent CNVs identified in our study and the black horizontal line in Figure 5B represents the same region in the parent that was identified in the child sample as a CNV region. (A) Transmission of a CNV from a father (NA12144) to the child (NA10846). (B) Transmission of a CNV from a father (NA12155) to the child (NA10831). In this case, the intensity profile in this region in the father is just below the significance threshold and was not called as a CNV. However, this region displayed a strong trend as a CNV. (C) A deletion CNV identified in the child (NA10846) is not found in either of the parents (NA12144 and NA12145).</p>
               </text>
               <graphic file="1471-2156-9-27-5"/>
            </fig>
            <p>A comparison of the four validated "de novo" CNVs with CNVs that have previously been described in the literature for these samples reveals that one of these four can be categorized as a CNV with a complex inheritance pattern and a second CNV can be categorized as a putative cell line artifact. In the case of the trio which includes the child DNA sample NA10846, a "de novo" CNV from 79,022,620 bp to 79,094,338 bp on chromosome 6 was validated using several QPCR primer pairs targeting different regions of the CNV (Figure <figr fid="F5">5C</figr>). In a previous study <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, this common CNV region was identified as a deletion in both parent samples (NA12144 and NA12145) as well as the reference sample (NA10851), and was found to be a homozygous deletion in the child (NA10846). Because the reference sample and the two parents contain the same CNV allele, the presence of the deletion in the parents was masked in our study. Thus, this is an example where an apparently "de novo" or non-inherited CNV appears to follow simple Mendelian inheritance but is missed due to the configurations of genotypes in the tested samples relative to the reference sample. In another example, for the case of the trio NA10831-NA12145-NA12146, a "de novo" CNV was validated between 84,014,256 bp and 84,037,846 bp on chromosome 7, but only in a specific lot number of the DNA sample corresponding to the child (Additional File <supplr sid="S6">6</supplr>). In previous work, this region was identified as a deletion in the child sample (NA10831), but not in the parent samples (NA12145 and NA12146) and was thus flagged as a potential cell line artifact <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>High resolution breakpoint determination for CNVs</p>
            </st>
            <p>For the Nsp CN array, the CNV border was defined as the middle point between the outer most fragment present in a region showing significance and the nearest fragment located outside of the significant region. For this reason, the reported border for a CNV region is an approximation of the true border, which should lie somewhere between these two points. The accuracy of the array and algorithm to delineate CNV boundaries was evaluated by experimental testing of 2 CNV regions that were identified by both the Nsp CN array as well as the 500 K EA platform (Additional File <supplr sid="S6">6C</supplr>). The first CNV tested was identified as a 40 kb insertion on chromosome 2 by the Nsp CN array and a 65 kb insertion by 500 K EA (Figure <figr fid="F6">6A</figr>). QPCR primers were designed to the regions immediately adjacent to the borders defined by the Nsp CN array, internal to the defined borders, and to regions that differed between the two platforms. The results show that the borders defined by the Nsp CN array and algorithm were highly accurate and limited only by the density of markers in the region (Figure <figr fid="F6">6</figr>). A comparison of the borders reported by the Nsp CN array and the borders reported by the 500 K EA array with the experimental QPCR results shows that the higher density of markers in the Nsp CN array is beneficial in the identification of the true border of a CNV region.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Improved boundary delineation with Nsp CN arrays compared to 500 K EA</p>
               </caption>
               <text>
                  <p><b>Improved boundary delineation with Nsp CN arrays compared to 500 K EA</b>. The CNV in these examples were identified by both the 500 K EA platform (black lines) as well as the Nsp CN array (red lines). The three panels represent three independent replicates of the test sample NA15510 and the reference sample NA10851 on the Nsp CN array (data set 1 and data set 3 are test data sets and data set 2 is used as training set). The blue lines represent the log intensity ratios, with the dots indicating the location of each probe from the Nsp CN array. Colored vertical lines indicate different primer pairs, with green indicating a confirmed copy number change, and red indicating no detectable copy number change. The black dots on the black horizontal line represent SNP markers tiled on the 500 K EA arrays. A) This CNV was identified as a 40 kb insertion using the Nsp CN array, and a 65 kb insertion using the 500 K EA arrays. The primer pairs, ordered from left to right on the figure, are named 1 to 19 in Additional File <supplr sid="S6">6C. B</supplr>) This CNV was identified as a 95 kb insertion using the Nsp CN array and a 23 kb insertion using 500 K EA. In addition, the CNV is flanked by segmental duplications (purple lines). Primers 1 through 9 are numbered from left to right in Additional file <supplr sid="S6">6C</supplr>.</p>
               </text>
               <graphic file="1471-2156-9-27-6"/>
            </fig>
            <p>A second example was tested which was defined as a larger CNV by the Nsp CN array (95 kb insertion on chromosome 17) compared to 500 K EA (23 kb insertion on chromosome 17). The primary reason for the smaller size on the 500 K EA platform was the lack of SNP probes in the segmental duplications that are associated with this CNV (Figure <figr fid="F6">6B</figr>). Again, the Nsp CN array borders were found to be more accurate (Additional File <supplr sid="S6">6</supplr>). It should be noted that although this CNV is clearly larger than 23 kb, the precise borders were difficult to establish due to the presence of segmental duplications within and flanking the region (Figure <figr fid="F6">6B</figr>).</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion and Conclusion</p>
         </st>
         <p>The routine testing of CNVs during genome wide association studies has been widely proposed yet has not been fully realized to the same extent as SNP genotyping <abbrgrp><abbr bid="B44">44</abbr><abbr bid="B45">45</abbr><abbr bid="B46">46</abbr></abbrgrp>. This goal is hindered in part by the fact that accurate and sensitive detection of CNVs that span varying numbers of nucleotides poses greater technical challenges than the genotype determination of a bi-allelic single nucleotide polymorphism. In addition, although SNPs can reliably be identified by many different molecular assays which all result in a common output (homozygous or heterozygous genotype call), CNV outputs can vary widely depending on the specific technical platform, calling algorithm, and reference DNA sample that is used <abbrgrp><abbr bid="B47">47</abbr><abbr bid="B48">48</abbr></abbrgrp>.</p>
         <p>The ability to accurately assess common copy number variation requires the development of novel high throughput technologies as well as the algorithms to extract and process the appropriate information. Here we describe a high density oligonucleotide array designed specifically for the interrogation of copy number changes without the necessity to genotype SNPs. In addition, we have utilized a CNV detection algorithm that takes advantage of well established standardization methods <abbrgrp><abbr bid="B37">37</abbr><abbr bid="B49">49</abbr><abbr bid="B50">50</abbr></abbrgrp> as well as the use of tree partitioning to segment the genome and delineate the CNV borders, a method that has been previously described for the identification of copy number changes using high density arrays <abbrgrp><abbr bid="B51">51</abbr></abbrgrp> and is a powerful alternative to other segmentation algorithms <abbrgrp><abbr bid="B52">52</abbr><abbr bid="B53">53</abbr><abbr bid="B54">54</abbr><abbr bid="B55">55</abbr></abbrgrp>. We have further justified the use of a tree partitioning model coupled with a permutation test by extensive experimental validation of the CNV calls as well as the precision of the borders determined by the algorithm.</p>
         <p>The single largest advantage of high density DNA oligonucleotide arrays is the vast amount of genetic information generated in a single experiment through the use of millions of independent probe sequences <abbrgrp><abbr bid="B56">56</abbr><abbr bid="B57">57</abbr><abbr bid="B58">58</abbr></abbrgrp>. The increased value of higher density is evident based on the increased number of CNVs called in any pair wise comparison, and the ability to detect much smaller CNVs compared to other array based platforms <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. For example, we identified 169 validated CNVs in one pair wise comparison (NA15510 vs NA10851) alone. This far outnumbers the list of CNVs discovered (using the same test and reference sample) by at least 5 other microarray based platforms (See Supplementary Table 1 in <abbrgrp><abbr bid="B59">59</abbr></abbrgrp>) although is still less than the 241 alterations discovered by fosmid end sequencing of NA15510 <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. Remarkably, in this one sample alone, more than 500 distinct copy number variations have been identified, and half of these have been experimentally validated. This underscores the point that any two human genomes may differ by tens of Megbases of DNA sequence due to structural variation alone.</p>
         <p>One issue with CNV survey studies to date is the lack of overlap between variants identified using different platforms <abbrgrp><abbr bid="B59">59</abbr><abbr bid="B60">60</abbr><abbr bid="B61">61</abbr></abbrgrp>. In addition, although the databases cataloguing all published CNV regions contain hundreds of Mbs of DNA, it is still unclear if a large proportion of these CNVs may in fact be false positives <abbrgrp><abbr bid="B59">59</abbr><abbr bid="B62">62</abbr></abbrgrp>. We have high confidence in the CNVs reported here since all have been experimentally validated or have been identified by multiple technological platforms.</p>
         <p>The presence of non-polymorphic probes improves array performance by allowing more probes to be utilized, even in more complex regions of the genome, such as segmental duplication regions, which are often not accessible through standard SNP genotyping. Future whole genome association studies should utilize both SNPs and CN probes to maximize the information and content. While SNP detection has been widely used and tested, this is the first report of a non-polymorphic set of probes that can be evaluated for eventual inclusion onto an integrated array containing both polymorphic and non-polymorphic probes <abbrgrp><abbr bid="B47">47</abbr><abbr bid="B61">61</abbr></abbrgrp>. A subset of probes from the Nsp CN array has been empirically selected for maximum responsiveness and has been incorporated into the SNP 6.0 array <abbrgrp><abbr bid="B63">63</abbr></abbrgrp>. This array is currently being used to assess structural variation in large sample sets. Finally, the Nsp CN arrays have been shown to be capable of detecting cancer causing aberrations with known pathological consequences <abbrgrp><abbr bid="B64">64</abbr></abbrgrp>. Thus, this type of array could also be used for array-based karyotyping in lieu of more time consuming and expensive cytogenetic methods <abbrgrp><abbr bid="B65">65</abbr></abbrgrp>.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Array Design</p>
            </st>
            <p>The Nsp CN array contains 12,339,139 oligonucleotide probes tiled onto two arrays. Probes were selected to represent each of the 1,330,354 fragments between 200&#8211;1100 bp predicted to arise after digestion of human genomic DNA with the restriction enzyme NspI. All data presented is based on the human reference genome build 35 (May 2004 build). For all chromosomes, 8&#8211;10 PM (perfect match) probes were identified per fragment using a probe selection algorithm previously developed for high density 25-mer arrays <abbrgrp><abbr bid="B66">66</abbr></abbrgrp>. Simple repeats and SNP sequences were avoided.</p>
            <p>For background estimation, a pooled set of "antigenomic" probes were used which has been matched to each perfect match feature based on its GC content and which are not present elsewhere in the genome <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Data Analysis</p>
            </st>
            <sec>
               <st>
                  <p>I. Preprocessing</p>
               </st>
               <sec>
                  <st>
                     <p>1. Probe Filtering</p>
                  </st>
                  <p>In order to extract the highest quality data from the Nsp-CN arrays, several filtering steps were implemented to remove adversely performing probes.</p>
                  <sec>
                     <st>
                        <p>Probe filtering based on probe GC content, fragment length and GC content, and NspI restriction site characteristics</p>
                     </st>
                     <p>Several previous studies have suggested that the restriction fragment length and GC content as well as probe GC content have a strong effect on feature intensity <abbrgrp><abbr bid="B37">37</abbr><abbr bid="B52">52</abbr><abbr bid="B67">67</abbr></abbrgrp>. Analysis of the relationship between Nsp-CN array probe intensity and its associated probe and fragment characteristics (data not shown) have led to the first set of filtering criteria: probes with less than 30% or greater than 60% GC content were removed as well as probes within restriction fragments greater than 1000 bp in length, &lt;25% GC content, or > 60% GC content. In addition, probes residing in fragments in which the enzyme recognition site contains a SNP <abbrgrp><abbr bid="B68">68</abbr></abbrgrp> were also filtered out.</p>
                  </sec>
                  <sec>
                     <st>
                        <p>Probe filtering based on number of genome hits</p>
                     </st>
                     <p>The xMAN (extreme Mapping of OligoNucleotides) algorithm was used to map all Nsp CN probes to the human genome <abbrgrp><abbr bid="B69">69</abbr></abbrgrp>. Probes with more than two genomic hits were discarded due to reduced ability to respond to changes in target dosage.</p>
                     <p>After the above two filtering steps, the number of probes was reduced from 12,339,139 to 10,379,759 (84.12%), and the number of fragments were reduced from 1,330,354 to 1,245,607 (93.6%). The remaining set of filters was applied independently for each data set.</p>
                  </sec>
                  <sec>
                     <st>
                        <p>Filtering of high-intensity probes</p>
                     </st>
                     <p>Exploratory data analysis discovered that probes with the highest intensity on the arrays had very low dose response (Additional File <supplr sid="S3">3</supplr>), in part due to cross hybridization with multiple sites in the genome. For each set of samples being analyzed together, probes that were consistently in the top 10% intensity categories were filtered out.</p>
                  </sec>
                  <sec>
                     <st>
                        <p>Filtering of low-intensity probes: estimation of background effects</p>
                     </st>
                     <p>In order to identify probes that consistently failed to produce a signal above the background level, a sequence specific model was used to estimate the contribution of systematic noise to the probe signal intensity. Although overall probe GC content plays a crucial role in the estimation of background, recent studies have pointed out that position dependent sequence effects are also important <abbrgrp><abbr bid="B70">70</abbr><abbr bid="B71">71</abbr><abbr bid="B72">72</abbr></abbrgrp>. Motivated by the sequence-specific model, the following multiple linear regression model was used to describe the background effect on the Nsp CN arrays:</p>
                     <p>
                        <display-formula id="M1">
                           <m:math name="1471-2156-9-27-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                              <m:semantics>
                                 <m:mrow>
                                    <m:mtable>
                                       <m:mtr>
                                          <m:mtd>
                                             <m:mrow>
                                                <m:mi>log</m:mi>
                                                <m:mo>&#8289;</m:mo>
                                                <m:mo stretchy="false">(</m:mo>
                                                <m:mi>I</m:mi>
                                                <m:mi>n</m:mi>
                                                <m:mi>t</m:mi>
                                                <m:mi>e</m:mi>
                                                <m:mi>n</m:mi>
                                                <m:mi>s</m:mi>
                                                <m:mi>i</m:mi>
                                                <m:mi>t</m:mi>
                                                <m:msub>
                                                   <m:mi>y</m:mi>
                                                   <m:mi>i</m:mi>
                                                </m:msub>
                                                <m:mo stretchy="false">)</m:mo>
                                                <m:mo>=</m:mo>
                                                <m:mi>&#945;</m:mi>
                                                <m:mo>+</m:mo>
                                                <m:mstyle displaystyle="true">
                                                   <m:munder>
                                                      <m:mo>&#8721;</m:mo>
                                                      <m:mrow>
                                                         <m:mi>k</m:mi>
                                                         <m:mo>&#8712;</m:mo>
                                                         <m:mo>{</m:mo>
                                                         <m:mi>A</m:mi>
                                                         <m:mo>,</m:mo>
                                                         <m:mi>C</m:mi>
                                                         <m:mo>,</m:mo>
                                                         <m:mi>G</m:mi>
                                                         <m:mo>}</m:mo>
                                                      </m:mrow>
                                                   </m:munder>
                                                   <m:mrow>
                                                      <m:mstyle displaystyle="true">
                                                         <m:munderover>
                                                            <m:mo>&#8721;</m:mo>
                                                            <m:mrow>
                                                               <m:mi>l</m:mi>
                                                               <m:mo>=</m:mo>
                                                               <m:mn>1</m:mn>
                                                            </m:mrow>
                                                            <m:mn>3</m:mn>
                                                         </m:munderover>
                                                         <m:mrow>
                                                            <m:msub>
                                                               <m:mi>&#946;</m:mi>
                                                               <m:mrow>
                                                                  <m:mi>k</m:mi>
                                                                  <m:mo>,</m:mo>
                                                                  <m:mi>l</m:mi>
                                                               </m:mrow>
                                                            </m:msub>
                                                         </m:mrow>
                                                      </m:mstyle>
                                                   </m:mrow>
                                                </m:mstyle>
                                                <m:msubsup>
                                                   <m:mi>P</m:mi>
                                                   <m:mrow>
                                                      <m:mi>i</m:mi>
                                                      <m:mo>,</m:mo>
                                                      <m:mi>k</m:mi>
                                                   </m:mrow>
                                                   <m:mi>l</m:mi>
                                                </m:msubsup>
                                                <m:mo>+</m:mo>
                                                <m:mstyle displaystyle="true">
                                                   <m:munderover>
                                                      <m:mo>&#8721;</m:mo>
                                                      <m:mrow>
                                                         <m:mi>j</m:mi>
                                                         <m:mo>=</m:mo>
                                                         <m:mn>1</m:mn>
                                                      </m:mrow>
                                                      <m:mrow>
                                                         <m:mn>25</m:mn>
                                                      </m:mrow>
                                                   </m:munderover>
                                                   <m:mrow>
                                                      <m:mstyle displaystyle="true">
                                                         <m:munder>
                                                            <m:mo>&#8721;</m:mo>
                                                            <m:mrow>
                                                               <m:mi>k</m:mi>
                                                               <m:mo>&#8712;</m:mo>
                                                               <m:mo>{</m:mo>
                                                               <m:mi>A</m:mi>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>C</m:mi>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>G</m:mi>
                                                               <m:mo>}</m:mo>
                                                            </m:mrow>
                                                         </m:munder>
                                                         <m:mrow>
                                                            <m:mstyle displaystyle="true">
                                                               <m:munderover>
                                                                  <m:mo>&#8721;</m:mo>
                                                                  <m:mrow>
                                                                     <m:mi>l</m:mi>
                                                                     <m:mo>=</m:mo>
                                                                     <m:mn>1</m:mn>
                                                                  </m:mrow>
                                                                  <m:mn>3</m:mn>
                                                               </m:munderover>
                                                               <m:mrow>
                                                                  <m:msub>
                                                                     <m:mi>&#947;</m:mi>
                                                                     <m:mrow>
                                                                        <m:mi>k</m:mi>
                                                                        <m:mo>,</m:mo>
                                                                        <m:mi>l</m:mi>
                                                                     </m:mrow>
                                                                  </m:msub>
                                                                  <m:msup>
                                                                     <m:mi>j</m:mi>
                                                                     <m:mi>l</m:mi>
                                                                  </m:msup>
                                                                  <m:msub>
                                                                     <m:mi>I</m:mi>
                                                                     <m:mrow>
                                                                        <m:mi>i</m:mi>
                                                                        <m:mi>j</m:mi>
                                                                        <m:mi>k</m:mi>
                                                                     </m:mrow>
                                                                  </m:msub>
                                                               </m:mrow>
                                                            </m:mstyle>
                                                         </m:mrow>
                                                      </m:mstyle>
                                                   </m:mrow>
                                                </m:mstyle>
                                             </m:mrow>
                                          </m:mtd>
                                       </m:mtr>
                                       <m:mtr>
                                          <m:mtd>
                                             <m:mrow>
                                                <m:mo>+</m:mo>
                                                <m:mstyle displaystyle="true">
                                                   <m:munderover>
                                                      <m:mo>&#8721;</m:mo>
                                                      <m:mrow>
                                                         <m:mi>m</m:mi>
                                                         <m:mo>=</m:mo>
                                                         <m:mn>1</m:mn>
                                                      </m:mrow>
                                                      <m:mrow>
                                                         <m:mn>24</m:mn>
                                                      </m:mrow>
                                                   </m:munderover>
                                                   <m:mrow>
                                                      <m:mstyle displaystyle="true">
                                                         <m:munder>
                                                            <m:mo>&#8721;</m:mo>
                                                            <m:mrow>
                                                               <m:mi>n</m:mi>
                                                               <m:mo>&#8712;</m:mo>
                                                               <m:mo>{</m:mo>
                                                               <m:mi>A</m:mi>
                                                               <m:mo>{</m:mo>
                                                               <m:mi>A</m:mi>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>C</m:mi>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>G</m:mi>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>T</m:mi>
                                                               <m:mo>}</m:mo>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>C</m:mi>
                                                               <m:mo>{</m:mo>
                                                               <m:mi>A</m:mi>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>C</m:mi>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>G</m:mi>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>T</m:mi>
                                                               <m:mo stretchy="false">)</m:mo>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>G</m:mi>
                                                               <m:mo>{</m:mo>
                                                               <m:mi>A</m:mi>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>C</m:mi>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>G</m:mi>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>T</m:mi>
                                                               <m:mo stretchy="false">)</m:mo>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>T</m:mi>
                                                               <m:mo>{</m:mo>
                                                               <m:mi>A</m:mi>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>C</m:mi>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>G</m:mi>
                                                               <m:mo>}</m:mo>
                                                               <m:mo>}</m:mo>
                                                            </m:mrow>
                                                         </m:munder>
                                                         <m:mrow>
                                                            <m:mstyle displaystyle="true">
                                                               <m:munderover>
                                                                  <m:mo>&#8721;</m:mo>
                                                                  <m:mrow>
                                                                     <m:mi>l</m:mi>
                                                                     <m:mo>=</m:mo>
                                                                     <m:mn>1</m:mn>
                                                                  </m:mrow>
                                                                  <m:mn>3</m:mn>
                                                               </m:munderover>
                                                               <m:mrow>
                                                                  <m:msub>
                                                                     <m:mi>&#948;</m:mi>
                                                                     <m:mrow>
                                                                        <m:mi>n</m:mi>
                                                                        <m:mo>,</m:mo>
                                                                        <m:mi>l</m:mi>
                                                                     </m:mrow>
                                                                  </m:msub>
                                                                  <m:msup>
                                                                     <m:mi>m</m:mi>
                                                                     <m:mi>l</m:mi>
                                                                  </m:msup>
                                                                  <m:msub>
                                                                     <m:mi>I</m:mi>
                                                                     <m:mrow>
                                                                        <m:mi>i</m:mi>
                                                                        <m:mi>m</m:mi>
                                                                        <m:mi>n</m:mi>
                                                                     </m:mrow>
                                                                  </m:msub>
                                                               </m:mrow>
                                                            </m:mstyle>
                                                         </m:mrow>
                                                      </m:mstyle>
                                                   </m:mrow>
                                                </m:mstyle>
                                                <m:mo>+</m:mo>
                                                <m:msub>
                                                   <m:mi>&#949;</m:mi>
                                                   <m:mi>i</m:mi>
                                                </m:msub>
                                             </m:mrow>
                                          </m:mtd>
                                       </m:mtr>
                                    </m:mtable>
                                 </m:mrow>
                                 <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeWabiqaaaqaaiGbcYgaSjabc+gaVjabcEgaNjabcIcaOiabdMeajjabd6gaUjabdsha0jabdwgaLjabd6gaUjabdohaZjabdMgaPjabdsha0jabdMha5naaBaaaleaacqWGPbqAaeqaaOGaeiykaKIaeyypa0JaeqySdeMaey4kaSYaaabuaeaadaaeWbqaaiabek7aInaaBaaaleaacqWGRbWAcqGGSaalcqWGSbaBaeqaaaqaaiabdYgaSjabg2da9iabigdaXaqaaiabiodaZaqdcqGHris5aaWcbaGaem4AaSMaeyicI4Saei4EaSNaemyqaeKaeiilaWIaem4qamKaeiilaWIaem4raCKaeiyFa0habeqdcqGHris5aOGaemiuaa1aa0baaSqaaiabdMgaPjabcYcaSiabdUgaRbqaaiabdYgaSbaakiabgUcaRmaaqahabaWaaabuaeaadaaeWbqaaiabeo7aNnaaBaaaleaacqWGRbWAcqGGSaalcqWGSbaBaeqaaOGaemOAaO2aaWbaaSqabeaacqWGSbaBaaGccqWGjbqsdaWgaaWcbaGaemyAaKMaemOAaOMaem4AaSgabeaaaeaacqWGSbaBcqGH9aqpcqaIXaqmaeaacqaIZaWma0GaeyyeIuoaaSqaaiabdUgaRjabgIGiolabcUha7jabdgeabjabcYcaSiabdoeadjabcYcaSiabdEeahjabc2ha9bqab0GaeyyeIuoaaSqaaiabdQgaQjabg2da9iabigdaXaqaaiabikdaYiabiwda1aqdcqGHris5aaGcbaGaey4kaSYaaabCaeaadaaeqbqaamaaqahabaGaeqiTdq2aaSbaaSqaaiabd6gaUjabcYcaSiabdYgaSbqabaGccqWGTbqBdaahaaWcbeqaaiabdYgaSbaakiabdMeajnaaBaaaleaacqWGPbqAcqWGTbqBcqWGUbGBaeqaaaqaaiabdYgaSjabg2da9iabigdaXaqaaiabiodaZaqdcqGHris5aaWcbaGaemOBa4MaeyicI4Saei4EaSNaemyqaeKaei4EaSNaemyqaeKaeiilaWIaem4qamKaeiilaWIaem4raCKaeiilaWIaemivaqLaeiyFa0NaeiilaWIaem4qamKaei4EaSNaemyqaeKaeiilaWIaem4qamKaeiilaWIaem4raCKaeiilaWIaemivaqLaeiykaKIaeiilaWIaem4raCKaei4EaSNaemyqaeKaeiilaWIaem4qamKaeiilaWIaem4raCKaeiilaWIaemivaqLaeiykaKIaeiilaWIaemivaqLaei4EaSNaemyqaeKaeiilaWIaem4qamKaeiilaWIaem4raCKaeiyFa0NaeiyFa0habeqdcqGHris5aaWcbaGaemyBa0Maeyypa0JaeGymaedabaGaeGOmaiJaeGinaqdaniabggHiLdGccqGHRaWkcqaH1oqzdaWgaaWcbaGaemyAaKgabeaaaaaaaa@E114@</m:annotation>
                              </m:semantics>
                           </m:math>
                        </display-formula>
                     </p>
                     <p>where</p>
                     <p>&#8226; <it>Intensity</it><sub><it>i </it></sub>is the probe intensity of probe <it>i</it>;</p>
                     <p>&#8226; <it>&#945; </it>is the intercept of the regression;</p>
                     <p>&#8226; <it>j </it>= 1,...,25, representing the position along the probe <it>i</it>;</p>
                     <p>&#8226; <it>k </it>represents the base at position <it>j</it>;</p>
                     <p>&#8226; <it>P</it><sub><it>i</it>,<it>k </it></sub>is the percentage of nucleotides A, C, G in the probe <it>i</it>;</p>
                     <p>&#8226; <it>&#946;</it><sub><it>k</it>,<it>l </it></sub>is the effect of nucleotide percentage (A, C, or G) in the probe, for a fixed base nucleotide <it>k</it>, the effect is modeled as a polynomial of degree 3;</p>
                     <p>&#8226; <it>I</it><sub><it>ijk </it></sub>is an indicator function such that it is 1 when the <it>j</it>th position is base <it>k </it>in probe <it>i</it>, and it is 0 otherwise;</p>
                     <p>&#8226; <it>&#947;</it><sub><it>k</it>,<it>l </it></sub>is the effect of base <it>k </it>in position <it>j</it>, the effect is modeled as a polynomial of degree 3;</p>
                     <p>&#8226; <it>m </it>= 1,2,...,24, representing the di-nucleotide position along the probe <it>i</it>;</p>
                     <p>&#8226; <it>n </it>is the set of di-nucleotide nearest neighbor compositions such as 'AA', 'AC', 'GT' etc;</p>
                     <p>&#8226; <it>I</it><sub><it>imn </it></sub>is an indicator function such that it is 1 when the <it>m</it>th position is di-nucleotide <it>n </it>in probe <it>i</it>, and it is 0 otherwise;</p>
                     <p>&#8226; <it>&#948;</it><sub><it>n</it>,<it>l </it></sub>is the effect of di-nucleotide in position <it>m</it>, the effect is modeled as a polynomial of degree 3;</p>
                     <p>&#8226; <it>&#949;</it><sub><it>i </it></sub>is the error-term.</p>
                     <p>Log intensities of all 33,886 anti-genomic probes were fitted to estimate parameters using least squares. Each array was fitted separately and a total of 64 parameters were estimated for each array. These parameters were used to calculate the background-adjusted intensities for all interrogation probes on the array, and the value of zero was set as the threshold to determine whether signal was greater than background. For each set of samples being analyzed together, probes that exhibited a consistent signal lower than background were filtered out.</p>
                  </sec>
                  <sec>
                     <st>
                        <p>Probe filtering based on number of probes within a fragment (probe set)</p>
                     </st>
                     <p>The last probe filtering step removed probes where only a single probe remained for a given fragment (due to filtering from previous steps). Thus, every fragment is represented by at least two probes that have passed all filtering criteria.</p>
                  </sec>
               </sec>
               <sec>
                  <st>
                     <p>2. Probe Standardization</p>
                  </st>
                  <p>Inspired by previous studies demonstrating that probe intensities are affected by fragment length, fragment GC content, probe GC content, nucleotide locations on the probe, and recognition site sequence of restriction enzyme, optical background adjusted probe intensities were fitted to a multiple linear regression model <abbrgrp><abbr bid="B37">37</abbr><abbr bid="B70">70</abbr><abbr bid="B71">71</abbr><abbr bid="B72">72</abbr></abbrgrp>. The AIC stepwise auto-selection procedure was used to identify the best model. The starting model has a 10 degree polynomial for each variable. A cubic term was used with most of the variables and the subset of selected variables can be slightly different from sample to sample. The following multiple linear regression model was used to fit the data:</p>
                  <p>
                     <display-formula id="M2">
                        <m:math name="1471-2156-9-27-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                           <m:semantics>
                              <m:mrow>
                                 <m:mtable>
                                    <m:mtr>
                                       <m:mtd>
                                          <m:mrow>
                                             <m:mi>log</m:mi>
                                             <m:mo>&#8289;</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mtext>adjusted</m:mtext>
                                             <m:mi>P</m:mi>
                                             <m:msub>
                                                <m:mi>M</m:mi>
                                                <m:mi>i</m:mi>
                                             </m:msub>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo>=</m:mo>
                                             <m:mi>&#945;</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mstyle displaystyle="true">
                                                <m:munder>
                                                   <m:mo>&#8721;</m:mo>
                                                   <m:mrow>
                                                      <m:mi>k</m:mi>
                                                      <m:mo>&#8712;</m:mo>
                                                      <m:mo>{</m:mo>
                                                      <m:mi>A</m:mi>
                                                      <m:mo>,</m:mo>
                                                      <m:mi>C</m:mi>
                                                      <m:mo>,</m:mo>
                                                      <m:mi>G</m:mi>
                                                      <m:mo>}</m:mo>
                                                   </m:mrow>
                                                </m:munder>
                                                <m:mrow>
                                                   <m:mstyle displaystyle="true">
                                                      <m:munderover>
                                                         <m:mo>&#8721;</m:mo>
                                                         <m:mrow>
                                                            <m:mi>l</m:mi>
                                                            <m:mo>=</m:mo>
                                                            <m:mn>1</m:mn>
                                                         </m:mrow>
                                                         <m:mn>3</m:mn>
                                                      </m:munderover>
                                                      <m:mrow>
                                                         <m:msub>
                                                            <m:mi>&#946;</m:mi>
                                                            <m:mrow>
                                                               <m:mi>k</m:mi>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>l</m:mi>
                                                            </m:mrow>
                                                         </m:msub>
                                                      </m:mrow>
                                                   </m:mstyle>
                                                </m:mrow>
                                             </m:mstyle>
                                             <m:msubsup>
                                                <m:mi>P</m:mi>
                                                <m:mrow>
                                                   <m:mi>i</m:mi>
                                                   <m:mo>,</m:mo>
                                                   <m:mi>k</m:mi>
                                                </m:mrow>
                                                <m:mi>l</m:mi>
                                             </m:msubsup>
                                             <m:mo>+</m:mo>
                                             <m:mstyle displaystyle="true">
                                                <m:munderover>
                                                   <m:mo>&#8721;</m:mo>
                                                   <m:mrow>
                                                      <m:mi>j</m:mi>
                                                      <m:mo>=</m:mo>
                                                      <m:mn>1</m:mn>
                                                   </m:mrow>
                                                   <m:mrow>
                                                      <m:mn>25</m:mn>
                                                   </m:mrow>
                                                </m:munderover>
                                                <m:mrow>
                                                   <m:mstyle displaystyle="true">
                                                      <m:munder>
                                                         <m:mo>&#8721;</m:mo>
                                                         <m:mrow>
                                                            <m:mi>k</m:mi>
                                                            <m:mo>&#8712;</m:mo>
                                                            <m:mo>{</m:mo>
                                                            <m:mi>A</m:mi>
                                                            <m:mo>,</m:mo>
                                                            <m:mi>C</m:mi>
                                                            <m:mo>,</m:mo>
                                                            <m:mi>G</m:mi>
                                                            <m:mo>}</m:mo>
                                                         </m:mrow>
                                                      </m:munder>
                                                      <m:mrow>
                                                         <m:mstyle displaystyle="true">
                                                            <m:munderover>
                                                               <m:mo>&#8721;</m:mo>
                                                               <m:mrow>
                                                                  <m:mi>l</m:mi>
                                                                  <m:mo>=</m:mo>
                                                                  <m:mn>1</m:mn>
                                                               </m:mrow>
                                                               <m:mn>3</m:mn>
                                                            </m:munderover>
                                                            <m:mrow>
                                                               <m:msub>
                                                                  <m:mi>&#947;</m:mi>
                                                                  <m:mrow>
                                                                     <m:mi>k</m:mi>
                                                                     <m:mo>,</m:mo>
                                                                     <m:mi>l</m:mi>
                                                                  </m:mrow>
                                                               </m:msub>
                                                               <m:msup>
                                                                  <m:mi>j</m:mi>
                                                                  <m:mi>l</m:mi>
                                                               </m:msup>
                                                               <m:msub>
                                                                  <m:mi>I</m:mi>
                                                                  <m:mrow>
                                                                     <m:mi>i</m:mi>
                                                                     <m:mi>j</m:mi>
                                                                     <m:mi>k</m:mi>
                                                                  </m:mrow>
                                                               </m:msub>
                                                            </m:mrow>
                                                         </m:mstyle>
                                                      </m:mrow>
                                                   </m:mstyle>
                                                </m:mrow>
                                             </m:mstyle>
                                          </m:mrow>
                                       </m:mtd>
                                    </m:mtr>
                                    <m:mtr>
                                       <m:mtd>
                                          <m:mrow>
                                             <m:mo>+</m:mo>
                                             <m:mstyle displaystyle="true">
                                                <m:munderover>
                                                   <m:mo>&#8721;</m:mo>
                                                   <m:mrow>
                                                      <m:mi>m</m:mi>
                                                      <m:mo>=</m:mo>
                                                      <m:mn>1</m:mn>
                                                   </m:mrow>
                                                   <m:mrow>
                                                      <m:mn>24</m:mn>
                                                   </m:mrow>
                                                </m:munderover>
                                                <m:mrow>
                                                   <m:mstyle displaystyle="true">
                                                      <m:munder>
                                                         <m:mo>&#8721;</m:mo>
                                                         <m:mrow>
                                                            <m:mi>n</m:mi>
                                                            <m:mo>&#8712;</m:mo>
                                                            <m:mo>{</m:mo>
                                                            <m:mi>A</m:mi>
                                                            <m:mo>{</m:mo>
                                                            <m:mi>A</m:mi>
                                                            <m:mo>,</m:mo>
                                                            <m:mi>C</m:mi>
                                                            <m:mo>,</m:mo>
                                                            <m:mi>G</m:mi>
                                                            <m:mo>,</m:mo>
                                                            <m:mi>T</m:mi>
                                                            <m:mo>}</m:mo>
                                                            <m:mo>,</m:mo>
                                                            <m:mi>C</m:mi>
                                                            <m:mo>{</m:mo>
                                                            <m:mi>A</m:mi>
                                                            <m:mo>,</m:mo>
                                                            <m:mi>C</m:mi>
                                                            <m:mo>,</m:mo>
                                                            <m:mi>G</m:mi>
                                                            <m:mo>,</m:mo>
                                                            <m:mi>T</m:mi>
                                                            <m:mo stretchy="false">)</m:mo>
                                                            <m:mo>,</m:mo>
                                                            <m:mi>G</m:mi>
                                                            <m:mo>{</m:mo>
                                                            <m:mi>A</m:mi>
                                                            <m:mo>,</m:mo>
                                                            <m:mi>C</m:mi>
                                                            <m:mo>,</m:mo>
                                                            <m:mi>G</m:mi>
                                                            <m:mo>,</m:mo>
                                                            <m:mi>T</m:mi>
                                                            <m:mo stretchy="false">)</m:mo>
                                                            <m:mo>,</m:mo>
                                                            <m:mi>T</m:mi>
                                                            <m:mo>{</m:mo>
                                                            <m:mi>A</m:mi>
                                                            <m:mo>,</m:mo>
                                                            <m:mi>C</m:mi>
                                                            <m:mo>,</m:mo>
                                                            <m:mi>G</m:mi>
                                                            <m:mo>}</m:mo>
                                                            <m:mo>}</m:mo>
                                                         </m:mrow>
                                                      </m:munder>
                                                      <m:mrow>
                                                         <m:mstyle displaystyle="true">
                                                            <m:munderover>
                                                               <m:mo>&#8721;</m:mo>
                                                               <m:mrow>
                                                                  <m:mi>l</m:mi>
                                                                  <m:mo>=</m:mo>
                                                                  <m:mn>1</m:mn>
                                                               </m:mrow>
                                                               <m:mn>3</m:mn>
                                                            </m:munderover>
                                                            <m:mrow>
                                                               <m:msub>
                                                                  <m:mi>&#948;</m:mi>
                                                                  <m:mrow>
                                                                     <m:mi>n</m:mi>
                                                                     <m:mo>,</m:mo>
                                                                     <m:mi>l</m:mi>
                                                                  </m:mrow>
                                                               </m:msub>
                                                               <m:msup>
                                                                  <m:mi>m</m:mi>
                                                                  <m:mi>l</m:mi>
                                                               </m:msup>
                                                               <m:msub>
                                                                  <m:mi>I</m:mi>
                                                                  <m:mrow>
                                                                     <m:mi>i</m:mi>
                                                                     <m:mi>m</m:mi>
                                                                     <m:mi>n</m:mi>
                                                                  </m:mrow>
                                                               </m:msub>
                                                            </m:mrow>
                                                         </m:mstyle>
                                                      </m:mrow>
                                                   </m:mstyle>
                                                </m:mrow>
                                             </m:mstyle>
                                          </m:mrow>
                                       </m:mtd>
                                    </m:mtr>
                                    <m:mtr>
                                       <m:mtd>
                                          <m:mrow>
                                             <m:mo>+</m:mo>
                                             <m:mstyle displaystyle="true">
                                                <m:munder>
                                                   <m:mo>&#8721;</m:mo>
                                                   <m:mrow>
                                                      <m:mi>o</m:mi>
                                                      <m:mo>&#8712;</m:mo>
                                                      <m:mo>{</m:mo>
                                                      <m:mi>A</m:mi>
                                                      <m:mo>,</m:mo>
                                                      <m:mi>C</m:mi>
                                                      <m:mo>,</m:mo>
                                                      <m:mi>G</m:mi>
                                                      <m:mo>}</m:mo>
                                                   </m:mrow>
                                                </m:munder>
                                                <m:mrow>
                                                   <m:mstyle displaystyle="true">
                                                      <m:munderover>
                                                         <m:mo>&#8721;</m:mo>
                                                         <m:mrow>
                                                            <m:mi>l</m:mi>
                                                            <m:mo>=</m:mo>
                                                            <m:mn>1</m:mn>
                                                         </m:mrow>
                                                         <m:mn>3</m:mn>
                                                      </m:munderover>
                                                      <m:mrow>
                                                         <m:msub>
                                                            <m:mi>&#951;</m:mi>
                                                            <m:mrow>
                                                               <m:mi>o</m:mi>
                                                               <m:mo>,</m:mo>
                                                               <m:mi>l</m:mi>
                                                            </m:mrow>
                                                         </m:msub>
                                                      </m:mrow>
                                                   </m:mstyle>
                                                </m:mrow>
                                             </m:mstyle>
                                             <m:msubsup>
                                                <m:mi>F</m:mi>
                                                <m:mrow>
                                                   <m:mi>i</m:mi>
                                                   <m:mo>,</m:mo>
                                                   <m:mi>o</m:mi>
                                                </m:mrow>
                                                <m:mi>l</m:mi>
                                             </m:msubsup>
                                             <m:mo>+</m:mo>
                                             <m:mstyle displaystyle="true">
                                                <m:munderover>
                                                   <m:mo>&#8721;</m:mo>
                                                   <m:mrow>
                                                      <m:mi>l</m:mi>
                                                      <m:mo>=</m:mo>
                                                      <m:mn>1</m:mn>
                                                   </m:mrow>
                                                   <m:mn>3</m:mn>
                                                </m:munderover>
                                         