<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-7-438</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Software</dochead>
      <bibl>
         <title>
            <p>QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Tang</snm>
               <fnm>Jifeng</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>jifeng.tang@wur.nl</email>
            </au>
            <au id="A2">
               <snm>Vosman</snm>
               <fnm>Ben</fnm>
               <insr iid="I1"/>
               <email>ben.vosman@wur.nl</email>
            </au>
            <au id="A3">
               <snm>Voorrips</snm>
               <mi>E</mi>
               <fnm>Roeland</fnm>
               <insr iid="I1"/>
               <email>roeland.voorrips@wur.nl</email>
            </au>
            <au id="A4">
               <snm>van der Linden</snm>
               <fnm>C Gerard</fnm>
               <insr iid="I1"/>
               <email>gerard.vanderlinden@wur.nl</email>
            </au>
            <au id="A5" ca="yes">
               <snm>Leunissen</snm>
               <mi>AM</mi>
               <fnm>Jack</fnm>
               <insr iid="I2"/>
               <email>jack.leunissen@wur.nl</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Plant Research International, PO Box 16, 6700 AA Wageningen, The Netherlands</p>
            </ins>
            <ins id="I2">
               <p>Laboratory of Bioinformatics, Wageningen University, PO Box 8128, 6700 ET Wageningen, The Netherlands</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2006</pubdate>
         <volume>7</volume>
         <issue>1</issue>
         <fpage>438</fpage>
         <url>http://www.biomedcentral.com/1471-2105/7/438</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17029635</pubid>
               <pubid idtype="doi">10.1186/1471-2105-7-438</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>29</day>
               <month>12</month>
               <year>2005</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>09</day>
               <month>10</month>
               <year>2006</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>09</day>
               <month>10</month>
               <year>2006</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2006</year>
         <collab>Tang et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Single nucleotide polymorphisms (SNPs) are important tools in studying complex genetic traits and genome evolution. Computational strategies for SNP discovery make use of the large number of sequences present in public databases (in most cases as expressed sequence tags (ESTs)) and are considered to be faster and more cost-effective than experimental procedures. A major challenge in computational SNP discovery is distinguishing allelic variation from sequence variation between paralogous sequences, in addition to recognizing sequencing errors. For the majority of the public EST sequences, trace or quality files are lacking which makes detection of reliable SNPs even more difficult because it has to rely on sequence comparisons only.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We have developed a new algorithm to detect reliable SNPs and insertions/deletions (indels) in EST data, both with and without quality files. Implemented in a pipeline called QualitySNP, it uses three filters for the identification of reliable SNPs. Filter 1 screens for all potential SNPs and identifies variation between or within genotypes. Filter 2 is the core filter that uses a haplotype-based strategy to detect reliable SNPs. Clusters with potential paralogs as well as false SNPs caused by sequencing errors are identified. Filter 3 screens SNPs by calculating a confidence score, based upon sequence redundancy and quality. Non-synonymous SNPs are subsequently identified by detecting open reading frames of consensus sequences (contigs) with SNPs. The pipeline includes a data storage and retrieval system for haplotypes, SNPs and alignments. QualitySNP's versatility is demonstrated by the identification of SNPs in EST datasets from potato, chicken and humans.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>QualitySNP is an efficient tool for SNP detection, storage and retrieval in diploid as well as polyploid species. It is available for running on Linux or UNIX systems. The program, test data, and user manual are available at <url>http://www.bioinformatics.nl/tools/snpweb/</url> and as Additional files.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Sequence variation in the genomic DNA of individuals of the same species or related species are typically single nucleotide polymorphisms (SNP) or small insertions/deletions (indels) <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>. Because of their abundance and slow mutation rate within the genome, they are the most common type of genetic markers <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> for studying complex genetic traits and genome evolution <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. In addition SNPs in coding sequences can be used to directly study the genetics of expressed genes and to map functional traits <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>. Non-synonymous SNPs (nsSNPs) are of particular interest because they change the protein sequence, possibly affecting protein function <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>.</p>
         <p>There are several strategies, both experimental and computational for SNP discovery. Experimental SNP discovery often consists of a number of laborious steps that make this process complex and expensive <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. The computational approach makes use of the large sequence datasets present in public databases. Over the last few years, a number of pipelines have been developed that automatically detect SNPs in such databases. One type of pipeline detects SNPs using trace files or quality files, for example the PHRED/PHRAP/PolyBayes system <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr></abbrgrp> and other pipelines <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr></abbrgrp>. The other type of pipeline uses only EST redundancy in text-based sequence files to detect SNPs: these include autoSNP <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr></abbrgrp> and SNiPpER <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. Both autoSNP and SNiPpER are based on sequence redundancy for the initial detection of SNPs, and sequencing errors are detected and filtered out by analyzing SNP patterns.</p>
         <p>The major drawback of all these computational approaches is that they do not provide a good way to distinguish allelic variation from sequence variation between paralogous sequences. In addition, they do not recognize sequencing errors very well, leading to the frequent occurrence of false positives <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B16">16</abbr><abbr bid="B18">18</abbr></abbrgrp>. Only PolyBayes <abbrgrp><abbr bid="B9">9</abbr></abbrgrp> has implemented an enhanced paralog identification routine, but it requires the corresponding genomic sequence and quality files in addition to the EST sequence. As most public ESTs do not include trace or quality file, and genomic sequences are not available for most species, the versatility of the PolyBayes paralog identification routine is limited.</p>
         <p>In this paper we describe a new method (QualitySNP) that uses a haplotype-based strategy to detect reliable synonymous and non-synonymous SNPs from public EST data without the requirement of trace/quality files or genomic sequence data. Haplotypes in this context represent the different alleles of a gene in a dataset. The haplotype reconstruction is based on a mathematical algorithm. QualitySNP's versatility is demonstrated by the identification of SNPs in EST datasets from potato, chicken and humans.</p>
      </sec>
      <sec>
         <st>
            <p>Materials</p>
         </st>
         <p>For potato, two datasets were used in our study. One dataset was from the EMBL database (version 79), containing 83,565 ESTs with tissue information of the potato variety Kennebec. The other was from the Potato Gene Index of TIGR database (data of Dec 7th 2004) containing 87,637 reads of potato ESTs with quality files, which was used to evaluate the quality of public potato ESTs and the performance of our SNP discovery pipeline. Function annotation information of the potato ESTs was obtained from the TIGR Gene Index <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> and UniGene <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. For chicken, a dataset consisting of 100,000 ESTs, originating from more than one genotype was used <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. For thorough validation of our program, nineteen UniGene datasets obtained from NCBI (Build #191 of <it>Homo sapiens</it>) were used. The Single Nucleotide Polymorphism database (dbSNP) was downloaded from NCBI (Jun 3th 2006) to our local machine.</p>
         <p>To detect non-synonymous and synonymous SNPs, the UniProt database (version of Feb 28th 2005) <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> was used to obtain reference protein sequences for ORF detection; FASTY, which is a module of the FASTA package 3.4 <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> and BLAST package 2.2.10 <abbrgrp><abbr bid="B24">24</abbr></abbrgrp> were used as tools for searching the UniProt database. CAP3 <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> was used for assembling sequences. Cross_match <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> was used for removing vector fragments; the vector sequences were downloaded from the NCBI data repository on Dec 5th 2004. To verify paralog identification the BLAT server of the human genome reference sequence was used <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>.</p>
      </sec>
      <sec>
         <st>
            <p>Architectural structure</p>
         </st>
         <p>The pipeline consists of five steps: 1) EST assembling using cross_match for removing vectors and CAP3 for sequence clustering, 2) analysis of the alignment information to select clusters with at least 4 EST members, 3) SNP detection and distinguishing variations between or within genotypes, 4) distinction between non-synonymous and synonymous SNPs using FASTY, and 5) transferring the final results into a SNP database (Figure <figr fid="F1">1</figr>). The pipeline is implemented in standard C-Shell script on a Linux workstation; the individual programming steps are written in the C programming language, with exception of the alignment analysis tool (PERL5.8) and the web pages to view the results from the database (PHP4 and MYSQL3.23).</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Flowchart for detecting reliable SNPs in the QualitySNP pipeline</p>
            </caption>
            <text>
               <p>Flowchart for detecting reliable SNPs in the QualitySNP pipeline. Step1 through 5 are described in detail in paragraph "Architectural structure"</p>
            </text>
            <graphic file="1471-2105-7-438-1"/>
         </fig>
         <p>In step 3 three filters are used to detect reliable SNPs: filter 1 screens clusters for potential SNPs and differentiates variations between or within genotypes; filter 2 detects clusters containing variations caused by sequencing errors and paralogous sequences; and filter 3 detects unreliable SNPs by assigning confidence scores to SNPs based on sequence redundancy and sequence quality.</p>
      </sec>
      <sec>
         <st>
            <p>Implementation</p>
         </st>
         <sec>
            <st>
               <p>Filter 1: screening for potential SNPs</p>
            </st>
            <p>EST data are clustered by CAP3 with a stringency level of 95% similarity per 100 bp, which is also used by other SNP mining programs <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B15">15</abbr></abbrgrp>; this setting is sufficient to prevent clustering of paralogous sequences in most cases. Clusters with at least 4 members are extracted from the alignment information, as well as annotation information, which was obtained from the TIGR Gene Index or UniGene <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. We detect all potential SNPs including bi-, tri-, and tetra-allelic SNPs, with the requirement that every allele is represented by more than one sequence <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B28">28</abbr></abbrgrp> (see Figure <figr fid="F1">1</figr>; filter 1). If genotype information for sequences is available, it can be used to classify SNPs as occurring between and/or within genotypes.</p>
         </sec>
         <sec>
            <st>
               <p>Filter 2: screening for reliable SNPs</p>
            </st>
            <sec>
               <st>
                  <p>(1) Haplotype reconstruction</p>
               </st>
               <p>In our setup, a haplotype is defined as a group of sequences within a cluster that represent the same allele of a gene. All the sequences in a haplotype should therefore have the same nucleotide on every polymorphic site. Our program reconstructs haplotypes using a mathematical method that minimizes false haplotype reconstruction due to the occurrence of sequencing errors (see below).</p>
               <p>Firstly, the similarity <it>S</it><sub><it>ij </it></sub>per polymorphic site between candidate sequence <it>i </it>and all current members of one potential haplotype is defined as</p>
               <p>
                  <m:math name="1471-2105-7-438-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>S</m:mi>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mi>j</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:msubsup>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>k</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>m</m:mi>
                                    </m:msubsup>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>s</m:mi>
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mi>j</m:mi>
                                          </m:mrow>
                                       </m:msub>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>k</m:mi>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:msubsup>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>k</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>m</m:mi>
                                    </m:msubsup>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>s</m:mi>
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mi>j</m:mi>
                                          </m:mrow>
                                       </m:msub>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>k</m:mi>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mstyle>
                                 <m:mo>+</m:mo>
                                 <m:mstyle displaystyle="true">
                                    <m:msubsup>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>k</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>m</m:mi>
                                    </m:msubsup>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>d</m:mi>
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mi>j</m:mi>
                                          </m:mrow>
                                       </m:msub>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>k</m:mi>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                           </m:mfrac>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mrow>
                              <m:mo>[</m:mo>
                              <m:mn>1</m:mn>
                              <m:mo>]</m:mo>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWudaWgaaWcbaGaemyAaKMaemOAaOgabeaakiabg2da9maalaaabaWaaabmaeaacqWGZbWCdaWgaaWcbaGaemyAaKMaemOAaOgabeaakiabcIcaOiabdUgaRjabcMcaPaWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemyBa0ganiabggHiLdaakeaadaaeWaqaaiabdohaZnaaBaaaleaacqWGPbqAcqWGQbGAaeqaaOGaeiikaGIaem4AaSMaeiykaKcaleaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGTbqBa0GaeyyeIuoakiabgUcaRmaaqadabaGaemizaq2aaSbaaSqaaiabdMgaPjabdQgaQbqabaGccqGGOaakcqWGRbWAcqGGPaqkaSqaaiabdUgaRjabg2da9iabigdaXaqaaiabd2gaTbqdcqGHris5aaaakiaaxMaacaWLjaWaamWaaeaacqaIXaqmaiaawUfacaGLDbaaaaa@615F@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>where <it>j </it>is one polymorphic site of the sequence <it>i</it>, <it>k </it>is one current member of the potential haplotype, and <it>m </it>is the total number of current members in the potential haplotype. <it>s</it><sub><it>ij</it></sub>(<it>k</it>) expresses whether or not the nucleotide at polymorphic site <it>j </it>of sequence <it>i </it>is the same as that of member <it>k </it>in the haplotype, whereas <it>d</it><sub><it>ij</it></sub>(<it>k</it>) expresses whether it is different: when the nucleotide at site <it>j </it>in sequence <it>i </it>is the same as that in sequence <it>k</it>, <it>s</it><sub><it>ij</it></sub>(<it>k</it>) is set to 1 and <it>d</it><sub><it>ij</it></sub>(<it>k</it>) is set to 0; when the nucleotides are different, <it>s</it><sub><it>ij</it></sub>(<it>k</it>) is set to 0 and <it>d</it><sub><it>ij</it></sub>(<it>k</it>) is set to 1. If sequence <it>k </it>has no information at site <it>j </it>both <it>s</it><sub><it>ij</it></sub>(<it>k</it>) and <it>d</it><sub><it>ij</it></sub>(<it>k</it>) are set to 0. <it>S</it><sub><it>ij </it></sub>is the similarity of sequence <it>i </it>to all current members in the potential haplotype on site <it>j</it>; <it>D</it><sub><it>ij </it></sub>is the dissimilarity between them. When <it>S</it><sub><it>ij </it></sub>is more than 0.75, sequence <it>i </it>is considered to match the haplotype on site <it>j</it>, so <it>S</it><sub><it>ij </it></sub>is set to 1 and <it>D</it><sub><it>ij </it></sub>is set to 0. When <it>S</it><sub><it>ij </it></sub>as calculated from <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> is less than 0.75 there is not enough information to assign sequence <it>i </it>to the potential haplotype with confidence, so <it>S</it><sub><it>ij </it></sub>is set to 0 and <it>D</it><sub><it>ij </it></sub>is set to 1. When both <m:math name="1471-2105-7-438-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mstyle displaystyle="true"><m:msubsup><m:mo>&#8721;</m:mo><m:mrow><m:mi>k</m:mi><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:mi>m</m:mi></m:msubsup><m:mrow><m:msub><m:mi>s</m:mi><m:mrow><m:mi>i</m:mi><m:mi>j</m:mi></m:mrow></m:msub></m:mrow></m:mstyle><m:mo stretchy="false">(</m:mo><m:mi>k</m:mi><m:mo stretchy="false">)</m:mo></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaaeWaqaaiabdohaZnaaBaaaleaacqWGPbqAcqWGQbGAaeqaaaqaaiabdUgaRjabg2da9iabigdaXaqaaiabd2gaTbqdcqGHris5aOGaeiikaGIaem4AaSMaeiykaKcaaa@3AC9@</m:annotation></m:semantics></m:math> and <m:math name="1471-2105-7-438-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mstyle displaystyle="true"><m:msubsup><m:mo>&#8721;</m:mo><m:mrow><m:mi>k</m:mi><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:mi>m</m:mi></m:msubsup><m:mrow><m:msub><m:mi>d</m:mi><m:mrow><m:mi>i</m:mi><m:mi>j</m:mi></m:mrow></m:msub></m:mrow></m:mstyle><m:mo stretchy="false">(</m:mo><m:mi>k</m:mi><m:mo stretchy="false">)</m:mo></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaaeWaqaaiabdsgaKnaaBaaaleaacqWGPbqAcqWGQbGAaeqaaaqaaiabdUgaRjabg2da9iabigdaXaqaaiabd2gaTbqdcqGHris5aOGaeiikaGIaem4AaSMaeiykaKcaaa@3AAB@</m:annotation></m:semantics></m:math> are 0, <it>S</it><sub><it>ij </it></sub>and <it>D</it><sub><it>ij </it></sub>are set to 0.</p>
               <p>Secondly, the similarity <it>S</it><sub><it>i </it></sub>of sequence <it>i </it>and the potential haplotype of all polymorphic sites is defined as</p>
               <p>
                  <m:math name="1471-2105-7-438-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>S</m:mi>
                              <m:mi>i</m:mi>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:msubsup>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>j</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>n</m:mi>
                                    </m:msubsup>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>S</m:mi>
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mi>j</m:mi>
                                          </m:mrow>
                                       </m:msub>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:msubsup>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>j</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>n</m:mi>
                                    </m:msubsup>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>S</m:mi>
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mi>j</m:mi>
                                          </m:mrow>
                                       </m:msub>
                                    </m:mrow>
                                 </m:mstyle>
                                 <m:mo>+</m:mo>
                                 <m:mstyle displaystyle="true">
                                    <m:msubsup>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>j</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>n</m:mi>
                                    </m:msubsup>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>D</m:mi>
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mi>j</m:mi>
                                          </m:mrow>
                                       </m:msub>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                           </m:mfrac>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mrow>
                              <m:mo>[</m:mo>
                              <m:mn>2</m:mn>
                              <m:mo>]</m:mo>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWudaWgaaWcbaGaemyAaKgabeaakiabg2da9maalaaabaWaaabmaeaacqWGtbWudaWgaaWcbaGaemyAaKMaemOAaOgabeaaaeaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGUbGBa0GaeyyeIuoaaOqaamaaqadabaGaem4uam1aaSbaaSqaaiabdMgaPjabdQgaQbqabaaabaGaemOAaOMaeyypa0JaeGymaedabaGaemOBa4ganiabggHiLdGccqGHRaWkdaaeWaqaaiabdseaenaaBaaaleaacqWGPbqAcqWGQbGAaeqaaaqaaiabdQgaQjabg2da9iabigdaXaqaaiabd6gaUbqdcqGHris5aaaakiaaxMaacaWLjaWaamWaaeaacqaIYaGmaiaawUfacaGLDbaaaaa@55D2@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>where <it>n </it>is the total number of all potential polymorphic sites of sequence <it>i</it>. When S<sub><it>i </it></sub>is more than 0.8, sequence <it>i </it>is considered to belong to this haplotype. If both <m:math name="1471-2105-7-438-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mstyle displaystyle="true"><m:msubsup><m:mo>&#8721;</m:mo><m:mrow><m:mi>j</m:mi><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:mi>n</m:mi></m:msubsup><m:mrow><m:msub><m:mi>S</m:mi><m:mrow><m:mi>i</m:mi><m:mi>j</m:mi></m:mrow></m:msub></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaaeWaqaaiabdofatnaaBaaaleaacqWGPbqAcqWGQbGAaeqaaaqaaiabdQgaQjabg2da9iabigdaXaqaaiabd6gaUbqdcqGHris5aaaa@376E@</m:annotation></m:semantics></m:math> and <m:math name="1471-2105-7-438-i6" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mstyle displaystyle="true"><m:msubsup><m:mo>&#8721;</m:mo><m:mrow><m:mi>j</m:mi><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:mi>n</m:mi></m:msubsup><m:mrow><m:msub><m:mi>D</m:mi><m:mrow><m:mi>i</m:mi><m:mi>j</m:mi></m:mrow></m:msub></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaaeWaqaaiabdseaenaaBaaaleaacqWGPbqAcqWGQbGAaeqaaaqaaiabdQgaQjabg2da9iabigdaXaqaaiabd6gaUbqdcqGHris5aaaa@3750@</m:annotation></m:semantics></m:math> are 0, the value of <it>S</it><sub><it>i </it></sub>is set to 0.0.</p>
            </sec>
            <sec>
               <st>
                  <p>(2) Identification of paralogs</p>
               </st>
               <p>Sets containing paralogous sequences can be expected to contain more polymorphisms than sets with only allelic sequences. A method based on the number and frequency of polymorphisms may therefore separate paralogs from alleles. However, some EST clusters show a larger than average number of SNPs because some genes or regions of genes evolve more rapidly <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. These SNPs represent allelic variation but will be mistaken for variation between paralogs by such an approach. Therefore we developed a method that identifies paralogs by using the differences in SNP numbers between potential haplotypes of the same cluster. The standard deviation of the number of potential SNPs among potential haplotypes in one cluster is calculated and used to identify haplotypes likely to be caused by paralogous sequences. The procedure of identifying paralogs is as follows:</p>
               <p>(a) Remove all potential haplotypes consisting of only one sequence: these are probably of poor quality <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>.</p>
               <p>(b) Calculate the number of potential SNPs defining each haplotype, i.e. the number of potential SNP sites where all sequences in all other haplotypes contain the same nucleotide and only the current haplotype has a different nucleotide.</p>
               <p>(c) Normalize this number of SNPs defining each potential haplotype:</p>
               <p>
                  <m:math name="1471-2105-7-438-i7" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mtable>
                              <m:mtr>
                                 <m:mtd>
                                    <m:mrow>
                                       <m:mi>n</m:mi>
                                       <m:mi>r</m:mi>
                                       <m:mi>m</m:mi>
                                       <m:mo>_</m:mo>
                                       <m:mi>s</m:mi>
                                       <m:mi>n</m:mi>
                                       <m:msub>
                                          <m:mi>p</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                       <m:mo>=</m:mo>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mi>s</m:mi>
                                             <m:mi>n</m:mi>
                                             <m:msub>
                                                <m:mi>p</m:mi>
                                                <m:mi>i</m:mi>
                                             </m:msub>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mfrac>
                                                <m:mrow>
                                                   <m:mstyle displaystyle="true">
                                                      <m:msubsup>
                                                         <m:mo>&#8721;</m:mo>
                                                         <m:mrow>
                                                            <m:mi>i</m:mi>
                                                            <m:mo>=</m:mo>
                                                            <m:mn>1</m:mn>
                                                         </m:mrow>
                                                         <m:mrow>
                                                            <m:mi>a</m:mi>
                                                            <m:mi>h</m:mi>
                                                            <m:mi>a</m:mi>
                                                            <m:mi>p</m:mi>
                                                         </m:mrow>
                                                      </m:msubsup>
                                                      <m:mrow>
                                                         <m:mi>s</m:mi>
                                                         <m:mi>n</m:mi>
                                                         <m:msub>
                                                            <m:mi>p</m:mi>
                                                            <m:mi>i</m:mi>
                                                         </m:msub>
                                                      </m:mrow>
                                                   </m:mstyle>
                                                </m:mrow>
                                                <m:mrow>
                                                   <m:mi>a</m:mi>
                                                   <m:mi>h</m:mi>
                                                   <m:mi>a</m:mi>
                                                   <m:mi>p</m:mi>
                                                </m:mrow>
                                             </m:mfrac>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                 </m:mtd>
                                 <m:mtd>
                                    <m:mrow>
                                       <m:mrow>
                                          <m:mo>{</m:mo>
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mo>|</m:mo>
                                             <m:mi>i</m:mi>
                                             <m:mo>&#8712;</m:mo>
                                             <m:mrow>
                                                <m:mo>[</m:mo>
                                                <m:mrow>
                                                   <m:mn>1</m:mn>
                                                   <m:mo>,</m:mo>
                                                   <m:mi>a</m:mi>
                                                   <m:mi>h</m:mi>
                                                   <m:mi>a</m:mi>
                                                   <m:mi>p</m:mi>
                                                </m:mrow>
                                                <m:mo>]</m:mo>
                                             </m:mrow>
                                          </m:mrow>
                                          <m:mo>}</m:mo>
                                       </m:mrow>
                                    </m:mrow>
                                 </m:mtd>
                              </m:mtr>
                           </m:mtable>
                           <m:mo>,</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeqacaaabaGaemOBa4MaemOCaiNaemyBa0Maei4xa8Laem4CamNaemOBa4MaemiCaa3aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpdaWcaaqaaiabdohaZjabd6gaUjabdchaWnaaBaaaleaacqWGPbqAaeqaaaGcbaWaaSaaaeaadaaeWaqaaiabdohaZjabd6gaUjabdchaWnaaBaaaleaacqWGPbqAaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabdggaHjabdIgaOjabdggaHjabdchaWbqdcqGHris5aaGcbaGaemyyaeMaemiAaGMaemyyaeMaemiCaahaaaaaaeaadaGadaqaaiabdMgaPjabcYha8jabdMgaPjabgIGiopaadmaabaGaeGymaeJaeiilaWIaemyyaeMaemiAaGMaemyyaeMaemiCaahacaGLBbGaayzxaaaacaGL7bGaayzFaaaaaiabcYcaSaaa@66A1@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>where <it>snp</it><sub><it>i </it></sub>(<it>i </it>&#8712; [1, <it>ahap</it>]) is the number of potential SNPs defining haplotype <it>i</it>, and <it>ahap </it>is the number of all haplotypes after removing poor quality haplotypes (a).</p>
               <p>(d) Calculate the standard deviation of the normalized number of potential SNPs among these haplotypes:</p>
               <p>
                  <m:math name="1471-2105-7-438-i8" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>D</m:mi>
                           <m:mo>=</m:mo>
                           <m:msqrt>
                              <m:mrow>
                                 <m:mfrac>
                                    <m:mrow>
                                       <m:mstyle displaystyle="true">
                                          <m:msubsup>
                                             <m:mo>&#8721;</m:mo>
                                             <m:mrow>
                                                <m:mi>i</m:mi>
                                                <m:mo>=</m:mo>
                                                <m:mn>1</m:mn>
                                             </m:mrow>
                                             <m:mrow>
                                                <m:mi>a</m:mi>
                                                <m:mi>h</m:mi>
                                                <m:mi>a</m:mi>
                                                <m:mi>p</m:mi>
                                             </m:mrow>
                                          </m:msubsup>
                                          <m:mrow>
                                             <m:msup>
                                                <m:mrow>
                                                   <m:mrow>
                                                      <m:mo>(</m:mo>
                                                      <m:mrow>
                                                         <m:mi>n</m:mi>
                                                         <m:mi>r</m:mi>
                                                         <m:mi>m</m:mi>
                                                         <m:mo>_</m:mo>
                                                         <m:mi>s</m:mi>
                                                         <m:mi>n</m:mi>
                                                         <m:msub>
                                                            <m:mi>p</m:mi>
                                                            <m:mi>i</m:mi>
                                                         </m:msub>
                                                         <m:mo>&#8722;</m:mo>
                                                         <m:mn>1</m:mn>
                                                      </m:mrow>
                                                      <m:mo>)</m:mo>
                                                   </m:mrow>
                                                </m:mrow>
                                                <m:mn>2</m:mn>
                                             </m:msup>
                                          </m:mrow>
                                       </m:mstyle>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>a</m:mi>
                                       <m:mi>h</m:mi>
                                       <m:mi>a</m:mi>
                                       <m:mi>p</m:mi>
                                    </m:mrow>
                                 </m:mfrac>
                              </m:mrow>
                           </m:msqrt>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mrow>
                              <m:mo>[</m:mo>
                              <m:mn>3</m:mn>
                              <m:mo>]</m:mo>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGebarcqGH9aqpdaGcaaqaamaalaaabaWaaabmaeaadaqadaqaaiabd6gaUjabdkhaYjabd2gaTjabc+faFjabdohaZjabd6gaUjabdchaWnaaBaaaleaacqWGPbqAaeqaaOGaeyOeI0IaeGymaedacaGLOaGaayzkaaWaaWbaaSqabeaacqaIYaGmaaaabaGaemyAaKMaeyypa0JaeGymaedabaGaemyyaeMaemiAaGMaemyyaeMaemiCaahaniabggHiLdaakeaacqWGHbqycqWGObaAcqWGHbqycqWGWbaCaaaaleqaaOGaaCzcaiaaxMaadaWadaqaaiabiodaZaGaay5waiaaw2faaaaa@52F2@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>In theory, the value of D can range from 0 to infinite. In our study, in 98% of the clusters the value of D ranged from 0 to 1. With increasing D-value the variation in number of SNPs among haplotypes is larger, and there is a higher probability of paralogs in the cluster. The value of D can therefore be used to identify clusters with a low probability of containing paralogs. Following from its definition D-value can only be used to distinguish paralogous clusters if at least three haplotypes are identified in those clusters.</p>
            </sec>
            <sec>
               <st>
                  <p>(3) Identification of reliable SNPs</p>
               </st>
               <p>In addition to using a redundancy-based criterion (all potential SNPs need at least 2 ESTs for every allele), another more stringent selection is used in the algorithm. The selection is a combination of two measures: major allele haplotype score and minor allele haplotype score. The major allele is the allele occurring in the majority of the sequences in a cluster, while the other is called the minor allele. The major allele haplotype score (<it>mahap</it>) is defined as the number of haplotypes with a major allelic nucleotide on one polymorphic site, and the minor allele haplotype score (<it>mihap</it>) is the number of haplotypes with the minor allelic nucleotide. The formulas are as follows:</p>
               <p>
                  <m:math name="1471-2105-7-438-i9" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>m</m:mi>
                           <m:mi>a</m:mi>
                           <m:mi>h</m:mi>
                           <m:mi>a</m:mi>
                           <m:mi>p</m:mi>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:msubsup>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>i</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:mi>a</m:mi>
                                    <m:mi>h</m:mi>
                                    <m:mi>a</m:mi>
                                    <m:mi>p</m:mi>
                                 </m:mrow>
                              </m:msubsup>
                              <m:mrow>
                                 <m:mi>m</m:mi>
                                 <m:mi>a</m:mi>
                                 <m:mi>h</m:mi>
                                 <m:mi>a</m:mi>
                                 <m:msub>
                                    <m:mi>p</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                              </m:mrow>
                           </m:mstyle>
                           <m:mrow>
                              <m:mo>{</m:mo>
                              <m:mrow>
                                 <m:mi>m</m:mi>
                                 <m:mi>a</m:mi>
                                 <m:mi>h</m:mi>
                                 <m:mi>a</m:mi>
                                 <m:msub>
                                    <m:mi>p</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo>=</m:mo>
                                 <m:mn>1</m:mn>
                                 <m:mo>|</m:mo>
                                 <m:mfrac>
                                    <m:mrow>
                                       <m:mi>w</m:mi>
                                       <m:mi>h</m:mi>
                                       <m:mo>&#215;</m:mo>
                                       <m:mi>h</m:mi>
                                       <m:msub>
                                          <m:mi>a</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                       <m:mo>+</m:mo>
                                       <m:mi>w</m:mi>
                                       <m:mi>l</m:mi>
                                       <m:mo>&#215;</m:mo>
                                       <m:mi>l</m:mi>
                                       <m:msub>
                                          <m:mi>a</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>h</m:mi>
                                       <m:msub>
                                          <m:mi>c</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                    </m:mrow>
                                 </m:mfrac>
                                 <m:mo>&#8805;</m:mo>
                                 <m:mn>0.75</m:mn>
                              </m:mrow>
                              <m:mo>}</m:mo>
                           </m:mrow>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mrow>
                              <m:mo>[</m:mo>
                              <m:mn>4</m:mn>
                              <m:mo>]</m:mo>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBcqWGHbqycqWGObaAcqWGHbqycqWGWbaCcqGH9aqpdaaeWaqaaiabd2gaTjabdggaHjabdIgaOjabdggaHjabdchaWnaaBaaaleaacqWGPbqAaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabdggaHjabdIgaOjabdggaHjabdchaWbqdcqGHris5aOWaaiWaaeaacqWGTbqBcqWGHbqycqWGObaAcqWGHbqycqWGWbaCdaWgaaWcbaGaemyAaKgabeaakiabg2da9iabigdaXiabcYha8naalaaabaGaem4DaCNaemiAaGMaey41aqRaemiAaGMaemyyae2aaSbaaSqaaiabdMgaPbqabaGccqGHRaWkcqWG3bWDcqWGSbaBcqGHxdaTcqWGSbaBcqWGHbqydaWgaaWcbaGaemyAaKgabeaaaOqaaiabdIgaOjabdogaJnaaBaaaleaacqWGPbqAaeqaaaaakiabgwMiZkabicdaWiabc6caUiabiEda3iabiwda1aGaay5Eaiaaw2haaiaaxMaacaWLjaWaamWaaeaacqaI0aanaiaawUfacaGLDbaaaaa@7677@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>
                  <m:math name="1471-2105-7-438-i10" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>m</m:mi>
                           <m:mi>i</m:mi>
                           <m:mi>h</m:mi>
                           <m:mi>a</m:mi>
                           <m:mi>p</m:mi>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:msubsup>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>i</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:mi>a</m:mi>
                                    <m:mi>h</m:mi>
                                    <m:mi>a</m:mi>
                                    <m:mi>p</m:mi>
                                 </m:mrow>
                              </m:msubsup>
                              <m:mrow>
                                 <m:mi>m</m:mi>
                                 <m:mi>i</m:mi>
                                 <m:mi>h</m:mi>
                                 <m:mi>a</m:mi>
                                 <m:msub>
                                    <m:mi>p</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                              </m:mrow>
                           </m:mstyle>
                           <m:mrow>
                              <m:mo>{</m:mo>
                              <m:mrow>
                                 <m:mi>m</m:mi>
                                 <m:mi>i</m:mi>
                                 <m:mi>h</m:mi>
                                 <m:mi>a</m:mi>
                                 <m:msub>
                                    <m:mi>p</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo>=</m:mo>
                                 <m:mn>1</m:mn>
                                 <m:mo>|</m:mo>
                                 <m:mfrac>
                                    <m:mrow>
                                       <m:mi>w</m:mi>
                                       <m:mi>h</m:mi>
                                       <m:mo>&#215;</m:mo>
                                       <m:mi>h</m:mi>
                                       <m:msub>
                                          <m:mi>b</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                       <m:mo>+</m:mo>
                                       <m:mi>w</m:mi>
                                       <m:mi>l</m:mi>
                                       <m:mo>&#215;</m:mo>
                                       <m:mi>l</m:mi>
                                       <m:msub>
                                          <m:mi>b</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>h</m:mi>
                                       <m:msub>
                                          <m:mi>c</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                    </m:mrow>
                                 </m:mfrac>
                                 <m:mo>&#8805;</m:mo>
                                 <m:mn>0.75</m:mn>
                              </m:mrow>
                              <m:mo>}</m:mo>
                           </m:mrow>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mrow>
                              <m:mo>[</m:mo>
                              <m:mn>5</m:mn>
                              <m:mo>]</m:mo>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBcqWGPbqAcqWGObaAcqWGHbqycqWGWbaCcqGH9aqpdaaeWaqaaiabd2gaTjabdMgaPjabdIgaOjabdggaHjabdchaWnaaBaaaleaacqWGPbqAaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabdggaHjabdIgaOjabdggaHjabdchaWbqdcqGHris5aOWaaiWaaeaacqWGTbqBcqWGPbqAcqWGObaAcqWGHbqycqWGWbaCdaWgaaWcbaGaemyAaKgabeaakiabg2da9iabigdaXiabcYha8naalaaabaGaem4DaCNaemiAaGMaey41aqRaemiAaGMaemOyai2aaSbaaSqaaiabdMgaPbqabaGccqGHRaWkcqWG3bWDcqWGSbaBcqGHxdaTcqWGSbaBcqWGIbGydaWgaaWcbaGaemyAaKgabeaaaOqaaiabdIgaOjabdogaJnaaBaaaleaacqWGPbqAaeqaaaaakiabgwMiZkabicdaWiabc6caUiabiEda3iabiwda1aGaay5Eaiaaw2haaiaaxMaacaWLjaWaamWaaeaacqaI1aqnaiaawUfacaGLDbaaaaa@76AD@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p><it>ha</it><sub><it>i </it></sub>and <it>la</it><sub><it>i </it></sub>are the number of sequences with the major allelic nucleotide occurring in high quality and low quality regions (this will be described in detail in the next section, filter 3); <it>hb</it><sub><it>i </it></sub>and <it>lb</it><sub><it>i </it></sub>are the number of sequences with the minor allelic nucleotide represented in high quality and low quality regions; <it>wh </it>and <it>wl </it>are the weight values for the high quality and low quality regions; <it>hc</it><sub><it>i </it></sub>is the number of sequences in the haplotype <it>i </it>with information at the polymorphic site. When more than 75% of the members in one haplotype have the same major or minor allelic nucleotide, <it>mahap</it><sub><it>i </it></sub>or <it>mihap</it><sub><it>i </it></sub>is increased by 1; otherwise they remain the same, as in this case the correct nucleotide on this site of the haplotype can not be assigned easily. Note that in "Haplotype reconstruction" in filter 2, we allowed for some discrepancies between haplotype members. When both <it>mahap </it>and <it>mihap </it>are greater than 1 in the cluster, each of major and minor allele occurs in at least one haplotype and the SNP therefore can be considered to be reliable.</p>
               <p>SNP patterns and SNP blocks are also defined in filter 2. SNP patterns are those SNPs with the same pattern of allele distribution over the haplotypes; they are determined as in the autoSNP program <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. SNP blocks are defined as sets of adjacent SNPs with the same SNP pattern. SNP pattern and SNP block information is part of the output of the pipeline, and can be used for instance for linkage disequilibrium (LD) studies.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Filter 3: screening SNPs with high confidence score (HCS)</p>
            </st>
            <p>The third filter calculates a confidence score for every putative SNP according to the number of occurrences of each allele in high (HQ) and low (LQ) quality regions. In standard sequencing procedures the beginnings and the ends of the sequence are generally of lower quality than the rest of the sequence, and therefore are likely to contain more sequencing errors. We used potato ESTs with quality scores from the TIGR database to establish the boundaries of HQ and LQ for sequences (see Results), and these were used as default settings in our program.</p>
            <p>Based on the HQ and LQ, the confidence score of each allele is calculated according to the score rules as defined in figure <figr fid="F2">2</figr>. The SNP confidence score is the smaller one of each allele confidence score. The confidence scores for each allele are as follows: 5 if the allele occurs in more than one HQ; 4 if in one HQ and at least two LQ; 3 if in more than 3 LQ; 2 if in one HQ and one LQ, or in 3 LQ; 1 if in 2 LQ, otherwise 0 (Figure <figr fid="F2">2</figr>). In our study, we assigned a high confidence score (HCS) to SNPs with a confidence score of at least 2. This threshold can be adjusted by users according to their specific requirements.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Scoring rules for SNP confidence scores</p>
               </caption>
               <text>
                  <p>Scoring rules for SNP confidence scores. The SNP confidence score is the smaller one of bi-allele confidence scores. Rectangles in black and white represent low quality regions (LQ) and high quality region (HQ) respectively. The confidence scores for each allele are as follows: 5 if the allele occurs in more than one HQ; 4 if in one HQ and at least two LQ; 3 if in more than 3 LQ; 2 if in one HQ and one LQ, or in 3 LQ; 1 if in 2 LQ, otherwise 0.</p>
               </text>
               <graphic file="1471-2105-7-438-2"/>
            </fig>
            <p>If quality files of sequences are available, one additional filtering step is used to screen for reliable SNPs. In this filter "SNP quality" is calculated as the smaller value of the average quality scores of major and minor alleles per polymorphic site in a cluster. In our study we used a minimum (PHRED) score value of 20.</p>
         </sec>
         <sec>
            <st>
               <p>Non-synonymous SNP identification</p>
            </st>
            <p>For the detection of non-synonymous SNPs (nsSNP), synonymous SNPs and SNP in UTR, two strategies can be used: alignment with reference protein sequences or ORF prediction using programs such as ESTscan. In our approach, the first method is used; FASTY was chosen as the tool to search the protein database rather than BLASTX, because it allows for frameshifts within codons <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> and produces better alignments with poor sequences. In our potato EST analysis the UniProt database was chosen as referencing database. After the step 3 of our SNP detection program the FASTY results are used by a parsing program, together with the alignment information and the SNP information, to identify the SNP type (nsSNP, sSNP or SNP in 3' UTR or 5' UTR). For this, the FASTY result is first sorted by E-value to get the hit with highest similarity. Next, any frameshift in the contig is detected and corrected, after which the ORF is detected. Finally, all nsSNPs and sSNPs in the protein hit region or coding region, and SNPs or indels in 3' UTR or 5' UTR are identified.</p>
         </sec>
         <sec>
            <st>
               <p>Database and SNP information retrieval system</p>
            </st>
            <p>All files containing relevant SNP information are transferred to the database. An SQL script for SNP database creation and data loading is produced automatically by the pipeline. The data in this database can be made accessible through the use of a web server. PHP scripts for generating web pages are supplied together with the code of the pipeline. The PHP script allows for easy retrieval of SNP information from the database, and BLAST searching (for an example, see our website <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>).</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>The new pipeline for SNP detection (called QualitySNP) presented here distinguishes itself from other programs mainly in the approach it takes for detecting sequencing errors and paralogous sequences. The source code and the manual of the program are freely available for academic use [<abbrgrp><abbr bid="B32">32</abbr></abbrgrp>, see Additional file <supplr sid="S1">1</supplr>, <supplr sid="S3">3</supplr>], an sample dataset for testing QualitySNP is available as well [<abbrgrp><abbr bid="B32">32</abbr></abbrgrp>, see <supplr sid="S2">Additional file 2</supplr>]. To demonstrate the specific properties and advantages of our program we have used potato, human and chicken ESTs as a target for SNP identification. Potato was chosen because it is a tetraploid species and cultivars consist of clonally propagated, heterozygous genotypes. The high level of heterozygosity and the tetraploid nature present problems for most currently available SNP detection programs in particular in the discrimination of paralogs from alleles. Also, within the genomes of plants large numbers of duplications are found <abbrgrp><abbr bid="B33">33</abbr><abbr bid="B34">34</abbr></abbrgrp> which may complicate detection of reliable SNPs. Human and chicken datasets were used as a reference for a 'normal' diploid species and to illustrate specific properties and advantages of QualitySNP.</p>
         <suppl id="S1">
            <title>
               <p>Additional file 1</p>
            </title>
            <text>
               <p>QualitySNP. The source code of QualitySNP; The file is unpacked by using the command "gunzip QualitySNP.tar.gz", and then use "tar -xvf QualitySNP.tar" on a Unix/Linux computer.</p>
            </text>
            <file name="1471-2105-7-438-S1.gz">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S2">
            <title>
               <p>Additional file 2</p>
            </title>
            <text>
               <p>testQualitySNP. A dataset for testing QualitySNP;The file is unpacked by using the command "gunzip testQualitySNPseq.tar.gz", and then use "tar -xvf testQualitySNPseq.tar" on a Unix/Linux computer.</p>
            </text>
            <file name="1471-2105-7-438-S2.gz">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S3">
            <title>
               <p>Additional file 3</p>
            </title>
            <text>
               <p>QualitySNP manual. The manual of QualitySNP</p>
            </text>
            <file name="1471-2105-7-438-S3.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <sec>
            <st>
               <p>Predicting haplotypes</p>
            </st>
            <p>In the first step, 83,565 potato ESTs were assembled into 10,670 clusters (Figure <figr fid="F1">1</figr>, step 1), of which 4864 clusters contained 4&#8211;100 members (Figure <figr fid="F1">1</figr>, step 2). After the analysis of alignment information of these 4864 clusters, 3081 clusters with potential SNPs were detected (Figure <figr fid="F1">1</figr>, step 3, filter 1). These 3081 clusters contained 41,532 ESTs (average of 14 ESTs per cluster) and 31,815 potential SNPs (average of 10 SNPs per cluster). In the tetraploid potato a maximum of 4 haplotypes per plant can be expected. For most haplotypes sufficient redundancy was available to use a similarity threshold per SNP site (<it>S</it><sub><it>ij</it></sub>, formula <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>) of 75%: for haplotypes with less than 4 members, all sequences must have the same nucleotide at a single SNP site, whereas for haplotypes with 4 or more members at least 75% of the sequences must match the consensus nucleotide.</p>
            <p>In the current dataset a contig was on average 1200 nucleotides in length, and ESTs were about 500~600 nucleotides long, so every EST would contain about 5 potential SNP sites on average. The frequency of sequence errors for EST data from NCBI dbEST was found to be 1 in 500 nucleotides <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>. Assuming a similar error rate, on average one of the five potential SNPs per EST observed in our data would be a sequencing error. With this in mind, the similarity threshold (<it>S</it><sub><it>i</it></sub>, formula <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>) for the whole sequence was set to 80% for every EST in one haplotype. With these settings, eighty-five percent of the clusters contained at most four haplotypes (Table <tblr tid="T1">1</tblr>); when haplotypes with only one member were excluded the percentage of clusters with at most four haplotypes increased to 96% (Table <tblr tid="T1">1</tblr>), which agrees with the tetraploid nature of the potato cultivar Kennebec.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Relationship of the number of haplotypes and the number of clusters derived from potato variety "Kennebec".</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="center">
                        <p>No. of haplotypes in one cluster</p>
                     </c>
                     <c ca="center">
                        <p>No. of clusters<sup><it>a</it></sup></p>
                     </c>
                     <c ca="center">
                        <p>No. of clusters with ahap<sup><it>b</it></sup></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>1478</p>
                     </c>
                     <c ca="center">
                        <p>1924</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>679</p>
                     </c>
                     <c ca="center">
                        <p>680</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>452</p>
                     </c>
                     <c ca="center">
                        <p>347</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>253</p>
                     </c>
                     <c ca="center">
                        <p>99</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>131</p>
                     </c>
                     <c ca="center">
                        <p>24</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>51</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>8</p>
                     </c>
                     <c ca="center">
                        <p>25</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>more than 8</p>
                     </c>
                     <c ca="center">
                        <p>12</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>total clusters with SNPs</p>
                     </c>
                     <c ca="center">
                        <p>3081</p>
                     </c>
                     <c ca="center">
                        <p>3081</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>clusters with at most 4 haplotypes</p>
                     </c>
                     <c ca="center">
                        <p>2609(84.7%)</p>
                     </c>
                     <c ca="center">
                        <p>2951(95.8%)</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p><sup><it>a </it></sup>all haplotypes in clusters were counted; <sup><it>b </it></sup>when clusters contained at least 2 haplotypes with at least 2 members, only these haplotypes were counted.</p>
               </tblfn>
            </tbl>
            <p>The thresholds of 75% and 80% are the default values in our program, but they can be adjusted by users according to their specific requirements.</p>
         </sec>
         <sec>
            <st>
               <p>Predicting paralogs based on haplotypes</p>
            </st>
            <p>Our method uses the standard deviation (D) of the (normalized) number of SNPs per haplotype to identify clusters that probably contain paralogs (See formula <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>). In order to get a D-value threshold, we assumed that clusters with 4&#8211;20 members contained mostly allelic sequences <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, and all clusters with at least 100 members paralogous sequences <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B16">16</abbr></abbrgrp>. Under this assumption, 2,544 clusters from the potato dataset were considered to be allelic clusters and 28 clusters with 100 to 300 members paralogous clusters. Figure <figr fid="F3">3a</figr> shows the relationship of the D-value threshold with the paralogous and allelic data set, after normalizing these data. With increasing D-value threshold, the number of presumably paralogous clusters increased sharply. The number of allelic clusters hardly changed. Both lines were found to be crossing at a D-value threshold of approximately 0.6, which was considered optimal for the screening of paralogs in the potato dataset. From the presumed paralogous dataset 17.8 % (5 clusters) of the clusters had D-values less than 0.6 and were most likely not paralogous clusters but for instance allelic clusters of highly expressed single genes (called false negative), and 9.8% (252 clusters) from the presumed allelic sequence dataset with D-value more than 0.6 may be clusters with sequences from lowly expressed paralogous genes (called false positive) (Figure <figr fid="F3">3a</figr>). Using this default setting, 2651 (86%) clusters had a D-value lower than 0.6, and these were therefore considered to be most likely free of paralogs. We used the same approach to determine the D-value threshold a chicken EST dataset of 100,000 sequences. There were 3,426 clusters with between 4&#8211;20 members and 23 clusters with 100&#8211;300 members used to get the D-value threshold. In this case lines were found to be crossing at D-value 0.9 (Figure <figr fid="F3">3b</figr>), and 8.7% (2) of the clusters from the presumed paralogous dataset may in fact be allelic clusters from highly expressed genes (false negatives), whereas 4.5% (153) of the clusters of the allelic set most likely contained lowly expressed paralogs (false positives).</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>The relationship of the normalized number of clusters in the datasets containing allelic sequences and paralogs</p>
               </caption>
               <text>
                  <p>The relationship of the normalized number of clusters in the datasets containing allelic sequences and paralogs. The dataset contained allelic sequence(clusters with 4&#8211;20 members; &#9651;) and those contained paralogs (clusters with 100&#8211;300 members; &#9675;) with the threshold for D-value using (a) the potato data, and (b) the chicken data.</p>
               </text>
               <graphic file="1471-2105-7-438-3"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Evaluation of paralogs identification</p>
            </st>
            <p>To evaluate the paralog identification routine of QualitySNP, 15 human UniGene datasets from NCBI and the human genomic sequence from UCSC were analyzed. QualitySNP was executed on these UniGene datasets individually after clustering by CAP3. The clusters with a D-value higher than 0.6 (default setting) were considered as clusters containing paralogs. For most clusters identified by QualitySNP as paralogous clusters, the consensus sequences (as generated by CAP3) were compared to the human genomic sequence by the BLAT server of UCSC. For 49 of the 62 (79%) presumed paralogous clusters the consensus sequences picked up multiple loci in the genome (with similarity setting 90% for 90% of the whole sequence) (see Table <tblr tid="T2">2</tblr>). Further analysis of some of these clusters revealed that separate paralogous genes on the human genome were indeed represented in the clusters. The majority of the clusters identified as paralogous clusters by QualitySNP were from UniGene datasets which were related to gene families. For instance, Unigene Hs.510635 located on 14q32.33 and in this region at least 9 genes (IGHD, IGHG1, IGH@, IGHG3, IGHA2, IGHM, IGHG2, IGHA1, IGHG4) are present that are highly similar. For this Unigene dataset different haplotypes in a paralogous cluster were also found to correspond to different genes (IGHA1 and IGHD1).</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Paralog identification by QualitySNP in human UniGene datasets.</p>
               </caption>
               <tblbdy cols="7">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3" ca="center">
                        <p>D-value > = 0.6</p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>D-value > = 0.9</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>UniGene</p>
                     </c>
                     <c ca="center">
                        <p>No. of cluster<sup><it>a</it></sup></p>
                     </c>
                     <c ca="center">
                        <p>confirmed</p>
                     </c>
                     <c ca="center">
                        <p>unconfirmed</p>
                     </c>
                     <c ca="center">
                        <p>No. of cluster<sup><it>b</it></sup></p>
                     </c>
                     <c ca="center">
                        <p>confirmed</p>
                     </c>
                     <c ca="center">
                        <p>unconfirmed</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Hs.300701</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Hs.533717</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Hs.12956</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Hs.22543</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Hs.468478</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Hs.591503</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Hs.567284</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Hs.510172</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Hs.406754</p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Hs.510635</p>
                     </c>
                     <c ca="center">
                        <p>29</p>
                     </c>
                     <c ca="center">
                        <p>28</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>16</p>
                     </c>
                     <c ca="center">
                        <p>16</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Hs.61635</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Hs.631881</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Hs.104741</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Hs.534639</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Hs.18069</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Total</p>
                     </c>
                     <c ca="center">
                        <p>62</p>
                     </c>
                     <c ca="center">
                        <p>49(79.03%)</p>
                     </c>
                     <c ca="center">
                        <p>13(20.97%)</p>
                     </c>
                     <c ca="center">
                        <p>26</p>
                     </c>
                     <c ca="center">
                        <p>23 (88.46%)</p>
                     </c>
                     <c ca="center">
                        <p>3 (11.54%)</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p><sup><it>a </it></sup>clusters with D value more than 0.6 are considered as clusters containing paralogs by QualitySNP; <sup><it>b </it></sup>clusters with D value more than 0.9 are considered as clusters containing paralogs. Confirmed: Number of clusters that were proven to contain paralogous sequences</p>
               </tblfn>
            </tbl>
            <p>When the D-value was increased to 0.9, the number of clusters identified as potentially paralogous decreased from 62 to 26; 88% of those clusters (23) were verified by the BLAT analysis against the human genome.</p>
         </sec>
         <sec>
            <st>
               <p>Predicting reliable SNPs</p>
            </st>
            <p>The quality of the SNP data was further improved by taking the quality of the sequence data into account. This was demonstrated with the potato EST dataset, Figure <figr fid="F4">4</figr> shows the relationships between the length of the low quality region and the number of potato EST sequences. The threshold for high quality region of sequences was a minimum average PHRED score of 20 in a 50 nucleotides sliding window. From figure <figr fid="F4">4</figr> it is clear that the low quality region (LQ) at the beginning (5' end) of sequences was shorter than the LQ region at the 3' end. At the 5' side of the sequences, 90% had a LQ of less than 30 nucleotides (Figure <figr fid="F4">4a</figr>). At the 3'-end, a large number of sequences had LQs of over 100 nt. Setting a fixed nucleotide limit would either exclude many sequences with short LQs, or include many sequences with large LQs at the 3'-side. Figure <figr fid="F4">4b</figr> shows that there is a relationship between the length of the LQ at the 3'-side and the total length of the sequence. Therefore we set the LQ to 30 nucleotides from the 5' side and 20% of the whole sequence for the 3' side as the default settings. In formula <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> and <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> for <it>mahap </it>and <it>mihap</it>, the default value of the weight values for HQ (<it>wh</it>) or LQ (<it>wl</it>) were set to 1.0, but these can be adjusted according to the data quality. For example, if sequences in low quality regions are very bad, the parameters can be set to <it>wl </it>= 0.5 and <it>wh </it>= 1.0. In filter 3 confidence scores are calculated (see Implentation section, filter 3).</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>The distribution of low quality region of TIGR potato sequences</p>
               </caption>
               <text>
                  <p>The distribution of low quality region of TIGR potato sequences. (a) Frequency distribution of the size of the low quality region (LQ). The size of LQ from the 5' side is the line with black triangles; the size of LQ from the 3' side is the line with black circles. (b) As 4a, with the size of the 3' LQ region expressed as percentage of the sequence length.</p>
               </text>
               <graphic file="1471-2105-7-438-4"/>
            </fig>
            <p>Using the default settings the major and minor allele haplotype score were calculated and this resulted in a selection of 17,745 reliable SNPs from the potato EST dataset from a total of 31,815. An additional 1,020 SNPs with confidence scores less than 2 were dropped from the set of SNPs which left 16,725 reliable bi-SNPs including 1815 indels. The ratio of transitions (C for T or A for G) and transversion (C for G/A, G for C/T, A for C/G) was 1.9 (9853/5057), the frequency of reliable SNPs was one SNP per 224 nucleotides, and the frequency of indels was one per 2,070 nucleotides (further statistic information is presented at our website <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>).</p>
            <p>The accuracy of the determination of SNP reliability from sequence data only was evaluated by using another potato EST dataset from TIGR that included quality files. In this dataset 21,240 potential SNPs were detected by filter 1, of these 7,971 reliable SNPs were identified by filter 2, and 6,431 were attributed a high confidence score, (HCS, confidence score at least 2) by filter 3. The SNP quality score was calculated by an additional filter from quality files (see Implementation section, filter 3). The distribution of potential (filter 1), reliable (filter 2) and HCS (filter 3) SNPs over the PHRED quality scores is shown in figure <figr fid="F5">5</figr>. After filter 2, 66% of the reliable SNPs had a quality score above 20. Applying filter 3 increased the percentage of reliable HCS SNPs with a quality score above 20 to 78%.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>The distribution of potential (filter 1), reliable (filter 2) and HCS (filter 3) SNPs over the PHRED quality scores</p>
               </caption>
               <text>
                  <p>The distribution of potential (filter 1), reliable (filter 2) and HCS (filter 3) SNPs over the PHRED quality scores. High confidence score (HCS) means the SNP confidence score is at least 2. The EST dataset used for this analysis included quality files obtained from the TIGR potato gene index.</p>
               </text>
               <graphic file="1471-2105-7-438-5"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Predicting non-synonymous SNPs</p>
            </st>
            <p>Depending on the research context, users of the pipeline may have an interest in predicting non-synonymous SNPs. This is in particular the case for studies involving protein structure and functional domains, where SNPs might affect the function of a protein. The UniProt database was searched with the consensus sequences of 2651 clusters of potato dataset (selected with a D-value threshold of 0.6). FASTY identified 2167 (81.7%) contigs with an open reading frame (ORF) that matched entries in the UniProt database, including 102 contigs which were corrected for frameshifts. This indicates that the UniProt had sufficient coverage to act as reference protein database for potato. Using the FASTY results, 10,354 reliable SNPs were identified in protein-encoding regions, 34% of these being nsSNP, which is similar to the results obtained by other authors: 35% in chicken ESTs <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>, 32% in Arabidopsis <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Validation of reliable SNPs and comparison with other programs</p>
            </st>
            <p>Reliable SNPs as identified by QualitySNP were validated by experimental data for the potato dataset. For this, we used 37 amplified and sequenced loci containing 60 SNPs and one indel identified by QualitySNP in the potato variety Kennebec (Van der Linden et al, in prep.). Three SNPs turned out to be false, and for 8 SNPs the resequencing data was not conclusive, most likely due to the tetraploid nature of potato. The remaining 50 SNPs as well as the indel were confirmed, demonstrating the reliability of the SNPs identified by Quality SNP.</p>
            <p>Up to now, PolyBayes is considered to be one of the best SNP detection programs. However, the program needs the availability of the EST trace files or quality files or the genomic sequence in order to perform its task. This limits the usability of PolyBayes and similar programs, such as PolyFreq <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> to cases for which these conditions are met, and therefore excludes the use of EST datasets for which no genomic sequences or quality files are available yet. autoSNP does not suffer from these limitations. We therefore compared the performance of QualitySNP with AutoSNP on an identical dataset.</p>
            <p>QualitySNP and AutoSNP were used to identify SNPs in nineteen individual human UniGene datasets. For each cluster of EST sequences, the consensus sequence was used to BLAST against SNP loci of dbSNP (default settings E-value is 0.01). Each SNP from dbSNP that occurs in the consensus sequence was determined by finding the perfect match of its sequence context in the consensus sequence. An SNP is considered to be confirmed when the SNP locus (approx. 60nt) matches the consensus sequence for 90% or more and the SNP identity (location and substitution) is confirmed. The results are summarized in Table <tblr tid="T3">3</tblr>. As dbSNP is not complete (only part of the potential SNPs in the human genome are represented in the database), a number of SNPs as identified in the EST dataset will not match dbSNP entries. Nevertheless, in total 35% of the SNPs identified by QualitySNP were confirmed. This was over four times more than for SNPs identified by autoSNP (8%). QualitySNP identified most of the confirmed SNPs found by AutoSNP. Moreover, most of the confirmed SNPs detected by autoSNP were also detected by QualitySNP. In addition, QualitySNP (a C-program) calculated SNP much more efficient than autoSNP (a Perl program); it used less CPU time for calculation than autoSNP, which is especially evident when large clusters are present.</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Validation of SNPs detected by QualitySNP and autoSNP in nineteen UniGene data sets of human.</p>
               </caption>
               <tblbdy cols="14">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="4" ca="center">
                        <p>QualitySNP(D &lt; = 0.6)</p>
                     </c>
                     <c cspan="4" ca="center">
                        <p>autoSNP</p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>their overlap</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="11">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>chromosome</p>
                     </c>
                     <c ca="center">
                        <p>UniGene</p>
                     </c>
                     <c ca="center">
                        <p>Size</p>
                     </c>
                     <c ca="center">
                        <p>Time(m)</p>
                     </c>
                     <c ca="center">
                        <p>candidate SNPs<sup><it>a</it></sup></p>
                     </c>
                     <c ca="center">
                        <p>confirmed</p>
                     </c>
                     <c ca="center">
                        <p>unconfirmed</p>
                     </c>
                     <c ca="center">
                        <p>Time(m)</p>
                     </c>
                     <c ca="center">
                        <p>candidate SNPs<sup><it>b</it></sup></p>
                     </c>
                     <c ca="center">
                        <p>confirmed</p>
                     </c>
                     <c ca="center">
                        <p>unconfirmed</p>
                     </c>
                     <c ca="center">
                        <p>candidate SNPs</p>
                     </c>
                     <c ca="center">
                        <p>confirmed</p>
                     </c>
                     <c ca="center">
                        <p>unconfirmed</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="14">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>Hs.300701</p>
                     </c>
                     <c ca="center">
                        <p>3640</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>18</p>
                     </c>
                     <c ca="center">
                        <p>5 (27.8%)</p>
                     </c>
                     <c ca="center">
                        <p>13</p>
                     </c>
                     <c ca="center">
                        <p>150</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>0(0%)</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>Hs.401316</p>
                     </c>
                     <c ca="center">
                        <p>1090</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0(0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>14</p>
                     </c>
                     <c ca="center">
                        <p>Hs.533717</p>
                     </c>
                     <c ca="center">
                        <p>1601</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>12</p>
                     </c>
                     <c ca="center">
                        <p>3 (25%)</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>26</p>
                     </c>
                     <c ca="center">
                        <p>166</p>
                     </c>
                     <c ca="center">
                        <p>1 (0.6%)</p>
                     </c>
                     <c ca="center">
                        <p>165</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0(0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>17</p>
                     </c>
                     <c ca="center">
                        <p>Hs.12956</p>
                     </c>
                     <c ca="center">
                        <p>622</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>2 (20%)</p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>15</p>
                     </c>
                     <c ca="center">
                        <p>1 (6.7%)</p>
                     </c>
                     <c ca="center">
                        <p>14</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>1(11.11%)</p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>19</p>
                     </c>
                     <c ca="center">
                        <p>Hs.515126</p>
                     </c>
                     <c ca="center">
                        <p>654</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>44</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>44</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0(0%)</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>15</p>
                     </c>
                     <c ca="center">
                        <p>Hs.22543</p>
                     </c>
                     <c ca="center">
                        <p>847</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>1 (10%)</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>1 (25%)</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1(100%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>Hs.468478</p>
                     </c>
                     <c ca="center">
                        <p>183</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0(0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>Hs.591503</p>
                     </c>
                     <c ca="center">
                        <p>200</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>2 (33.3%)</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>0(0%)</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>Hs.567284</p>
                     </c>
                     <c ca="center">
                        <p>194</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>0(0%)</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>Hs.510172</p>
                     </c>
                     <c ca="center">
                        <p>282</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0(0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>17</p>
                     </c>
                     <c ca="center">
                        <p>Hs.406754</p>
                     </c>
                     <c ca="center">
                        <p>6453</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>49</p>
                     </c>
                     <c ca="center">
                        <p>25 (51%)</p>
                     </c>
                     <c ca="center">
                        <p>24</p>
                     </c>
                     <c ca="center">
                        <p>51</p>
                     </c>
                     <c ca="center">
                        <p>43</p>
                     </c>
                     <c ca="center">
                        <p>6 (14%))</p>
                     </c>
                     <c ca="center">
                        <p>37</p>
                     </c>
                     <c ca="center">
                        <p>14</p>
                     </c>
                     <c ca="center">
                        <p>5(35.71%)</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>14</p>
                     </c>
                     <c ca="center">
                        <p>Hs.510635</p>
                     </c>
                     <c ca="center">
                        <p>27193</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>535</p>
                     </c>
                     <c ca="center">
                        <p>198 (37%)</p>
                     </c>
                     <c ca="center">
                        <p>337</p>
                     </c>
                     <c ca="center">
                        <p>13</p>
                     </c>
                     <c ca="center">
                        <p>895</p>
                     </c>
                     <c ca="center">
                        <p>92 (10.3%)</p>
                     </c>
                     <c ca="center">
                        <p>803</p>
                     </c>
                     <c ca="center">
                        <p>143</p>
                     </c>
                     <c ca="center">
                        <p>86(60.14%)</p>
                     </c>
                     <c ca="center">
                        <p>57</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>Hs.61635</p>
                     </c>
                     <c ca="center">
                        <p>82</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0(0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>Hs.631881</p>
                     </c>
                     <c ca="center">
                        <p>355</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0(0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>8</p>
                     </c>
                     <c ca="center">
                        <p>Hs.104741</p>
                     </c>
                     <c ca="center">
                        <p>275</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0(0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>Hs.534639</p>
                     </c>
                     <c ca="center">
                        <p>1910</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>11</p>
                     </c>
                     <c ca="center">
                        <p>1 (9.1%)</p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>0(0%)</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>14</p>
                     </c>
                     <c ca="center">
                        <p>Hs.18069</p>
                     </c>
                     <c ca="center">
                        <p>1965</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>1 (33.3%)</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0(0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>17</p>
                     </c>
                     <c ca="center">
                        <p>Hs.514220</p>
                     </c>
                     <c ca="center">
                        <p>6800</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                     <c ca="center">
                        <p>2 (25%)</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>267</p>
                     </c>
                     <c ca="center">
                        <p>13</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>13</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>0(0%)</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>12</p>
                     </c>
                     <c ca="center">
                        <p>Hs.19192</p>
                     </c>
                     <c ca="center">
                        <p>397</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0 (0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0(0%)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="14">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Total</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>54743</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>677</p>
                     </c>
                     <c ca="center">
                        <p>240 (35.5%)</p>
                     </c>
                     <c ca="center">
                        <p>437</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>1214</p>
                     </c>
                     <c ca="center">
                        <p>101 (8.3%)</p>
                     </c>
                     <c ca="center">
                        <p>1113</p>
                     </c>
                     <c ca="center">
                        <p>189</p>
                     </c>
                     <c ca="center">
                        <p>93(49.21%)</p>
                     </c>
                     <c ca="center">
                        <p>96</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p><sup><it>a </it></sup>Candidate SNPs predicted by QualitySNP; <sup><it>b </it></sup>candidate SNPs (score > = 50) predicted by autoSNP. confirmed: Candidate SNPs are considered as confirmed if they are present in dbSNP. Time (m): the minutes were used to run a program on each UniGene.</p>
               </tblfn>
            </tbl>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <sec>
            <st>
               <p>Haplotype-based strategy for SNPs detection</p>
            </st>
            <p>We present here a SNP identification pipeline called QualitySNP that uses a haplotype-based strategy and reconstructs haplotypes at the start of the SNP identification process. This haplotype-based strategy makes full use of redundancy in sequences by clustering them, and in doing so not only reduces the influence of sequencing errors, but also removes poor quality sequences which otherwise would be identified as a haplotype with one single sequence. In the haplotype-based strategy, we eliminate SNPs that can be due to random and/or systematical sequencing errors (resulting from the sequencing strategy) or reverse transcriptase errors.</p>
            <p>Once haplotypes have been defined and classified, it is possible to choose which SNPs will be used to diagnose the haplotype present in a genotype. Haplotype-based analysis of SNPs is more informative than analysis based on individual SNPs only, and is therefore more powerful in analyzing association with phenotypes <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Haplotype reconstruction</p>
            </st>
            <p>About 96% of the clusters obtained from EST sequence data of potato were predicted to contain four or less haplotypes (Table <tblr tid="T1">1</tblr>), which shows that our haplotype definition based on potential SNPs works well. This agrees with the suggestion of Rafalski <abbrgrp><abbr bid="B37">37</abbr></abbrgrp> and Russell <it>et al</it>. <abbrgrp><abbr bid="B38">38</abbr></abbrgrp> that closely spaced SNPs will be sufficient to define haplotypes. Using the D-value to exclude clusters probably containing paralogs, only 3% of the remaining clusters still contained more than 4 haplotypes. This most likely results from incorrect haplotype reconstruction, which could be caused by several reasons. Firstly, some sequence errors may occur frequently due to systematic problems in the experimental procedures, and such repeated errors would be considered as valid alleles. Secondly, as EST sequences are usually much shorter than the corresponding mRNA, haplotypes at one end of the cluster sometimes cannot unambiguously be associated with haplotypes at the other end and will therefore be counted as separate haplotypes, raising the total number of haplotypes within one cluster. We checked ten clusters with more than 4 haplotypes and D-value less than 0.6, and found that for five clusters this was indeed the case. Thirdly, some paralogs may be highly similar, and may not be distinguishable from alleles. These paralogs may not be filtered out by filter 2, and account for the extra (false) haplotypes in a cluster.</p>
         </sec>
         <sec>
            <st>
               <p>Paralogs identification</p>
            </st>
            <p>The identification of paralogs is an important problem in SNP detection, especially in large contigs, which are more likely to contain paralogs and random errors <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B16">16</abbr></abbrgrp>. Most programs avoid the problem of large clusters by using a maximum cluster size of 20&#8211;50 for SNP discovery <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B16">16</abbr></abbrgrp>. Our program is not limited by cluster size, and can handle clusters with an arbitrary number of members. SNPs in potentially interesting (highly expressed) genes are therefore still detected.</p>
            <p>Paralogous sequences are generally less similar than allelic sequences. This property can be used to identify clusters that are likely to contain paralogs. POLYBAYES is a Bayesian method using the dissimilarity rate of paralogs and the polymorphism rate as input to calculate the probability that ESTs represent paralogs <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. However, this may not be accurate when the polymorphism rate varies substantially between different genes. Our method detects clusters with paralogs by calculating the standard deviation (D) of the number of potential polymorphisms among haplotypes rather than the deviation from the mean polymorphism rate, and is therefore still able to reliably detect SNPs in highly variable allelic sequences. Discrimination of paralogs based on D-value is applicable if at least three haplotypes are detected in a cluster.</p>
            <p>For our potato dataset, with the threshold D-value set to 0.6, 14% of the clusters with a high probability of containing paralogs were excluded from SNP detection. For the chicken dataset the D-value threshold set to 0.9, leaving 6% clusters with potentially paralogous sequences. The higher D-value threshold for the chicken dataset and the lower numbers of false positive and false negative clusters are most likely the result of the better quality of chicken sequence data compared to the potato data. In addition, the differences between chicken and potato may be partly accounted for by the fact that the potato genome is likely to contain more paralogous genes than the chicken genome, as gene duplication events in potato have occurred more frequently than in the chicken genome. Indeed, the paralog content of the chicken genome is relatively low even compared to the human genome <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>.</p>
            <p>The study with the human UniGene datasets (Table <tblr tid="T2">2</tblr>) demonstrates what the consequences are of different D-value threshold settings. When the D-value threshold was increased from 0.6 to 0.9, 53% (26) clusters confirmed with paralogs on the human genome with D-value threshold 0.6 were now wrongly designated allelic sequence clusters. Increasing the D-value will allow discovery of SNPs in additional clusters, but the percentage paralogous clusters of these additional clusters is also higher, which will decrease the reliability of the discovered SNPs. The most reliable set of SNP will therefore be produced at low D-values. Quality SNP enables the user to set D-value thresholds. This means that a user can decide to have the most reliable SNP dataset and use a low D-value threshold. However, if the user is interested in a gene that is represented in a cluster with a higher D-value, the D-value threshold can be increased, allowing this cluster to be investigated for SNPs by QualitySNP.</p>
         </sec>
         <sec>
            <st>
               <p>Reliability of SNPs discovered by QualitySNP</p>
            </st>
            <p>Several steps in the QualitySNP pipeline are designed to improve the reliability of the SNP output of the program, while still being able to work with datasets from as many crops as possible (which means being able to produce highly reliable SNP identification even on datasets that do not have quality files). These steps include 1) the <it>mihap/mahap </it>calculations and settings and using the High Confidence Scores which effectively eliminates most of the SNPs identified in presumably low quality sequence regions (illustrated in Figure <figr fid="F5">5</figr>) and 2) haplotype reconstruction and using D-value thresholds for filtering out paralog-containing clusters (as illustrated by the data in Table <tblr tid="T2">2</tblr>). In QualitySNP, most of the settings can be adjusted according to the user's preference.</p>
            <p>The reliability of SNPs produced by QualitySNP is illustrated by the fact that nearly all of the 52 potato SNPs (49) and an indel that we were able to evaluate by sequencing were indeed confirmed. In addition, our validation of the SNP output of QualitySNP using human EST and SNP data demonstrates that QualitySNP outperforms autoSNP, producing a higher number as well as more reliable SNPs than autoSNP. The percentage of SNPs confirmed by comparison to dbSNP may appear relatively low (35%). However, dbSNP is a public-domain archive for a broad collection of simple genetic polymorphisms (from NCBI), and although the number of SNPs in dbSNP increases everyday, it does not cover most of the SNPs present in the human genome. Therefore, it is likely that a number of true SNPs will not find a match in the dbSNPs, and therefore can not be confirmed.</p>
         </sec>
         <sec>
            <st>
               <p>Retrieval system</p>
            </st>
            <p>QualitySNP includes a retrieval system that allows the user to extract additional useful information from the analysis. For instance, information about the nature of the SNPs (synonymous or non-synonymous) can be made part of the output. The SNP output can be modified by changing the reference genotype, and the D-value setting can be used to adjust the stringency with which paralogous clusters are detected and excluded. This may be very useful when focusing on a specific gene family where alleles of different paralogous sequences need to be identified. Statistics concerning the number of different types of SNPs and clusters can be included in the output. Searching parameters include the contig reference number, GenBank/EMBL/DDBJ accession number of ESTs, and UniGene ID; output options include SNP information, alignment information, EST function annotation information and ORF information of the contig. The SNP retrieval system based on the potato data is available at the website <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>In conclusion, QualitySNP works at least as well as, and in cases outperforms currently available methods, without the drawbacks of some of them, such as the necessity to provide a genomic sequence or sequence quality files. However, if quality files are available, this information can also be used by QualitySNP. By using a haplotype-based strategy, QualitySNP not only predicts reliable SNPs but also identifies haplotypes, and thus can be used in EST-based genotyping.</p>
         <p>Another advantage of QualitySNP over other programs for SNP detection in nucleotide databases is the availability of a retrieval system that can output various kinds of data. Although QualitySNP can be used as a SNP detection tool with default settings, it can also be used for instance to examine specific clusters of genes, or to find nsSNPs in candidate genes.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>BV, JL and JT identified the need to develop the program, initiated the project, and designed its basic functionality. JT designed the algorithm and wrote the source code. All authors contributed to the overall design and feature requirements, and participated in the drafting of the manuscript and approved the final version.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>Thanks are due to David Edwards (La Trobe University, Bundoora, Australia) for his data and program, and to Martijn van Kaauwen (Plant Research International, Wageningen UR) for expert technical assistance in the SNP validation in potato. This research was supported by the Dutch Ministry of Agriculture, Nature and Food Quality (kennisbasis funding).</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>The essence of SNPs</p>
            </title>
            <aug>
               <au>
                  <snm>Brookes</snm>
                  <fnm>AJ</fnm>
               </au>
            </aug>
            <source>Gene</source>
            <pubdate>1999</pubdate>
            <volume>234</volume>
            <fpage>177</fpage>
            <lpage>186</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0378-1119(99)00219-X</pubid>
                  <pubid idtype="pmpid" link="fulltext">10395891</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>High-throughput identification, database storage and analysis of SNPs in EST sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Useche</snm>
                  <fnm>FJ</fnm>
               </au>
               <au>
                  <snm>Gao</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Harafey</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Rafalski</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Genome Inform Ser Workshop Genome Inform</source>
            <pubdate>2001</pubdate>
            <volume>12</volume>
            <fpage>194</fpage>
            <lpage>203</lpage>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Mining SNPs from EST databases</p>
            </title>
            <aug>
               <au>
                  <snm>Picoult-Newberg</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Ideker</snm>
                  <fnm>TE</fnm>
               </au>
               <au>
                  <snm>Pohl</snm>
                  <fnm>MG</fnm>
               </au>
               <au>
                  <snm>Taylor</snm>
                  <fnm>SL</fnm>
               </au>
               <au>
                  <snm>Donaldson</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Nickerson</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Boyce-Jacino</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>1999</pubdate>
            <volume>9</volume>
            <fpage>167</fpage>
            <lpage>174</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">310719</pubid>
                  <pubid idtype="pmpid" link="fulltext">10022981</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Accessing genetic variation: genotyping single nucleotide polymorphisms</p>
            </title>
            <aug>
               <au>
                  <snm>Syvanen</snm>
                  <fnm>AC</fnm>
               </au>
            </aug>
            <source>Nat Rev Genet</source>
            <pubdate>2001</pubdate>
            <volume>2</volume>
            <fpage>930</fpage>
            <lpage>942</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/35103535</pubid>
                  <pubid idtype="pmpid" link="fulltext">11733746</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>First-generation SNP/InDel markers tagging loci for pathogen resistance in the potato genome</p>
            </title>
            <aug>
               <au>
                  <snm>Rickert</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>JH</fnm>
               </au>
               <au>
                  <snm>Meyer</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Nagel</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Ballvora</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Oefner</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Gebhardt</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Plant Biotech J</source>
            <pubdate>2003</pubdate>
            <volume>1</volume>
            <fpage>399</fpage>
            <lpage>410</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1046/j.1467-7652.2003.00036.x</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>ESTs as a source for sequence polymorphism discovery in sugarcane: example of the Adh genes</p>
            </title>
            <aug>
               <au>
                  <snm>Grivet</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Glaszmann</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Vincentz</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Silva</snm>
                  <fnm>Fd</fnm>
               </au>
               <au>
                  <snm>Arruda</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Theor Appl Genet</source>
            <pubdate>2003</pubdate>
            <volume>106</volume>
            <fpage>190</fpage>
            <lpage>197</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12582843</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Identification of candidate coding region single nucleotide polymorphisms in 165 human genes using assembled expressed sequence tags</p>
            </title>
            <aug>
               <au>
                  <snm>Garg</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Green</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Nickerson</snm>
                  <fnm>DA</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>1999</pubdate>
            <volume>9</volume>
            <fpage>1087</fpage>
            <lpage>1092</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">310835</pubid>
                  <pubid idtype="pmpid" link="fulltext">10568748</pubid>
                  <pubid idtype="doi">10.1101/gr.9.11.1087</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>A double-screening method to identify reliable candidate non-synonymous SNPs from chicken EST data</p>
            </title>
            <aug>
               <au>
                  <snm>Kim</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Schmidt</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Decker</snm>
                  <fnm>KS</fnm>
               </au>
               <au>
                  <snm>Emara</snm>
                  <fnm>MG</fnm>
               </au>
            </aug>
            <source>Animal Genet</source>
            <pubdate>2003</pubdate>
            <volume>34</volume>
            <fpage>249</fpage>
            <lpage>254</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1046/j.1365-2052.2003.01003.x</pubid>
                  <pubid idtype="pmpid" link="fulltext">12873212</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>A general approach to single-nucleotide polymorphism discovery</p>
            </title>
            <aug>
               <au>
                  <snm>Marth</snm>
                  <fnm>GT</fnm>
               </au>
               <au>
                  <snm>Korf</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Yandell</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Yeh</snm>
                  <fnm>RT</fnm>
               </au>
               <au>
                  <snm>Gu</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Zakeri</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Stitziel</snm>
                  <fnm>NO</fnm>
               </au>
               <au>
                  <snm>Hillier</snm>
                  <fnm>LD</fnm>
               </au>
               <au>
                  <snm>Kwok</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Gish</snm>
                  <fnm>WR</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>1999</pubdate>
            <volume>23</volume>
            <fpage>452</fpage>
            <lpage>456</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/70570</pubid>
                  <pubid idtype="pmpid" link="fulltext">10581034</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Automated SNP detection in expressed sequence tags: statistical considerations and application to maritime pine sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Le Dantec</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Chagn&#233;</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Pot</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Cantin</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Garnier-G&#233;r&#233;</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Bedon</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Frigerio</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Chaumeil</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>L&#233;ger</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Garcia</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Legrait</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>de Daruvar</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Plomion</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Plant Mol Biol</source>
            <pubdate>2004</pubdate>
            <volume>54</volume>
            <fpage>461</fpage>
            <lpage>470</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1023/B:PLAN.0000036376.11710.6f</pubid>
                  <pubid idtype="pmpid" link="fulltext">15284499</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Reliable identification of large numbers of candidate SNPs from public EST data</p>
            </title>
            <aug>
               <au>
                  <snm>Buetow</snm>
                  <fnm>KH</fnm>
               </au>
               <au>
                  <snm>Edmonson</snm>
                  <fnm>MN</fnm>
               </au>
               <au>
                  <snm>Cassidy</snm>
                  <fnm>AB</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>1999</pubdate>
            <volume>21</volume>
            <fpage>323</fpage>
            <lpage>325</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/6851</pubid>
                  <pubid idtype="pmpid" link="fulltext">10080189</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Mining single-nucleotide polymorphisms from hexaploid wheat ESTs</p>
            </title>
            <aug>
               <au>
                  <snm>Somers</snm>
                  <fnm>DJ</fnm>
               </au>
               <au>
                  <snm>Kirkpatrick</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Moniwa</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Walsh</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Genome</source>
            <pubdate>2003</pubdate>
            <volume>46</volume>
            <fpage>431</fpage>
            <lpage>437</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1139/g03-027</pubid>
                  <pubid idtype="pmpid" link="fulltext">12834059</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>novoSNP, a novel computational tool for sequence variation discovery</p>
            </title>
            <aug>
               <au>
                  <snm>Weckx</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Del Favero</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Rademakers</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Claes</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Cruts</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>De Jonghe</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Van Broeckhoven</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>De Rijk</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2005</pubdate>
            <volume>15</volume>
            <fpage>436</fpage>
            <lpage>442</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">551570</pubid>
                  <pubid idtype="pmpid" link="fulltext">15741513</pubid>
                  <pubid idtype="doi">10.1101/gr.2754005</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>A method for finding single-nucleotide polymorphisms with allele frequencies in sequences of deep coverage</p>
            </title>
            <aug>
               <au>
                  <snm>Wang</snm>
                  <fnm>JHX</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>7</volume>
            <fpage>220</fpage>
            <lpage>227</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1186/1471-2105-6-220</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP</p>
            </title>
            <aug>
               <au>
                  <snm>Barker</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Batley</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>O' Sullivan</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Edwards</snm>
                  <fnm>KJ</fnm>
               </au>
               <au>
                  <snm>Edwards</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <fpage>421</fpage>
            <lpage>422</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btf881</pubid>
                  <pubid idtype="pmpid" link="fulltext">12584131</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Mining for single nucleotide polymorphisms and insertions/deletions in maize expressed sequence tag data</p>
            </title>
            <aug>
               <au>
                  <snm>Batley</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Barker</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>O' Sullivan</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Edwards</snm>
                  <fnm>KJ</fnm>
               </au>
               <au>
                  <snm>Edwards</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Plant Physiol</source>
            <pubdate>2003</pubdate>
            <volume>132</volume>
            <fpage>84</fpage>
            <lpage>91</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">166954</pubid>
                  <pubid idtype="pmpid" link="fulltext">12746514</pubid>
                  <pubid idtype="doi">10.1104/pp.102.019422</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Snipping polymorphisms from large EST collections in barley (Hordeum vulgare L.)</p>
            </title>
            <aug>
               <au>
                  <snm>Kota</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Rudd</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Facius</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Kolesov</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Thiel</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Stein</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Mayer</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Graner</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Mol Gen Genomics</source>
            <pubdate>2003</pubdate>
            <volume>270</volume>
            <fpage>24</fpage>
            <lpage>33</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1007/s00438-003-0891-6</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Application of machine learning in SNP discovery</p>
            </title>
            <aug>
               <au>
                  <snm>Matukumalli</snm>
                  <fnm>LK</fnm>
               </au>
               <au>
                  <snm>Grefenstette</snm>
                  <fnm>JJ</fnm>
               </au>
               <au>
                  <snm>Hyten</snm>
                  <fnm>DL</fnm>
               </au>
               <au>
                  <snm>Choi</snm>
                  <fnm>Ik-Young</fnm>
               </au>
               <au>
                  <snm>Cregan</snm>
                  <fnm>PB</fnm>
               </au>
               <au>
                  <snm>Van Tassell</snm>
                  <fnm>CP</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>4</fpage>
            <lpage>13</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1186/1471-2105-7-4</pubid>
                  <pubid idtype="pmpid" link="fulltext">16398931</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>TIGR Gene Index</p>
            </title>
            <url>http://www.tigr.org/</url>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Database resources of the National Center for Biotechnology</p>
            </title>
            <aug>
               <au>
                  <snm>Wheeler</snm>
                  <fnm>DL</fnm>
               </au>
               <au>
                  <snm>Church</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Federhen</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Lash</snm>
                  <fnm>AE</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Pontius</snm>
                  <fnm>JU</fnm>
               </au>
               <au>
                  <snm>Schuler</snm>
                  <fnm>GD</fnm>
               </au>
               <au>
                  <snm>Schriml</snm>
                  <fnm>LM</fnm>
               </au>
               <au>
                  <snm>Sequeira</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Tatusova</snm>
                  <fnm>TA</fnm>
               </au>
               <au>
                  <snm>Wagner</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <fpage>28</fpage>
            <lpage>33</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">165480</pubid>
                  <pubid idtype="pmpid" link="fulltext">12519941</pubid>
                  <pubid idtype="doi">10.1093/nar/gkg033</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Chicken EST</p>
            </title>
            <url>ftp://rocky.bms.umist.ac.uk/pub/chickest/fastafiles/clipped/</url>
         </bibl>
         <bibl id="B22">
            <title>
               <p>UniProt: The Universal Protein Knowledgebase</p>
            </title>
            <aug>
               <au>
                  <snm>Apweiler</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Bairoch</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Wu</snm>
                  <fnm>CH</fnm>
               </au>
               <au>
                  <snm>Barker</snm>
                  <fnm>WC</fnm>
               </au>
               <au>
                  <snm>Boeckmann</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Ferro</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Gasteiger</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Huang</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Lopez</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Magrane</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Natale</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>O'Donovan</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Redaschi</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Yeh</snm>
                  <fnm>LS</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <fpage>D115</fpage>
            <lpage>D119</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">308865</pubid>
                  <pubid idtype="pmpid" link="fulltext">14681372</pubid>
                  <pubid idtype="doi">10.1093/nar/gkh131</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Improved tools for biological sequence comparison</p>
            </title>
            <aug>
               <au>
                  <snm>Pearson</snm>
                  <fnm>WR</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1988</pubdate>
            <volume>85</volume>
            <fpage>2444</fpage>
            <lpage>2448</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">280013</pubid>
                  <pubid idtype="pmpid" link="fulltext">3162770</pubid>
                  <pubid idtype="doi">10.1073/pnas.85.8.2444</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Gapped BLAST and PSI-BLAST: A new generation of protein database search programs</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Schaeffer</snm>
                  <fnm>AA</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1997</pubdate>
            <volume>25</volume>
            <fpage>3389</fpage>
            <lpage>3402</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">146917</pubid>
                  <pubid idtype="pmpid" link="fulltext">9254694</pubid>
                  <pubid idtype="doi">10.1093/nar/25.17.3389</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>CAP3: a DNA sequence assembly program</p>
            </title>
            <aug>
               <au>
                  <snm>Huang</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Madan</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>1999</pubdate>
            <volume>9</volume>
            <fpage>868</fpage>
            <lpage>877</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">310812</pubid>
                  <pubid idtype="pmpid" link="fulltext">10508846</pubid>
                  <pubid idtype="doi">10.1101/gr.9.9.868</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Base-calling of automated sequencer traces using phred. II. Error probabilities</p>
            </title>
            <aug>
               <au>
                  <snm>Ewing</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Green</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>1998</pubdate>
            <volume>8</volume>
            <fpage>186</fpage>
            <lpage>194</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9521922</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>BLAT sever of UCSC</p>
            </title>
            <url>http://genome.ucsc.edu/cgi-bin/hgBlat</url>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Genotype to phenotype: a technological challenge</p>
            </title>
            <aug>
               <au>
                  <snm>Wilson</snm>
                  <fnm>ID</fnm>
               </au>
               <au>
                  <snm>Barker</snm>
                  <fnm>GL</fnm>
               </au>
               <au>
                  <snm>Edwards</snm>
                  <fnm>KJ</fnm>
               </au>
            </aug>
            <source>Ann Appl Biol</source>
            <pubdate>2003</pubdate>
            <volume>142</volume>
            <fpage>33</fpage>
            <lpage>39</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1111/j.1744-7348.2003.tb00226.x</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Using cDNA and genomic sequences as tools to develop SNP strategies in cassava (Manihot esculenta Crantz)</p>
            </title>
            <aug>
               <au>
                  <snm>Lopez</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Piegu</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Cooke</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Delseny</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Tohme</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Verdier</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <source>Theor Appl Gen</source>
            <pubdate>2005</pubdate>
            <volume>110</volume>
            <fpage>425</fpage>
            <lpage>431</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1007/s00122-004-1833-3</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Chicken single nucleotide polymorphism identification and selection for genetic mapping</p>
            </title>
            <aug>
               <au>
                  <snm>Jalving</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Van't Slot</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>van Oost</snm>
                  <fnm>BA</fnm>
               </au>
            </aug>
            <source>Poultry Sci</source>
            <pubdate>2004</pubdate>
            <volume>83</volume>
            <fpage>1925</fpage>
            <lpage>1931</lpage>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Comparison of DNA sequences with protein sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Pearson</snm>
                  <fnm>WR</fnm>
               </au>
               <au>
                  <snm>Wood</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Genomics</source>
            <pubdate>1997</pubdate>
            <volume>46</volume>
            <fpage>24</fpage>
            <lpage>36</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/geno.1997.4995</pubid>
                  <pubid idtype="pmpid" link="fulltext">9403055</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>QualitySNP</p>
            </title>
            <url>http://www.bioinformatics.nl/tools/snpweb/</url>
         </bibl>
         <bibl id="B33">
            <title>
               <p>The hidden duplication past of Arabidopsis thaliana</p>
            </title>
            <aug>
               <au>
                  <snm>Simillion</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Vandepoele</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Montagu</snm>
                  <fnm>MCEv</fnm>
               </au>
               <au>
                  <snm>Zabeau</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Peer</snm>
                  <fnm>Yvd</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2002</pubdate>
            <volume>99</volume>
            <fpage>13627</fpage>
            <lpage>13632</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">129725</pubid>
                  <pubid idtype="pmpid" link="fulltext">12374856</pubid>
                  <pubid idtype="doi">10.1073/pnas.212522399</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <title>
               <p>Evidence that rice and other cereals are ancient aneuploids</p>
            </title>
            <aug>
               <au>
                  <snm>Vandepoele</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Simillion</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Van de Peer</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>Plant Cell</source>
            <pubdate>2003</pubdate>
            <volume>15</volume>
            <fpage>2192</fpage>
            <lpage>2202</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">181340</pubid>
                  <pubid idtype="pmpid" link="fulltext">12953120</pubid>
                  <pubid idtype="doi">10.1105/tpc.014019</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Discovery of single nucleotide polymorphisms in Lycopersicon esculentum by computer aided analysis of expressed sequence tags</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Bai</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Kabelka</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Eaton</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Kamoun</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>van der Knaap</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Francis</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Mol Breeding</source>
            <pubdate>2004</pubdate>
            <volume>14</volume>
            <fpage>21</fpage>
            <lpage>34</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1023/B:MOLB.0000037992.03731.a5</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Large-scale identification and analysis of genome-wide single-nucleotide polymorphisms for mapping in Arabidopsis thaliana</p>
            </title>
            <aug>
               <au>
                  <snm>Schmid</snm>
                  <fnm>KJ</fnm>
               </au>
               <au>
                  <snm>Rosleff S&#246;rensen</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Stracke</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>T&#246;rj&#233;k</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Altmann</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Mitchell-Olds</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Weisshaar</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2003</pubdate>
            <volume>13</volume>
            <fpage>1250</fpage>
            <lpage>1257</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">403656</pubid>
                  <pubid idtype="pmpid" link="fulltext">12799357</pubid>
                  <pubid idtype="doi">10.1101/gr.728603</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>Applications of single nucleotide polymorphisms in crop genetics</p>
            </title>
            <aug>
               <au>
                  <snm>Rafalski</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Curr Op Plant Biol</source>
            <pubdate>2002</pubdate>
            <volume>5</volume>
            <fpage>94</fpage>
            <lpage>100</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S1369-5266(02)00240-6</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>A comparison of sequence-based polymorphism and haplotype content in transcribed and anonymous regions of the barley genome</p>
            </title>
            <aug>
               <au>
                  <snm>Russell</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Booth</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Fuller</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Harrower</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Hedley</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Machray</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Powell</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Genome</source>
            <pubdate>2004</pubdate>
            <volume>47</volume>
            <fpage>389</fpage>
            <lpage>398</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15060592</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B39">
            <title>
               <p>Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution</p>
            </title>
            <aug>
               <au>
                  <cnm>International Chicken Genome Sequencing Consortium</cnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2004</pubdate>
            <volume>432</volume>
            <fpage>695</fpage>
            <lpage>716</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature03154</pubid>
                  <pubid idtype="pmpid" link="fulltext">15592404</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>

