<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-9-29</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Methodology article</dochead>
      <bibl>
         <title>
            <p>A nonparametric model for quality control of database search results in shotgun proteomics</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Zhang</snm>
               <fnm>Jiyang</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>zhangjy@hupo.org.cn</email>
            </au>
            <au id="A2">
               <snm>Li</snm>
               <fnm>Jianqi</fnm>
               <insr iid="I2"/>
               <email>Lijq@hupo.org.cn</email>
            </au>
            <au id="A3">
               <snm>Liu</snm>
               <fnm>Xin</fnm>
               <insr iid="I2"/>
               <email>dkgha@126.com</email>
            </au>
            <au id="A4">
               <snm>Xie</snm>
               <fnm>Hongwei</fnm>
               <insr iid="I1"/>
               <email>xhwei65@hotmail.com</email>
            </au>
            <au id="A5" ca="yes">
               <snm>Zhu</snm>
               <fnm>Yunping</fnm>
               <insr iid="I2"/>
               <email>zhuyp@hupo.org.cn</email>
            </au>
            <au id="A6" ca="yes">
               <snm>He</snm>
               <fnm>Fuchu</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>hefc@nic.bmi.ac.cn</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>College of Mechanical &amp; Electronic Engineering and Automatization, National University of Defense Technology, Changsha, 410073, China</p>
            </ins>
            <ins id="I2">
               <p>State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Beijing 102206, China</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2008</pubdate>
         <volume>9</volume>
         <issue>1</issue>
         <fpage>29</fpage>
         <url>http://www.biomedcentral.com/1471-2105/9/29</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">18205957</pubid>
               <pubid idtype="doi">10.1186/1471-2105-9-29</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>05</day>
               <month>6</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>21</day>
               <month>1</month>
               <year>2008</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>21</day>
               <month>1</month>
               <year>2008</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2008</year>
         <collab>Zhang et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Analysis of complex samples with tandem mass spectrometry (MS/MS) has become routine in proteomic research. However, validation of database search results creates a bottleneck in MS/MS data processing. Recently, methods based on a randomized database have become popular for quality control of database search results. However, a consequent problem is the ignorance of how to combine different database search scores to improve the sensitivity of randomized database methods.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>In this paper, a multivariate nonlinear discriminate function (DF) based on the multivariate nonparametric density estimation technique was used to filter out false-positive database search results with a predictable false positive rate (FPR). Application of this method to control datasets of different instruments (LCQ, LTQ, and LTQ/FT) yielded an estimated FPR close to the actual FPR. As expected, the method was more sensitive when more features were used. Furthermore, the new method was shown to be more sensitive than two commonly used methods on 3 complex sample datasets and 3 control datasets.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Using the nonparametric model, a more flexible DF can be obtained, resulting in improved sensitivity and good FPR estimation. This nonparametric statistical technique is a powerful tool for tackling the complexity and diversity of datasets in shotgun proteomics.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>The objective of proteomics is to investigate proteins on a global scale <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>. The high-throughput and sensitive tandem mass spectrometry (MS/MS) platform is now a supporting technology for protein identification in proteomic research <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>. Using the shotgun strategy, a large number of MS/MS spectra can be gathered in a few hours <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. The MS/MS data is generally processed by the so-called database searching method <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Automated software such as SEQUEST <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> and MASCOT <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> can rapidly assign tryptic peptides to MS/MS spectra by searching a protein sequence database and then identify proteins by utilizing the identified peptides. A notable problem in the MS/MS data processing is the high false positive rate (FPR) of the database search results <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. Thus, validation of database search results is unavoidable and necessary work, particularly when processing the large amount low accuracy MS/MS spectra with SEQUEST <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>.</p>
         <p>There are many proposed parameters and algorithms for evaluating SEQUEST database search results <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr><abbr bid="B31">31</abbr><abbr bid="B32">32</abbr></abbrgrp>. Such approaches must confront two main problems: First, the complex physical and chemical mechanisms of the shotgun experiment make it difficult to model the matches between MS/MS spectra and peptides with a one-size-fits-all algorithm <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. Thus, database search software provides multiple scores, and many empirical and intuitive parameters are used in the validation of database search results. These parameters describe different aspects of the quality of the match and provide complementary information to the validation of the database search results. Combining these parameters while considering their relationships is difficult. Second, many factors can affect the distributions of quality control parameters, including the sample, the database, the experimental conditions, and other random factors <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B27">27</abbr></abbrgrp>. Avoiding the effects of such factors during the validation of database search results is difficult. In addition, large-scale proteomics always uses multiple, complementary MS/MS platforms and multiple database search software tools to acquire more results with a high confidence level. Thus, a universal framework for quality control of results is needed <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>.</p>
         <p>Recently, the randomized database method has become an attractive framework for quality control of MS/MS database search results. By constructing a negative control dataset for each experiment MS/MS dataset and the given database, the randomized database method can provide a universal foundation for the result quality control for different types of database search software and minimize the effects of differences in samples, experiment conditions, and databases <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. In the randomized database method, the negative control dataset is generated by searching the constructed randomized database and used to simulate random matches from the normal database. The false positive rate can be estimated using the numbers of matches from the normal and randomized database given a set of filter criteria.</p>
         <p>Moore et al. used the reverse database (a special kind of randomized database) for their Qscore model in 2002 <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. Subsequently, Qian et al. <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> and Peng et al. <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> used the reverse database method to investigate the problem of optimizing the cutoff value of <it>Xcorr </it>and &#916;<it>Cn </it>in yeast and human proteome research, respectively. Recently, Higdon et al. <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> investigated some problems encountered in the application of the reshuffled database. As they noted, searching a combined database can yield more accurate FPR estimation than individually searching normal and reshuffled databases. Based on the binomial distribution, Huttlin et al. investigated the minimum error associated with the estimated FPR <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>. They pointed out that the estimated FPR for a large dataset could be quite accurate. Randomized database methods have been widely used in many research projects <abbrgrp><abbr bid="B34">34</abbr><abbr bid="B35">35</abbr><abbr bid="B36">36</abbr><abbr bid="B37">37</abbr><abbr bid="B38">38</abbr><abbr bid="B39">39</abbr><abbr bid="B40">40</abbr></abbrgrp>. However, different groups use different criteria; there is no standard statistical framework that can easily integrate commonly used parameters for the quality control of database search results.</p>
         <p>There are two primary problems with the randomized database method: how to determine the filter criteria and how to estimate the FPR in succession. Based on the hypothesis that random matches are randomly drawn from normal and randomized databases, formula 1 can be used to estimate the actual FPR <abbrgrp><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr></abbrgrp>; Elias et al <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> recommended formula 2 for reliable data quality control:</p>
         <p>
            <display-formula id="M1">
               <m:math name="1471-2105-9-29-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mi>F</m:mi>
                        <m:mi>P</m:mi>
                        <m:mi>R</m:mi>
                        <m:mo>=</m:mo>
                        <m:mfrac>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>N</m:mi>
                                 <m:mi>R</m:mi>
                              </m:msub>
                           </m:mrow>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>N</m:mi>
                                 <m:mi>N</m:mi>
                              </m:msub>
                           </m:mrow>
                        </m:mfrac>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOrayKaemiuaaLaemOuaiLaeyypa0tcfa4aaSaaaeaacqWGobGtdaWgaaqaaiabdkfasbqabaaabaGaemOta40aaSbaaeaacqWGobGtaeqaaaaaaaa@360E@</m:annotation>
                  </m:semantics>
               </m:math>
            </display-formula>
         </p>
         <p>
            <display-formula id="M2">
               <m:math name="1471-2105-9-29-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mi>F</m:mi>
                        <m:mi>P</m:mi>
                        <m:mi>R</m:mi>
                        <m:mo>=</m:mo>
                        <m:mfrac>
                           <m:mrow>
                              <m:mn>2</m:mn>
                              <m:msub>
                                 <m:mi>N</m:mi>
                                 <m:mi>R</m:mi>
                              </m:msub>
                           </m:mrow>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>N</m:mi>
                                 <m:mi>N</m:mi>
                              </m:msub>
                              <m:mo>+</m:mo>
                              <m:msub>
                                 <m:mi>N</m:mi>
                                 <m:mi>R</m:mi>
                              </m:msub>
                           </m:mrow>
                        </m:mfrac>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOrayKaemiuaaLaemOuaiLaeyypa0tcfa4aaSaaaeaacqaIYaGmcqWGobGtdaWgaaqaaiabdkfasbqabaaabaGaemOta40aaSbaaeaacqWGobGtaeqaaiabgUcaRiabd6eaonaaBaaabaGaemOuaifabeaaaaaaaa@3A55@</m:annotation>
                  </m:semantics>
               </m:math>
            </display-formula>
         </p>
         <p>where <it>N</it><sub><it>R </it></sub>and N<sub><it>N </it></sub>are the preserved number of peptide matches that pass certain filter criteria and derive from the randomized and normal databases, respectively. Huttlin et al <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> have given a statistical interpretation of formula 2 by using the binomial distribution. So, in this paper, we used formula 2 to estimate FPR. Generally, the filter criteria are discriminant functions (DFs) of database search scores. Determining the acceptance boundaries for database search scores (such as <it>Xcorr </it>and &#916;<it>Cn</it>) is a simple and commonly used method <abbrgrp><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr></abbrgrp>. Lopez-Ferrer et al sought to introduce a statistical model that would provide a more complex DF and thus improve the sensitivity of filter criteria <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. In their model, <it>XCc</it>(<it>=</it>ln(<it>Xcorr</it>)) and <inline-formula><m:math name="1471-2105-9-29-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mi>D</m:mi><m:mi>C</m:mi><m:mi>c</m:mi><m:mo stretchy="false">(</m:mo><m:mo>=</m:mo><m:msqrt><m:mrow><m:mi>&#916;</m:mi><m:mi>C</m:mi><m:mi>n</m:mi></m:mrow></m:msqrt><m:mo stretchy="false">)</m:mo></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemiraqKaem4qamKaem4yamMaeiikaGIaeyypa0ZaaOaaaeaacqqHuoarcqWGdbWqcqWGUbGBaSqabaGccqGGPaqkaaa@35F9@</m:annotation></m:semantics></m:math></inline-formula> of random matches were considered to follow normal distributions, and the distributions of <it>XCc </it>and <it>DCc </it>were assumed to be independent. The contour line of the estimated joint distribution of <it>XCc </it>and <it>DCc </it>was used as the filter boundary. However, we found that normal distributions do not fit well the distributions of <it>XCc </it>and <it>DCc </it>of the random matches in the LCQ control dataset used in this paper(see "Datasets and database search" section); the <it>&#967;</it><sup>2 </sup>goodness of fit test shows that we can reject the null hypothesis <it>H</it><sub>0 </sub>(the distribution is normal) at a significance level of 0.05. Furthermore, the correlation between <it>XCc </it>and <it>DCc </it>is significant (correlation coefficient = 0.1, p-value = 1.8 &#215; 10<sup>-24</sup>; random matches in the LCQ control dataset, see section "Datasets and database search") which is inconsistent with the independence assumption made by Lopez-Ferrer et al. Another problem with their model is that it cannot be generalized to the situations involving more parameters.</p>
         <p>Multivariate nonparametric models can describe data with complex and variable statistical structures. The term nonparametric is not meant to imply that such models do not use any parameters but rather denotes that the number and nature of the parameters are not fixed in advance but flexible. This advantage makes nonparametric models a powerful tool for addressing the problem of multiple parameters with variable distributions in the validation of database search results. Using a set of kernel functions (such as a Gaussian kernel function); the nonparametric model can fit the distribution of multiple parameters directly with considerable accuracy <abbrgrp><abbr bid="B41">41</abbr><abbr bid="B42">42</abbr></abbrgrp>. Generally, parameter estimation for a nonparametric model is an iterative optimization procedure. The fully nonparametric probability density function estimate (FnPDFe) procedure proposed by Archambeau et al. <abbrgrp><abbr bid="B42">42</abbr></abbrgrp> and David et al. <abbrgrp><abbr bid="B43">43</abbr></abbrgrp>, which is based on a maximum likelihood estimate (MLE) and expectation-maximization (EM) algorithm, is easily implemented with computer programs. In this paper, based on the randomized database searching, FnPDFe was used to estimate the multivariate PDF of the commonly used database scores, the contour lines of the estimated PDF were taken as the candidate DFs. We demonstrated that the FPR estimation errors of the newly introduced method were acceptable on the control datasets from different instruments (LCQ, LTQ and LTQ/FT), its sensitivity was also proved to be improved on the control datasets and the real sample datasets.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>In this section, the DFs of the nonparametric model were discussed at first, and then we show that the sensitivity of the model could be improved by incorporating more features. The accuracy of the FPR estimation of the nonparametric model was investigated and the performance of the nonparametric model was proved superior by comparing with other commonly used methods in proteomics.</p>
         <sec>
            <st>
               <p>Nonparametric model and the DF</p>
            </st>
            <p>In order to illustrate the shape of the DFs derived from the nonparametric model, a two dimension model which used <it>Xcorr </it>and &#916;<it>Cn </it>was investigated at first. Because <it>Xcorr </it>significantly correlate with the charge state (+1, +2, and +3) <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, the matches with different charge states were processed individually. Since a large percentage of correct matches have a double charge, the matches in the control dataset with a double charge are discussed here. Using a trial and error approach, a model with 3 Gaussian functions (18 variables, Table <tblr tid="T1">1</tblr>) fit the distribution well (<it>&#967;</it><sup>2 </sup>goodness of fit test; significance level = 0.05). Figure <figr fid="F1">1A</figr> and Figure <figr fid="F1">1B</figr> show the histogram and density function, respectively. The estimated error for each bin is shown in Figure <figr fid="F1">1C</figr>. The small error (&#8804; 3.6 &#215; 10<sup>-3</sup>) also demonstrates that the fit is accurate.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>The model with 3 Gaussian functions for +2 charge observations in the LCQ control dataset</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="center">
                        <p>
                           <it>&#956;</it>
                           <sub>
                              <it>i</it>
                           </sub>
                        </p>
                     </c>
                     <c ca="center">
                        <p>&#931;<sub><it>i</it></sub></p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>P</it>
                           <sub>
                              <it>i</it>
                           </sub>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>(1.528008,0.156465)</p>
                     </c>
                     <c ca="center">
                        <p>
                           <inline-formula>
                              <m:math name="1471-2105-9-29-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
                                 <m:semantics>
                                    <m:mrow>
                                       <m:mrow>
                                          <m:mo>[</m:mo>
                                          <m:mrow>
                                             <m:mtable>
                                                <m:mtr>
                                                   <m:mtd>
                                                      <m:mrow>
                                                         <m:mtext>0</m:mtext>
                                                         <m:mtext>.147405</m:mtext>
                                                      </m:mrow>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mrow>
                                                         <m:mtext>0</m:mtext>
                                                         <m:mtext>.007248</m:mtext>
                                                      </m:mrow>
                                                   </m:mtd>
                                                </m:mtr>
                                                <m:mtr>
                                                   <m:mtd>
                                                      <m:mrow>
                                                         <m:mtext>0</m:mtext>
                                                         <m:mtext>.007248</m:mtext>
                                                      </m:mrow>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mrow>
                                                         <m:mtext>0</m:mtext>
                                                         <m:mtext>.004207</m:mtext>
                                                      </m:mrow>
                                                   </m:mtd>
                                                </m:mtr>
                                             </m:mtable>
                                          </m:mrow>
                                          <m:mo>]</m:mo>
                                       </m:mrow>
                                    </m:mrow>
                                    <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaWaamWaaeaafaqabeGacaaabaGaeeimaaJaeeOla4IaeeymaeJaeeinaqJaee4naCJaeeinaqJaeeimaaJaeeynaudabaGaeeimaaJaeeOla4IaeeimaaJaeeimaaJaee4naCJaeeOmaiJaeeinaqJaeeioaGdabaGaeeimaaJaeeOla4IaeeimaaJaeeimaaJaee4naCJaeeOmaiJaeeinaqJaeeioaGdabaGaeeimaaJaeeOla4IaeeimaaJaeeimaaJaeeinaqJaeeOmaiJaeeimaaJaee4naCdaaaGaay5waiaaw2faaaaa@4B3D@</m:annotation>
                                 </m:semantics>
                              </m:math>
                           </inline-formula>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.138577</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>(1.615925,0.079976)</p>
                     </c>
                     <c ca="center">
                        <p>
                           <inline-formula>
                              <m:math name="1471-2105-9-29-i5" xmlns:m="http://www.w3.org/1998/Math/MathML">
                                 <m:semantics>
                                    <m:mrow>
                                       <m:mrow>
                                          <m:mo>[</m:mo>
                                          <m:mrow>
                                             <m:mtable>
                                                <m:mtr>
                                                   <m:mtd>
                                                      <m:mrow>
                                                         <m:mtext>0</m:mtext>
                                                         <m:mtext>.236614</m:mtext>
                                                      </m:mrow>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mrow>
                                                         <m:mtext>-0</m:mtext>
                                                         <m:mtext>.001756</m:mtext>
                                                      </m:mrow>
                                                   </m:mtd>
                                                </m:mtr>
                                                <m:mtr>
                                                   <m:mtd>
                                                      <m:mrow>
                                                         <m:mtext>-0</m:mtext>
                                                         <m:mtext>.001756</m:mtext>
                                                      </m:mrow>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mrow>
                                                         <m:mtext>0</m:mtext>
                                                         <m:mtext>.001686</m:mtext>
                                                      </m:mrow>
                                                   </m:mtd>
                                                </m:mtr>
                                             </m:mtable>
                                          </m:mrow>
                                          <m:mo>]</m:mo>
                                       </m:mrow>
                                    </m:mrow>
                                    <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaWaamWaaeaafaqabeGacaaabaGaeeimaaJaeeOla4IaeeOmaiJaee4mamJaeeOnayJaeeOnayJaeeymaeJaeeinaqdabaGaeeyla0IaeeimaaJaeeOla4IaeeimaaJaeeimaaJaeeymaeJaee4naCJaeeynauJaeeOnaydabaGaeeyla0IaeeimaaJaeeOla4IaeeimaaJaeeimaaJaeeymaeJaee4naCJaeeynauJaeeOnaydabaGaeeimaaJaeeOla4IaeeimaaJaeeimaaJaeeymaeJaeeOnayJaeeioaGJaeeOnaydaaaGaay5waiaaw2faaaaa@4D09@</m:annotation>
                                 </m:semantics>
                              </m:math>
                           </inline-formula>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.476640</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>(1.369449,0.023879)</p>
                     </c>
                     <c ca="center">
                        <p>
                           <inline-formula>
                              <m:math name="1471-2105-9-29-i6" xmlns:m="http://www.w3.org/1998/Math/MathML">
                                 <m:semantics>
                                    <m:mrow>
                                       <m:mrow>
                                          <m:mo>[</m:mo>
                                          <m:mrow>
                                             <m:mtable>
                                                <m:mtr>
                                                   <m:mtd>
                                                      <m:mrow>
                                                         <m:mtext>0</m:mtext>
                                                         <m:mtext>.078369</m:mtext>
                                                      </m:mrow>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mrow>
                                                         <m:mtext>-0</m:mtext>
                                                         <m:mtext>.000077</m:mtext>
                                                      </m:mrow>
                                                   </m:mtd>
                                                </m:mtr>
                                                <m:mtr>
                                                   <m:mtd>
                                                      <m:mrow>
                                                         <m:mtext>-0</m:mtext>
                                                         <m:mtext>.000077</m:mtext>
                                                      </m:mrow>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mrow>
                                                         <m:mtext>0</m:mtext>
                                                         <m:mtext>.000250</m:mtext>
                                                      </m:mrow>
                                                   </m:mtd>
                                                </m:mtr>
                                             </m:mtable>
                                          </m:mrow>
                                          <m:mo>]</m:mo>
                                       </m:mrow>
                                    </m:mrow>
                                    <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaWaamWaaeaafaqabeGacaaabaGaeeimaaJaeeOla4IaeeimaaJaee4naCJaeeioaGJaee4mamJaeeOnayJaeeyoaKdabaGaeeyla0IaeeimaaJaeeOla4IaeeimaaJaeeimaaJaeeimaaJaeeimaaJaee4naCJaee4naCdabaGaeeyla0IaeeimaaJaeeOla4IaeeimaaJaeeimaaJaeeimaaJaeeimaaJaee4naCJaee4naCdabaGaeeimaaJaeeOla4IaeeimaaJaeeimaaJaeeimaaJaeeOmaiJaeeynauJaeeimaadaaaGaay5waiaaw2faaaaa@4CEF@</m:annotation>
                                 </m:semantics>
                              </m:math>
                           </inline-formula>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.384784</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Identified nonparametric model for observations in the control dataset with a +2 charge state</p>
               </caption>
               <text>
                  <p>Identified nonparametric model for observations in the control dataset with a +2 charge state. (A) The 2-dimensional histogram. (B) The density function curve of the mixed model with 3 Gaussian functions. (C) The error of the density function in each bin. (D) Contour lines of the density function serve as the filter boundaries.</p>
               </text>
               <graphic file="1471-2105-9-29-1"/>
            </fig>
            <p>DFs that can simultaneously reject as many false positives as possible and accept as many true positives as possible are preferred. Thus, the region in the feature space with fewer random matches is more preferred, and the contour lines of the PDF of the random matches are good candidate DFs (Figure <figr fid="F1">1D</figr>). Generally, random matches have a small &#916;<it>Cn </it>and <it>Xcorr</it>, while correct matches have a large &#916;<it>Cn </it>and <it>Xcorr</it>. Correct matches with the peptide isoform <abbrgrp><abbr bid="B44">44</abbr></abbrgrp> have a small &#916;<it>Cn </it>and a large <it>Xcorr</it>. Matches with a small <it>Xcorr </it>and a large &#916;<it>Cn </it>may be due to the limited search space of the database searching. These matches are rare and more likely to be random matches; they may be localized to the accepted region of the contour line DFs because these results are also rare random events. A new DF of <it>Xcorr </it>was added to exclude such matches: <it>Xcorr </it>> <it>m</it><sub><it>Xcorr</it></sub>, where <it>m</it><sub><it>Xcorr </it></sub>is the mean of <it>Xcorr </it>of randomized database matches (bold red vertical line in Figure <figr fid="F1">1D</figr>). Given an expected <it>FPR</it><it>&#945;</it>, a target value <it>f</it><sub><it>&#945; </it></sub>can be searched to ensure the calculated FPR (<it>FPR</it><sub><it>cal</it></sub>) is less than or equal to <it>&#945;</it>. When searching for <it>f</it><sub><it>&#945;</it></sub>, <it>N</it><sub><it>N </it></sub>and <it>N</it><sub><it>R </it></sub>were counted according to the rules:</p>
            <p>
               <display-formula id="M3">
                  <m:math name="1471-2105-9-29-i7" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>i</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mi>N</m:mi>
                              </m:munderover>
                              <m:mrow>
                                 <m:mi>P</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>i</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:msub>
                                    <m:mi>f</m:mi>
                                    <m:mi>G</m:mi>
                                 </m:msub>
                              </m:mrow>
                           </m:mstyle>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>X</m:mi>
                           <m:mo>|</m:mo>
                           <m:mi>i</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>&#8804;</m:mo>
                           <m:msub>
                              <m:mi>f</m:mi>
                              <m:mi>&#945;</m:mi>
                           </m:msub>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaWaaabCaeaacqWGqbaucqGGOaakcqWGPbqAcqGGPaqkcqWGMbGzdaWgaaWcbaGaem4raCeabeaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoakiabcIcaOiabdIfayjabcYha8jabdMgaPjabcMcaPiabgsMiJkabdAgaMnaaBaaaleaaiiGacqWFXoqyaeqaaaaa@4448@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>and</p>
            <p>
               <display-formula id="M4"><it>Xcorr </it>> <it>m</it><sub><it>Xcorr </it></sub></display-formula>
            </p>
            <p>where <it>X </it>= (<it>Xcorr</it>, &#916;<it>Cn</it>) is the observation, and <it>N </it>= 3 is the number of Gaussian functions. Many <it>f</it><sub><it>&#945; </it></sub>satisfied formula 3 and formula 4. The one with the largest <it>N</it><sub><it>N </it></sub>was used in the final DF. Figure <figr fid="F2">2</figr> shows the DFs for different expected FPRs and different charge states. The shapes of the boundaries were significantly different, which indicates that it is difficult to fit all the distributions of different charge states with a simple distribution. The nonparametric model can provide feasible solutions to this complex problem. Since the resulting DFs are smooth, this method is more robust than the K nearest neighbor method <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Inferred filter boundaries for different charge state observations in the control dataset</p>
               </caption>
               <text>
                  <p>Inferred filter boundaries for different charge state observations in the control dataset. The pink vertical lines in the +1, +2, and +3 panels are the smallest accepted <it>Xcorr</it>. The red curves are the filter boundaries for FPR = 0.01, and the green curves are the filter boundaries for FPR = 0.05. The blue points on the <it>Xcorr</it>-&#916;<it>Cn </it>plane represent the randomized database matches, and the red points represent the normal database matches. The shape of the boundaries is greatly different for different charge states.</p>
               </text>
               <graphic file="1471-2105-9-29-2"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Incorporating more features</p>
            </st>
            <p>One obvious advantage of the nonparametric model is that it can easily integrate more scores for validating peptide identifications. By taking into account more features and performing the classification in a high-dimension feature space, a more reasonable DF can be found, and thus, higher sensitivity can be achieved. Here, another powerful parameter called <it>Sim </it>introduced by Zhang <abbrgrp><abbr bid="B45">45</abbr></abbrgrp> in 2004 and discussed by Sun et al. <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> recently was added to the nonparametric model. <it>Sim </it>measures the similarity between the experiment and the predicted MS/MS spectrum which was generated by the kinetic model introduced by Zhang <abbrgrp><abbr bid="B45">45</abbr></abbrgrp> and the mass error tolerance for aligning the ions was specified as 0.5.</p>
            <p>For the LCQ control dataset, by trial and error, we found a nonparametric model with 5 component GDFs can work well (65 parameters). We also tried a model with 7 component GDFs, but its performance was not improved and two of the component GDFs had a coefficient <it>P</it><sub><it>i </it></sub>near 0 [see Additional file <supplr sid="S1">1</supplr>]. Thus, we selected 5 component GDFs to build the model. When the expected FPR was 0.05 and 0.01, the actual FPR was 0.044 and 0.012, respectively. The number of peptide matches after filtering was 765 and 699, which were 104 (approximately 15.6%) and 121 (approximately 20.9%) respectively higher than the results of the nonparametric model using <it>Xcorr </it>and &#916;<it>Cn</it>, respectively. The sensitivity increased to 0.879 and 0.822 respectively, and the specificity did not change. Thus, by incorporating more features, the nonparametric model can provide greater discriminating power. In the following part of this paper, we discussed the nonparametric model with three features: <it>Xcorr</it>, &#916;<it>Cn </it>and <it>Sim </it>only. All the model parameters used in this paper were provided in Additional file <supplr sid="S1">1</supplr>.</p>
            <suppl id="S1">
               <title>
                  <p>Additional file 1</p>
               </title>
               <text>
                  <p>The parameters of the nonparametric models for different datasets. This file collected the parameters of the nonparametric models and filter criteria for different datasets. The file was compressed as RAR archive to reduce the size.</p>
               </text>
               <file name="1471-2105-9-29-S1.RAR">
                  <p>Click here for file</p>
               </file>
            </suppl>
         </sec>
         <sec>
            <st>
               <p>The accuracy of the FPR estimation</p>
            </st>
            <p>The control datasets were generated by analyzing a set of known proteins and peptides with MS/MS platforms, which were commonly used to validate the performance of mathematical models for peptide identification <abbrgrp><abbr bid="B46">46</abbr></abbrgrp>. Table <tblr tid="T2">2</tblr> reports the actual FPR and the number of validated matches at two commonly expected FPRs of 0.05 and 0.01. From Table <tblr tid="T2">2</tblr>, the following propositions can be made:</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Actual FPRs and the corresponding estimated FPRs</p>
               </caption>
               <tblbdy cols="8">
                  <r>
                     <c ca="center">
                        <p>Instrument type</p>
                     </c>
                     <c ca="center">
                        <p>Charge state</p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>
                           <it>Expected FPR = 0.05</it>
                        </p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>
                           <it>Expected FPR = 0.01</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>
                           <it>Total matches/false positive matches</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>Actual FPR</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>Estimated FPR</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>Total matches/false positive matches</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>Actual FPR</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>Estimated FPR</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>LCQ</p>
                     </c>
                     <c ca="center">
                        <p>+1</p>
                     </c>
                     <c ca="center">
                        <p>62/3</p>
                     </c>
                     <c ca="center">
                        <p>0.048</p>
                     </c>
                     <c ca="center">
                        <p>0.030</p>
                     </c>
                     <c ca="center">
                        <p>57/2</p>
                     </c>
                     <c ca="center">
                        <p>0.035</p>
                     </c>
                     <c ca="center">
                        <p>0.000</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>+2</p>
                     </c>
                     <c ca="center">
                        <p>521/23</p>
                     </c>
                     <c ca="center">
                        <p>0.044</p>
                     </c>
                     <c ca="center">
                        <p>0.049</p>
                     </c>
                     <c ca="center">
                        <p>464/6</p>
                     </c>
                     <c ca="center">
                        <p>0.012</p>
                     </c>
                     <c ca="center">
                        <p>0.009</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>+3</p>
                     </c>
                     <c ca="center">
                        <p>181/2</p>
                     </c>
                     <c ca="center">
                        <p>0.011</p>
                     </c>
                     <c ca="center">
                        <p>0.043</p>
                     </c>
                     <c ca="center">
                        <p>178/2</p>
                     </c>
                     <c ca="center">
                        <p>0.011</p>
                     </c>
                     <c ca="center">
                        <p>0.000</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>LTQ</p>
                     </c>
                     <c ca="center">
                        <p>+1</p>
                     </c>
                     <c ca="center">
                        <p>447/43</p>
                     </c>
                     <c ca="center">
                        <p>0.096</p>
                     </c>
                     <c ca="center">
                        <p>0.048</p>
                     </c>
                     <c ca="center">
                        <p>242/9</p>
                     </c>
                     <c ca="center">
                        <p>0.037</p>
                     </c>
                     <c ca="center">
                        <p>0.008</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>+2</p>
                     </c>
                     <c ca="center">
                        <p>4,623/169</p>
                     </c>
                     <c ca="center">
                        <p>0.037</p>
                     </c>
                     <c ca="center">
                        <p>0.050</p>
                     </c>
                     <c ca="center">
                        <p>3,961/26</p>
                     </c>
                     <c ca="center">
                        <p>0.007</p>
                     </c>
                     <c ca="center">
                        <p>0.010</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>+3</p>
                     </c>
                     <c ca="center">
                        <p>1,611/59</p>
                     </c>
                     <c ca="center">
                        <p>0.037</p>
                     </c>
                     <c ca="center">
                        <p>0.050</p>
                     </c>
                     <c ca="center">
                        <p>1,449/26</p>
                     </c>
                     <c ca="center">
                        <p>0.018</p>
                     </c>
                     <c ca="center">
                        <p>0.010</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>LTQ/FT</p>
                     </c>
                     <c ca="center">
                        <p>+1</p>
                     </c>
                     <c ca="center">
                        <p>168/18</p>
                     </c>
                     <c ca="center">
                        <p>0.107</p>
                     </c>
                     <c ca="center">
                        <p>0.047</p>
                     </c>
                     <c ca="center">
                        <p>124/12</p>
                     </c>
                     <c ca="center">
                        <p>0.097</p>
                     </c>
                     <c ca="center">
                        <p>0.000</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>+2</p>
                     </c>
                     <c ca="center">
                        <p>1,861/43</p>
                     </c>
                     <c ca="center">
                        <p>0.023</p>
                     </c>
                     <c ca="center">
                        <p>0.049</p>
                     </c>
                     <c ca="center">
                        <p>1,543/14</p>
                     </c>
                     <c ca="center">
                        <p>0.009</p>
                     </c>
                     <c ca="center">
                        <p>0.009</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>+3</p>
                     </c>
                     <c ca="center">
                        <p>565/6</p>
                     </c>
                     <c ca="center">
                        <p>0.011</p>
                     </c>
                     <c ca="center">
                        <p>0.048</p>
                     </c>
                     <c ca="center">
                        <p>543/7</p>
                     </c>
                     <c ca="center">
                        <p>0.007</p>
                     </c>
                     <c ca="center">
                        <p>0.007</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>(1) In most cases, the FPRs estimated by formula 2 were close to but larger than the actual FPRs. Thus, the quality of the resulting datasets was better than claimed. It facilitates the strict result quality control but some sensitivity is lost.</p>
            <p>(2) For little datasets, such as +1 charge state matches of different instruments, the actual FPR was larger than the corresponding estimated FPR. The error of the FPR estimation was also a bit larger. This result agrees with the conclusions of Huttlin et al <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>.</p>
            <p>(3) The estimated FPRs were not equal but close to the expected FPR. The smaller the resulting datasets, the larger the difference between estimated FPR and expected FPR. This arises from the rounding error in formula 2. For example, with an expected FPR of 0.01, the allowable number of random matches was less than 1 for the +1 charge dataset of LCQ, because only 62 matches were left after filtering. Thus, it is impossible to have an estimated FPR exactly equal to 0.01. A preferred alternative is rounding the estimated FPR to 0 (Table <tblr tid="T2">2</tblr>).</p>
            <p>(4) The error of the FPR estimation at the expected FPR of 0.01 is larger that of 0.05. This result means that some unexpected contaminants exist. For example, in the LCQ control dataset, peptide "HVGDLGNVTADK" was identified with high database scores <it>Xcorr </it>= 4.5837, &#916;<it>Cn </it>= 0.542204) and the matched percentage of predicted ions reached 91% (Figure <figr fid="F3">3</figr>). This peptide comes from protein sp|P00441| SODC_HUMAN, which is not a protein in the control sample. But this peptide also belongs to protein sp|P00442|SODC_BOVIN, which may be contaminants in the sample because 4 proteins (ALBU_BOVIN, LACB_BOVIN, LCA_BOVIN and CYC_BOVIN) of bovine were added to the control sample.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>The mass spectrum matched with peptide "HVGDLGNVTADK "</p>
               </caption>
               <text>
                  <p>The mass spectrum matched with peptide "HVGDLGNVTADK ".</p>
               </text>
               <graphic file="1471-2105-9-29-3"/>
            </fig>
            <p>(5) Manually checking the confirmed matches by the nonparametric model, we found that some results with large <it>Xcorr </it>but very small &#916;<it>Cn </it>were confirmed. In some cases, the peptide in the second rank was correct. For example, in the LTQ dataset (D2), a peptide "LEAELEK" was identified with <it>Xcorr </it>= 2.4273 and &#916;<it>Cn </it>= 0.0533 (+1 charge state). The peptide at the second rank was "LEALEEK", a peptide from control protein P62937|PPIA_HUMAN, because of the theoretic mass spectrum similarity between these peptides, which will result in some FPR estimation error.</p>
         </sec>
         <sec>
            <st>
               <p>Compare the performance of nonparametric model with other methods</p>
            </st>
            <p>Two other methods were also be widely used in the proteomic research. The first one (named M1) searches for the optimized cut-off values of <it>Xcorr </it>and &#916;<it>Cn </it>simultaneously while making the number of confirmed matches reached its maximum given an expected FPR. The resulting accepted region on the <it>Xcorr</it>-&#916;<it>Cn </it>plane is a rectangle. The second one (named M2) is Peptideprophet (V1.9), which is an empirical statistic model, introduced by Keller et al <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. PeptideProphet provided the estimated error rates (EER) at different probability score cut-offs. EER has similar meaning with FPR, so we used it as the measure of the quality of the resulting dataset and only the probability score cut-offs without additional criterion were used to filter the matches. In order to name it easily, we denote the nonparametric model as M3 in the following part of this paper. For the control datasets, the confirmed matches, the actual FPR and the sensitivity were listed in Table <tblr tid="T3">3</tblr> (The filter criteria can be found in Additional file <supplr sid="S1">1</supplr>). Some conclusions can be drawn:</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Comparison of different methods on the control datasets</p>
               </caption>
               <tblbdy cols="8">
                  <r>
                     <c ca="center">
                        <p>Instrument type</p>
                     </c>
                     <c ca="center">
                        <p>Methods</p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>
                           <it>Expected FPR = 0.05</it>
                        </p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>
                           <it>Expected FPR = 0.01</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>
                           <it>Validated matches/false positives</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>Actual FPR</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p><it>Sensitivity </it>(%)</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>Validated matches/false positives</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>Actual FPR</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p><it>Sensitivity </it>(%)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>LCQ</p>
                     </c>
                     <c ca="center">
                        <p>M1</p>
                     </c>
                     <c ca="center">
                        <p>652/30</p>
                     </c>
                     <c ca="center">
                        <p>0.046</p>
                     </c>
                     <c ca="center">
                        <p>74.3</p>
                     </c>
                     <c ca="center">
                        <p>581/15</p>
                     </c>
                     <c ca="center">
                        <p>0.026</p>
                     </c>
                     <c ca="center">
                        <p>69.1</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>M2</p>
                     </c>
                     <c ca="center">
                        <p>735/34</p>
                     </c>
                     <c ca="center">
                        <p>0.046</p>
                     </c>
                     <c ca="center">
                        <p>84.1</p>
                     </c>
                     <c ca="center">
                        <p>587/9</p>
                     </c>
                     <c ca="center">
                        <p>0.015</p>
                     </c>
                     <c ca="center">
                        <p>69.3</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>M3</p>
                     </c>
                     <c ca="center">
                        <p>765/28</p>
                     </c>
                     <c ca="center">
                        <p>0.037</p>
                     </c>
                     <c ca="center">
                        <p>87.9</p>
                     </c>
                     <c ca="center">
                        <p>699/10</p>
                     </c>
                     <c ca="center">
                        <p>0.014</p>
                     </c>
                     <c ca="center">
                        <p>82.2</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>LTQ</p>
                     </c>
                     <c ca="center">
                        <p>M1</p>
                     </c>
                     <c ca="center">
                        <p>5507/156</p>
                     </c>
                     <c ca="center">
                        <p>0.028</p>
                     </c>
                     <c ca="center">
                        <p>71.0</p>
                     </c>
                     <c ca="center">
                        <p>4761/48</p>
                     </c>
                     <c ca="center">
                        <p>0.010</p>
                     </c>
                     <c ca="center">
                        <p>62.6</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>M2</p>
                     </c>
                     <c ca="center">
                        <p>5818/197</p>
                     </c>
                     <c ca="center">
                        <p>0.034</p>
                     </c>
                     <c ca="center">
                        <p>74.6</p>
                     </c>
                     <c ca="center">
                        <p>4640/20</p>
                     </c>
                     <c ca="center">
                        <p>0.004</p>
                     </c>
                     <c ca="center">
                        <p>61.6</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>M3</p>
                     </c>
                     <c ca="center">
                        <p>6681/271</p>
                     </c>
                     <c ca="center">
                        <p>0.041</p>
                     </c>
                     <c ca="center">
                        <p>85.1</p>
                     </c>
                     <c ca="center">
                        <p>5652/61</p>
                     </c>
                     <c ca="center">
                        <p>0.011</p>
                     </c>
                     <c ca="center">
                        <p>74.2</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>LTQ/FT</p>
                     </c>
                     <c ca="center">
                        <p>M1</p>
                     </c>
                     <c ca="center">
                        <p>2554/69</p>
                     </c>
                     <c ca="center">
                        <p>0.027</p>
                     </c>
                     <c ca="center">
                        <p>83.7</p>
                     </c>
                     <c ca="center">
                        <p>2135/30</p>
                     </c>
                     <c ca="center">
                        <p>0.014</p>
                     </c>
                     <c ca="center">
                        <p>70.9</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>M2</p>
                     </c>
                     <c ca="center">
                        <p>2111/46</p>
                     </c>
                     <c ca="center">
                        <p>0.022</p>
                     </c>
                     <c ca="center">
                        <p>69.6</p>
                     </c>
                     <c ca="center">
                        <p>1411/15</p>
                     </c>
                     <c ca="center">
                        <p>0.011</p>
                     </c>
                     <c ca="center">
                        <p>46.8</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>M3</p>
                     </c>
                     <c ca="center">
                        <p>2594/67</p>
                     </c>
                     <c ca="center">
                        <p>0.026</p>
                     </c>
                     <c ca="center">
                        <p>87.5</p>
                     </c>
                     <c ca="center">
                        <p>2210/33</p>
                     </c>
                     <c ca="center">
                        <p>0.015</p>
                     </c>
                     <c ca="center">
                        <p>74.5</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>(1) In each case, the sensitivity of M3 is the highest. The difference in sensitivity of different methods ranges from 3.8% to 27.7%.</p>
            <p>(2) For the LCQ and LTQ dataset, the performance of M1 and M2 differs little and Peptideprophet (M2) which was trained by a LCQ control dataset <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> does not seem to work well on the LTQ/FT dataset.</p>
            <p>(3) The performance of the nonparametric model differs little on the dataset of different instruments. When the expected FPR is 0.05, the sensitivity is above 0.85 and it is above 0.74 when the expected FPR is 0.01.</p>
            <p>(4) FPR estimation errors exist for different methods. In some cases, the error is large. This may be caused by the calculation errors because of unexpected contaminants and random errors.</p>
         </sec>
         <sec>
            <st>
               <p>Application to large datasets</p>
            </st>
            <p>Shotgun experiments always generate large datasets <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. Thus, the nonparametric model demonstrated to be effective with the control dataset should be validated using large datasets. At first, we investigated the quality of the confirmed matches by the nonparametric model (The filter criteria can be found in Additional file <supplr sid="S7">7</supplr>). Another 6 parameters which were commonly used to validate the peptide identifications of SEQUEST database search results were calculated for each match. They are maximal continuous b or y ion series length (<it>CSL</it>) <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, the matched percentage of the predicted ions by SEQUEST (<it>Ions</it>) <abbrgrp><abbr bid="B44">44</abbr></abbrgrp>, ranked preliminary score (<it>RSp</it>) <abbrgrp><abbr bid="B44">44</abbr></abbrgrp>, the continuity of b or y ion series (<it>Cont</it>) <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>, the matched percentage of ion intensities in the experiment mass spectrum (<it>iIons</it>) <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> and the matched percentage of the peak number in the experiment mass spectrum (<it>nIons</it>) <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. The percentages of the confirmed results which passed the empirical rules (Table <tblr tid="T4">4</tblr>) convinced us that most of these matches had a high confidence level. It must be noted that <it>RSp </it>= 1 is a strict rule <abbrgrp><abbr bid="B44">44</abbr></abbrgrp> and some correct matches may be lost if we require <it>RSp </it>= 1. For instance, only 76% correct matches are with <it>RSp </it>= 1 in the LTQ control dataset.</p>
            <p>As a case study, we investigated the overlaps of the three methods on the LTQ dataset. More than 90% of the matches confirmed by M1 or M2 were covered by M3 (Figure <figr fid="F4">4</figr>), and 89.1 (FPR = 0.05) and 83.6 (FPR = 0.01) of the matches confirmed by the nonparametric model were covered by M1 &#8746; M2. Each method of the three can all provide some matches that are not covered by the other two because they utilize different filter boundaries and different parameters.</p>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Validate the confirmed matches by empirical rules (%).</p>
               </caption>
               <tblbdy cols="8">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="7" ca="center">
                        <p>Empirical rules</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Instrument</p>
                     </c>
                     <c ca="center">
                        <p>FPR</p>
                     </c>
                     <c ca="center">
                        <p><it>CSL </it>&#8805; 4</p>
                     </c>
                     <c ca="center">
                        <p><it>Ions </it>&#8805; 0.2</p>
                     </c>
                     <c ca="center">
                        <p><it>RSp </it>= 1</p>
                     </c>
                     <c ca="center">
                        <p><it>Conts </it>&#8805; 0.2</p>
                     </c>
                     <c ca="center">
                        <p><it>iIons </it>&#8805; 0.25</p>
                     </c>
                     <c ca="center">
                        <p><it>nIons </it>&#8805; 0.2</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>LCQ</p>
                     </c>
                     <c ca="center">
                        <p>0.05</p>
                     </c>
                     <c ca="center">
                        <p>92.1</p>
                     </c>
                     <c ca="center">
                        <p>99.5</p>
                     </c>
                     <c ca="center">
                        <p>77.6</p>
                     </c>
                     <c ca="center">
                        <p>86.5</p>
                     </c>
                     <c ca="center">
                        <p>98.0</p>
                     </c>
                     <c ca="center">
                        <p>96.4</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.01</p>
                     </c>
                     <c ca="center">
                        <p>94.5</p>
                     </c>
                     <c ca="center">
                        <p>99.8</p>
                     </c>
                     <c ca="center">
                        <p>85.9</p>
                     </c>
                     <c ca="center">
                        <p>86.5</p>
                     </c>
                     <c ca="center">
                        <p>98.6</p>
                     </c>
                     <c ca="center">
                        <p>97.6</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>LTQ</p>
                     </c>
                     <c ca="center">
                        <p>0.05</p>
                     </c>
                     <c ca="center">
                        <p>91.5</p>
                     </c>
                     <c ca="center">
                        <p>90.5</p>
                     </c>
                     <c ca="center">
                        <p>68.6</p>
                     </c>
                     <c ca="center">
                        <p>93.4</p>
                     </c>
                     <c ca="center">
                        <p>89.9</p>
                     </c>
                     <c ca="center">
                        <p>92.7</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.01</p>
                     </c>
                     <c ca="center">
                        <p>96.9</p>
                     </c>
                     <c ca="center">
                        <p>99.8</p>
                     </c>
                     <c ca="center">
                        <p>75.6</p>
                     </c>
                     <c ca="center">
                        <p>95.6</p>
                     </c>
                     <c ca="center">
                        <p>96.7</p>
                     </c>
                     <c ca="center">
                        <p>97.1</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>LTQ/FT</p>
                     </c>
                     <c ca="center">
                        <p>0.05</p>
                     </c>
                     <c ca="center">
                        <p>99.1</p>
                     </c>
                     <c ca="center">
                        <p>100.0</p>
                     </c>
                     <c ca="center">
                        <p>67.7</p>
                     </c>
                     <c ca="center">
                        <p>98.6</p>
                     </c>
                     <c ca="center">
                        <p>97.0</p>
                     </c>
                     <c ca="center">
                        <p>99.9</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.01</p>
                     </c>
                     <c ca="center">
                        <p>99.5</p>
                     </c>
                     <c ca="center">
                        <p>100.0</p>
                     </c>
                     <c ca="center">
                        <p>75.9</p>
                     </c>
                     <c ca="center">
                        <p>99.0</p>
                     </c>
                     <c ca="center">
                        <p>98.0</p>
                     </c>
                     <c ca="center">
                        <p>100.0</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Comparison of the confirmed matches among M1, M2 and M3</p>
               </caption>
               <text>
                  <p>Comparison of the confirmed matches among M1, M2 and M3.</p>
               </text>
               <graphic file="1471-2105-9-29-4"/>
            </fig>
            <p>Figure <figr fid="F5">5A</figr> shows the mesh grids of a DF of M3 (+2 charge state matches in D5, FPR = 0.01). As it appears, the matches with the smaller <it>Xcorr</it>, &#916;<it>Cn </it>or <it>Sim </it>were discarded by M3, which agrees with the experience that the matches with large scores (<it>Xcorr</it>, &#916;<it>Cn </it>or <it>Sim</it>) are more possibly correct. Figure <figr fid="F5">5B</figr>~Figure <figr fid="F5">5E</figr> illustrate the score distributions of the matches uniquely confirmed by M1~M3. It is clear that some matches with small <it>Xcorr</it>, &#916;<it>Cn </it>and <it>Sim </it>were confirmed by PeptideProphet (red points), which integrated some other parameters, such as preliminary score (<it>Sp</it>). M2 confirmed some matches with middle <it>Xcorr </it>and &#916;<it>Cn </it>but small <it>Sim </it>(green points). M3 confirmed many matches (4714) with relative smaller <it>Xcorr </it>and &#916;<it>Cn </it>but large <it>Sim</it>, which were discarded by M1 and M3. These results demonstrated that different filter boundaries with different parameters would generate different results with different sensitivity and integrating more complementary parameters by appropriate methods could improve the sensitivity of database search result validation.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>The mesh grids of the DF of M3 and the score distribution of the matches uniquely validated by M1~M3</p>
               </caption>
               <text>
                  <p>The mesh grids of the DF of M3 and the score distribution of the matches uniquely validated by M1~M3. The blue points in B~E represent the matches uniquely validated by M3, the red points are those of M2 and the green points are those of M1.</p>
               </text>
               <graphic file="1471-2105-9-29-5"/>
            </fig>
            <p>In Table <tblr tid="T5">5</tblr>, we gave the numbers of confirmed matches, non-redundant peptides, identified proteins (Minimal protein list assembled by DBParser algorithm <abbrgrp><abbr bid="B47">47</abbr></abbrgrp>) and the percentage of proteins with at least 2 or 3 peptide hits (The filter criteria can be found in Additional file <supplr sid="S7">7</supplr>). The nonparametric model can confirm up to 14.5% more proteins than the other two kinds of methods, which indicated that our model has a higher sensitivity. For the same kind of instrument, three methods gave about the same percentage of proteins with at least 2 or 3 peptide hits at different confidence levels. The percentage of proteins with at least 2 peptide hits reaches above 50% for the LCQ or LTQ dataset, but it is about 40% for the LTQ/FT dataset. It is interesting that the percentage of proteins with at least 2 or 3 peptide hits can not be improved by improving the confidence level of the peptide identifications when one method is used.</p>
            <tbl id="T5">
               <title>
                  <p>Table 5</p>
               </title>
               <caption>
                  <p>Comparison of different methods on the complex datasets</p>
               </caption>
               <tblbdy cols="10">
                  <r>
                     <c ca="center">
                        <p>Instrument type</p>
                     </c>
                     <c ca="center">
                        <p>Methods</p>
                     </c>
                     <c cspan="4" ca="center">
                        <p>
                           <it>Expected FPR = 0.05</it>
                        </p>
                     </c>
                     <c cspan="4" ca="center">
                        <p>
                           <it>Expected FPR = 0.01</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>
                           <it>Confirmed matches</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>Non-redundant peptides</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>Proteins*</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p><it>Proteins with at least 2/3 peptide hits </it>(%)</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>Confirmed matches</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>Non-redundant peptides</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>Proteins*</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p><it>Proteins with at least 2/3 peptide hits </it>(%)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="10">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>LCQ</p>
                     </c>
                     <c ca="center">
                        <p>M1</p>
                     </c>
                     <c ca="center">
                        <p>13,636</p>
                     </c>
                     <c ca="center">
                        <p>5,268</p>
                     </c>
                     <c ca="center">
                        <p>1,922</p>
                     </c>
                     <c ca="center">
                        <p>51.1/35.2</p>
                     </c>
                     <c ca="center">
                        <p>11,512</p>
                     </c>
                     <c ca="center">
                        <p>4,496</p>
                     </c>
                     <c ca="center">
                        <p>1,630</p>
                     </c>
                     <c ca="center">
                        <p>54.0/36.0</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>M2</p>
                     </c>
                     <c ca="center">
                        <p>14,128</p>
                     </c>
                     <c ca="center">
                        <p>5,333</p>
                     </c>
                     <c ca="center">
                        <p>1,860</p>
                     </c>
                     <c ca="center">
                        <p>53.7/36.4</p>
                     </c>
                     <c ca="center">
                        <p>10,436</p>
                     </c>
                     <c ca="center">
                        <p>4,219</p>
                     </c>
                     <c ca="center">
                        <p>1,586</p>
                     </c>
                     <c ca="center">
                        <p>53.3/34.4</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>M3</p>
                     </c>
                     <c ca="center">
                        <p>15,923</p>
                     </c>
                     <c ca="center">
                        <p>5,872</p>
                     </c>
                     <c ca="center">
                        <p>2,077</p>
                     </c>
                     <c ca="center">
                        <p>52.6/36.2</p>
                     </c>
                     <c ca="center">
                        <p>13,549</p>
                     </c>
                     <c ca="center">
                        <p>5,084</p>
                     </c>
                     <c ca="center">
                        <p>1,729</p>
                     </c>
                     <c ca="center">
                        <p>55.9/38.3</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="10">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>LTQ</p>
                     </c>
                     <c ca="center">
                        <p>M1</p>
                     </c>
                     <c ca="center">
                        <p>45,153</p>
                     </c>
                     <c ca="center">
                        <p>10,359</p>
                     </c>
                     <c ca="center">
                        <p>3,363</p>
                     </c>
                     <c ca="center">
                        <p>54.6/37.6</p>
                     </c>
                     <c ca="center">
                        <p>36,857</p>
                     </c>
                     <c ca="center">
                        <p>8,601</p>
                     </c>
                     <c ca="center">
                        <p>2,733</p>
                     </c>
                     <c ca="center">
                        <p>58.3/39.3</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>M2</p>
                     </c>
                     <c ca="center">
                        <p>40,791</p>
                     </c>
                     <c ca="center">
                        <p>10,053</p>
                     </c>
                     <c ca="center">
                        <p>3,166</p>
                     </c>
                     <c ca="center">
                        <p>55.2/39.1</p>
                     </c>
                     <c ca="center">
                        <p>30,696</p>
                     </c>
                     <c ca="center">
                        <p>7,875</p>
                     </c>
                     <c ca="center">
                        <p>2,488</p>
                     </c>
                     <c ca="center">
                        <p>58.7/40.3</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>M3</p>
                     </c>
                     <c ca="center">
                        <p>52,569</p>
                     </c>
                     <c ca="center">
                        <p>11,451</p>
                     </c>
                     <c ca="center">
                        <p>3,421</p>
                     </c>
                     <c ca="center">
                        <p>57.9/40.9</p>
                     </c>
                     <c ca="center">
                        <p>44,576</p>
                     </c>
                     <c ca="center">
                        <p>9,756</p>
                     </c>
                     <c ca="center">
                        <p>2,801</p>
                     </c>
                     <c ca="center">
                        <p>61.6/43.1</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="10">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>LTQ/FT</p>
                     </c>
                     <c ca="center">
                        <p>M1</p>
                     </c>
                     <c ca="center">
                        <p>25,672</p>
                     </c>
                     <c ca="center">
                        <p>4,602</p>
                     </c>
                     <c ca="center">
                        <p>2,723</p>
                     </c>
                     <c ca="center">
                        <p>42.0/23.0</p>
                     </c>
                     <c ca="center">
                        <p>22,750</p>
                     </c>
                     <c ca="center">
                        <p>3,869</p>
                     </c>
                     <c ca="center">
                        <p>2,193</p>
                     </c>
                     <c ca="center">
                        <p>42.5/22.8</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>M2</p>
                     </c>
                     <c ca="center">
                        <p>23,571</p>
                     </c>
                     <c ca="center">
                        <p>3,947</p>
                     </c>
                     <c ca="center">
                        <p>2,462</p>
                     </c>
                     <c ca="center">
                        <p>45.2/25.4</p>
                     </c>
                     <c ca="center">
                        <p>19,930</p>
                     </c>
                     <c ca="center">
                        <p>3,366</p>
                     </c>
                     <c ca="center">
                        <p>2,083</p>
                     </c>
                     <c ca="center">
                        <p>45.4/24.9</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>M3</p>
                     </c>
                     <c ca="center">
                        <p>27,565</p>
                     </c>
                     <c ca="center">
                        <p>4,855</p>
                     </c>
                     <c ca="center">
                        <p>2,820</p>
                     </c>
                     <c ca="center">
                        <p>43.7/24.8</p>
                     </c>
                     <c ca="center">
                        <p>25,185</p>
                     </c>
                     <c ca="center">
                        <p>4,196</p>
                     </c>
                     <c ca="center">
                        <p>2,291</p>
                     </c>
                     <c ca="center">
                        <p>45.6/25.7</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Note * It was the count of minimal protein list assembled by DBParser algorithm [47].</p>
               </tblfn>
            </tbl>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>Due to the complexity of the peptide identification problem, many parameters have been proposed for use in modeling the quality of matches between MS/MS spectra and peptides. For example, <it>Xcorr </it>and <it>Sim </it>assess the similarity between theoretical and experimental spectra, and &#916;<it>Cn </it>assesses the effect of database size. There are two main reasons for the simultaneous existence of multiple parameters. First, the complex physical and chemical process of the MS/MS platform makes it difficult to model the peptide identification problem universally <abbrgrp><abbr bid="B48">48</abbr></abbrgrp>). Second, the huge computational burden of the database search makes it difficult to implement complex models. Thus, most MS/MS data processing approaches currently used include two steps: 1) find candidate peptides quickly and thus reduce the search space; 2) validate the results carefully by taking into account more information. As in this paper, a popular way for quality control of data in shotgun proteomics is to generate a set of easily calculated scores measuring the quality of the matches in different ways and then to combine these parameters to validate the results <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. The randomized database method provides a feasible framework for constructing a negative control dataset and controlling the FPR of the acquired dataset. The nonparametric model introduced in this paper provides a framework for feature integration and determination of nonlinear DFs. However, if too many parameters are used, the nonparametric model will encounter a computational problem. With too many variable parameters in the model, there may be many solutions to the MLE equations. Thus, the iterative process of the EM algorithm may reach a local minimum, and good performance of the model cannot be guaranteed. Thus, when many features are used, it is recommended that the features be partitioned into different groups by hierarchical clustering <abbrgrp><abbr bid="B49">49</abbr></abbrgrp> and the nonparametric model be applied to each cluster. Other feature-space reduction methods such as principal component analysis (PCA) and partial least squares (PLS) can also be used <abbrgrp><abbr bid="B50">50</abbr></abbrgrp>.</p>
         <p>The EM algorithm is guaranteed to converge <abbrgrp><abbr bid="B51">51</abbr></abbrgrp>. However, if there are too many variables, it may reach a local minimum. For double-charged matches in the LCQ control dataset (here, we only used two variables: <it>Xcorr </it>and &#916;<it>Cn</it>), we also tried a Gaussian mixed model with 15 components (5 fold of the model we used). The values of the ML function calculated in the iterative process of the EM algorithm increased monotonically for the Gaussian mixed model with 3 components, whereas for the Gaussian mixed model with 15 components they initially increased and then decreased along the iterative step (Figure <figr fid="F6">6</figr>). The performance (<it>&#967;</it><sup>2 </sup>statistic; smaller = better) of the 15-mixed models demonstrated the same pattern. It was confirmed that too many variables (90 variables) do not lead to better performance. It is fortunate that the Gaussian model with 3 mixed functions fit the data satisfactorily. For the large dataset and the model with more features, the number of component functions did not exceed 7. If a more complex mixed model is needed, we recommend the following strategies: 1) optimize the ML function directly using more robust nonlinear optimization techniques such as the conjugate gradient and quasi-Newton methods <abbrgrp><abbr bid="B52">52</abbr></abbrgrp>; 2) directly fit the histogram with an optimized binned method (such as Scott's rule <abbrgrp><abbr bid="B53">53</abbr></abbrgrp>) using a RBF neural network; or 3) use another nonparametric model such as the adaptive kernel density estimation proposed by Silverman <abbrgrp><abbr bid="B54">54</abbr></abbrgrp>.</p>
         <fig id="F6">
            <title>
               <p>Figure 6</p>
            </title>
            <caption>
               <p>ML function values and <it>&#967;</it><sup>2 </sup>statistic with iterative step and different numbers of mixed Gaussian functions</p>
            </caption>
            <text>
               <p>ML function values and <it>&#967;</it><sup>2 </sup>statistic with iterative step and different numbers of mixed Gaussian functions. <it>n</it>, the iterative step of the EM algorithm; L, the ML function value; chi-square, <it>&#967;</it><sup>2 </sup>statistic; N, number of mixed Gaussian functions. It is clear that the EM algorithm will confront the local minimum problem when the number of variables is too many.</p>
            </text>
            <graphic file="1471-2105-9-29-6"/>
         </fig>
         <p>The computational burden of the nonparametric model may be doubted, especially for the huge LTQ dataset. It is lucky that it does not need so many observations to build the nonparametric model. If the dataset is too large, we can resample the observations and use fewer observations to build the model. We tried this approach on the LTQ complex dataset. The results achieved by the model built with randomly selected 30,000 observations differed little from that of the model built with all the 432,338 observations. Thus, in the model building procedure, if the number of the observations exceeds 30,000, we resample the dataset and randomly select 30,000 observations to build the model and if the number of the observations is less than 30,000, all the observations are used. Therefore, the consumed time of the model building was less than 2 min on a PC with Intel Pentium 4 2.8G CPU and 512 MB memory.</p>
         <p>The nonparametric model proposed in this paper is easy to use. First, a combined database is prepared containing the normal and randomized protein sequence. Then database search is performed on the combined database and the results are collected; the normal and randomized database matches are labeled with the assistance of references provided by the database search software. The randomized database matches are then used to build the nonparametric model. In this step, a parameter set different from that described here can be used. To obtain the final results, a search for the DF described in the "Nonparametric model and filter boundary" section given an expected FPR is performed. The workflow shown in figure <figr fid="F7">7</figr> (Methods section) has been implemented by several Matlab (MathWorks, Natick, MA) scripts and in-house C++ programs. The database search results were collected using an in-house program called OutSum.exe, which were stored in the *.out files given by SEQUEST. The resulting data, stored in a plain-text file, were loaded into a Matlab workspace. A script called NoParQ.m was used to build the nonparametric model. The programs used in this paper were provided in a compressed archive [see Additional file <supplr sid="S2">2</supplr>].</p>
         <suppl id="S2">
            <title>
               <p>Additional file 2</p>
            </title>
            <text>
               <p>Program package. This file packaged all the programs used in this work, which include the Microsoft Windows executable EXE files and the Matlab script M files. A readme file is provided in this package to illustrate how to use these programs.</p>
            </text>
            <file name="1471-2105-9-29-S2.RAR">
               <p>Click here for file</p>
            </file>
         </suppl>
         <fig id="F7">
            <title>
               <p>Figure 7</p>
            </title>
            <caption>
               <p>Illustration of the workflow</p>
            </caption>
            <text>
               <p>Illustration of the workflow. The workflow is based on the nonparametric model and the randomized database method. First, the randomized database is constructed and merged with the normal database. Then a database search is performed using SEQUEST. Peptide matches from the randomized database are used to build the mixed Gaussian model. Filter boundaries are determined based on the mixed Gaussian model and the expected FPR, and the normal database matches are filtered. During construction of the nonparametric model, k-means clustering is used to initialize the parameters of the EM algorithm. The red points in the left rectangle are the cluster center on the <it>Xcorr</it>-&#916;<it>Cn </it>plane. The red pints on the right rectangle denote the matches from the normal database and the blue points are matches from the randomized database.</p>
            </text>
            <graphic file="1471-2105-9-29-7"/>
         </fig>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>In this paper, we provide a framework for validation of peptide identification in shotgun proteomics that is based on the randomized database method and a nonparametric model. The practical problems in implementing the nonparametric model were investigated, and its performance was found to be better than that of traditional methods. The nonparametric model can provide a more flexible and accurate solution for DF determination for quality control of large datasets in shotgun proteomics research. All the programs used in this work are available by request from the authors.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Datasets and database search</p>
            </st>
            <p>Six datasets generated by three kinds of mass spectrometry platforms (LCQ, LTQ and LTQ/FT) were used to demonstrate the performance of the nonparametric model. Three control datasets were used to validate the accuracy of the FPR estimation and the improvement of the sensitivity. Since the MS/MS datasets generated by the shotgun technique are always large, we also verified the generality of the nonparametric model on the large real sample datasets. The basic information about the six datasets is listed in Table <tblr tid="T6">6</tblr>.</p>
            <tbl id="T6">
               <title>
                  <p>Table 6</p>
               </title>
               <caption>
                  <p>The 6 datasets used in this paper.</p>
               </caption>
               <tblbdy cols="7">
                  <r>
                     <c ca="center">
                        <p>
                           <it>Dataset type</it>
                        </p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>
                           <it>Control dataset</it>
                        </p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>
                           <it>Real sample dataset</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Dataset Name</p>
                     </c>
                     <c ca="center">
                        <p>D1</p>
                     </c>
                     <c ca="center">
                        <p>D2</p>
                     </c>
                     <c ca="center">
                        <p>D3</p>
                     </c>
                     <c ca="center">
                        <p>D4</p>
                     </c>
                     <c ca="center">
                        <p>D5</p>
                     </c>
                     <c ca="center">
                        <p>D6</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Instrument</p>
                     </c>
                     <c ca="center">
                        <p>LCQ</p>
                     </c>
                     <c ca="center">
                        <p>LTQ</p>
                     </c>
                     <c ca="center">
                        <p>LTQ/FT</p>
                     </c>
                     <c ca="center">
                        <p>LCQ</p>
                     </c>
                     <c ca="center">
                        <p>LTQ</p>
                     </c>
                     <c ca="center">
                        <p>LTQ/FT</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Reference or notes</p>
                     </c>
                     <c ca="center">
                        <p>[46]</p>
                     </c>
                     <c ca="center">
                        <p>[55]</p>
                     </c>
                     <c ca="center">
                        <p>unpublished</p>
                     </c>
                     <c ca="center">
                        <p>[44]</p>
                     </c>
                     <c ca="center">
                        <p>[56]</p>
                     </c>
                     <c ca="center">
                        <p>unpublished</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Sample</p>
                     </c>
                     <c ca="center">
                        <p>12 purified proteins + 23 peptides</p>
                     </c>
                     <c ca="center">
                        <p>49 purified human proteins</p>
                     </c>
                     <c ca="center">
                        <p>8 purified proteins</p>
                     </c>
                     <c ca="center">
                        <p>Human K562 cell line</p>
                     </c>
                     <c ca="center">
                        <p>Human liver</p>
                     </c>
                     <c ca="center">
                        <p>Human Liver</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Data source</p>
                     </c>
                     <c ca="center">
                        <p>the BIATECH Institute (Bothell, WA 98011, USA)</p>
                     </c>
                     <c ca="center">
                        <p>Proteomics Standards Research Group (sPRG) [55]</p>
                     </c>
                     <c ca="center">
                        <p>Beijing proteome Research Center (Beijing 102206, China)</p>
                     </c>
                     <c ca="center">
                        <p>Open Proteomics Database (OPD)[57]</p>
                     </c>
                     <c ca="center">
                        <p>Beijing proteome Research Center (Beijing 102206, China)</p>
                     </c>
                     <c ca="center">
                        <p>Beijing proteome Research Center (Beijing 102206, China)</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>The two unpublished LTQ/FT datasets were provided by Beijing Proteome Research Center (BPRC). The samples were digested with trypsin and then analyzed by a 7-Tesla LTQ/FT mass spectrometer (Thermo Electron, San Jose, CA) coupled with an Agilent 1100 nano-flow liquid chromatography system. The reverse phase C18 trap columns (300 <it>&#956;</it>m internal diameter &#215; 5 mm long column) were connected with the 6-port column-switching valve for the on-line desalting. A PicoFritTM tip column (BioBasic C18, 5 <it>&#956;</it>m particle size, 75 <it>&#956;</it>m internal diameter &#215; 10 cm long column, 15 <it>&#956;</it>m internal diameter at spray tip, New Objective, Woburn, MA, USA) was used for the following separation. Elution was solvent A (Milli-Q water, 2 % acetonitrile and 0.1%FA, v/v/v) and solvent B (Milli-Q water, 80% acetonitrile and 0.1%FA, v/v/v). The gradient was 15&#8211;40% B in 40 min, 40&#8211;100% B in 10 min. One FT full MS scan was followed by 5 data-dependent LTQ MS/MS scans on the five most intense ions. The dynamical excluding time was 45 seconds. Ions were accumulated in linear ion trap controlled by AGC. The AGC values were 5 &#215; 10<sup>5 </sup>charges for FT full MS scan and 1 &#215; 10<sup>4 </sup>charges for LTQ MS/MS scan. The resolution was 10,000 for FT full MS scan at m/z 400. The temperature of the ion transfer tube was set at 200&#176;C and the spray voltage was 1.8 KV. The isolation width was 4Da and normalized collision energy was 35% for MS/MS scan. Mass spectra were acquired over the m/z range from 400 to 2000.</p>
            <p>All the MS/MS spectra were extracted from the *.raw files by Extract_MSn.exe which is a console program in Bioworks 3.2 (Thermo Finnigan, San Jose, CA). For the LCQ datasets, the minimal total ion intensity is 10,000. For the LTQ or LTQ/FT datasets, the total ion intensity of each MS/MS spectrum is required to exceed 100. For all the datasets, the spectra must have at least 20 ions. Then the database search was performed on a local TurboSEQUEST (version 2.7) server. The fixed modification of oxidation (15.99Da) on the Met residue and the variable modification of carboxyamidomethylation (57.02Da) on the Cys residue were set. The enzyme was trypsin and the maximal allowed missed cleavage sites was 2. Only the b and y ions were taken into account. For the LCQ or LTQ datasets, the precursor mass error tolerance was 3.0Da, and for the LFQ/FT datasets, it was 15ppm.</p>
            <p>For all the datasets except D2, which was searched against the database published by sPRG <abbrgrp><abbr bid="B55">55</abbr></abbrgrp>, the searched databases were derived from IPI Human 3.19 <abbrgrp><abbr bid="B60">60</abbr></abbrgrp>. For the control datasets, the control sequences for dataset D1 and D3 [see Additional file <supplr sid="S3">3</supplr>, <supplr sid="S4">4</supplr> and <supplr sid="S5">5</supplr>] including the sequences of purified proteins or peptides plus the typical sample contaminants such as keratin and trypsin were added into the IPI Human 3.19. The control sequences for D2 were determined according to the report of sPRG (see Additional file <supplr sid="S4">4</supplr>) <abbrgrp><abbr bid="B55">55</abbr></abbrgrp>. The databases were constructed using the method proposed in one of our previous paper <abbrgrp><abbr bid="B58">58</abbr></abbrgrp> and could be described as: the protein sequences in the normal database were digested <it>in silico </it>(trypsin), and then the amino acid residues (AAR) (except the one on the C-terminal) of the resulting peptides were reshuffled by using a random number generator. Then the reshuffled peptides were spliced to form new protein sequences in the randomized database. Finally, the normal database and the randomized database were merged to form the searched database.</p>
            <suppl id="S3">
               <title>
                  <p>Additional file 3</p>
               </title>
               <text>
                  <p>The control sequences of the LCQ control dataset. This file includes the control sequences for the LCQ control dataset, which include the sequences of control proteins and the common contaminants. The file was compressed as RAR archive to reduce the size.</p>
               </text>
               <file name="1471-2105-9-29-S3.RAR">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S4">
               <title>
                  <p>Additional file 4</p>
               </title>
               <text>
                  <p>The control sequences of the LTQ control dataset. This file includes the control sequences for the LTQ control dataset, which include the sequences of control proteins and the common contaminants. The file was compressed as RAR archive to reduce the size.</p>
               </text>
               <file name="1471-2105-9-29-S4.RAR">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S5">
               <title>
                  <p>Additional file 5</p>
               </title>
               <text>
                  <p>The control sequences of the LTQ/FT control dataset. This file includes the control sequences for the LTQ/FT control dataset, which include the sequences of control proteins and the common contaminants.</p>
               </text>
               <file name="1471-2105-9-29-S5.RAR">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>After database searching, the matches with +1, +2 and +3 charge state were extracted (Table <tblr tid="T7">7</tblr>). For each spectrum, only the first rank match with an assigned peptide with more than 5 AAR was taken into account for further analysis. For the control datasets, the matches which were assigned peptides of control sequences were validated by the following criteria: 1) the b-ion or y-ion series should confirm at least 3 consecutive amino acids of the assigned peptide sequence <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>, 2) ranked preliminary score (<it>RSp</it>) &#8804; 50. The confirmed matches of control datasets were provided in the supplementary materials [see Additional file <supplr sid="S6">6</supplr>, <supplr sid="S7">7</supplr> and <supplr sid="S8">8</supplr>).</p>
            <suppl id="S6">
               <title>
                  <p>Additional file 6</p>
               </title>
               <text>
                  <p>Validated matches in the LCQ control dataset. This file contains the validated correct matches for the LCQ control dataset. The file was compressed as RAR archive to reduce the size.</p>
               </text>
               <file name="1471-2105-9-29-S6.RAR">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S7">
               <title>
                  <p>Additional file 7</p>
               </title>
               <text>
                  <p>Validated matches in the LTQ control dataset. This file contains the validated correct matches for the LTQ control dataset. The file was compressed as RAR archive to reduce the size.</p>
               </text>
               <file name="1471-2105-9-29-S7.RAR">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S8">
               <title>
                  <p>Additional file 8</p>
               </title>
               <text>
                  <p>Validated matches in the LTQ/FT control dataset. This file contains the validated correct matches for the LTQ/FT control dataset. The file was compressed as RAR archive to reduce the size.</p>
               </text>
               <file name="1471-2105-9-29-S8.RAR">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <tbl id="T7">
               <title>
                  <p>Table 7</p>
               </title>
               <caption>
                  <p>Database search results of the 6 datasets</p>
               </caption>
               <tblbdy cols="8">
                  <r>
                     <c cspan="2" ca="center">
                        <p>
                           <it>Datasets</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D1</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D2</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D3</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D4</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D5</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D6</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Database search results</p>
                     </c>
                     <c ca="center">
                        <p>+1</p>
                     </c>
                     <c ca="center">
                        <p>467</p>
                     </c>
                     <c ca="center">
                        <p>3,039</p>
                     </c>
                     <c ca="center">
                        <p>1,544</p>
                     </c>
                     <c ca="center">
                        <p>24,875</p>
                     </c>
                     <c ca="center">
                        <p>61,574</p>
                     </c>
                     <c ca="center">
                        <p>36,610</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>+2</p>
                     </c>
                     <c ca="center">
                        <p>3,687</p>
                     </c>
                     <c ca="center">
                        <p>28,130</p>
                     </c>
                     <c ca="center">
                        <p>6,028</p>
                     </c>
                     <c ca="center">
                        <p>63,272</p>
                     </c>
                     <c ca="center">
                        <p>754,401</p>
                     </c>
                     <c ca="center">
                        <p>557,994</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>+3</p>
                     </c>
                     <c ca="center">
                        <p>3,654</p>
                     </c>
                     <c ca="center">
                        <p>28,943</p>
                     </c>
                     <c ca="center">
                        <p>2,579</p>
                     </c>
                     <c ca="center">
                        <p>63,027</p>
                     </c>
                     <c ca="center">
                        <p>776,794</p>
                     </c>
                     <c ca="center">
                        <p>492,950</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>The workflow of the nonparametric model based method</p>
            </st>
            <p>The workflow of the nonparametric model based method is shown in Figure <figr fid="F7">7</figr>. Firstly, a randomized database was constructed by randomizing the tryptic peptide sequence. Then the MS/MS spectra were searched against the combined database using SEQUEST. Then, matches with an assigned peptide from the randomized database (we call them randomized database matches, RDM) were used to build the nonparametric model. The joint distribution of selected parameters (such as <it>Xcorr</it>, &#916;<it>Cn </it>and <it>Sim </it><abbrgrp><abbr bid="B31">31</abbr><abbr bid="B45">45</abbr></abbrgrp>) of random matches was fit with the nonparametric model using the FnPDFe method and the contour lines of the estimated PDF, which are complex nonlinear functions, were used as candidate DFs. The actually used DFs were determined according to the expected FPR and formula 2 for different charge states. Finally, the resulting DFs were used to filter the matches from the normal database. In the model-building step, k-means clustering was used to initialize the EM algorithm procedure.</p>
         </sec>
         <sec>
            <st>
               <p>Initial the nonparametric model with k-means clustering</p>
            </st>
            <p>K-means clustering <abbrgrp><abbr bid="B59">59</abbr></abbrgrp> is commonly used to partition observations into different groups according to defined distance (such as Euclidean distance). The optimization goal of k-means clustering is to find a partition in which objects within each cluster are as close as possible to each other and as far as possible from objects in other clusters. However, in practice, the scale of each feature will significantly affect the clustering results when Euclidean distance is used. In our application, <it>Xcorr </it>and &#916;<it>Cn </it>were two main features. <it>Xcorr </it>is a float point value whose typical value is 2.5 but may be larger than 10; &#916;<it>Cn </it>is in the range [0, 1]. When directly using the observed values in the k-means clustering, <it>Xcorr </it>will dominate the partition results (Figure <figr fid="F8">8</figr>) because the distance (formula 5) between two observations (<it>Xcorr</it><sub><it>i</it></sub>, &#916;<it>Cn</it><sub><it>i</it></sub>), <it>i </it>= 1, 2, is mainly determined by <it>Xcorr</it>, which has a larger scale.</p>
            <fig id="F8">
               <title>
                  <p>Figure 8</p>
               </title>
               <caption>
                  <p>The partitions of k-means clustering before (A) and after (B) normalization (z-score) of the features</p>
               </caption>
               <text>
                  <p>The partitions of k-means clustering before (A) and after (B) normalization (z-score) of the features. Blue and red points represent different clusters. The observations derive from the control dataset. Records with larger <it>Xcorr </it>and &#916;<it>Cn </it>are more likely to be positive results. The partition given by k-means clustering using the observed values is based on <it>Xcorr</it>; &#916;<it>Cn </it>has no effect. After normalization, the partition is more consistent with the empirical knowledge.</p>
               </text>
               <graphic file="1471-2105-9-29-8"/>
            </fig>
            <p>
               <display-formula id="M5">
                  <m:math name="1471-2105-9-29-i8" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>d</m:mi>
                           <m:mo>=</m:mo>
                           <m:msqrt>
                              <m:mrow>
                                 <m:msup>
                                    <m:mrow>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>X</m:mi>
                                       <m:mi>c</m:mi>
                                       <m:mi>o</m:mi>
                                       <m:mi>r</m:mi>
                                       <m:msub>
                                          <m:mi>r</m:mi>
                                          <m:mn>1</m:mn>
                                       </m:msub>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mi>X</m:mi>
                                       <m:mi>c</m:mi>
                                       <m:mi>o</m:mi>
                                       <m:mi>r</m:mi>
                                       <m:msub>
                                          <m:mi>r</m:mi>
                                          <m:mn>2</m:mn>
                                       </m:msub>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                    <m:mn>2</m:mn>
                                 </m:msup>
                                 <m:mo>+</m:mo>
                                 <m:msup>
                                    <m:mrow>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>&#916;</m:mi>
                                       <m:mi>C</m:mi>
                                       <m:msub>
                                          <m:mi>n</m:mi>
                                          <m:mn>1</m:mn>
                                       </m:msub>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mi>&#916;</m:mi>
                                       <m:mi>C</m:mi>
                                       <m:msub>
                                          <m:mi>n</m:mi>
                                          <m:mn>2</m:mn>
                                       </m:msub>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                    <m:mn>2</m:mn>
                                 </m:msup>
                              </m:mrow>
                           </m:msqrt>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemizaqMaeyypa0ZaaOaaaeaacqGGOaakcqWGybawcqWGJbWycqWGVbWBcqWGYbGCcqWGYbGCdaWgaaWcbaGaeGymaedabeaakiabgkHiTiabdIfayjabdogaJjabd+gaVjabdkhaYjabdkhaYnaaBaaaleaacqaIYaGmaeqaaOGaeiykaKYaaWbaaSqabeaacqaIYaGmaaGccqGHRaWkcqGGOaakcqqHuoarcqWGdbWqcqWGUbGBdaWgaaWcbaGaeGymaedabeaakiabgkHiTiabfs5aejabdoeadjabd6gaUnaaBaaaleaacqaIYaGmaeqaaOGaeiykaKYaaWbaaSqabeaacqaIYaGmaaaabeaaaaa@50D2@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Thus, a normalization step, which calculated the z-score of the observed values of each feature, was used to eliminate the scale difference, and thus achieve a more reasonable partition (Figure <figr fid="F8">8</figr>).</p>
         </sec>
         <sec>
            <st>
               <p>Nonparametric model and the EM algorithm</p>
            </st>
            <p>The basic objective of nonparametric density estimation is to approximate the distribution of observations using the weighted sum of a series of simple functions, which does not emphasize the physical meaning of the parameters but the accuracy of the approximation. This idea can be implemented using smoothing splines or radial basis function (RBF) neural network to fit the histogram directly <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>. Another way to implement the nonparametric model is to fit the distribution with kernel density functions. The optimization goal of the nonparametric model is to minimize the mean integrated squared error of the fit or to maximize the maximum likelihood function of the observations. Many kinds of nonparametric models have been proposed by different researchers <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>. The FnPDFe procedure <abbrgrp><abbr bid="B42">42</abbr></abbrgrp> is attractive because it is easy to implement and has a clear statistical explanation. Let <it>X </it>be a <it>d </it>dimension random vector <it>X </it>&#8712; <it>R</it><sup><it>d</it></sup>. Its PDF can be approximated by a Gaussian mixed model that is defined as the linear combination of <it>N </it>multivariate Gaussian density functions (MGDFs):</p>
            <p>
               <display-formula id="M6">
                  <m:math name="1471-2105-9-29-i9" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>f</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>X</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>i</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mi>N</m:mi>
                              </m:munderover>
                              <m:mrow>
                                 <m:mi>P</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>i</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:msub>
                                    <m:mi>f</m:mi>
                                    <m:mi>G</m:mi>
                                 </m:msub>
                              </m:mrow>
                           </m:mstyle>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>X</m:mi>
                           <m:mo>|</m:mo>
                           <m:mi>i</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOzayMaeiikaGIaemiwaGLaeiykaKIaeyypa0ZaaabCaeaacqWGqbaucqGGOaakcqWGPbqAcqGGPaqkcqWGMbGzdaWgaaWcbaGaem4raCeabeaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoakiabcIcaOiabdIfayjabcYha8jabdMgaPjabcMcaPaaa@44B2@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>where:</p>
            <p>
               <display-formula id="M7">
                  <m:math name="1471-2105-9-29-i10" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>f</m:mi>
                              <m:mi>G</m:mi>
                           </m:msub>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>X</m:mi>
                           <m:mo>|</m:mo>
                           <m:mi>i</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mn>1</m:mn>
                              <m:mrow>
                                 <m:msup>
                                    <m:mrow>
                                       <m:mrow>
                                          <m:mo>(</m:mo>
                                          <m:mrow>
                                             <m:mn>2</m:mn>
                                             <m:mi>&#960;</m:mi>
                                          </m:mrow>
                                          <m:mo>)</m:mo>
                                       </m:mrow>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>d</m:mi>
                                       <m:mo>/</m:mo>
                                       <m:mn>2</m:mn>
                                    </m:mrow>
                                 </m:msup>
                                 <m:msup>
                                    <m:mrow>
                                       <m:mrow>
                                          <m:mo>|</m:mo>
                                          <m:mrow>
                                             <m:msub>
                                                <m:mi>&#931;</m:mi>
                                                <m:mi>i</m:mi>
                                             </m:msub>
                                          </m:mrow>
                                          <m:mo>|</m:mo>
                                       </m:mrow>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mn>1</m:mn>
                                       <m:mo>/</m:mo>
                                       <m:mn>2</m:mn>
                                    </m:mrow>
                                 </m:msup>
                              </m:mrow>
                           </m:mfrac>
                           <m:msup>
                              <m:mi>e</m:mi>
                              <m:mrow>
                                 <m:mo>&#8722;</m:mo>
                                 <m:mfrac>
                                    <m:mn>1</m:mn>
                                    <m:mn>2</m:mn>
                                 </m:mfrac>
                                 <m:msup>
                                    <m:mrow>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>X</m:mi>
                                       <m:mo>&#8722;</m:mo>
                                       <m:msub>
                                          <m:mi>&#956;</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                    <m:mi>T</m:mi>
                                 </m:msup>
                                 <m:msubsup>
                                    <m:mi>&#931;</m:mi>
                                    <m:mi>i</m:mi>
                                    <m:mrow>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mn>1</m:mn>
                                    </m:mrow>
                                 </m:msubsup>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>X</m:mi>
                                 <m:mo>&#8722;</m:mo>
                                 <m:msub>
                                    <m:mi>&#956;</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:msup>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOzay2aaSbaaSqaaiabdEeahbqabaGccqGGOaakcqWGybawcqGG8baFcqWGPbqAcqGGPaqkcqGH9aqpjuaGdaWcaaqaaiabigdaXaqaamaabmaabaGaeGOmaidcciGae8hWdahacaGLOaGaayzkaaWaaWbaaeqabaGaemizaqMaei4la8IaeGOmaidaamaaemaabaGaeu4Odm1aaSbaaeaacqWGPbqAaeqaaaGaay5bSlaawIa7amaaCaaabeqaaiabigdaXiabc+caViabikdaYaaaaaGccqWGLbqzdaahaaWcbeqaaiabgkHiTKqbaoaalaaabaGaeGymaedabaGaeGOmaidaaSGaeiikaGIaemiwaGLaeyOeI0Iae8hVd02aaSbaaWqaaiabdMgaPbqabaWccqGGPaqkdaahaaadbeqaaiabdsfaubaaliabfo6atnaaDaaameaacqWGPbqAaeaacqGHsislcqaIXaqmaaWccqGGOaakcqWGybawcqGHsislcqWF8oqBdaWgaaadbaGaemyAaKgabeaaliabcMcaPaaaaaa@614C@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>and <it>P</it>(<it>i</it>),<it>i </it>= 1,...<it>N </it>satisfies: (1) 0 &lt;<it>P</it>(<it>i</it>) &#8804; 1; (2) <inline-formula><m:math name="1471-2105-9-29-i11" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mstyle displaystyle="true"><m:munderover><m:mo>&#8721;</m:mo><m:mrow><m:mi>i</m:mi><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:mi>N</m:mi></m:munderover><m:mrow><m:mi>P</m:mi><m:mo stretchy="false">(</m:mo><m:mi>i</m:mi><m:mo stretchy="false">)</m:mo><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaWaaabCaeaacqWGqbaucqGGOaakcqWGPbqAcqGGPaqkcqGH9aqpcqaIXaqmaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aaaa@38B7@</m:annotation></m:semantics></m:math></inline-formula>. <it>&#956;</it><sub><it>i</it></sub>, &#931;<sub><it>i </it></sub>is the mean vector and covariance matrix of the <it>i</it>-th MGDF.</p>
            <p>Consider independent and identically distributed observations set{<it>x</it><sub>1</sub>, <it>x</it><sub>2</sub>,......<it>x</it><sub><it>n</it></sub>}; the log-likelihood function of the mixed model is:</p>
            <p>
               <display-formula id="M8">
                  <m:math name="1471-2105-9-29-i12" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>L</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>&#952;</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>k</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mi>n</m:mi>
                              </m:munderover>
                              <m:mrow>
                                 <m:mi>ln</m:mi>
                                 <m:mo>&#8289;</m:mo>
                                 <m:mi>f</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:msub>
                                    <m:mi>x</m:mi>
                                    <m:mi>k</m:mi>
                                 </m:msub>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:mstyle>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemitaWKaeiikaGccciGae8hUdeNaeiykaKIaeyypa0ZaaabCaeaacyGGSbaBcqGGUbGBcqWGMbGzcqGGOaakcqWG4baEdaWgaaWcbaGaem4AaSgabeaakiabcMcaPaWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemOBa4ganiabggHiLdaaaa@418E@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Generally, MLE can be used to infer the parameters <it>&#952; </it>in the mixed model. However, the resulting MLE equations cannot be solved analytically. The FnPDFe method uses the EM algorithm to provide iterative solutions for these parameters <abbrgrp><abbr bid="B43">43</abbr></abbrgrp>, which can be read as:</p>
            <p>(1) Initial step: Initialize the objective parameters <it>&#956;</it><sub><it>i</it></sub>, &#931;<sub><it>i</it></sub>, and <it>P</it>(<it>i</it>) with heuristic knowledge or random values.</p>
            <p>(2) E-step: update the posterior distributions:</p>
            <p>
               <display-formula id="M9">
                  <m:math name="1471-2105-9-29-i13" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msup>
                              <m:mi>g</m:mi>
                              <m:mrow>
                                 <m:mi>t</m:mi>
                                 <m:mo>+</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                           </m:msup>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>i</m:mi>
                           <m:mo>|</m:mo>
                           <m:msub>
                              <m:mi>x</m:mi>
                              <m:mi>k</m:mi>
                           </m:msub>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:msubsup>
                                    <m:mi>f</m:mi>
                                    <m:mi>G</m:mi>
                                    <m:mi>t</m:mi>
                                 </m:msubsup>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:msub>
                                    <m:mi>x</m:mi>
                                    <m:mi>k</m:mi>
                                 </m:msub>
                                 <m:mo>|</m:mo>
                                 <m:mi>i</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:msup>
                                    <m:mi>P</m:mi>
                                    <m:mi>t</m:mi>
                                 </m:msup>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>i</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:munderover>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>j</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>N</m:mi>
                                    </m:munderover>
                                    <m:mrow>
                                       <m:msubsup>
                                          <m:mi>f</m:mi>
                                          <m:mi>G</m:mi>
                                          <m:mi>t</m:mi>
                                       </m:msubsup>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:msub>
                                          <m:mi>x</m:mi>
                                          <m:mi>k</m:mi>
                                       </m:msub>
                                       <m:mo>|</m:mo>
                                       <m:mi>j</m:mi>
                                       <m:mo stretchy="false">)</m:mo>
                                       <m:msup>
                                          <m:mi>P</m:mi>
                                          <m:mi>t</m:mi>
                                       </m:msup>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>j</m:mi>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaem4zaC2aaWbaaSqabeaacqWG0baDcqGHRaWkcqaIXaqmaaGccqGGOaakcqWGPbqAcqGG8baFcqWG4baEdaWgaaWcbaGaem4AaSgabeaakiabcMcaPiabg2da9KqbaoaalaaabaGaemOzay2aa0baaeaacqWGhbWraeaacqWG0baDaaGaeiikaGIaemiEaG3aaSbaaeaacqWGRbWAaeqaaiabcYha8jabdMgaPjabcMcaPiabdcfaqnaaCaaabeqaaiabdsha0baacqGGOaakcqWGPbqAcqGGPaqkaeaadaaeWbqaaiabdAgaMnaaDaaabaGaem4raCeabaGaemiDaqhaaiabcIcaOiabdIha4naaBaaabaGaem4AaSgabeaacqGG8baFcqWGQbGAcqGGPaqkcqWGqbaudaahaaqabeaacqWG0baDaaGaeiikaGIaemOAaOMaeiykaKcabaGaemOAaOMaeyypa0JaeGymaedabaGaemOta4eacqGHris5aaaaaaa@6373@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>(3) M-step: estimate the current parameters:</p>
            <p>
               <display-formula id="M10">
                  <m:math name="1471-2105-9-29-i14" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msup>
                              <m:mi>P</m:mi>
                              <m:mrow>
                                 <m:mi>t</m:mi>
                                 <m:mo>+</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                           </m:msup>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>i</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mn>1</m:mn>
                              <m:mi>n</m:mi>
                           </m:mfrac>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>k</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mi>n</m:mi>
                              </m:munderover>
                              <m:mrow>
                                 <m:msup>
                                    <m:mi>g</m:mi>
                                    <m:mi>t</m:mi>
                                 </m:msup>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>i</m:mi>
                                 <m:mo>|</m:mo>
                                 <m:msub>
                                    <m:mi>x</m:mi>
                                    <m:mi>k</m:mi>
                                 </m:msub>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:mstyle>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemiuaa1aaWbaaSqabeaacqWG0baDcqGHRaWkcqaIXaqmaaGccqGGOaakcqWGPbqAcqGGPaqkcqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabd6gaUbaakmaaqahabaGaem4zaC2aaWbaaSqabeaacqWG0baDaaGccqGGOaakcqWGPbqAcqGG8baFcqWG4baEdaWgaaWcbaGaem4AaSgabeaakiabcMcaPaWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemOBa4ganiabggHiLdaaaa@496A@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>
               <display-formula id="M11">
                  <m:math name="1471-2105-9-29-i15" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msubsup>
                              <m:mi>&#956;</m:mi>
                              <m:mi>i</m:mi>
                              <m:mrow>
                                 <m:mi>t</m:mi>
                                 <m:mo>+</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                           </m:msubsup>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:munderover>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>k</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>n</m:mi>
                                    </m:munderover>
                                    <m:mrow>
                                       <m:msup>
                                          <m:mi>g</m:mi>
                                          <m:mi>t</m:mi>
                                       </m:msup>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>i</m:mi>
                                       <m:mo>|</m:mo>
                                       <m:msub>
                                          <m:mi>x</m:mi>
                                          <m:mi>k</m:mi>
                                       </m:msub>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mstyle>
                                 <m:msub>
                                    <m:mi>x</m:mi>
                                    <m:mi>k</m:mi>
                                 </m:msub>
                              </m:mrow>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:munderover>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>k</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>n</m:mi>
                                    </m:munderover>
                                    <m:mrow>
                                       <m:msup>
                                          <m:mi>g</m:mi>
                                          <m:mi>t</m:mi>
                                       </m:msup>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>i</m:mi>
                                       <m:mo>|</m:mo>
                                       <m:msub>
                                          <m:mi>x</m:mi>
                                          <m:mi>k</m:mi>
                                       </m:msub>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaacciGae8hVd02aa0baaSqaaiabdMgaPbqaaiabdsha0jabgUcaRiabigdaXaaakiabg2da9KqbaoaalaaabaWaaabCaeaacqWGNbWzdaahaaqabeaacqWG0baDaaGaeiikaGIaemyAaKMaeiiFaWNaemiEaG3aaSbaaeaacqWGRbWAaeqaaiabcMcaPaqaaiabdUgaRjabg2da9iabigdaXaqaaiabd6gaUbGaeyyeIuoacqWG4baEdaWgaaqaaiabdUgaRbqabaaabaWaaabCaeaacqWGNbWzdaahaaqabeaacqWG0baDaaGaeiikaGIaemyAaKMaeiiFaWNaemiEaG3aaSbaaeaacqWGRbWAaeqaaiabcMcaPaqaaiabdUgaRjabg2da9iabigdaXaqaaiabd6gaUbGaeyyeIuoaaaaaaa@59F8@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>
               <display-formula id="M12">
                  <m:math name="1471-2105-9-29-i16" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msubsup>
                              <m:mi>&#931;</m:mi>
                              <m:mi>i</m:mi>
                              <m:mrow>
                                 <m:mi>t</m:mi>
                                 <m:mo>+</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                           </m:msubsup>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:munderover>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>k</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>n</m:mi>
                                    </m:munderover>
                                    <m:mrow>
                                       <m:msup>
                                          <m:mi>g</m:mi>
                                          <m:mi>t</m:mi>
                                       </m:msup>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>i</m:mi>
                                       <m:mo>|</m:mo>
                                       <m:msub>
                                          <m:mi>x</m:mi>
                                          <m:mi>k</m:mi>
                                       </m:msub>
                                       <m:mo stretchy="false">)</m:mo>
                                       <m:msup>
                                          <m:mrow>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:msub>
                                                <m:mi>x</m:mi>
                                                <m:mi>k</m:mi>
                                             </m:msub>
                                             <m:mo>&#8722;</m:mo>
                                             <m:msubsup>
                                                <m:mi>&#956;</m:mi>
                                                <m:mi>i</m:mi>
                                                <m:mi>t</m:mi>
                                             </m:msubsup>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                          <m:mi>T</m:mi>
                                       </m:msup>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:msub>
                                          <m:mi>x</m:mi>
                                          <m:mi>k</m:mi>
                                       </m:msub>
                                       <m:mo>&#8722;</m:mo>
                                       <m:msubsup>
                                          <m:mi>&#956;</m:mi>
                                          <m:mi>i</m:mi>
                                          <m:mi>t</m:mi>
                                       </m:msubsup>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:munderover>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>k</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>n</m:mi>
                                    </m:munderover>
                                    <m:mrow>
                                       <m:msup>
                                          <m:mi>g</m:mi>
                                          <m:mi>t</m:mi>
                                       </m:msup>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>i</m:mi>
                                       <m:mo>|</m:mo>
                                       <m:msub>
                                          <m:mi>x</m:mi>
                                          <m:mi>k</m:mi>
                                       </m:msub>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaeu4Odm1aa0baaSqaaiabdMgaPbqaaiabdsha0jabgUcaRiabigdaXaaakiabg2da9KqbaoaalaaabaWaaabCaeaacqWGNbWzdaahaaqabeaacqWG0baDaaGaeiikaGIaemyAaKMaeiiFaWNaemiEaG3aaSbaaeaacqWGRbWAaeqaaiabcMcaPiabcIcaOiabdIha4naaBaaabaGaem4AaSgabeaacqGHsisliiGacqWF8oqBdaqhaaqaaiabdMgaPbqaaiabdsha0baacqGGPaqkdaahaaqabeaacqWGubavaaGaeiikaGIaemiEaG3aaSbaaeaacqWGRbWAaeqaaiabgkHiTiab=X7aTnaaDaaabaGaemyAaKgabaGaemiDaqhaaiabcMcaPaqaaiabdUgaRjabg2da9iabigdaXaqaaiabd6gaUbGaeyyeIuoaaeaadaaeWbqaaiabdEgaNnaaCaaabeqaaiabdsha0baacqGGOaakcqWGPbqAcqGG8baFcqWG4baEdaWgaaqaaiabdUgaRbqabaGaeiykaKcabaGaem4AaSMaeyypa0JaeGymaedabaGaemOBa4gacqGHris5aaaaaaa@6C93@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p><b>(4) </b>Repeat steps 2&#8211;3 until the change of parameters is very little.</p>
            <p>One problem with implementation of the EM algorithm is how to initialize the parameters. Use of an improper starting point may prolong the converging time of the EM algorithm or cause it to reach a local minimum. In this work, k-means clustering was used to partition the observations into subclasses, and the means and covariance matrixes of the component Gaussian distributions were initialized using the means and covariance matrixes of the subclasses.</p>
            <p>Another difficulty in implementing the EM algorithm is the selection of the number of component density functions. Generally speaking, inclusion of more functions will approximate the distributions of the observations more accurately, while allowing more parameters to be determined. However, overly complex models may cause the EM algorithm to reach a local minimum and worsen the performance of the resulting model. In this work, a trial and error procedure was used to select the minimum number of component density functions: try numbers from 2 until the change of the likelihood function value is very little (such as less than 1%).</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Abbreviations</p>
         </st>
         <p>MS/MS: tandem mass spectrometry; DF: discriminate function; FPR: false positive rate; LCQ: 3D quadrupole ion trap; LTQ: linear trap quadrupole; FT: Fourier transform; PDF: probability density function; FnPDFe: fully nonparametric probability density function estimate; MLE: maximum likelihood estimate; EM: expectation-maximization; MGDF: multivariate Gaussian density function; RDM: randomized database matches; IPI: international protein index; MLE: maximum likelihood estimate; EM: expectation-maximization.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>JZ developed the program for data processing and wrote the main text of the paper. XL finished the experiment to analyze the samples on LTQ/FT platform. HX inspected all the algorithm problems and provided abundant suggestions for improving the implementation of the EM algorithm. YZ and FH reviewed the paper and revised its framework.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We thank Dr.Songfeng Wu of the Beijing Proteome Research Centre for his thoughtful discussion. We also thank Master's candidate JieMa of the Beijing Proteome Research Centre for assistance with the database search. The LCQ control dataset was provided by the BIATECH institute and Dr.Zhongqi Zhang kindly provided the program MassAnalyzer, we thank them here. This work was funded by the Chinese Ministry of Science and Technology (2006AA02A312, 2006AA02Z334, 2006CB910803, 2006CB910700), the National Natural Science Foundation of China (30621063, 342123), and the Beijing Municipal Science and Technology Project (H030230280590), Chinese National Key Program of Basic Research (2006CB910700).</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Proteomics to study genes and genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Pandey</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Mann</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2000</pubdate>
            <volume>405</volume>
            <issue>6788</issue>
            <fpage>837</fpage>
            <lpage>46</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/35015709</pubid>
                  <pubid idtype="pmpid" link="fulltext">10866210</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Proteomics: the first decade and beyond</p>
            </title>
            <aug>
               <au>
                  <snm>Patterson</snm>
                  <fnm>SD</fnm>
               </au>
               <au>
                  <snm>Aebersold</snm>
                  <fnm>RH</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2003</pubdate>
            <volume>33</volume>
            <issue>Suppl</issue>
            <fpage>311</fpage>
            <lpage>23</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/ng1106</pubid>
                  <pubid idtype="pmpid" link="fulltext">12610541</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Mass spectrometry-based proteomics</p>
            </title>
            <aug>
               <au>
                  <snm>Aebersold</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Mann</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2003</pubdate>
            <volume>422</volume>
            <issue>6928</issue>
            <fpage>198</fpage>
            <lpage>207</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature01511</pubid>
                  <pubid idtype="pmpid" link="fulltext">12634793</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Mass spectrometry and protein analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Domon</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Aebersold</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2006</pubdate>
            <volume>312</volume>
            <issue>5771</issue>
            <fpage>212</fpage>
            <lpage>7</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.1124619</pubid>
                  <pubid idtype="pmpid" link="fulltext">16614208</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS</p>
            </title>
            <aug>
               <au>
                  <snm>Nesvizhskii</snm>
                  <fnm>AI</fnm>
               </au>
               <au>
                  <snm>Aebersold</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Drug Discov Today</source>
            <pubdate>2004</pubdate>
            <volume>9</volume>
            <issue>4</issue>
            <fpage>173</fpage>
            <lpage>81</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S1359-6446(03)02978-7</pubid>
                  <pubid idtype="pmpid" link="fulltext">14960397</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database</p>
            </title>
            <aug>
               <au>
                  <snm>Eng</snm>
                  <fnm>JK</fnm>
               </au>
               <au>
                  <snm>McCormack</snm>
                  <fnm>AL</fnm>
               </au>
               <au>
                  <snm>Yates</snm>
                  <fnm>JR</fnm>
                  <suf>3rd</suf>
               </au>
            </aug>
            <source>J Am Soc Mass Spectrom</source>
            <pubdate>1994</pubdate>
            <volume>5</volume>
            <issue>11</issue>
            <fpage>976</fpage>
            <lpage>89</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/1044-0305(94)80016-2</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Probability-based protein identification by searching sequence databases using mass spectrometry data</p>
            </title>
            <aug>
               <au>
                  <snm>Perkins</snm>
                  <fnm>DN</fnm>
               </au>
               <au>
                  <snm>Pappin</snm>
                  <fnm>DJ</fnm>
               </au>
               <au>
                  <snm>Creasy</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Cottrell</snm>
                  <fnm>JS</fnm>
               </au>
            </aug>
            <source>Electrophoresis</source>
            <pubdate>1999</pubdate>
            <volume>20</volume>
            <issue>18</issue>
            <fpage>3551</fpage>
            <lpage>67</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/(SICI)1522-2683(19991201)20:18&lt;3551::AID-ELPS3551>3.0.CO;2-2</pubid>
                  <pubid idtype="pmpid" link="fulltext">10612281</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Challenges and opportunities in proteomics data analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Domon</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Aebersold</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Mol Cell Proteomics</source>
            <pubdate>2006</pubdate>
            <volume>5</volume>
            <issue>10</issue>
            <fpage>1921</fpage>
            <lpage>6</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1074/mcp.R600012-MCP200</pubid>
                  <pubid idtype="pmpid" link="fulltext">16896060</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book</p>
            </title>
            <aug>
               <au>
                  <snm>Sadygov</snm>
                  <fnm>RG</fnm>
               </au>
               <au>
                  <snm>Cociorva</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Yates</snm>
                  <fnm>JR</fnm>
                  <suf>3rd</suf>
               </au>
            </aug>
            <source>Nat Methods</source>
            <pubdate>2004</pubdate>
            <volume>1</volume>
            <issue>3</issue>
            <fpage>195</fpage>
            <lpage>202</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nmeth725</pubid>
                  <pubid idtype="pmpid" link="fulltext">15789030</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Valid data from large-scale proteomics studies</p>
            </title>
            <aug>
               <au>
                  <snm>Chamrad</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Meyer</snm>
                  <fnm>HE</fnm>
               </au>
            </aug>
            <source>Nat Methods</source>
            <pubdate>2005</pubdate>
            <volume>2</volume>
            <issue>9</issue>
            <fpage>667</fpage>
            <lpage>75</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid>16118632</pubid>
                  <pubid idtype="pmpid" link="fulltext">16118637</pubid>
                  <pubid idtype="doi">10.1038/nmeth0905-647</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Integrated Approach for Manual Evaluation of Peptides Identified by Searching Protein Sequence Databases with Tandem Mass Spectra</p>
            </title>
            <aug>
               <au>
                  <snm>Chen</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Kwon</snm>
                  <fnm>SW</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>SC</fnm>
               </au>
               <au>
                  <snm>Zhao</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>J Proteome Res</source>
            <pubdate>2005</pubdate>
            <volume>4</volume>
            <issue>3</issue>
            <fpage>998</fpage>
            <lpage>1005</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/pr049754t</pubid>
                  <pubid idtype="pmpid" link="fulltext">15952748</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>DTASelect and Contrast: Tools for Assembling and Comparing Protein Identifications from Shotgun Proteomics</p>
            </title>
            <aug>
               <au>
                  <snm>Tabb</snm>
                  <fnm>DL</fnm>
               </au>
               <au>
                  <snm>McDonald</snm>
                  <fnm>WH</fnm>
               </au>
               <au>
                  <snm>Yates</snm>
                  <fnm>JR</fnm>
                  <suf>3rd</suf>
               </au>
            </aug>
            <source>J Proteome Res</source>
            <pubdate>2002</pubdate>
            <volume>1</volume>
            <issue>1</issue>
            <fpage>21</fpage>
            <lpage>6</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/pr015504q</pubid>
                  <pubid idtype="pmpid">12643522</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>AMASS: Software for Automatically Validating the Quality of MS/MS Spectrum from SEQUEST Results</p>
            </title>
            <aug>
               <au>
                  <snm>Sun</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Zheng</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Gao</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>Mol Cell Proteomics</source>
            <pubdate>2004</pubdate>
            <volume>3</volume>
            <issue>12</issue>
            <fpage>1194</fpage>
            <lpage>1199</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1074/mcp.M400120-MCP200</pubid>
                  <pubid idtype="pmpid" link="fulltext">15489460</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Direct analysis of protein complexes using mass spectrometry</p>
            </title>
            <aug>
               <au>
                  <snm>Link</snm>
                  <fnm>AJ</fnm>
               </au>
               <au>
                  <snm>Eng</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Schieltz</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Carmack</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Mize</snm>
                  <fnm>GJ</fnm>
               </au>
               <au>
                  <snm>Morris</snm>
                  <fnm>DR</fnm>
               </au>
               <au>
                  <snm>Garvik</snm>
                  <fnm>BM</fnm>
               </au>
               <au>
                  <snm>Yates</snm>
                  <fnm>JR</fnm>
                  <suf>3rd</suf>
               </au>
            </aug>
            <source>Nat Biotechnol</source>
            <pubdate>1999</pubdate>
            <volume>17</volume>
            <issue>7</issue>
            <fpage>676</fpage>
            <lpage>82</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/10890</pubid>
                  <pubid idtype="pmpid" link="fulltext">10404161</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search</p>
            </title>
            <aug>
               <au>
                  <snm>Keller</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Nesvizhskii</snm>
                  <fnm>AI</fnm>
               </au>
               <au>
                  <snm>Kolker</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Aebersold</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Anal Chem</source>
            <pubdate>2002</pubdate>
            <volume>74</volume>
            <issue>20</issue>
            <fpage>5383</fpage>
            <lpage>5392</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/ac025747h</pubid>
                  <pubid idtype="pmpid">12403597</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Statistical model for large-scale peptide identification in databases from tandem mass spectra using SEQUEST</p>
            </title>
            <aug>
               <au>
                  <snm>Lopez-Ferrer</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Martinez-Bartolome</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Villar</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Campillos</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Martin-Maroto</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Vazquez</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Anal Chem</source>
            <pubdate>2004</pubdate>
            <volume>76</volume>
            <issue>23</issue>
            <fpage>6853</fpage>
            <lpage>6860</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/ac049305c</pubid>
                  <pubid idtype="pmpid" link="fulltext">15571333</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>A model of random mass-matching and its use for automated significance testing in mass spectrometric proteome analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Eriksson</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Fenyo</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proteomics</source>
            <pubdate>2002</pubdate>
            <volume>2</volume>
            <issue>3</issue>
            <fpage>262</fpage>
            <lpage>270</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/1615-9861(200203)2:3&lt;262::AID-PROT262>3.0.CO;2-W</pubid>
                  <pubid idtype="pmpid" link="fulltext">11921442</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases</p>
            </title>
            <aug>
               <au>
                  <snm>Sadygov</snm>
                  <fnm>RG</fnm>
               </au>
               <au>
                  <snm>Yates</snm>
                  <fnm>JR</fnm>
                  <suf>3rd</suf>
               </au>
            </aug>
            <source>Anal Chem</source>
            <pubdate>2003</pubdate>
            <volume>75</volume>
            <issue>15</issue>
            <fpage>3792</fpage>
            <lpage>3798</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/ac034157w</pubid>
                  <pubid idtype="pmpid">14572045</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Statistical Models for Protein Validation Using Tandem Mass Spectral Data and Protein Amino Acid Sequence Databases</p>
            </title>
            <aug>
               <au>
                  <snm>Sadygov</snm>
                  <fnm>RG</fnm>
               </au>
               <au>
                  <snm>Liu</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Yates</snm>
                  <fnm>JR</fnm>
                  <suf>3rd</suf>
               </au>
            </aug>
            <source>Anal Chem</source>
            <pubdate>2004</pubdate>
            <volume>76</volume>
            <issue>6</issue>
            <fpage>1664</fpage>
            <lpage>1671</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/ac035112y</pubid>
                  <pubid idtype="pmpid" link="fulltext">15018565</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Qscore: An Algorithm for Evaluating SEQUEST Database Search Results</p>
            </title>
            <aug>
               <au>
                  <snm>Moore</snm>
                  <fnm>RE</fnm>
               </au>
               <au>
                  <snm>Young</snm>
                  <fnm>MK</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>TD</fnm>
               </au>
            </aug>
            <source>J Am Soc Mass Spectrom</source>
            <pubdate>2002</pubdate>
            <volume>13</volume>
            <issue>4</issue>
            <fpage>378</fpage>
            <lpage>386</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S1044-0305(02)00352-5</pubid>
                  <pubid idtype="pmpid">11951976</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Artificial Neural Network Analysis for Evaluation of Peptide MS/MS Spectra in Proteomics</p>
            </title>
            <aug>
               <au>
                  <snm>Ba&#252;czek</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Bucinski</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Ivanov</snm>
                  <fnm>Ar</fnm>
               </au>
               <au>
                  <snm>Kaliszan</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Anal Chem</source>
            <pubdate>2004</pubdate>
            <volume>76</volume>
            <issue>6</issue>
            <fpage>1726</fpage>
            <lpage>1732</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/ac030297u</pubid>
                  <pubid idtype="pmpid" link="fulltext">15018575</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>A computational method for assessing peptide identification Reliability in tandem mass spectrometry analysis with SEQUEST</p>
            </title>
            <aug>
               <au>
                  <snm>Razumovskaya</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Olman</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Xu</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Uberbacher</snm>
                  <fnm>EC</fnm>
               </au>
               <au>
                  <snm>VerBerkmoes</snm>
                  <fnm>NC</fnm>
               </au>
               <au>
                  <snm>Hettich</snm>
                  <fnm>RL</fnm>
               </au>
               <au>
                  <snm>Xu</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>Proteomics</source>
            <pubdate>2004</pubdate>
            <volume>4</volume>
            <issue>4</issue>
            <fpage>961</fpage>
            <lpage>969</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/pmic.200300656</pubid>
                  <pubid idtype="pmpid" link="fulltext">15048978</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>A New Algorithm for the Evaluation of Shotgun Peptide Sequencing in Proteomics: Support Vector Machine Classification of Peptide MS/MS Spectra and SEQUEST Scores</p>
            </title>
            <aug>
               <au>
                  <snm>Anderson</snm>
                  <fnm>DC</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Payan</snm>
                  <fnm>DG</fnm>
               </au>
               <au>
                  <snm>Noble</snm>
                  <fnm>WS</fnm>
               </au>
            </aug>
            <source>J Proteome Res</source>
            <pubdate>2003</pubdate>
            <volume>2</volume>
            <issue>2</issue>
            <fpage>137</fpage>
            <lpage>146</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/pr0255654</pubid>
                  <pubid idtype="pmpid">12716127</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Improved classification of mass spectrometry database search results using newer machine learning approaches</p>
            </title>
            <aug>
               <au>
                  <snm>Ulintz</snm>
                  <fnm>PJ</fnm>
               </au>
               <au>
                  <snm>Zhu</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Qin</snm>
                  <fnm>ZS</fnm>
               </au>
               <au>
                  <snm>Andrews</snm>
                  <fnm>PC</fnm>
               </au>
            </aug>
            <source>Mol Cell Proteomics</source>
            <pubdate>2006</pubdate>
            <volume>5</volume>
            <issue>3</issue>
            <fpage>497</fpage>
            <lpage>509</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">16321970</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Probability-Based Evaluation of Peptide and Protein identifications from Tandem Mass Spectrometry and SEQUEST Analysis: The Human Proteome</p>
            </title>
            <aug>
               <au>
                  <snm>Qian</snm>
                  <fnm>WJ</fnm>
               </au>
               <au>
                  <snm>Liu</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Monroe</snm>
                  <fnm>ME</fnm>
               </au>
               <au>
                  <snm>Strittmatter</snm>
                  <fnm>EF</fnm>
               </au>
               <au>
                  <snm>Jacobs</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Kangas</snm>
                  <fnm>LJ</fnm>
               </au>
               <au>
                  <snm>Petritis</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Camp</snm>
                  <fnm>DG</fnm>
                  <suf>2nd</suf>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>RD</fnm>
               </au>
            </aug>
            <source>J Proteome Res</source>
            <pubdate>2005</pubdate>
            <volume>4</volume>
            <issue>1</issue>
            <fpage>53</fpage>
            <lpage>62</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/pr0498638</pubid>
                  <pubid idtype="pmpid" link="fulltext">15707357</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome</p>
            </title>
            <aug>
               <au>
                  <snm>Peng</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Elias</snm>
                  <fnm>JE</fnm>
               </au>
               <au>
                  <snm>Thoreen</snm>
                  <fnm>CC</fnm>
               </au>
               <au>
                  <snm>Licklider</snm>
                  <fnm>LJ</fnm>
               </au>
               <au>
                  <snm>Gygi</snm>
                  <fnm>SP</fnm>
               </au>
            </aug>
            <source>J Proteome Res</source>
            <pubdate>2003</pubdate>
            <volume>2</volume>
            <issue>1</issue>
            <fpage>43</fpage>
            <lpage>50</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/pr025556v</pubid>
                  <pubid idtype="pmpid">12643542</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations</p>
            </title>
            <aug>
               <au>
                  <snm>Elias</snm>
                  <fnm>JE</fnm>
               </au>
               <au>
                  <snm>Haas</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Faherty</snm>
                  <fnm>BK</fnm>
               </au>
               <au>
                  <snm>Gygi</snm>
                  <fnm>SP</fnm>
               </au>
            </aug>
            <source>Nat Methods</source>
            <pubdate>2005</pubdate>
            <volume>2</volume>
            <issue>9</issue>
            <fpage>667</fpage>
            <lpage>75</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nmeth785</pubid>
                  <pubid idtype="pmpid" link="fulltext">16118637</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Randomized sequence databases for tandem mass spectrometry peptide and protein identification</p>
            </title>
            <aug>
               <au>
                  <snm>Higdon</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Hogan</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Van Belle</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Kolker</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>OMICS</source>
            <pubdate>2005</pubdate>
            <volume>9</volume>
            <issue>4</issue>
            <fpage>364</fpage>
            <lpage>79</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1089/omi.2005.9.364</pubid>
                  <pubid idtype="pmpid" link="fulltext">16402894</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>RScore: a peptide randomicity score for evaluating tandem mass spectra</p>
            </title>
            <aug>
               <au>
                  <snm>Li</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Sun</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Gao</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Rapid Commun Mass Spectrom</source>
            <pubdate>2004</pubdate>
            <volume>18</volume>
            <issue>14</issue>
            <fpage>1655</fpage>
            <lpage>9</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/rcm.1535</pubid>
                  <pubid idtype="pmpid" link="fulltext">15282793</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>A Method for Assessing the Statistical Significance of Mass Spectrometry-Based Protein Identifications Using General Scoring Schemes</p>
            </title>
            <aug>
               <au>
                  <snm>Fenyo</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Beavis</snm>
                  <fnm>RC</fnm>
               </au>
            </aug>
            <source>Anal Chem</source>
            <pubdate>2003</pubdate>
            <volume>75</volume>
            <issue>4</issue>
            <fpage>768</fpage>
            <lpage>74</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/ac0258709</pubid>
                  <pubid idtype="pmpid">12622365</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Improved validation of peptide MS/MS assignments using spectral intensity prediction</p>
            </title>
            <aug>
               <au>
                  <snm>Sun</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Meyer-Arendt</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Eichelberger</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Yen</snm>
                  <fnm>CY</fnm>
               </au>
               <au>
                  <snm>Old</snm>
                  <fnm>WM</fnm>
               </au>
               <au>
                  <snm>Pierce</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Cios</snm>
                  <fnm>KJ</fnm>
               </au>
               <au>
                  <snm>Ahn</snm>
                  <fnm>NG</fnm>
               </au>
               <au>
                  <snm>Resing</snm>
                  <fnm>KA</fnm>
               </au>
            </aug>
            <source>Mol Cell Proteomics</source>
            <pubdate>2007</pubdate>
            <volume>6</volume>
            <issue>1</issue>
            <fpage>1</fpage>
            <lpage>17</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid>17018520</pubid>
                  <pubid idtype="pmpid" link="fulltext">17018520</pubid>
                  <pubid idtype="doi">10.1074/mcp.M600449-MCP200</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Application of peptide LC retention time information in a discriminant function for peptide identification by tandem mass spectrometry</p>
            </title>
            <aug>
               <au>
                  <snm>Strittmatter</snm>
                  <fnm>EF</fnm>
               </au>
               <au>
                  <snm>Kangas</snm>
                  <fnm>LJ</fnm>
               </au>
               <au>
                  <snm>Petritis</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Mottaz</snm>
                  <fnm>HM</fnm>
               </au>
               <au>
                  <snm>Anderson</snm>
                  <fnm>GA</fnm>
               </au>
               <au>
                  <snm>Shen</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Jacobs</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Camp</snm>
                  <fnm>DG</fnm>
                  <suf>2nd</suf>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>RD</fnm>
               </au>
            </aug>
            <source>J Proteome Res</source>
            <pubdate>2004</pubdate>
            <volume>3</volume>
            <issue>4</issue>
            <fpage>760</fpage>
            <lpage>9</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/pr049965y</pubid>
                  <pubid idtype="pmpid">15359729</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Prediction of Error Associated with False-Positive Rate Determination for Peptide Identification in Large-Scale Proteomics Experiments Using a Combined Reverse and Forward Peptide Sequence Database Strategy</p>
            </title>
            <aug>
               <au>
                  <snm>Huttlin</snm>
                  <fnm>EL</fnm>
               </au>
               <au>
                  <snm>Hegeman</snm>
                  <fnm>AD</fnm>
               </au>
               <au>
                  <snm>Harms</snm>
                  <fnm>AC</fnm>
               </au>
               <au>
                  <snm>Sussman</snm>
                  <fnm>MR</fnm>
               </au>
            </aug>
            <source>J Proteome Res</source>
            <pubdate>2007</pubdate>
            <volume>6</volume>
            <issue>1</issue>
            <fpage>392</fpage>
            <lpage>398</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/pr0603194</pubid>
                  <pubid idtype="pmpid" link="fulltext">17203984</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <title>
               <p>Nucleolar proteome dynamics</p>
            </title>
            <aug>
               <au>
                  <snm>Andersen</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Lam</snm>
                  <fnm>YW</fnm>
               </au>
               <au>
                  <snm>Leung</snm>
                  <fnm>AK</fnm>
               </au>
               <au>
                  <snm>Ong</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Lyon</snm>
                  <fnm>CE</fnm>
               </au>
               <au>
                  <snm>Lamond</snm>
                  <fnm>AI</fnm>
               </au>
               <au>
                  <snm>Mann</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2005</pubdate>
            <volume>433</volume>
            <issue>7021</issue>
            <fpage>77</fpage>
            <lpage>83</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature03207</pubid>
                  <pubid idtype="pmpid" link="fulltext">15635413</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Large-scale and high-confidence proteomic analysis of human seminal plasma</p>
            </title>
            <aug>
               <au>
                  <snm>Pilch</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Mann</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <issue>5</issue>
            <fpage>R40</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1779515</pubid>
                  <pubid idtype="pmpid" link="fulltext">16709260</pubid>
                  <pubid idtype="doi">10.1186/gb-2006-7-5-r40</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Status of complete proteome analysis by mass spectrometry: SILAC labeled yeast as a model system</p>
            </title>
            <aug>
               <au>
                  <snm>de Godoy</snm>
                  <fnm>LM</fnm>
               </au>
               <au>
                  <snm>Olsen</snm>
                  <fnm>JV</fnm>
               </au>
               <au>
                  <snm>de Souza</snm>
                  <fnm>GA</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Mortensen</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Mann</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <issue>6</issue>
            <fpage>R50</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1779535</pubid>
                  <pubid idtype="pmpid" link="fulltext">16784548</pubid>
                  <pubid idtype="doi">10.1186/gb-2006-7-6-r50</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>The human urinary proteome contains more than 1500 proteins including a large proportion of membranes proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Adachi</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Kumar</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Olsen</snm>
                  <fnm>JV</fnm>
               </au>
               <au>
                  <snm>Mann</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <issue>9</issue>
            <fpage>R80</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1794545</pubid>
                  <pubid idtype="pmpid" link="fulltext">16948836</pubid>
                  <pubid idtype="doi">10.1186/gb-2006-7-9-r80</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>A probability-based approach for high-throughput protein phosphorylation analysis and site localization</p>
            </title>
            <aug>
               <au>
                  <snm>Beausoleil</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Villen</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Gerber</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Rush</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Gygi</snm>
                  <fnm>SP</fnm>
               </au>
            </aug>
            <source>Nat Biotechnol</source>
            <pubdate>2006</pubdate>
            <volume>24</volume>
            <issue>10</issue>
            <fpage>1285</fpage>
            <lpage>92</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nbt1240</pubid>
                  <pubid idtype="pmpid" link="fulltext">16964243</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B39">
            <title>
               <p>Enhanced analysis of metastatic prostate cancer using stable isotopes and high mass accuracy instrumentation</p>
            </title>
            <aug>
               <au>
                  <snm>Everley</snm>
                  <fnm>PA</fnm>
               </au>
               <au>
                  <snm>Bakalarski</snm>
                  <fnm>CE</fnm>
               </au>
               <au>
                  <snm>Elias</snm>
                  <fnm>JE</fnm>
               </au>
               <au>
                  <snm>Waghorne</snm>
                  <fnm>CG</fnm>
               </au>
               <au>
                  <snm>Beausoleil</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Gerber</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Faherty</snm>
                  <fnm>BK</fnm>
               </au>
               <au>
                  <snm>Zetter</snm>
                  <fnm>BR</fnm>
               </au>
               <au>
                  <snm>Gygi</snm>
                  <fnm>SP</fnm>
               </au>
            </aug>
            <source>J Proteome Res</source>
            <pubdate>2006</pubdate>
            <volume>5</volume>
            <issue>5</issue>
            <fpage>1224</fpage>
            <lpage>31</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/pr0504891</pubid>
                  <pubid idtype="pmpid" link="fulltext">16674112</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B40">
            <title>
               <p>Optimization and use of peptide mass measurement accuracy in shotgun proteomics</p>
            </title>
            <aug>
               <au>
                  <snm>Haas</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Faherty</snm>
                  <fnm>BK</fnm>
               </au>
               <au>
                  <snm>Gerber</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Elias</snm>
                  <fnm>JE</fnm>
               </au>
               <au>
                  <snm>Beausoleil</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Bakalarski</snm>
                  <fnm>CE</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Villen</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Gygi</snm>
                  <fnm>SP</fnm>
               </au>
            </aug>
            <source>Mol Cell Proteomics</source>
            <pubdate>2006</pubdate>
            <volume>5</volume>
            <issue>7</issue>
            <fpage>1326</fpage>
            <lpage>37</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1074/mcp.M500339-MCP200</pubid>
                  <pubid idtype="pmpid" link="fulltext">16635985</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B41">
            <title>
               <p>Nonparametric Multivariate Density Estimation: A Comparative Study</p>
            </title>
            <aug>
               <au>
                  <snm>Hwang</snm>
                  <fnm>JN</fnm>
               </au>
               <au>
                  <snm>Lay</snm>
                  <fnm>SR</fnm>
               </au>
               <au>
                  <snm>Lippman</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>IEEE Transactions on Signal Processing</source>
            <pubdate>1994</pubdate>
            <volume>42</volume>
            <issue>10</issue>
            <fpage>2795</fpage>
            <lpage>2810</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1109/78.324744</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B42">
            <title>
               <p>Fully nonparametric probability density function estimation with finite gaussian mixture models</p>
            </title>
            <aug>
               <au>
                  <snm>Archambeau</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Verleysen</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>7th ICPAR Conf</source>
            <pubdate>2003</pubdate>
            <fpage>81</fpage>
            <lpage>84</lpage>
         </bibl>
         <bibl id="B43">
            <title>
               <p>Pattern Classification, Second Edition</p>
            </title>
            <aug>
               <au>
                  <snm>Duda</snm>
                  <mi>O</mi>
                  <fnm>Richard</fnm>
               </au>
               <au>
                  <snm>Hart</snm>
                  <mi>E</mi>
                  <fnm>Peter</fnm>
               </au>
               <au>
                  <snm>Stork</snm>
                  <mi>G</mi>
                  <fnm>David</fnm>
               </au>
            </aug>
            <source>John Wiley</source>
            <pubdate>2001</pubdate>
            <volume>10</volume>
            <fpage>3</fpage>
            <lpage>13</lpage>
         </bibl>
         <bibl id="B44">
            <title>
               <p>Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics</p>
            </title>
            <aug>
               <au>
                  <snm>Resing</snm>
                  <fnm>KA</fnm>
               </au>
               <au>
                  <snm>Meyer-Arendt</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Mendoza</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Aveline-Wolf</snm>
                  <fnm>LD</fnm>
               </au>
               <au>
                  <snm>Jonscher</snm>
                  <fnm>KR</fnm>
               </au>
               <au>
                  <snm>Pierce</snm>
                  <fnm>KG</fnm>
               </au>
               <au>
                  <snm>Old</snm>
                  <fnm>WM</fnm>
               </au>
               <au>
                  <snm>Cheung</snm>
                  <fnm>HT</fnm>
               </au>
               <au>
                  <snm>Russell</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Wattawa</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Goehle</snm>
                  <fnm>GR</fnm>
               </au>
               <au>
                  <snm>Knight</snm>
                  <fnm>RD</fnm>
               </au>
               <au>
                  <snm>Ahn</snm>
                  <fnm>NG</fnm>
               </au>
            </aug>
            <source>Anal Chem</source>
            <pubdate>2004</pubdate>
            <volume>76</volume>
            <issue>13</issue>
            <fpage>3556</fpage>
            <lpage>68</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/ac035229m</pubid>
                  <pubid idtype="pmpid" link="fulltext">15228325</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B45">
            <title>
               <p>Prediction of Low-Energy Collision-Induced Dissociation Spectra of Peptides</p>
            </title>
            <aug>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Anal Chem</source>
            <pubdate>2004</pubdate>
            <volume>76</volume>
            <issue>14</issue>
            <fpage>3908</fpage>
            <lpage>3922</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/ac049951b</pubid>
                  <pubid idtype="pmpid" link="fulltext">15253624</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B46">
            <title>
               <p>Standard mixtures for proteome studies</p>
            </title>
            <aug>
               <au>
                  <snm>Purvine</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Picone</snm>
                  <fnm>AF</fnm>
               </au>
               <au>
                  <snm>Kolker</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>OMICS</source>
            <pubdate>2004</pubdate>
            <volume>8</volume>
            <issue>1</issue>
            <fpage>79</fpage>
            <lpage>92</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1089/153623104773547507</pubid>
                  <pubid idtype="pmpid" link="fulltext">15107238</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B47">
            <title>
               <p>DBParser: Web-Based Software for Shotgun Proteomic Data Analyses</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Dondeti</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Dezube</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Maynard</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Geer</snm>
                  <fnm>LY</fnm>
               </au>
               <au>
                  <snm>Epstein</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Markey</snm>
                  <fnm>SP</fnm>
               </au>
               <au>
                  <snm>Kowalak</snm>
                  <fnm>JA</fnm>
               </au>
            </aug>
            <source>J Proteome Res</source>
            <pubdate>2004</pubdate>
            <volume>3</volume>
            <issue>5</issue>
            <fpage>1002</fpage>
            <lpage>08</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/pr049920x</pubid>
                  <pubid idtype="pmpid" link="fulltext">15473689</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B48">
            <title>
               <p>Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides</p>
            </title>
            <aug>
               <au>
                  <snm>Nesvizhskii</snm>
                  <fnm>AI</fnm>
               </au>
               <au>
                  <snm>Roos</snm>
                  <fnm>FF</fnm>
               </au>
               <au>
                  <snm>Grossmann</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Vogelzang</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Eddes</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Gruissem</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Baginsky</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Aebersold</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Mol Cell Proteomics</source>
            <pubdate>2006</pubdate>
            <volume>5</volume>
            <issue>4</issue>
            <fpage>652</fpage>
            <lpage>70</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">16352522</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B49">
            <title>
               <p>Computational cluster validation in post-genomic data analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Handl</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Knowles</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Kell</snm>
                  <fnm>DB</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>15</issue>
            <fpage>3201</fpage>
            <lpage>12</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti517</pubid>
                  <pubid idtype="pmpid" link="fulltext">15914541</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B50">
            <title>
               <p>Partial least square regression: A tutorial</p>
            </title>
            <aug>
               <au>
                  <snm>Geladi</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Kowlaski</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Analytica Chemica Acta</source>
            <pubdate>1986</pubdate>
            <volume>35</volume>
            <fpage>1</fpage>
            <lpage>17</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/0003-2670(86)80028-9</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B51">
            <title>
               <p>A gentle tutorial of the EM algorithm and its applications to parameter estimation for gaussian mixture and hidden Markov models</p>
            </title>
            <aug>
               <au>
                  <snm>Bilmes</snm>
                  <fnm>JA</fnm>
               </au>
            </aug>
            <publisher>International Computer Science Institute, Berkeley, California</publisher>
            <pubdate>1998</pubdate>
            <note>Technical Report TR-97-021</note>
         </bibl>
         <bibl id="B52">
            <title>
               <p>Linear and Nonlinear Programming</p>
            </title>
            <aug>
               <au>
                  <snm>Nash</snm>
                  <fnm>SG</fnm>
               </au>
               <au>
                  <snm>Sofer</snm>
                  <fnm/>
               </au>
            </aug>
            <source>McGraw-Hill</source>
            <pubdate>1996</pubdate>
         </bibl>
         <bibl id="B53">
            <title>
               <p>On optimal and data-based histograms</p>
            </title>
            <aug>
               <au>
                  <snm>Scott</snm>
                  <fnm>DW</fnm>
               </au>
            </aug>
            <source>Biometrika</source>
            <pubdate>1979</pubdate>
            <volume>66</volume>
            <fpage>605</fpage>
            <lpage>610</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/biomet/66.3.605</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B54">
            <title>
               <p>Density estimation for statistics and data analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Silverman</snm>
                  <fnm>BW</fnm>
               </au>
            </aug>
            <publisher>Chapman Hall: London</publisher>
            <pubdate>1986</pubdate>
         </bibl>
         <bibl id="B55">
            <url>http://www.abrf.org/index.cfm/group.show/ProteomicsStandardsResearchGroup.47.htm</url>
         </bibl>
         <bibl id="B56">
            <title>
               <p>Analysis of human liver proteome using replicate shotgun strategy</p>
            </title>
            <aug>
               <au>
                  <snm>Chen</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ying</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Song</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Liu</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Yang</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Wu</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Jiang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Cai</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>He</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Qian</snm>
                  <fnm>X</fnm>
               </au>
            </aug>
            <source>Proteomics</source>
            <pubdate>2007</pubdate>
            <volume>7</volume>
            <issue>14</issue>
            <fpage>2479</fpage>
            <lpage>88</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/pmic.200600338</pubid>
                  <pubid idtype="pmpid" link="fulltext">17623305</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B57">
            <title>
               <p>The need for a public proteomics repository</p>
            </title>
            <aug>
               <au>
                  <snm>Prince</snm>
                  <fnm>JT</fnm>
               </au>
               <au>
                  <snm>Carlson</snm>
                  <fnm>MW</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Lu</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Marcotte</snm>
                  <fnm>EM</fnm>
               </au>
            </aug>
            <source>Nat Biotechnol</source>
            <pubdate>2004</pubdate>
            <volume>22</volume>
            <issue>4</issue>
            <fpage>471</fpage>
            <lpage>2</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nbt0404-471</pubid>
                  <pubid idtype="pmpid">15085804</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B58">
            <title>
               <p>A new strategy to filter out false positive identifications of peptides in SEQUEST database search results</p>
            </title>
            <aug>
               <au>
                  <snm>Zhang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Xie</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Zhu</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>He</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>Proteomics</source>
            <pubdate>2007</pubdate>
            <volume>7</volume>
            <issue>22</issue>
            <fpage>4036</fpage>
            <lpage>44</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/pmic.200600929</pubid>
                  <pubid idtype="pmpid" link="fulltext">17952874</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B59">
            <title>
               <p>Feature weighting in k-means clustering</p>
            </title>
            <aug>
               <au>
                  <snm>Modha</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Scott-Spangler</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Machine Learning</source>
            <pubdate>2003</pubdate>
            <volume>52</volume>
            <issue>3</issue>
            <fpage>217</fpage>
            <lpage>237</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1023/A:1024016609528</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B60">
            <url>ftp://ftp.ebi.ac.uk/pub/databases/IPI/old/HUMAN/ipi.HUMAN.v3.19.fasta.gz</url>
         </bibl>
      </refgrp>
   </bm>
</art>
