<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art><ui>1471-2105-13-S16-S4</ui><ji>1471-2105</ji><fm>
<dochead>Review</dochead>
<bibl>
<title>
<p>Computational approaches to protein inference in shotgun proteomics</p>
</title>
<aug>
<au id="A1"><snm>Li</snm><mnm>Fuga</mnm><fnm>Yong</fnm><insr iid="I1"/></au>
<au ca="yes" id="A2"><snm>Radivojac</snm><fnm>Predrag</fnm><insr iid="I1"/><email>predrag@indiana.edu  </email></au>
</aug>
<insg>
<ins id="I1"><p>School of Informatics and Computing, Indiana University, Bloomington 150 S. Woodlawn Avenue, Bloomington, Indiana, 47405, USA</p></ins>
</insg>
<source>BMC Bioinformatics</source>

<supplement><title><p>Statistical mass spectrometry-based proteomics</p></title><editor>Predrag Radivojac and Olga Vitek</editor><note>Research and reviews</note></supplement><issn>1471-2105</issn>
<pubdate>2012</pubdate>
<volume>13</volume>
<issue>Suppl 16</issue>
<fpage>S4</fpage>
<url>http://www.biomedcentral.com/1471-2105/13/S16/S4</url>
<xrefbib><pubidlist><pubid idtype="pmpid">23176300</pubid><pubid idtype="doi">10.1186/1471-2105-13-S16-S4</pubid></pubidlist></xrefbib>
</bibl>
<history><pub><date><day>5</day><month>11</month><year>2012</year></date></pub></history>
<cpyrt><year>2012</year><collab>Li and Radivojac; licensee BioMed Central Ltd.</collab><note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
<abs>
<sec>
<st>
<p>Abstract</p>
</st>
<p>Shotgun proteomics has recently emerged as a powerful approach to characterizing proteomes in biological samples. Its overall objective is to identify the form and quantity of each protein in a high-throughput manner by coupling liquid chromatography with tandem mass spectrometry. As a consequence of its high throughput nature, shotgun proteomics faces challenges with respect to the analysis and interpretation of experimental data. Among such challenges, the identification of proteins present in a sample has been recognized as an important computational task. This task generally consists of (1) assigning experimental tandem mass spectra to peptides derived from a protein database, and (2) mapping assigned peptides to proteins and quantifying the confidence of identified proteins. Protein identification is fundamentally a statistical inference problem with a number of methods proposed to address its challenges. In this review we categorize current approaches into rule-based, combinatorial optimization and probabilistic inference techniques, and present them using integer programing and Bayesian inference frameworks. We also discuss the main challenges of protein identification and propose potential solutions with the goal of spurring innovative research in this area.</p>
</sec>
</abs>
</fm><bdy>
<sec>
<st>
<p>Introduction</p>
</st>
<p>The main objective of mass spectrometry-based proteomics is to provide a molecular snapshot of the form (e.g. splice isoforms, post-translational modifications), abundance level, and functional aspects (e.g. protein-protein interactions, protein localization) of each protein in a biological sample <abbrgrp>
<abbr bid="B1">1</abbr>
<abbr bid="B2">2</abbr>
<abbr bid="B3">3</abbr>
</abbrgrp>. Among proteomics strategies, bottom-up or shotgun proteomics has emerged as a high-throughput technology capable of characterizing hundreds of proteins at the same time. In this scenario, proteins in a sample are first digested into peptides, typically using site-specific proteolytic enzymes (e.g. trypsin). Peptides are then separated by liquid chromatography (LC) and analyzed by tandem mass-spectrometry (MS/MS) resulting in a set of MS/MS spectra <abbrgrp>
<abbr bid="B4">4</abbr>
</abbrgrp>. In contrast to the top-down proteomics strategy, where intact proteins are directly analyzed through mass spectrometers, shotgun proteomics is characterized by high separation efficiency and mass spectral sensitivity. At the same time, it places higher demands on the computational and statistical techniques necessary for peptide identification, protein identification, and label-free quantification.</p>
<p>In a standard computational pipeline, MS/MS spectra from a mass spectrometer are searched against spectral libraries <abbrgrp>
<abbr bid="B5">5</abbr>
<abbr bid="B6">6</abbr>
<abbr bid="B7">7</abbr>
<abbr bid="B8">8</abbr>
</abbrgrp> and/or <it>in silico </it>spectra <abbrgrp>
<abbr bid="B9">9</abbr>
<abbr bid="B10">10</abbr>
<abbr bid="B11">11</abbr>
<abbr bid="B12">12</abbr>
<abbr bid="B13">13</abbr>
<abbr bid="B14">14</abbr>
</abbrgrp> corresponding to peptides from a protein database in order to provide <it>peptide-spectrum matches </it>(PSMs). Such a database search, depending on the parameters of the search and the MS/MS platform, can result in a large number of PSMs that are assigned scores indicating the confidence level of correct identification of the respective peptide. The next step is to assemble a list of <it>identified proteins </it>from all, or a subset of, PSMs and provide statistical confidence levels for each protein.</p>
<p>Protein identification is a special case of label-free protein quantification because, in an ideal scenario, each protein with a correctly inferred non-zero quantity (abundance) would be considered identified. However, label-free quantification has not yet reached the accuracy needed for the wide dynamic range of quantities observed in cellular or extracellular proteomes <abbrgrp>
<abbr bid="B15">15</abbr>
</abbrgrp>. In addition, in many practical situations it suffices to only consider the existence of proteins in the sample and not their exact quantity. Thus, solving the more general and significantly more difficult problem of quantification to provide a solution to its subproblem may result in less accurate solutions to protein identification.</p>
<p>Obtaining a list of identified proteins from a set of peptide sequences with identification scores may seem straightforward. However, there are several factors that combine to challenge such intuition: (1) Usually only a small number of peptide identifications, mostly unreliable, are available for each protein <abbrgrp>
<abbr bid="B16">16</abbr>
</abbrgrp>. This is because only the top-scoring PSMs for each peptide are typically included into the candidate set for peptide identifications, and among those candidates only a small subset are considered to be confident identifications. This leads to difficulties in providing confident protein identifications, e.g. if only a single peptide is identified from a protein. (2) Peptides, even those from the same protein, are not equally likely to be identified in a proteomics experiment <abbrgrp>
<abbr bid="B17">17</abbr>
<abbr bid="B18">18</abbr>
<abbr bid="B19">19</abbr>
</abbrgrp>. The probability that a peptide is identified in a standard proteomics experiment has been referred to as <it>peptide detectability </it>
<abbrgrp>
<abbr bid="B19">19</abbr>
</abbrgrp>, see appendix. (3) Many peptide sequences encountered in a typical proteomics workflow can be mapped to more than one protein in a database. These are referred to as <it>degenerate </it>or <it>shared peptides </it>
<abbrgrp>
<abbr bid="B20">20</abbr>
<abbr bid="B21">21</abbr>
</abbrgrp>. It is a common situation that a eukaryotic sample contains more degenerate than <it>unique peptides</it>, i.e. peptides that can be mapped to only one protein. (4) It is non-trivial to estimate the false discovery rates (FDRs) of identified peptides and proteins. Some approaches to estimating peptide-level FDRs involve construction of decoy databases or use unsupervised estimation of class-conditional distributions (distributions of PSM scores given correct and false identifications, respectively). However, a large number of low-scoring PSMs may create difficulties in determining the certainty of both peptide and protein identification. While methods for the estimation of peptide-level FDRs have been characterized relatively well computing protein-level FDRs remains an open problem <abbrgrp>
<abbr bid="B22">22</abbr>
<abbr bid="B23">23</abbr>
</abbrgrp>.</p>
<p>The process of identifying proteins that are present in a biological sample is now widely framed as a statistical inference problem, and has been referred to as the <it>protein inference problem </it>
<abbrgrp>
<abbr bid="B20">20</abbr>
<abbr bid="B21">21</abbr>
</abbrgrp>. To date, a number of approaches have been proposed to address this problem <abbrgrp>
<abbr bid="B20">20</abbr>
<abbr bid="B35">35</abbr>
<abbr bid="B36">36</abbr>
<abbr bid="B37">37</abbr>
</abbrgrp>. We categorize those approaches into three broad groups, noting that a particular method may exploit more than one strategy:</p>
<p indent="1">1. Rule-based strategies - methods that rely on a relatively small set of confidently identified (unique) peptides that are subsequently assigned to proteins.</p>
<p indent="1">2. Combinatorial optimization algorithms - methods that rely on constrained optimization formulations of the protein inference problem resulting, for example, in minimal protein lists that cover some or all confidently identified peptides.</p>
<p indent="1">3. Probabilistic inference algorithms - methods that formulate the problem probabilistically and assign identification probabilities for each protein in a database.</p>
<p>In the following sections, we provide justification for the development of advanced protein inference algorithms and then review the major computational strategies. All combinatorial optimization techniques are presented using a framework of integer programming; on the other hand, probabilistic algorithms are summarized using Bayesian inference principles. Our focus is also on the intuition behind the algorithms, the types of solutions generated, and the strengths and limitations of each method. We believe this information is essential in order to understand commonalities among the algorithms as well as their principal differences. It is also important for the proper interpretation of outputs from the various protein inference tools already applied in bottom-up proteomics.</p>
<sec>
<st>
<p>Notation</p>
</st>
<p>Before discussing algorithmic details, it is important to introduce notation that will be used throughout this paper. Let us consider a set of tandem mass spectra <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i1"><m:mrow>
   <m:mi mathvariant="script">S</m:mi>
</m:mrow>
</m:math>
</inline-formula> from a proteomics experiment and let <inline-formula>
<graphic file="1471-2105-13-S16-S4-i35.gif"/>
</inline-formula> be a database of proteins that the spectra are searched against. Let also <inline-formula>
<graphic file="1471-2105-13-S16-S4-i36.gif"/>
</inline-formula> be the set of all peptides in the database and, similarly, <inline-formula>
<graphic file="1471-2105-13-S16-S4-i37.gif"/>
</inline-formula> be the set of peptides that belong to protein <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i44"><m:mrow>
   <m:msub>
      <m:mi>P</m:mi>
      <m:mi>i</m:mi>
   </m:msub>
</m:mrow>
</m:math>
</inline-formula>. We now define two sets of indicator variables as follows</p>
<p>
<display-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i2"><m:mrow>
   <m:msub>
      <m:mrow>
         <m:mi>t</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mfenced close="" open="{" separators="">
      <m:mrow>
         <m:mtable class="array" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" equalrows="false">
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:mn>1</m:mn>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:mn>0</m:mn>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center"/>
            </m:mtr>
         </m:mtable>
         <m:mtable class="array" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" equalrows="false">
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">if</m:mtext>
                  </m:mstyle>
                  <m:mspace class="thinspace" width="0.3em"/>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">peptide</m:mtext>
                  </m:mstyle>
                  <m:mspace class="thinspace" width="0.3em"/>
                  <m:msub>
                     <m:mrow>
                        <m:mi>p</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>j</m:mi>
                     </m:mrow>
                  </m:msub>
                  <m:mspace class="thinspace" width="0.3em"/>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">is</m:mtext>
                  </m:mstyle>
                  <m:mspace class="thinspace" width="0.3em"/>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">confidently</m:mtext>
                  </m:mstyle>
                  <m:mspace class="thinspace" width="0.3em"/>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">identified</m:mtext>
                  </m:mstyle>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">otherwise</m:mtext>
                  </m:mstyle>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center"/>
            </m:mtr>
         </m:mtable>
      </m:mrow>
   </m:mfenced>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>and</p>
<p>
<display-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i3"><m:mrow>
   <m:msub>
      <m:mrow>
         <m:mi>x</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mfenced close="" open="{" separators="">
      <m:mrow>
         <m:mtable class="array" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" equalrows="false">
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:mn>1</m:mn>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:mn>0</m:mn>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center"/>
            </m:mtr>
         </m:mtable>
      </m:mrow>
   </m:mfenced>
   <m:mtable class="array" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" equalrows="false">
      <m:mtr>
         <m:mtd class="array" columnalign="center">
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">if</m:mtext>
            </m:mstyle>
            <m:mspace class="thinspace" width="0.3em"/>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">peptide</m:mtext>
            </m:mstyle>
            <m:mspace class="thinspace" width="0.3em"/>
            <m:msub>
               <m:mrow>
                  <m:mi>p</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>j</m:mi>
               </m:mrow>
            </m:msub>
            <m:mspace class="thinspace" width="0.3em"/>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">is</m:mtext>
            </m:mstyle>
            <m:mspace class="thinspace" width="0.3em"/>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">present</m:mtext>
            </m:mstyle>
            <m:mspace class="thinspace" width="0.3em"/>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">in</m:mtext>
            </m:mstyle>
            <m:mspace class="thinspace" width="0.3em"/>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">the</m:mtext>
            </m:mstyle>
            <m:mspace class="thinspace" width="0.3em"/>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">sample</m:mtext>
            </m:mstyle>
         </m:mtd>
      </m:mtr>
      <m:mtr>
         <m:mtd class="array" columnalign="center">
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">otherwise</m:mtext>
            </m:mstyle>
         </m:mtd>
      </m:mtr>
      <m:mtr>
         <m:mtd class="array" columnalign="center"/>
      </m:mtr>
   </m:mtable>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>Confident peptide identifications can be determined in several ways, typically by using strict FDR thresholds on the top-scoring PSMs (per peptide) and are estimated using a decoy database <abbrgrp>
<abbr bid="B22">22</abbr>
</abbrgrp> or tools such as PeptideProphet <abbrgrp>
<abbr bid="B38">38</abbr>
</abbrgrp>, which calculate the posterior probability of a correct peptide identification. When posterior probabilities are available, stringent thresholds (e.g. 0.90) can be applied directly to those probabilities. Alternatively, sufficiently high scores from various search engines <abbrgrp>
<abbr bid="B9">9</abbr>
<abbr bid="B39">39</abbr>
<abbr bid="B40">40</abbr>
<abbr bid="B41">41</abbr>
<abbr bid="B42">42</abbr>
</abbrgrp> are sometimes used to select confident identifications.</p>
<p>It is important to mention that variables <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i45"><m:mrow>
   <m:msub>
      <m:mi>t</m:mi>
      <m:mi>j</m:mi>
   </m:msub>
</m:mrow>
</m:math>
</inline-formula> and <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i46"><m:mrow>
   <m:msub>
      <m:mi>x</m:mi>
      <m:mi>j</m:mi>
   </m:msub>
</m:mrow>
</m:math>
</inline-formula> are different. For example, a peptide <it>p<sub>j </sub>
</it>that is confidently identified, e.g. using an FDR threshold of 0.01, will result in setting <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i47"><m:mrow>
   <m:msub>
      <m:mi>t</m:mi>
      <m:mi>j</m:mi>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mn>1</m:mn>
</m:mrow>
</m:math>
</inline-formula>. On the other hand, <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i46">
<m:mrow>
<m:msub>
<m:mi>x</m:mi>
<m:mi>j</m:mi>
</m:msub>
</m:mrow>
</m:math>
</inline-formula> can be seen as a hidden variable that is to be inferred. Accordingly, <inline-formula>
<graphic file="1471-2105-13-S16-S4-i38.gif"/>
</inline-formula> refers to the probability that peptide <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i48"><m:mrow>
   <m:mi>&#160;j</m:mi>
</m:mrow>
</m:math>
</inline-formula> is present in the sample given all the data from the mass spectrometer. A set of confidently identified peptides, using any of the above-mentioned approaches will be denoted as <inline-formula>
<graphic file="1471-2105-13-S16-S4-i39.gif"/>
</inline-formula>.</p>
<p>In some situations it will be necessary to consider peptides with explicit designations of their parent proteins. In those cases, the <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i48">
<m:mrow>
<m:mi>&#160;j</m:mi>
</m:mrow>
</m:math>
</inline-formula>-th peptide derived from protein <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i44">
<m:mrow>
<m:msub>
<m:mi>P</m:mi>
<m:mi>i</m:mi>
</m:msub>
</m:mrow>
</m:math>
</inline-formula> will be denoted as <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i49"><m:mrow>
   <m:msub>
      <m:mi>p</m:mi>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
</m:mrow>
</m:math>
</inline-formula>. Two or more such peptides will be allowed to have identical amino acid sequences. For example, peptides <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i49">
<m:mrow>
<m:msub>
<m:mi>p</m:mi>
<m:mrow>
<m:mi>i</m:mi>
<m:mi>j</m:mi>
</m:mrow>
</m:msub>
</m:mrow>
</m:math>
</inline-formula> and <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i50"><m:mrow>
   <m:msub>
      <m:mi>p</m:mi>
      <m:mrow>
         <m:mi>k</m:mi>
         <m:mi>l</m:mi>
      </m:mrow>
   </m:msub>
</m:mrow>
</m:math>
</inline-formula> (<inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i51"><m:mrow>
   <m:mi>&#160;i</m:mi>
</m:mrow>
</m:math>
</inline-formula> &#8800; <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i52"><m:mrow>
   <m:mi>&#160;k</m:mi>
</m:mrow>
</m:math>
</inline-formula>) with identical amino acid sequences will be referred to as degenerate peptides. In the context of protein inference, peptides that occur multiple times only within a single protein will not be considered degenerate. Finally, we define</p>
<p>
<display-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i4"><m:mrow>
   <m:msub>
      <m:mrow>
         <m:mi>y</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>i</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mfenced close="" open="{" separators="">
      <m:mrow>
         <m:mtable class="array" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" equalrows="false">
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:mn>1</m:mn>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:mn>0</m:mn>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center"/>
            </m:mtr>
         </m:mtable>
         <m:mtable class="array" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" equalrows="false">
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">if</m:mtext>
                  </m:mstyle>
                  <m:mspace class="thinspace" width="0.3em"/>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">protein</m:mtext>
                  </m:mstyle>
                  <m:mspace class="thinspace" width="0.3em"/>
                  <m:msub>
                     <m:mrow>
                        <m:mi>P</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>i</m:mi>
                     </m:mrow>
                  </m:msub>
                  <m:mspace class="thinspace" width="0.3em"/>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">is</m:mtext>
                  </m:mstyle>
                  <m:mspace class="thinspace" width="0.3em"/>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">present</m:mtext>
                  </m:mstyle>
                  <m:mspace class="thinspace" width="0.3em"/>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">in</m:mtext>
                  </m:mstyle>
                  <m:mspace class="thinspace" width="0.3em"/>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">the</m:mtext>
                  </m:mstyle>
                  <m:mspace class="thinspace" width="0.3em"/>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">sample</m:mtext>
                  </m:mstyle>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">otherwise</m:mtext>
                  </m:mstyle>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center"/>
            </m:mtr>
         </m:mtable>
      </m:mrow>
   </m:mfenced>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>Variable <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i53"><m:mrow>
   <m:msub>
      <m:mi>y</m:mi>
      <m:mi>i</m:mi>
   </m:msub>
</m:mrow>
</m:math>
</inline-formula> can be seen as an equivalent of <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i46">
<m:mrow>
<m:msub>
<m:mi>x</m:mi>
<m:mi>j</m:mi>
</m:msub>
</m:mrow>
</m:math>
</inline-formula> at the protein level. Thus, <inline-formula>
<graphic file="1471-2105-13-S16-S4-i40.gif"/>
</inline-formula> is the posterior probability that protein <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i44">
<m:mrow>
<m:msub>
<m:mi>P</m:mi>
<m:mi>i</m:mi>
</m:msub>
</m:mrow>
</m:math>
</inline-formula> is present in the sample. The summary of notation and abbreviations is shown in Table <tblr tid="T1">1</tblr>.</p>
<tbl id="T1"><title><p>Table 1</p></title><caption><p>Summary of notation and abbreviations used throughout this paper.</p></caption><tblbdy cols="2">
      <r>
         <c ca="left">
            <p>Notation</p>
         </c>
         <c ca="left">
            <p>Description</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <inline-formula>
                  <graphic file="1471-2105-13-S16-S4-i56.gif"/>
               </inline-formula>
            </p>
         </c>
         <c ca="left">
            <p>Set of all fragmentation spectra outputted by mass spectrometer</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <inline-formula>
                  <graphic file="1471-2105-13-S16-S4-i57.gif"/>
               </inline-formula>
            </p>
         </c>
         <c ca="left">
            <p>Set of spectra identified for peptide <it>j</it></p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>s</it>
            </p>
         </c>
         <c ca="left">
            <p>A single fragmentation spectrum, <inline-formula><graphic file="1471-2105-13-S16-S4-i58.gif"/></inline-formula></p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p><inline-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i44"><m:mrow><m:msub><m:mi>P</m:mi><m:mi>i</m:mi></m:msub></m:mrow></m:math></inline-formula> or <it>i</it></p>
         </c>
         <c ca="left">
            <p>Protein <it>i</it></p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p><it>p<sub>j </sub></it>or <it>j</it></p>
         </c>
         <c ca="left">
            <p>Peptide <it>j</it></p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>p<sub>ij</sub></it>
            </p>
         </c>
         <c ca="left">
            <p>Peptide <it>j </it>derived from protein <it>i</it>; used to explicitly indicate the parent protein for peptide <it>j</it></p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <inline-formula>
                  <graphic file="1471-2105-13-S16-S4-i59.gif"/>
               </inline-formula>
            </p>
         </c>
         <c ca="left">
            <p>Protein database, a set of proteins used for peptide and protein identification</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <inline-formula>
                  <graphic file="1471-2105-13-S16-S4-i61.gif"/>
               </inline-formula>
            </p>
         </c>
         <c ca="left">
            <p>Peptide database, the set of all (tryptic) peptides derived from <inline-formula><graphic file="1471-2105-13-S16-S4-i60.gif"/></inline-formula></p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <inline-formula>
                  <graphic file="1471-2105-13-S16-S4-i62.gif"/>
               </inline-formula>
            </p>
         </c>
         <c ca="left">
            <p>Set of peptides derived from protein <inline-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i44"><m:mrow><m:msub><m:mi>P</m:mi><m:mi>i</m:mi></m:msub></m:mrow></m:math></inline-formula></p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <inline-formula>
                  <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i45">
                     <m:mrow>
                        <m:msub>
                           <m:mi>t</m:mi>
                           <m:mi>j</m:mi>
                        </m:msub>
                     </m:mrow>
                  </m:math>
               </inline-formula>
            </p>
         </c>
         <c ca="left">
            <p>Indicator variable, set to 1 if peptide is <it>p<sub>j </sub></it>confidently identified</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <inline-formula>
                  <graphic file="1471-2105-13-S16-S4-i63.gif"/>
               </inline-formula>
            </p>
         </c>
         <c ca="left">
            <p>Set of peptides that are confidently identified</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>x<sub>j</sub></it>
            </p>
         </c>
         <c ca="left">
            <p>Indicator variable, set to 1 if <inline-formula><graphic file="1471-2105-13-S16-S4-i64.gif"/></inline-formula> is present in the sample</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>y<sub>i</sub></it>
            </p>
         </c>
         <c ca="left">
            <p>Indicator variable, set to 1 if <inline-formula><graphic file="1471-2105-13-S16-S4-i65.gif"/></inline-formula> is present in the sample</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p><it>x </it>= (<it>x</it><sub>1</sub>, ... , <it>x<sub>j </sub></it>, ...)</p>
         </c>
         <c ca="left">
            <p>Indicator vector representing all peptides in <inline-formula><graphic file="1471-2105-13-S16-S4-i66.gif"/></inline-formula></p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p><it>y </it>= (<it>y</it><sub>1</sub>, ... , <it>y<sub>i </sub></it>, ...)</p>
         </c>
         <c ca="left">
            <p>Indicator vector representing all proteins in <inline-formula><graphic file="1471-2105-13-S16-S4-i60.gif"/></inline-formula></p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p><it>N</it>(<it>i</it>)</p>
         </c>
         <c ca="left">
            <p>Set of peptides mapped to protein <it>P<sub>i</sub></it></p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p><it>N</it>(<it>j</it>)</p>
         </c>
         <c ca="left">
            <p>Set of proteins that contain peptide <it>p<sub>j</sub></it></p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>x</it>
               <sub><it>N</it>(<it>i</it>)</sub>
            </p>
         </c>
         <c ca="left">
            <p>Indicator vector representing peptides in <inline-formula><graphic file="1471-2105-13-S16-S4-i62.gif"/></inline-formula></p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <inline-formula>
                  <graphic file="1471-2105-13-S16-S4-i67.gif"/>
               </inline-formula>
            </p>
         </c>
         <c ca="left">
            <p>Peptide identification probability, the probability that peptide <it>j </it>is present in the sample given the spectra identified for peptide <it>j</it></p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p><it>P </it>(<it>x<sub>j </sub></it>= 1|<it>s</it>)</p>
         </c>
         <c ca="left">
            <p>The probability of the PSM matching to be correct when peptide <it>j </it>is the top-scoring match of spectrum</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <inline-formula>
                  <graphic file="1471-2105-13-S16-S4-i68.gif"/>
               </inline-formula>
            </p>
         </c>
         <c ca="left">
            <p>Protein posterior probabilities, the probability that protein <it>i </it>is present in the sample given all spectra</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p><it>d<sub>ij </sub></it>(<it>q</it>)</p>
         </c>
         <c ca="left">
            <p>Detectability of peptide <it>p<sub>ij </sub></it>at some specified quantity <it>q</it>; effective detectability</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <inline-formula>
                  <graphic file="1471-2105-13-S16-S4-i69.gif"/>
               </inline-formula>
            </p>
         </c>
         <c ca="left">
            <p>Detectability of peptide <it>p<sub>ij </sub></it>at standard quantity <it>q</it><sup>0 </sup>; standard detectability</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>d<sub>ij</sub></it>
            </p>
         </c>
         <c ca="left">
            <p>Detectability of peptide <it>p<sub>ij</sub></it>; effective detectability</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>NSP<sub>ij</sub></it>
            </p>
         </c>
         <c ca="left">
            <p>The estimated number of (identified) sibling peptides of peptide <it>p<sub>ij</sub></it>, used by ProteinProphet to adjust the peptide identification probability</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>PSM</p>
         </c>
         <c ca="left">
            <p>Peptide-spectrum match; when it is clear from the context, we use PSM to also refer to the top-scoring PSM per spectrum</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>FDR</p>
         </c>
         <c ca="left">
            <p>False discovery rate; the fraction of incorrect peptide identifications in <inline-formula><graphic file="1471-2105-13-S16-S4-i70.gif"/></inline-formula> or the fraction of incorrect protein identifications in a given list outputted by a protein inference algorithm. FDR should be distinguished from the false positive rate (FPR), the fraction of all peptides (proteins) from the database that are not present in the sample but are predicted to be present (at a particular threshold).</p>
         </c>
      </r>
   </tblbdy></tbl>
</sec>
</sec>
<sec>
<st>
<p>Protein inference: significance and algorithms</p>
</st>
<p>Our first goal is to investigate the influence of degenerate peptides and to show that their presence is often a major factor contributing to the challenges in protein inference. We analyze several cellular and serum samples and characterize the peptide identification process. The data include cell line and plasma samples from <it>Homo sapiens </it>
<abbrgrp>
<abbr bid="B16">16</abbr>
</abbrgrp>, a tissue sample from <it>Mus musculus </it>
<abbrgrp>
<abbr bid="B43">43</abbr>
</abbrgrp>, as well as samples from <it>Saccharomyces cerevisiae </it>
<abbrgrp>
<abbr bid="B44">44</abbr>
</abbrgrp> and <it>Deinococcus radiodurans </it>
<abbrgrp>
<abbr bid="B24">24</abbr>
</abbrgrp>. The sets of spectra were searched using MASCOT <abbrgrp>
<abbr bid="B39">39</abbr>
</abbrgrp> against the human IPI database (v3.35), mouse IPI database (v3.35), Saccharomyces Genome Database (R63, 05-Jan-2010), and <it>D. radiodurans </it>proteins extracted from GenBank (27-Aug-2009), respectively.</p>
<p>Figure <figr fid="F1">1A</figr> shows the percentage of identified peptides per protein for an FDR of 0.01 (on the unique peptide level) when using a reversed database as decoy. We observe that 32-63% of proteins are covered by only one confidently identified peptide, while 5-36% of proteins are covered by five peptides or more. Figure <figr fid="F1">1B</figr> shows the percentage of degenerate peptides in each sample. The results indicate that 57-68% of peptides in human and mouse samples are degenerate, regardless of the type of biological sample (e.g. cell line vs. tissue vs. plasma). On the other hand, the yeast and <it>D. radiodurans </it>data sets contain only 18% and 1% of degenerate peptides, respectively. Figure <figr fid="F1">1C</figr> provides the percentage of candidate proteins hit by unique peptides. In mouse and human samples more than 80% of candidate proteins are identified only with degenerate peptides. This percentage decreases to 23% for yeast and 3% for <it>D</it>. <it>radiodurans</it>. Finally, in Figure <figr fid="F1">1D</figr> we provide the percentage of protein groups of a particular size, where a group is formed from the set of proteins that are hit by exactly the same peptides. In accordance with previous results, most of the yeast and <it>D. radiodurans </it>candidate proteins are distinguishable; however, for human and mouse samples, between 30% and 50% of protein groups contain multiple proteins.</p>
<fig id="F1"><title><p>Figure 1</p></title><caption><p>Summary of peptide identification results over five data sets using a false discovery rate of 0.01 and a reversed database.</p></caption><text>
   <p><b>Summary of peptide identification results over five data sets using a false discovery rate of 0.01 and a reversed database</b>. (A) Percentage of identified peptides per protein in each sample; (B) percentage of degenerate peptides in each sample; (C) Percentage of all proteins hit by at least one unique peptide, calculated as the number of proteins with at least one unique peptide divided by the number of all proteins hit by at least one peptide; (D) Percentage of protein groups of a particular size, where groups consist of proteins identified by the same set of peptides. The number of identified peptides and proteins in each sample were as follows: (3006, 1898) in human cell line; (700, 390) in human plasma; (5062, 4331) in mouse liver; (4012, 1154) in yeast; and (969, 368) in <it>D. radiodurans</it>. Peptide identifications correspond to charges +1, +2, and +3, combined.</p>
</text><graphic file="1471-2105-13-S16-S4-1"/></fig>
<p>This analysis provides evidence that protein inference is a non-trivial problem, especially for multicellular eukaryotes that are known to contain large numbers of paralogous proteins. It also emphasizes the importance of developing sophisticated protein inference algorithms.</p>
<sec>
<st>
<p>Rule-based approaches</p>
</st>
<p>With a typical LC-MS/MS experiment resulting in a potentially large number of protein identifications, concerns were raised regarding the impact of misidentified proteins on biomedical science <abbrgrp>
<abbr bid="B45">45</abbr>
</abbrgrp>. In response to this, several guidelines were proposed regarding the standards for publishing proteomics results <abbrgrp>
<abbr bid="B46">46</abbr>
<abbr bid="B47">47</abbr>
<abbr bid="B48">48</abbr>
<abbr bid="B49">49</abbr>
</abbrgrp>. The so-called "two-peptide rule" or two-hit rule, requiring two or more confidently identified peptides to define a confident protein identification, was advocated <abbrgrp>
<abbr bid="B46">46</abbr>
<abbr bid="B48">48</abbr>
</abbrgrp>. The same guidelines also recommended the parsimony principle (see next Section) as an explanation for the confident peptide identifications, and suggested that "protein family" - proteins with similar sequences due to single amino acid variants, homologs, splicing variants, or annotation mistakes - should be reported as one group if the proteins share the same identified peptides.</p>
<p>There is a good rationale for using the two-peptide rule. In principle, one correct unique peptide should be sufficient to correctly identify a protein. However, even for the low FDR associated with a set of peptides, many individual peptides in a large data set are incorrectly identified. Furthermore, proteins identified by single peptide hits are more likely to be incorrectly identified than proteins with higher peptide coverage <abbrgrp>
<abbr bid="B45">45</abbr>
</abbrgrp>. It was reported that FDRs for single-hit proteins can be over 10 times higher than FDRs at the PSM level <abbrgrp>
<abbr bid="B50">50</abbr>
</abbrgrp>, likely due to the clustering of correct peptide identifications to the correct proteins and the lack of clustering behavior for the incorrect peptides <abbrgrp>
<abbr bid="B50">50</abbr>
<abbr bid="B51">51</abbr>
</abbrgrp>.</p>
<p>However, the two-peptide rule has been challenged <abbrgrp>
<abbr bid="B51">51</abbr>
<abbr bid="B52">52</abbr>
</abbrgrp>. First, while including single-hit proteins without stringent quality control can compromise specificity, ignoring such proteins will certainly compromise sensitivity <abbrgrp>
<abbr bid="B52">52</abbr>
</abbrgrp>. Second, controlling the confidence (FDR) at the peptide level and then deducing the proteins using heuristic rules leads to undefined FDRs at the protein level <abbrgrp>
<abbr bid="B27">27</abbr>
<abbr bid="B50">50</abbr>
<abbr bid="B51">51</abbr>
<abbr bid="B52">52</abbr>
</abbrgrp>. On the other hand, controlling FDR directly at the protein level may rescue some of the confident single-hit proteins. Indeed, Gupta and Pevnzer demonstrated that using the "single-peptide rule" results in 10-40% more protein identifications compared with the two-peptide rule at a fixed FDR level <abbrgrp>
<abbr bid="B52">52</abbr>
</abbrgrp>. The single-peptide rule simply uses the highest scoring peptide from a protein as a score for that protein, and then directly estimates FDR at the protein level (rather than at the peptide level) using decoy databases. Thus, any protein that has one or more peptides with a score above a certain threshold is deemed confident. This statement seems problematic because proteins hit by single peptides should not be reliable. However, two mediocre peptides are not necessarily better than one good peptide; thus, many proteins hit by a single peptide can be rescued with more stringent score thresholds. Since a significant portion of such proteins are correct <abbrgrp>
<abbr bid="B53">53</abbr>
</abbrgrp>, it is not surprising that the single-peptide rule leads to more protein identifications.</p>
<p>With the help of protein-level FDR estimation (using a decoy database), better and more complex rules may be devised to achieve even higher sensitivity. For example, Weatherly et al. proposed setting separate score thresholds for proteins with different number of confident peptide identifications <abbrgrp>
<abbr bid="B51">51</abbr>
</abbrgrp>. They reported that gradually lower score thresholds were needed for proteins with increasingly higher coverage. For the coverage of 1 (i.e. proteins hit by single peptides), a MASCOT score of 44 was required, while for coverage of 6, a MASCOT score as low as 11 was necessary for the same FDR <abbrgrp>
<abbr bid="B51">51</abbr>
</abbrgrp>.</p>
<p>Despite the relative simplicity of rule-based approaches, the performance of heuristic rules is fundamentally limited by the lack of rigorous treatment and proper combination of the peptide identification scores and prior knowledge.</p>
</sec>
<sec>
<st>
<p>Combinatorial optimization algorithms</p>
</st>
<p>The input to this class of algorithms typically consists of a set of confidently identified peptides <inline-formula>
<graphic file="1471-2105-13-S16-S4-i41.gif"/>
</inline-formula> = <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i54"><m:mrow>
   <m:mo>{</m:mo>
   <m:mrow>
      <m:msub>
         <m:mi>p</m:mi>
         <m:mi>j</m:mi>
      </m:msub>
      <m:mo>|</m:mo>
      <m:msub>
         <m:mi>t</m:mi>
         <m:mi>j</m:mi>
      </m:msub>
      <m:mo>=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mo>}</m:mo>
</m:mrow>
</m:math>
</inline-formula> and a protein database <inline-formula>
<graphic file="1471-2105-13-S16-S4-i42.gif"/>
</inline-formula>. The objective is to provide a list of proteins that optimizes certain criteria. In one way or another, all such formulations result in NP-hard problems and are usually solved using approximation algorithms.</p>
<sec>
<st>
<p>The minimum set cover formulation</p>
</st>
<p>
<b>
<it>Minimum set cover (MSC) problem: </it>
</b>Given a set of confident peptide identifications <inline-formula>
<graphic file="1471-2105-13-S16-S4-i41.gif"/>
</inline-formula> and protein database <inline-formula>
<graphic file="1471-2105-13-S16-S4-i42.gif"/>
</inline-formula>, find a smallest protein list <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i30"><m:mrow>
   <m:mi mathvariant="script">L</m:mi>
</m:mrow>
</m:math>
</inline-formula> &#8838; <inline-formula>
<graphic file="1471-2105-13-S16-S4-i42.gif"/>
</inline-formula> such that each peptide from <inline-formula>
<graphic file="1471-2105-13-S16-S4-i41.gif"/>
</inline-formula> is assigned to at least one protein from <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i30">
<m:mrow>
<m:mi mathvariant="script">L</m:mi>
</m:mrow>
</m:math>
</inline-formula>. More formally,</p>
<p>
<display-formula>
<graphic file="1471-2105-13-S16-S4-i5.gif"/>
</display-formula>
</p>
<p>This protein inference formulation is identical to the classical computer science problem of minimum set cover, where given a set of elements (peptides) <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i31"><m:mrow>
   <m:mi mathvariant="script">U</m:mi>
</m:mrow>
</m:math>
</inline-formula> and a set of subsets (proteins) over <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i31">
<m:mrow>
<m:mi mathvariant="script">U</m:mi>
</m:mrow>
</m:math>
</inline-formula>, the goal is to find a smallest (not necessarily unique) set of subsets that contain all elements in <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i31">
<m:mrow>
<m:mi mathvariant="script">U</m:mi>
</m:mrow>
</m:math>
</inline-formula>. It is convenient to visualize the MSC formulation using bipartite graphs (Figure <figr fid="F2">2A</figr>). Using graph representation, it is relatively easy to see that an optimal solution to the MSC problem can also be provided if the original graph is divided into connected components and an optimal MSC solution provided separately for each component.</p>
<fig id="F2"><title><p>Figure 2</p></title><caption><p>(A) A bipartite graph showing two connected components with five identified peptides and four proteins that contain these peptides. (B) An expanded bipartite graph showing the situation corresponding to the first connected component in panel A, with unidentified peptides added. The unidentified peptides are connected to their parent proteins using dashed lines.</p></caption><text>
   <p>(A) A bipartite graph showing two connected components with five identified peptides and four proteins that contain these peptides. (B) An expanded bipartite graph showing the situation corresponding to the first connected component in panel A, with unidentified peptides added. The unidentified peptides are connected to their parent proteins using dashed lines.</p>
</text><graphic file="1471-2105-13-S16-S4-2"/></fig>
<p>The MSC approach has been implemented in the IDPicker software <abbrgrp>
<abbr bid="B54">54</abbr>
<abbr bid="B55">55</abbr>
</abbrgrp>. IDPicker, however, also contains several heuristics that further simplify the solution and its interpretation. The algorithm starts by collapsing the peptide-protein bipartite graph such that all peptides/proteins connected to the same proteins/peptides form group nodes containing multiple peptides or proteins. It then finds a set of disconnected subgraphs within a bipartite graph using a depth-first search. Finally, it performs a MSC optimization in each of those subgraphs. IDPicker extends beyond algorithmic implementations, e.g. it contains modules for calculating confidently identified peptides (using an FDR-based approach), modules for combining scores from multiple search engines, as well as visualization of results.</p>
<p>The minimum set cover formulation is one of the most commonly encountered strategies in protein inference, and is recommended by the guidelines for publishing proteomics results <abbrgrp>
<abbr bid="B46">46</abbr>
<abbr bid="B48">48</abbr>
</abbrgrp>. Its intuition is to select the smallest among many possible solutions (Occam's razor, parsimony principle), which can be justified by considering the number of possible solutions when protein list consists of exactly <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i55"><m:mrow>
   <m:mi>&#160;n</m:mi>
</m:mrow>
</m:math>
</inline-formula> proteins. Assuming <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i55">
<m:mrow>
<m:mi>&#160;n</m:mi>
</m:mrow>
</m:math>
</inline-formula> &#8810; |<inline-formula>
<graphic file="1471-2105-13-S16-S4-i42.gif"/>
</inline-formula>|, the solutions of smaller sizes are selected from a smaller solution space and are therefore less likely to be spurious findings. In many practical situations, including protein inference, the MSC formulation leads to natural and acceptable solutions. However, it is not obvious that a minimalist formulation should apply to biological samples in which multiple paralogous proteins or protein isoforms may be present at the same time. This approach also ignores other available information, e.g. peptides that are not identified (all dashed edges in Figure <figr fid="F2">2B</figr>), gene functions <abbrgrp>
<abbr bid="B56">56</abbr>
</abbrgrp> or mRNA expression levels <abbrgrp>
<abbr bid="B57">57</abbr>
</abbrgrp>.</p>
</sec>
<sec>
<st>
<p>The partial set cover formulation</p>
</st>
<p>Although the MSC formulation relies on a set confidently identified peptides, a subset of such peptides are expected to be incorrect identifications. This fact provides motivation for the partial set cover approaches where the goal is to find the minimum protein list that covers at least 100&#183;<it>c</it>% of the identified peptides, where 0 &lt;<it>c </it>&#8804; 1 is a user specified parameter.</p>
<p>
<b>
<it>Minimum partial set cover (MPSC) problem: </it>
</b>Given a set of confident peptide identifications <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i32"><m:mrow>
   <m:mi mathvariant="script">U</m:mi>
</m:mrow>
</m:math>
</inline-formula>, protein database <inline-formula>
<graphic file="1471-2105-13-S16-S4-i42.gif"/>
</inline-formula>, and parameter <it>c </it>(0 &lt;<it>c </it>&#8804; 1), find a protein list <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i30">
<m:mrow>
<m:mi mathvariant="script">L</m:mi>
</m:mrow>
</m:math>
</inline-formula> of minimal size such that at least 100&#183;<it>c</it>% of identified peptides are assigned to the proteins from <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i30">
<m:mrow>
<m:mi mathvariant="script">L</m:mi>
</m:mrow>
</m:math>
</inline-formula>. More formally,</p>
<p>
<display-formula>
<graphic file="1471-2105-13-S16-S4-i6.gif"/>
</display-formula>
</p>
<p>where <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i72"><m:mrow>
   <m:msub>
      <m:mi>z</m:mi>
      <m:mi>j</m:mi>
   </m:msub>
   <m:mo>&#8712;</m:mo>
   <m:mrow>
      <m:mo>{</m:mo>
      <m:mrow>
         <m:mn>0</m:mn>
         <m:mo>,</m:mo>
         <m:mn>1</m:mn>
      </m:mrow>
      <m:mo>}</m:mo>
   </m:mrow>
</m:mrow>
</m:math>
</inline-formula> indicates whether peptide <inline-formula>
<graphic file="1471-2105-13-S16-S4-i71.gif"/>
</inline-formula> is excluded <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i73"><m:mrow>
   <m:mo>(</m:mo>
   <m:msub>
      <m:mi>z</m:mi>
      <m:mi>j</m:mi>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mn>1</m:mn>
   <m:mo>)</m:mo>
</m:mrow>
</m:math>
</inline-formula> from the list of assigned peptides. Both MSC and MPSC problems are NP-hard in general. Thus, optimal solutions cannot be guaranteed in situations with a large number of identified peptides (note that each peptide from <inline-formula>
<graphic file="1471-2105-13-S16-S4-i74.gif"/>
</inline-formula> adds a constraint in the problem formulation). A number of approximation algorithms have been proposed ranging from greedy algorithms to integer programming, and several such algorithms have been tested in protein inference <abbrgrp>
<abbr bid="B58">58</abbr>
</abbrgrp>.</p>
<p>Both the MSC and MPSC problem formulations result in situations where it is not possible to distinguish among proteins identified exclusively by degenerate peptides (e.g. proteins <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i75"><m:mrow>
   <m:msub>
      <m:mi>P</m:mi>
      <m:mn>1</m:mn>
   </m:msub>
</m:mrow>
</m:math>
</inline-formula> and <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i76"><m:mrow>
   <m:msub>
      <m:mi>P</m:mi>
      <m:mn>2</m:mn>
   </m:msub>
</m:mrow>
</m:math>
</inline-formula> in Figure <figr fid="F2">2</figr>). Nesvizhskii and Aebersold have identified several such classes of proteins, naming them indistinguishable proteins, subset proteins, subsumable proteins, etc. <abbrgrp>
<abbr bid="B20">20</abbr>
</abbrgrp>. Because such situations are common for eukaryotes or samples containing multiple closely related organisms, different problem formulations are necessary to provide appropriate tie resolutions.</p>
</sec>
<sec>
<st>
<p>The minimum missed peptide formulation</p>
</st>
<p>The MSC-based formulations of the protein inference problem rely only on peptides that were confidently identified (<inline-formula>
<graphic file="1471-2105-13-S16-S4-i41.gif"/>
</inline-formula>) and thus ignore all unidentified peptides from the proteins containing at least one peptide from <inline-formula>
<graphic file="1471-2105-13-S16-S4-i41.gif"/>
</inline-formula>, see dashed edges in Figure <figr fid="F2">2B</figr>. In addition, these methods implicitly assume that each peptide is equally likely to be observed in an MS/MS experiment. The first combinatorial approach addressing these aspects was the minimum missed peptide (MMP) formulation <abbrgrp>
<abbr bid="B59">59</abbr>
</abbrgrp>. This approach relies on the concept of peptide detectability (Box 1).</p>
<p>To provide intuition for the MMP approach, let us consider the example in Figure <figr fid="F3">3</figr>, which itself corresponds to the bipartite graph from Figure <figr fid="F2">2B</figr>. When considering only peptides in <inline-formula>
<graphic file="1471-2105-13-S16-S4-i41.gif"/>
</inline-formula> (solid lines in Figure <figr fid="F2">2B</figr>), proteins <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i75">
<m:mrow>
<m:msub>
<m:mi>P</m:mi>
<m:mn>1</m:mn>
</m:msub>
</m:mrow>
</m:math>
</inline-formula> and <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i76">
<m:mrow>
<m:msub>
<m:mi>P</m:mi>
<m:mn>2</m:mn>
</m:msub>
</m:mrow>
</m:math>
</inline-formula> would be classified as indistinguishable <abbrgrp>
<abbr bid="B20">20</abbr>
</abbrgrp>; however, given detectabilities of all peptides, it can be inferred that protein <it>P</it>
<sub>1 </sub>is more likely to be present in the sample than protein <it>P</it>
<sub>2</sub>. Specifically, the three identified peptides (shaded) are the most detectable peptides in protein <it>P</it>
<sub>1</sub>. On the other hand, these peptides are among the least expected peptides to be observed if protein <it>P</it>
<sub>2 </sub>was in the sample. Thus, protein <it>P</it>
<sub>1 </sub>is more likely to be a correct identification than protein <it>P</it>
<sub>2</sub>. Note that the tie resolution was provided by considering unidentified peptides.</p>
<fig id="F3"><title><p>Figure 3</p></title><caption><p>A detectability plot corresponding to the situation from Figure 2B. Peptides in each protein are ranked according to their detectability. The identified peptides <it>p</it>1, <it>p</it>2, and <it>p</it>3 are shaded, while the remaining peptides are white. The situation provides intuition for a decision that protein <it>P</it>1 is more likely to be present in the sample than protein <it>P</it>2. Note that detectabilities of peptides <it>p</it>1, <it>p</it>2, and <it>p</it>3 are not necessarily identical in the two proteins. This is because they depend on peptide sequence but also on the context of the parent protein (neighboring peptides).</p></caption><text>
   <p><b>A detectability plot corresponding to the situation from Figure 2B</b>. Peptides in each protein are ranked according to their detectability. The identified peptides <it>p</it>1, <it>p</it>2, and <it>p</it>3 are shaded, while the remaining peptides are white. The situation provides intuition for a decision that protein <it>P</it>1 is more likely to be present in the sample than protein <it>P</it>2. Note that detectabilities of peptides <it>p</it>1, <it>p</it>2, and <it>p</it>3 are not necessarily identical in the two proteins. This is because they depend on peptide sequence but also on the context of the parent protein (neighboring peptides).</p>
</text><graphic file="1471-2105-13-S16-S4-3" hint_layout="single"/></fig>
<p>Before formalizing the MMP approach, let us consider a particular <it>solution </it>to the protein inference problem in which different peptides from <inline-formula>
<graphic file="1471-2105-13-S16-S4-i74.gif"/>
</inline-formula> are <it>assigned </it>to protein <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i44">
<m:mrow>
<m:msub>
<m:mi>P</m:mi>
<m:mi>i</m:mi>
</m:msub>
</m:mrow>
</m:math>
</inline-formula>. Note that some peptides <inline-formula>
<graphic file="1471-2105-13-S16-S4-i77.gif"/>
</inline-formula> may not be assigned to <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i78"><m:mrow>
   <m:msub>
      <m:mi>P</m:mi>
      <m:mi>i</m:mi>
   </m:msub>
   <m:mo>(</m:mo>
   <m:msub>
      <m:mrow>
         <m:mi>x</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mn>0</m:mn>
</m:mrow>
</m:math>
</inline-formula> although their sequence can be mapped to the protein and the peptide is confidently identified <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i79"><m:mrow>
   <m:mo>(</m:mo>
   <m:msub>
      <m:mrow>
         <m:mi>t</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mn>1</m:mn>
   <m:mo>)</m:mo>
</m:mrow>
</m:math>
</inline-formula>. Peptide <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i49">
<m:mrow>
<m:msub>
<m:mi>p</m:mi>
<m:mrow>
<m:mi>i</m:mi>
<m:mi>j</m:mi>
</m:mrow>
</m:msub>
</m:mrow>
</m:math>
</inline-formula> is defined as <it>missed </it>if <inline-formula>
<graphic file="1471-2105-13-S16-S4-i80.gif"/>
</inline-formula> and</p>
<p>
<display-formula>
<graphic file="1471-2105-13-S16-S4-i9.gif"/>
</display-formula>
</p>
<p>where <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i81"><m:mrow>
   <m:msub>
      <m:mrow>
         <m:mi>d</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
</m:mrow>
</m:math>
</inline-formula> is detectability of peptide <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i49">
<m:mrow>
<m:msub>
<m:mi>p</m:mi>
<m:mrow>
<m:mi>i</m:mi>
<m:mi>j</m:mi>
</m:mrow>
</m:msub>
</m:mrow>
</m:math>
</inline-formula>. In other words, a peptide is missed if in a particular inference solution (1) it is not confidently identified and (2) a peptide with lower detectability from the same protein is identified and assigned to that protein. We emphasize that the peptides with detectabilities lower than the minimum detectability of assigned peptides for protein <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i44">
<m:mrow>
<m:msub>
<m:mi>P</m:mi>
<m:mi>i</m:mi>
</m:msub>
</m:mrow>
</m:math>
</inline-formula> are not considered missed due to the fact that protein quantity influences effective detectability of all peptides in <it>P<sub>i</sub>
</it>. Thus, for effective detectability below a certain threshold, no peptides are expected to be observed. The MMP approach can now be formalized as follows.</p>
<p>
<b>
<it>Minimum missed peptide (MMP) problem: </it>
</b>Given a set of confident peptide identifications <inline-formula>
<graphic file="1471-2105-13-S16-S4-i41.gif"/>
</inline-formula>, protein database <inline-formula>
<graphic file="1471-2105-13-S16-S4-i42.gif"/>
</inline-formula>, and peptide detectability for each peptide <inline-formula>
<graphic file="1471-2105-13-S16-S4-i64.gif"/>
</inline-formula>, find a set of proteins <inline-formula>
<graphic file="1471-2105-13-S16-S4-i11.gif"/>
</inline-formula> that covers all peptides in <inline-formula>
<graphic file="1471-2105-13-S16-S4-i70.gif"/>
</inline-formula> and minimizes the number of missed peptides. More formally,</p>
<p>
<display-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i12"><m:mrow>
   <m:mtable class="gathered">
      <m:mtr>
         <m:mtd>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">minimize</m:mtext>
            </m:mstyle>
            <m:mspace class="thinspace" width="0.3em"/>
            <m:munder class="msub">
               <m:mrow>
                  <m:mo mathsize="big"> &#8721;</m:mo>
               </m:mrow>
               <m:mrow>
                  <m:mi>i</m:mi>
                  <m:mo class="MathClass-punc">,</m:mo>
                  <m:mi>j</m:mi>
               </m:mrow>
            </m:munder>
            <m:msub>
               <m:mrow>
                  <m:mi>z</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>i</m:mi>
                  <m:mi>j</m:mi>
               </m:mrow>
            </m:msub>
            <m:mo class="MathClass-bin">&#8901;</m:mo>
            <m:mrow>
               <m:mo class="MathClass-open">(</m:mo>
               <m:mrow>
                  <m:mn>1</m:mn>
                  <m:mo class="MathClass-bin">-</m:mo>
                  <m:msub>
                     <m:mrow>
                        <m:mi>t</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>j</m:mi>
                     </m:mrow>
                  </m:msub>
               </m:mrow>
               <m:mo class="MathClass-close">)</m:mo>
            </m:mrow>
         </m:mtd>
      </m:mtr>
      <m:mtr>
         <m:mtd>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">subject</m:mtext>
            </m:mstyle>
            <m:mspace class="thinspace" width="0.3em"/>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">to</m:mtext>
            </m:mstyle>
            <m:mspace class="thinspace" width="0.3em"/>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">(</m:mtext>
            </m:mstyle>
            <m:msub>
               <m:mrow>
                  <m:mi>z</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>i</m:mi>
                  <m:mi>j</m:mi>
               </m:mrow>
            </m:msub>
            <m:mo class="MathClass-bin">-</m:mo>
            <m:msub>
               <m:mrow>
                  <m:mi>z</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>i</m:mi>
                  <m:mi>k</m:mi>
               </m:mrow>
            </m:msub>
            <m:mo class="MathClass-close">)</m:mo>
            <m:mo class="MathClass-bin">&#8901;</m:mo>
            <m:mrow>
               <m:mo class="MathClass-open">(</m:mo>
               <m:mrow>
                  <m:msub>
                     <m:mrow>
                        <m:mi>d</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>i</m:mi>
                        <m:mi>j</m:mi>
                     </m:mrow>
                  </m:msub>
                  <m:mo class="MathClass-bin">-</m:mo>
                  <m:msub>
                     <m:mrow>
                        <m:mi>d</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>i</m:mi>
                        <m:mi>k</m:mi>
                     </m:mrow>
                  </m:msub>
               </m:mrow>
               <m:mo class="MathClass-close">)</m:mo>
            </m:mrow>
            <m:mo class="MathClass-rel">&#8805;</m:mo>
            <m:mn>0</m:mn>
            <m:mspace class="quad" width="1em"/>
            <m:mrow>
               <m:mo class="MathClass-open">(</m:mo>
               <m:mrow>
                  <m:mo class="MathClass-op">&#8704;</m:mo>
                  <m:mi>i</m:mi>
                  <m:mo class="MathClass-punc">,</m:mo>
                  <m:mi>j</m:mi>
                  <m:mo class="MathClass-rel">&#8712;</m:mo>
                  <m:mi>N</m:mi>
                  <m:mrow>
                     <m:mo class="MathClass-open">(</m:mo>
                     <m:mrow>
                        <m:mi>i</m:mi>
                     </m:mrow>
                     <m:mo class="MathClass-close">)</m:mo>
                  </m:mrow>
                  <m:mo class="MathClass-punc">,</m:mo>
                  <m:mi>k</m:mi>
                  <m:mo class="MathClass-rel">&#8712;</m:mo>
                  <m:mi>N</m:mi>
                  <m:mrow>
                     <m:mo class="MathClass-open">(</m:mo>
                     <m:mrow>
                        <m:mi>i</m:mi>
                     </m:mrow>
                     <m:mo class="MathClass-close">)</m:mo>
                  </m:mrow>
               </m:mrow>
               <m:mo class="MathClass-close">)</m:mo>
            </m:mrow>
         </m:mtd>
      </m:mtr>
      <m:mtr>
         <m:mtd>
            <m:munder class="msub">
               <m:mrow>
                  <m:mo mathsize="big"> &#8721;</m:mo>
               </m:mrow>
               <m:mrow>
                  <m:mi>i</m:mi>
                  <m:mo class="MathClass-rel">&#8712;</m:mo>
                  <m:mi>N</m:mi>
                  <m:mrow>
                     <m:mo class="MathClass-open">(</m:mo>
                     <m:mrow>
                        <m:mi>j</m:mi>
                     </m:mrow>
                     <m:mo class="MathClass-close">)</m:mo>
                  </m:mrow>
               </m:mrow>
            </m:munder>
            <m:msub>
               <m:mrow>
                  <m:mi>z</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>i</m:mi>
                  <m:mi>j</m:mi>
               </m:mrow>
            </m:msub>
            <m:mo class="MathClass-rel">&#8805;</m:mo>
            <m:msub>
               <m:mrow>
                  <m:mi>t</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>j</m:mi>
               </m:mrow>
            </m:msub>
            <m:mspace class="quad" width="1em"/>
            <m:mrow>
               <m:mo class="MathClass-open">(</m:mo>
               <m:mrow>
                  <m:mo class="MathClass-op">&#8704;</m:mo>
                  <m:msub>
                     <m:mrow>
                        <m:mi>p</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>j</m:mi>
                     </m:mrow>
                  </m:msub>
                  <m:mo class="MathClass-rel">&#8712;</m:mo>
                  <m:mi mathvariant="script">C</m:mi>
               </m:mrow>
               <m:mo class="MathClass-close">)</m:mo>
            </m:mrow>
            <m:mo class="MathClass-punc">,</m:mo>
         </m:mtd>
      </m:mtr>
      <m:mtr>
         <m:mtd/>
      </m:mtr>
   </m:mtable>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>where <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i82"><m:mrow>
   <m:msub>
      <m:mrow>
         <m:mi>z</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo>&#8712;</m:mo>
   <m:mrow>
      <m:mo>{</m:mo>
      <m:mrow>
         <m:mn>0</m:mn>
         <m:mo>,</m:mo>
         <m:mn>1</m:mn>
      </m:mrow>
      <m:mo>}</m:mo>
   </m:mrow>
</m:mrow>
</m:math>
</inline-formula> indicates whether detectability <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i81">
<m:mrow>
<m:msub>
<m:mrow>
<m:mi>d</m:mi>
</m:mrow>
<m:mrow>
<m:mi>i</m:mi>
<m:mi>j</m:mi>
</m:mrow>
</m:msub>
</m:mrow>
</m:math>
</inline-formula> for peptide <inline-formula>
<graphic file="1471-2105-13-S16-S4-i77.gif"/>
</inline-formula> is above or equal to <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i83"><m:mrow>
   <m:mo>(</m:mo>
   <m:msub>
      <m:mrow>
         <m:mi>z</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mn>1</m:mn>
   <m:mo>)</m:mo>
</m:mrow>
</m:math>
</inline-formula> or below <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i84"><m:mrow>
   <m:mo>(</m:mo>
   <m:msub>
      <m:mrow>
         <m:mi>z</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mn>0</m:mn>
   <m:mo>)</m:mo>
</m:mrow>
</m:math>
</inline-formula> the minimum detectability of peptides assigned to protein <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i44">
<m:mrow>
<m:msub>
<m:mi>P</m:mi>
<m:mi>i</m:mi>
</m:msub>
</m:mrow>
</m:math>
</inline-formula> and <it>N</it>(<it>i</it>) is a set of peptides connected to <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i44">
<m:mrow>
<m:msub>
<m:mi>P</m:mi>
<m:mi>i</m:mi>
</m:msub>
</m:mrow>
</m:math>
</inline-formula> in the expanded bipartite graph (see Figure <figr fid="F2">2B</figr>). A set of identified proteins can now be determined as</p>
<p>
<display-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i13"><m:mrow>
   <m:msub>
      <m:mrow>
         <m:mi>y</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>i</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mfenced close="" open="{" separators="">
      <m:mrow>
         <m:mtable class="array" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" equalrows="false">
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:mn>0</m:mn>
                  <m:mspace class="quad" width="1em"/>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">if</m:mtext>
                  </m:mstyle>
                  <m:msub>
                     <m:mrow>
                        <m:mo mathsize="big"> &#8721;</m:mo>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>j</m:mi>
                        <m:mo class="MathClass-rel">&#8712;</m:mo>
                        <m:mi>N</m:mi>
                        <m:mrow>
                           <m:mo class="MathClass-open">(</m:mo>
                           <m:mrow>
                              <m:mi>i</m:mi>
                           </m:mrow>
                           <m:mo class="MathClass-close">)</m:mo>
                        </m:mrow>
                     </m:mrow>
                  </m:msub>
                  <m:msub>
                     <m:mrow>
                        <m:mi>z</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>i</m:mi>
                        <m:mi>j</m:mi>
                     </m:mrow>
                  </m:msub>
                  <m:mo class="MathClass-bin">&#8901;</m:mo>
                  <m:msub>
                     <m:mrow>
                        <m:mi>t</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>j</m:mi>
                     </m:mrow>
                  </m:msub>
                  <m:mo class="MathClass-rel">=</m:mo>
                  <m:mn>0</m:mn>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:mn>1</m:mn>
                  <m:mspace class="quad" width="1em"/>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">if</m:mtext>
                  </m:mstyle>
                  <m:msub>
                     <m:mrow>
                        <m:mo mathsize="big"> &#8721;</m:mo>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>j</m:mi>
                        <m:mo class="MathClass-rel">&#8712;</m:mo>
                        <m:mi>N</m:mi>
                        <m:mrow>
                           <m:mo class="MathClass-open">(</m:mo>
                           <m:mrow>
                              <m:mi>i</m:mi>
                           </m:mrow>
                           <m:mo class="MathClass-close">)</m:mo>
                        </m:mrow>
                     </m:mrow>
                  </m:msub>
                  <m:msub>
                     <m:mrow>
                        <m:mi>z</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>i</m:mi>
                        <m:mi>j</m:mi>
                     </m:mrow>
                  </m:msub>
                  <m:mo class="MathClass-bin">&#8901;</m:mo>
                  <m:msub>
                     <m:mrow>
                        <m:mi>t</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>j</m:mi>
                     </m:mrow>
                  </m:msub>
                  <m:mo class="MathClass-rel">&gt;</m:mo>
                  <m:mn>0</m:mn>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center"/>
            </m:mtr>
         </m:mtable>
      </m:mrow>
   </m:mfenced>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>Alves et al. have shown that the minimum cover set problem can be reduced to the minimum missed peptide formulation <abbrgrp>
<abbr bid="B59">59</abbr>
</abbrgrp>. Thus, the MMP problem is NP-hard and approximation algorithms are needed for large-scale problems. Alves et al. proposed an efficient greedy approximation algorithm that provides a good solution <abbrgrp>
<abbr bid="B59">59</abbr>
<abbr bid="B60">60</abbr>
<abbr bid="B61">61</abbr>
</abbrgrp>. Alternative formulations and algorithmic approaches are also possible. For example, this algorithm can be generalized in a relatively straightforward manner to a partial set formulation or to a version that minimizes the overall probability of unidentified peptides.</p>
<p>Although the MMP formulation was the first protein inference technique capable of resolving indistinguishable proteins, it generally shares the limitations of other approaches based on combinatorial optimization techniques. That is, these algorithms do not provide probabilities for identified proteins, unless post-processing statistical models are used <abbrgrp>
<abbr bid="B62">62</abbr>
</abbrgrp>.</p>
</sec>
</sec>
<sec>
<st>
<p>Probabilistic inference algorithms</p>
</st>
<p>Similarly to the previous classes of algorithms, probabilistic approaches to protein inference generally consist of two steps. First, PSM scores are converted to PSM probabilities using algorithms such as PeptideProphet <abbrgrp>
<abbr bid="B38">38</abbr>
</abbrgrp>. After this pre-processing step, protein inference is performed based on an assumed probabilistic model. In probabilistic terms, protein inference involves computing protein posterior probabilities <inline-formula>
<graphic file="1471-2105-13-S16-S4-i68.gif"/>
</inline-formula> for every protein in <inline-formula>
<graphic file="1471-2105-13-S16-S4-i42.gif"/>
</inline-formula>.</p>
<p>Several classes of probabilistic algorithms have been proposed so far <abbrgrp>
<abbr bid="B21">21</abbr>
<abbr bid="B24">24</abbr>
<abbr bid="B60">60</abbr>
<abbr bid="B61">61</abbr>
<abbr bid="B63">63</abbr>
<abbr bid="B64">64</abbr>
<abbr bid="B65">65</abbr>
<abbr bid="B66">66</abbr>
<abbr bid="B67">67</abbr>
<abbr bid="B68">68</abbr>
<abbr bid="B69">69</abbr>
<abbr bid="B70">70</abbr>
<abbr bid="B71">71</abbr>
</abbrgrp>, with different strategies and levels of rigor in addressing protein groups and different run-time performance. Some probabilistic algorithms do not address degenerate peptides <abbrgrp>
<abbr bid="B63">63</abbr>
<abbr bid="B65">65</abbr>
<abbr bid="B68">68</abbr>
<abbr bid="B70">70</abbr>
</abbrgrp>, while some such as ProteinProphet <abbrgrp>
<abbr bid="B21">21</abbr>
</abbrgrp> combine probabilistic inference with the parsimony principle (for degenerate peptides) and protein grouping (for indistinguishable proteins). In the following subsections, we provide an in-depth discussion of the three major probabilistic methods: ProteinProphet <abbrgrp>
<abbr bid="B21">21</abbr>
</abbrgrp>, MSBayesPro <abbrgrp>
<abbr bid="B61">61</abbr>
</abbrgrp>, and Fido <abbrgrp>
<abbr bid="B71">71</abbr>
</abbrgrp>, and briefly mention several other methods. We use the same notation for all models and, when possible, provide new interpretations of the algorithms. We aim to reveal inherent connections and principal differences among the methods. For original derivations and interpretations, readers are referred to the original publications.</p>
<sec>
<st>
<p>ProteinProphet</p>
</st>
<p>ProteinProphet is the first and most widely used probabilistic protein inference approach <abbrgrp>
<abbr bid="B21">21</abbr>
</abbrgrp>, with importance comparable to the first automated peptide identification tool, SEQUEST <abbrgrp>
<abbr bid="B9">9</abbr>
</abbrgrp>. ProteinProphet consists of four major steps; together, they convert the original PSM probabilities from PeptideProphet to peptide identification probabilities and then combine the peptide identification probabilities to infer proteins.</p>
<sec>
<st>
<p>Pre-processing</p>
</st>
<p>In order to obtain protein identification probabilities, peptide identification probabilities are needed as input. Here, the difficulty is to obtain one peptide identification probability from typically multiple spectra matched to a peptide. The solution used in ProteinProphet is to simply take the maximum value among the peptide-spectrum matching probabilities for peptide <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i48">
<m:mrow>
<m:mi>&#160;j</m:mi>
</m:mrow>
</m:math>
</inline-formula> (step 1, Figure <figr fid="F4">4A</figr>), i.e.</p>
<fig id="F4"><title><p>Figure 4</p></title><caption><p>(A) A diagram of the ProteinProphet algorithm. The numbers in the circles correspond to the steps mentioned in the text. The presence of loops in the diagram represents iterative inference algorithms used by ProteinProphet. (B) Toy examples illustrating the impact of fluctuations in peptide identification probabilities on the inference outcome from ProteinProphet (version 4.2 RAPTURE rev 2). Top part shows four identified peptides corresponding to two proteins. Peptides p<sub>1</sub> and p<sub>2</sub> are shared peptides while peptides p<sub>3</sub> and p<sub>4</sub> are unique peptides for proteins P<sub>1</sub> and P<sub>2</sub>, respectively. The numerical values are the peptide identification probabilities used in the toy examples. Bottom: ProteinProphet results on seven data sets with minor changes in peptide identification probabilities. Noisy Input and Output: the peptide and protein identification probabilities respectively; group: protein group probability; N/A: not reported in ProteinProphet output.</p></caption><text>
   <p>(A) A diagram of the ProteinProphet algorithm. The numbers in the circles correspond to the steps mentioned in the text. The presence of loops in the diagram represents iterative inference algorithms used by ProteinProphet. (B) Toy examples illustrating the impact of fluctuations in peptide identification probabilities on the inference outcome from ProteinProphet (version 4.2 RAPTURE rev 2). Top part shows four identified peptides corresponding to two proteins. Peptides p<sub>1</sub> and p<sub>2</sub> are shared peptides while peptides p<sub>3</sub> and p<sub>4</sub> are unique peptides for proteins P<sub>1</sub> and P<sub>2</sub>, respectively. The numerical values are the peptide identification probabilities used in the toy examples. Bottom: ProteinProphet results on seven data sets with minor changes in peptide identification probabilities. Noisy Input and Output: the peptide and protein identification probabilities respectively; group: protein group probability; N/A: not reported in ProteinProphet output.</p>
</text><graphic file="1471-2105-13-S16-S4-4"/></fig>
<p>
<display-formula>
<graphic file="1471-2105-13-S16-S4-i14.gif"/>
</display-formula>
</p>
<p>where <inline-formula>
<graphic file="1471-2105-13-S16-S4-i85.gif"/>
</inline-formula> is the set of spectra identified for peptide <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i48">
<m:mrow>
<m:mi>&#160;j</m:mi>
</m:mrow>
</m:math>
</inline-formula>. If no spectrum is matched to the peptide, i.e. if <inline-formula>
<graphic file="1471-2105-13-S16-S4-i86.gif"/>
</inline-formula> then <inline-formula>
<graphic file="1471-2105-13-S16-S4-i87.gif"/>
</inline-formula>. Recently, the iProphet algorithm was proposed to improve this approach <abbrgrp>
<abbr bid="B72">72</abbr>
</abbrgrp>.</p>
</sec>
<sec>
<st>
<p>Combining peptide probabilities</p>
</st>
<p>A key feature of ProteinProphet is that protein probabilities are computed by assuming peptide identifications to be independent pieces of evidence for the presence of protein <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i51">
<m:mrow>
<m:mi>&#160;i</m:mi>
</m:mrow>
</m:math>
</inline-formula> in the sample, i.e.</p>
<p>
<display-formula>
<graphic file="1471-2105-13-S16-S4-i15.gif"/>
</display-formula>
</p>
<p>where <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i88"><m:mrow>
   <m:mi>N</m:mi>
   <m:mo>(</m:mo>
   <m:mi>i</m:mi>
   <m:mo>)</m:mo>
</m:mrow>
</m:math>
</inline-formula> is the set of peptides mapped to protein <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i51">
<m:mrow>
<m:mi>&#160;i</m:mi>
</m:mrow>
</m:math>
</inline-formula>. This assumption, however, is not easy to justify because peptide identifications are not statistically independent. That is, if one peptide from the protein is confidently identified, the chance is higher that another peptide from the same protein will also be identified. Another problem with this assumption is that each degenerate peptide is counted toward all proteins it maps to. These issues are addressed via the following two adjustment steps.</p>
</sec>
<sec>
<st>
<p>Adjustment for peptide identification probability</p>
</st>
<p>To address the limitation due to the independence assumption, ProteinProphet replaces <inline-formula>
<graphic file="1471-2105-13-S16-S4-i67.gif"/>
</inline-formula> in the equation above by <inline-formula>
<graphic file="1471-2105-13-S16-S4-i89.gif"/>
</inline-formula>; step 2, Figure <figr fid="F4">4A</figr>. The difference between the adjusted peptide identification probability <inline-formula>
<graphic file="1471-2105-13-S16-S4-i89.gif"/>
</inline-formula>) and the original peptide identification probability <inline-formula>
<graphic file="1471-2105-13-S16-S4-i67.gif"/>
</inline-formula> comes from the presence of other spectra (peptides) mapped to the same protein as peptide <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i48">
<m:mrow>
<m:mi>&#160;j</m:mi>
</m:mrow>
</m:math>
</inline-formula>. They are expected to change the confidence of peptide identification. However, it is not straightforward to estimate <inline-formula>
<graphic file="1471-2105-13-S16-S4-i89.gif"/>
</inline-formula>. Nesvizhskii et al. defined the expected number of sibling peptides (NSP), i.e. the number identified peptides (other than peptide <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i90"><m:mrow>
   <m:msub>
      <m:mi>p</m:mi>
      <m:mi>j</m:mi>
   </m:msub>
</m:mrow>
</m:math>
</inline-formula>) weighted by the adjusted peptide identification probability <inline-formula>
<graphic file="1471-2105-13-S16-S4-i89.gif"/>
</inline-formula>), from the same protein. Specifically, <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i16"><m:mrow>
   <m:mi>N</m:mi>
   <m:mi>S</m:mi>
   <m:msub>
      <m:mrow>
         <m:mi>P</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:msub>
      <m:mrow>
         <m:mo mathsize="big"> &#8721;</m:mo>
      </m:mrow>
      <m:mrow>
         <m:msup>
            <m:mrow>
               <m:mi>j</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>&#8242;</m:mi>
            </m:mrow>
         </m:msup>
         <m:mo class="MathClass-rel">&#8712;</m:mo>
         <m:mi>N</m:mi>
         <m:mrow>
            <m:mo class="MathClass-open">(</m:mo>
            <m:mrow>
               <m:mi>i</m:mi>
            </m:mrow>
            <m:mo class="MathClass-close">)</m:mo>
         </m:mrow>
         <m:mo class="MathClass-punc">,</m:mo>
         <m:msup>
            <m:mrow>
               <m:mi>j</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>&#8242;</m:mi>
            </m:mrow>
         </m:msup>
         <m:mo class="MathClass-rel">&#8800;</m:mo>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
   <m:mi>P</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>x</m:mi>
            </m:mrow>
            <m:mrow>
               <m:msup>
                  <m:mrow>
                     <m:mi>j</m:mi>
                  </m:mrow>
                  <m:mrow>
                     <m:mi>&#8242;</m:mi>
                  </m:mrow>
               </m:msup>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">=</m:mo>
         <m:mn>1</m:mn>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:mi mathvariant="script">S</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
</m:mrow>
</m:math>
</inline-formula>, where <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i51">
<m:mrow>
<m:mi>&#160;i</m:mi>
</m:mrow>
</m:math>
</inline-formula> indexes a parent protein of peptide <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i48">
<m:mrow>
<m:mi>&#160;j</m:mi>
</m:mrow>
</m:math>
</inline-formula> (step 4, Figure <figr fid="F4">4A</figr>). ProteinProphet then approximates <inline-formula>
<graphic file="1471-2105-13-S16-S4-i91.gif"/>
</inline-formula>, which is computed from <inline-formula>
<graphic file="1471-2105-13-S16-S4-i92.gif"/>
</inline-formula>) and <inline-formula>
<graphic file="1471-2105-13-S16-S4-i93.gif"/>
</inline-formula>) by using the Bayes rule. Since computing <it>NSP<sub>ij </sub>
</it>requires <inline-formula>
<graphic file="1471-2105-13-S16-S4-i89.gif"/>
</inline-formula>, and computing <inline-formula>
<graphic file="1471-2105-13-S16-S4-i89.gif"/>
</inline-formula> requires <it>NSP<sub>ij</sub>
</it>, iterative updating is used until convergence (steps 2, 4; Figure <figr fid="F4">4A</figr>).</p>
</sec>
<sec>
<st>
<p>Adjustment for peptide degeneracy</p>
</st>
<p>In order to address degenerate peptides, a weighting scheme is used to modify protein probabilities to</p>
<p>
<display-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i17"><m:mrow>
   <m:mi>P</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>y</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>i</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">=</m:mo>
         <m:mn>1</m:mn>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:mi mathvariant="script">S</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mn>1</m:mn>
   <m:mo class="MathClass-bin">-</m:mo>
   <m:munder class="msub">
      <m:mrow>
         <m:mo mathsize="big">&#8719;</m:mo>
      </m:mrow>
      <m:mrow>
         <m:mi>j</m:mi>
         <m:mo class="MathClass-rel">&#8712;</m:mo>
         <m:mi>N</m:mi>
         <m:mrow>
            <m:mo class="MathClass-open">(</m:mo>
            <m:mrow>
               <m:mi>i</m:mi>
            </m:mrow>
            <m:mo class="MathClass-close">)</m:mo>
         </m:mrow>
      </m:mrow>
   </m:munder>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:mn>1</m:mn>
         <m:mo class="MathClass-bin">-</m:mo>
         <m:msub>
            <m:mrow>
               <m:mi>w</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>i</m:mi>
               <m:mi>j</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-bin">&#8901;</m:mo>
         <m:mi>P</m:mi>
         <m:mrow>
            <m:mo class="MathClass-open">(</m:mo>
            <m:mrow>
               <m:msub>
                  <m:mrow>
                     <m:mi>x</m:mi>
                  </m:mrow>
                  <m:mrow>
                     <m:mi>j</m:mi>
                  </m:mrow>
               </m:msub>
               <m:mo class="MathClass-rel">=</m:mo>
               <m:mn>1</m:mn>
               <m:mo class="MathClass-rel">|</m:mo>
               <m:mi mathvariant="script">S</m:mi>
            </m:mrow>
            <m:mo class="MathClass-close">)</m:mo>
         </m:mrow>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mo class="MathClass-punc">,</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>where <it>w<sub>ij </sub>
</it>is the "proportion" of peptide <it>j </it>assigned to protein <it>i </it>(step 3, Figure <figr fid="F4">4A</figr>). Nesvizhskii et al. defined that <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i18"><m:mrow>
   <m:msub>
      <m:mrow>
         <m:mi>w</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mi>P</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>y</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>i</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">=</m:mo>
         <m:mn>1</m:mn>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:mi mathvariant="script">S</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mo class="MathClass-bin">/</m:mo>
   <m:msub>
      <m:mrow>
         <m:mo mathsize="big"> &#8721;</m:mo>
      </m:mrow>
      <m:mrow>
         <m:msup>
            <m:mrow>
               <m:mi>i</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>&#8242;</m:mi>
            </m:mrow>
         </m:msup>
         <m:mo class="MathClass-rel">&#8712;</m:mo>
         <m:mi>N</m:mi>
         <m:mrow>
            <m:mo class="MathClass-open">(</m:mo>
            <m:mrow>
               <m:mi>j</m:mi>
            </m:mrow>
            <m:mo class="MathClass-close">)</m:mo>
         </m:mrow>
      </m:mrow>
   </m:msub>
   <m:mi>P</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>y</m:mi>
            </m:mrow>
            <m:mrow>
               <m:msup>
                  <m:mrow>
                     <m:mi>i</m:mi>
                  </m:mrow>
                  <m:mrow>
                     <m:mi>&#8242;</m:mi>
                  </m:mrow>
               </m:msup>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">=</m:mo>
         <m:mspace class="thinspace" width="0.3em"/>
         <m:mn>1</m:mn>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:mi mathvariant="script">S</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
</m:mrow>
</m:math>
</inline-formula>, where <it>N</it>(<it>j</it>) is the set of proteins that contain peptide <inline-formula>
<graphic file="1471-2105-13-S16-S4-i48.gif"/>
</inline-formula> (step 5, Figure <figr fid="F4">4A</figr>). This adjustment step is in accordance with the parsimony principle cause <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i19"><m:msub>
   <m:mrow>
      <m:mo class="MathClass-op">&#8721;</m:mo>
   </m:mrow>
   <m:mrow>
      <m:mi>i</m:mi>
      <m:mo class="MathClass-rel">&#8712;</m:mo>
      <m:mi>N</m:mi>
      <m:mrow>
         <m:mo class="MathClass-open">(</m:mo>
         <m:mrow>
            <m:mi>j</m:mi>
         </m:mrow>
         <m:mo class="MathClass-close">)</m:mo>
      </m:mrow>
   </m:mrow>
</m:msub>
<m:msub>
   <m:mrow>
      <m:mi>w</m:mi>
   </m:mrow>
   <m:mrow>
      <m:mi>i</m:mi>
      <m:mi>j</m:mi>
   </m:mrow>
</m:msub>
<m:mo class="MathClass-rel">=</m:mo>
<m:mn>1</m:mn>
</m:math>
</inline-formula>, i.e. one peptide is ensured to come from only one protein in total. Note that <it>w<sub>ij </sub>
</it>= 1 for all unique peptides and that <it>w<sub>ij </sub>
</it>= 0 if peptide <it>j </it>cannot be mapped to protein <it>i</it>, i.e. when <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i20"><m:mi>i</m:mi>
<m:mo class="MathClass-rel">&#8713;</m:mo>
<m:mi>N</m:mi>
<m:mrow>
   <m:mo class="MathClass-open">(</m:mo>
   <m:mrow>
      <m:mi>j</m:mi>
   </m:mrow>
   <m:mo class="MathClass-close">)</m:mo>
</m:mrow>
</m:math>
</inline-formula>. Since the calculations of <it>w<sub>ij </sub>
</it>and <inline-formula>
<graphic file="1471-2105-13-S16-S4-i68.gif"/>
</inline-formula> are mutually dependent, another iterative updating procedure is used until convergence.</p>
<p>By combining these four steps, with a minor modification to include weights <it>w<sub>ij </sub>
</it>for peptides in the NSP adjustment step, i.e. <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i21"><m:mrow>
   <m:mi>N</m:mi>
   <m:mi>S</m:mi>
   <m:msub>
      <m:mrow>
         <m:mi>P</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:msub>
      <m:mrow>
         <m:mo mathsize="big"> &#8721;</m:mo>
      </m:mrow>
      <m:mrow>
         <m:msup>
            <m:mrow>
               <m:mi>j</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>&#8242;</m:mi>
            </m:mrow>
         </m:msup>
         <m:mo class="MathClass-rel">&#8712;</m:mo>
         <m:mi>N</m:mi>
         <m:mrow>
            <m:mo class="MathClass-open">(</m:mo>
            <m:mrow>
               <m:mi>i</m:mi>
            </m:mrow>
            <m:mo class="MathClass-close">)</m:mo>
         </m:mrow>
         <m:mo class="MathClass-punc">,</m:mo>
         <m:msup>
            <m:mrow>
               <m:mi>j</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>&#8242;</m:mi>
            </m:mrow>
         </m:msup>
         <m:mo class="MathClass-rel">&#8800;</m:mo>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
   <m:msub>
      <m:mrow>
         <m:mi>w</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-bin">&#8901;</m:mo>
   <m:mi>P</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>x</m:mi>
            </m:mrow>
            <m:mrow>
               <m:msup>
                  <m:mrow>
                     <m:mi>j</m:mi>
                  </m:mrow>
                  <m:mrow>
                     <m:mi>&#8242;</m:mi>
                  </m:mrow>
               </m:msup>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">=</m:mo>
         <m:mn>1</m:mn>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:mi mathvariant="script">S</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
</m:mrow>
</m:math>
</inline-formula>, protein identification probability <inline-formula>
<graphic file="1471-2105-13-S16-S4-i68.gif"/>
</inline-formula> can be approximated through a variant of the expectation-maximization (EM) iterative process (steps 2-5; Figure <figr fid="F4">4A</figr>). Since indistinguishable proteins remain indistinguishable in ProteinProphet, the grouping strategy is adopted by treating the indistinguishable proteins as one protein. Therefore, a "group probability", i.e. the probability that any one of the proteins in the group is identified, is reported.</p>
<p>As the first probabilistic inference method for protein identification, ProteinProphet has been very successful and, as part of the Trans-Proteomic Pipeline <abbrgrp>
<abbr bid="B73">73</abbr>
</abbrgrp>, remains the most widely used protein inference tool. Although the degenerate peptides are handled by a parsimony-driven weighting procedure, an iterative method by ProteinProphet is used to obtain those weights and ultimately results in reasonable probabilities for proteins. Recently, the tool has been improved, mainly at the pre-processing step, due to iProphet <abbrgrp>
<abbr bid="B72">72</abbr>
</abbrgrp>. By using the same computational strategy as in the NSP adjustment step of ProteinProphet, iProphet obtains one identification probability for each peptide by aggregating the PSM probabilities of the peptide from multiple search engines, spectra, experiments, charge states, and PTM states.</p>
</sec>
<sec>
<st>
<p>Limitations</p>
</st>
<p>Because ProteinProphet relies on certain strong assumptions, e.g. the parsimony-driven weighting (step 5, Figure <figr fid="F4">4A</figr>), its outputs are not always sensible from a statistical perspective. One such scenario was noticed by the authors <abbrgrp>
<abbr bid="B21">21</abbr>
</abbrgrp>, that for a set of proteins with shared peptides, a protein with a unique peptide, no matter how small the identification probability is, always dominates the protein(s) without unique peptides. In other words, the algorithm assigns score 1 to the protein with a random but unique peptide identification and score 0 to other proteins. This is undesirable, since there are always a large number of random peptide identifications with close to 0 probabilities in real proteomics data sets. To address the issue, only peptides with probabilities &#8805;0.2 are used to compute protein probabilities. Similarly, we observed that the inference outcome of ProteinProphet is sensitive to minor changes in peptide probabilities. This can be illustrated by a simple example shown in Figure <figr fid="F4">4B</figr>. Consider two homologous proteins <it>P</it>
<sub>1 </sub>and <it>P</it>
<sub>2 </sub>with identified peptides {<it>p</it>
<sub>1</sub>, <it>p</it>
<sub>2</sub>, <it>p</it>
<sub>3</sub>} and {<it>p</it>
<sub>1</sub>, <it>p</it>
<sub>2</sub>, <it>p</it>
<sub>4</sub>}, respectively. Suppose <it>p</it>
<sub>1 </sub>and <it>p</it>
<sub>2 </sub>are reliable identifications, but that <it>p</it>
<sub>3 </sub>and <it>p</it>
<sub>4 </sub>are not, with small identification probabilities. In the seven toy datasets (A-G) in Figure <figr fid="F4">4B</figr>, we varied the identification probability of peptides <it>p</it>
<sub>3 </sub>and <it>p</it>
<sub>4</sub>, and computed the protein probability using ProteinProphet. In data sets A and E, when the probabilities of unique peptides are not larger than 0.5, ProteinProphet considers proteins <it>P</it>
<sub>1 </sub>and <it>P</it>
<sub>2 </sub>indistinguishable, and only reports a group probability; in data set B, when probability of peptide <it>p</it>
<sub>3 </sub>is slightly larger than <it>p</it>
<sub>4 </sub>(which has probability 0.5 or less), ProteinProphet considers protein <it>P</it>
<sub>1 </sub>as much more reliable than <it>P</it>
<sub>2</sub>; in data sets C and G, when probability of peptide is (slightly) larger than <it>p</it>
<sub>3 </sub>(which has probability 0.5 or less), ProteinProphet considers protein <it>P</it>
<sub>2 </sub>as much more reliable than <it>P</it>
<sub>1</sub>; in data set D, when the probabilities are both larger than 0.5, ProteinProphet considers both proteins to be reliable; while in data set F, when the probability of peptide <it>p</it>
<sub>3 </sub>is 0.2 or less, ProteinProphet suggests that only protein <it>P</it>
<sub>2 </sub>can be the true protein, despite the significant probability that peptide <it>p</it>
<sub>4 </sub>is a random identification. This non-continuity of the inference results is counterintuitive. Naturally, one would expect the probability of protein <it>P</it>
<sub>2 </sub>(<it>P</it>
<sub>1</sub>) decreases (increases) gradually as the probability of peptide <it>p</it>
<sub>3 </sub>decreases.</p>
<p>Although ProteinProphet applies the parsimony principle to the issue of shared peptides, it uses a probabilistic model and an EM-like algorithm. Thus, ProteinProphet distinguishes itself from the other parsimony principle-driven methods, such as the combinatorial approaches discussed earlier. However, it is not clear how often ProteinProphet actually leads to the same solutions as other various combinatorial approaches regarding proteins with shared peptides. In addition, with the presence of degenerate peptides, the inference problem is difficult; thus, it would be interesting to compare the EM-like iterative algorithm used by ProteinProphet with the heuristics used by the combinatorial approaches to examine how efficiently they handle large data sets.</p>
</sec>
</sec>
<sec>
<st>
<p>MSBayesPro</p>
</st>
<p>MSBayesPro <abbrgrp>
<abbr bid="B61">61</abbr>
</abbrgrp> is defined as a full probabilistic protein inference method which provides "perhaps the most rigorous existing treatment of the peptide degeneracy problem" <abbrgrp>
<abbr bid="B71">71</abbr>
</abbrgrp>. The MSBayesPro model includes peptide detectability in the probabilistic model; thus it can, to some degree, distinguish among "indistinguishable" proteins.</p>
<sec>
<st>
<p>Model structure</p>
</st>
<p>MSBayesPro is a Bayesian network (Figure <figr fid="F5">5</figr>) serving as a generative model for the data. The high level structure of the network is simple: Proteins &#8594; Peptides &#8594; Spectra, which mimics the experimental protocol in proteomics where proteins are first digested into peptides, from which spectra are generated. Hence,</p>
<fig id="F5"><title><p>Figure 5</p></title><caption><p>An example of a Bayesian network used as a generative model in MSBayesPro. Three layers are provided reflecting the LC-MS/MS experiment in which proteins are first digested into peptides which are then matched to the MS/MS spectra. The numbers associated with the directed edges indicate peptide detectabilities (between the first two layers) and PSM identification scores (between the last two layers). Note that some peptides may not be matched with any spectra while some others may be matched with more than one spectrum. The peptides not matched to any spectra are shown using dashed lines.</p></caption><text>
   <p>An example of a Bayesian network used as a generative model in MSBayesPro. Three layers are provided reflecting the LC-MS/MS experiment in which proteins are first digested into peptides which are then matched to the MS/MS spectra. The numbers associated with the directed edges indicate peptide detectabilities (between the first two layers) and PSM identification scores (between the last two layers). Note that some peptides may not be matched with any spectra while some others may be matched with more than one spectrum. The peptides not matched to any spectra are shown using dashed lines.</p>
</text><graphic file="1471-2105-13-S16-S4-5"/></fig>
<p>
<display-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i34"><m:mrow>
   <m:mi>P</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:mi>y</m:mi>
         <m:mo class="MathClass-punc">,</m:mo>
         <m:mi>x</m:mi>
         <m:mo class="MathClass-punc">,</m:mo>
         <m:mi mathvariant="script">S</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mi>P</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:mi>y</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mi>P</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:mi>x</m:mi>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:mi>y</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mi>P</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:mi mathvariant="script">S</m:mi>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:mi>x</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mo class="MathClass-rel">&#8733;</m:mo>
   <m:mi>P</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:mi>y</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mi>P</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:mi>x</m:mi>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:mi>y</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mi>P</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:mi>x</m:mi>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:mi mathvariant="script">S</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mo class="MathClass-punc">,</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>where <it>y </it>is a vector of random indicator variables for all candidate proteins, <it>x </it>is a vector of random indicator variables representing <it>all </it>peptides from those proteins, and <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i33"><m:mrow>
   <m:mi mathvariant="script">S</m:mi>
</m:mrow>
</m:math>
</inline-formula> represents the data, i.e. all the spectra generated in the experiment. The Peptides &#8594; Spectra associations are defined by the available PSM scores (or probabilities). The Proteins &#8594; Peptides connections, however, are determined by the sequences of the peptides and candidate proteins. If the sequence of peptide <it>p<sub>j </sub>
</it>can be exactly mapped to protein <it>P<sub>i</sub>
</it>, there will be an edge pointing from the protein node <it>i </it>to peptide node <it>j </it>in the network. This is similar to the structure of the model used in ProteinProphet, although the latter is not a Bayesian network. However, there is an important difference between MSBayesPro and ProteinProphet, i.e. all peptides, identified and unidentified, are included in the network structure in MSBayesPro. In contrast, the unidentified peptides are ignored in ProteinProphet and other Bayesian network models <abbrgrp>
<abbr bid="B69">69</abbr>
<abbr bid="B71">71</abbr>
</abbrgrp> proposed subsequently. Other than the simplification of the model structure, we believe there is no legitimate justification for excluding unidentified peptides from a probabilistic model. Such peptides will have the identification probability <inline-formula>
<graphic file="1471-2105-13-S16-S4-i87.gif"/>
</inline-formula>; thus <it>x<sub>j </sub>
</it>= 0 is guaranteed in the inference step. We note that it is these unidentified peptides that, together with the peptide detectability information, will lead to tie resolution between grouped proteins and improve the scoring of proteins hit by single peptides.</p>
<p>The MSBayesPro model has an important property in that the peptide identifications are conditionally independent given the presence of the parent proteins (Figure <figr fid="F5">5</figr>). This is not to be confused with the independence assumption of peptide identification used in ProteinProphet. Actually, the conditional independence assumption in MSBayesPro will lead to marginally dependent peptide identifications if two peptides share parent proteins directly or indirectly through other peptide/protein nodes (that is, if the two peptides are in a connected component of the graph). Furthermore, the conditional independence assumption aligns with the LC-MS/MS experiment. Consider a protein <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i44">
<m:mrow>
<m:msub>
<m:mi>P</m:mi>
<m:mi>i</m:mi>
</m:msub>
</m:mrow>
</m:math>
</inline-formula> that is in the sample at some known abundance <it>q<sub>i</sub>
</it>. Then, further knowing the information that one peptide is already identified from this protein does not inform whether another peptide from the same protein should be identified in MS/MS or not. With conditional independence, we can expand the joint probabilities of the set of peptides <it>N</it>(<it>i</it>) (both the identified ones and those that are not) from protein <it>i </it>as</p>
<p>
<display-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i22"><m:mrow>
   <m:mi>P</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>x</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>N</m:mi>
               <m:mrow>
                  <m:mo class="MathClass-open">(</m:mo>
                  <m:mrow>
                     <m:mi>i</m:mi>
                  </m:mrow>
                  <m:mo class="MathClass-close">)</m:mo>
               </m:mrow>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:msub>
            <m:mrow>
               <m:mi>y</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>i</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">=</m:mo>
         <m:mn>1</m:mn>
         <m:mo class="MathClass-punc">,</m:mo>
         <m:msub>
            <m:mrow>
               <m:mi>y</m:mi>
            </m:mrow>
            <m:mrow>
               <m:msup>
                  <m:mrow>
                     <m:mi>i</m:mi>
                  </m:mrow>
                  <m:mrow>
                     <m:mi>&#8242;</m:mi>
                  </m:mrow>
               </m:msup>
               <m:mo class="MathClass-rel">&#8800;</m:mo>
               <m:mi>i</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">=</m:mo>
         <m:mn>0</m:mn>
         <m:mo class="MathClass-punc">,</m:mo>
         <m:msub>
            <m:mrow>
               <m:mi>q</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>i</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">=</m:mo>
         <m:mi>q</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:msub>
      <m:mrow>
         <m:mo mathsize="big"> &#8719;</m:mo>
      </m:mrow>
      <m:mrow>
         <m:mi>j</m:mi>
         <m:mo class="MathClass-rel">&#8712;</m:mo>
         <m:mi>N</m:mi>
         <m:mrow>
            <m:mo class="MathClass-open">(</m:mo>
            <m:mrow>
               <m:mi>i</m:mi>
            </m:mrow>
            <m:mo class="MathClass-close">)</m:mo>
         </m:mrow>
      </m:mrow>
   </m:msub>
   <m:mi>P</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>x</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>j</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:msub>
            <m:mrow>
               <m:mi>y</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>i</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">=</m:mo>
         <m:mn>1</m:mn>
         <m:mo class="MathClass-punc">,</m:mo>
         <m:msub>
            <m:mrow>
               <m:mi>y</m:mi>
            </m:mrow>
            <m:mrow>
               <m:msup>
                  <m:mrow>
                     <m:mi>i</m:mi>
                  </m:mrow>
                  <m:mrow>
                     <m:mi>&#8242;</m:mi>
                  </m:mrow>
               </m:msup>
               <m:mo class="MathClass-rel">&#8800;</m:mo>
               <m:mi>i</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">=</m:mo>
         <m:mn>0</m:mn>
         <m:mo class="MathClass-punc">,</m:mo>
         <m:msub>
            <m:mrow>
               <m:mi>q</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>i</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">=</m:mo>
         <m:mi>q</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mi>.</m:mi>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>where <it>q<sub>i </sub>
</it>is the abundance of protein <it>P<sub>i</sub>
</it>.</p>
</sec>
<sec>
<st>
<p>Model inputs and parameters</p>
</st>
<p>MSBayesPro requires peptide identification likelihood ratios and a set of peptide detectabilities. The former is a required input to the method, and the latter, as required parameters of MSBayesPro, can be provided as an input, or ideally, peptide detectabilities should be estimated via a machine learning model from the same data set on which protein inference is carried out <abbrgrp>
<abbr bid="B24">24</abbr>
<abbr bid="B61">61</abbr>
</abbrgrp>.</p>
<p>For peptide identifications, the input to MSBayesPro is the likelihood ratios <inline-formula>
<graphic file="1471-2105-13-S16-S4-i94.gif"/>
</inline-formula> rather than the peptide identification probabilities <inline-formula>
<graphic file="1471-2105-13-S16-S4-i92.gif"/>
</inline-formula> that implicitly include a uniform prior <abbrgrp>
<abbr bid="B60">60</abbr>
<abbr bid="B61">61</abbr>
</abbrgrp>. Here the original peptide-invariant class priors used to compute peptide identification probability are replaced in MSBayesPro by the peptide sequence and protein abundance dependent detectabilities, which are more informative priors. We note that this treatment in MSBayesPro is somewhat related to the NSP adjustment in ProteinProphet, which essentially changes the prior to incorporate information from the NSP values (interestingly, NSP values may roughly reflect protein abundances, in similar ways as effective detectability). Note that unlike detectability, NSP is not specific to the sequence of a peptide.</p>
<p>Using peptide detectability is an important distinguishing feature of MSBayesPro. Detectability is required to build the conditional distribution tables between the Protein and Peptide layers and subsequently to compute the posterior probabilities for the proteins. However, to use detectability properly it is important to consider the impact from protein quantity (Box 1). Li et al. <abbrgrp>
<abbr bid="B60">60</abbr>
</abbrgrp> proposed a quantity adjustment formula to convert <it>standard peptide detectability </it>
<inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i23"><m:msubsup>
   <m:mrow>
      <m:mi>d</m:mi>
   </m:mrow>
   <m:mrow>
      <m:mi>i</m:mi>
      <m:mi>j</m:mi>
   </m:mrow>
   <m:mrow>
      <m:mn>0</m:mn>
   </m:mrow>
</m:msubsup>
<m:mo class="MathClass-rel">=</m:mo>
<m:mi>P</m:mi>
<m:mrow>
   <m:mo class="MathClass-open">(</m:mo>
   <m:mrow>
      <m:msub>
         <m:mrow>
            <m:mi>x</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mi>j</m:mi>
         </m:mrow>
      </m:msub>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
      <m:mo class="MathClass-rel">|</m:mo>
      <m:msub>
         <m:mrow>
            <m:mi>y</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mi>i</m:mi>
         </m:mrow>
      </m:msub>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
      <m:mo class="MathClass-punc">,</m:mo>
      <m:mspace class="thinspace" width="0.3em"/>
      <m:mspace class="thinspace" width="0.3em"/>
      <m:msub>
         <m:mrow>
            <m:mi>q</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mi>i</m:mi>
         </m:mrow>
      </m:msub>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:msup>
         <m:mrow>
            <m:mi>q</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mn>0</m:mn>
         </m:mrow>
      </m:msup>
   </m:mrow>
   <m:mo class="MathClass-close">)</m:mo>
</m:mrow>
</m:math>
</inline-formula> to <it>effective detectability </it>
<inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i24"><m:msub>
   <m:mrow>
      <m:mi>d</m:mi>
   </m:mrow>
   <m:mrow>
      <m:mi>i</m:mi>
      <m:mi>j</m:mi>
   </m:mrow>
</m:msub>
<m:mrow>
   <m:mo class="MathClass-open">(</m:mo>
   <m:mrow>
      <m:mi>q</m:mi>
   </m:mrow>
   <m:mo class="MathClass-close">)</m:mo>
</m:mrow>
<m:mo class="MathClass-rel">=</m:mo>
<m:mi>P</m:mi>
<m:mrow>
   <m:mo class="MathClass-open">(</m:mo>
   <m:mrow>
      <m:msub>
         <m:mrow>
            <m:mi>x</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mi>j</m:mi>
         </m:mrow>
      </m:msub>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
      <m:mo class="MathClass-rel">|</m:mo>
      <m:msub>
         <m:mrow>
            <m:mi>y</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mi>i</m:mi>
         </m:mrow>
      </m:msub>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
      <m:mo class="MathClass-punc">,</m:mo>
      <m:msub>
         <m:mrow>
            <m:mi>q</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mi>i</m:mi>
         </m:mrow>
      </m:msub>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mi>q</m:mi>
   </m:mrow>
   <m:mo class="MathClass-close">)</m:mo>
</m:mrow>
</m:math>
</inline-formula>, where <it>q<sub>i</sub>
</it>, the quantity of protein <it>P<sub>i</sub>
</it>, is estimated by the maximum likelihood or moment matching approaches. If a (degenerate) peptide <it>p<sub>j </sub>
</it>is shared by multiple proteins, the network structure requires combining detectabilities <it>d<sub>ij </sub>
</it>over all parent proteins of <it>p<sub>j</sub>
</it>. Here, MSBayesPro assumes that <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i25"><m:msub>
   <m:mrow>
      <m:mi>d</m:mi>
   </m:mrow>
   <m:mrow>
      <m:mi>j</m:mi>
   </m:mrow>
</m:msub>
<m:mo class="MathClass-rel">=</m:mo>
<m:mn>1</m:mn>
<m:mo class="MathClass-bin">-</m:mo>
<m:msub>
   <m:mrow>
      <m:mo class="MathClass-op">&#8719;</m:mo>
   </m:mrow>
   <m:mrow>
      <m:mi>i</m:mi>
      <m:mo class="MathClass-rel">&#8712;</m:mo>
      <m:mi>N</m:mi>
      <m:mrow>
         <m:mo class="MathClass-open">(</m:mo>
         <m:mrow>
            <m:mi>j</m:mi>
         </m:mrow>
         <m:mo class="MathClass-close">)</m:mo>
      </m:mrow>
   </m:mrow>
</m:msub>
<m:mi>P</m:mi>
<m:mrow>
   <m:mo class="MathClass-open">(</m:mo>
   <m:mrow>
      <m:msub>
         <m:mrow>
            <m:mi>x</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mi>j</m:mi>
         </m:mrow>
      </m:msub>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>0</m:mn>
      <m:mo class="MathClass-rel">|</m:mo>
      <m:msub>
         <m:mrow>
            <m:mi>y</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mi>i</m:mi>
         </m:mrow>
      </m:msub>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
      <m:mo class="MathClass-punc">,</m:mo>
      <m:mspace class="thinspace" width="0.3em"/>
      <m:mspace class="thinspace" width="0.3em"/>
      <m:msub>
         <m:mrow>
            <m:mi>q</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mi>i</m:mi>
         </m:mrow>
      </m:msub>
   </m:mrow>
   <m:mo class="MathClass-close">)</m:mo>
</m:mrow>
<m:mo class="MathClass-rel">=</m:mo>
<m:mn>1</m:mn>
<m:mo class="MathClass-bin">-</m:mo>
<m:msub>
   <m:mrow>
      <m:mo class="MathClass-op">&#8719;</m:mo>
   </m:mrow>
   <m:mrow>
      <m:mi>i</m:mi>
      <m:mo class="MathClass-rel">&#8712;</m:mo>
      <m:mi>N</m:mi>
      <m:mrow>
         <m:mo class="MathClass-open">(</m:mo>
         <m:mrow>
            <m:mi>j</m:mi>
         </m:mrow>
         <m:mo class="MathClass-close">)</m:mo>
      </m:mrow>
   </m:mrow>
</m:msub>
<m:mrow>
   <m:mo class="MathClass-open">(</m:mo>
   <m:mrow>
      <m:mn>1</m:mn>
      <m:mo class="MathClass-bin">-</m:mo>
      <m:msub>
         <m:mrow>
            <m:mi>d</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mi>i</m:mi>
            <m:mi>j</m:mi>
         </m:mrow>
      </m:msub>
   </m:mrow>
   <m:mo class="MathClass-close">)</m:mo>
</m:mrow>
</m:math>
</inline-formula>. Alternative approaches in combining multiple detectabilities may also work, but the key intuition is the following: if, for a given peptide, there are multiple parent proteins all present in the sample, the detectability of the peptide should be larger than its detectability from any of the individual proteins alone. This treatment permits a non-parsimonious solution, because a degenerate peptide is allowed to come from more than one parent protein.</p>
</sec>
<sec>
<st>
<p>Inference algorithms</p>
</st>
<p>With the Bayesian network model structure and parameters specified, it is in principle easy to exactly compute the joint posterior probability for the proteins, i.e. <inline-formula>
<graphic file="1471-2105-13-S16-S4-i95.gif"/>
</inline-formula>. An optimal solution for the presence of all proteins (the maximum <it>a posteriori </it>configuration) is computed as <inline-formula>
<graphic file="1471-2105-13-S16-S4-i96.gif"/>
</inline-formula>. The joint posterior probability can be further marginalized to compute <inline-formula>
<graphic file="1471-2105-13-S16-S4-i97.gif"/>
</inline-formula> for the presence of each individual protein in the sample. In practice, this is not always possible due to the prohibitive time complexity, i.e. the inference on Bayesian networks is NP-hard in general <abbrgrp>
<abbr bid="B74">74</abbr>
</abbrgrp>. MSBayesPro uses Gibbs sampling instead of exact computation when a connected component in the Bayesian network is large (it is easy to show that connected components should be considered separately).</p>
<p>It is important to note that MSBayesPro also reports estimated protein quantities and the marginal posterior probabilities for peptides, which provide better scores for measuring peptide confidence <abbrgrp>
<abbr bid="B61">61</abbr>
</abbrgrp>. Thus, in its core, MSBayesPro is also a label-free quantification algorithm. Further generalization of the MSBayesPro model has been suggested to unify the peptide and protein identification problems and perform higher-level inference on genes and pathways based on proteomics data <abbrgrp>
<abbr bid="B75">75</abbr>
</abbrgrp>.</p>
</sec>
<sec>
<st>
<p>Limitations</p>
</st>
<p>The use of peptide detectability is both the strength and a limitation of MSBayesPro. The method requires good detectability predictions in order to achieve good performance <abbrgrp>
<abbr bid="B24">24</abbr>
</abbrgrp>. However, prediction of detectability for non-tryptic peptides and post-translationally modified peptides is not a fully solved problem yet, which limits the applicability of MSBayesPro. In addition, detectabilities cannot be expected to provide tie resolution for proteins with nearly identical sequences. These cases, however, reveal the limits of shotgun proteomics experiments and should be addressed by follow-up experiments such as well-designed targeted proteomics experiments. Another limitation is related to the computational complexity: efficient approximation algorithms are necessary for MSBayesPro to work on very large data sets.</p>
</sec>
</sec>
<sec>
<st>
<p>The Fido model</p>
</st>
<p>The Fido model <abbrgrp>
<abbr bid="B71">71</abbr>
<abbr bid="B76">76</abbr>
</abbrgrp> uses a Bayesian network, but was primarily designed for fast inference. The major contribution of this method consists of two graph transformations applied to each connected component: collapsing protein nodes that are connected to the identical sets of peptides and pruning of spectral nodes (with user specified parameters) that results in splitting of the connected components. Both transformations facilitate tradeoffs between the accuracy and speed of the inference step. Fido also allows an application of advanced probabilistic inference algorithms, e.g. the junction tree algorithm, which significantly improve protein inference on large graphs.</p>
<p>There are two major differences in the Bayesian network models used by Fido and MSBayesPro. First, unidentified peptides are ignored in Fido and a sequence-independent parameter is used as a replacement for peptide detectability (Figure <figr fid="F6">6</figr>). Hence, the resulting Bayesian network is simpler and inference is faster. Second, another parameter <it>&#946; </it>is introduced to the model, which is the prior probability for a peptide to be identified from an artificial "noise" node. This addresses the situation where input peptide probabilities are not accurate (e.g. many incorrect peptides are assigned high probability). We believe this is a legitimate remedy for disasters that can happen during the peptide probability estimation. However, parameter <it>&#946; </it>seems to be redundant given that <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i26"><m:msup>
   <m:mrow>
      <m:mrow>
         <m:mo class="MathClass-open">(</m:mo>
         <m:mrow>
            <m:mn>1</m:mn>
            <m:mo class="MathClass-bin">-</m:mo>
            <m:mi>&#945;</m:mi>
         </m:mrow>
         <m:mo class="MathClass-close">)</m:mo>
      </m:mrow>
   </m:mrow>
   <m:mrow>
      <m:mo class="MathClass-rel">|</m:mo>
      <m:mi>N</m:mi>
      <m:mrow>
         <m:mo class="MathClass-open">(</m:mo>
         <m:mrow>
            <m:mi>j</m:mi>
         </m:mrow>
         <m:mo class="MathClass-close">)</m:mo>
      </m:mrow>
      <m:mo class="MathClass-rel">|</m:mo>
   </m:mrow>
</m:msup>
</m:math>
</inline-formula> is the probability for a peptide <it>p<sub>j </sub>
</it>to be identified from "noise". The authors indeed observed strong inverse correlation between the optimal values of <it>&#945; </it>and <it>&#946;</it>.</p>
<fig id="F6"><title><p>Figure 6</p></title><caption><p>An example of a Bayesian network used in the Fido model. Three layers mimic the LC-MS/MS experiment in which proteins are first digested into peptides which are then matched to the MS/MS spectra. The numbers associated with the directed edges correspond to PSM identification scores between the peptide and spectrum layers, while &#945;, &#946;, and &#947; are the three parameters used in Fido. Note that unidentified peptides are not modeled.</p></caption><text>
   <p>An example of a Bayesian network used in the Fido model. Three layers mimic the LC-MS/MS experiment in which proteins are first digested into peptides which are then matched to the MS/MS spectra. The numbers associated with the directed edges correspond to PSM identification scores between the peptide and spectrum layers, while &#945;, &#946;, and &#947; are the three parameters used in Fido. Note that unidentified peptides are not modeled.</p>
</text><graphic file="1471-2105-13-S16-S4-6"/></fig>
<p>One limitation of the Fido model is that it requires a decoy (randomized) database to find the best values of the parameters (<inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i98"><m:mi>&#160;&#945;</m:mi>
</m:math>
</inline-formula>, <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i99"><m:mi>&#160;&#946;</m:mi>
</m:math>
</inline-formula>, and <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i100"><m:mi>&#160;&#947;</m:mi>
</m:math>
</inline-formula>- the prior for the presence of proteins) by combining an ROC optimization (in a supervised manner) with FDR estimation. Some versions of this approach may lead to overly optimistic performance estimates. Decoy database-independent maximum likelihood approach may be an alternative to fit the parameters. Finally, the parameter optimization step dramatically increases the run time of the algorithm (up to 2000 times), which compromises the overall speed of the method <abbrgrp>
<abbr bid="B71">71</abbr>
</abbrgrp>.</p>
</sec>
<sec>
<st>
<p>Other probabilistic approaches</p>
</st>
<p>Yang et al. recently investigated protein inference from an information retrieval (IR) point of view <abbrgrp>
<abbr bid="B68">68</abbr>
</abbrgrp>. This work is interesting because it leverages methods in the IR field to the protein identification problem in proteomics. The authors found that the Prob-OR score, which is similar to ProteinProphet without the two adjustment steps, is dramatically worse than Prob-AND score, which is related to the protein posterior probabilities computed by MSBayesPro if degenerate peptides were treated as unique to each parent protein. We emphasize that the IR method proposed by Yang et al. is inherently a ranking approach rather than an inference approach; hence, it does not directly address the shared peptide issue as do the other probabilistic approaches discussed above.</p>
<p>Gerster et al. <abbrgrp>
<abbr bid="B69">69</abbr>
</abbrgrp> recently reported a new probabilistic approach, Markovian Inference of Proteins and Gene Models (MIPGEM), that is similar to MSBayesPro and Fido. MIPGEM models peptide probabilities as random variables as in some previous approaches <abbrgrp>
<abbr bid="B66">66</abbr>
</abbrgrp> and assumes conditional independence between peptide scores given their parent proteins (Markovian assumption). Similar to the Fido model, MIPGEM does not consider peptide detectability or unidentified peptides although the authors suggested that including detectability would be a future consideration. Table <tblr tid="T2">2</tblr> provides a summary of the major probabilistic inference methods. Several other methods are reviewed in <abbrgrp>
<abbr bid="B35">35</abbr>
<abbr bid="B36">36</abbr>
<abbr bid="B37">37</abbr>
</abbrgrp>.</p>
<tbl id="T2"><title><p>Table 2</p></title><caption><p>A comparison between different probabilistic protein inference algorithms.</p></caption><tblbdy cols="5">
      <r>
         <c ca="left">
            <p>
               <b>Methods</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>ProteinProphet</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>MSBayesPro</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>Fido</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>MIPGEM</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Underlying graph structure</p>
         </c>
         <c ca="left">
            <p>Bipartite graph with identified peptides and matching proteins<sup>1</sup></p>
         </c>
         <c ca="left">
            <p>Bayesian network with all peptides from proteins with at least one identified peptide</p>
         </c>
         <c ca="left">
            <p>Bayesian network with identified peptides and matching proteins</p>
         </c>
         <c ca="left">
            <p>k-partite graph with identified peptides, matching proteins and (optionally) matching gene models<sup>2</sup></p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Inference algorithm</p>
         </c>
         <c ca="left">
            <p>EM (Expectation Maximization) like</p>
         </c>
         <c ca="left">
            <p>1) Exact<sup>3</sup>;</p>
            <p>2) Memorizing-Gibbs sampling</p>
         </c>
         <c ca="left">
            <p>1) Exact<sup>3</sup> ;</p>
            <p>2) Pruning approximation</p>
         </c>
         <c ca="left">
            <p>1) Exact<sup>3</sup>;</p>
            <p>2) Direct sampling</p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Input</p>
         </c>
         <c ca="left">
            <p>Probabilities for peptides with user-defined cutoff for <it>p </it>(often <it>p </it>&gt; 0.05 is used)</p>
         </c>
         <c ca="left">
            <p>Likelihood ratios for peptides with <it>p </it>&gt; 0.05 and peptide detectabilities</p>
         </c>
         <c ca="left">
            <p>Likelihood ratios for peptides</p>
            <p>with <it>p </it>&gt; 0.05</p>
         </c>
         <c ca="left">
            <p>Probabilities for peptides with user-defined cutoff for <it>p </it>(often <it>p </it>&gt; 0.05 is used; 0.9 for best performance)</p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Output</p>
         </c>
         <c ca="left">
            <p>1) Protein probabilities;</p>
            <p>2) Protein group probabilities;</p>
            <p>3) NSP adjusted peptide probabilities</p>
         </c>
         <c ca="left">
            <p>1) MAP solution, protein abundances and probabilities;</p>
            <p>2) Protein group probabilities;</p>
            <p>3) Posterior peptide probabilities</p>
         </c>
         <c ca="left">
            <p>1) Protein probabilities;</p>
            <p>2) Protein group probabilities</p>
         </c>
         <c ca="left">
            <p>1) Protein probabilities;</p>
            <p>2) Gene model probabilities</p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Protein prior estimation</p>
         </c>
         <c ca="left">
            <p>No protein priors</p>
         </c>
         <c ca="left">
            <p>Direct frequency estimation based on protein posterior probabilities in one run of MSBayesPro</p>
         </c>
         <c ca="left">
            <p>Grid search optimizing cross-</p>
            <p>validation performance through multi-runs of Fido with different</p>
            <p>priors</p>
         </c>
         <c ca="left">
            <p>Grid search optimizing model likelihood through multi-runs of the MIPGEM with different priors</p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Peptide probability adjustment by</p>
         </c>
         <c ca="left">
            <p>NSP from a parent protein</p>
         </c>
         <c ca="left">
            <p>Protein quantity adjusted peptide detectability</p>
         </c>
         <c ca="left">
            <p>Two detectability-like parameters <it>&#945;</it>, <it>&#946;</it></p>
         </c>
         <c ca="left">
            <p>Treating peptide identifications as random variables</p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Protein grouping</p>
         </c>
         <c ca="left">
            <p>Yes</p>
         </c>
         <c ca="left">
            <p>No (indistinguishable proteins are resolved)</p>
         </c>
         <c ca="left">
            <p>Yes</p>
         </c>
         <c ca="left">
            <p>No (indistinguishable proteins are not resolved)</p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Peptide charge</p>
         </c>
         <c ca="left">
            <p>Considered</p>
         </c>
         <c ca="left">
            <p>Ignored</p>
         </c>
         <c ca="left">
            <p>Considered</p>
         </c>
         <c ca="left">
            <p>Considered</p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Novel aspects</p>
         </c>
         <c ca="left">
            <p>1) First probabilistic protein inference algorithm;</p>
            <p>2) Efficient EM algorithm</p>
         </c>
         <c ca="left">
            <p>1) A Bayesian network;</p>
            <p>2) Resolves indistinguishable proteins using unidentified peptides and peptide detectability;</p>
            <p>3) Modified Gibbs sampling</p>
         </c>
         <c ca="left">
            <p>1) Using a noise model to remedy inaccurate peptide probabilities;</p>
            <p>2) Pruning algorithm, efficient inference</p>
         </c>
         <c ca="left">
            <p>Gene model probabilities<sup>4</sup></p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Availability</p>
         </c>
         <c ca="left">
            <p>http://tools.proteomecenter.org</p>
         </c>
         <c ca="left">
            <p>http://darwin.informatics.indiana.edu/yonli/</p>
         </c>
         <c ca="left">
            <p>http://noble.gs.washington.edu/proj/fido</p>
         </c>
         <c ca="left">
            <p>-</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>1. For ProteinProphet, the underlying bipartite graph does not correspond to a Bayesian Network although it guides the EM-like algorithm through inference.</p>
      <p>2. MIPGEM uses a rule-based protein removal scheme to simplify the network structure;</p>
      <p>3. Exact computation is used only for small connected components;</p>
      <p>4. Gene centric proteomics was proposed in <abbrgrp><abbr bid="B77">77</abbr></abbrgrp>, and implemented earlier in a deterministic way in <abbrgrp><abbr bid="B67">67</abbr></abbrgrp>.</p>
   </tblfn></tbl>
</sec>
</sec>
</sec>
<sec>
<st>
<p>Discussion</p>
</st>
<p>Our main goal in this review was to present the challenges, intuition and proposed solutions to the protein inference problem. With increased throughput of proteomics experiments, the tools and approaches presented here will have increasingly more important applications to many problems in biology and biomedical sciences. These applications include inference and verification of gene models, identification of splice forms or post-translatioinally modified sites. Some of these problems can only be addressed using proteomics techniques and, as such, proteomics holds great promise in systems biology, biomarker discovery, diagnostics, prognostics and treatment monitoring.</p>
<p>Undoubtedly, there is a need for more sophisticated methodology for protein inference, unbiased performance evaluation of these techniques, as well as stand-alone tools with graphical user interface that will facilitate transition from research environments to practice in biomedical sciences. We conclude this paper by discussing the current issues in evaluating protein inference algorithms and then speculating on the ideal protein inference approaches.</p>
<sec>
<st>
<p>Evaluation of protein identification methods</p>
</st>
<p>Despite the development of computational protein identification methods, objectively evaluating the performance of the methods remains a problem. Two strategies are currently available: the use of standard samples (mixtures of known proteins) and the use of decoy protein sequences to estimate FDR at the protein level. Both approaches have limitations.</p>
<p>To date, only a limited number of standard samples <abbrgrp>
<abbr bid="B78">78</abbr>
<abbr bid="B79">79</abbr>
<abbr bid="B80">80</abbr>
</abbrgrp> containing 10-50 proteins have been used to facilitate evaluation of peptide/protein identification. The advantage of using standard samples is that the truth is known; thus, the accuracy measures, e.g. precision and recall, of protein identification can be directly computed. However, standard samples are frequently plagued by contaminant proteins and the boundary between true and false protein identification is blurred. Another limitation of standard samples is their small number of proteins, which leads to difficulties in assessing statistical significance in method comparisons.</p>
<p>The second approach estimates protein-level false discovery rates with the help of decoy databases. Although the approach has been used in several studies <abbrgrp>
<abbr bid="B51">51</abbr>
<abbr bid="B52">52</abbr>
</abbrgrp>, two serious problems of the approach are generally ignored. We suggest that using decoy databases for evaluation of protein identification algorithms should be approached with these limitations in mind. First, unlike the decoy (e.g. reversed, randomized) database approach for peptides, the decoy database for proteins does not produce the correct estimation of the number of incorrect protein identifications when the correct proteins comprise a significant portion of the database. In an extreme scenario, when all proteins in the database are present in the sample, all the identified proteins from the forward database are correct despite many peptides being in-correct identifications. On the other hand, all identified proteins from a decoy database are incorrect. Thus, using a decoy directly will produce a non-zero FDR, while FDR = 0 is the correct answer.</p>
<p>This problem can be addressed by correcting for the bias due to the number of true proteins in the forward database. Let the number of identified forward and decoy proteins be <it>n<sub>F </sub>
</it>and <it>n<sub>D</sub>
</it>, and the total number of forward and decoy proteins in the databases be <it>N<sub>F </sub>
</it>and <it>N<sub>D</sub>
</it>, respectively. Let the protein level FDR in forward database be <it>FDR<sub>P </sub>
</it>and the rate of incorrect protein identifications from the forward and decoy database be</p>
<p>
<display-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i27"><m:mrow>
   <m:msub>
      <m:mrow>
         <m:mi>&#947;</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>F</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mfrac>
      <m:mrow>
         <m:mi>F</m:mi>
         <m:mi>D</m:mi>
         <m:msub>
            <m:mrow>
               <m:mi>R</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>P</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-bin">&#8901;</m:mo>
         <m:msub>
            <m:mrow>
               <m:mi>n</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>F</m:mi>
            </m:mrow>
         </m:msub>
      </m:mrow>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>N</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>F</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-bin">-</m:mo>
         <m:mrow>
            <m:mo class="MathClass-open">(</m:mo>
            <m:mrow>
               <m:mn>1</m:mn>
               <m:mo class="MathClass-bin">-</m:mo>
               <m:mi>F</m:mi>
               <m:mi>D</m:mi>
               <m:msub>
                  <m:mrow>
                     <m:mi>R</m:mi>
                  </m:mrow>
                  <m:mrow>
                     <m:mi>P</m:mi>
                  </m:mrow>
               </m:msub>
            </m:mrow>
            <m:mo class="MathClass-close">)</m:mo>
         </m:mrow>
         <m:mo class="MathClass-bin">&#8901;</m:mo>
         <m:msub>
            <m:mrow>
               <m:mi>n</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>F</m:mi>
            </m:mrow>
         </m:msub>
      </m:mrow>
   </m:mfrac>
   <m:mo class="MathClass-punc">,</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>and</p>
<p>
<display-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i28"><m:mrow>
   <m:msub>
      <m:mrow>
         <m:mi>&#947;</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>D</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mfrac>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>n</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>D</m:mi>
            </m:mrow>
         </m:msub>
      </m:mrow>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>N</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>D</m:mi>
            </m:mrow>
         </m:msub>
      </m:mrow>
   </m:mfrac>
   <m:mo class="MathClass-punc">,</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>respectively. An assumption regarding a decoy database is that the rates of the false protein identifications are identical; hence, <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i101"><m:mrow>
   <m:msub>
      <m:mi>&#947;</m:mi>
      <m:mi>F</m:mi>
   </m:msub>
   <m:mo>=</m:mo>
   <m:msub>
      <m:mi>&#947;</m:mi>
      <m:mi>D</m:mi>
   </m:msub>
</m:mrow>
</m:math>
</inline-formula>. By solving this equation we find</p>
<p>
<display-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i29"><m:mrow>
   <m:mi>F</m:mi>
   <m:mi>D</m:mi>
   <m:msub>
      <m:mrow>
         <m:mi>R</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>P</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mfrac>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>n</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>D</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-bin">&#8901;</m:mo>
         <m:mrow>
            <m:mo class="MathClass-open">(</m:mo>
            <m:mrow>
               <m:msub>
                  <m:mrow>
                     <m:mi>N</m:mi>
                  </m:mrow>
                  <m:mrow>
                     <m:mi>F</m:mi>
                  </m:mrow>
               </m:msub>
               <m:mo class="MathClass-bin">-</m:mo>
               <m:msub>
                  <m:mrow>
                     <m:mi>n</m:mi>
                  </m:mrow>
                  <m:mrow>
                     <m:mi>F</m:mi>
                  </m:mrow>
               </m:msub>
            </m:mrow>
            <m:mo class="MathClass-close">)</m:mo>
         </m:mrow>
      </m:mrow>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>n</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>F</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-bin">&#8901;</m:mo>
         <m:mrow>
            <m:mo class="MathClass-open">(</m:mo>
            <m:mrow>
               <m:msub>
                  <m:mrow>
                     <m:mi>N</m:mi>
                  </m:mrow>
                  <m:mrow>
                     <m:mi>D</m:mi>
                  </m:mrow>
               </m:msub>
               <m:mo class="MathClass-bin">-</m:mo>
               <m:msub>
                  <m:mrow>
                     <m:mi>n</m:mi>
                  </m:mrow>
                  <m:mrow>
                     <m:mi>D</m:mi>
                  </m:mrow>
               </m:msub>
            </m:mrow>
            <m:mo class="MathClass-close">)</m:mo>
         </m:mrow>
      </m:mrow>
   </m:mfrac>
   <m:mi>.</m:mi>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>Note that there is a correction factor <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i102"><m:mrow>
   <m:mo stretchy="false">(</m:mo>
   <m:msub>
      <m:mi>N</m:mi>
      <m:mi>F</m:mi>
   </m:msub>
   <m:mo>&#8722;</m:mo>
   <m:msub>
      <m:mi>n</m:mi>
      <m:mi>F</m:mi>
   </m:msub>
   <m:mo stretchy="false">)</m:mo>
   <m:mo stretchy="false">(</m:mo>
   <m:msub>
      <m:mi>N</m:mi>
      <m:mi>D</m:mi>
   </m:msub>
   <m:mo>&#8722;</m:mo>
   <m:msub>
      <m:mi>n</m:mi>
      <m:mi>D</m:mi>
   </m:msub>
   <m:mo stretchy="false">)</m:mo>
</m:mrow>
</m:math>
</inline-formula> in this equation compared to the FDR formula used for peptides. Also, when <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i103"><m:mrow>
   <m:msub>
      <m:mi>N</m:mi>
      <m:mi>F</m:mi>
   </m:msub>
   <m:mo>=</m:mo>
   <m:msub>
      <m:mi>n</m:mi>
      <m:mi>F</m:mi>
   </m:msub>
   <m:mo>,</m:mo>
   <m:mtext>&#8201;</m:mtext>
   <m:mi>F</m:mi>
   <m:mi>D</m:mi>
   <m:msub>
      <m:mi>R</m:mi>
      <m:mi>P</m:mi>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mn>0</m:mn>
</m:mrow>
</m:math>
</inline-formula> as expected. A related correction is implemented in the MAYU approach <abbrgrp>
<abbr bid="B50">50</abbr>
</abbrgrp> developed for FDR estimation from large proteomics data sets, i.e. the case when <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-13-S16-S4-i104"><m:mrow>
   <m:msub>
      <m:mi>n</m:mi>
      <m:mi>F</m:mi>
   </m:msub>
   <m:mo>/</m:mo>
   <m:msub>
      <m:mi>N</m:mi>
      <m:mi>F</m:mi>
   </m:msub>
   <m:mo>&#8811;</m:mo>
   <m:mn>0</m:mn>
</m:mrow>
</m:math>
</inline-formula>. Further corrections may be needed if the average lengths of the identified vs. non-identified proteins are different.</p>
<p>We would like to point out that, for probabilistic protein inference algorithms, theoretical protein FDR values can be computed based on the protein posterior probabilities. However, such theoretical FDR values are only accurate when the reported protein posterior probabilities are accurate. Hence, they need to be evaluated themselves, e.g. against the target/decoy-based empirical FDRs.</p>
<p>The second and more serious issue for applying the decoy approach is related to the existence of protein families. In fact, to our knowledge, no solution has yet been proposed. Simply speaking, a randomized database cannot serve as a good decoy for evaluating methods on data sets that contain many degenerate peptide identifications. The reason is that such peptides are typically shared among forward proteins, which could be similar to each other due to biological/annotation reasons, but not with decoy proteins. As a result, a randomized protein database cannot provide indications whether the identifications made among homologous proteins are correct or not. For this reason, a randomized decoy database is expected to underestimate FDRs for eukaryotic samples, which have large number of shared peptides (Figure <figr fid="F1">1</figr>). The problem might be addressed using well-constructed non-random sequence database or using a closely related proteome database as decoy. Evaluating protein inference algorithms using such non-random decoys, however, remains a research problem.</p>
<p>We emphasize that both standard mixtures and the target/decoy approach for complex samples have their pros and cons in evaluating protein inference algorithms, and they are not mutually exclusive approaches. In fact, standard mixtures can be used to validate the target/decoy approach for protein FDR estimation. It is generally a good idea to use both strategies for a more complete and objective evaluation.</p>
<sec>
<st>
<p>A need for guidelines for comparisons between methods</p>
</st>
<p>Due to the complexity of protein inference, fair evaluation of the proposed methods has been challenging. This is due to two major aspects. First, reliable and objective validation of the protein identification results is itself a challenging problem, as the FDR estimation is still unreliable. In addition, it is not even obvious how to compare models whose outputs are considerably different, e.g. those that provide protein groups and those that resolve ties between all proteins. Second, due to the lack of agreed upon guidelines, avoidable unfair comparisons are sometimes seen in the literature <abbrgrp>
<abbr bid="B69">69</abbr>
</abbrgrp>. In other works, different peptide identification algorithms or scoring schemes are sometimes used as inputs to different protein inference methods, making the protein inference comparisons uninterpretable.</p>
<p>In order to address this situation, we tentatively propose the following principles for comparisons of protein inference algorithms. First, whenever possible, the same or equivalent peptide identification scores as input to different programs should be used. Second, effort should be made to provide inputs most appropriate to each algorithm considered. For example, algorithms that take all peptide identifications should be provided all scores, while programs that take only confident identifications should be provided such a subset. Third, at least one standard protein mixture data set should be used and all known proteins (whether they belong to "indistinguishable" protein groups or not) in such data sets should be included in the evaluation of the protein inference methods. This will allow the evaluation of protein inference algorithms on proteins identified without any unique peptides. Finally, and in an ideal scenario, large data sets from complex samples of unknown proteins should also be used to compare different programs; however, we caution that the current decoy database strategy may not provide reliable FDR estimates at the protein level (evaluation for protein data sets with significant fraction of degenerate peptides is a particular problem).</p>
</sec>
</sec>
<sec>
<st>
<p>The ultimate protein inference approach</p>
</st>
<p>Despite the amount of published work, the protein inference problem is far from solved. We believe two aspects are crucial to the future approaches. First, the model should be probabilistic and with degenerate peptides treated in principled ways. Second, unidentified peptides should be exploited with peptide detectability incorporated into the model, perhaps adjusted to allow modeling peptide competition at the elution stage in a given sample. Despite the current limitations of peptide detectability predictions, especially for non-tryptic and modified peptides, it is believed that including detectability <abbrgrp>
<abbr bid="B24">24</abbr>
<abbr bid="B35">35</abbr>
<abbr bid="B69">69</abbr>
<abbr bid="B71">71</abbr>
</abbrgrp> or peptide-specific information for peptide probability adjustment <abbrgrp>
<abbr bid="B21">21</abbr>
</abbrgrp> would improve the current methods for protein inference.</p>
<p>Furthermore, we believe that better estimation of peptide/protein quantity might also help protein inference by, for example, improving the quantity adjustment of peptide detectability <abbrgrp>
<abbr bid="B60">60</abbr>
<abbr bid="B61">61</abbr>
</abbrgrp>, and provide additional input information for protein inference. As mentioned in the Introduction, protein inference can be viewed as a special case of protein label-free quantification. In fact, an ideal inference algorithm should automatically be a quantification algorithm, and vice versa. We believe much better performance can be achieved by combining the protein inference and quantification tasks into one statistical framework.</p>
<p>Algorithmic development is equally important for rigorous and yet practical probabilistic inference. Serang et al. <abbrgrp>
<abbr bid="B76">76</abbr>
</abbrgrp> proposed an approximate solution by setting low peptide probabilities to zero and then applying the graph pruning procedure. In this way the complexity of the problem can be controlled at arbitrarily low levels with the price of potentially high error (i.e. the computed probability may greatly deviate from the exact values). The Gibbs sampling approach implemented in MSBayesPro can achieve arbitrarily high accuracy in probability estimation; however, the time required for the inference can be prohibitively long. A fast algorithm with controllable error bound is desirable. Applying well-established exact or approximate graph inference algorithms, e.g. the junction tree algorithm <abbrgrp>
<abbr bid="B76">76</abbr>
</abbrgrp>, is an important direction for further investigation.</p>
</sec>
</sec>
<sec>
<st>
<p>Appendix</p>
</st>
<sec>
<st>
<p>Peptide detectability</p>
</st>
<p>Peptide detectability has been defined as the probability that a peptide will be identified in a proteomics experiment given the presence of its parent protein in a sample <abbrgrp>
<abbr bid="B19">19</abbr>
<abbr bid="B24">24</abbr>
</abbrgrp>. There are multiple factors, spanning all phases of a proteomics experiment, that influence peptide identification. For example, during sample storage and preparation, some peptides may be truncated at their termini resulting in semi-tryptic or non-tryptic peptides (in the case of trypsin digestion) which usually remain unidentified in a database search <abbrgrp>
<abbr bid="B25">25</abbr>
</abbrgrp>. Peptides with different hydrophobicity patterns may not be retained in the LC stationary phase (hydrophilic peptides) or may be insoluble in the LC mobile phase (hydrophobic peptides). Peptides that eluted will be ionized with different efficiencies based on the presence and distribution of charged residues in their sequence. Furthermore, in complex biological samples, peptides are likely to co-elute with many other peptides and thus compete for ionizing protons during the electrospray ionization. Many peptides may elute and ionize well, but poorly fragment, producing MS/MS spectra with few peaks. Such peptides are difficult to interpret by computational methods. In addition, peptides whose <it>m/z </it>values are outside of the range of the mass spectrometer (200-2000 Da) cannot be identified. Apart from physicochemical aspects, there are several biological factors influencing peptide identification. For example, the three-dimensional structure of a protein could lead to the existence of sites with different sensitivities to proteolytic digestion. Other sites may be post-translationally modified by one of more than 200 different post-translational modifications (PTMs) observed in eukaryotes <abbrgrp>
<abbr bid="B26">26</abbr>
</abbrgrp>. Many such peptides typi-cally remain unidentified unless a database search explicitly specifies the PTM type or the sites of interest (which, in turn, leads to a decrease in the number of identifications for regular peptides). Finally, different peptide identification software packages are based on different assumptions and are known to result in differences among identified peptides <abbrgrp>
<abbr bid="B27">27</abbr>
</abbrgrp>.</p>
<p>It has been shown that the detectability of a peptide at standard quantity, i.e. <it>standard detectability</it>, is a property of the peptide sequence and thus can be predicted from peptide/protein sequence for a given experimental platform <abbrgrp>
<abbr bid="B17">17</abbr>
<abbr bid="B19">19</abbr>
<abbr bid="B24">24</abbr>
<abbr bid="B28">28</abbr>
<abbr bid="B29">29</abbr>
<abbr bid="B30">30</abbr>
<abbr bid="B31">31</abbr>
<abbr bid="B32">32</abbr>
<abbr bid="B33">33</abbr>
</abbrgrp>. On the other hand, the quantity of a protein also determines the fate of a peptide with respect to its identification. For example, peptides with high standard detectability that are present in a sample in low quantity may not be identified, while peptides with relatively low detectability present at high quantity may in fact be observed. Therefore, protein quantity and standard detectability collectively determine the <it>effective detectability </it>of each peptide in a protein. Effective peptide detectability cannot be predicted from amino acid sequence alone (unless protein quantity can be shown to depend on protein sequence) and has to be estimated from a set of peptide identifications and their standard detectabilities <abbrgrp>
<abbr bid="B24">24</abbr>
</abbrgrp>.</p>
<p>A detectable peptide is related to a <it>proteotypic peptide</it>, which is "an experimentally observable peptide that uniquely identifies a specific protein or protein isoform" <abbrgrp>
<abbr bid="B18">18</abbr>
</abbrgrp>. In practice, proteotypic peptides were required to be observed in more than 50% of experiments in which their parent protein was identified <abbrgrp>
<abbr bid="B18">18</abbr>
<abbr bid="B30">30</abbr>
</abbrgrp>. The relationship between two definitions can be best understood from an interesting property of peptides in MS/MS experiments to group at either high or low end of the detectability scale in a standard sample <abbrgrp>
<abbr bid="B24">24</abbr>
<abbr bid="B34">34</abbr>
</abbrgrp>
</p>
<p>In summary, the identification of a peptide in a proteomics experiment is a stochastic event that depends on multiple factors. It is therefore convenient to summarize all these factors using a probabilistic framework. The standard detectability of peptide <it>p<sub>j </sub>
</it>from protein <it>P<sub>i </sub>
</it>(at quantity <it>q<sup>0</sup>
</it>) will be denoted as <it>d<sub>ij</sub>
</it>
<sup>0</sup>, while the effective detectability at an arbitrary quantity <it>q </it>will be denoted as <it>d<sub>ij</sub>
</it>(<it>q</it>).</p>
</sec>
</sec>
<sec>
<st>
<p>Competing interests</p>
</st>
<p>The authors declare that they have no competing interests.</p>
</sec>
</bdy><bm>
<ack>
<sec>
<st>
<p>Acknowledgements</p>
</st>
<p>We thank Prof. Matthew Hahn, Prof. Haixu Tang and Dr. Sujun Li for the comments and help in writing this paper. We also thank the anonymous reviewers for their suggestions and criticisms that further improved the paper. This work was supported by the National Institutes of Health grants RR024236-01A1 and CA126480-01.</p>
<p>This article has been published as part of <it>BMC Bioinformatics </it>Volume 13 Supplement 16, 2012: Statistical mass spectrometry-based proteomics. The full contents of the supplement are available online at <url>http://www.biomedcentral.com/1471-2105/13/S16</url>.</p>
</sec>
</ack>
<refgrp><bibl id="B1"><title><p>Mass spectrometry-based proteomics</p></title><aug><au><snm>Aebersold</snm><fnm>R</fnm></au><au><snm>Mann</snm><fnm>M</fnm></au></aug><source>Nature</source><pubdate>2003</pubdate><volume>422</volume><issue>6928</issue><fpage>198</fpage><lpage>207</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nature01511</pubid><pubid idtype="pmpid" link="fulltext">12634793</pubid></pubidlist></xrefbib></bibl><bibl id="B2"><title><p>The biological impact of mass-spectrometry-based proteomics</p></title><aug><au><snm>Cravatt</snm><fnm>BF</fnm></au><au><snm>Simon</snm><fnm>GM</fnm></au><au><snm>Yates</snm><fnm>JR</fnm></au></aug><source>Nature</source><pubdate>2007</pubdate><volume>450</volume><issue>7172</issue><fpage>991</fpage><lpage>1000</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nature06525</pubid><pubid idtype="pmpid" link="fulltext">18075578</pubid></pubidlist></xrefbib></bibl><bibl id="B3"><title><p>Decoding signalling networks by mass spectrometry-based proteomics</p></title><aug><au><snm>Choudhary</snm><fnm>C</fnm></au><au><snm>Mann</snm><fnm>M</fnm></au></aug><source>Nat Rev Mol Cell Biol</source><pubdate>2010</pubdate><volume>11</volume><issue>6</issue><fpage>427</fpage><lpage>439</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nrm2900</pubid><pubid idtype="pmpid" link="fulltext">20461098</pubid></pubidlist></xrefbib></bibl><bibl id="B4"><title><p>The ABC's (and XYZ's) of peptide sequencing</p></title><aug><au><snm>Steen</snm><fnm>H</fnm></au><au><snm>Mann</snm><fnm>M</fnm></au></aug><source>Nat Rev Mol Cell Biol</source><pubdate>2004</pubdate><volume>5</volume><issue>9</issue><fpage>699</fpage><lpage>711</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nrm1468</pubid><pubid idtype="pmpid" link="fulltext">15340378</pubid></pubidlist></xrefbib></bibl><bibl id="B5"><title><p>Using annotated peptide mass spectrum libraries for protein identification</p></title><aug><au><snm>Craig</snm><fnm>R</fnm></au><au><snm>Cortens</snm><fnm>JC</fnm></au><au><snm>Fenyo</snm><fnm>D</fnm></au><au><snm>Beavis</snm><fnm>RC</fnm></au></aug><source>J Proteome Res</source><pubdate>2006</pubdate><volume>5</volume><issue>8</issue><fpage>1843</fpage><lpage>1849</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/pr0602085</pubid><pubid idtype="pmpid" link="fulltext">16889405</pubid></pubidlist></xrefbib></bibl><bibl id="B6"><title><p>Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries</p></title><aug><au><snm>Frewen</snm><fnm>BE</fnm></au><au><snm>Merrihew</snm><fnm>GE</fnm></au><au><snm>Wu</snm><fnm>CC</fnm></au><au><snm>Noble</snm><fnm>WS</fnm></au><au><snm>MacCoss</snm><fnm>MJ</fnm></au></aug><source>Anal Chem</source><pubdate>2006</pubdate><volume>78</volume><issue>16</issue><fpage>5678</fpage><lpage>5684</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/ac060279n</pubid><pubid idtype="pmpid" link="fulltext">16906711</pubid></pubidlist></xrefbib></bibl><bibl id="B7"><title><p>Development and validation of a spectral library searching method for peptide identification from MS/MS</p></title><aug><au><snm>Lam</snm><fnm>H</fnm></au><au><snm>Deutsch</snm><fnm>EW</fnm></au><au><snm>Eddes</snm><fnm>JS</fnm></au><au><snm>Eng</snm><fnm>JK</fnm></au><au><snm>King</snm><fnm>N</fnm></au><au><snm>Stein</snm><fnm>SE</fnm></au><au><snm>Aebersold</snm><fnm>R</fnm></au></aug><source>Proteomics</source><pubdate>2007</pubdate><volume>7</volume><issue>5</issue><fpage>655</fpage><lpage>667</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1002/pmic.200600625</pubid><pubid idtype="pmpid" link="fulltext">17295354</pubid></pubidlist></xrefbib></bibl><bibl id="B8"><title><p>Building consensus spectral libraries for peptide identification in proteomics</p></title><aug><au><snm>Lam</snm><fnm>H</fnm></au><au><snm>Deutsch</snm><fnm>EW</fnm></au><au><snm>Eddes</snm><fnm>JS</fnm></au><au><snm>Eng</snm><fnm>JK</fnm></au><au><snm>Stein</snm><fnm>SE</fnm></au><au><snm>Aebersold</snm><fnm>R</fnm></au></aug><source>Nat Methods</source><pubdate>2008</pubdate><volume>5</volume><issue>10</issue><fpage>873</fpage><lpage>875</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nmeth.1254</pubid><pubid idtype="pmcid">2637392</pubid><pubid idtype="pmpid" link="fulltext">18806791</pubid></pubidlist></xrefbib></bibl><bibl id="B9"><title><p>An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database</p></title><aug><au><snm>Eng</snm><fnm>JK</fnm></au><au><snm>McCormack</snm><fnm>AL</fnm></au><au><snm>Yates</snm><fnm>JR</fnm></au></aug><source>J Am Soc Mass Spectrom</source><pubdate>1994</pubdate><volume>5</volume><fpage>976</fpage><lpage>989</lpage><xrefbib><pubid idtype="doi">10.1016/1044-0305(94)80016-2</pubid></xrefbib></bibl><bibl id="B10"><title><p>Prediction of low-energy collision-induced dissociation spectra of peptides</p></title><aug><au><snm>Zhang</snm><fnm>Z</fnm></au></aug><source>Anal Chem</source><pubdate>2004</pubdate><volume>76</volume><issue>14</issue><fpage>3908</fpage><lpage>3922</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/ac049951b</pubid><pubid idtype="pmpid" link="fulltext">15253624</pubid></pubidlist></xrefbib></bibl><bibl id="B11"><title><p>Prediction of low-energy collision-induced dissociation spectra of peptides with three or more charges</p></title><aug><au><snm>Zhang</snm><fnm>Z</fnm></au></aug><source>Anal Chem</source><pubdate>2005</pubdate><volume>77</volume><issue>19</issue><fpage>6364</fpage><lpage>6373</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/ac050857k</pubid><pubid idtype="pmpid" link="fulltext">16194101</pubid></pubidlist></xrefbib></bibl><bibl id="B12"><title><p>Intensity-based protein identification by machine learning from a library of tandem mass spectra</p></title><aug><au><snm>Elias</snm><fnm>JE</fnm></au><au><snm>Gibbons</snm><fnm>FD</fnm></au><au><snm>King</snm><fnm>OD</fnm></au><au><snm>Roth</snm><fnm>FP</fnm></au><au><snm>Gygi</snm><fnm>SP</fnm></au></aug><source>Nat Biotechnol</source><pubdate>2004</pubdate><volume>22</volume><issue>2</issue><fpage>214</fpage><lpage>219</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nbt930</pubid><pubid idtype="pmpid" link="fulltext">14730315</pubid></pubidlist></xrefbib></bibl><bibl id="B13"><title><p>A machine learning approach to predicting peptide fragmentation spectra</p></title><aug><au><snm>Arnold</snm><fnm>RJ</fnm></au><au><snm>Jayasankar</snm><fnm>N</fnm></au><au><snm>Aggarwal</snm><fnm>D</fnm></au><au><snm>Tang</snm><fnm>H</fnm></au><au><snm>Radivojac</snm><fnm>P</fnm></au></aug><source>Pac Symp Biocomput</source><pubdate>2006</pubdate><fpage>219</fpage><lpage>230</lpage></bibl><bibl id="B14"><title><p>Modeling peptide fragmentation with dynamic Bayesian networks for peptide identification</p></title><aug><au><snm>Klammer</snm><fnm>AA</fnm></au><au><snm>Reynolds</snm><fnm>SM</fnm></au><au><snm>Bilmes</snm><fnm>JA</fnm></au><au><snm>MacCoss</snm><fnm>MJ</fnm></au><au><snm>Noble</snm><fnm>WS</fnm></au></aug><source>Bioinformatics</source><pubdate>2008</pubdate><volume>24</volume><issue>13</issue><fpage>i348</fpage><lpage>356</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btn189</pubid><pubid idtype="pmcid">2665034,2665034</pubid><pubid idtype="pmpid" link="fulltext">18586734</pubid></pubidlist></xrefbib></bibl><bibl id="B15"><title><p>The human plasma proteome: history, character, and diagnostic prospects</p></title><aug><au><snm>Anderson</snm><fnm>NL</fnm></au><au><snm>Anderson</snm><fnm>NG</fnm></au></aug><source>Mol Cell Proteomics</source><pubdate>2002</pubdate><volume>1</volume><issue>11</issue><fpage>845</fpage><lpage>867</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1074/mcp.R200007-MCP200</pubid><pubid idtype="pmpid" link="fulltext">12488461</pubid></pubidlist></xrefbib></bibl><bibl id="B16"><title><p>Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics</p></title><aug><au><snm>Resing</snm><fnm>KA</fnm></au><au><snm>Meyer-Arendt</snm><fnm>K</fnm></au><au><snm>Mendoza</snm><fnm>AM</fnm></au><au><snm>Aveline-Wolf</snm><fnm>LD</fnm></au><au><snm>Jonscher</snm><fnm>KR</fnm></au><au><snm>Pierce</snm><fnm>KG</fnm></au><au><snm>Old</snm><fnm>WM</fnm></au><au><snm>Cheung</snm><fnm>HT</fnm></au><au><snm>Russell</snm><fnm>S</fnm></au><au><snm>Wattawa</snm><fnm>JL</fnm></au><etal/></aug><source>Anal Chem</source><pubdate>2004</pubdate><volume>76</volume><issue>13</issue><fpage>3556</fpage><lpage>3568</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/ac035229m</pubid><pubid idtype="pmpid" link="fulltext">15228325</pubid></pubidlist></xrefbib></bibl><bibl id="B17"><title><p>Definition and characterization of a "trypsinosome" from specific peptide characteristics by nano-HPLC-MS/MS and in silico analysis of complex protein mixtures</p></title><aug><au><snm>Le Bihan</snm><fnm>T</fnm></au><au><snm>Robinson</snm><fnm>MD</fnm></au><au><snm>Figeys</snm><fnm>D</fnm></au></aug><source>J Proteome Res</source><pubdate>2004</pubdate><volume>3</volume><issue>6</issue><fpage>1138</fpage><lpage>1148</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/pr049909x</pubid><pubid idtype="pmpid" link="fulltext">15595722</pubid></pubidlist></xrefbib></bibl><bibl id="B18"><title><p>Scoring proteomes with proteotypic peptide probes</p></title><aug><au><snm>Kuster</snm><fnm>B</fnm></au><au><snm>Schirle</snm><fnm>M</fnm></au><au><snm>Mallick</snm><fnm>P</fnm></au><au><snm>Aebersold</snm><fnm>R</fnm></au></aug><source>Nat Rev Mol Cell Biol</source><pubdate>2005</pubdate><volume>6</volume><issue>7</issue><fpage>577</fpage><lpage>583</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nrm1683</pubid><pubid idtype="pmpid" link="fulltext">15957003</pubid></pubidlist></xrefbib></bibl><bibl id="B19"><title><p>A computational approach toward label-free protein quantification using predicted peptide detectability</p></title><aug><au><snm>Tang</snm><fnm>H</fnm></au><au><snm>Arnold</snm><fnm>RJ</fnm></au><au><snm>Alves</snm><fnm>P</fnm></au><au><snm>Xun</snm><fnm>Z</fnm></au><au><snm>Clemmer</snm><fnm>DE</fnm></au><au><snm>Novotny</snm><fnm>MV</fnm></au><au><snm>Reilly</snm><fnm>JP</fnm></au><au><snm>Radivojac</snm><fnm>P</fnm></au></aug><source>Bioinformatics</source><pubdate>2006</pubdate><volume>22</volume><issue>14</issue><fpage>e481</fpage><lpage>e488</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btl237</pubid><pubid idtype="pmpid" link="fulltext">16873510</pubid></pubidlist></xrefbib></bibl><bibl id="B20"><title><p>Interpretation of shotgun proteomic data: the protein inference problem</p></title><aug><au><snm>Nesvizhskii</snm><fnm>AI</fnm></au><au><snm>Aebersold</snm><fnm>R</fnm></au></aug><source>Mol Cell Proteomics</source><pubdate>2005</pubdate><volume>4</volume><issue>10</issue><fpage>1419</fpage><lpage>1440</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1074/mcp.R500012-MCP200</pubid><pubid idtype="pmpid" link="fulltext">16009968</pubid></pubidlist></xrefbib></bibl><bibl id="B21"><title><p>A statistical model for identifying proteins by tandem mass spectrometry</p></title><aug><au><snm>Nesvizhskii</snm><fnm>AI</fnm></au><au><snm>Keller</snm><fnm>A</fnm></au><au><snm>Kolker</snm><fnm>E</fnm></au><au><snm>Aebersold</snm><fnm>R</fnm></au></aug><source>Anal Chem</source><pubdate>2003</pubdate><volume>75</volume><issue>17</issue><fpage>4646</fpage><lpage>4658</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/ac0341261</pubid><pubid idtype="pmpid">14632076</pubid></pubidlist></xrefbib></bibl><bibl id="B22"><title><p>Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry</p></title><aug><au><snm>Elias</snm><fnm>JE</fnm></au><au><snm>Gygi</snm><fnm>SP</fnm></au></aug><source>Nat Methods</source><pubdate>2007</pubdate><volume>4</volume><issue>3</issue><fpage>207</fpage><lpage>214</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nmeth1019</pubid><pubid idtype="pmpid" link="fulltext">17327847</pubid></pubidlist></xrefbib></bibl><bibl id="B23"><title><p>A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics</p></title><aug><au><snm>Nesvizhskii</snm><fnm>AI</fnm></au></aug><source>J Proteomics</source><pubdate>2010</pubdate><volume>73</volume><issue>11</issue><fpage>2092</fpage><lpage>2123</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.jprot.2010.08.009</pubid><pubid idtype="pmcid">2956504</pubid><pubid idtype="pmpid" link="fulltext">20816881</pubid></pubidlist></xrefbib></bibl><bibl id="B24"><title><p>The importance of peptide detectability for protein identification, quantification, and experiment design in MS/MS proteomics</p></title><aug><au><snm>Li</snm><fnm>YF</fnm></au><au><snm>Arnold</snm><fnm>RJ</fnm></au><au><snm>Tang</snm><fnm>H</fnm></au><au><snm>Radivojac</snm><fnm>P</fnm></au></aug><source>J Proteome Res</source><pubdate>2010</pubdate><volume>9</volume><issue>12</issue><fpage>6288</fpage><lpage>6297</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/pr1005586</pubid><pubid idtype="pmcid">3006185</pubid><pubid idtype="pmpid" link="fulltext">21067214</pubid></pubidlist></xrefbib></bibl><bibl id="B25"><title><p>Fast and accurate identification of semi-tryptic peptides in shotgun proteomics</p></title><aug><au><snm>Alves</snm><fnm>P</fnm></au><au><snm>Arnold</snm><fnm>RJ</fnm></au><au><snm>Clemmer</snm><fnm>DE</fnm></au><au><snm>Li</snm><fnm>Y</fnm></au><au><snm>Reilly</snm><fnm>JP</fnm></au><au><snm>Sheng</snm><fnm>Q</fnm></au><au><snm>Tang</snm><fnm>H</fnm></au><au><snm>Xun</snm><fnm>Z</fnm></au><au><snm>Zeng</snm><fnm>R</fnm></au><au><snm>Radivojac</snm><fnm>P</fnm></au></aug><source>Bioinformatics</source><pubdate>2008</pubdate><volume>24</volume><issue>1</issue><fpage>102</fpage><lpage>109</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btm545</pubid><pubid idtype="pmpid" link="fulltext">18033797</pubid></pubidlist></xrefbib></bibl><bibl id="B26"><aug><au><snm>Walsh</snm><fnm>CT</fnm></au></aug><source>Posttranslational modification of proteins: expanding nature's inventory</source><publisher>Englewood, CO: Roberts and Company Publishers</publisher><pubdate>2006</pubdate><xrefbib><pubid idtype="pmpid" link="fulltext">21429787</pubid></xrefbib></bibl><bibl id="B27"><title><p>Comparative evaluation of tandem MS search algorithms using a target-decoy search strategy</p></title><aug><au><snm>Balgley</snm><fnm>BM</fnm></au><au><snm>Laudeman</snm><fnm>T</fnm></au><au><snm>Yang</snm><fnm>L</fnm></au><au><snm>Song</snm><fnm>T</fnm></au><au><snm>Lee</snm><fnm>CS</fnm></au></aug><source>Mol Cell Proteomics</source><pubdate>2007</pubdate><volume>6</volume><issue>9</issue><fpage>1599</fpage><lpage>1608</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1074/mcp.M600469-MCP200</pubid><pubid idtype="pmpid" link="fulltext">17533222</pubid></pubidlist></xrefbib></bibl><bibl id="B28"><title><p>Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation</p></title><aug><au><snm>Lu</snm><fnm>P</fnm></au><au><snm>Vogel</snm><fnm>C</fnm></au><au><snm>Wang</snm><fnm>R</fnm></au><au><snm>Yao</snm><fnm>X</fnm></au><au><snm>Marcotte</snm><fnm>EM</fnm></au></aug><source>Nat Biotechnol</source><pubdate>2007</pubdate><volume>25</volume><issue>1</issue><fpage>117</fpage><lpage>124</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nbt1270</pubid><pubid idtype="pmpid" link="fulltext">17187058</pubid></pubidlist></xrefbib></bibl><bibl id="B29"><title><p>Peptide detectability following ESI mass spectrometry: prediction using genetic programming</p></title><aug><au><snm>Wedge</snm><fnm>DC</fnm></au><au><snm>Gaskell</snm><fnm>SJ</fnm></au><au><snm>Hubbard</snm><fnm>SJ</fnm></au><au><snm>Kell</snm><fnm>DB</fnm></au><au><snm>Lau</snm><fnm>KW</fnm></au><au><snm>Eyers</snm><fnm>C</fnm></au></aug><source>Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation (GECCO): 2007; New York, NY</source><pubdate>2007</pubdate><fpage>2219</fpage><lpage>2225</lpage></bibl><bibl id="B30"><title><p>Computational prediction of proteotypic peptides for quantitative proteomics</p></title><aug><au><snm>Mallick</snm><fnm>P</fnm></au><au><snm>Schirle</snm><fnm>M</fnm></au><au><snm>Chen</snm><fnm>SS</fnm></au><au><snm>Flory</snm><fnm>MR</fnm></au><au><snm>Lee</snm><fnm>H</fnm></au><au><snm>Martin</snm><fnm>D</fnm></au><au><snm>Ranish</snm><fnm>J</fnm></au><au><snm>Raught</snm><fnm>B</fnm></au><au><snm>Schmitt</snm><fnm>R</fnm></au><au><snm>Werner</snm><fnm>T</fnm></au><etal/></aug><source>Nat Biotechnol</source><pubdate>2007</pubdate><volume>25</volume><issue>1</issue><fpage>125</fpage><lpage>131</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nbt1275</pubid><pubid idtype="pmpid" link="fulltext">17195840</pubid></pubidlist></xrefbib></bibl><bibl id="B31"><title><p>Prediction of peptides observable by mass spectrometry applied at the experimental set level</p></title><aug><au><snm>Sanders</snm><fnm>WS</fnm></au><au><snm>Bridges</snm><fnm>SM</fnm></au><au><snm>McCarthy</snm><fnm>FM</fnm></au><au><snm>Nanduri</snm><fnm>B</fnm></au><au><snm>Burgess</snm><fnm>SC</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2007</pubdate><volume>8</volume><issue>Suppl 7</issue><fpage>S23</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-8-S7-S23</pubid><pubid idtype="pmcid">2099492</pubid><pubid idtype="pmpid" link="fulltext">18047723</pubid></pubidlist></xrefbib></bibl><bibl id="B32"><title><p>Calculating absolute and relative protein abundance from mass spectrometry-based protein expression data</p></title><aug><au><snm>Vogel</snm><fnm>C</fnm></au><au><snm>Marcotte</snm><fnm>EM</fnm></au></aug><source>Nat Protoc</source><pubdate>2008</pubdate><volume>3</volume><issue>9</issue><fpage>1444</fpage><lpage>1451</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nprot.2008.132</pubid><pubid idtype="pmpid" link="fulltext">18772871</pubid></pubidlist></xrefbib></bibl><bibl id="B33"><title><p>A support vector machine model for the prediction of proteotypic peptides for accurate mass and time proteomics</p></title><aug><au><snm>Webb-Robertson</snm><fnm>BJ</fnm></au><au><snm>Cannon</snm><fnm>WR</fnm></au><au><snm>Oehmen</snm><fnm>CS</fnm></au><au><snm>Shah</snm><fnm>AR</fnm></au><au><snm>Gurumoorthi</snm><fnm>V</fnm></au><au><snm>Lipton</snm><fnm>MS</fnm></au><au><snm>Waters</snm><fnm>KM</fnm></au></aug><source>Bioinformatics</source><pubdate>2008</pubdate><volume>24</volume><issue>13</issue><fpage>1503</fpage><lpage>1509</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btn218</pubid><pubid idtype="pmpid" link="fulltext">18453551</pubid></pubidlist></xrefbib></bibl><bibl id="B34"><title><p>Combinatorial libraries of synthetic peptides as a model for shotgun proteomics</p></title><aug><au><snm>Bohrer</snm><fnm>BC</fnm></au><au><snm>Li</snm><fnm>YF</fnm></au><au><snm>Reilly</snm><fnm>JP</fnm></au><au><snm>Clemmer</snm><fnm>DE</fnm></au><au><snm>DiMarchi</snm><fnm>RD</fnm></au><au><snm>Radivojac</snm><fnm>P</fnm></au><au><snm>Tang</snm><fnm>H</fnm></au><au><snm>Arnold</snm><fnm>RJ</fnm></au></aug><source>Anal Chem</source><pubdate>2010</pubdate><volume>82</volume><issue>15</issue><fpage>6559</fpage><lpage>6568</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/ac100910a</pubid><pubid idtype="pmcid">2927099</pubid><pubid idtype="pmpid" link="fulltext">20669997</pubid></pubidlist></xrefbib></bibl><bibl id="B35"><title><p>Protein inference by assembling peptides identified from tandem mass spectra</p></title><aug><au><snm>Shi</snm><fnm>J</fnm></au><au><snm>Wu</snm><fnm>F</fnm></au></aug><source>Curr Bioinformatics</source><pubdate>2009</pubdate><volume>4</volume><issue>3</issue><fpage>226</fpage><lpage>233</lpage><xrefbib><pubid idtype="doi">10.2174/157489309789071048</pubid></xrefbib></bibl><bibl id="B36"><title><p>Protein inference: a review</p></title><aug><au><snm>Huang</snm><fnm>T</fnm></au><au><snm>Wang</snm><fnm>J</fnm></au><au><snm>Yu</snm><fnm>W</fnm></au><au><snm>He</snm><fnm>Z</fnm></au></aug><source>Brief Bioinform</source><pubdate>2012</pubdate></bibl><bibl id="B37"><title><p>A review of statistical methods for protein identification using tandem mass spectrometry</p></title><aug><au><snm>Serang</snm><fnm>O</fnm></au><au><snm>Noble</snm><fnm>WS</fnm></au></aug><source>Stat Interface</source><pubdate>2012</pubdate><volume>5</volume><issue>1</issue><fpage>3</fpage><lpage>20</lpage><xrefbib><pubidlist><pubid idtype="pmcid">3402235</pubid><pubid idtype="pmpid">22833779</pubid></pubidlist></xrefbib></bibl><bibl id="B38"><title><p>Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search</p></title><aug><au><snm>Keller</snm><fnm>A</fnm></au><au><snm>Nesvizhskii</snm><fnm>AI</fnm></au><au><snm>Kolker</snm><fnm>E</fnm></au><au><snm>Aebersold</snm><fnm>R</fnm></au></aug><source>Anal Chem</source><pubdate>2002</pubdate><volume>74</volume><issue>20</issue><fpage>5383</fpage><lpage>5392</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/ac025747h</pubid><pubid idtype="pmpid">12403597</pubid></pubidlist></xrefbib></bibl><bibl id="B39"><title><p>Probability-based protein identification by searching sequence databases using mass spectrometry data</p></title><aug><au><snm>Perkins</snm><fnm>DN</fnm></au><au><snm>Pappin</snm><fnm>DJ</fnm></au><au><snm>Creasy</snm><fnm>DM</fnm></au><au><snm>Cottrell</snm><fnm>JS</fnm></au></aug><source>Electrophoresis</source><pubdate>1999</pubdate><volume>20</volume><issue>18</issue><fpage>3551</fpage><lpage>3567</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1002/(SICI)1522-2683(19991201)20:18&lt;3551::AID-ELPS3551&gt;3.0.CO;2-2</pubid><pubid idtype="pmpid" link="fulltext">10612281</pubid></pubidlist></xrefbib></bibl><bibl id="B40"><title><p>InsPecT: identification of posttranslationally modified peptides from tandem mass spectra</p></title><aug><au><snm>Tanner</snm><fnm>S</fnm></au><au><snm>Shu</snm><fnm>H</fnm></au><au><snm>Frank</snm><fnm>A</fnm></au><au><snm>Wang</snm><fnm>LC</fnm></au><au><snm>Zandi</snm><fnm>E</fnm></au><au><snm>Mumby</snm><fnm>M</fnm></au><au><snm>Pevzner</snm><fnm>PA</fnm></au><au><snm>Bafna</snm><fnm>V</fnm></au></aug><source>Anal Chem</source><pubdate>2005</pubdate><volume>77</volume><issue>14</issue><fpage>4626</fpage><lpage>4639</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/ac050102d</pubid><pubid idtype="pmpid" link="fulltext">16013882</pubid></pubidlist></xrefbib></bibl><bibl id="B41"><title><p>MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis</p></title><aug><au><snm>Tabb</snm><fnm>DL</fnm></au><au><snm>Fernando</snm><fnm>CG</fnm></au><au><snm>Chambers</snm><fnm>MC</fnm></au></aug><source>J Proteome Res</source><pubdate>2007</pubdate><volume>6</volume><issue>2</issue><fpage>654</fpage><lpage>661</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/pr0604054</pubid><pubid idtype="pmcid">2525619</pubid><pubid idtype="pmpid" link="fulltext">17269722</pubid></pubidlist></xrefbib></bibl><bibl id="B42"><title><p>SQID: an intensity-incorporated protein identification algorithm for tandem mass spectrometry</p></title><aug><au><snm>Li</snm><fnm>W</fnm></au><au><snm>Ji</snm><fnm>L</fnm></au><au><snm>Goya</snm><fnm>J</fnm></au><au><snm>Tan</snm><fnm>G</fnm></au><au><snm>Wysocki</snm><fnm>VH</fnm></au></aug><source>J Proteome Res</source><pubdate>2011</pubdate><volume>10</volume><issue>4</issue><fpage>1593</fpage><lpage>1602</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/pr100959y</pubid><pubid idtype="pmcid">3477243</pubid><pubid idtype="pmpid" link="fulltext">21204564</pubid></pubidlist></xrefbib></bibl><bibl id="B43"><title><p>On the accuracy and limits of peptide fragmentation spectrum prediction</p></title><aug><au><snm>Li</snm><fnm>S</fnm></au><au><snm>Arnold</snm><fnm>RJ</fnm></au><au><snm>Tang</snm><fnm>H</fnm></au><au><snm>Radivojac</snm><fnm>P</fnm></au></aug><source>Anal Chem</source><pubdate>2011</pubdate><volume>83</volume><issue>3</issue><fpage>790</fpage><lpage>796</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/ac102272r</pubid><pubid idtype="pmcid">3036742</pubid><pubid idtype="pmpid" link="fulltext">21175207</pubid></pubidlist></xrefbib></bibl><bibl id="B44"><title><p>Repeatability and reproducibility in proteomic identifications by liquid chromatography-tandem mass spectrometry</p></title><aug><au><snm>Tabb</snm><fnm>DL</fnm></au><au><snm>Vega-Montoto</snm><fnm>L</fnm></au><au><snm>Rudnick</snm><fnm>PA</fnm></au><au><snm>Variyath</snm><fnm>AM</fnm></au><au><snm>Ham</snm><fnm>AJ</fnm></au><au><snm>Bunk</snm><fnm>DM</fnm></au><au><snm>Kilpatrick</snm><fnm>LE</fnm></au><au><snm>Billheimer</snm><fnm>DD</fnm></au><au><snm>Blackman</snm><fnm>RK</fnm></au><au><snm>Cardasis</snm><fnm>HL</fnm></au><etal/></aug><source>J Proteome Res</source><pubdate>2010</pubdate><volume>9</volume><issue>2</issue><fpage>761</fpage><lpage>776</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/pr9006365</pubid><pubid idtype="pmcid">2818771</pubid><pubid idtype="pmpid" link="fulltext">19921851</pubid></pubidlist></xrefbib></bibl><bibl id="B45"><title><p>Protein identification by mass spectrometry: issues to be considered</p></title><aug><au><snm>Baldwin</snm><fnm>MA</fnm></au></aug><source>Mol Cell Proteomics</source><pubdate>2004</pubdate><volume>3</volume><issue>1</issue><fpage>1</fpage><lpage>9</lpage><xrefbib><pubid idtype="pmpid" link="fulltext">14608001</pubid></xrefbib></bibl><bibl id="B46"><title><p>The need for guidelines in publication of peptide and protein identification data: Working Group on Publication Guidelines for Peptide and Protein Identification Data</p></title><aug><au><snm>Carr</snm><fnm>S</fnm></au><au><snm>Aebersold</snm><fnm>R</fnm></au><au><snm>Baldwin</snm><fnm>M</fnm></au><au><snm>Burlingame</snm><fnm>A</fnm></au><au><snm>Clauser</snm><fnm>K</fnm></au><au><snm>Nesvizhskii</snm><fnm>A</fnm></au></aug><source>Mol Cell Proteomics</source><pubdate>2004</pubdate><volume>3</volume><issue>6</issue><fpage>531</fpage><lpage>533</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1074/mcp.T400006-MCP200</pubid><pubid idtype="pmpid" link="fulltext">15075378</pubid></pubidlist></xrefbib></bibl><bibl id="B47"><title><p>Reporting protein identification data: the next generation of guidelines</p></title><aug><au><snm>Bradshaw</snm><fnm>RA</fnm></au><au><snm>Burlingame</snm><fnm>AL</fnm></au><au><snm>Carr</snm><fnm>S</fnm></au><au><snm>Aebersold</snm><fnm>R</fnm></au></aug><source>Mol Cell Proteomics</source><pubdate>2006</pubdate><volume>5</volume><issue>5</issue><fpage>787</fpage><lpage>788</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1074/mcp.E600005-MCP200</pubid><pubid idtype="pmpid" link="fulltext">16670253</pubid></pubidlist></xrefbib></bibl><bibl id="B48"><title><p>Guidelines for the next 10 years of proteomics</p></title><aug><au><snm>Wilkins</snm><fnm>MR</fnm></au><au><snm>Appel</snm><fnm>RD</fnm></au><au><snm>Van Eyk</snm><fnm>JE</fnm></au><au><snm>Chung</snm><fnm>MC</fnm></au><au><snm>Gorg</snm><fnm>A</fnm></au><au><snm>Hecker</snm><fnm>M</fnm></au><au><snm>Huber</snm><fnm>LA</fnm></au><au><snm>Langen</snm><fnm>H</fnm></au><au><snm>Link</snm><fnm>AJ</fnm></au><au><snm>Paik</snm><fnm>YK</fnm></au><etal/></aug><source>Proteomics</source><pubdate>2006</pubdate><volume>6</volume><issue>1</issue><fpage>4</fpage><lpage>8</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1002/pmic.200500856</pubid><pubid idtype="pmpid" link="fulltext">16400714</pubid></pubidlist></xrefbib></bibl><bibl id="B49"><title><p>Minimum reporting guidelines for proteomics released by the Proteomics Standards Initiative</p></title><aug><au><snm>Jones</snm><fnm>AR</fnm></au><au><snm>Orchard</snm><fnm>S</fnm></au></aug><source>Mol Cell Proteomics</source><pubdate>2008</pubdate><volume>7</volume><issue>10</issue><fpage>2067</fpage><lpage>2068</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1074/mcp.H800010-MCP200</pubid><pubid idtype="pmpid" link="fulltext">18843148</pubid></pubidlist></xrefbib></bibl><bibl id="B50"><title><p>Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry</p></title><aug><au><snm>Reiter</snm><fnm>L</fnm></au><au><snm>Claassen</snm><fnm>M</fnm></au><au><snm>Schrimpf</snm><fnm>SP</fnm></au><au><snm>Jovanovic</snm><fnm>M</fnm></au><au><snm>Schmidt</snm><fnm>A</fnm></au><au><snm>Buhmann</snm><fnm>JM</fnm></au><au><snm>Hengartner</snm><fnm>MO</fnm></au><au><snm>Aebersold</snm><fnm>R</fnm></au></aug><source>Mol Cell Proteomics</source><pubdate>2009</pubdate><volume>8</volume><issue>11</issue><fpage>2405</fpage><lpage>2417</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1074/mcp.M900317-MCP200</pubid><pubid idtype="pmcid">2773710</pubid><pubid idtype="pmpid" link="fulltext">19608599</pubid></pubidlist></xrefbib></bibl><bibl id="B51"><title><p>A heuristic method for assigning a false-discovery rate for protein identifications from Mascot database search results</p></title><aug><au><snm>Weatherly</snm><fnm>DB</fnm></au><au><snm>Atwood</snm><fnm>JA</fnm></au><au><snm>Minning</snm><fnm>TA</fnm></au><au><snm>Cavola</snm><fnm>C</fnm></au><au><snm>Tarleton</snm><fnm>RL</fnm></au><au><snm>Orlando</snm><fnm>R</fnm></au></aug><source>Mol Cell Proteomics</source><pubdate>2005</pubdate><volume>4</volume><issue>6</issue><fpage>762</fpage><lpage>772</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1074/mcp.M400215-MCP200</pubid><pubid idtype="pmpid" link="fulltext">15703444</pubid></pubidlist></xrefbib></bibl><bibl id="B52"><title><p>False discovery rates of protein identifications: a strike against the two-peptide rule</p></title><aug><au><snm>Gupta</snm><fnm>N</fnm></au><au><snm>Pevzner</snm><fnm>PA</fnm></au></aug><source>J Proteome Res</source><pubdate>2009</pubdate><volume>8</volume><issue>9</issue><fpage>4173</fpage><lpage>4181</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/pr9004794</pubid><pubid idtype="pmcid">3398614</pubid><pubid idtype="pmpid" link="fulltext">19627159</pubid></pubidlist></xrefbib></bibl><bibl id="B53"><title><p>Comparative proteogenomics: combining mass spectrometry and comparative genomics to analyze multiple genomes</p></title><aug><au><snm>Gupta</snm><fnm>N</fnm></au><au><snm>Benhamida</snm><fnm>J</fnm></au><au><snm>Bhargava</snm><fnm>V</fnm></au><au><snm>Goodman</snm><fnm>D</fnm></au><au><snm>Kain</snm><fnm>E</fnm></au><au><snm>Kerman</snm><fnm>I</fnm></au><au><snm>Nguyen</snm><fnm>N</fnm></au><au><snm>Ollikainen</snm><fnm>N</fnm></au><au><snm>Rodriguez</snm><fnm>J</fnm></au><au><snm>Wang</snm><fnm>J</fnm></au><etal/></aug><source>Genome Res</source><pubdate>2008</pubdate><volume>18</volume><issue>7</issue><fpage>1133</fpage><lpage>1142</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.074344.107</pubid><pubid idtype="pmcid">2493402</pubid><pubid idtype="pmpid" link="fulltext">18426904</pubid></pubidlist></xrefbib></bibl><bibl id="B54"><title><p>Proteomic parsimony through bipartite graph analysis improves accuracy and transparency</p></title><aug><au><snm>Zhang</snm><fnm>B</fnm></au><au><snm>Chambers</snm><fnm>MC</fnm></au><au><snm>Tabb</snm><fnm>DL</fnm></au></aug><source>J Proteome Res</source><pubdate>2007</pubdate><volume>6</volume><issue>9</issue><fpage>3549</fpage><lpage>3557</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/pr070230d</pubid><pubid idtype="pmcid">2810678</pubid><pubid idtype="pmpid" link="fulltext">17676885</pubid></pubidlist></xrefbib></bibl><bibl id="B55"><title><p>IDPicker 2.0: Improved protein assembly with high discrimination peptide identification filtering</p></title><aug><au><snm>Ma</snm><fnm>ZQ</fnm></au><au><snm>Dasari</snm><fnm>S</fnm></au><au><snm>Chambers</snm><fnm>MC</fnm></au><au><snm>Litton</snm><fnm>MD</fnm></au><au><snm>Sobecki</snm><fnm>SM</fnm></au><au><snm>Zimmerman</snm><fnm>LJ</fnm></au><au><snm>Halvey</snm><fnm>PJ</fnm></au><au><snm>Schilling</snm><fnm>B</fnm></au><au><snm>Drake</snm><fnm>PM</fnm></au><au><snm>Gibson</snm><fnm>BW</fnm></au><etal/></aug><source>J Proteome Res</source><pubdate>2009</pubdate><volume>8</volume><issue>8</issue><fpage>3872</fpage><lpage>3881</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/pr900360j</pubid><pubid idtype="pmcid">2810655</pubid><pubid idtype="pmpid" link="fulltext">19522537</pubid></pubidlist></xrefbib></bibl><bibl id="B56"><title><p>Mining gene functional networks to improve mass-spectrometry-based protein identification</p></title><aug><au><snm>Ramakrishnan</snm><fnm>SR</fnm></au><au><snm>Vogel</snm><fnm>C</fnm></au><au><snm>Kwon</snm><fnm>T</fnm></au><au><snm>Penalva</snm><fnm>LO</fnm></au><au><snm>Marcotte</snm><fnm>EM</fnm></au><au><snm>Miranker</snm><fnm>DP</fnm></au></aug><source>Bioinformatics</source><pubdate>2009</pubdate><volume>25</volume><issue>22</issue><fpage>2955</fpage><lpage>2961</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btp461</pubid><pubid idtype="pmcid">2773251</pubid><pubid idtype="pmpid" link="fulltext">19633097</pubid></pubidlist></xrefbib></bibl><bibl id="B57"><title><p>Integrating shotgun proteomics and mRNA expression data to improve protein identification</p></title><aug><au><snm>Ramakrishnan</snm><fnm>SR</fnm></au><au><snm>Vogel</snm><fnm>C</fnm></au><au><snm>Prince</snm><fnm>JT</fnm></au><au><snm>Li</snm><fnm>Z</fnm></au><au><snm>Penalva</snm><fnm>LO</fnm></au><au><snm>Myers</snm><fnm>M</fnm></au><au><snm>Marcotte</snm><fnm>EM</fnm></au><au><snm>Miranker</snm><fnm>DP</fnm></au><au><snm>Wang</snm><fnm>R</fnm></au></aug><source>Bioinformatics</source><pubdate>2009</pubdate><volume>25</volume><issue>11</issue><fpage>1397</fpage><lpage>1403</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btp168</pubid><pubid idtype="pmcid">2682515</pubid><pubid idtype="pmpid" link="fulltext">19318424</pubid></pubidlist></xrefbib></bibl><bibl id="B58"><title><p>A partial set covering model for protein mixture identification using mass spectrometry data</p></title><aug><au><snm>He</snm><fnm>Z</fnm></au><au><snm>Yang</snm><fnm>C</fnm></au><au><snm>Yu</snm><fnm>W</fnm></au></aug><source>IEEE/ACM Trans Comput Biol Bioinform</source><pubdate>2011</pubdate><volume>8</volume><issue>2</issue><fpage>368</fpage><lpage>380</lpage><xrefbib><pubid idtype="pmpid" link="fulltext">21233521</pubid></xrefbib></bibl><bibl id="B59"><title><p>Advancements in protein identification from shotgun proteomics using predicted peptide detectability</p></title><aug><au><snm>Alves</snm><fnm>P</fnm></au><au><snm>Arnold</snm><fnm>RJ</fnm></au><au><snm>Novotny</snm><fnm>MV</fnm></au><au><snm>Radivojac</snm><fnm>P</fnm></au><au><snm>Reilly</snm><fnm>JP</fnm></au><au><snm>Tang</snm><fnm>H</fnm></au></aug><source>Pac Symp Biocomput</source><pubdate>2007</pubdate><volume>12</volume><fpage>409</fpage><lpage>420</lpage></bibl><bibl id="B60"><title><p>A Bayesian approach to protein inference problem in shotgun proteomics</p></title><aug><au><snm>Li</snm><fnm>YF</fnm></au><au><snm>Arnold</snm><fnm>RJ</fnm></au><au><snm>Li</snm><fnm>Y</fnm></au><au><snm>Radivojac</snm><fnm>P</fnm></au><au><snm>Sheng</snm><fnm>Q</fnm></au><au><snm>Tang</snm><fnm>H</fnm></au></aug><source>The 12th Annual International Conference on Research in Computational Molecular Biology, RECOMB 2008: 2008; Singapore</source><pubdate>2008</pubdate><fpage>167</fpage><lpage>180</lpage></bibl><bibl id="B61"><title><p>A Bayesian approach to protein inference problem in shotgun proteomics</p></title><aug><au><snm>Li</snm><fnm>YF</fnm></au><au><snm>Arnold</snm><fnm>RJ</fnm></au><au><snm>Li</snm><fnm>Y</fnm></au><au><snm>Radivojac</snm><fnm>P</fnm></au><au><snm>Sheng</snm><fnm>Q</fnm></au><au><snm>Tang</snm><fnm>H</fnm></au></aug><source>J Comput Biol</source><pubdate>2009</pubdate><volume>16</volume><issue>8</issue><fpage>1183</fpage><lpage>1193</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1089/cmb.2009.0018</pubid><pubid idtype="pmcid">2799497,2799497</pubid><pubid idtype="pmpid" link="fulltext">19645593</pubid></pubidlist></xrefbib></bibl><bibl id="B62"><title><p>Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases</p></title><aug><au><snm>Sadygov</snm><fnm>RG</fnm></au><au><snm>Liu</snm><fnm>H</fnm></au><au><snm>Yates</snm><fnm>JR</fnm></au></aug><source>Anal Chem</source><pubdate>2004</pubdate><volume>76</volume><issue>6</issue><fpage>1664</fpage><lpage>1671</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/ac035112y</pubid><pubid idtype="pmpid" link="fulltext">15018565</pubid></pubidlist></xrefbib></bibl><bibl id="B63"><title><p>Probability-based evaluation of peptide and protein identifications from tandem mass spectrometry and SEQUEST analysis: the human proteome</p></title><aug><au><snm>Qian</snm><fnm>WJ</fnm></au><au><snm>Liu</snm><fnm>T</fnm></au><au><snm>Monroe</snm><fnm>ME</fnm></au><au><snm>Strittmatter</snm><fnm>EF</fnm></au><au><snm>Jacobs</snm><fnm>JM</fnm></au><au><snm>Kangas</snm><fnm>LJ</fnm></au><au><snm>Petritis</snm><fnm>K</fnm></au><au><snm>Camp</snm><fnm>DG</fnm></au><au><snm>Smith</snm><fnm>RD</fnm></au></aug><source>J Proteome Res</source><pubdate>2005</pubdate><volume>4</volume><issue>1</issue><fpage>53</fpage><lpage>62</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/pr0498638</pubid><pubid idtype="pmpid" link="fulltext">15707357</pubid></pubidlist></xrefbib></bibl><bibl id="B64"><title><p>Probability model for assessing proteins assembled from peptide sequences inferred from tandem mass spectrometry data</p></title><aug><au><snm>Feng</snm><fnm>J</fnm></au><au><snm>Naiman</snm><fnm>DQ</fnm></au><au><snm>Cooper</snm><fnm>B</fnm></au></aug><source>Anal Chem</source><pubdate>2007</pubdate><volume>79</volume><issue>10</issue><fpage>3901</fpage><lpage>3911</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/ac070202e</pubid><pubid idtype="pmpid" link="fulltext">17441689</pubid></pubidlist></xrefbib></bibl><bibl id="B65"><title><p>EBP, a program for protein identification using multiple tandem mass spectrometry datasets</p></title><aug><au><snm>Price</snm><fnm>TS</fnm></au><au><snm>Lucitt</snm><fnm>MB</fnm></au><au><snm>Wu</snm><fnm>W</fnm></au><au><snm>Austin</snm><fnm>DJ</fnm></au><au><snm>Pizarro</snm><fnm>A</fnm></au><au><snm>Yocum</snm><fnm>AK</fnm></au><au><snm>Blair</snm><fnm>IA</fnm></au><au><snm>FitzGerald</snm><fnm>GA</fnm></au><au><snm>Grosser</snm><fnm>T</fnm></au></aug><source>Mol Cell Proteomics</source><pubdate>2007</pubdate><volume>6</volume><issue>3</issue><fpage>527</fpage><lpage>536</lpage><xrefbib><pubid idtype="pmpid" link="fulltext">17164401</pubid></xrefbib></bibl><bibl id="B66"><title><p>A hierarchical statistical model to assess the confidence of peptides and proteins inferred from tandem mass spectrometry</p></title><aug><au><snm>Shen</snm><fnm>C</fnm></au><au><snm>Wang</snm><fnm>Z</fnm></au><au><snm>Shankar</snm><fnm>G</fnm></au><au><snm>Zhang</snm><fnm>X</fnm></au><au><snm>Li</snm><fnm>L</fnm></au></aug><source>Bioinformatics</source><pubdate>2008</pubdate><volume>24</volume><issue>2</issue><fpage>202</fpage><lpage>208</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btm555</pubid><pubid idtype="pmpid" link="fulltext">18024968</pubid></pubidlist></xrefbib></bibl><bibl id="B67"><title><p>Deterministic protein inference for shotgun proteomics data provides new insights into Arabidopsis pollen development and function</p></title><aug><au><snm>Grobei</snm><fnm>MA</fnm></au><au><snm>Qeli</snm><fnm>E</fnm></au><au><snm>Brunner</snm><fnm>E</fnm></au><au><snm>Rehrauer</snm><fnm>H</fnm></au><au><snm>Zhang</snm><fnm>R</fnm></au><au><snm>Roschitzki</snm><fnm>B</fnm></au><au><snm>Basler</snm><fnm>K</fnm></au><au><snm>Ahrens</snm><fnm>CH</fnm></au><au><snm>Grossniklaus</snm><fnm>U</fnm></au></aug><source>Genome Res</source><pubdate>2009</pubdate><volume>19</volume><issue>10</issue><fpage>1786</fpage><lpage>1800</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.089060.108</pubid><pubid idtype="pmcid">2765272</pubid><pubid idtype="pmpid" link="fulltext">19546170</pubid></pubidlist></xrefbib></bibl><bibl id="B68"><title><p>Protein identification from tandem mass spectra with probabilistic language modeling</p></title><aug><au><snm>Yang</snm><fnm>Y</fnm></au><au><snm>Harpale</snm><fnm>A</fnm></au><au><snm>Ganapathy</snm><fnm>S</fnm></au></aug><source>Machine Learning and Knowledge Discovery in Databases</source><pubdate>2009</pubdate><fpage>554</fpage><lpage>569</lpage></bibl><bibl id="B69"><title><p>Protein and gene model inference based on statistical modeling in k-partite graphs</p></title><aug><au><snm>Gerster</snm><fnm>S</fnm></au><au><snm>Qeli</snm><fnm>E</fnm></au><au><snm>Ahrens</snm><fnm>CH</fnm></au><au><snm>Buhlmann</snm><fnm>P</fnm></au></aug><source>Proc Natl Acad Sci USA</source><pubdate>2010</pubdate><volume>107</volume><issue>27</issue><fpage>12101</fpage><lpage>12106</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.0907654107</pubid><pubid idtype="pmcid">2901486</pubid><pubid idtype="pmpid" link="fulltext">20562346</pubid></pubidlist></xrefbib></bibl><bibl id="B70"><title><p>A nested mixture model for protein identification using mass spectrometryA nested mixture model for protein identification using mass spectrometry</p></title><aug><au><snm>Li</snm><fnm>Q</fnm></au><au><snm>MacCoss</snm><fnm>M</fnm></au><au><snm>Stephens</snm><fnm>M</fnm></au></aug><source>Annals</source><pubdate>2010</pubdate><volume>4</volume><issue>2</issue><fpage>962</fpage><lpage>987</lpage></bibl><bibl id="B71"><title><p>Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data</p></title><aug><au><snm>Serang</snm><fnm>O</fnm></au><au><snm>MacCoss</snm><fnm>MJ</fnm></au><au><snm>Noble</snm><fnm>WS</fnm></au></aug><source>J Proteome Res</source><pubdate>2010</pubdate><volume>9</volume><issue>10</issue><fpage>5346</fpage><lpage>5357</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/pr100594k</pubid><pubid idtype="pmcid">2948606</pubid><pubid idtype="pmpid" link="fulltext">20712337</pubid></pubidlist></xrefbib></bibl><bibl id="B72"><title><p>iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates</p></title><aug><au><snm>Shteynberg</snm><fnm>D</fnm></au><au><snm>Deutsch</snm><fnm>EW</fnm></au><au><snm>Lam</snm><fnm>H</fnm></au><au><snm>Eng</snm><fnm>JK</fnm></au><au><snm>Sun</snm><fnm>Z</fnm></au><au><snm>Tasman</snm><fnm>N</fnm></au><au><snm>Mendoza</snm><fnm>L</fnm></au><au><snm>Moritz</snm><fnm>RL</fnm></au><au><snm>Aebersold</snm><fnm>R</fnm></au><au><snm>Nesvizhskii</snm><fnm>AI</fnm></au></aug><source>Mol Cell Proteomics</source><pubdate>2011</pubdate><volume>10</volume><issue>12</issue><fpage>M111 007690</fpage><xrefbib><pubid idtype="pmpid" link="fulltext">21876204</pubid></xrefbib></bibl><bibl id="B73"><title><p>A guided tour of the Trans-Proteomic Pipeline</p></title><aug><au><snm>Deutsch</snm><fnm>EW</fnm></au><au><snm>Mendoza</snm><fnm>L</fnm></au><au><snm>Shteynberg</snm><fnm>D</fnm></au><au><snm>Farrah</snm><fnm>T</fnm></au><au><snm>Lam</snm><fnm>H</fnm></au><au><snm>Tasman</snm><fnm>N</fnm></au><au><snm>Sun</snm><fnm>Z</fnm></au><au><snm>Nilsson</snm><fnm>E</fnm></au><au><snm>Pratt</snm><fnm>B</fnm></au><au><snm>Prazen</snm><fnm>B</fnm></au><etal/></aug><source>Proteomics</source><pubdate>2010</pubdate><volume>10</volume><issue>6</issue><fpage>1150</fpage><lpage>1159</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1002/pmic.200900375</pubid><pubid idtype="pmcid">3017125</pubid><pubid idtype="pmpid" link="fulltext">20101611</pubid></pubidlist></xrefbib></bibl><bibl id="B74"><title><p>Probabilistic inference using belief etworks is NP-hard</p></title><aug><au><snm>Cooper</snm><fnm>G</fnm></au></aug><source>Artificial Intelligence</source><pubdate>1990</pubdate><volume>42</volume><issue>2-3</issue><fpage>393</fpage><lpage>405</lpage><xrefbib><pubid idtype="doi">10.1016/0004-3702(90)90060-D</pubid></xrefbib></bibl><bibl id="B75"><title><p>Protein identification problem from a Bayesian point of view</p></title><aug><au><snm>Li</snm><fnm>YF</fnm></au><au><snm>Arnold</snm><fnm>RJ</fnm></au><au><snm>Radivojac</snm><fnm>P</fnm></au><au><snm>Tang</snm><fnm>H</fnm></au></aug><source>Stat Interface</source><pubdate>2012</pubdate><volume>5</volume><issue>1</issue><fpage>21</fpage><lpage>38</lpage></bibl><bibl id="B76"><title><p>Faster mass spectrometry-based protein inference: junction trees are more efficient than sampling and marginalization by enumeration</p></title><aug><au><snm>Serang</snm><fnm>O</fnm></au><au><snm>Noble</snm><fnm>WS</fnm></au></aug><source>IEEE/ACM Transactions on Computational Biology and Bioinformatics</source><pubdate>2012</pubdate></bibl><bibl id="B77"><title><p>What does it mean to identify a protein in proteomics?</p></title><aug><au><snm>Rappsilber</snm><fnm>J</fnm></au><au><snm>Mann</snm><fnm>M</fnm></au></aug><source>Trends Biochem Sci</source><pubdate>2002</pubdate><volume>27</volume><issue>2</issue><fpage>74</fpage><lpage>78</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/S0968-0004(01)02021-7</pubid><pubid idtype="pmpid" link="fulltext">11852244</pubid></pubidlist></xrefbib></bibl><bibl id="B78"><title><p>Experimental protein mixture for validating tandem mass spectral analysis</p></title><aug><au><snm>Keller</snm><fnm>A</fnm></au><au><snm>Purvine</snm><fnm>S</fnm></au><au><snm>Nesvizhskii</snm><fnm>AI</fnm></au><au><snm>Stolyar</snm><fnm>S</fnm></au><au><snm>Goodlett</snm><fnm>DR</fnm></au><au><snm>Kolker</snm><fnm>E</fnm></au></aug><source>OMICS</source><pubdate>2002</pubdate><volume>6</volume><issue>2</issue><fpage>207</fpage><lpage>212</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1089/153623102760092805</pubid><pubid idtype="pmpid" link="fulltext">12143966</pubid></pubidlist></xrefbib></bibl><bibl id="B79"><title><p>Standard mixtures for proteome studies</p></title><aug><au><snm>Purvine</snm><fnm>S</fnm></au><au><snm>Picone</snm><fnm>AF</fnm></au><au><snm>Kolker</snm><fnm>E</fnm></au></aug><source>OMICS</source><pubdate>2004</pubdate><volume>8</volume><issue>1</issue><fpage>79</fpage><lpage>92</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1089/153623104773547507</pubid><pubid idtype="pmpid" link="fulltext">15107238</pubid></pubidlist></xrefbib></bibl><bibl id="B80"><title><p>The standard protein mix database: a diverse data set to assist in the production of improved Peptide and protein identification software tools</p></title><aug><au><snm>Klimek</snm><fnm>J</fnm></au><au><snm>Eddes</snm><fnm>JS</fnm></au><au><snm>Hohmann</snm><fnm>L</fnm></au><au><snm>Jackson</snm><fnm>J</fnm></au><au><snm>Peterson</snm><fnm>A</fnm></au><au><snm>Letarte</snm><fnm>S</fnm></au><au><snm>Gafken</snm><fnm>PR</fnm></au><au><snm>Katz</snm><fnm>JE</fnm></au><au><snm>Mallick</snm><fnm>P</fnm></au><au><snm>Lee</snm><fnm>H</fnm></au><etal/></aug><source>J Proteome Res</source><pubdate>2008</pubdate><volume>7</volume><issue>1</issue><fpage>96</fpage><lpage>103</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/pr070244j</pubid><pubid idtype="pmcid">2577160</pubid><pubid idtype="pmpid" link="fulltext">17711323</pubid></pubidlist></xrefbib></bibl></refgrp>
</bm></art>