<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2003-4-8-r51</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Software</dochead>
      <bibl>
         <title>
            <p>A comparative proteomics resource: proteins of <it>Arabidopsis thaliana</it></p>
         </title>
         <aug>
            <au id="A1">
               <snm>Li</snm>
               <mi>W</mi>
               <fnm>Wilfred</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A2">
               <snm>Quinn</snm>
               <mi>B</mi>
               <fnm>Greg</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A3">
               <snm>Alexandrov</snm>
               <mi>N</mi>
               <fnm>Nickolai</fnm>
               <insr iid="I2"/>
            </au>
            <au id="A4" ca="yes">
               <snm>Bourne</snm>
               <mi>E</mi>
               <fnm>Philip</fnm>
               <insr iid="I1"/>
               <insr iid="I3"/>
               <email>bourne@sdsc.edu</email>
            </au>
            <au id="A5">
               <snm>Shindyalov</snm>
               <mi>N</mi>
               <fnm>Ilya</fnm>
               <insr iid="I1"/>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>San Diego Supercomputer Center, 9500 Gilman Drive, University of California San Diego, La Jolla, CA 92093-0505, USA</p>
            </ins>
            <ins id="I2">
               <p>Ceres Inc., 3007 Malibu Canyon Road, Malibu, CA 90265, USA</p>
            </ins>
            <ins id="I3">
               <p>Department of Pharmacology, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2003</pubdate>
         <volume>4</volume>
         <issue>8</issue>
         <fpage>R51</fpage>
         <url>http://genomebiology.com/2003/4/8/R51</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="doi">10.1186/gb-2003-4-8-r51</pubid>
               <pubid idtype="pmpid">12914659</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>3</day>
               <month>2</month>
               <year>2003</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>6</day>
               <month>5</month>
               <year>2003</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>2</day>
               <month>7</month>
               <year>2003</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>28</day>
               <month>7</month>
               <year>2003</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2003</year>
         <collab>Li et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.</collab>
      </cpyrt>
      <shorttitle>
         <p>A comparative proteomics resource: proteins of <it>Arabidopsis thaliana</it></p>
      </shorttitle>
      <shortabs>
         <p>Using an integrative genome annotation pipeline (iGAP) for proteome-wide protein structure and functional domain assignment, all the proteins of <it>Arabidopsis thaliana </it> have been analyzed.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <p>Using an integrative genome annotation pipeline (iGAP) for proteome-wide protein structure and functional domain assignment, we analyzed all the proteins of <it>Arabidopsis thaliana</it>. Three-dimensional structures at the level of the domain are assigned by fold recognition and threading based on a novel fold library that extends common domain classifications. iGAP is being applied to proteins from all available proteomes as part of a comparative proteomics resource. The database is accessible from the web.</p>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010015">Model organisms</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Rationale</p>
         </st>
         <p>Protein-sequence-based comparative analysis to infer biological function is important and familiar to most biologists. Sequence-profile methods such as PSI-BLAST <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> or HMMER <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> are often used to detect distant homologs, and resources such as Prosite <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>, BLOCKS <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> and PFAM <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> are representative resources resulting from protein classification based on sequence patterns. Protein structure also plays a crucial role in a full understanding of protein function as it is more conserved than sequence and hence exposes relationships not possible from sequence alone. Many protein domains have less than 10% sequence identity, and yet possess a similar fold and possibly related function.</p>
         <p>One of the early insights gained from comparative genomics was domain accretion <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. From prokaryotes to eukaryotes, the number of domains increases. But in higher eukaryotes, different combinations of domains are often observed in the same and different protein families. From a structural point of view domains are discreet compact folding units. PIR <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> classifies proteins into either a homeomorphic superfamily (proteins containing similar domains in the same order) or a homology domain superfamily (proteins from different homeomorphic superfamilies sharing a common ancestral domain). This modular nature of proteins necessitates a new approach to proteome annotation - a structural-domain-based approach.</p>
         <p>There already exist a number of automated or semi-automated complete genome annotation systems. For example, GeneQuiz <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> and PEDANT <abbrgrp><abbr bid="B9">9</abbr></abbrgrp> are two pipelines that are comprehensive and highly automated (Table <tblr tid="T1">1</tblr>). Similarly, there are several sites that provide protein structure annotations for various genomes. Superfamily <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> uses a set of hidden Markov model (HHM) profiles based on SCOP superfamily members. MatDB, based on PEDANT analysis of <it>Arabidopsis thaliana</it>, provides structural annotations using SCOP domain position specific scoring matrix (PSSM) profiles. The National Center for Biotechnology Information (NCBI) maintains a Conserved Domain Database (CDD) that uses PFAM and SMART <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> domain PSSMs to detect possible structural homologs. The 3D-Genomics database <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> uses SCOP domain PSSMs from 3D-PSSM <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. Gene3D uses the CATH domain classification to annotate genes and genomes <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>.</p>
         <tbl id="T1" hint_layout="double">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Comparison of different annotation pipelines</p>
            </caption>
            <tblbdy cols="4">
               <r>
                  <c ca="left">
                     <p>Pipeline</p>
                  </c>
                  <c ca="left">
                     <p>Focus area</p>
                  </c>
                  <c ca="left">
                     <p>Applications</p>
                  </c>
                  <c ca="left">
                     <p>Coverage</p>
                  </c>
               </r>
               <r>
                  <c cspan="4">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>GeneQuiz</p>
                  </c>
                  <c ca="left">
                     <p>Sequence homology</p>
                     <p>Function assignment</p>
                  </c>
                  <c ca="left">
                     <p>BLAST, FASTA, COILS,</p>
                     <p>MaxHom, Prosite, Blocks,</p>
                     <p>Predict Protein, Coils,</p>
                     <p>Transmembrane helix, CAST.</p>
                  </c>
                  <c ca="left">
                     <p>65 genomes</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>PEDANT</p>
                  </c>
                  <c ca="left">
                     <p>Gene prediction</p>
                     <p>Sequence homology</p>
                     <p>Function assignment</p>
                     <p>Fold assignment</p>
                  </c>
                  <c ca="left">
                     <p>BLAST, PSI-BLAST,</p>
                     <p>HMMER, PREDATOR,</p>
                     <p>Orpheus, BLIMPS, STRIDE.</p>
                  </c>
                  <c ca="left">
                     <p>133 complete genomes, 91 partial genomes</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>PAT</p>
                  </c>
                  <c ca="left">
                     <p>Sequence homology</p>
                     <p>Function assignment</p>
                     <p>Fold recognition</p>
                     <p>Structure prediction</p>
                  </c>
                  <c ca="left">
                     <p>WU-BLAST, PSI-BLAST,</p>
                     <p>123D, HMMER</p>
                  </c>
                  <c ca="left">
                     <p>103+ genomes, continuous expansion</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>We have developed an automated integrative genome annotation pipeline (iGAP) initially to annotate the proteins of <it>A. thaliana </it>and later all proteomes based on a comprehensive fold library (Figure <figr fid="F1">1</figr>). In addition to the domains from SCOP, we have included domains parsed using the protein domain parser (PDP) <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, full-length Protein Data Bank (PDB) chains and chains not classified by SCOP, but associated with SCOP using combinatorial extension (CE), a structural-similarity search algorithm <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. The result is a comprehensive fold library (FOLDLIB) from which comparative and fold recognition models of three-dimensional structure are derived. As a step beyond PSI-BLAST or PFAM profiles, we have used 123D+ <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp>, which not only performs target-template profile-profile alignment, but also uses secondary structure and contact capacity potential information for protein fold recognition. Further, the annotation pipeline provides a graded reliability index of functional prediction reliability ranging from A to E based on extensive benchmarking of selectivity versus sensitivity (N.N.A., I.N.S and P.E.B., unpublished work). Here we describe iGAP and the initial results on the analysis of <it>A. thaliana</it>, the first proteome processed, using a combination of web interface and SQL queries (Figure <figr fid="F2">2</figr>). Comparisons are made to other annotation schemes used to process <it>Arabidopsis </it>and to other proteomes processed with iGAP. The iGAP is systematically being applied to more than 1,000 proteomes, completely or partially sequenced and publicly available at NCBI <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>, to develop a comparative proteomic resource.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>The integrative genome annotation pipeline (iGAP)</p>
            </caption>
            <text>
               <p>The integrative genome annotation pipeline (iGAP). Processing of initial structural information is shown on the left and processing of initial sequence information on the right. Green shading indicates a processing step involving structure information and blue shading a processing step involving a sequence. Steps boxed with dotted lines indicate partial integration into the benchmarking scheme. See text for further details.</p>
            </text>
            <graphic file="gb-2003-4-8-r51-1"/>
         </fig>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Overview of the user interface</p>
            </caption>
            <text>
               <p>Overview of the user interface. The information stored in the database may be accessed by known identifiers, keywords, browsing classifications (SCOP and FOLDLIB) and by sequence. Identifiers supported include <it>Arabidopsis </it>locus id, NCBI gi number, SCOP id, PDB id, FOLDLIB id and PFAM id. Keywords are limited to those available in each original data source.</p>
            </text>
            <graphic file="gb-2003-4-8-r51-2"/>
         </fig>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <p>Automated annotation pipelines are crucial to organize the deluge of genomic information. Table <tblr tid="T1">1</tblr> compares features of iGAP with those of GeneQuiz and PEDANT, two established genome annotation methodologies. GeneQuiz focuses on homolog and function assignment through sequence similarity search; PEDANT is a comprehensive analysis pipeline with emphasis on gene prediction, secondary and tertiary structure assignment; iGAP puts much more emphasis on fold recognition, threading and, to be released in the near future, homology modeling. Table <tblr tid="T2">2</tblr> compares the proteins of <it>A. thaliana </it>(PAT) database to established databases of protein annotations. They differ in both coverage and focus. Again, each of the resources has clear strengths in a number of areas, but PAT stands out in terms of the amount of structural information it provides. Whereas other resources are limited to what is present in PDB or SCOP, PAT provides additional domains from PDP, and genetic domains from Astral. Moreover, an important feature of iGAP is the benchmarking used to establish the reliability measures. Such quality assurance is critical to the future development of these resources if they are to be used in a meaningful way by experimentalists.</p>
         <tbl id="T2" hint_layout="double">
            <title>
               <p>Table 2</p>
            </title>
            <caption>
               <p>Database feature comparison</p>
            </caption>
            <tblbdy cols="6">
               <r>
                  <c ca="left">
                     <p>Databases</p>
                  </c>
                  <c ca="left">
                     <p>Features</p>
                  </c>
                  <c ca="left">
                     <p>Scope</p>
                  </c>
                  <c ca="left">
                     <p>Level of integration</p>
                  </c>
                  <c ca="left">
                     <p>Learning curve</p>
                  </c>
                  <c ca="left">
                     <p>Drawbacks</p>
                  </c>
               </r>
               <r>
                  <c cspan="6">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Entrez Genome <abbrgrp><abbr bid="B20">20</abbr></abbrgrp></p>
                  </c>
                  <c ca="left">
                     <p>Domains from CDD (SMART, PFAM)</p>
                     <p>Proteins by NCBI GI number, accession number, Swiss Prot ID, and so on</p>
                     <p>Structure by PDB ID</p>
                     <p>3D domains from MMDB</p>
                     <p>Domain relatives by CDART</p>
                     <p>Related sequences using BLINK </p>
                     <p>Visualization using Cn3D </p>
                     <p>Public data</p>
                  </c>
                  <c ca="left">
                     <p>All sequences published or voluntarily deposited 1,000+ genomes</p>
                  </c>
                  <c ca="left">
                     <p>High</p>
                  </c>
                  <c ca="left">
                     <p>Easy to high</p>
                  </c>
                  <c ca="left">
                     <p>Complex system</p>
                     <p>Only experimental structural information is available</p>
                     <p>Software interface is not readily available</p>
                     <p>Linkout progress is slow</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>EBI Proteome Analysis Database <abbrgrp><abbr bid="B43">43</abbr></abbrgrp></p>
                  </c>
                  <c ca="left">
                     <p>InterPro member databases (SwissProt, PFAM, SMART, TIGRFAM, PRINTS, PROSITE, ProDom, PIR SuperFamily)</p>
                     <p>Families, domains and sites by member databases</p>
                     <p>GO annotation</p>
                     <p>Manual curation and integration</p>
                     <p>Precomputed matches against InterPro entries</p>
                  </c>
                  <c ca="left">
                     <p>Complete proteomes in SwissProt and TrEMBL</p>
                     <p>110+ proteomes</p>
                  </c>
                  <c ca="left">
                     <p>Medium</p>
                  </c>
                  <c ca="left">
                     <p>Easy to moderate</p>
                  </c>
                  <c ca="left">
                     <p>SRS based query interface free to academia</p>
                     <p>Basic keyword search possible</p>
                     <p>Sequence based classification</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>MatDB</p>
                  </c>
                  <c ca="left">
                     <p><it>Arabidopsis </it>annotation from PEDANT</p>
                     <p>Free text search</p>
                     <p>Protein categories by structure, function based on SCOP, PIR, InterPro</p>
                  </c>
                  <c ca="left">
                     <p><it>Arabidopsis </it>with limited intergenome comparison</p>
                  </c>
                  <c ca="left">
                     <p>Medium</p>
                  </c>
                  <c ca="left">
                     <p>Easy to moderate</p>
                  </c>
                  <c ca="left">
                     <p>Query response time varies</p>
                     <p>SCOP classification mildly difficult to use</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Proteins of <it>Arabidopsis thaliana </it>(PAT) database</p>
                  </c>
                  <c ca="left">
                     <p>Domains from SCOP, predicted domains from PDP, and full length PDB chains with less than 90% sequence identity (FOLDLIB)</p>
                     <p>GO annotation</p>
                     <p>Precomputed matches against FOLDLIB</p>
                     <p>Template-based structure models</p>
                     <p>Visualization using QuickPDB, Chime</p>
                     <p>Advanced keyword search</p>
                     <p>Hierarchical browsing based on SCOP</p>
                     <p>Related sequences using WU-BLAST</p>
                  </c>
                  <c ca="left">
                     <p>Currently 87 Expanding to provide coverage for all known proteomes</p>
                  </c>
                  <c ca="left">
                     <p>Medium</p>
                  </c>
                  <c ca="left">
                     <p>Easy to Moderate</p>
                  </c>
                  <c ca="left">
                     <p>Presentation</p>
                     <p>Style Query flexibility implies a higher learning curve</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>TAIR</p>
                  </c>
                  <c ca="left">
                     <p>GO and other ontology development</p>
                     <p>Sequence and map viewer</p>
                     <p>Domains from InterPro</p>
                     <p>Regulatory motif analysis</p>
                     <p>User annotation</p>
                  </c>
                  <c ca="left">
                     <p>Comprehensive resource devoted to <it>Arabidopsis</it></p>
                  </c>
                  <c ca="left">
                     <p>Medium</p>
                  </c>
                  <c ca="left">
                     <p>Easy to moderate</p>
                  </c>
                  <c ca="left">
                     <p>No structural information</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>SUPERFAMILY</p>
                  </c>
                  <c ca="left">
                     <p>HMM (SAM) models for SCOP domains</p>
                     <p>Fold recognition</p>
                     <p>Domain architecture visualization</p>
                  </c>
                  <c ca="left">
                     <p>107 genomes</p>
                  </c>
                  <c ca="left">
                     <p>Low to medium</p>
                  </c>
                  <c ca="left">
                     <p>Easy to moderate</p>
                  </c>
                  <c ca="left">
                     <p>Presentation style</p>
                     <p>No update information</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Gene 3D</p>
                  </c>
                  <c ca="left">
                     <p>Structural assignment based on CATH domain classification using PSI-BLAST</p>
                  </c>
                  <c ca="left">
                     <p>66 genomes</p>
                  </c>
                  <c ca="left">
                     <p>Low</p>
                  </c>
                  <c ca="left">
                     <p>Easy</p>
                  </c>
                  <c ca="left">
                     <p>Annotation not dynamically linked to CATH</p>
                     <p>No update information</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>Table <tblr tid="T3">3a</tblr> indicates the coverage of the <it>Arabidopsis </it>proteome provided by each methodology and associated resource. It is clear that InterPro and iGAP represent two approaches that provide very high coverage of the <it>Arabidopsis </it>proteome, based on sequence and structural information respectively. A combination of InterProScan and iGAP is under active development to integrate sequence- and structure-based annotation. Interestingly, only 14% of the <it>Arabidopsis </it>Information Resource (TAIR) GO annotation is based on nonelectronic annotation. This makes an even stronger argument for the integration of sequence- and structure-based annotation, to reduce the possibility of error propagation in electronic annotation. Table <tblr tid="T3">3b</tblr> highlights some specific examples of results achieved by PAT over other means. Whether these results are meaningful depends on the user's perspective. For one user, a few additional predictions with 90% certainty could be a distraction. To another, they might, in connection with additional experimental evidence, prove valuable. A future challenge to those of us providing such resources is to minimize the pain and maximize the gain for the different types of user. Again quality assurance and user interface design will prove important. While we have made efforts to classify the reliability of our predictions, they are still predictions and should be used, where possible, with associated experimental proof.</p>
         <tbl id="T3" hint_layout="double">
            <title>
               <p>Table 3</p>
            </title>
            <caption>
               <p>Comparison of PAT with other resources</p>
            </caption>
            <tblbdy cols="5">
               <r>
                  <c ca="left">
                     <p>
                        <b>(a) Coverage</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>PAT</p>
                  </c>
                  <c ca="left">
                     <p>PEDANT/MatDB</p>
                  </c>
                  <c ca="left">
                     <p>TAIR/GO</p>
                  </c>
                  <c ca="left">
                     <p>EBI Proteomes/InterPro</p>
                  </c>
               </r>
               <r>
                  <c cspan="5">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>94% A-E</p>
                  </c>
                  <c ca="left">
                     <p>30.9% PDB</p>
                  </c>
                  <c ca="left">
                     <p>38% ALL</p>
                  </c>
                  <c ca="left">
                     <p>77.3% InterPro</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>84% A-D</p>
                  </c>
                  <c ca="left">
                     <p>26.7% SCOP</p>
                  </c>
                  <c ca="left">
                     <p>14% Non-IEA</p>
                  </c>
                  <c ca="left">
                     <p>0.07% PDB</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>65% A-C</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>46% A-B</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>38% A</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <b>(b) Specific examples</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>Target</p>
                  </c>
                  <c ca="left">
                     <p>Other sources</p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>PAT</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c cspan="2">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Results</p>
                  </c>
                  <c ca="left">
                     <p>Reliability</p>
                  </c>
               </r>
               <r>
                  <c cspan="5">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>AP2 domain (1gcc)</p>
                  </c>
                  <c ca="left">
                     <p>140 hits by BLAST against NR</p>
                  </c>
                  <c ca="left">
                     <p>155 hits</p>
                  </c>
                  <c ca="left">
                     <p>C (90% certainty) or above</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>15239082 (At5g11550.1)</p>
                  </c>
                  <c ca="left">
                     <p>No hits by PSI-BLAST</p>
                     <p>None from TAIR, PEDANT</p>
                  </c>
                  <c ca="left">
                     <p>1EE4</p>
                  </c>
                  <c ca="left">
                     <p>C</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>15228210 (At3g47660)</p>
                  </c>
                  <c ca="left">
                     <p>FYVE/PHD zinc finger</p>
                     <p>RCC1 like domain</p>
                     <p>Sugar transporter signature (PROSITE)</p>
                  </c>
                  <c ca="left">
                     <p>FYVE/PHD zinc finger;</p>
                     <p>RCC1 like domain;</p>
                     <p>PH domain</p>
                  </c>
                  <c ca="left">
                     <p>A (99.9% certainty);</p>
                     <p>B (99% certainty); </p>
                     <p>C</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Cytochrome P450</p>
                  </c>
                  <c ca="left">
                     <p>238 (TAIR GO)</p>
                  </c>
                  <c ca="left">
                     <p>249 hits</p>
                     <p>256 hits</p>
                  </c>
                  <c ca="left">
                     <p>C or above</p>
                     <p>D (50% certainty) or above</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Protein-kinase-like domain</p>
                  </c>
                  <c ca="left">
                     <p>1037 hits (PEDANT/MatDB) 951 hits (TAIR GO)</p>
                  </c>
                  <c ca="left">
                     <p>1,179 hits</p>
                  </c>
                  <c ca="left">
                     <p>C or above</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Alpha/beta hydrolase fold</p>
                  </c>
                  <c ca="left">
                     <p>
                        <it>Arabidopsis</it>
                     </p>
                  </c>
                  <c ca="left">
                     <p>194 hits (PEDANT/MatDB, SCOP 3.65)</p>
                  </c>
                  <c ca="left">
                     <p>340 hits</p>
                     <p>200 hits</p>
                  </c>
                  <c ca="left">
                     <p>C or above</p>
                     <p>A</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Human</p>
                  </c>
                  <c ca="left">
                     <p>69 hits (PEDANT/MatDB, <it>SCOP </it>c.69)</p>
                  </c>
                  <c ca="left">
                     <p>1,086 hits</p>
                     <p>1,18 hits</p>
                  </c>
                  <c ca="left">
                     <p>C or above</p>
                     <p>A</p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p><b>(a)</b> Percent coverage against specific data sources. <b>(b)</b> PDB sequence of 1gcc <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> was used to perform a standard BLAST search. The putative protein with gi number 15239082 (At5g11550.1) returns no hits using PSI-BLAST. The putative protein (gi number 15228210, locus id At3g47660) contains a FYVE/PHD zinc finger domain, and an RCC1 like domain (a regulator of chromosome condensation). TAIR also reported a sugar transporter signature for this protein from Prosite search. The term 'cytochrome P450' was used to search TAIR GO annotation (release). This was obtained using the search by keyword query feature, after we've loaded the TAIR GO data into our database. The cytochrome P450 fold in the SCOP hierarchy was used to retrieve the hits from PAT. Actual hits may vary between releases.</p>
            </tblfn>
         </tbl>
         <p>With regard to iGAP specifically, we first looked at the overall coverage of the <it>Arabidopsis </it>proteome using iGAP (Figure <figr fid="F3">3</figr>). We were able to assign nearly 70% of the <it>Arabidopsis </it>proteome to folds which had a reliability index C (90% confidence) or better. This compares to 56% of <it>Arabidopsis </it>proteins in the NCBI nonredundant (NR) protein database having an assigned function. While fold assignment does not necessarily translate into functional assignment, it provides a useful indicator.</p>
         <fig id="F3">
            <title>
               <p>Figure 3</p>
            </title>
            <caption>
               <p>Classes of <it>Arabidopsis </it>proteome annotation</p>
            </caption>
            <text>
               <p>Classes of <it>Arabidopsis </it>proteome annotation. <b>(a) </b>The functional annotation on <it>Arabidopsis </it>proteins provided by the NCBI NR database. In this database, 36.4% of <it>Arabidopsis </it>proteins are reliably assigned on the basis of experimental evidence; 55.6% are annotated when automated annotation is included. This data is based on the 17 October 2001 release of NR. <b>(b) </b>Structural annotation provided by PAT. PAT has 69.3% coverage with a C reliability or better.</p>
            </text>
            <graphic file="gb-2003-4-8-r51-3"/>
         </fig>
         <p>Second, PAT provides annotations not reported by other databases. Some examples are listed in Table <tblr tid="T4">4</tblr>. For example, the AP2-domain is a DNA-binding transcription factor that controls flower and seed development <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> in <it>Arabidopsis</it>. The structure of the AP2 domain is found in the PDB (1gcc) <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. Standard BLAST using the 1gcc sequence provides 140 hits at <it>p </it>&lt; 0.1 (a very weak threshold). In PAT, there are 143 hits of A or B reliability (> 99% confidence) plus 12 of reliability C (> 90% &lt; 99% confidence). Another putative protein (GI number 15228210, locus id At3g47660) has a previously undetected domain at the amino terminus which resembles the structure of the pleckstrin homology (PH) domain from phospholipase C delta (PDB 1mai) (C prediction). PH domains are commonly found in signaling proteins <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. Additional domains found in this protein (also documented by TAIR as InterPro domains) include FYVE/PHD zinc finger and an RCC1 like domain (a regulator of chromosome condensation), with A and B reliabilities respectively. TAIR also reported a sugar transporter signature for this protein from Prosite. While the exact function of the protein remains to be determined experimentally, the new finding of a putative PH domain could offer clues to its potential mechanism for signaling and intracellular targeting.</p>
         <tbl id="T4" hint_layout="double">
            <title>
               <p>Table 4</p>
            </title>
            <caption>
               <p>Sampling of known <it>Arabidopsis </it>protein structures in PAT</p>
            </caption>
            <tblbdy cols="9">
               <r>
                  <c ca="left">
                     <p>
                        <b>(a) PDB structures from <it>Arabidopsis </it>mapped to FOLDLIB entries</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>PDB ID</p>
                  </c>
                  <c ca="left">
                     <p>SCOP family</p>
                  </c>
                  <c ca="left">
                     <p>SCOP superfamily</p>
                  </c>
                  <c ca="left">
                     <p>GI number</p>
                  </c>
                  <c ca="left">
                     <p>Name</p>
                  </c>
                  <c ca="left">
                     <p>Domain found</p>
                  </c>
                  <c ca="left">
                     <p>Reliability</p>
                  </c>
                  <c ca="left">
                     <p>Number of unknown or putative proteins with similar domain : total number*</p>
                  </c>
               </r>
               <r>
                  <c cspan="9">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>1dj2</p>
                  </c>
                  <c ca="left">
                     <p>Nitrogenase iron protein-like</p>
                  </c>
                  <c ca="left">
                     <p>P-loop containing nucleotide triphosphate hydrolases</p>
                  </c>
                  <c ca="left">
                     <p>15230358</p>
                  </c>
                  <c ca="left">
                     <p>Adenylosuccinate synthetase</p>
                  </c>
                  <c ca="left">
                     <p>1dj2 (48-490)</p>
                  </c>
                  <c ca="left">
                     <p>A</p>
                  </c>
                  <c ca="left">
                     <p>1:2</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>1dcf</p>
                  </c>
                  <c ca="left">
                     <p>The receiver domain of the ethylene receptor</p>
                  </c>
                  <c ca="left">
                     <p>CheY-like</p>
                  </c>
                  <c ca="left">
                     <p>15219629</p>
                  </c>
                  <c ca="left">
                     <p>The receiver domain of the ethylene receptor</p>
                  </c>
                  <c ca="left">
                     <p>1dcf (605-736)</p>
                  </c>
                  <c ca="left">
                     <p>A</p>
                  </c>
                  <c ca="left">
                     <p>19:33</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>1jh7</p>
                  </c>
                  <c ca="left">
                     <p>Cyclic nucleotide phospho-diesterase</p>
                  </c>
                  <c ca="left">
                     <p>Cyclic nucleotide phospho-diesterase</p>
                  </c>
                  <c ca="left">
                     <p>15234068</p>
                  </c>
                  <c ca="left">
                     <p>Putative protein</p>
                  </c>
                  <c ca="left">
                     <p>1fsi (1-181)</p>
                  </c>
                  <c ca="left">
                     <p>A</p>
                  </c>
                  <c ca="left">
                     <p>2:2</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>2aak</p>
                  </c>
                  <c ca="left">
                     <p>Ubiquitin conjugating enzyme</p>
                  </c>
                  <c ca="left">
                     <p>Ubiquitin conjugating enzyme</p>
                  </c>
                  <c ca="left">
                     <p>15223746</p>
                  </c>
                  <c ca="left">
                     <p>Ubiquitin conjugating enzyme</p>
                  </c>
                  <c ca="left">
                     <p>1a3s (1-151)</p>
                  </c>
                  <c ca="left">
                     <p>A</p>
                  </c>
                  <c ca="left">
                     <p>6:12</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>1vok</p>
                  </c>
                  <c ca="left">
                     <p>TATA-box binding protein (TBP), carboxy-terminal domain</p>
                  </c>
                  <c ca="left">
                     <p>TATA-box binding protein-like</p>
                  </c>
                  <c ca="left">
                     <p>15231241</p>
                  </c>
                  <c ca="left">
                     <p>TATA sequence-binding protein 1</p>
                  </c>
                  <c ca="left">
                     <p>1ais (12-198)</p>
                  </c>
                  <c ca="left">
                     <p>A</p>
                  </c>
                  <c ca="left">
                     <p>0:2</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>3nul</p>
                  </c>
                  <c ca="left">
                     <p>Profilin (actin-binding protein)</p>
                  </c>
                  <c ca="left">
                     <p>Profilin (actin-binding protein)</p>
                  </c>
                  <c ca="left">
                     <p>15224838</p>
                  </c>
                  <c ca="left">
                     <p>Profilin 1</p>
                  </c>
                  <c ca="left">
                     <p>3nul (2-131)</p>
                  </c>
                  <c ca="left">
                     <p>A</p>
                  </c>
                  <c ca="left">
                     <p>0:4</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>1ibj</p>
                  </c>
                  <c ca="left">
                     <p>Cystathionine synthase-like</p>
                  </c>
                  <c ca="left">
                     <p>PLP-dependent transferases</p>
                  </c>
                  <c ca="left">
                     <p>15230203</p>
                  </c>
                  <c ca="left">
                     <p>Cystathionine beta-lyase precursor</p>
                  </c>
                  <c ca="left">
                     <p>1ibj (1-464)</p>
                  </c>
                  <c ca="left">
                     <p>A</p>
                  </c>
                  <c ca="left">
                     <p>41:54</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <b>(b) PDB structures not found in FOLDLIB</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>PDB ID</p>
                  </c>
                  <c ca="left">
                     <p>SCOP family</p>
                  </c>
                  <c ca="left">
                     <p>SCOP superfamily</p>
                  </c>
                  <c ca="left">
                     <p>GI number</p>
                  </c>
                  <c ca="left">
                     <p>Name</p>
                  </c>
                  <c ca="left">
                     <p>Domain found</p>
                  </c>
                  <c ca="left">
                     <p>Reliability</p>
                  </c>
                  <c ca="left">
                     <p>Method</p>
                  </c>
               </r>
               <r>
                  <c cspan="9">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>1gp4,6</p>
                  </c>
                  <c ca="left">
                     <p>Penicillin synthase-like</p>
                  </c>
                  <c ca="left">
                     <p>Clavaminate synthase-like</p>
                  </c>
                  <c ca="left">
                     <p>15235853</p>
                  </c>
                  <c ca="left">
                     <p>Putative leucoantho-cyanidin dioxygenase</p>
                  </c>
                  <c ca="left">
                     <p>1hjg (43-350)</p>
                  </c>
                  <c ca="left">
                     <p>A</p>
                  </c>
                  <c ca="left">
                     <p>123D</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>1e6b (88-220)</p>
                  </c>
                  <c ca="left">
                     <p>Glutathione <it>S</it>-transferases, carboxy-terminal domain</p>
                  </c>
                  <c ca="left">
                     <p>Pseudo SCOP entry by PAT (glutathione <it>S</it>-transferases, carboxy-terminal domain)</p>
                  </c>
                  <c ca="left">
                     <p>15226952</p>
                  </c>
                  <c ca="left">
                     <p>Putative glutathione <it>S</it>-transferase</p>
                  </c>
                  <c ca="left">
                     <p>1fw1 (89-193)</p>
                  </c>
                  <c ca="left">
                     <p>A</p>
                  </c>
                  <c ca="left">
                     <p>WU-BLAST</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Thioredoxin-like (glutathione <it>S</it>-transferases, carboxy-terminal domain)</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>1fw1 [1-218]</p>
                  </c>
                  <c ca="left">
                     <p>A</p>
                  </c>
                  <c ca="left">
                     <p>123D</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>1fw1 [11-215]</p>
                  </c>
                  <c ca="left">
                     <p>A</p>
                  </c>
                  <c ca="left">
                     <p>WU-BLAST</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>1e6b (8-87)</p>
                  </c>
                  <c ca="left">
                     <p>Thioredoxin-like</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>1fw1 (11-89)</p>
                  </c>
                  <c ca="left">
                     <p>A</p>
                  </c>
                  <c ca="left">
                     <p>WU-BLAST</p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p><b>a)</b> The known <it>Arabidopsis </it>PDB ids are obtained from NCBI pdbaa FASTA file (9/1/02 release). Each PDB id is used as a query using the PAT id search field. The 'Domain found' column lists some of the domains found in the protein. Use the GI number to search the PAT web site to see all possible domain assignments. If there are multiple domain boundaries specified, only the longest possible domain boundary is listed. *Non-NR entries were also excluded in the statistics collected in the last column of the table. Only predictions with higher than C reliability (90% certainty) are included. The non-NR entries (contributed by Ceres, Inc) were absent from NR of NCBI at the time of analysis. 1gp4, 1gp6, and 1e6b were not in SCOP release 1.55 or the FOLDLIB in this study (see Table <tblr tid="T1">1b</tblr>). 1j6y was an NMR structure and was excluded. <b>(b)</b> The sequences of the three structures not in the FOLDLIB were analyzed as unknown proteins. The assignment by SCOP release 1.59 is enclosed in parenthesis. In the case of 1e6b, two distinct domains are classified by SCOP 1.59. The two regions are listed after the PDB id. In the case of 1gp4 or 1gp6, only 123D produced an A prediction correctly. In the case of 1e6b, the template is predicted correctly by both 123D and WUBLAST, but WUBLAST produced multiple domains, two of which coincides with SCOP release 1.59 assignment.</p>
            </tblfn>
         </tbl>
         <p>Third, we surveyed a set of <it>Arabidopsis </it>proteins that have known protein structures (confidence level A, Table <tblr tid="T4">4a</tblr>). For most of these structures, PAT identifies a number of additional <it>Arabidopsis </it>proteins predicted to contain the same domain. For example, the ubiquitin-conjugating enzyme, which is important in protein degradation, identifies 6 unknown proteins out of 12, with 'C' or above confidence, which contain similar domains. In contrast, no additional proteins were found to have TBP-like (TATA binding protein-like) domains.</p>
         <p>Recent structures not found in FOLDLIB or SCOP (release 1.55) were examined to see how well they were predicted by iGAP (Table <tblr tid="T4">4b</tblr>). For PDB structures 1gp4 and 1gp6 (putative leucoanthocyanidin dioxygenase, NCBI NR database 17 October 2001 release), 123D was able to correctly predict the fold to be similar to 1hig (clavaminate synthase-like SCOP superfamily). WU-BLAST only gave a number of low-probability (E reliability) predictions.</p>
         <p>Similarly, PDB entry 1e6b (putative glutathione-<it>S</it>-transferase, NCBI NR database 17 October 2001) is a protein with an amino-terminal thioredoxin-like domain and a contiguous glutathione-<it>S</it>-transferase carboxy-terminal domain. Both WU-BLAST and 123D correctly recognized the template structure 1fw1 (glutathione transferase z/maleylacetoacetate isomerase). Both WU-BLAST and 123D predicted the whole protein to be thioredoxin-like with a reliability index of A. However, WU-BLAST made two additional predictions, both correct. The 'pseudo SCOP entry by PAT' is a novel domain parsed by PDP, which at the time was not in SCOP release 1.55. (It is classified as a separate domain in SCOP 1.59.) This was recognized by WU-BLAST. Additionally, WU-BLAST also recognized the amino-terminal thioredoxin-like domain with correct boundaries.</p>
         <p>Finally, the SCOP classification of protein structures by fold (Figure <figr fid="F4">4a</figr>) and by family (Figure <figr fid="F4">4b</figr>) provides a convenient way to catalog the relative occurrences of structures in <it>A. thaliana</it>. With respect to folds, the membrane all-alpha fold, alpha-alpha superhelix and protein kinase-like (PK-like) fold ranked highest. The TIM barrel and Rossman folds, and seven-bladed beta-propeller folds are also among the top folds. PK-like proteins have the second highest occurrence at the superfamily level (data not shown). Not surprisingly, serine/threonine kinases and tyrosine kinases are among the most abundant families.</p>
         <fig id="F4">
            <title>
               <p>Figure 4</p>
            </title>
            <caption>
               <p>SCOP classifications for the <it>Arabidopsis thaliana </it>proteome</p>
            </caption>
            <text>
               <p>SCOP classifications for the <it>Arabidopsis thaliana </it>proteome. <b>(a) </b>Occurrences of SCOP folds. Folds belonging to the same SCOP class are shaded the same color. <b>(b) </b>Occurrences of SCOP families. Families belonging to the same fold are shaded the same color. Families belonging to the same fold but to different superfamilies are indicated by striped bars. The top 15 folds and families are shown. Data are based on SCOP release 1.59.</p>
            </text>
            <graphic file="gb-2003-4-8-r51-4"/>
         </fig>
      </sec>
      <sec>
         <st>
            <p>Conclusions</p>
         </st>
         <p>The PAT database was initially developed as a joint development of academia and industry to serve the <it>Arabidopsis </it>and plant proteomics community through the provision of structure and functional assignment to all identified proteins in the <it>Arabidopsis </it>genome. The underlying technology, specifically iGAP and the associated reliability criteria, is well suited for application to other proteomes and this processing is ongoing to provide a comparative proteomics resource. With more of a focus on comparative proteomics, the resource is being expanded in an effort we refer to as the Encyclopedia of Life (EOL). Details on EOL can be found at <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>.</p>
      </sec>
      <sec>
         <st>
            <p>Materials and methods</p>
         </st>
         <p>The iGAP components are shown in Figure <figr fid="F1">1</figr>, which illustrates how primary protein sequence and structure data are processed by the system. Details are given below.</p>
         <sec>
            <st>
               <p>Software and availability</p>
            </st>
            <p>The software components of iGAP have been tested on Redhat Linux 7.2, Sun Solaris 5.8 and the IBM AIX operating systems. It is currently ported to the Teragrid platform <abbrgrp><abbr bid="B24">24</abbr></abbrgrp> for high-performance distributed computing. Access is via an Apache web server (1.3.25) and an Oracle 9.2.0 database at the San Diego Supercomputer Center where high uptime is maintained. A new interface based on Java 2 Enterprise Edition (J2EE) and Struts framework is under development.</p>
            <p>The iGAP software components developed at the University of California San Diego (UCSD) are available free for academic use by contacting the authors as part of the University of California Copyright Agreement. For-profit organizations need to contact the UCSD Technology Transfer Office. Separate licenses may be required for non-UCSD components. The key components and steps are described below, with additional details available from the Web <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>.</p>
            <sec>
               <st>
                  <p>FOLDLIB</p>
               </st>
               <p>SCOP domain sequences filtered at 90% identity <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> are downloaded from the Astral database <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. PDB chains are clustered at 90% identity and parsed with PDP <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> to provide additional domains, including those not yet assigned by SCOP. SCOP lags behind the PDB in terms of structures processed. The sequences from SCOP, PDB, and PDP are then clustered at 90% identity to define the final structure-template library. Profile libraries for these templates are generated for use by 123D using PSI-BLAST with a default E-value of 1e-6 and three iterations.</p>
            </sec>
            <sec>
               <st>
                  <p>The pipeline</p>
               </st>
               <p>The first step of the pipeline uses a set of filter programs to determine the low-complexity regions as well as transmembrane regions, signal-peptide sequences, and coiled coils in a particular proteome. The programs used include SEG <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> for low-complexity region, COILS <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> for coiled coils, TMHMM <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> for transmembrane region, PSORT <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> for subcellular location and signalP <abbrgrp><abbr bid="B32">32</abbr></abbrgrp> for signal peptides.</p>
               <p>The second step determines sequence similarity hits by pairwise sequence comparison using WU-BLAST (W. Gish, personal communication). WU-BLAST is used because it is fast and performed best in our benchmark studies. The default E-value used is 1e-5. The third step generates PSI-BLAST profiles for each input protein sequence against the FOLDLIB sequences. The default H-value used is 1e-6 and three iterations for profile generation. In the fourth step, the program 123D is used to provide additional mapping to FOLDLIB using fold recognition <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. 123D has been used successfully in CASP <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> competitions.</p>
            </sec>
            <sec>
               <st>
                  <p>Reliability index</p>
               </st>
               <p>The reliability of a prediction is calculated on the basis of a novel benchmarking procedure against SCOP and will be described elsewhere. The index is expressed as percent certainty that a particular prediction is correct: A = 99.9% certainty, B = 99% certainty, C = 90% certainty, D = 50% certainty, and E = 10% certainty.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Database and user interface</p>
            </st>
            <p>Data provided by iGAP are stored in an Oracle 9i (release 2) relational database system. The database is connected to the web using Apache mod_perl and the Perl DBI. External data sources include SCOP, NR, PFAM, NCBI taxonomy, LocusLink <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>, SwissProt <abbrgrp><abbr bid="B35">35</abbr></abbrgrp> and InterPro <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>.</p>
            <p>Chromosomal position information for the <it>Arabidopsis </it>data were obtained from the TIGR <it>Arabidopsis thaliana </it>database <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. The physical and chemical properties are calculated using the EMBOSS pepstats program <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>. The Gene Ontology assignment for <it>Arabidopsis </it>was obtained from The <it>Arabidopsis </it>Information Resource (TAIR) <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. We have also developed our own methodology for assigning additional GO terms with a measure of likelihood (W Krebs and P.E.B., unpublished work) beyond those assigned by SwissProt.</p>
            <p>By default, only those predictions with a reliability index of C or above are shown. The reliability index for all queries may be changed using a pull down menu. The key characteristics of the Web interface that we have developed include the following (Figure <figr fid="F2">2</figr>).</p>
            <sec>
               <st>
                  <p>SCOP browser</p>
               </st>
               <p>The use of SCOP classifications provides a hierarchical view of the data from a structure perspective. For example, the user may start with the all-alpha class and drill down through fold, superfamily, family, and domain level. Alternatively, the structure classification can be searched for terms such as "Rossman fold" present in SCOP annotation.</p>
            </sec>
            <sec>
               <st>
                  <p>FOLDLIB browser</p>
               </st>
               <p>The classification of protein folds in the fold library can be browsed. Alternatively, it can be searched by PDB id or sequence.</p>
            </sec>
            <sec>
               <st>
                  <p>Search by identifier</p>
               </st>
               <p>The database may be searched using identifiers from a number of existing databases such as SCOP, PFAM (ID or Accession Number), NCBI (GI number), PDB identifier, Locus identifier, Gene Ontology (GO) term <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>, or FOLDLIB identifier.</p>
            </sec>
            <sec>
               <st>
                  <p>Search by keywords</p>
               </st>
               <p>Descriptions from NR, PFAM, PDB, FOLDLIB, SCOP and GO are parsed and indexed. The text index supports complex searches and wild card searches. No attempt is made to reconcile nomenclature differences introduced by each individual data source.</p>
            </sec>
            <sec>
               <st>
                  <p>Domain summary</p>
               </st>
               <p>This provides preliminary information on a particular domain, identified by its FOLDLIB id. The protein domain sequence is displayed and its structure may be viewed using a Chime (MDL, San Leandro, CA) plug-in <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>. All sequences which contain the same domain are displayed. For each sequence, a link provides the specific target-template alignment and a graphic representation of the domain architecture. It also links to the template based models described below.</p>
            </sec>
            <sec>
               <st>
                  <p>Gene summary</p>
               </st>
               <p>This provides preliminary information on all the domains located within a particular gene including domain boundary information. Each domain may subsequently be interrogated with the SCOP browser to provide superfamily, family and fold level information. The protein summary page provides comprehensive information about the protein besides domain assignment.</p>
            </sec>
            <sec>
               <st>
                  <p>Template-based models</p>
               </st>
               <p>From the template target alignment, 3D coordinates from the FOLDLIB template are used to construct a C-alpha only PDB format file using the sequence of the target protein. The resulting PDB file may then be visualized using QuickPDB, a Java applet developed by I.N.S. and P.E.B. (unpublished), or with other popular 3D viewers such as the Chime viewer plugin.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Availability and update</p>
            </st>
            <p>The data are available from the Web <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. Information may be downloaded in text or XML format and imported into an Excel spreadsheet, MySQL database or other applications. For advanced users, the data may be retrieved using SQL from the Web interface. A database schema is available on the SQL search page as an aid in SQL query formulation.</p>
            <p>A workflow management system is under development to automate the processing and update of proteomes. All external data are updated when a major release of NR becomes available. NR database is downloaded from NCBI. Sequences from other sequencing centers are clustered at 100% identity using cd-hit <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>. Subsequent updates are performed monthly using the NCBI NR Month database. The unique sequences are sorted according to taxonomy using the NCBI gi_taxonomy mapping table. Only sequences that are new or changed (crc64 checksum) are submitted to a continuous update process. The release date for each source database used is given on the home page. The <it>Arabidopsis </it>proteome (27,242 total and 27,089 unique sequences, 7 September 2002 release) may be computed in approximately 50,000 computer hours.</p>
         </sec>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>This work is supported by the National Partnership for Advanced Computational Infrastructure (NPACI) funded by the National Science Foundation (NSF) grant ASC 9619020 and the National Institutes of Health (NIH) grant GM63208-01A1S1. The authors wish to thank the many biologists who provided feedback to the development of the database and interface, the authors of the external software components, and Robert Byrnes for reviewing the manuscript.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Schaffer</snm>
                  <fnm>AA</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1997</pubdate>
            <volume>25</volume>
            <fpage>3389</fpage>
            <lpage>3402</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/25.17.3389</pubid>
                  <pubid idtype="pmpid" link="fulltext">9254694</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Profile hidden Markov models.</p>
            </title>
            <aug>
               <au>
                  <snm>Eddy</snm>
                  <fnm>SR</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>1998</pubdate>
            <volume>14</volume>
            <fpage>755</fpage>
            <lpage>763</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/14.9.755</pubid>
                  <pubid idtype="pmpid" link="fulltext">9918945</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>The PROSITE database, its status in 2002.</p>
            </title>
            <aug>
               <au>
                  <snm>Falquet</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Pagni</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Bucher</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Hulo</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Sigrist</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Hofmann</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Bairoch</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>235</fpage>
            <lpage>238</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/30.1.235</pubid>
                  <pubid idtype="pmpid" link="fulltext">11752303</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>The Blocks database - a system for protein classification.</p>
            </title>
            <aug>
               <au>
                  <snm>Pietrokovski</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Henikoff</snm>
                  <fnm>JG</fnm>
               </au>
               <au>
                  <snm>Henikoff</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1996</pubdate>
            <volume>24</volume>
            <fpage>197</fpage>
            <lpage>200</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/24.1.197</pubid>
                  <pubid idtype="pmpid" link="fulltext">8594578</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>The Pfam protein families database.</p>
            </title>
            <aug>
               <au>
                  <snm>Bateman</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Birney</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Cerruti</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Durbin</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Etwiller</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Eddy</snm>
                  <fnm>SR</fnm>
               </au>
               <au>
                  <snm>Griffiths-Jones</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Howe</snm>
                  <fnm>KL</fnm>
               </au>
               <au>
                  <snm>Marshall</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Sonnhammer</snm>
                  <fnm>EL</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>276</fpage>
            <lpage>280</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/30.1.276</pubid>
                  <pubid idtype="pmpid" link="fulltext">11752314</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Apoptotic molecular machinery: vastly increased complexity in vertebrates revealed by genome comparisons.</p>
            </title>
            <aug>
               <au>
                  <snm>Aravind</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Dixit</snm>
                  <fnm>VM</fnm>
               </au>
               <au>
                  <snm>Koonin</snm>
                  <fnm>EV</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2001</pubdate>
            <volume>291</volume>
            <fpage>1279</fpage>
            <lpage>1284</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.291.5507.1279</pubid>
                  <pubid idtype="pmpid" link="fulltext">11181990</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>The Protein Information Resource: an integrated public resource of functional annotation of proteins.</p>
            </title>
            <aug>
               <au>
                  <snm>Wu</snm>
                  <fnm>CH</fnm>
               </au>
               <au>
                  <snm>Huang</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Arminski</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Castro-Alvear</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Hu</snm>
                  <fnm>ZZ</fnm>
               </au>
               <au>
                  <snm>Ledley</snm>
                  <fnm>RS</fnm>
               </au>
               <au>
                  <snm>Lewis</snm>
                  <fnm>KC</fnm>
               </au>
               <au>
                  <snm>Mewes</snm>
                  <fnm>HW</fnm>
               </au>
               <au>
                  <snm>Orcutt</snm>
                  <fnm>BC</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>35</fpage>
            <lpage>37</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/30.1.35</pubid>
                  <pubid idtype="pmpid" link="fulltext">11752247</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>The GeneQuiz web server: protein functional analysis through the Web.</p>
            </title>
            <aug>
               <au>
                  <snm>Hoersch</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Leroy</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>NP</fnm>
               </au>
               <au>
                  <snm>Andrade</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Sander</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Trends Biochem Sci</source>
            <pubdate>2000</pubdate>
            <volume>25</volume>
            <fpage>33</fpage>
            <lpage>35</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0968-0004(99)01510-8</pubid>
                  <pubid idtype="pmpid" link="fulltext">10637611</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Functional and structural genomics using PEDANT.</p>
            </title>
            <aug>
               <au>
                  <snm>Frishman</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Albermann</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Hani</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Heumann</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Metanomski</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Zollner</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Mewes</snm>
                  <fnm>HW</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2001</pubdate>
            <volume>17</volume>
            <fpage>44</fpage>
            <lpage>57</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/17.1.44</pubid>
                  <pubid idtype="pmpid" link="fulltext">11222261</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments.</p>
            </title>
            <aug>
               <au>
                  <snm>Gough</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Chothia</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>268</fpage>
            <lpage>272</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/30.1.268</pubid>
                  <pubid idtype="pmpid" link="fulltext">11752312</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Recent improvements to the SMART domain-based sequence annotation resource.</p>
            </title>
            <aug>
               <au>
                  <snm>Letunic</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Goodstadt</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Dickens</snm>
                  <fnm>NJ</fnm>
               </au>
               <au>
                  <snm>Doerks</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Schultz</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Mott</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Ciccarelli</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Copley</snm>
                  <fnm>RR</fnm>
               </au>
               <au>
                  <snm>Ponting</snm>
                  <fnm>CP</fnm>
               </au>
               <au>
                  <snm>Bork</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>242</fpage>
            <lpage>244</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/30.1.242</pubid>
                  <pubid idtype="pmpid" link="fulltext">11752305</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>3D-Genomics</p>
            </title>
            <url>http://www.sbg.bio.ic.ac.uk/3dgenomics</url>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Enhanced genome annotation using structural profiles in the program 3D-PSSM.</p>
            </title>
            <aug>
               <au>
                  <snm>Kelley</snm>
                  <fnm>LA</fnm>
               </au>
               <au>
                  <snm>MacCallum</snm>
                  <fnm>RM</fnm>
               </au>
               <au>
                  <snm>Sternberg</snm>
                  <fnm>MJ</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2000</pubdate>
            <volume>299</volume>
            <fpage>499</fpage>
            <lpage>520</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.2000.3741</pubid>
                  <pubid idtype="pmpid" link="fulltext">10860755</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Gene3D: structural assignment for whole genes and genomes using the CATH domain structure database.</p>
            </title>
            <aug>
               <au>
                  <snm>Buchan</snm>
                  <fnm>DW</fnm>
               </au>
               <au>
                  <snm>Shepherd</snm>
                  <fnm>AJ</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Pearl</snm>
                  <fnm>FM</fnm>
               </au>
               <au>
                  <snm>Rison</snm>
                  <fnm>SC</fnm>
               </au>
               <au>
                  <snm>Thornton</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Orengo</snm>
                  <fnm>CA</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2002</pubdate>
            <volume>12</volume>
            <fpage>503</fpage>
            <lpage>514</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.213802</pubid>
                  <pubid idtype="pmpid" link="fulltext">11875040</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>PDP: protein domain parser.</p>
            </title>
            <aug>
               <au>
                  <snm>Alexandrov</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Shindyalov</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <fpage>429</fpage>
            <lpage>430</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg006</pubid>
                  <pubid idtype="pmpid" link="fulltext">12584135</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>A database and tools for 3-D protein structure comparison and alignment using the Combinatorial Extension (CE) algorithm.</p>
            </title>
            <aug>
               <au>
                  <snm>Shindyalov</snm>
                  <fnm>IN</fnm>
               </au>
               <au>
                  <snm>Bourne</snm>
                  <fnm>PE</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2001</pubdate>
            <volume>29</volume>
            <fpage>228</fpage>
            <lpage>229</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/29.1.228</pubid>
                  <pubid idtype="pmpid" link="fulltext">11125099</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Analysis of topological and nontopological structural similarities in the PDB: new examples with old structures.</p>
            </title>
            <aug>
               <au>
                  <snm>Alexandrov</snm>
                  <fnm>NN</fnm>
               </au>
               <au>
                  <snm>Fischer</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>1996</pubdate>
            <volume>25</volume>
            <fpage>354</fpage>
            <lpage>365</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/(SICI)1097-0134(199607)25:3&lt;354::AID-PROT7>3.3.CO;2-W</pubid>
                  <pubid idtype="pmpid">8844870</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Alignment algorithm for homology modeling and threading.</p>
            </title>
            <aug>
               <au>
                  <snm>Alexandrov</snm>
                  <fnm>NN</fnm>
               </au>
               <au>
                  <snm>Luethy</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Protein Sci</source>
            <pubdate>1998</pubdate>
            <volume>7</volume>
            <fpage>254</fpage>
            <lpage>258</lpage>
            <xrefbib>
               <pubid idtype="pmpid">9521100</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>NCBI Genomic Biology</p>
            </title>
            <url>http://www.ncbi.nih.gov/Genomes</url>
         </bibl>
         <bibl id="B20">
            <title>
               <p>The AP2 domain of APETALA2 defines a large new family of DNA binding proteins in <it>Arabidopsis</it>.</p>
            </title>
            <aug>
               <au>
                  <snm>Okamuro</snm>
                  <fnm>JK</fnm>
               </au>
               <au>
                  <snm>Caster</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Villarroel</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Van Montagu</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Jofuku</snm>
                  <fnm>KD</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1997</pubdate>
            <volume>94</volume>
            <fpage>7076</fpage>
            <lpage>7081</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1073/pnas.94.13.7076</pubid>
                  <pubid idtype="pmpid" link="fulltext">9192694</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>A novel mode of DNA recognition by a beta-sheet revealed by the solution structure of the GCC-box binding domain in complex with DNA.</p>
            </title>
            <aug>
               <au>
                  <snm>Allen</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Yamasaki</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Ohme-Takagi</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Tateno</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Suzuki</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>EMBO J</source>
            <pubdate>1998</pubdate>
            <volume>17</volume>
            <fpage>5484</fpage>
            <lpage>5496</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/emboj/17.18.5484</pubid>
                  <pubid idtype="pmpid" link="fulltext">9736626</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>A putative modular domain present in diverse signaling proteins.</p>
            </title>
            <aug>
               <au>
                  <snm>Mayer</snm>
                  <fnm>BJ</fnm>
               </au>
               <au>
                  <snm>Ren</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Clark</snm>
                  <fnm>KL</fnm>
               </au>
               <au>
                  <snm>Baltimore</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Cell</source>
            <pubdate>1993</pubdate>
            <volume>73</volume>
            <fpage>629</fpage>
            <lpage>630</lpage>
            <xrefbib>
               <pubid idtype="pmpid">8500161</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>The Encyclopedia of Life Project</p>
            </title>
            <url>http://eol.sdsc.edu</url>
         </bibl>
         <bibl id="B24">
            <title>
               <p>TeraGrid</p>
            </title>
            <url>http://www.teragrid.org</url>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Proteins of <it>Arabidopsis thaliana </it>(PAT) Database</p>
            </title>
            <url>http://pat.sdsc.edu</url>
         </bibl>
         <bibl id="B26">
            <title>
               <p>SCOP database in 2002: refinements accommodate structural genomics.</p>
            </title>
            <aug>
               <au>
                  <snm>Lo Conte</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Brenner</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Hubbard</snm>
                  <fnm>TJ</fnm>
               </au>
               <au>
                  <snm>Chothia</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Murzin</snm>
                  <fnm>AG</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>264</fpage>
            <lpage>267</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/30.1.264</pubid>
                  <pubid idtype="pmpid" link="fulltext">11752311</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>ASTRAL compendium enhancements.</p>
            </title>
            <aug>
               <au>
                  <snm>Chandonia</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Walker</snm>
                  <fnm>NS</fnm>
               </au>
               <au>
                  <snm>Lo Conte</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Koehl</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Levitt</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Brenner</snm>
                  <fnm>SE</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>260</fpage>
            <lpage>263</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/30.1.260</pubid>
                  <pubid idtype="pmpid" link="fulltext">11752310</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Analysis of compositionally biased regions in sequence databases.</p>
            </title>
            <aug>
               <au>
                  <snm>Wootton</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Federhen</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Methods Enzymol</source>
            <pubdate>1996</pubdate>
            <volume>266</volume>
            <fpage>554</fpage>
            <lpage>571</lpage>
            <xrefbib>
               <pubid idtype="pmpid">8743706</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Predicting coiled coils from protein sequences.</p>
            </title>
            <aug>
               <au>
                  <snm>Lupas</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Van Dyke</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Stock</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1991</pubdate>
            <volume>252</volume>
            <fpage>1162</fpage>
            <lpage>1164</lpage>
            <xrefbib>
               <pubid idtype="pmpid">2031185</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>A hidden Markov model for predicting transmembrane helices in protein sequences.</p>
            </title>
            <aug>
               <au>
                  <snm>Sonnhammer</snm>
                  <fnm>EL</fnm>
               </au>
               <au>
                  <snm>von Heijne</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Krogh</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Proc Int Conf Intell Syst Mol Biol</source>
            <pubdate>1998</pubdate>
            <volume>6</volume>
            <fpage>175</fpage>
            <lpage>182</lpage>
            <xrefbib>
               <pubid idtype="pmpid">9783223</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization.</p>
            </title>
            <aug>
               <au>
                  <snm>Nakai</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Horton</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Trends Biochem Sci</source>
            <pubdate>1999</pubdate>
            <volume>24</volume>
            <fpage>34</fpage>
            <lpage>36</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0968-0004(98)01336-X</pubid>
                  <pubid idtype="pmpid" link="fulltext">10087920</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites.</p>
            </title>
            <aug>
               <au>
                  <snm>Nielsen</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Engelbrecht</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Brunak</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>von Heijne</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Int J Neural Syst</source>
            <pubdate>1997</pubdate>
            <volume>8</volume>
            <fpage>581</fpage>
            <lpage>599</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1142/S0129065797000537</pubid>
                  <pubid idtype="pmpid">10065837</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Critical assessment of methods of protein structure prediction (CASP): round IV.</p>
            </title>
            <aug>
               <au>
                  <snm>Moult</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Fidelis</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Zemla</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Hubbard</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2001</pubdate>
            <volume>Suppl 5</volume>
            <fpage>2</fpage>
            <lpage>7</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/prot.10054</pubid>
                  <pubid idtype="pmpid" link="fulltext">11835476</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <title>
               <p>RefSeq and LocusLink: NCBI gene-centered resources.</p>
            </title>
            <aug>
               <au>
                  <snm>Pruitt</snm>
                  <fnm>KD</fnm>
               </au>
               <au>
                  <snm>Maglott</snm>
                  <fnm>DR</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2001</pubdate>
            <volume>29</volume>
            <fpage>137</fpage>
            <lpage>140</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/29.1.137</pubid>
                  <pubid idtype="pmpid" link="fulltext">11125071</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>The SWISS-PROT protein sequence data bank and its supplement TrEMBL.</p>
            </title>
            <aug>
               <au>
                  <snm>Bairoch</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Apweiler</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1997</pubdate>
            <volume>25</volume>
            <fpage>31</fpage>
            <lpage>36</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/25.1.31</pubid>
                  <pubid idtype="pmpid" link="fulltext">9016499</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>The InterPro database, an integrated documentation resource for protein families, domains and functional sites.</p>
            </title>
            <aug>
               <au>
                  <snm>Apweiler</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Attwood</snm>
                  <fnm>TK</fnm>
               </au>
               <au>
                  <snm>Bairoch</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bateman</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Birney</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Biswas</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Bucher</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Cerutti</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Corpet</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Croning</snm>
                  <fnm>MD</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2001</pubdate>
            <volume>29</volume>
            <fpage>37</fpage>
            <lpage>40</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/29.1.37</pubid>
                  <pubid idtype="pmpid" link="fulltext">11125043</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>The Institute for Genomic Research</p>
            </title>
            <url>http://www.tigr.org</url>
         </bibl>
         <bibl id="B38">
            <title>
               <p>EMBOSS: The European Molecular Biology Open Software Suite</p>
            </title>
            <url>http://www.hgmp.mrc.ac.uk/Software/EMBOSS/</url>
         </bibl>
         <bibl id="B39">
            <title>
               <p>TAIR: The Arabidopsis Information Resource</p>
            </title>
            <url>http://www.arabidopsis.org</url>
         </bibl>
         <bibl id="B40">
            <title>
               <p>Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.</p>
            </title>
            <aug>
               <au>
                  <snm>Ashburner</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ball</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Blake</snm>
                  <fnm>JA</fnm>
               </au>
               <au>
                  <snm>Botstein</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Butler</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Cherry</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Davis</snm>
                  <fnm>AP</fnm>
               </au>
               <au>
                  <snm>Dolinski</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Dwight</snm>
                  <fnm>SS</fnm>
               </au>
               <au>
                  <snm>Eppig</snm>
                  <fnm>JT</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2000</pubdate>
            <volume>25</volume>
            <fpage>25</fpage>
            <lpage>29</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/75556</pubid>
                  <pubid idtype="pmpid" link="fulltext">10802651</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B41">
            <title>
               <p>The MDL Chime Site</p>
            </title>
            <url>http://www.mdl.com/chime</url>
         </bibl>
         <bibl id="B42">
            <title>
               <p>Clustering of highly homologous sequences to reduce the size of large protein databases.</p>
            </title>
            <aug>
               <au>
                  <snm>Li</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Jaroszewski</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Godzik</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2001</pubdate>
            <volume>17</volume>
            <fpage>282</fpage>
            <lpage>283</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/17.3.282</pubid>
                  <pubid idtype="pmpid" link="fulltext">11294794</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B43">
            <title>
               <p>EBI Proteome Analysis Database</p>
            </title>
            <url>http://www.ebi.ac.uk/proteome</url>
         </bibl>
      </refgrp>
   </bm>
</art>
