<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-7-262</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>Predicting DNA-binding sites of proteins from amino acid sequence</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Yan</snm>
               <fnm>Changhui</fnm>
               <insr iid="I1"/>
               <email>cyan@cc.usu.edu</email>
            </au>
            <au id="A2">
               <snm>Terribilini</snm>
               <fnm>Michael</fnm>
               <insr iid="I2"/>
               <insr iid="I3"/>
               <email>terrible@iastate.edu</email>
            </au>
            <au id="A3">
               <snm>Wu</snm>
               <fnm>Feihong</fnm>
               <insr iid="I4"/>
               <insr iid="I5"/>
               <insr iid="I6"/>
               <email>wuflyh@iastate.edu</email>
            </au>
            <au id="A4">
               <snm>Jernigan</snm>
               <mi>L</mi>
               <fnm>Robert</fnm>
               <insr iid="I3"/>
               <insr iid="I6"/>
               <insr iid="I7"/>
               <insr iid="I8"/>
               <email>jernigan@iastate.edu</email>
            </au>
            <au id="A5">
               <snm>Dobbs</snm>
               <fnm>Drena</fnm>
               <insr iid="I2"/>
               <insr iid="I3"/>
               <insr iid="I4"/>
               <insr iid="I6"/>
               <insr iid="I7"/>
               <email>ddobbs@iastate.edu</email>
            </au>
            <au id="A6">
               <snm>Honavar</snm>
               <fnm>Vasant</fnm>
               <insr iid="I3"/>
               <insr iid="I4"/>
               <insr iid="I5"/>
               <insr iid="I6"/>
               <insr iid="I7"/>
               <email>honavar@cs.iastate.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Computer Science, Utah State University, Logan, Utah, 84341, USA</p>
            </ins>
            <ins id="I2">
               <p>Department of Genetics, Development and Cell Biology, Iowa State University, Ames, Iowa, 50010, USA</p>
            </ins>
            <ins id="I3">
               <p>Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, Iowa, 50010, USA</p>
            </ins>
            <ins id="I4">
               <p>Artificial Intelligence Research Laboratory, Iowa State University, Ames, Iowa, 50010, USA</p>
            </ins>
            <ins id="I5">
               <p>Department of Computer Science, Iowa State University, Ames, Iowa, 50010, USA</p>
            </ins>
            <ins id="I6">
               <p>Center for Computational Intelligence, Learning, and Discovery, Iowa State University, Ames, Iowa, 50010, USA</p>
            </ins>
            <ins id="I7">
               <p>Laurence H Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, Iowa, 50010, USA</p>
            </ins>
            <ins id="I8">
               <p>Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, Iowa, 50010, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2006</pubdate>
         <volume>7</volume>
         <issue>1</issue>
         <fpage>262</fpage>
         <url>http://www.biomedcentral.com/1471-2105/7/262</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">16712732</pubid>
               <pubid idtype="doi">10.1186/1471-2105-7-262</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>28</day>
               <month>11</month>
               <year>2005</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>19</day>
               <month>5</month>
               <year>2006</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>19</day>
               <month>5</month>
               <year>2006</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2006</year>
         <collab>Yan et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Understanding the molecular details of protein-DNA interactions is critical for deciphering the mechanisms of gene regulation. We present a machine learning approach for the identification of amino acid residues involved in protein-DNA interactions.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We start with a Na&#239;ve Bayes classifier trained to predict whether a given amino acid residue is a DNA-binding residue based on its identity and the identities of its sequence neighbors. The input to the classifier consists of the identities of the target residue and 4 sequence neighbors on each side of the target residue. The classifier is trained and evaluated (using leave-one-out cross-validation) on a non-redundant set of 171 proteins. Our results indicate the feasibility of identifying interface residues based on local sequence information. The classifier achieves 71% overall accuracy with a correlation coefficient of 0.24, 35% specificity and 53% sensitivity in identifying interface residues as evaluated by leave-one-out cross-validation. We show that the performance of the classifier is improved by using sequence entropy of the target residue (the entropy of the corresponding column in multiple alignment obtained by aligning the target sequence with its sequence homologs) as additional input. The classifier achieves 78% overall accuracy with a correlation coefficient of 0.28, 44% specificity and 41% sensitivity in identifying interface residues. Examination of the predictions in the context of 3-dimensional structures of proteins demonstrates the effectiveness of this method in identifying DNA-binding sites from sequence information. In 33% (56 out of 171) of the proteins, the classifier identifies the interaction sites by correctly recognizing at least half of the interface residues. In 87% (149 out of 171) of the proteins, the classifier correctly identifies at least 20% of the interface residues. This suggests the possibility of using such classifiers to identify potential DNA-binding motifs and to gain potentially useful insights into sequence correlates of protein-DNA interactions.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Na&#239;ve Bayes classifiers trained to identify DNA-binding residues using sequence information offer a computationally efficient approach to identifying putative DNA-binding sites in DNA-binding proteins and recognizing potential DNA-binding motifs.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="bmc" subtype="user_supplied_xml" id="endnote"/>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Protein-DNA interactions play a pivotal role in gene regulation. The ability to identify amino acid residues that are responsible for the specificity and affinity of the interactions can significantly improve our understanding of macromolecular functions and contribute to advances in drug discovery <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>. Hence, the discovery of the principles of protein-DNA interactions has been a topic of significant interest for many years <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Current approaches to uncovering such principles rely on experimental analysis of the structures of protein-DNA complexes in order to understand the molecular details of specific residue-residue contacts that mediate protein-DNA recognition <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>. In addition to biophysical methods for structure determination, biochemical and molecular genetic approaches have been widely used to identify DNA-binding sites on proteins and to investigate the interaction modes between proteins and DNA. For example, alanine-scanning mutagenesis has been used to identify the amino acids important for target recognition by the m<sup>5</sup>C methyltransferase <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> and to distinguish specific amino acids important for DNA binding and transcription activation by SoxS <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. More recently, methods for precisely identifying protein-DNA contacts by coupling photochemical crosslinking with mass spectrometry have also been developed <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>.</p>
         <p>With increasing availability of protein sequence data, there is an urgent need for computational tools that can rapidly and reliably identify DNA-binding sites. Hence, there has been significant recent interest in developing computational methods for identification of amino acid residues that participate in protein-DNA interactions based on combinations of sequence, structure, evolutionary information, and chemical or physical properties. For example, Jones <it>et al. </it><abbrgrp><abbr bid="B10">10</abbr></abbrgrp> analyzed residue patches on the surface of DNA-binding proteins and used electrostatic potentials of residues to predict DNA-binding sites. They recently applied this method to the identification of three specific classes of DNA-binding proteins, based on the presence of solvent accessible DNA-binding structural motifs <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. In related work, Tsuchiya <it>et al. </it><abbrgrp><abbr bid="B12">12</abbr></abbrgrp> used a structure-based method to identify protein-DNA binding sites based on electrostatic potentials and surface shape, and Keil <it>et al</it>. <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> trained a neural network classifier to identify patches likely to be DNA-binding sites based on physical and chemical properties of the patches. Neural network classifiers have also been used to identify protein-DNA interface residues based on a combination of sequence neighbor and structure information <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. More recently, Ahmad and Sarai have proposed a sequence-based method for predicting DNA-binding residues that incorporates sequence alignment profiles into the input <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>.</p>
         <p>Against this background, this paper describes a machine-learning approach to developing a classifier for identifying amino acid residues that are likely to be involved in protein-DNA interactions.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Identification of interface residues based on local sequence information</p>
            </st>
            <p>A Na&#239;ve Bayes classifier was trained to predict whether or not a target residue in a protein sequence is an interface residue based on local protein sequence information. Several input encodings based on local sequence information were tried, with input consisting of: (a) the identities of 9 amino acid residues, corresponding to a window containing the target residue and 4 neighboring residues on each side of the target residue; and (b) the identities of 9 amino acid residues and the sequence entropy of the target residue (the entropy of the corresponding column in multiple alignment obtained by aligning the target sequence with its sequence homologs). In each case, Na&#239;ve Bayes classifiers were trained and evaluated using leave-one-out cross-validation on a set of 171 DNA-binding proteins</p>
            <p>Table <tblr tid="T1">1</tblr> shows that the classifier using amino acid identities as input achieved an overall accuracy of 71% with a correlation coefficient of 0.24, 35% of the residues predicted to be interface residues are actually interface residues, and 53% of interface residues are correctly identified. Adding the sequence entropy of the target residue (the entropy of the corresponding column in multiple alignment obtained by aligning the target sequence with its sequence homologs) to the input improved the performance of the classifier (Table <tblr tid="T1">1</tblr>). The resulting classifier achieved an overall accuracy of 78% with a correlation coefficient of 0.28, 44% specificity, and 41% sensitivity. In 33% (56 of 171) of the proteins, the classifier recognizes the interaction site by correctly identifying at least half of the interface residues, and in 87% (149 of 171) of the proteins, by correctly identifying at least 20% of the interface residues.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>The performance of the Naive Bayes classifiers</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Identities (ID)<sup>a</sup></p>
                     </c>
                     <c ca="left">
                        <p>ID + entropy <sup>b</sup></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Accuracy (%)</p>
                     </c>
                     <c ca="left">
                        <p>71</p>
                     </c>
                     <c ca="left">
                        <p>78</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Correlation coefficient</p>
                     </c>
                     <c ca="left">
                        <p>0.24</p>
                     </c>
                     <c ca="left">
                        <p>0.28</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Specificity (%)</p>
                     </c>
                     <c ca="left">
                        <p>35</p>
                     </c>
                     <c ca="left">
                        <p>44</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Sensitivity (%)</p>
                     </c>
                     <c ca="left">
                        <p>53</p>
                     </c>
                     <c ca="left">
                        <p>41</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p><sup>a </sup>Input contains only the identities of 9 amino acid residues (the target residue and its 4 sequence neighbors on each side). <sup>b </sup>Sequence entropy of the target residue position is added as an additional input.</p>
               </tblfn>
            </tbl>
            <p>Inclusion of other features of the target residue, including relative solvent accessibility, secondary structure, electrostatic potential, and hydrophobicity as additional inputs to the classifier did not yield performance improvements (data not shown) relative to the classifier trained using only the amino acid identities of the target residue and its sequence neighbors. Classifiers trained using features other than the amino acid identities of target residue and its neighbors as input achieved performance that was lower than that of the classifier using amino acid identities of the corresponding residues as input (data not shown).</p>
         </sec>
         <sec>
            <st>
               <p>Evaluation of the predictions in the context of 3-dimensional structures of proteins</p>
            </st>
            <p>We examined in the context of the 3-dimensional structures of the protein-DNA complexes, the DNA-binding residue predictions generated by a Na&#239;ve Bayes classifier trained to identify such residues based on the amino acid identities of the target residue and its sequence neighbors. Two representative examples are shown in figure <figr fid="F1">1</figr>. Figure <figr fid="F1">1A</figr> shows the predictions on the transcription factor C/Ebp&#946; from PDB complex 1gu4. The predictions of the classifier rank the 3<sup>rd </sup>best in terms of correlation efficient among the 171 proteins. We note that the classifier is able to recognize the DNA-binding site on the protein on the basis of sequence information alone. Figure <figr fid="F1">1B</figr> shows the predictions on the intron-associated endonuclease I-<it>Tev</it>I from PDB complex 1i3j. The predictions of the classifier in this case rank the114<sup>th </sup>best among the 171 proteins in terms of correlation efficient. I-<it>Tev</it>I wraps around the DNA and has an unusually extended binding site. We note that the predicted DNA-binding residues cover the long segment of the protein that binds to the DNA.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Visualization of predicted DNA-binding residues on 3-D Structure</p>
               </caption>
               <text>
                  <p><b>Visualization of predicted DNA-binding residues on 3-D Structure</b>. The predicted interface residues are shown in red on protein surface. DNA molecules bound to the proteins are shown in blue. <b>A</b>: The predictions on C/Ebp&#946; from PDB complex 1gu4, the 3<sup>rd </sup>best out of the 179 proteins in terms of correlation coefficient. <b>B</b>: The predictions on I-<it>Tev</it>I from PDB complex 1i3j, the 114<sup>th </sup>best out of the 179 proteins. Figures are generated using Protein Explorer [38].</p>
               </text>
               <graphic file="1471-2105-7-262-1"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Receiver operating characteristic (ROC) curve</p>
            </st>
            <p>In some situations (e.g., identification of critical interface residues for site-specific mutagenesis), it is desirable to predict interface residues with high precision at the cost of reduced coverage. In other situations, discovering more potential interface residues might be more useful. These different requirements can be met by modifying the threshold &#952; used by the Na&#239;ve Bayes classifier in this study. The Na&#239;ve Bayes classifier predicts a residue to be an interface residue if <m:math name="1471-2105-7-262-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mfrac><m:mrow><m:mi>P</m:mi><m:mo stretchy="false">(</m:mo><m:mi>c</m:mi><m:mo>=</m:mo><m:mn>1</m:mn><m:mo>|</m:mo><m:mi>X</m:mi><m:mo>=</m:mo><m:msub><m:mi>x</m:mi><m:mn>1</m:mn></m:msub><m:msub><m:mi>x</m:mi><m:mn>2</m:mn></m:msub><m:mn>...</m:mn><m:msub><m:mi>x</m:mi><m:mi>n</m:mi></m:msub><m:mo stretchy="false">)</m:mo></m:mrow><m:mrow><m:mi>P</m:mi><m:mo stretchy="false">(</m:mo><m:mi>c</m:mi><m:mo>=</m:mo><m:mn>0</m:mn><m:mo>|</m:mo><m:mi>X</m:mi><m:mo>=</m:mo><m:msub><m:mi>x</m:mi><m:mn>1</m:mn></m:msub><m:msub><m:mi>x</m:mi><m:mn>2</m:mn></m:msub><m:mn>...</m:mn><m:msub><m:mi>x</m:mi><m:mi>n</m:mi></m:msub><m:mo stretchy="false">)</m:mo></m:mrow></m:mfrac><m:mo>></m:mo><m:mi>&#952;</m:mi></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdcfaqjabcIcaOiabdogaJjabg2da9iabigdaXiabcYha8jabdIfayjabg2da9iabdIha4naaBaaaleaacqaIXaqmaeqaaOGaemiEaG3aaSbaaSqaaiabikdaYaqabaGccqGGUaGlcqGGUaGlcqGGUaGlcqWG4baEdaWgaaWcbaGaemOBa4gabeaakiabcMcaPaqaaiabdcfaqjabcIcaOiabdogaJjabg2da9iabicdaWiabcYha8jabdIfayjabg2da9iabdIha4naaBaaaleaacqaIXaqmaeqaaOGaemiEaG3aaSbaaSqaaiabikdaYaqabaGccqGGUaGlcqGGUaGlcqGGUaGlcqWG4baEdaWgaaWcbaGaemOBa4gabeaakiabcMcaPaaacqGH+aGpcqaH4oqCaaa@5936@</m:annotation></m:semantics></m:math>. Figure <figr fid="F2">2</figr> shows the Receiver Operating Characteristic curve (ROC curve) of the DNA-binding site predictor.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Receiver Operating Characteristic curve (ROC curve) for interface residue identification</p>
               </caption>
               <text>
                  <p>Receiver Operating Characteristic curve (ROC curve) for interface residue identification.</p>
               </text>
               <graphic file="1471-2105-7-262-2"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Na&#239;ve Bayes classifier using only local sequence identities as input can discover DNA-binding motifs</p>
            </st>
            <p>The results summarized above show that a Na&#239;ve Bayes classifier trained on a set of DNA-binding proteins can successfully identify protein-DNA interface residues from amino acid sequence. This raises the question as to how the sequence features that are identified as predictive of DNA-binding residues by Na&#239;ve Bayes classifier relate to known DNA-binding motifs. To explore this question, we used the ps_scan program to search for PROSITE motifs in our data set of 171 DNA-binding proteins. PROSITE motifs were found in 53 of the 171 proteins (a total of 73 hits). Of these 73 hits, 61 overlap with actual protein-DNA binding sites. The DNA-binding site predictions produced by the Na&#239;ve Bayes classifier (in the leave-one-out cross-validation setting) using the identities of a window of 9 residues and the sequence entropy of the target residue as input, substantially overlap with 56 of the 61 PROSITE DNA-binding motifs (Figure <figr fid="F3">3</figr>). It is worth noting that 118 of the 171 DNA-binding proteins in our data set contain <it>no </it>PROSITE motif whose annotation suggests a role in protein-DNA interactions. PROSITE motifs cover more than 50% of interface residues in only 11% (18 out of 171) of the proteins and cover at least 20% of interface residues in only 20% (34 out of 171) of the proteins. In contrast, the Na&#239;ve Bayes classifier identifies at least 50% of the interface residues in 33% (56 out of 171) of the proteins and at least 20% of the interface residues in 87% (149 out of 171) of the DNA-binding proteins used in this study. These results suggest the possibility of using a Na&#239;ve Bayes classifier trained to predict DNA-binding residues to identify putative DNA-binding motifs.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Comparison of actual and predicted DNA-binding site residues for transcription factor CREB (PDB 1dh3A)</p>
               </caption>
               <text>
                  <p><b>Comparison of actual and predicted DNA-binding site residues for transcription factor CREB (PDB 1dh3A)</b>. PROSITE motif BZIP_BASIC (bottom row) covers many of the actual interface residues (the first row below sequence). Note that the predictions of Na&#239;ve Bayes classifier (the second row below sequence) overlap with the PROSITE motifs, but more closely correspond to the actual interface residues.</p>
               </text>
               <graphic file="1471-2105-7-262-3"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Comparison with previously published methods</p>
            </st>
            <p>Ahmad and Sarai have developed a Position Specific Scoring Matrix (PSSM) based neural network classifier for predicting DNA-binding sites <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. To the best of our knowledge, this is the only previously published study which reports the performance of a DNA-binding site prediction using only sequence information on a "per residue" basis. Ahmad and Sarai have made available an online server that predicts DNA-binding residues using a PSSM-based neural network classifier <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. The server makes predictions for protein sequences that are 40 to 200 amino acid residues in length. In our data set of 171 DNA-binding proteins, 86 have length in this range. The predictions of the PSSM-based classifier on these 86 proteins were obtained by submitting the sequences to the online server. The server returns, for each residue in the submitted sequence, the estimated probability that the residue is a DNA-binding residue. These probabilities can be compared with a threshold to obtain a prediction as to whether a residue is a DNA-binding residue. Different choices of threshold yield different predictions. We varied the threshold from 0.01 to 0.99 in increments of 0.02 to generate an ROC curve for the PSSM-based neural network classifier. For comparison, we trained and evaluated using leave-one-out cross-validation, a Na&#239;ve Bayes classifier using as input the identities of 9 amino acid residues on the subset of 86 proteins (ranging from 40 to 200 amino acids in length). Figure <figr fid="F4">4</figr> shows the comparison of the ROC curves of the PSSM-based neural network classifier with that of the Na&#239;ve Bayes classifier on the data set of 86 proteins. The results show that the Na&#239;ve Bayes classifier achieves higher hit rate, for any given choice of the false alarm rate, than the current implementation of the PSSM-based neural network classifier in the online server.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>The ROC curves for the Na&#239;ve Bayes classifier and the PSSM-based classifier</p>
               </caption>
               <text>
                  <p><b>The ROC curves for the Na&#239;ve Bayes classifier and the PSSM-based classifier</b>. The Na&#239;ve Bayes classifier uses the identities of 9 amino acid residues as input. The ROC for the Na&#239;ve Bayes classifier is obtained using Weka on 86 DNA-binding proteins with lengths ranging from 40 to 200 residues with pairwise sequence similarity less than 30%. The ROC for the PSSM-based classifier is generated using the true positive, false positive, true negative, and false negative predictions obtained by submitting the 86 sequences to the online server [16] that implements PSSM-based classifier developed by Ahmad and Sarai [15].</p>
               </text>
               <graphic file="1471-2105-7-262-4"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Identification of DNA-binding residues in type I restriction-modification system</p>
            </st>
            <p>Restriction-modification (R-M) systems play important role in the recognition and elimination of foreign DNA. In type I R-M systems, S subunit determines the specificity of DNA recognition. The interaction mode between S subunit and DNA is still unknown. Recently, Kim <it>et al. </it><abbrgrp><abbr bid="B17">17</abbr></abbrgrp> solved the crystal structure of the S subunit from <it>M. jannaschii</it>, the only crystal structure ever reported for the S subunit of type I (R-M) systems. To further evaluate the Na&#239;ve Bayes classifier, we used the classifier trained on our data set of 171 DNA-binding proteins (using identities of the target residue, and 4 sequence neighbors on either side along with the sequence entropy of the target residue as input) to identify DNA-binding residues on the S subunit of the type I R-M system from <it>M. jannaschii</it>. Figure <figr fid="F5">5</figr> shows the predicted DNA-binding residues in red and spacefill. Note that Kim <it>et al. </it><abbrgrp><abbr bid="B17">17</abbr></abbrgrp> reported, based on the solved crystal structure of the S subunit of <it>M. jannaschii</it>, that the structures of the two target recognition domains (TRD1, residue 1&#8211;168 and TRD2, residue 209&#8211;378) of the S subunit are similar to the DNA binding domain of <it>Taq</it>I-MTase. By aligning the structures of TRD1 and TRD2 with the structure of <it>Taq</it>I-MTase/DNA complex, Kim <it>et al. </it><abbrgrp><abbr bid="B17">17</abbr></abbrgrp> proposed a model for the interaction between the S subunit and DNA. In figure <figr fid="F5">5</figr>, the DNA molecules in Kim's model are shown in blue. Comparison of Kim's model with the DNA-binding site predictions produced by our Na&#239;ve Bayes classifier shows that the Naive Bayes classifier agrees with the locations of the two potential DNA-binding sites on the S subunit in Kim's interaction model.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>The predictions on the S subunit of the type I (R-M) system from <it>M. jannaschi</it></p>
               </caption>
               <text>
                  <p><b>The predictions on the S subunit of the type I (R-M) system from <it>M. jannaschi</it></b>. The predicted interface residues are shown in red. The DNA molecules from the interaction model proposed by Kim <it>et al. </it>[17] are shown in blue. The locations of R units in Kim's model are indicated by circles. Figures are generated using Protein Explorer [38].</p>
               </text>
               <graphic file="1471-2105-7-262-5"/>
            </fig>
            <p>Figure <figr fid="F5">5</figr> also shows that two additional DNA-binding sites predicted by the Na&#239;ve Bayes classifier overlap with the potential interaction sites between the S subunit and R subunits of the protein (shown as circles in figure <figr fid="F5">5</figr>) as proposed in Kim's model. This observation raises the intriguing possibility that protein-DNA interfaces and protein-protein interfaces might have some common features.</p>
         </sec>
         <sec>
            <st>
               <p>Predictions of the Na&#239;ve Bayes classifier on proteins for which there is no experimental evidence suggesting a DNA-binding role</p>
            </st>
            <p>Given that the Na&#239;ve Bayes classifier was trained to identify DNA-binding residues in proteins that are known to bind to DNA, it is interesting to examine their predictions on a set of proteins for which at present, there is no evidence suggesting a DNA-binding role. We assembled a non-redundant data set of 2,323 proteins which, based on our analysis of Gene Ontology annotations, appear to have no evidence suggesting a DNA-binding role. A Na&#239;ve Bayes classifier trained on our data set of 171 DNA-binding proteins to identify the DNA-binding residues (using amino acid identities of the target residue and its sequence neighbors together with the sequence entropy of the target residue as input) was applied to the 2,323 proteins with no known DNA-binding role. The Na&#239;ve Bayes classifier predicted 11% of the 613,754 residues from these 2,323 proteins as potentially DNA-binding residues. It would be inappropriate to conclude that 11% is a per residue basis false positive rate of our classifier because absence of DNA-binding evidence in GO annotation does not necessarily imply that the protein in question does not have a DNA-binding role. It is quite possible that at least some of these 2,323 proteins indeed bind to DNA. It should be emphasized that our classifier was <it>not </it>trained to distinguish the class of DNA-binding proteins from those that are not DNA-binding (Training such a classifier would involve using representatives of both DNA-binding and non DNA-binding proteins in the training set). It is interesting to note that in 156 of the 2,323 proteins, <it>no </it>residues were predicted to be DNA-binding by our classifier; 264 had fewer than 5 predicted DNA-binding residues; 502 had fewer than 10 predicted DNA-binding residues, and 999 with fewer than 20 DNA-binding residues. Exploring the implications of these observations would require experimentally testing some of the proteins on which our Na&#239;ve Bayes classifier predicts putative DNA-binding sites for DNA-binding activity. Another potentially interesting direction would be to train classifiers to distinguish proteins that are DNA-binding (without necessarily identifying the DNA-binding residues) from those that are not.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <sec>
            <st>
               <p>Effectiveness of local amino acid sequence based approach to prediction of putative DNA-binding sites</p>
            </st>
            <p>In this paper, we have described a computationally efficient approach to identifying putative DNA-binding residues of DNA-binding proteins using Na&#239;ve Bayes classifiers trained to predict DNA-binding residues using amino acid identities of the target residue and its sequence neighbors. The resulting classifier achieves 71% overall accuracy with a correlation coefficient of 0.24, 35% specificity and 53% sensitivity in identifying interface residues as evaluated by leave-one-out cross-validation. Our results indicate the feasibility of identifying interface residues based on local sequence information alone.</p>
            <p>We found that the performance of the classifier is improved by using sequence entropy of the target residue (the entropy of the corresponding column in multiple alignment obtained by aligning the target sequence with its sequence homologs) as additional input. This observation is consistent with the suggestion that DNA-binding residues are likely to be conserved (because of their function). The resulting classifier achieves 78% overall accuracy with a correlation coefficient of 0.28, 44% specificity and 41% sensitivity in identifying interface residues.</p>
            <p>Incorporating additional structure-derived information such as solvent accessibility, electrostatic potential, hydrophobicity or secondary structure of the target residue as additional input, however, did not improve the performance in this study. This should not be taken to mean that these features are not useful predictors of a residue's functionality. In particular, electrostatic potential has been shown to be useful in identification of protein-DNA interface residues <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp>. The fact that this information does not improve performance of our Na&#239;ve Bayes classifiers might have to do with the properties of input encoding or the classification method. Specifically, the additional features were simply added as additional input. The underlying assumption of the Na&#239;ve Bayes classifier that the inputs are independent given the class almost certainly does not hold in the case of protein sequences. Hence, more systematic analysis is needed to identify features that are useful for identification of interface residues and develop methods of representing them in input to a broad range of classifiers. Jones and Thornton <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> analyzed six features of surface patches in protein-protein interaction sites and developed an approach to identify protein-protein interfaces based on the scores combining the six features. Sen <it>et al. </it><abbrgrp><abbr bid="B19">19</abbr></abbrgrp> developed an ensemble method to identify protease-inhibitor binding sites based on sequence, structure and evolution information. It would be interesting to explore such methods for computational prediction of protein-DNA interfaces.</p>
         </sec>
         <sec>
            <st>
               <p>Comparison of Na&#239;ve Bayes classifier with a PSSM-based neural network classifier</p>
            </st>
            <p>Ahmad and Sarai <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> used a PSSM-based neural network classifier to identify interface residues in protein-DNA interactions. Our comparison of the PSSM-based classifier with the Na&#239;ve Bayes classifier shows that the Na&#239;ve Bayes classifier achieves higher hit rate than the PSSM-based classifier for any given choice of the false alarm rate.</p>
            <p>We note that the PSSM-based classifier's ROC originally reported by Ahmad and Sarai <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> is better than the PSSM-based classifier's ROC achieved by their online server <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> on the data set used in our comparison. A few factors may have contributed to this difference: (1) the data set used by Ahmad and Sarai in their original study is different from the data set of 86 proteins used here. It is possible that the current implementation of the PSSM-based method is well optimized for their original data set, but not for the 86 proteins used here; (2) the ROC reported by Ahmad and Sarai includes predictions on proteins of all lengths, whereas the online server only makes predictions for proteins with a length in the range of 40&#8211;200. We chose to compare the Na&#239;ve Bayes classifier with the online server because the server is publicly available and it provides the raw probabilities of the predictions making it possible to compare the ROC curves of the two classifiers on the same data set. However, it should be noted that in the case of Na&#239;ve Bayes classifier, our use of leave-one-out cross-validation ensures that the training and test data do not overlap. We have no control over the training data used by the PSSM-based classifier. Nevertheless, a comparison of the two ROC curves suggests that the Na&#239;ve Bayes classifier achieves higher hit rate than the current implementation of the PSSM-based neural network classifier for any given choice of the false alarm rate.</p>
            <p>A thorough assessment of the performance of the Na&#239;ve Bayes classifier relative to the PSSM-based classifier requires systematic comparisons using leave-one-out cross-validation on identical data sets &#8211; which is at present, not feasible without access to an implementation of the algorithm and the precise parameter settings used to train the PSSM-based classifier. Plans are underway to perform such a comparison using identical data sets and evaluation procedures, in collaboration with Ahmad and Sarai.</p>
            <p>It should be noted that the Na&#239;ve Bayes classifier described in this paper offers several advantages over the PSSM-based neural network classifier: (a) The Na&#239;ve Bayes classifier can be trained in a single pass through the training data whereas training a neural network classifier requires many, often hundreds of passes through the training data. (b) Training the Na&#239;ve Bayes classifier, unlike the neural network classifier, requires no time-consuming and computationally expensive exploration of many possible choices of network architecture (e.g., number of hidden neurons) and parameter settings (e.g., learning rate). (c) The Na&#239;ve Bayes classifier, as well as predictions generated by it is amenable to a straightforward probabilistic interpretation whereas the neural network classifier is more of a "black box".</p>
            <p>These advantages, together with the superior performance of the Na&#239;ve Bayes classifier relative to the current implementation of the PSSM-based neural network classifier, make it an attractive alternative to the latter in identifying DNA-binding residues from a protein sequence. However, the neural network classifier is not limited by the strong independence assumption of the Na&#239;ve Bayes classifier. Hence, it would be interesting to explore whether a neural network classifier or a variant of it could be optimized to yield results that are better than that of the simple Na&#239;ve Bayes classifier.</p>
         </sec>
         <sec>
            <st>
               <p>Use of Na&#239;ve Bayes classifiers to identify putative novel DNA-binding motifs</p>
            </st>
            <p>Protein sequence motifs (defined here as sequence segments associated with specific protein functions or structural families) are often used to identify putative DNA-binding domains. Discovery of such motifs requires alignment of protein sequences that are known to have the same or similar functions. Generating multiple sequence alignments that reveal useful sequence motifs requires significant human expertise to identify a suitable set of sequences to be aligned and to manually refine, through an iterative process of trial and error, the multiple sequence alignment. Against this background, it is interesting to note that in 118 out of 171 DNA-binding proteins used in this study, we found <it>no </it>PROSITE motifs whose annotations suggest a possible DNA-binding role. In the remaining proteins, 61 PROSITE motifs were found to overlap with protein-DNA binding sites. The DNA-binding sites predicted by the Na&#239;ve Bayes classifier significantly overlapped with 56 of the 61 PROSITE motifs that overlapped with DNA-binding sites. PROSITE motifs cover at least 20% of the DNA-binding residues in only 20% (34 out of 171) of the proteins. In contrast, the Na&#239;ve Bayes classifier identifies at least 20% of the interface residues in 87% (149 out of 171) of the DNA-binding proteins used in this study. This raises the possibility of identifying novel sequence motifs that correspond to protein-DNA interfaces by using a Na&#239;ve Bayes classifier trained to identify protein-DNA binding sites. More systematic comparison of this approach with alternative approaches to identification of putative DNA-binding motifs using other motif libraries and different motif finding methods is needed to evaluate its efficacy relative to other approaches.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>In previous work, we have used similar approaches to identify interface residues involved in protein-protein interactions <abbrgrp><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp> and protein-RNA interactions <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. Here we show that it is also feasible to identify interface residues involved in protein-DNA interaction using sequence information. With the level of success achieved in this study, putative DNA-binding sites predicted by the classifiers trained using a machine-learning approach should be useful for guiding experimental investigations into the role of specific residues of a protein in its interaction with DNA, e.g., by localizing candidate residues for alanine-scanning mutagenesis <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>. Moreover, analysis of the binding site "rules" generated by classifiers may provide valuable insight into the protein-DNA recognition code responsible for the specificity and affinity of protein-DNA interactions in living cells.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Data sets</p>
            </st>
            <p>DNA-binding proteins: A data set of DNA-binding proteins was extracted from structures of known protein-DNA complexes in the Protein Data Bank <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. The dataset was culled using PISCES <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. The resulting dataset consists of 171 proteins with mutual sequence identity &lt;= 30% and each protein has at least 40 amino acid residues. All the structures have resolution better than 3.0 &#197; and R factor less than 0.3.</p>
            <p>Proteins that do not have evidence of a DNA-binding role: A non-redundant set of proteins with mutual identity less than 30% was extracted from the PDB using the cluster file from the Protein Data Bank <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. Structures with resolution worse than 2.5 &#197; were removed. The annotations for each protein were retrieved from the Gene Ontology Annotation (GOA) <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. Proteins with annotations indicative of a DNA-binding role were eliminated, leaving a data set of 2,313 proteins with no evidence of a DNA-binding role.</p>
         </sec>
         <sec>
            <st>
               <p>Definition of interface residues</p>
            </st>
            <p>Interface residues are defined as described in Jones <it>et al. </it><abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. Accessible surface area (ASA) was computed for each residue in the unbound protein (in absence of DNA) and in the protein-DNA complex using NACCESS <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. A residue is defined to be an interface residue if its ASA in the protein-DNA complex is less than its ASA in the unbound protein by at least 1&#197;<sup>2</sup>. The 171 proteins have 38,649 residues in total and 5,050 of them are interface residues.</p>
         </sec>
         <sec>
            <st>
               <p>Na&#239;ve Bayes classifier</p>
            </st>
            <p>We used the Na&#239;ve Bayes implementation in the Weka package from the University of Waikato, New Zealand <abbrgrp><abbr bid="B28">28</abbr><abbr bid="B29">29</abbr></abbrgrp>. For each input target residue, the classifier produces a Boolean output (with 1 denoting an interface residue and 0 denoting a non-interface residue). The Na&#239;ve Bayes classifier assumes independence of the attributes given the class. The Na&#239;ve Bayes classifier performs as well as more sophisticated methods on many classification tasks <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. For an input <it>X </it>= <it>x</it><sub>1 </sub><it>x</it><sub>2 </sub>,...,<it>x</it><sub><it>n </it></sub>, a Na&#239;ve Bayes classifier assigns it a class label <it>c </it>by optimizing the posterior: <m:math name="1471-2105-7-262-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mi>c</m:mi><m:mo>=</m:mo><m:mi>arg</m:mi><m:mo>&#8289;</m:mo><m:munder><m:mrow><m:mi>max</m:mi><m:mo>&#8289;</m:mo></m:mrow><m:mi>c</m:mi></m:munder><m:mi>P</m:mi><m:mo stretchy="false">(</m:mo><m:mi>c</m:mi><m:mo>|</m:mo><m:mi>X</m:mi><m:mo>=</m:mo><m:msub><m:mi>x</m:mi><m:mn>1</m:mn></m:msub><m:msub><m:mi>x</m:mi><m:mn>2</m:mn></m:msub><m:mn>...</m:mn><m:msub><m:mi>x</m:mi><m:mi>n</m:mi></m:msub><m:mo stretchy="false">)</m:mo><m:mo>=</m:mo><m:mi>arg</m:mi><m:mo>&#8289;</m:mo><m:munder><m:mrow><m:mi>max</m:mi><m:mo>&#8289;</m:mo></m:mrow><m:mi>c</m:mi></m:munder><m:mi>P</m:mi><m:mtext>(</m:mtext><m:mi>c</m:mi><m:mtext>)</m:mtext><m:mstyle displaystyle="true"><m:munderover><m:mo>&#8719;</m:mo><m:mrow><m:mi>i</m:mi><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:mi>n</m:mi></m:munderover><m:mrow><m:mi>P</m:mi><m:mo stretchy="false">(</m:mo><m:msub><m:mi>x</m:mi><m:mi>i</m:mi></m:msub><m:mo>|</m:mo><m:mi>c</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGJbWycqGH9aqpcyGGHbqycqGGYbGCcqGGNbWzdaWfqaqaaiGbc2gaTjabcggaHjabcIha4bWcbaGaem4yamgabeaakiabdcfaqjabcIcaOiabdogaJjabcYha8jabdIfayjabg2da9iabdIha4naaBaaaleaacqaIXaqmaeqaaOGaemiEaG3aaSbaaSqaaiabikdaYaqabaGccqGGUaGlcqGGUaGlcqGGUaGlcqWG4baEdaWgaaWcbaGaemOBa4gabeaakiabcMcaPiabg2da9iGbcggaHjabckhaYjabcEgaNnaaxababaGagiyBa0MaeiyyaeMaeiiEaGhaleaacqWGJbWyaeqaamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegyvzYrwyUfgaiqGakiaa=bfacqqGOaakcaWFJbGaeeykaKYaaebCaeaacaWFqbGaeiikaGIaemiEaG3aaSbaaSqaaiabdMgaPbqabaGccqGG8baFcqWGJbWycqGGPaqkaSqaaiaa=LgacqGH9aqpiqaacaGFXaaabaGaa8NBaaqdcqGHpis1aaaa@7297@</m:annotation></m:semantics></m:math>. In the case of two class classification (<it>c </it>&#8712; {0, 1}), this is equivalent to determining <it>c </it>by comparing the ratio likelihood with a parameter &#952; as in equation (1).</p>
            <p>
               <m:math name="1471-2105-7-262-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mfrac>
                           <m:mrow>
                              <m:mi>P</m:mi>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>c</m:mi>
                              <m:mo>=</m:mo>
                              <m:mn>1</m:mn>
                              <m:mo>|</m:mo>
                              <m:mi>X</m:mi>
                              <m:mo>=</m:mo>
                              <m:msub>
                                 <m:mi>x</m:mi>
                                 <m:mn>1</m:mn>
                              </m:msub>
                              <m:msub>
                                 <m:mi>x</m:mi>
                                 <m:mn>2</m:mn>
                              </m:msub>
                              <m:mn>...</m:mn>
                              <m:msub>
                                 <m:mi>x</m:mi>
                                 <m:mi>n</m:mi>
                              </m:msub>
                              <m:mo stretchy="false">)</m:mo>
                           </m:mrow>
                           <m:mrow>
                              <m:mi>P</m:mi>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>c</m:mi>
                              <m:mo>=</m:mo>
                              <m:mn>0</m:mn>
                              <m:mo>|</m:mo>
                              <m:mi>X</m:mi>
                              <m:mo>=</m:mo>
                              <m:msub>
                                 <m:mi>x</m:mi>
                                 <m:mn>1</m:mn>
                              </m:msub>
                              <m:msub>
                                 <m:mi>x</m:mi>
                                 <m:mn>2</m:mn>
                              </m:msub>
                              <m:mn>...</m:mn>
                              <m:msub>
                                 <m:mi>x</m:mi>
                                 <m:mi>n</m:mi>
                              </m:msub>
                              <m:mo stretchy="false">)</m:mo>
                           </m:mrow>
                        </m:mfrac>
                        <m:mo>=</m:mo>
                        <m:mfrac>
                           <m:mrow>
                              <m:mi>P</m:mi>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>c</m:mi>
                              <m:mo>=</m:mo>
                              <m:mn>1</m:mn>
                              <m:mo stretchy="false">)</m:mo>
                              <m:mstyle displaystyle="true">
                                 <m:munderover>
                                    <m:mo>&#8719;</m:mo>
                                    <m:mrow>
                                       <m:mi>i</m:mi>
                                       <m:mo>=</m:mo>
                                       <m:mn>1</m:mn>
                                    </m:mrow>
                                    <m:mi>n</m:mi>
                                 </m:munderover>
                                 <m:mrow>
                                    <m:mi>P</m:mi>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:msub>
                                       <m:mi>x</m:mi>
                                       <m:mi>i</m:mi>
                                    </m:msub>
                                    <m:mo>|</m:mo>
                                    <m:mi>c</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                    <m:mo stretchy="false">)</m:mo>
                                 </m:mrow>
                              </m:mstyle>
                           </m:mrow>
                           <m:mrow>
                              <m:mi>P</m:mi>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>c</m:mi>
                              <m:mo>=</m:mo>
                              <m:mn>0</m:mn>
                              <m:mo stretchy="false">)</m:mo>
                              <m:mstyle displaystyle="true">
                                 <m:munderover>
                                    <m:mo>&#8719;</m:mo>
                                    <m:mrow>
                                       <m:mi>i</m:mi>
                                       <m:mo>=</m:mo>
                                       <m:mn>1</m:mn>
                                    </m:mrow>
                                    <m:mi>n</m:mi>
                                 </m:munderover>
                                 <m:mrow>
                                    <m:mi>P</m:mi>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:msub>
                                       <m:mi>x</m:mi>
                                       <m:mi>i</m:mi>
                                    </m:msub>
                                    <m:mo>|</m:mo>
                                    <m:mi>c</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>0</m:mn>
                                    <m:mo stretchy="false">)</m:mo>
                                 </m:mrow>
                              </m:mstyle>
                           </m:mrow>
                        </m:mfrac>
                        <m:mo>></m:mo>
                        <m:mi>&#952;</m:mi>
                        <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                        <m:mrow>
                           <m:mo>(</m:mo>
                           <m:mn>1</m:mn>
                           <m:mo>)</m:mo>
                        </m:mrow>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdcfaqjabcIcaOiabdogaJjabg2da9iabigdaXiabcYha8jabdIfayjabg2da9iabdIha4naaBaaaleaacqaIXaqmaeqaaOGaemiEaG3aaSbaaSqaaiabikdaYaqabaGccqGGUaGlcqGGUaGlcqGGUaGlcqWG4baEdaWgaaWcbaGaemOBa4gabeaakiabcMcaPaqaaiabdcfaqjabcIcaOiabdogaJjabg2da9iabicdaWiabcYha8jabdIfayjabg2da9iabdIha4naaBaaaleaacqaIXaqmaeqaaOGaemiEaG3aaSbaaSqaaiabikdaYaqabaGccqGGUaGlcqGGUaGlcqGGUaGlcqWG4baEdaWgaaWcbaGaemOBa4gabeaakiabcMcaPaaacqGH9aqpdaWcaaqaaiabdcfaqjabcIcaOiabdogaJjabg2da9iabigdaXiabcMcaPmaarahabaGaemiuaaLaeiikaGIaemiEaG3aaSbaaSqaaiabdMgaPbqabaGccqGG8baFcqWGJbWycqGH9aqpcqaIXaqmcqGGPaqkaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6gaUbqdcqGHpis1aaGcbaGaemiuaaLaeiikaGIaem4yamMaeyypa0JaeGimaaJaeiykaKYaaebCaeaacqWGqbaucqGGOaakcqWG4baEdaWgaaWcbaGaemyAaKgabeaakiabcYha8jabdogaJjabg2da9iabicdaWiabcMcaPaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemOBa4ganiabg+GivdaaaOGaeyOpa4JaeqiUdeNaaCzcaiaaxMaadaqadaqaaiabigdaXaGaayjkaiaawMcaaaaa@8D7B@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p><it>c </it>is predicted to be 1 if the ratio likelihood is greater than &#952;, and 0 otherwise. When a local sequence around the target residue was encoded using numeric features such as hydrophobicity, the numerical values were discretized using the discretization filter of Weka.</p>
            <p>In a standard Na&#239;ve Bayes classifier, &#952; takes the value of 1. The predictions of Na&#239;ve Bayes classifier are biased in favor of the majority class when the dataset consists of unequal numbers of examples for the two classes. Hence, we trained &#952; to optimize classification performance on training data. We used leave-one-out cross-validation to train and test the classifier. In each round of experiment, all proteins except one were used as the training set and the remaining protein was used to test the classifier. In the training stage, the conditional probability table <it>P</it>(<it>x</it><sub><it>i </it></sub>| <it>c</it>) and prior probability <it>p </it>(<it>c</it>) were estimated using the training set. To determine &#952;, the classifier was applied to the training set and different values of &#952; ranging from 0.01 to 1 were tested, in increments of 0.01. The value of &#952; for which the classifier yields the highest correlation coefficient was used to make predictions on the test set.</p>
         </sec>
         <sec>
            <st>
               <p>Na&#239;ve Bayes classifier using only local sequence identity as input</p>
            </st>
            <p>The input to the Na&#239;ve Bayes classifier contains the identities of 2<it>n</it>+1 residues in the form of <it>X </it>= (<it>x</it><sub><it>t</it>-<it>n </it></sub>, <it>x</it><sub><it>t</it>-<it>n</it>+1 </sub>,...,<it>x</it><sub><it>t</it>-1 </sub>,<it>x</it><sub><it>t </it></sub>,<it>x</it><sub><it>t</it>-1 </sub>,...,<it>x</it><sub><it>t</it>+<it>n</it>-1 </sub>, <it>x</it><sub><it>t</it>+<it>n </it></sub>), where <it>x<sub>t</sub></it> is the identity of target residue, <it>x</it><sub><it>t</it>-<it>n </it></sub>, <it>x</it><sub><it>t</it>-<it>n</it>+1 </sub>,...,<it>x</it><sub><it>t</it>-1 </sub>and <it>x</it><sub><it>t</it>+1 </sub>, <it>x</it><sub><it>t</it>+<it>n</it>-1 </sub>, <it>x</it><sub><it>t</it>+<it>n </it></sub> are the identities of <it>n </it>residues on each side of the target residue. Different values of <it>n </it>from 1 to 10 were tried and the best performance was obtained when <it>n </it>= 4 (corresponding to a window size of 9). A training example is an ordered pair (<it>X</it>, <it>c</it>), where <it>c </it>&#8712; {0, 1}. 1 indicates that the target residue (the residue in the center of the input window) is an interface residue and 0 indicates that target residue is not an interface residue. For a test example <it>X</it>, the classifier outputs 1 (i.e., <it>X </it>is predicted to be an interface residue) or 0 (i.e., <it>X </it>is predicted to be a non-interface residue) as the class label of <it>X</it>.</p>
         </sec>
         <sec>
            <st>
               <p>Na&#239;ve Bayes classifier using additional inputs</p>
            </st>
            <p>Relative solvent accessibility (rASA), sequence entropy, secondary structure, electrostatic potential and hydrophobicity were considered. When a feature of the target residue is added into the input of amino acid identities of residues in a 9-residue window, the input to the classifier is encoded as <it>X </it>= (<it>x</it><sub><it>t</it>-<it>n </it></sub>, <it>x</it><sub><it>t</it>-<it>n</it>+1 </sub>,...,<it>x</it><sub><it>t</it>-1 </sub>,<it>x</it><sub><it>t </it></sub>,<it>x</it><sub><it>t</it>+1 </sub>,...,<it>x</it><sub><it>t</it>+<it>n</it>-1 </sub>, <it>x</it><sub><it>t</it>+<it>n </it></sub><it>, f</it><sub><it>t </it></sub>), with <it>f</it><sub><it>t </it></sub>standing for the corresponding feature of the target residue (e.g., sequence entropy, hydrophobicity, etc.), and <it>x</it><sub><it>i </it></sub>denotes the amino acid identity of the corresponding position within the sequence window. When a feature other than residue identity of the input window (i.e., the target residue and its sequence neighbors) is used to encode the local sequence around the target residue, the input to the classifier has the form of <it>X </it>= (<it>f</it><sub><it>t</it>-<it>n </it></sub>, <it>f</it><sub><it>t</it>-<it>n</it>+1 </sub>,...,<it>f</it><sub><it>t</it>-1 </sub>, <it>f</it><sub><it>t </it></sub>, <it>f</it><sub><it>t</it>+1 </sub>,...,<it>f</it><sub><it>t</it>+<it>n</it>-1 </sub>, <it>f</it><sub><it>t</it>+<it>n </it></sub>), where <it>f</it><sub><it>i </it></sub>is the corresponding feature (e.g., hydrophobicity) of the residue <it>i</it>.</p>
            <p>The relative solvent accessible surface area (rASA) of each residue (in the absence of DNA) was computed using NACCESS <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. Entropy of each sequence position (the sequence entropy for the corresponding column in multiple of the multiple sequence alignment) was extracted from the HSSP database <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. The sequence entropy is normalized to the range of 0&#8211;100, with lower entropy values corresponding to more conserved sequence positions. Secondary structure for each residue was extracted from the PDB database <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. Electrostatic potential for each atom was calculated using Delphi <abbrgrp><abbr bid="B32">32</abbr><abbr bid="B33">33</abbr></abbrgrp>, using parameters based on the study of Jones <it>et al. </it><abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. The electrostatic potential for each residue was calculated in a similar way as the study of Jones <it>et al. </it><abbrgrp><abbr bid="B10">10</abbr></abbrgrp>: the electrostatic potential of an atom is set to 0 if its solvent accessibility is less than 1&#197;<sup>2 </sup>and the electrostatic potential of a residue is the average over all its atoms. Hydrophobicity of each residue is obtained from the consensus normalized hydrophobicity scale derived by Eisenberg <it>et al. </it><abbrgrp><abbr bid="B34">34</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Performance measures</p>
            </st>
            <p>Because no single performance measure provides a complete picture of performance of the classifier <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>, we used a combination of <it>accuracy, correlation coefficient </it>(<it>CC</it>)<it>, specificity </it>and <it>sensitivity</it>. These measures are defined as described in Baldi <it>et al. </it><abbrgrp><abbr bid="B35">35</abbr></abbrgrp>. <m:math name="1471-2105-7-262-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mi>A</m:mi><m:mi>c</m:mi><m:mi>c</m:mi><m:mi>u</m:mi><m:mi>r</m:mi><m:mi>a</m:mi><m:mi>c</m:mi><m:mi>y</m:mi><m:mo>=</m:mo><m:mfrac><m:mrow><m:mi>T</m:mi><m:mi>P</m:mi><m:mo>+</m:mo><m:mi>T</m:mi><m:mi>N</m:mi></m:mrow><m:mi>N</m:mi></m:mfrac><m:mo>;</m:mo><m:mtext>&#160;&#160;</m:mtext><m:mi>C</m:mi><m:mi>C</m:mi><m:mo>=</m:mo><m:mfrac><m:mrow><m:mi>T</m:mi><m:mi>P</m:mi><m:mo>&#215;</m:mo><m:mi>T</m:mi><m:mi>N</m:mi><m:mo>&#8722;</m:mo><m:mi>F</m:mi><m:mi>P</m:mi><m:mo>&#215;</m:mo><m:mi>F</m:mi><m:mi>N</m:mi></m:mrow><m:mrow><m:msqrt><m:mrow><m:mo stretchy="false">(</m:mo><m:mi>T</m:mi><m:mi>P</m:mi><m:mo>+</m:mo><m:mi>F</m:mi><m:mi>N</m:mi><m:mo stretchy="false">)</m:mo><m:mo stretchy="false">(</m:mo><m:mi>T</m:mi><m:mi>P</m:mi><m:mo>+</m:mo><m:mi>F</m:mi><m:mi>P</m:mi><m:mo stretchy="false">)</m:mo><m:mo stretchy="false">(</m:mo><m:mi>T</m:mi><m:mi>N</m:mi><m:mo>+</m:mo><m:mi>F</m:mi><m:mi>P</m:mi><m:mo stretchy="false">)</m:mo><m:mo stretchy="false">(</m:mo><m:mi>T</m:mi><m:mi>N</m:mi><m:mo>+</m:mo><m:mi>F</m:mi><m:mi>N</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:msqrt></m:mrow></m:mfrac><m:mo>;</m:mo><m:mi>S</m:mi><m:mi>e</m:mi><m:mi>n</m:mi><m:mi>s</m:mi><m:mi>i</m:mi><m:mi>t</m:mi><m:mi>i</m:mi><m:mi>v</m:mi><m:mi>i</m:mi><m:mi>t</m:mi><m:mi>y</m:mi><m:mo>=</m:mo><m:mfrac><m:mrow><m:mi>T</m:mi><m:mi>P</m:mi></m:mrow><m:mrow><m:mi>T</m:mi><m:mi>P</m:mi><m:mo>+</m:mo><m:mi>F</m:mi><m:mi>N</m:mi></m:mrow></m:mfrac><m:mo>;</m:mo><m:mi>S</m:mi><m:mi>p</m:mi><m:mi>e</m:mi><m:mi>c</m:mi><m:mi>i</m:mi><m:mi>f</m:mi><m:mi>i</m:mi><m:mi>c</m:mi><m:mi>i</m:mi><m:mi>t</m:mi><m:mi>y</m:mi><m:mo>=</m:mo><m:mfrac><m:mrow><m:mi>T</m:mi><m:mi>P</m:mi></m:mrow><m:mrow><m:mi>T</m:mi><m:mi>P</m:mi><m:mo>+</m:mo><m:mi>F</m:mi><m:mi>P</m:mi></m:mrow></m:mfrac></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGbbqqcqWGJbWycqWGJbWycqWG1bqDcqWGYbGCcqWGHbqycqWGJbWycqWG5bqEcqGH9aqpdaWcaaqaaiabdsfaujabdcfaqjabgUcaRiabdsfaujabd6eaobqaaiabd6eaobaacqGG7aWocqqGGaaicqqGGaaicqWGdbWqcqWGdbWqcqGH9aqpdaWcaaqaaiabdsfaujabdcfaqjabgEna0kabdsfaujabd6eaojabgkHiTiabdAeagjabdcfaqjabgEna0kabdAeagjabd6eaobqaamaakaaabaGaeiikaGIaemivaqLaemiuaaLaey4kaSIaemOrayKaemOta4KaeiykaKIaeiikaGIaemivaqLaemiuaaLaey4kaSIaemOrayKaemiuaaLaeiykaKIaeiikaGIaemivaqLaemOta4Kaey4kaSIaemOrayKaemiuaaLaeiykaKIaeiikaGIaemivaqLaemOta4Kaey4kaSIaemOrayKaemOta4KaeiykaKcaleqaaaaakiabcUda7iabdofatjabdwgaLjabd6gaUjabdohaZjabdMgaPjabdsha0jabdMgaPjabdAha2jabdMgaPjabdsha0jabdMha5jabg2da9maalaaabaGaemivaqLaemiuaafabaGaemivaqLaemiuaaLaey4kaSIaemOrayKaemOta4eaaiabcUda7iabdofatjabdchaWjabdwgaLjabdogaJjabdMgaPjabdAgaMjabdMgaPjabdogaJjabdMgaPjabdsha0jabdMha5jabg2da9maalaaabaGaemivaqLaemiuaafabaGaemivaqLaemiuaaLaey4kaSIaemOrayKaemiuaafaaaaa@A1BF@</m:annotation></m:semantics></m:math>, where <it>TP</it><b><it>= </it></b>the number of <it>true positives </it>(residues predicted to be DNA-binding residues that are in fact interface residues); <it>FP </it>= the number of <it>false positives </it>(residues predicted to be DNA-binding residues that are in fact not interface residues); <it>TN = </it>the number of <it>true negatives </it>(residues predicted to be non DNA-binding residues that are in fact not DNA-binding residues); <it>FN = </it>the number of <it>false negatives </it>(residues predicted to be non DNA-binding residues that are in fact DNA-binding residues); <it>N </it>= <it>TP+TN+FP+FN </it>(the total number of examples).</p>
            <p><it>Sensitivity </it>is the fraction of positive examples (DNA-binding residues) that are predicted as such by the classifier. <it>Specificity </it>is the fraction of positive predictions (residues predicted to be DNA-binding residues) that are actually interface residues. <it>Accuracy </it>is the fraction of overall predictions that are correct. <it>Correlation coefficient </it>measures the correlation between predictions and actual class labels.</p>
            <p>The Receiver Operating Characteristic curve (ROC curve) is a plot of the "hit rate" (<it>TP</it>/(<it>TP+FN</it>)) versus the "false alarm rate" (<it>FP/</it>(<it>TN+FP</it>)) <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>. It shows the tradeoff between hit rate and false alarm rate when different threshold values are used for the classifier.</p>
         </sec>
         <sec>
            <st>
               <p>Identifying PROSITE motifs in protein sequences</p>
            </st>
            <p>The PROSITE motif database was downloaded from the PROSITE <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>. Protein sequences were scanned using the ps-scan program <abbrgrp><abbr bid="B37">37</abbr></abbrgrp> to identify motifs. Frequently matching (unspecific) patterns and profiles were omitted by setting the "-s" and "-r" options of ps-scan.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Competing interests</p>
         </st>
         <p>The author(s) declare that they have no competing interests.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>CY carried out the computations, prepared an initial draft of the manuscript and participated in discussions and manuscript revisions. MT, and FW, and RLJ participated in discussions and manuscript reviews. DD and VH participated in experimental design, discussions, and manuscript preparation and revisions. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>This Research was supported in part by a grant from the National Institutes of Health (GM 066387) to VH, DD, and RLJ. We thank O. Yakhnenko and D. Caragea for providing comments on the manuscript. We thank Dr. S. Ahmad and Dr. A. Sarai for sharing the details of their PSSM-based neural network classifier.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Transcription factor therapeutics: long-shot or lodestone</p>
            </title>
            <aug>
               <au>
                  <snm>Ghosh</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Papavassiliou</snm>
                  <fnm>AG</fnm>
               </au>
            </aug>
            <source>Curr Med Chem</source>
            <pubdate>2005</pubdate>
            <volume>12</volume>
            <fpage>691</fpage>
            <lpage>701</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15790306</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Designing transcription factor architectures for drug discovery</p>
            </title>
            <aug>
               <au>
                  <snm>Blancafort</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Segal</snm>
                  <fnm>DJ</fnm>
               </au>
               <au>
                  <snm>Barbas</snm>
                  <fnm>CFIII</fnm>
               </au>
            </aug>
            <source>Mol Pharmacol</source>
            <pubdate>2004</pubdate>
            <volume>66</volume>
            <fpage>1361</fpage>
            <lpage>1371</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1124/mol.104.002758</pubid>
                  <pubid idtype="pmpid" link="fulltext">15340042</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Transcription factors: structural families and principles of DNA recognition.</p>
            </title>
            <aug>
               <au>
                  <snm>Pabo</snm>
                  <fnm>CO</fnm>
               </au>
               <au>
                  <snm>Sauer</snm>
                  <fnm>RT</fnm>
               </au>
            </aug>
            <source>Annu Rev Biochem</source>
            <pubdate>1992</pubdate>
            <volume>61</volume>
            <fpage>1053</fpage>
            <lpage>1095</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1146/annurev.bi.61.070192.005201</pubid>
                  <pubid idtype="pmpid" link="fulltext">1497306</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Zinc finger proteins: new insights into structural and functional diversity</p>
            </title>
            <aug>
               <au>
                  <snm>Laity</snm>
                  <fnm>JH</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>BM</fnm>
               </au>
               <au>
                  <snm>Wright</snm>
                  <fnm>PE</fnm>
               </au>
            </aug>
            <source>Current Opinion in Structural Biology</source>
            <pubdate>2001</pubdate>
            <volume>11</volume>
            <fpage>39</fpage>
            <lpage>46</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0959-440X(00)00167-6</pubid>
                  <pubid idtype="pmpid" link="fulltext">11179890</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Catabolite activator protein: DNA binding and transcription activation</p>
            </title>
            <aug>
               <au>
                  <snm>Lawson</snm>
                  <fnm>CL</fnm>
               </au>
               <au>
                  <snm>Swigon</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Murakami</snm>
                  <fnm>KS</fnm>
               </au>
               <au>
                  <snm>Darst</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Berman</snm>
                  <fnm>HM</fnm>
               </au>
               <au>
                  <snm>Ebright</snm>
                  <fnm>RH</fnm>
               </au>
            </aug>
            <source>Current Opinion in Structural Biology</source>
            <pubdate>2004</pubdate>
            <volume>14</volume>
            <fpage>10</fpage>
            <lpage>20</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.sbi.2004.01.012</pubid>
                  <pubid idtype="pmpid" link="fulltext">15102444</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Transcription factors: global and detailed views</p>
            </title>
            <aug>
               <au>
                  <snm>Muller</snm>
                  <fnm>CW</fnm>
               </au>
            </aug>
            <source>Current Opinion in Structural Biology</source>
            <pubdate>2001</pubdate>
            <volume>11</volume>
            <fpage>26</fpage>
            <lpage>32</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0959-440X(00)00163-9</pubid>
                  <pubid idtype="pmpid" link="fulltext">11179888</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Identification of amino acids important for target recognition by the DNA:m5C methyltransferase M.NgoPII by alanine-scanning mutagenesis of residues at the protein-DNA interface</p>
            </title>
            <aug>
               <au>
                  <snm>Radlinska</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Kondrzycka-Dada</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Piekarowicz</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bujnicki</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2005</pubdate>
            <volume>58</volume>
            <fpage>263</fpage>
            <lpage>270</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/prot.20297</pubid>
                  <pubid idtype="pmpid" link="fulltext">15558546</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>A comprehensive alanine scanning mutagenesis of the Escherichia coli transcriptional activator SoxS: identifying amino acids important for DNA binding and transcription activation</p>
            </title>
            <aug>
               <au>
                  <snm>Griffith</snm>
                  <fnm>KL</fnm>
               </au>
               <au>
                  <snm>Wolf</snm>
                  <fnm>JRE</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Biology</source>
            <pubdate>2002</pubdate>
            <volume>322</volume>
            <fpage>237</fpage>
            <lpage>257</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0022-2836(02)00782-9</pubid>
                  <pubid idtype="pmpid" link="fulltext">12217688</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>A novel strategy for the identification of protein-DNA contacts by photocrosslinking and mass spectrometry</p>
            </title>
            <aug>
               <au>
                  <snm>Geyer</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Geyer</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Pingoud</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <fpage>e132</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">519130</pubid>
                  <pubid idtype="pmpid" link="fulltext">15383647</pubid>
                  <pubid idtype="doi">10.1093/nar/gnh131</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Using electrostatic potentials to predict DNA-binding sites on DNA-binding proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Jones</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Shanahan</snm>
                  <fnm>HP</fnm>
               </au>
               <au>
                  <snm>Berman</snm>
                  <fnm>HM</fnm>
               </au>
               <au>
                  <snm>Thornton</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <fpage>7189</fpage>
            <lpage>7198</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">291864</pubid>
                  <pubid idtype="pmpid" link="fulltext">14654694</pubid>
                  <pubid idtype="doi">10.1093/nar/gkg922</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Identifying DNA-binding proteins using structural motifs and the electrostatic potential</p>
            </title>
            <aug>
               <au>
                  <snm>Shanahan</snm>
                  <fnm>HP</fnm>
               </au>
               <au>
                  <snm>Garcia</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Jones</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Thornton</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <fpage>4732</fpage>
            <lpage>4741</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">519102</pubid>
                  <pubid idtype="pmpid" link="fulltext">15356290</pubid>
                  <pubid idtype="doi">10.1093/nar/gkh803</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Structure-based prediction of DNA-binding sites on proteins using the empirical preference of electrostatic potential and the shape of molecular surfaces</p>
            </title>
            <aug>
               <au>
                  <snm>Tsuchiya</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Kinoshita</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Nakamura</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2004</pubdate>
            <volume>55</volume>
            <fpage>885</fpage>
            <lpage>894</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/prot.20111</pubid>
                  <pubid idtype="pmpid" link="fulltext">15146487</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Pattern recognition strategies for molecular surfaces: III. Binding site prediction with a neural network</p>
            </title>
            <aug>
               <au>
                  <snm>Keil</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Exner</snm>
                  <fnm>TE</fnm>
               </au>
               <au>
                  <snm>Brickmann</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>J Comput Chem</source>
            <pubdate>2004</pubdate>
            <volume>25</volume>
            <fpage>779</fpage>
            <lpage>789</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/jcc.10361</pubid>
                  <pubid idtype="pmpid" link="fulltext">15011250</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information</p>
            </title>
            <aug>
               <au>
                  <snm>Ahmad</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Gromiha</snm>
                  <fnm>MM</fnm>
               </au>
               <au>
                  <snm>Sarai</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <fpage>477</fpage>
            <lpage>486</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg432</pubid>
                  <pubid idtype="pmpid" link="fulltext">14990443</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>PSSM-based prediction of DNA binding sites in proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Ahmad</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Sarai</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>33</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">550660</pubid>
                  <pubid idtype="pmpid" link="fulltext">15720719</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-6-33</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Prediction of DNA-binding residues by PSSM and sequence homology</p>
            </title>
            <source>http://wwwnetasaorg/dbs-pssm/</source>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Crystal structure of DNA sequence specificity subunit of a type I restriction-modification enzyme and its functional implications</p>
            </title>
            <aug>
               <au>
                  <snm>Kim</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>DeGiovanni</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Jancarik</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Adams</snm>
                  <fnm>PD</fnm>
               </au>
               <au>
                  <snm>Yokota</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>SH</fnm>
               </au>
            </aug>
            <source>PNAS</source>
            <pubdate>2005</pubdate>
            <volume>102</volume>
            <fpage>3248</fpage>
            <lpage>3253</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">549290</pubid>
                  <pubid idtype="pmpid" link="fulltext">15728358</pubid>
                  <pubid idtype="doi">10.1073/pnas.0409851102</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Prediction of protein-protein interaction sites using patch analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Jones</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Thornton</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1997</pubdate>
            <volume>272</volume>
            <fpage>133</fpage>
            <lpage>143</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1997.1233</pubid>
                  <pubid idtype="pmpid" link="fulltext">9299343</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Predicting binding sites of hydrolase-inhibitor complexes by combining several methods</p>
            </title>
            <aug>
               <au>
                  <snm>Sen</snm>
                  <fnm>TZ</fnm>
               </au>
               <au>
                  <snm>Kloczkowski</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Jernigan</snm>
                  <fnm>RL</fnm>
               </au>
               <au>
                  <snm>Yan</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Honavar</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Ho</snm>
                  <fnm>KM</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>CZ</fnm>
               </au>
               <au>
                  <snm>Ihm</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Cao</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Gu</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Dobbs</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>5</volume>
            <fpage>205</fpage>
            <xrefbib>
               <pubid idtype="doi">10.1186/1471-2105-5-205</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>A two-stage classifier for identification of protein-protein interface residues</p>
            </title>
            <aug>
               <au>
                  <snm>Yan</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Dobbs</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Honavar</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <fpage>i371</fpage>
            <lpage>i378</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bth920</pubid>
                  <pubid idtype="pmpid" link="fulltext">15262822</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Identification of interface residues in protease-inhibitor and antigen-antibody complexes: a support vector machine approach</p>
            </title>
            <aug>
               <au>
                  <snm>Yan</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Honavar</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Dobbs</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Neural Computing &amp; Applications</source>
            <pubdate>2004</pubdate>
            <volume>13</volume>
            <fpage>123</fpage>
            <lpage>129</lpage>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Prediction of RNA-binding sites in proteins based on amino acid sequence</p>
            </title>
            <aug>
               <au>
                  <snm>Terribilini</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>JH</fnm>
               </au>
               <au>
                  <snm>Yan</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Jernigan</snm>
                  <fnm>RL</fnm>
               </au>
               <au>
                  <snm>Honavar</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Dobbs</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <note>Submitted</note>
         </bibl>
         <bibl id="B23">
            <title>
               <p> The Protein Data Bank</p>
            </title>
            <aug>
               <au>
                  <snm>Berman</snm>
                  <fnm>HM</fnm>
               </au>
               <au>
                  <snm>Westbrook</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Feng</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Gilliland</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Bhat</snm>
                  <fnm>TN</fnm>
               </au>
               <au>
                  <snm>Weissig</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Shindyalov</snm>
                  <fnm>IN</fnm>
               </au>
               <au>
                  <snm>Bourne</snm>
                  <fnm>PE</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2000</pubdate>
            <volume>28</volume>
            <fpage>235</fpage>
            <lpage>242</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">102472</pubid>
                  <pubid idtype="pmpid" link="fulltext">10592235</pubid>
                  <pubid idtype="doi">10.1093/nar/28.1.235</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>PISCES: a protein sequence culling server</p>
            </title>
            <aug>
               <au>
                  <snm>Wang</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Dunbrack</snm>
                  <fnm>RLJ</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <fpage>1589</fpage>
            <lpage>1591</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg224</pubid>
                  <pubid idtype="pmpid" link="fulltext">12912846</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>PDB derived data</p>
            </title>
            <source>ftp://ftprcsborg/pub/pdb/derived_data/</source>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Gene ontology annotation</p>
            </title>
            <source>http://wwwebiacuk/GOA/</source>
         </bibl>
         <bibl id="B27">
            <title>
               <p>NACCESS</p>
            </title>
            <aug>
               <au>
                  <snm>Hubbard</snm>
                  <fnm>SJ</fnm>
               </au>
            </aug>
            <publisher>Department of Biochemistry and Molecular Biology, University College, London.</publisher>
            <pubdate>1993</pubdate>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Data mining: practical machine learning tools and techniques with Java implements</p>
            </title>
            <aug>
               <au>
                  <snm>Witten</snm>
                  <fnm>IH</fnm>
               </au>
               <au>
                  <snm>Frank</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <publisher>San Mateo, CA, Morgan Kaufmann</publisher>
            <pubdate>1999</pubdate>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Weka 3: Data mining software in Java</p>
            </title>
            <source>http://wwwcswaikatoacnz/~ml/weka/</source>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Theory refinement on Bayesian networks: ; Los Angeles, CA.</p>
            </title>
            <aug>
               <au>
                  <snm>Buntine</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <publisher/>
            <pubdate>1991</pubdate>
            <fpage>52</fpage>
            <lpage>60</lpage>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Database of homology derived protein structures and the structural meaning of sequence alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Sander</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Schneider</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>1991</pubdate>
            <volume>9</volume>
            <fpage>56</fpage>
            <lpage>68</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/prot.340090107</pubid>
                  <pubid idtype="pmpid">2017436</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Extending the applicability of the nonlinear Poisson-Boltzmann equation: multiple dielectric constants and multivalent ions</p>
            </title>
            <aug>
               <au>
                  <snm>Rocchia</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Alexov</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Honig</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Journal of Physical Chemistry</source>
            <pubdate>2001</pubdate>
            <volume>B 105</volume>
            <fpage>6507</fpage>
            <lpage>6514</lpage>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Rapid grid-based construction of the molecular surface for both molecules and geometric objects: applications to the finite difference Poisson-Boltzmann method</p>
            </title>
            <aug>
               <au>
                  <snm>Rocchia</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Sridharan</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Nicholls</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Alexov</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Chiabrera</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Honig</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Journal of Computational Chemistry</source>
            <pubdate>2002</pubdate>
            <volume>23</volume>
            <fpage>128</fpage>
            <lpage>137</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/jcc.1161</pubid>
                  <pubid idtype="pmpid" link="fulltext">11913378</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <title>
               <p>The hydrophobicity moment detects periodicity in protein hydrophobicity</p>
            </title>
            <aug>
               <au>
                  <snm>Eisenberg</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Weiss</snm>
                  <fnm>RM</fnm>
               </au>
               <au>
                  <snm>Terwilliger</snm>
                  <fnm>TC</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1984</pubdate>
            <volume>81</volume>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Assessing the accuracy of prediction algorithms for classification: an overview</p>
            </title>
            <aug>
               <au>
                  <snm>Baldi</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Brunak</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Chauvin</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Andersen</snm>
                  <fnm>CAF</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2000</pubdate>
            <volume>16</volume>
            <fpage>412</fpage>
            <lpage>424</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/16.5.412</pubid>
                  <pubid idtype="pmpid" link="fulltext">10871264</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>The PROSITE database</p>
            </title>
            <aug>
               <au>
                  <snm>Hulo</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Bairoch</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bulliard</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Cerutti</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>De Castro</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Langendijk-Genevaux</snm>
                  <fnm>PS</fnm>
               </au>
               <au>
                  <snm>Pagni</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Sigrist</snm>
                  <fnm>CJA</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>2006</pubdate>
            <volume>34</volume>
            <fpage>D227</fpage>
            <lpage>230</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1347426</pubid>
                  <pubid idtype="pmpid" link="fulltext">16381852</pubid>
                  <pubid idtype="doi">10.1093/nar/gkj063</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>ps_scan program</p>
            </title>
            <source>ftp://caexpasyorg/databases/prosite/tools/ps_scan/</source>
         </bibl>
         <bibl id="B38">
            <title>
               <p>Protein Explorer: easy yet powerful macromolecular visualization</p>
            </title>
            <aug>
               <au>
                  <snm>Martz</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Trends Biochem Sci</source>
            <pubdate>2002</pubdate>
            <volume>27</volume>
            <fpage>107</fpage>
            <lpage>109</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0968-0004(01)02008-4</pubid>
                  <pubid idtype="pmpid" link="fulltext">11852249</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
