<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>1471-2105-7-S1-S2</ui>
	<ji>1471-2105</ji>
	<fm>
		<dochead>Proceedings</dochead>
		<bibl>
			<title>
				<p>Choosing negative examples for the prediction of protein-protein interactions</p>
			</title>
			<aug>
				<au id="A1" ca="yes">
					<snm>Ben-Hur</snm>
					<fnm>Asa</fnm>
					<insr iid="I1"/>
					<insr iid="I2"/>
					<email>asa@cs.colostate.edu</email>
				</au>
				<au id="A2">
					<snm>Noble</snm>
					<mnm>Stafford</mnm>
					<fnm>William</fnm>
					<insr iid="I3"/>
					<insr iid="I4"/>
					<email>noble@gs.washington.edu</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>Department of Computer Science, Colorado State University, Fort Collins CO, USA</p>
				</ins>
				<ins id="I2">
					<p>Department of Computer Science, University of Colorado, Boulder CO, USA</p>
				</ins>
				<ins id="I3">
					<p>Department of Genome Sciences, University of Washington, Seattle WA, USA</p>
				</ins>
				<ins id="I4">
					<p>Department of Computer Science and Engineering, University of Washington, Seattle WA, USA</p>
				</ins>
			</insg>
			<source>BMC Bioinformatics</source>
			<supplement>
				<title>
					<p>NIPS workshop on New Problems and Methods in Computational Biology</p>
				</title>
				<editor>Gal Chechik, Christina Leslie, Gunnar R&#228;tsch, Koji Tsuda</editor>
				<note>Proceedings</note>
				<url>http://www.biomedcentral.com/content/pdf/1471-2105-7-S1-info.pdf</url>
			</supplement>
			<conference>
				<title>
					<p>NIPS workshop on New Problems and Methods in Computational Biology</p>
				</title>
				<location>Whistler, Canada</location>
				<date-range>18 December 2004</date-range>
			</conference>
			<issn>1471-2105</issn>
			<pubdate>2006</pubdate>
			<volume>7</volume>
			<issue>Suppl 1</issue>
			<fpage>S2</fpage>
			<xrefbib>
				<pubidlist><pubid idtype="pmpid">16723005</pubid><pubid idtype="doi">10.1186/1471-2105-7-S1-S2</pubid>
				</pubidlist></xrefbib>
		</bibl>
		<history>
			<pub>
				<date>
					<day>20</day>
					<month>3</month>
					<year>2006</year>
				</date>
			</pub>
		</history>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<p>The protein-protein interaction networks of even well-studied model organisms are sketchy at best, highlighting the continued need for computational methods to help direct experimentalists in the search for novel interactions. This need has prompted the development of a number of methods for predicting protein-protein interactions based on various sources of data and methodologies. The common method for choosing negative examples for training a predictor of protein-protein interactions is based on annotations of cellular localization, and the observation that pairs of proteins that have different localization patterns are unlikely to interact. While this method leads to high quality sets of non-interacting proteins, we find that this choice can lead to biased estimates of prediction accuracy, because the constraints placed on the distribution of the negative examples makes the task easier. The effects of this bias are demonstrated in the context of both sequence-based and non-sequence based features used for predicting protein-protein interactions.</p>
			</sec>
		</abs>
	</fm>
	<bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<p>Despite advances in high-throughput experimental methods for detecting protein-protein interactions, the interaction networks for even well studied model organisms are far from complete. In addition, high throughput assays typically have a high rate of false positives <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Therefore, there is a continuing need for computational methods to complement existing experimental approaches.</p>
			<p>Methods for predicting protein-protein interaction use a variety of data sources. Sequence-based methods are usually based on the domain, motif, or k-mer composition of the sequences. Sprinzak and Margalit <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> have noted that many pairs of structural domains tend to appear in interacting proteins, and have used this intuition to predict interactions according to the over-representation of pairs of domains. Domain and motif composition is also the basis of several Bayesian network models that aim to explain an observed interaction network in terms of interactions between pairs of motifs or domains <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr></abbrgrp>. In the context of kernel methods, similar kernels designed for predicting interactions from sequence were proposed in <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>. Other sequence-based methods use co-evolution of interacting proteins by comparing phylogenetic trees <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>, correlated mutations <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>, or gene fusion <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. An alternative approach is to combine multiple sources of genomic information &#8212; gene expression, Gene Ontology annotations, transcriptional regulation, etc. &#8212; to predict co-membership in a complex <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>.</p>
			<p>All the above-mentioned methods require an informed choice of positive examples (interacting pairs of proteins) and negative examples (non-interacting pairs of proteins) for training and assessing the performance of a classifier. In view of the large fraction of false positive interactions generated by high throughput methods, positive examples need to be chosen with care. These are often chosen as interactions generated by reliable methods (small scale experiments), interactions confirmed by several methods, or interactions confirmed by interacting paralogs <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B11">11</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>.</p>
			<p>Negative examples also need to be chosen with care, and two such selection methods have been described in the literature. Because there are no "gold standard" non-interactions, some authors suggest that high quality non-interactions can be generated by considering pairs of proteins whose cellular localization is different, most likely preventing the proteins from participating in a biologically relevant interaction <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B16">16</abbr></abbrgrp>. Other authors use a simpler scheme, selecting non-interacting pairs uniformly at random from the set of all proteins pairs that are not known to interact <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B7">7</abbr><abbr bid="B12">12</abbr><abbr bid="B17">17</abbr></abbrgrp>.</p>
			<p>In this paper, we argue that that the first method is not appropriate for assessing classifier accuracy. In particular, we show that restricting negative examples to non co-localized protein pairs leads to a biased estimate of the accuracy of a predictor of protein-protein interactions. The basic assumption underlying the assessment of the accuracy of a classifier is that the distribution of testing examples reflects the intended use of the method. In the case of predicting protein-protein interactions, a simple uniform random choice of non-interacting protein pairs yields an unbiased estimate of the true distribution. In contrast, imposing the constraint of non co-localization may induce a different distribution on the features that are used for classification. The resulting biased distribution of negative examples leads to over-optimistic estimates of classifier accuracy. This bias is likely to affect results reported in several papers <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B11">11</abbr></abbrgrp>.</p>
			<p>The simpler selection scheme &#8212; choosing negative examples uniformly at random &#8212; also has potential pitfalls: because the interaction network is not complete, the set of negative examples can be contaminated with interacting proteins. This contamination, however, is likely to be very small: it has been estimated that the number of interactions in yeast is well below 100,000 <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B18">18</abbr></abbrgrp>, a number which is 0.25 percent of the total number of protein pairs in yeast. This effect is likely to be much smaller than the contamination of even high-quality positive examples; moreover, our results show that a support vector machine classifier is resistant to even higher levels of label contamination.</p>
		</sec>
		<sec>
			<st>
				<p>Results</p>
			</st>
			<p>In this paper we postulate that testing a classifier of protein-protein interactions on negative examples composed of pairs of proteins that are not co-localized results in a biased assessment of classifier accuracy. In order to test this hypothesis we need to define "co-localization." We do this using the subcellular localization component of the Gene Ontology (GO). GO keywords are becoming the standard in annotating gene products <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. These keywords are arranged in a hierarchical manner in a rooted, directed acyclic graph, where keywords lower in the hierarchy represent more specific terms. Therefore, one cannot say that two proteins are not co-localized simply because they don't share the exact same GO terms. As a similarity measure between two GO terms we use the negative log of the fraction of genes annotated with the lowest common ancestor of the two terms. This similarity was introduced as a similarity measure on a hierarchy in <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>, used in the context of GO annotations in <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, and used in a kernel in <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. Using this measure of similarity allows us to generate parameterized sets of negative examples characterized by a maximum degree of similarity allowed between their GO cellular compartment annotations.</p>
			<p>Perhaps the simplest way to predict protein-protein interactions is to represent pairs of proteins by a set of genomic features that reflect how likely they are to interact. Examples of features that were used for this task are similarity of GO process and GO function annotations, correlation of gene expression, presence of similar transcription factor binding sites in the upstream region of the genes, participation in common regulatory modules and so on <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>. Table <tblr tid="T1">1</tblr> illustrates that as we vary the upper bound on the allowed similarity between the cellular compartment annotations of pairs of proteins in the negative examples (called the <it>co-localization threshold </it>in what follows), GO function and process annotations, and microarray data become more predictive of protein-protein interactions, as measured using the ROC score (the area under the receiver operating characteristic curve). This observation is not surprising. Consider, for example, biological process annotations. Interacting proteins often participate in similar processes. Conversely, negative examples that are not co-localized will be less likely to participate in similar biological processes, making this variable more predictive of interaction. A similar argument holds for the GO function annotations and gene expression correlations. Note that GO function annotations are less predictive than the process annotations because interactions are often required for carrying out a particular process, whereas proteins that carry out the same function can do so in different contexts, not requring interaction.</p>
			<tbl id="T1">
				<title>
					<p>Table 1</p>
				</title>
				<caption>
					<p>The dependence of ROC scores of several variables on the co-localization threshold for the MIPS/DIP interaction data. The variables are: GO process similarity, GO function similarity, and correlations between microarray data under various environmental conditions [19]. For each threshold we computed the average ROC scores for 10 drawings of the negative examples. The standard deviation is shown in parentheses.</p>
				</caption>
				<tblbdy cols="4">
					<r>
						<c ca="center">
							<p>threshold</p>
						</c>
						<c ca="center">
							<p>GO process</p>
						</c>
						<c ca="center">
							<p>GO function</p>
						</c>
						<c ca="center">
							<p>microarray</p>
						</c>
					</r>
					<r>
						<c cspan="4">
							<hr/>
						</c>
					</r>
					<r>
						<c ca="center">
							<p>1.00</p>
						</c>
						<c ca="center">
							<p>0.81 (0.001)</p>
						</c>
						<c ca="center">
							<p>0.64 (0.002)</p>
						</c>
						<c ca="center">
							<p>0.64 (0.005)</p>
						</c>
					</r>
					<r>
						<c ca="center">
							<p>0.50</p>
						</c>
						<c ca="center">
							<p>0.82 (0.001)</p>
						</c>
						<c ca="center">
							<p>0.65 (0.004)</p>
						</c>
						<c ca="center">
							<p>0.64 (0.003)</p>
						</c>
					</r>
					<r>
						<c ca="center">
							<p>0.20</p>
						</c>
						<c ca="center">
							<p>0.82 (0.002)</p>
						</c>
						<c ca="center">
							<p>0.66 (0.005)</p>
						</c>
						<c ca="center">
							<p>0.65 (0.005)</p>
						</c>
					</r>
					<r>
						<c ca="center">
							<p>0.10</p>
						</c>
						<c ca="center">
							<p>0.83 (0.002)</p>
						</c>
						<c ca="center">
							<p>0.66 (0.005)</p>
						</c>
						<c ca="center">
							<p>0.66 (0.003)</p>
						</c>
					</r>
					<r>
						<c ca="center">
							<p>0.04</p>
						</c>
						<c ca="center">
							<p>0.83 (0.001)</p>
						</c>
						<c ca="center">
							<p>0.67 (0.004)</p>
						</c>
						<c ca="center">
							<p>0.66 (0.004)</p>
						</c>
					</r>
				</tblbdy>
			</tbl>
			<p>Using non-co-localized negative examples can lead to a bias when using sequence-based features as well. In this case the features are pairs of sequence features, e.g., motifs or k-mers that belong to a pair of protein sequences. Such a kernel was used in <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp> with a support vector machine (SVM) classifier. The dimensionality of the feature space of these kernels is very high, and in fact, the method doesn't use an explicit representation of the features. For the sequence-based features we show the existence of the bias incurred by using non-co-localized negative examples by showing that the accuracy of a classifier depends on the co-localization threshold of the negative examples on which the method was tested. Figure <figr fid="F1">1</figr> illustrates the increase in classifier accuracy as the co-localization threshold is decreased. This effect is much larger than the variability that results from the randomness in the choice of negative examples and the cross-validation (CV) estimate: the standard deviation of the ROC score on 10 drawings of the negative examples was 0.003, and the variability between different runs of CV is even lower. We can explain the higher accuracy for low co-localization threshold by the fact that the constraint on localization restricts the negative examples to a sub-space of sequence space, making the learning problem easier than when there is no constraint.</p>
			<fig id="F1">
				<title>
					<p>Figure 1</p>
				</title>
				<caption>
					<p>The dependence of prediction accuracy, quantified by the area under the ROC/ROC<sub>50 </sub>curves, on the co-localization threshold used to choose negative examples</p>
				</caption>
				<text>
					<p>The dependence of prediction accuracy, quantified by the area under the ROC/ROC<sub>50 </sub>curves, on the co-localization threshold used to choose negative examples. Enforcing the condition that no two proteins in the set of negative examples have a GO component similarity that is greater than a given threshold (the co-localization threshold) imposes a constraint on the distribution of negative examples. This constraint makes it easier for the classifier to distinguish between positive and negative examples, and the effect gets stronger as the co-localization threshold becomes smaller. All methods are SVM-based classifiers trained using different kernels on two interaction datasets. Results are computed using five-fold cross-validation, averaged over five drawings of negative examples. The spectrum kernel method uses pairs of k-mers as features; the motif method uses the composition of discrete sequence motifs, and the non-sequence method uses features such as co-expression as measured in microarray experiments, similarity in GO process and function annotations etc. We performed our experiment on two yeast physical interaction datasets: the BIND data is derived from the BIND database; the experiments using the non-sequence data were performed on a subset of reliable interactions that are found by multiple assays in BIND; DIP/MIPS is a dataset of reliable interactions derived from the DIP and MIPS databases.</p>
				</text>
				<graphic file="1471-2105-7-S1-S2-1"/>
			</fig>
			<p>In our experiments we used sets of negative examples characterized by the similarity of the localization annotations of two proteins. To see the relevance of our results to other published work, we need to establish a relationship between our co-localization threshold, and criteria used elsewhere. The data of <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> is used in several studies of protein-protein interactions. They considered five very broad cellular compartments (cytoplasm, mitochondrion, nucleus, plasma membrane, and secretory pathway organelles). Four of these have corresponding nodes in the cellular compartment part of GO. The average GO similarity between these compartments ranges from 0.002 to 0.36, and is 0.13 on average. At this level of the co-localization threshold our results show a strong effect.</p>
		</sec>
		<sec>
			<st>
				<p>Discussion</p>
			</st>
			<p>There are many pitfalls in designing machine learning experiments (see <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> for an example in the context of feature selection). Design of experiments in the field of bioinformatics, where various sources of data are often correlated, requires special care to make sure no information on the testing example labels leaks to the representation of the training examples. In this paper, we illustrated a phenomenon where, by constraining the distribution of negative examples, the classification problem becomes easier. Although choosing negative examples as pairs of proteins that are localized to different cellular compartments creates high-quality negative examples, it also makes them easier to distinguish from interacting proteins. In the case where the data is characterized by features such as similarity of GO process or function annotations, constraining the distribution of the component similarity has a direct effect on the distribution of the GO process annotation.</p>
			<p>In the case of the sequence-based classifiers, the improvement in classifier performance is the result of constraining the negative examples to a smaller region of sequence-space. We see a difference between the behavior of the motif/pfam kernels and the spectrum kernel: the results with the spectrum kernel are more strongly affected by the distribution of negative examples. We believe that this difference is the result of the greater flexibility of the spectrum kernel, which allows it to fit arbitrary training sets. The motif/pfam kernels, by contrast, use features that are more biologically relevant, so cannot be biased as much as the spectrum kernel. The gold standard negative examples of <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> were not only constrained by lack of co-localization; they also demanded that both pairs of proteins have GO annotations in both the function and process components. This constraint would likely increase classifier accuracy even further.</p>
			<p>The reader may suspect that the improvement in classifier accuracy when constraining the negative examples to be non-co-localized may be the result of higher quality negative examples. To address this concern we performed the following experiment to test the effect of changing the labels of a small fraction of the negative examples. We considered the MIPS/DIP dataset with the spectrum kernel, and negative examples chosen with a co-localization threhsold of 0.1. We divided the dataset into two parts: training data (80%), test data (20%), and flipped the labels of 2% of the negative examples, a fraction likely to be much higher than the level of contamination under a choice of unconstrained selection of negative examples. SVMs were trained on both flipped and unflipped versions of the data. The average ROC (ROC<sub>50</sub>) scores for 10 draws of the data were 0.874 (0.361) for the unflipped data, and 0.871 (0.356) for the flipped data. This experiment illustrates that SVMs can easily handle a larger amount of noise in the negative examples than is expected in the actual data. Thus, the effect shown above is not a result of better quality negative examples.</p>
			<p>Without being aware of the bias in using gold standard non-interactions, one may think, looking at a couple of papers that describe methods for predicting protein protein interactions from sequence <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>, that the problem is well addressed by these methods. However, this is not the case: the good performance is in fact a result of the biased selection of negative examples, and prediction of protein-protein interactions from sequence is a difficult problem that can still be considered unsolved.</p>
		</sec>
		<sec>
			<st>
				<p>Methods</p>
			</st>
			<sec>
				<st>
					<p>Positive Examples</p>
				</st>
				<p>We focus on prediction of physical interactions in yeast and use interaction data derived from several sources. These interactions are used as positive examples when training our classifiers.</p>
				<p>&#8226; Data from BIND <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. BIND includes published interaction data from high-throughput experiments as well as curated entries derived from published papers. Using physical interactions yields a dataset comprised of 10,517 interactions among 4233 yeast proteins (downloaded July 9th, 2004). Selecting interactions that were verified by multiple experimental assays yields a dataset of 750 trusted interactions. We used all the interactions for training, but assessed the performance only on trusted interactions.</p>
				<p>&#8226; A curated set of high quality interactions from MIPS and DIP <abbrgrp><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr></abbrgrp>, also used in <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. This set contains MIPS interactions that were annotated as physical interactions derived from small scale experiments, DIP interactions from small scale experiments, and DIP interactions verified by multiple experiments, for a total of 4838 interactions.</p>
				<p>In both cases we avoided using interactions that were validated by interacting paralogs in yeast to define trusted interactions, since those are likely to be easier to predict using the sequence-based methods. We eliminated self-interactions from each dataset, since many of the features we use are based on measures of similarity between the two proteins, e.g., gene expression correlation, and similarity of GO annotations.</p>
			</sec>
			<sec>
				<st>
					<p>Negative Examples</p>
				</st>
				<p>We compared two methods for choosing negative examples in this paper:</p>
				<p>&#8226; Random pairs of proteins that are not known to physically interact.</p>
				<p>&#8226; Parameterized sets of negative examples were chosen as random pairs of protein that are not known to physically interact, such that the similarity of their GO cellular compartment annotations is below some threshold.</p>
				<p>In each case the number of negative examples was chosen to be equal to the number of positive examples in the dataset.</p>
			</sec>
			<sec>
				<st>
					<p>Support Vector Machines</p>
				</st>
				<p>The support vector machine (SVM) <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> is a classification method that provides state-of-the-art performance in many domains including bioinformatics <abbrgrp><abbr bid="B28">28</abbr><abbr bid="B29">29</abbr></abbrgrp>. SVMs access the data only through the <it>kernel function </it>which defines the similarity between data objects. This allows the use of SVMs even when an explicit vector-space representation of the data is not available, but a kernel function is provided. This is the case for one of the kernels used in this work, where a kernel between two pairs of sequences is defined (see below and <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>).</p>
			</sec>
			<sec>
				<st>
					<p>Figures of merit</p>
				</st>
				<p>In this paper we evaluate the accuracy of a trained classifier using two metrics. Both metrics &#8212; the area under the receiver operating characteristic curve (ROC score), and the normalized area under that curve up to the first 50 false positives, the ROC<sub>50 </sub>score &#8212; aim to measure both sensitivity and specificity by integrating over a curve that plots the true positive rate as a function of the false positive rate. The motivation for using both metrics is provided for example in <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>.</p>
			</sec>
			<sec>
				<st>
					<p>Pairwise kernels</p>
				</st>
				<p>The kernels proposed in the literature for handling genomic information, e.g., sequence kernels such as the motif and spectrum kernels presented below, provide a similarity between two sequences, or more generally, a similarity between a representation of two proteins. Therefore, such kernels are not directly applicable to the task of predicting protein-protein interactions, which requires a similarity between two pairs of proteins. Thus, we want a function <it>K</it>((<it>X</it><sub>1</sub>, <it>X</it><sub>2</sub>), (<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-S1-S2-i1"><m:semantics><m:mrow><m:msub><m:msup><m:mi>X</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>1</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGybawgaqbamaaBaaaleaacqaIXaqmaeqaaaaa@2F0F@</m:annotation></m:semantics></m:math>, <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-S1-S2-i2"><m:semantics><m:mrow><m:msub><m:msup><m:mi>X</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>2</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGybawgaqbamaaBaaaleaacqaIYaGmaeqaaaaa@2F11@</m:annotation></m:semantics></m:math>)) that returns the similarity between the proteins <it>X</it><sub>1 </sub>and <it>X</it><sub>2 </sub>compared to the proteins <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-S1-S2-i1"><m:semantics><m:mrow><m:msub><m:msup><m:mi>X</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>1</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGybawgaqbamaaBaaaleaacqaIXaqmaeqaaaaa@2F0F@</m:annotation></m:semantics></m:math> and <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-S1-S2-i2"><m:semantics><m:mrow><m:msub><m:msup><m:mi>X</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>2</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGybawgaqbamaaBaaaleaacqaIYaGmaeqaaaaa@2F11@</m:annotation></m:semantics></m:math>. We call a kernel that operates on individual genes or proteins a <it>genomic kernel</it>, and a kernel that compares pairs of genes or proteins a <it>pairwise kernel</it>. Two recent papers proposed an approach for converting a genomic kernel into a pairwise kernel <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>. They define the kernel</p>
				<p><it>K</it>((<it>X</it><sub>1</sub>, <it>X</it><sub>2</sub>), (<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-S1-S2-i1"><m:semantics><m:mrow><m:msub><m:msup><m:mi>X</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>1</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGybawgaqbamaaBaaaleaacqaIXaqmaeqaaaaa@2F0F@</m:annotation></m:semantics></m:math>, <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-S1-S2-i2"><m:semantics><m:mrow><m:msub><m:msup><m:mi>X</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>2</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGybawgaqbamaaBaaaleaacqaIYaGmaeqaaaaa@2F11@</m:annotation></m:semantics></m:math>)) = <it>K'</it>(<it>X</it><sub>1</sub>, <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-S1-S2-i1"><m:semantics><m:mrow><m:msub><m:msup><m:mi>X</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>1</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGybawgaqbamaaBaaaleaacqaIXaqmaeqaaaaa@2F0F@</m:annotation></m:semantics></m:math>) <it>K'</it>(<it>X</it><sub>2</sub>, <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-S1-S2-i2"><m:semantics><m:mrow><m:msub><m:msup><m:mi>X</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>2</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGybawgaqbamaaBaaaleaacqaIYaGmaeqaaaaa@2F11@</m:annotation></m:semantics></m:math>) + <it>K'</it>(<it>X</it><sub>1</sub>, <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-S1-S2-i2"><m:semantics><m:mrow><m:msub><m:msup><m:mi>X</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>2</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGybawgaqbamaaBaaaleaacqaIYaGmaeqaaaaa@2F11@</m:annotation></m:semantics></m:math>) <it>K'</it>(<it>X</it><sub>2</sub>, <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-S1-S2-i1"><m:semantics><m:mrow><m:msub><m:msup><m:mi>X</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>1</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGybawgaqbamaaBaaaleaacqaIXaqmaeqaaaaa@2F0F@</m:annotation></m:semantics></m:math>), &#160;&#160;&#160; (1)</p>
				<p>where <it>K'</it>(&#183;, &#183;) is any genomic kernel. The intuition behind the kernel is that for the pair (<it>X</it><sub>1</sub>, <it>X</it><sub>2</sub>) to be considered similar to (<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-S1-S2-i1"><m:semantics><m:mrow><m:msub><m:msup><m:mi>X</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>1</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGybawgaqbamaaBaaaleaacqaIXaqmaeqaaaaa@2F0F@</m:annotation></m:semantics></m:math>, <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-S1-S2-i2"><m:semantics><m:mrow><m:msub><m:msup><m:mi>X</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>2</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGybawgaqbamaaBaaaleaacqaIYaGmaeqaaaaa@2F11@</m:annotation></m:semantics></m:math>), <it>X</it><sub>1 </sub>needs to be similar to <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-S1-S2-i1"><m:semantics><m:mrow><m:msub><m:msup><m:mi>X</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>1</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGybawgaqbamaaBaaaleaacqaIXaqmaeqaaaaa@2F0F@</m:annotation></m:semantics></m:math> and <it>X</it><sub>2 </sub>needs to be similar to <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-S1-S2-i2"><m:semantics><m:mrow><m:msub><m:msup><m:mi>X</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>2</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGybawgaqbamaaBaaaleaacqaIYaGmaeqaaaaa@2F11@</m:annotation></m:semantics></m:math> (the first term) or <it>X</it><sub>1 </sub>is similar to <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-S1-S2-i2"><m:semantics><m:mrow><m:msub><m:msup><m:mi>X</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>2</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGybawgaqbamaaBaaaleaacqaIYaGmaeqaaaaa@2F11@</m:annotation></m:semantics></m:math> and <it>X</it><sub>2 </sub>is similar to <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-S1-S2-i1"><m:semantics><m:mrow><m:msub><m:msup><m:mi>X</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>1</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGybawgaqbamaaBaaaleaacqaIXaqmaeqaaaaa@2F0F@</m:annotation></m:semantics></m:math> (the second term). The feature space for this kernel is a vector space of (symmetrized) pairs of features from the underlying genomic kernel.</p>
			</sec>
			<sec>
				<st>
					<p>Sequence kernels</p>
				</st>
				<p>We use two sequence kernels in this work: the spectrum kernel <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> and the motif kernel <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. The spectrum kernel models a sequence in the space of all k-mers, and its feature space is a vector of counts of the number of times each k-mer appears in the sequence. For the motif kernel we use discrete sequence motifs, representing a sequence in terms of a motif composition vector that counts how many times a discrete sequence motif matches the sequence. To compute the motif kernel we used discrete sequence motifs constructed from the eBlocks database <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>. Yeast ORFs contain occurrences of 17,768 motifs out of a set of 42,718 motifs. For both kernels we used a normalized linear kernel in the space of k-mer/motif counts: <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-S1-S2-i3"><m:semantics><m:mrow><m:mi>K</m:mi><m:mo stretchy="false">(</m:mo><m:mi>x</m:mi><m:mo>,</m:mo><m:mi>y</m:mi><m:mo stretchy="false">)</m:mo><m:mo>/</m:mo><m:msqrt><m:mrow><m:mi>K</m:mi><m:mo stretchy="false">(</m:mo><m:mi>x</m:mi><m:mo>,</m:mo><m:mi>x</m:mi><m:mo stretchy="false">)</m:mo><m:mi>K</m:mi><m:mo stretchy="false">(</m:mo><m:mi>y</m:mi><m:mo>,</m:mo><m:mi>y</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:msqrt></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGlbWscqGGOaakieGacqWF4baEcqWFSaalcqWF5bqEcqGGPaqkcqGGVaWldaGcaaqaaiabdUealjabcIcaOiabdIha4jabcYcaSiabdIha4jabcMcaPiabdUealjabcIcaOiabdMha5jabcYcaSiabdMha5jabcMcaPaWcbeaaaaa@419E@</m:annotation></m:semantics></m:math>.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Availability</p>
			</st>
			<p>Data and code related to this work are available at: <url>http://noble.gs.washington.edu/proj/sppi</url>. All the classification experiments were performed using the <b>PyML </b>machine learning library available at <url>http://pyml.sourceforge.net</url>.</p>
		</sec>
	</bdy>
	<bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>This work is funded by NCRR NIH award P41 RR11823, by NHGRI NIH award R33 HG003070, and by NSF award BDI-0243257. WSN is an Alfred P. Sloan Research Fellow.</p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>Comparative assessment of large-scale data sets of protein-protein interactions</p>
				</title>
				<aug>
					<au>
						<snm>von Mering</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Krause</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Snel</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Cornell</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Olivier</snm>
						<fnm>SG</fnm>
					</au>
					<au>
						<snm>Fields</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Bork</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Nature</source>
				<pubdate>2002</pubdate>
				<volume>417</volume>
				<fpage>399</fpage>
				<lpage>403</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12000970</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B2">
				<title>
					<p>Correlated sequence-signatures as markers of protein-protein interaction</p>
				</title>
				<aug>
					<au>
						<snm>Sprinzak</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Margalit</snm>
						<fnm>H</fnm>
					</au>
				</aug>
				<source>Journal of Molecular Biology</source>
				<pubdate>2001</pubdate>
				<volume>311</volume>
				<fpage>681</fpage>
				<lpage>692</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">11518523</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B3">
				<title>
					<p>Inferring domain-domain interactions from protein-protein interactions</p>
				</title>
				<aug>
					<au>
						<snm>Deng</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Mehta</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Sun</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Chen</snm>
						<fnm>T</fnm>
					</au>
				</aug>
				<source>Genome Research</source>
				<pubdate>2002</pubdate>
				<volume>12</volume>
				<issue>10</issue>
				<fpage>1540</fpage>
				<lpage>1548</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">187530</pubid>
						<pubid idtype="pmpid" link="fulltext">12368246</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B4">
				<title>
					<p>Learning to predict protein-protein interactions</p>
				</title>
				<aug>
					<au>
						<snm>Gomez</snm>
						<fnm>SM</fnm>
					</au>
					<au>
						<snm>Noble</snm>
						<fnm>WS</fnm>
					</au>
					<au>
						<snm>Rzhetsky</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19</volume>
				<fpage>1875</fpage>
				<lpage>1881</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">14555619</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B5">
				<title>
					<p>Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale</p>
				</title>
				<aug>
					<au>
						<snm>Wang</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Segal</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Ben-Hur</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Koller</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Brutlag</snm>
						<fnm>DL</fnm>
					</au>
				</aug>
				<source>Advances in Neural Information Processing Systems 17</source>
				<publisher>Cambridge, MA: MIT Press</publisher>
				<editor>Saul LK, Weiss Y, Bottou L</editor>
				<pubdate>2005</pubdate>
				<fpage>1465</fpage>
				<lpage>1472</lpage>
			</bibl>
			<bibl id="B6">
				<title>
					<p>Predicting protein-protein interactions using signature products</p>
				</title>
				<aug>
					<au>
						<snm>Martin</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Roe</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Faulon</snm>
						<fnm>JL</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>21</volume>
				<issue>2</issue>
				<fpage>218</fpage>
				<lpage>226</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15319262</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B7">
				<title>
					<p>Kernel methods for predicting protein-protein interactions</p>
				</title>
				<aug>
					<au>
						<snm>Ben-Hur</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Noble</snm>
						<fnm>WS</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>21</volume>
				<issue>suppl 1</issue>
				<fpage>i38</fpage>
				<lpage>i46</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15961482</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B8">
				<title>
					<p>Exploiting the co-evolution of interacting proteins to discover interaction specificity</p>
				</title>
				<aug>
					<au>
						<snm>Ramani</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Marcotte</snm>
						<fnm>E</fnm>
					</au>
				</aug>
				<source>Journal of Molecular Biology</source>
				<pubdate>2003</pubdate>
				<volume>327</volume>
				<fpage>273</fpage>
				<lpage>284</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12614624</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B9">
				<title>
					<p>In silico two-hybrid system for the selection of physically interacting protein pairs</p>
				</title>
				<aug>
					<au>
						<snm>Pazos</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Valencia</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Proteins: Structure, Function and Genetics</source>
				<pubdate>2002</pubdate>
				<volume>47</volume>
				<issue>2</issue>
				<fpage>219</fpage>
				<lpage>227</lpage>
			</bibl>
			<bibl id="B10">
				<title>
					<p>Detecting protein function and protein-protein interactions from genome sequences</p>
				</title>
				<aug>
					<au>
						<snm>Marcotte</snm>
						<fnm>EM</fnm>
					</au>
					<au>
						<snm>Pellegrini</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Ng</snm>
						<fnm>HL</fnm>
					</au>
					<au>
						<snm>Rice</snm>
						<fnm>DW</fnm>
					</au>
					<au>
						<snm>Yeates</snm>
						<fnm>TO</fnm>
					</au>
					<au>
						<snm>Eisenberg</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Science</source>
				<pubdate>1999</pubdate>
				<volume>285</volume>
				<fpage>751</fpage>
				<lpage>753</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">10427000</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B11">
				<title>
					<p>A Bayesian networks approach for predicting protein-protein interactions from genomic data</p>
				</title>
				<aug>
					<au>
						<snm>Jansen</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Yu</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Greenbaum</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Kluger</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Krogan</snm>
						<fnm>NJ</fnm>
					</au>
					<au>
						<snm>Chung</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Emili</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Snyder</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Greenblatt</snm>
						<fnm>JF</fnm>
					</au>
					<au>
						<snm>Gerstein</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Science</source>
				<pubdate>2003</pubdate>
				<volume>302</volume>
				<fpage>449</fpage>
				<lpage>453</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">14564010</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B12">
				<title>
					<p>Predicting co-complexed protein pairs using genomic and proteomic data integration</p>
				</title>
				<aug>
					<au>
						<snm>Zhang</snm>
						<fnm>LV</fnm>
					</au>
					<au>
						<snm>Wong</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>King</snm>
						<fnm>O</fnm>
					</au>
					<au>
						<snm>Roth</snm>
						<fnm>F</fnm>
					</au>
				</aug>
				<source>BMC Bioinformatics</source>
				<pubdate>2004</pubdate>
				<volume>5</volume>
				<fpage>38</fpage>
				<lpage>53</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">419405</pubid>
						<pubid idtype="pmpid" link="fulltext">15090078</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B13">
				<title>
					<p>Information assessment on predicting protein-protein interactions</p>
				</title>
				<aug>
					<au>
						<snm>Lin</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Wu</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Jansen</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Gerstein</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Zhao</snm>
						<fnm>H</fnm>
					</au>
				</aug>
				<source>BMC Bioinformatics</source>
				<pubdate>2004</pubdate>
				<volume>5</volume>
				<fpage>154</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">529436</pubid>
						<pubid idtype="pmpid" link="fulltext">15491499</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B14">
				<title>
					<p>How Reliable are Experimental Protein-Protein Interaction Data?</p>
				</title>
				<aug>
					<au>
						<snm>Sprinzak</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Sattath</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Margalit</snm>
						<fnm>H</fnm>
					</au>
				</aug>
				<source>Journal of Molecular Biology</source>
				<pubdate>2003</pubdate>
				<volume>327</volume>
				<issue>5</issue>
				<fpage>919</fpage>
				<lpage>923</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12662919</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B15">
				<title>
					<p>Two Methods for Assessment of the Reliability of High Throughput Observations</p>
				</title>
				<aug>
					<au>
						<snm>Deane</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Salwinski</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Xenarios</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Eisenberg</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Molecular &amp; Cellular Proteomics</source>
				<pubdate>2002</pubdate>
				<volume>1</volume>
				<fpage>349</fpage>
				<lpage>356</lpage>
			</bibl>
			<bibl id="B16">
				<title>
					<p>Analyzing protein function on a genomic scale: the importance of gold-standard positives and negatives for network prediction</p>
				</title>
				<aug>
					<au>
						<snm>Jansen</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Gerstein</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Current Opnion in Microbiology</source>
				<pubdate>2004</pubdate>
				<volume>7</volume>
				<fpage>535</fpage>
				<lpage>545</lpage>
			</bibl>
			<bibl id="B17">
				<title>
					<p>Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources</p>
				</title>
				<aug>
					<au>
						<snm>Qi</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Klein-Seetharaman</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Bar-Joseph</snm>
						<fnm>Z</fnm>
					</au>
				</aug>
				<source>Proceedings of the Pacific Symposium on Biocomputing</source>
				<pubdate>2005</pubdate>
			</bibl>
			<bibl id="B18">
				<title>
					<p>On the number of protein-protein interactions in the yeast proteome</p>
				</title>
				<aug>
					<au>
						<snm>Grigoriev</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>nar</source>
				<pubdate>2003</pubdate>
				<volume>31</volume>
				<issue>14</issue>
				<fpage>4157</fpage>
				<lpage>4161</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">165980</pubid>
						<pubid idtype="pmpid" link="fulltext">12853633</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B19">
				<title>
					<p>Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes</p>
				</title>
				<aug>
					<au>
						<snm>Gasch</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Spellman</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Kao</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Carmel-Harel</snm>
						<fnm>O</fnm>
					</au>
					<au>
						<snm>Eisen</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Storz</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Botstein</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Brown</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Molecular Biology of the Cell</source>
				<pubdate>2000</pubdate>
				<volume>11</volume>
				<fpage>4241</fpage>
				<lpage>4257</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">15070</pubid>
						<pubid idtype="pmpid" link="fulltext">11102521</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B20">
				<title>
					<p>Gene ontology: tool for the unification of biology</p>
				</title>
				<aug>
					<au>
						<snm>Gene Ontology</snm>
						<fnm>Consortium</fnm>
					</au>
				</aug>
				<source>Nat Genet</source>
				<pubdate>2000</pubdate>
				<volume>25</volume>
				<fpage>25</fpage>
				<lpage>9</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">10802651</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B21">
				<title>
					<p>Using Information Content to Evaluate Semantic Similarity in a Taxonomy</p>
				</title>
				<aug>
					<au>
						<snm>Resnik</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>IJCAI</source>
				<pubdate>1995</pubdate>
				<fpage>448</fpage>
				<lpage>453</lpage>
				<note>[citeseer.ist.psu.edu/resnik95using.html]</note>
			</bibl>
			<bibl id="B22">
				<title>
					<p>Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation</p>
				</title>
				<aug>
					<au>
						<snm>Lord</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Stevens</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Brass</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Goble</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19</volume>
				<issue>10</issue>
				<fpage>1275</fpage>
				<lpage>1283</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12835272</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B23">
				<title>
					<p>Selection bias in gene extraction on the basis of microarray gene-expression data</p>
				</title>
				<aug>
					<au>
						<snm>Ambroise</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>McLachlan</snm>
						<fnm>GJ</fnm>
					</au>
				</aug>
				<source>Proceedings of the National Academy of Sciences of the United States of America</source>
				<pubdate>2002</pubdate>
				<volume>99</volume>
				<issue>10</issue>
				<fpage>6562</fpage>
				<lpage>6566</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">124442</pubid>
						<pubid idtype="pmpid" link="fulltext">11983868</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B24">
				<title>
					<p>BIND-The Biomolecular Interaction Network Database</p>
				</title>
				<aug>
					<au>
						<snm>Bader</snm>
						<fnm>GD</fnm>
					</au>
					<au>
						<snm>Donaldson</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Wolting</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Ouellette</snm>
						<fnm>BF</fnm>
					</au>
					<au>
						<snm>Pawson</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Hogue</snm>
						<fnm>CW</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2001</pubdate>
				<volume>29</volume>
				<fpage>242</fpage>
				<lpage>245</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">29820</pubid>
						<pubid idtype="pmpid" link="fulltext">11125103</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B25">
				<title>
					<p>MIPS: a database for genomes and protein sequences</p>
				</title>
				<aug>
					<au>
						<snm>Mewes</snm>
						<fnm>HW</fnm>
					</au>
					<au>
						<snm>Frishman</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Gruber</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Geier</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Haase</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Kaps</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Lemcke</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Mannhaupt</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Pfeiffer</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Sch&#252;ller</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Stocker</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Weil</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Research</source>
				<pubdate>2000</pubdate>
				<volume>28</volume>
				<fpage>37</fpage>
				<lpage>40</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">102494</pubid>
						<pubid idtype="pmpid" link="fulltext">10592176</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B26">
				<title>
					<p>DIP: the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions</p>
				</title>
				<aug>
					<au>
						<snm>Xenarios</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Salwinski</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Duan</snm>
						<fnm>XQJ</fnm>
					</au>
					<au>
						<snm>Higney</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Kim</snm>
						<fnm>SM</fnm>
					</au>
					<au>
						<snm>Eisenberg</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Research</source>
				<pubdate>2002</pubdate>
				<volume>30</volume>
				<fpage>303</fpage>
				<lpage>305</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">99070</pubid>
						<pubid idtype="pmpid" link="fulltext">11752321</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B27">
				<title>
					<p>A Training Algorithm for Optimal Margin Classifiers</p>
				</title>
				<aug>
					<au>
						<snm>Boser</snm>
						<fnm>BE</fnm>
					</au>
					<au>
						<snm>Guyon</snm>
						<fnm>IM</fnm>
					</au>
					<au>
						<snm>Vapnik</snm>
						<fnm>VN</fnm>
					</au>
				</aug>
				<source>5th Annual ACM Workshop on COLT</source>
				<publisher>Pittsburgh, PA: ACM Press</publisher>
				<editor>Haussler D</editor>
				<pubdate>1992</pubdate>
				<fpage>144</fpage>
				<lpage>152</lpage>
				<url>http://www.clopinet.com/isabelle/Papers/</url>
			</bibl>
			<bibl id="B28">
				<aug>
					<au>
						<snm>Sch&#246;lkopf</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Smola</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Learning with Kernels</source>
				<publisher>Cambridge, MA: MIT Press</publisher>
				<pubdate>2002</pubdate>
			</bibl>
			<bibl id="B29">
				<aug>
					<au>
						<snm>Noble</snm>
						<fnm>WS</fnm>
					</au>
				</aug>
				<source>Kernel methods in computational biology, chap. Support vector machine applications in computational biology</source>
				<publisher>Cambridge, MA: MIT Press</publisher>
				<pubdate>2004</pubdate>
				<fpage>71</fpage>
				<lpage>92</lpage>
			</bibl>
			<bibl id="B30">
				<title>
					<p>The spectrum kernel: A string kernel for SVM protein classification</p>
				</title>
				<aug>
					<au>
						<snm>Leslie</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Eskin</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Noble</snm>
						<fnm>WS</fnm>
					</au>
				</aug>
				<source>Proceedings of the Pacific Symposium on Biocomputing</source>
				<publisher>New Jersey: World Scientific</publisher>
				<editor>Altman RB, Dunker AK, Hunter L, Lauderdale K, Klein TE</editor>
				<pubdate>2002</pubdate>
				<fpage>564</fpage>
				<lpage>575</lpage>
			</bibl>
			<bibl id="B31">
				<title>
					<p>Remote homology detection: a motif based approach</p>
				</title>
				<aug>
					<au>
						<snm>Ben-hur</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Brutlag</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Proceedings of the Eleventh International Conference on Intelligent Systems for Molecular Biology</source>
				<pubdate>2003</pubdate>
				<volume>19</volume>
				<issue>suppl 1</issue>
				<fpage>i26</fpage>
				<lpage>i33</lpage>
			</bibl>
			<bibl id="B32">
				<title>
					<p>eBLOCKS: enumerating conserved protein blocks to achieve maximal sensitivity and specificity</p>
				</title>
				<aug>
					<au>
						<snm>Su</snm>
						<fnm>Q</fnm>
					</au>
					<au>
						<snm>Liu</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Saxonov</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Brutlag</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Research</source>
				<pubdate>2005</pubdate>
				<volume>33</volume>
				<fpage>178</fpage>
				<lpage>182</lpage>
			</bibl>
		</refgrp>
	</bm>
</art>
