Abstract
Background
An important class of interaction switches for biological circuits and disease pathways are short binding motifs. However, the biological experiments to find these binding motifs are often laborious and expensive. With the availability of protein interaction data, novel binding motifs can be discovered computationally: by applying standard motif extracting algorithms on protein sequence sets each interacting with either a common protein or a protein group with similar properties. The underlying assumption is that proteins with common interacting partners will share some common binding motifs. Although novel binding motifs have been discovered with such approach, it is not applicable if a protein interacts with very few other proteins or when prior knowledge of protein group is not available or erroneous. Experimental noise in input interaction data can further deteriorate the dismal performance of such approaches.
Results
We propose a novel approach of finding correlated short sequence motifs from proteinprotein interaction data to effectively circumvent the abovementioned limitations. Correlated motifs are those motifs that consistently cooccur only in pairs of interacting protein sequences, and could possibly interact with each other directly or indirectly to mediate interactions. We adopted the (l, d)motif model and formulate finding the correlated motifs as an (l, d)motif pair finding problem. We present both an exact algorithm, DMOTIF, as well as its approximation algorithm, DSTAR to solve this problem. Evaluation on extensive simulated data showed that our approach not only eliminated the need for any prior protein grouping, but is also more robust in extracting motifs from noisy interaction data. Application on two biological datasets (SH3 interaction network and TGFβ signaling network) demonstrates that the approach can extract correlated motifs that correspond to actual interacting subsequences.
Conclusion
The correlated motif approach outlined in this paper is able to find correlated linear motifs from sparse and noisy interaction data. This, in turn, will expedite the discovery of novel linear binding motifs, and facilitate the studies of biological pathways mediated by them.
Background
An important class of interaction switches for biological circuits and disease pathways are the binding motifs [1,2]. These are very short, functional regions on the proteins that conform to particular sequence patterns; a wellknown example is the set of peptides expressing a PxxP consensus (where x represent any arbitrary amino acid) that bind SH3 protein domains [3,4]. Finding such motifs is important for drug discovery as many have been implicated in disease pathways. For instance, the prolinerich motifs and glutaminerich motifs have been linked to Alzheimer's disease, Muscular Dystrophy [5] and Huntington's disease [6]. Recently, Marti et. al. reported that the short linear sequence motif RxLx [QE] played a key role in the pathogenesis of malaria [7,8].
Binding motifs can be discovered by biological experiments, such as sitedirected mutagenesis and phage display, which are laborious and expensive. However, given a set of proteinprotein interaction data, binding motifs can be discovered computationally as follows: (i) group protein sequences that interact with the same protein, and (ii) for each set of protein sequences grouped, extract the motifs using motif discovery algorithms like MEME [9], Gibbs Sampler [10], PRATT [11] and TEIRESIAS [12]. For example, to computationally detect any possible motif binds by protein Crk, we could input protein sequences interacting with Crk to motif discovery programs. The underlying assumption is that Crk binds through similar sequence segments in many of its interaction partners, which can be detected by string pattern algorithms. For discussion, we denote such approach as OneToMany (OTM) since we start with one protein to derive a group of multiple proteins associated with it for motif extraction.
The OTM approach is effective only when the protein we start with have enough number of interacting partners for motif extraction. In reality, many proteins have limited interacting partners [13]. This means that for many of the proteins, the signals from the few and short motif instances would be too weak for detection by the existing motif discovery algorithms. The scenario is actually worse when we further consider the high noise levels in interaction data [14] and the inherent heterogeneity of protein interactions – not all the real interacting partners of a protein necessarily carry the same binding motif. In the extreme cases of proteins having only one known interacting partner, it is impossible to extract binding motifs using the OTM approach.
Sometimes, it is possible to apply some known knowledge of protein groups to increase the number of sequences for motif extraction. For example, if individual copies of the SH3 domain bind limited protein partners, we could pool all sequences that bind any SH3 domain proteins to increase the PxxP motif's instances for its "discovery". We denote this approach as the ManytoMany (MTM) approach since we derived a set of sequences for motif extraction from another set of sequences (protein group). Reiss and Schwikowski adopted an MTMbased method with a modified Gibbs sampling algorithm to enhance motif finding on proteins with limited binding partners and successfully extracted more motifs than the OTMbased approaches [15]. In another work, Neduva et. al. complement the OTM approach with MTM approach to find novel linear motif from protein interaction data [16]. However, the MTM will not be applicable if prior knowledge on the protein group is not available. Even if the knowledge are available, they might be incomplete, erroneous or just too generic. As a result, finding motifs from the interacting partners of such a group might often yield less satisfactory results.
In this paper, we are interested in the case when the linear motif in question actually bind directly or interact indirectly with another linear motif. It makes a lot of sense since linear motifs are in general short enough that most of the time it interacts with a similarly short region on the other protein. For modular interaction domains, for example, it is often the subregions, rather than the entire domains, that are involved in mediating proteinprotein interactions. In essence, we are modelling interactions as mediated by pair of motifs each occurring in separate proteins that are interacting, and this work revolves around discovering such motif pairs from protein interaction data.
Formally, suppose a set of proteinprotein interactions occurring between sequences containing the linear motif x and sequences containing the linear motif y, we present a novel approach to simultaneously find both motifs x and y directly from protein interaction data. It is based on the intuition that if a set of interactions were indeed mediated by x and y, they will be presented for extraction as overrepresented cooccurring similar substring pairs found in pairs of interacting proteins in the data set (see Figure 1). Our approach mines such substring pairs in input interaction data – which we termed the correlated motifs – that correspond to x and y. The term "correlated" indicates that the output motif pair may not necessarily be directly binding each other but their cooccurrences in interacting sequences are significant. Our new approach offers the following advantages:
Figure 1. Correlated motif pair approach. A depiction of our approach for finding correlated motifs. The dotted lines indicates the interactions between the proteins.
1. In contrast to both OTM and MTMbased approaches, it simultaneously finds two motifs that are interactioncorrelated instead of one motif.
2. Like the MTM approach, it increases the number of motif instances for detection (See Figure 1).
3. However, it does not require any prior knowledge for protein grouping (although, when available, such information would still be useful), resting on the assumption that members of a protein group should share similar substrings that can be extracted by our approach as one of the motifs (See Figure 1).
4. By finding pairs of correlated motifs in the interaction data instead of single motifs in protein sequence data, our approach is more stringent and hence more resilient against noise since it is less likely for two spurious noiseinduced motifs to cooccur in the interaction data more frequently than the true ones.
We adopted the (l, d)motif model which had been used frequently to model motifs in biological sequences thanks to its simplicity [1722]. In the (l, d)motif model, the actual motif and motif instances are strings of length l and each instance differs by no more than d mismatches from the actual motif. Thus any two motif instances would have at most 2d mismatches. Consequently, a set of very similar substrings can be modelled as a (l, d) motif with a small d while a more diverse substring set need to be modelled with a larger d. We then formulated our approach as an (l, d)motif pair finding problem, and presented an exact algorithm, DMOTIF, as well as its approximation algorithm, DSTAR to solve the problem.
Our benchmarking analysis shows that DSTAR's performance is comparable to DMOTIF's with a substantially shorter running time. Thus, in evaluation experiments, we compare only DSTAR with other existing algorithms so that we can run extensive tests on both simulated and real biological datasets. Result from the former validates that the correlated motif approach is more robust than OTM and MTM in extracting motifs from sparse but noisy interaction data. Evaluation on real biological datasets, on another hand, demonstrates that our DSTAR algorithm is able to extract correlated motifs that are biologically relevant. On a SH3 domain interaction dataset [3], DSTAR extracted "PxxPx[KR]" and "GxxPxNY" as correlated motifs; the two motifs were subsequently validated to actual interacting interfaces in the structural data of SH3 domain and its ligand (see Figure 2). DSTAR also extracted "[KR]xxPxxP", a known SH3 binding motif, that was not detected by any existing algorithms tested in this study(see Figure 3 and Table 1). Application of DSTAR on the TGFβ signaling pathway [23] extracted correlated motifs that mapped to putative phosphorylation sites and kinase subregions in proteins respectively (more details in the 1).
Figure 2. Evidence from PDB structural data – SH3 domain vs. PxxPxR. 3D structure (PDB ID: 1AVZ) of a SH3 domain of FYN tyrosine kinase bound to with another protein. The sequence segments that express the "PxxPxR" motif and "GxxPxNY" motif (detected by DSTAR in this work) are highlighted in dark blue and orange respectively. The two segments correspond to actual interacting subsequences.
Figure 3. The "PxxP", "PxxPx[KR]" and "[KR]xxPxxP" motifs and their associated motifs extracted by DSTAR. Lines between the sequence segments denote interaction between their parent proteins. The result is found from multiple runs of DSTAR with different combination of motif width l = 6, 7, 8, distance d = 1 and k_{i }= k_{n }= 5. We then rank all the outputs from the different runs by their χscore.
Table 1. Comparison on the performance of various algorithms on the SH3 dataset
Additional file 1. SupplementaryDataDetailsOnMotifExtractedOnTGFBeta. The file contains detailed description on the motif instance set pair extracted by DSTAR and the corresponding known Phosphorylation motifs that is found enriched in one of the set.
Format: PDF Size: 141KB Download file
This file can be viewed with: Adobe Acrobat Reader
Related works
There are existing works [2427] that also find overrepresented pairs of cooccurring sequence patterns from proteinprotein interaction data, but most focused on discovering interaction correlations between sequence patterns predefined in existing databases such as Pfam, InterPro and Prosite. Such usage of predefined patterns drastically reduces the motif search space to enable motif mining in large interaction network. However, their coverage is also consequently limited by the degree of completeness of existing pattern databases. Todate, only about 200 binding motifs out of some few thousands that possibly exist [2] have been found. The correlated motif approach outlined in this work can therefore complement existing works by discovering more novel motifs as well as their correlations from the increasingly abundant protein interaction data. Our algorithms can also be applied on biological pathways or protein networks directly to detect the most significant cooccurring motif pairs in these pathways. Such functionality is important for studying pathways known to be mediated by recurring domains and motifs, like various signaling pathways [28,29].
Results and discussion
In the following discussion, we compared our algorithms (DSTAR and DMOTIF) against the existing algorithms, run in either OTM or MTM mode. This is because, to our knowledge, there is no existing algorithm based on our approach. Recall that in the (l, d)motif model, the motif (a consensus string) and its instances are strings of length l and each instance differs by no more than d mismatches from the actual motif. The l and d are two parameters to the algorithms. Users can either input specific l and d into the algorithms or input a range of values for l and d instead. In the latter, the algorithms will extract the different (l, d)motif pairs and output them, ranked based on their significance. At the same time, user must provide two additional parameters k_{i }and k_{n }for more directed search: k_{i }specifies the minimum number of interactions that (l, d)motif pairs must cooccur in while k_{n }dictate the minimum of interacting proteins that must express each of the (l, d) motif.
In short, our algorithms tries to cluster the interaction data into groups of interaction which express some statistically significant (l, d)motif pair; it look for pairs of similar substring set (defined by the (l, d) motif model) occurring across pairs of interacting proteins, and rank them based on their cooccurrence statistical significance. The exact algorithm DMOTIF would find all possible motif pairs which satisfy the threshold given while DSTAR would allow a bit of inaccuracy for the sake of speed. We performed a preliminary experiment on DMOTIF and DSTAR to compare their accuracy and efficiency, and found out that DMOTIF is only modestly more accurate than DSTAR while running several orders of magnitude slower than the latter. The details of the comparison can be found in the Methods section. For efficiency, we therefore only ran DSTAR in our following evaluation experiments.
Artificial data with planted (l, d)motifs
We evaluate the robustness of DSTAR against noise in input data using simulated data with planted (l, d)motifs. Another goal of the study is to investigate the performance of DSTAR when dealing with problems involving weak motifs. This will provide insights to the user on how the latter influences DSTAR's accuracy.
Simulation setup
We follow the simulation setup devised in [17], where the authors planted welldefined artificial (l, d)motifs into random sequences to create artificial datasets for evaluation. Here, we create sequences with planted (l, d)motifs and then pair them up to generate artificial interaction datasets. For each pair of (l, d)motifs (x, y), five instances of motif x and five instances of motif y are inserted into ten randomly selected protein sequences. To simulate the real scenarios as close as possible, the motifs were planted in randomly selected yeast (Saccharomyces cerevisiae) protein sequences instead of random sequences. Let us denote the five sequences with planted motif x as sequence set P_{x}, and the five sequences with planted motif y as sequence set P_{y}. We set P_{x} = P_{y} = 5 in our current simulations.
We simulate the real protein interactions by pairing every sequences in P_{x }to sequences in P_{y}, and vice versa. A spurious interaction is modeled by pairing a protein in P_{x}(P_{y}, resp.) with a random yeast protein not in P_{y}(P_{x}, resp.). Given that a protein interacts with an average of 5.8 other proteins (interaction statistics in DIP [30]), and that the high throughput yeast twohybrid technique is known to have at least 50% error [14], we would expect at most 2.9 true interactions per protein. Being conservative, we set = 2 here. Let ε be the noise level defined as the fraction of the spurious interactions within all interactions that belong to one particular protein. We investigate the performance of the algorithms with ε = 0.50 as well as ε = 0.60. For instance, when = 2 and ε = 0.50, the proteins in P_{x }and P_{y }will be involved in (on average) 4 interactions; two of which would be spurious.
The algorithms and parameter settings
We applied DSTAR, as well as other known motif extraction algorithms such as MEME and Gibbs Sampler to see whether they can extract instances of both planted motifs amongst its motif pairs with the highest scores from the noisy input datasets. We also implemented an algorithm, SSTAR, to find single (l, d)motifs in subsets of protein sequences based on the wellestablished SPSTAR algorithm [17]. We ran MEME, Gibbs Sampler and SSTAR using the MTM approach since = 2 is too low for an OTMbased approach to detect the motifs. We assume that all the algorithms using the MTMapproach will be ran only on the proteins that interact with those in P_{y }when trying to find motif x (and vice versa for y). The average of the two cases is the reported performance. Note that this effectively provides the existing algorithms with prior knowledge on the underlying groupings of the protein sequences; the knowledge of sequence groups P_{x }and P_{y}.
To search for the set of planted (l, d)motifs, we set the parameters for the various algorithms as follows. For MEME, the parameters are: Mode = ZOOPS (option in MEME when not every input sequences are guaranteed to contain a motif of interest) and Motif Width = l. For Gibb Sampler, the parameters are: Mode = Motif Sampler (option in Gibbs Sampler when not all input sequences are guaranteed to contain a motif of interest), Motif Width = l and Expected Motif Occurrence = 5. For DSTAR and SSTAR, being (l, d)motif searching algorithms, the first two parameters are l and d. We set the minimum number of motif occurrences in the sequences, k_{n }= 5. For DSTAR, the minimum number of interactions between the instances of the correlated motifs, k_{i }is also set to 5 as well.
Evaluation metrics
We evaluate the relative performance of the algorithms using the following metrics:
where TP_{x}(TP_{y}, resp.) is the number of correctly recovered planted motifs x(y, resp.) FN_{x}(FN_{y}, resp.) is the number of instances of the planted motif x(y, resp.) the algorithm fails to recover. Lastly, FP_{x}(FP_{y}, resp.) is the number of spurious motifs included by the algorithm as a candidate instance of x(y, resp.).
Results
We applied DSTAR and all the other algorithms on numerous sets of simulated interaction data with different planted (l, d)motifs, namely the (8, 1), (7, 1), (9, 2), (6, 1) and (8, 2)motifs (listed in decreasing order of motif strength). For each combination of motif and ε value, we generated 10 random datasets and compute the average performance of the algorithms in discovering correct motif. Our results showed that MEME and Gibbs Sampler performed quite poorly. Even for a relatively strong (8, 1)motif, MEME can only achieve FMeasures of 0.49 and 0.35 for ε = 0.50 and 0.60, respectively (As for Gibbs Sampler, the FMeasures were 0.58 and 0.29 respectively). However, since both of these algorithms used different motif models, they may not be optimized to search for (l, d)motifs. Instead, we will compare their relative performance on real biological data later. An noteworthy observation, however, is increased noise in input data can drastically decrease the performances of the algorithms.
Not surprisingly, both DSTAR and SSTAR attained very high average FMeasure of 0.99 for relatively stronger (8, 1) and (7, 1) – motifs on all values of ε (data not shown). Figure 4 shows the comparison of FMeasures of DSTAR and SSTAR on the weaker (9, 2), (6, 1) and (8, 2) motifs. Observe that DSTAR performed consistently better than SSTAR on all the cases, and furthermore, the performance margins were higher when there were more noise in the data. This study validates that even without having the prior knowledge of the motifs contained in the interaction data, DSTAR is able to handle noise much better than the other algorithms. This is of practical importance since real interaction data are often highly noisy data containing many interactions between unknown domains and/or motifs.
Figure 4. Comparison between DSTAR and SSTAR. Comparison between DSTAR and SSTAR(A variant of SPSTAR) in extracting planted (l, d)motifs. The motifs are arranged on the xaxis in decreasing order of motif strength. The number of planted motif instances in each dataset is 5 and the datapoint is the average over 10 runs.
Biological data
In this section, we apply our algorithm on two biologically significant datasets: SH3 domain interaction data and TGFβ signaling pathway data. We show that our approach can better discover real biological motifs than the other methods.
SH3 domain interaction data
SH3 domains are conserved amino acid segments (of length ≈ 60 amino acids) found across multiple proteins. Through various biological experiments, SH3 domains have been determined to bind short sequence segments expressing the general motif "PxxP" [3]. The interactions between SH3 proteins and the "PxxP" motif mirror our motif pair (x, y) (in this case, one of the motifs should correspond to parts of SH3 domain). For evaluation, we use the same dataset derived by Tong et. al. to find the interacting partners of SH3 domain proteins [3]. This dataset, which we called SH3PxxPTong, was downloaded from BIND online database. It consists of 233 proteinprotein interactions among 146 yeast proteins of which 23 are SH3 domain proteins (as determined using HMMER program from Pfam). We will first assess whether the known SH3 binding motifs can be extracted among the top motifs by each algorithm. Next, we investigate the biological relevance of the correlated motifs extracted by DSTAR.
The algorithms and parameters
We ran DSTAR on the SH3PxxPTong dataset multiple times with different combinations of l = 6, 7, 8, d = 1 and k_{n }= k_{i }= 5. The outputs from the different runs were then systematically ranked using their χscores. Note again that in the case of our DSTAR algorithm, the motifs were mined without having to separate the SH3 domain proteins and the nonSH3 domain proteins, unlike the other MTM motif extraction methods which require such prior knowledge. For comparison, we also attempted to extract the "PxxP"like motifs with MEME (ZOOPS mode, Motif Width = 4 – 9), Gibbs Sampler (Motif Sampler mode, Motif Width = 4 – 8, Expected Motif Number ≥ 5) and SPSTAR (l = 6, 7, 8, d = 1 and Minimum Motif Number = 5) from the 130 sequences in the dataset that bind to any SH3 proteins (the MTM approach).
Validation
Without the luxury of experimentally validating the motifs extracted, it is hard to determine the accuracy of the various algorithms correctly. However, we reasoned that a good algorithm should at least extract most of the known motifs. In other words, when applying DSTAR on the interaction data of SH3 proteins, we should expect it to extract some "PxxP"like motifs on one side and another motif that occurs consistently in SH3 domains on the other side. We consider here the wellknown SH3binding motifs "PxxP", "PxxPx[RK]" and "[RK]xxPxxP". For each of these three motifs, we check whether it was "expressed" within the top 50 motifs reported (usually user would not want to check beyond this number). We define a set of protein sequence segments reported by an algorithm to be expressing a motif if at least 50% of the sequence segments match the pattern.
Results
Table 1 shows the results for DSTAR, SSTAR, MEME, and Gibbs Sampler. The generic "PxxP" motif was extracted among the top outputs by all algorithms. However, only our DSTAR algorithm managed to extract both "PxxPx[KR]" and "[KR]xxPxxP" motifs (within the top 50 motifs output of each algorithm). In fact, only two instances of the "PxxPx[KR]" motif are found in the segments extracted within the top 50 sets of segments extracted by MEME. No "[KR]xxPxxP" motif instance was extracted. To be sure, we reran MEME on the same 130 sequences with more specific motif lengths = 6–7 (instead of motif length = 4–9) but to no avail. This confirmed that MEME with the MTM approach has indeed missed out the more specific variants. As for SSTAR, the limited instances of the "PxxPx[KR]" and "[KR]xxPxxP" motifs extracted were overwhelmed by the more general "PxxP" motif. DSTAR, despite having no access to prior grouping knowledge unlike the other algorithms, was the only algorithm that was able to extract the specific SH3binding motifs.
One might argue that since the MTMalgorithms were applied on the set of all SH3binding sequences which contained either of the motifs "PxxPx[KR]" and "[KR]xxPxxP", it may be unsurprising that only the general "PxxP" motif was extracted instead of the more specific motifs. The OTM approach may be more suitable for extracting the specific motifs since it does not consider the SH3binding sequences in a "wholesale" manner as the MTM approach. As such, we applied MEME, Gibbs Sampler and SSTAR on the interacting protein partners of each individual SH3 protein in the SH3PxxP dataset. In total, the OTM approach can be applied on the 22 SH3 proteins that bind more than 1 protein sequence. We used the same parameters used in the MTM approach for each algorithm except that the Minimum Motif Occurrence = 2. We deemed a motif to be extracted successfully if more than 50% of a segment set within the top 50 sets extracted expressed the motif and that 50% should comprise of at least 2 instances. For MEME, "PxxP" motif was extracted for 3 SH3 proteins (Abp1,Rvs167,Bzz1) and "PxxPx[KR]" motif was extracted for 2 other SH3 proteins (Ysc84,Myo3). Gibbs Sampler extracted the "PxxP" and "PxxPx[KR]" motifs for 1(Sho1) and 2 SH3 proteins (Yfr024c,Ysc84) respectively. Finally, for SSTAR, the "PxxP" motif was extracted for 8 SH3 proteins (Fus1,Bbc1,Rvs167,Hse1,Bzz1,Myo3,Hof1,Nyo5) and the "PxxPx[KR]" motif was extracted for 2 other SH3 proteins (Yfr024c,Ysc84). Again, all the algorithms failed to extract "[KR]xxPxxP" motif within the top 50 output for any of the SH3 proteins. In comparison, DSTAR extracted the specific "PxxPx[KR]" and "[KR]xxPxxP" for more SH3 proteins (Figure 3).
Since DSTAR extracts correlated motifs, it is interesting to further analyze the extracted associated sequence segments of the three prolinerich motifs as shown in Figure 3. We were intrigued to discovered that all associated sequence segments extracted together with "PxxP", "PxxPx[RK]" and "[RK]xxPxxP" by DSTAR were found within SH3 domains. In addition, we also discovered that all associated sequence segments of the three prolinerich motifs expressed a "PxxY" general consensus. Specifically, DSTAR extracted "GxxPxNY" as the associated motif of "PxxPx[KR]" motif. A further check into the structural data (PDB ID:1AVZ) of an experimentally determined interaction between an SH3 protein and a protein expressing a "PxxPx[KR]" motif reveals that the sequence segment in SH3 domain expressing the "GxxPxNY" motif indeed forms a binding interface with the segment expressing the "PxxPx[RK]" motif (Figure 2). Hence, in this particular case, DSTAR has extracted correlated motifs that actually are binding motifs.
TGFβ signaling pathway
Next, we applied DSTAR on the interaction network of TGFβ signaling pathway that was derived using LUMIER [23] – an automated highthroughput protein interaction detection technology that can detect phosphorylationdependent interactions. Note that the original experiment was not specifically geared toward detecting interactions of any particular protein domain or motif. Hence, unlike the SH3PxxP dataset, it is not immediately apparent whether any relevant motif pairs can be found in the interaction network. We applied DSTAR on this interaction dataset to see whether we can extract any interesting motif pairs. The dataset was retrieved from BIND database and consists of 446 interactions among 214 proteins. DSTAR was applied on the dataset with the same parameters used for SH3PxxP dataset. As we do not know what to expect as correct answer, we focused on validating the top motif pair extracted. Interestingly, DSTAR extracted a motif pair, with general consensus patterns "[TA]E[LI]Y[NQ]T" and "GKT[CIS][ILT][IL]", from 87 unique interactions as our top output (1). For ease of discussion, let us denote the motif pair as (X, Y). First, we verified that (X, Y) is not likely to occur by chance as the estimated probability (pvalue) of getting the motif pair with the same interaction set size is less than 0.001 (by testing the motif pair on 1000 randomly generated interaction data with the same network topology and sequences). Hence, we conjectured that the motif pair is a possible key interaction mechanism in the TGFβ signaling pathway.
We also found that the sequence segment set of motif Y is enriched in known kinase phosphorylation motifs (27 sites in 50 segments, based on result from PhosphoMotif Finder [31]). To determine the significance of finding 27 sites in the segment sets, we generate 1000 segments sets, each containing 50 segments randomly selected from the same protein set. We found out that none of them contain at least 27 segments with the phosphorylation motifs, implying an estimated pvalue < 0.001.
We listed the overrepresented phosphorylation motifs in Table 2 (for a detailed listings of all of the phosphorylation sites, see 1). Further analysis also showed that 5 out of 6 associated sequence segments of motif X were also found within kinase protein domains (determined using HMMER from Pfam). Such biological characterization of our extracted motif pair (X, Y) with X as kinase motifs and Y as phosphorylation motifs is indeed in concurrence with the fact that signalling pathways are typically regulated by kinases through protein phosphorylation. This further indicates that our method have extracted a biologically feasible motif pair from the TGFβ interaction dataset.
Table 2. The overrepresented phosphorylation sites motifs found by DSTAR
We also investigated whether such kinase phosphorylation motifs may also be extracted using the OTM approach. For each kinase protein found in Y by DSTAR, we submitted their binding partners to MEME (ZOOPS mode, Motif Width = 4 – 8), Gibbs Sampler (Motif Sampler mode, Motif Width = 4 – 8, Expected Motif Number ≥ 2) and SSTAR (l = 6, 7, 8, d = 1 and k_{n }= 5). We found that overrepresented phosphorylation motifs can be found within the top 10 output segment sets for only 2 out of the 5 kinase proteins by all MEME, Gibbs Sampler and SSTAR (based on result from PhosphoMotif Finder).
Note that the above OTM approach had relied on the pregrouping of kinase proteins to guide the motif discovery (and yet its result were still not as good as our DSTAR's motifs). In practice, such specific prior biological knowledge may not be available. In this case, in order to discover that (X, Y) is a significant interaction mechanism in the TGFβ signaling pathway, one would first need to repeatedly mine motifs in all possible groupings of the protein sequences before finding some significant correlations between the motifs extracted from the protein groups. This can be a laborious process – even if we were to use the proteins' domain information for pregrouping the proteins, there could be a large number of domains involved, while the performance may be limited by the coverage of domain information. DSTAR, on the other hand, depends on no such information and found the correlated motif pairs directly from input interaction data in one single process.
Conclusion
Discovery of novel binding motifs acting as interaction switches for biological circuits can lead to invaluable insights for important applications such as drug discovery, as various short binding motifs have been found to be associated with disease pathways. However, such motifs have also been known to be hard to find both experimentally and computationally [2].
The recently available proteinprotein interaction data present a rich data source to aid in such important discoveries through motif discovery algorithms. The efforts can be hindered by sparse and noisy nature of existing protein interaction data, as well as the inadequacy of current biological knowledge. In this paper, we have proposed a novel approach of mining correlated de – novo motifs from interaction data. We formulated our approach as an (l, d)motif pair finding problem for which we gave an exact algorithm, DMOTIF, as well as its approximation algorithm, DSTAR. Our evaluation results have shown that our proposed approach can eliminates the need for prior knowledge on protein groups during the discovery process. Such functionality allows the discovery of motifs not to be constrained by inadequate biological knowledge. The approach is also more robust in extracting motifs from noisy interaction data. Of course, since DSTAR is devised for finding linear sequence motifs, it would fail if one of the correlated motifs is a structural one. However, it may still be used to identify short conserved sequence regions that formed parts of such structural motifs. Given that existing protein structural data is still very limited when compared to available proteinprotein interaction data, short conserved sequence regions identified by DSTAR could facilitate further biological experiments like mutagenesis studies.
While we have presented an approximation algorithm DSTAR to speed up the extraction of motif pairs from interaction data, more work will need to be done in order to scale up the approach to handle genomewide interaction data or the larger DNAprotein interaction data. Also, as real biological motifs can be of varying lengths, we will also need to extend our current approach to discover binding motifs that are not of any predefined lengths. We leave these as future work.
Methods
Preliminaries
Let s = a_{1}a_{2}a_{3}...a_{n }be a lengthn protein sequence defined over the alphabet Σ of 20 amino acids, and s[u, v] as the substring of the string s starting at position u up to position v. When the substring's length l is fixed, we simply write s[u] for s[u, u + l  1]. We will call such a substring the lsubstring at position u.
The (l, d)motif finding problem
The definition of (l, d)motif was originally proposed in [17] to model motifs in biological sequences. Consider a set S = {s_{1}, s_{2}, s_{3}...,s_{t}} of t protein sequences of length n. A lengthl pattern p is an (l, d)motif in S' ⊆ S if all sequences s_{i }∈ S' have at least one lsubstring s_{i}[u] which differs from p by at most d mismatches. Such s_{i}[u]'s are termed as the instances of p. In their work, Pevzner et. al. [17] computed for the (l, d)motif p that has at least one instance in each sequence in S. In our work, it is important to find motifs from a significantly large subset S' of S since, in some case, there is no guarantee that every input sequence would contain an instance of the motif. In other words, for a given (l, d)motif p, let _{d}(p) be {s ∈ S  s contains an lsubstring of distance at most d from p}. Given the minimum number of instance threshold k_{n}, we then define the general (l, d)motif finding problem as finding all (l, d)motif p in S such that _{d}(p) ≥ k_{n}.
The (l, d)motif pair finding problem
We extend the problem of finding (l, d)motifs in a set of sequences into one for finding motif pairs in a set of sequence pairs for mining interacting motifs in a set of proteinprotein interactions. Given a protein interaction dataset I ⊆ S × S of size m over the set of proteins S where for any (s_{i}, s_{j}) ∈ I we have i ≤ j, we want to find a pair of (l, d)motifs which is overrepresented in I. That is, we want to find an (l, d)motif pair (x, y) that have the following characteristics:
(1) Let I_{(x, y) }be the set of interactions between _{d}(x) and _{d}(y), namely, I_{(x, y) }= I ∩ (_{d}(x) × _{d}(y)). We require that I_{(x, y)} ≥ k_{i }for a minimum number of interaction threshold k_{i}.
(2) Let (x) be a subset of _{d}(x) containing sequences that interact with those in _{d}(y). Similarly, let (y) be a subset of _{d}(y) with interacting sequences with _{d}(x). We also require that (x), (y) ≥ k_{n}.
We call this problem the (l, d)motif pair finding problem. For every (s_{i}, s_{j}) ∈ I_{(x, y)}, we want find (s_{i}[u], s_{j}[v]) which are instances of x and y. Biologically, (s_{i}[u], s_{j}[v]) may correspond to the functional regions in the proteins s_{i }and s_{j }that mediate their interaction.
Scoring function
It is likely for many (l, d)motif pairs (x, y) to exist within a given interaction dataset I over the set of proteins S. We define here a scoring function to rank them systematically.
Let O(S_{x}, S_{y}) be the observed number of interactions between two protein sets S_{x }and S_{y }containing the motifs x and y respectively. Let E(S_{x}, S_{y}) be the expected number of interactions between S_{x }and S_{y}. We estimate E(S_{x}, S_{y}) based on the assumption that interactions occur at random. Since the probability of any interaction occurring between two random proteins in S is , we have
where the term in the brackets computes the total number of interactions possible between the proteins in S_{x }and S_{y}. Based on the idea of χ^{2}statistic, we formulate the following function χ to score a given pair of (x, y)motif containing protein sets S_{x }and S_{y }as
Methods
For illustration, we will first give an exact algorithm DMOTIF to find cooccurring motifs in I. Then, we will present our approximation algorithm, DSTAR, that can offer significant speedup at the cost of slight accuracy degradation. The use of DSTAR for scaling up is necessary for dealing with the large input datasets in practice.
DMOTIF algorithm
The basic idea of the exact algorithm is to enumerate all possible (l, d)motif pairs and then check if they have enough instances to satisfy the minimum size threshold k_{n}and k_{i}. Note that any (l, d)motif pair must be of hamming distance d from some (l, d)motif pair instance. Given a string p of length l, we define X_{p }to be all strings p' of length l with hamming distances at most d from p. The algorithm named DMOTIFBASIC in Figure 5 describes the most straightforward brute force approach on the problem. Observe that the instances of any (l, d)motif x would be of distance 2d from one another. Pevzner et. al. [17] described a method to compute all instances of an (l, d)motif by transforming the problem into finding cliques in a tpartite graph G. In this graph, all lsubstrings in all s_{i }∈ S are the nodes and any two of them will be connected by an edge if (a) they originate from distinct proteins and (b) they are at most 2d apart. Thus, finding the (l, d)motifs having at least k_{n }instances is equivalent to finding cliques of size at least k_{n }in G, which is an NPhard problem.
Figure 5. The DMOTIFBASIC algorithm.
We attempt to reduce the complexity of the problem by assuming that k_{n }≥ 3 and try to find all cliques of size 3 first. In other words, we first find three lsubstrings, (s_{i}[u], s_{j}[v], s_{k}[w]), from distinct sequences s_{i}, s_{j}, and s_{k }and then only try those candidate (l, d)motifs p ∈ . For convenience, we call the string triplet (s_{i}[u], s_{j}[v], s_{k}[w]) a triangle within s_{i}, s_{j}, and s_{k}and we denote the set intersection by .
In the case of interaction data, we have to find all interaction triplets (s_{i}, s_{i'}), (s_{j}, s_{j'}), (s_{k}, s_{k'}) and compute the triangles from (s_{i}, s_{j}, s_{k}) and (s_{i'}, s_{j'}, s_{k'}). But as interaction is commutative (at least in our current consideration) i.e. (s_{i}, s_{j}) is equivalent to (s_{j}, s_{i}), we also have to consider the latter configuration when we choose the interaction triplets. As such, we let I_{d }be the set of ordered pair which contains both ⟨s_{i}, s_{j}⟩ and ⟨s_{j}, s_{i}⟩ for each (s_{i}, s_{j}) ∈ I. The algorithm can then start by choosing the ordered pair triplets from I_{d}(I_{d} ≈ 2m). The complete listing of the algorithm, DMOTIF, is presented in Figure 6.
Figure 6. The DMOTIF algorithm.
In practice, DMOTIF runs much faster when compared to the straightforward brute force algorithm(which we have also implemented as a benchmark). However, the memory requirement of DMOTIF could be much larger than the latter as we have to store the sets X for the different triangles in the set T_{l }and T_{r }to avoid redundant computations. When d is large relative to l, there would be a lot of candidate (l, d)motifs to check given a triangle. When the number of triangles is also large, even DMOTIF would soon run at a crawling speed. In view of that, we propose the following approximation algorithm, DSTAR. Before we start, let us define the (l, d)star pair finding problem and show how it approximates for the (l, d)motif pair finding problem.
The (l, d)star pair finding problem
For any given pair of lsubstrings (s_{i}[u], s_{j}[v]) from some interaction (s_{i}, s_{j}), there may be an exponential (with respect to d) number of possible (l, d)motifs (x, y) which is within distance d. Hence, even after speedingup the algorithm with filtering, DMOTIF can only handle relatively smallsized problems. In our proposed algorithm DSTAR, we will aim to find only the instances of a motif pair (x, y) instead of finding the motif (x, y) themselves since they may not even occur in S.
DSTAR algorithm
Recall that given an (l, d)motif x, any two instances of x, x_{i }and x_{j}, would be at most 2d apart. Hence, if we manage to get one instance x_{i }of x, all the other instances of x would surely be in _{2d}(x_{i}). In the context of interaction data, we first get all lsubstring pairs (s_{i}[u], s_{j}[v]) from each interacting proteins (s_{i}, s_{j}) ∈ I. Next, we find those (s_{i}[u], s_{j}[v]) that satisfy two conditions (1) There are more than k_{i }interactions between _{2d}(s_{i}[u]) and _{2d}(s_{j}[v]). (2) Let the set of the interactions be denoted similarly by , and we require that both . The pair of protein set () is denoted as an (l, d)star pair. Our simulation experiments indicate that DSTAR yields a good approximation of the exact solution while being much more efficient when the dataset is large. The complete listing of the algorithm is in Figure 7.
Figure 7. The DSTAR algorithm.
Time complexity
The loop in line 1 takes O(m) time. The next loop in line 2 takes another O(m) time. Both pairwise sequence comparisons in step 3 and 4 require O(n^{2}) time. Each time, the number of position pairs (u, v) in P_{1 }× P_{2 }could also reach O(n^{2}). Updating , , can all be done in constant time with a lookup table (one could save space using hashsets, but the updating will take amortized constant time instead). The loop in line 11 would require at most O(n^{2}) time for all entries [u, v], each requiring at most O(t) time to build , from () for computing the χscore. Therefore, in the worst case, DSTAR would run in O(m^{2}n^{2}+ mtn^{2}). We also need to be mindful that the memory requirement to store the matrix and arrays is max{O(mn^{2}), O(tn)}.
Comparison between DMOTIF and DSTAR
First, we investigate the effect of data size on the performance of our two approaches. We ran our evaluation on 5 different datasets containing artificial interaction sets I of size ranging from 10 to 150 (note that for some weaker motifs, we did not evaluate up to size 150 as the running time of the DMOTIF became too slow to be measured). In each interaction set, the protein sequences in all interaction are distinct; in other words, S = 2I. We also planted the (l, d)motif pair in only half of the interactions in I to effect a fixed ε = 0.50 on all datasets.
Evaluation was performed here by checking if the planted motifs were reported as the best motif by the motif finding algorithm. Table 3 shows the average result over 10 datapoints (I = 10, 20, ..100) in each of the 5 evaluation datasets. Figure 8 displays the running time of both algorithms on different data size averaged over the 5 datasets. We use an x86 Pentium 4 M 1.6 GHz machine with 512 MB of memory for running the comparison. We observed that when the (l, d)motifs get less specific and k_{n }is small, the planted motifs could be masked out by other signals present in the protein sequences. This happened in one of the datapoints of (6, 1)motifs with I = 10, in which DSTAR failed to have 100% sensitivity rate. Overall, it is clear that DSTAR performs only slightly worse than DMOTIF while the running time of DSTAR is much better than DMOTIF for larger datasets. The running time of DMOTIF is also highly influenced by the strength/specificity of the (l, d)motif. As compared to DSTAR, the running time of DMOTIF increases much more rapidly when the motif gets less specific. For example, for I = 100, the running time of DMOTIF on (8, 1), (7, 1), (6, 1) motifs are 797.4 s, 1930.7 s and 17385.2 s, respectively. For the same datapoints, DSTAR only required 253 s, 266.5 s, and 306.1 s, respectively. Indeed, this observation was further confirmed when we tried DMOTIF on our real biological dataset later – it was still running after 10 hours while DSTAR terminates in less than 20 minutes.
Table 3. Comparison on specificity and sensitivity between DMOTIF and DSTAR
Figure 8. Comparison of running time between DMOTIF and DSTAR. Observe that the running time of DMOTIF increases rapidly as the input data grows and also as the (l, d)motif gets weaker. All experiments were run on a x86 Pentium 4 1.6 GHz machine with 512 MB of memory.
On choosing the parameters k_{n }and k_{i}
As with many other algorithm, the setting of the appropriate parameters would be a challenge for the user. Most of the time, these cannot be derived directly from the data. For DSTAR, one must set the minimum threshold parameters k_{n }and k_{i }for the minimum number of each motif instance and the minimum number of interaction that must be involved to derive the motifs. We performed a set of test where we vary the k_{n }and k_{i }value. The trend shows that the accuracy is highest when k_{n }and k_{i }is near their real value Z, where Z denotes the actual number of motif instances in the data, and N, the number of true interactions between the motif pair instances in the data, respectively. For strong motifs, accuracy is not affected even when k_{n }or k_{i }are set to relatively low values. For weaker motifs, it is easier to find spurious motifs and hence when k_{n }and k_{i }are too low, the performance will be poor. Hence we would suggest using large enough k_{n }or k_{i }and try to reduce them when one still cannot find any result. In general, like other existing motif algorithm, when the user has a good estimate of the length of the motif found, the quality of the motifs found would be better. The details can be found in Figure 9.
Figure 9. Effect on varying k_{n }and k_{i }on the performance of DSTAR. The experiments suggest that the nearer k_{n }and k_{i }to their actual value (the actual number of motif and motif pair in the dataset) would result in a better performance of the algorithm.
Availability and requirements
Project name: Correlated motif discovery project.
Project homepage: http://www.comp.nus.edu.sg/~bioinfo/hugowill/DSTAR.html webcite
Operating Systems: Windows XP, RedHat Linux, Solaris.
Programming Language: C.
License: The binaries used in the experiments are freely available in the website and in 2.
Additional file 2. Program. The file contains the binaries of the C implementation of DSTAR for Windows XP, Linux and Solaris OS. Also included in the zip file are the C2BIND dataset as sample input and its sample output file.
Format: ZIP Size: 136KB Download file
Authors' contributions
The project was initiated by SHT. The problem formulation was further polished and finalized by all authors. The algorithm design is done by WH, SHT and WKS. DMOTIF and DSTAR were implemented by WH and all experiments was designed WKS, SHT and SKN, and were run by SHT. All authors contributed equally on the writeup and the analysis of the results obtained.
Acknowledgements
We thank Dr. Limsoon, Wong for his feedbacks over several discussions which significantly improves the quality of this work. SHT and SKN are funded by the Agency for Science, Technology and Research (A*STAR) of Singapore. WH and WKS are funded by National University of Singapore.
References

Puntervoll P, Linding R, Gemund C, ChabanisDavidson S, Mattingsdal M, Cameron S, Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, SupertiFurga G, Wyrwicz L, Ramu C, McGuigan C, Gudavalli R, Letunic I, Bork P, Rychlewski L, Kuster B, HelmerCitterich M, Hunter WN, Aasland R, Gibson TJ: ELM server: A new resource for investigating short functional sites in modular eukaryotic proteins.
Nucleic Acids Res 2003, 31(13):36253630. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Neduva V, Russell RB: Linear motifs: evolutionary interaction switches.
FEBS Lett 2005, 579(15):33423345. PubMed Abstract  Publisher Full Text

Tong AHY, Drees B, Nardelli G, Bader GD, Brannetti B, Castagnoli L, Evangelista M, Ferracuti S, Nelson B, Paoluzi S, Quondam M, Zucconi A, Hogue CW, Fields S, Boone C, Cesareni G: A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules.
Science 2002, 295(5553):321324. PubMed Abstract  Publisher Full Text

Cesareni G, Cesareni G, Panni S, Nardelli G, Castagnoli L: Can we infer peptide recognition specificity mediated by SH3 domains?
FEBS Lett 2002, 513:3844. PubMed Abstract  Publisher Full Text

Hu H, Columbus J, Zhang Y, Wu D, Lian L, Yang S, Goodwin J, Luczak C, Carter M, Chen L, James M, Davis R, Sudol M, Rodwell J, Herrero JJ: A map of WW domain family interactions.
Proteomics 2004, 4(3):643655. PubMed Abstract  Publisher Full Text

Goehler H, Lalowski M, Stelzl U, Waelter S, Stroedicke M, Worm U, Droege A, Lindenberg KS, Knoblich M, Haenig C, Herbst M, Suopanki J, Scherzinger E, Abraham C, Bauer B, Hasenbank R, Fritzsche A, Ludewig AH, Bussow K, Coleman SH, Gutekunst CA, Landwehrmeyer BG, Lehrach H, Wanker EE: A protein interaction network links GIT1, an enhancer of huntingtin aggregation, to Huntington's disease.
Mol Cell 2004, 15(6):853865. PubMed Abstract  Publisher Full Text

Marti M, Good RT, Rug M, Knuepfer E, Cowman AF: Targeting malaria virulence and remodeling proteins to the host erythrocyte.
Science 2004, 306(5703):19301933. PubMed Abstract  Publisher Full Text

Hiller NL, Bhattacharjee S, van Ooij C, Liolios K, Harrison T, LopezEstrano C, Haldar K: A hosttargeting signal in virulence proteins reveals a secretome in malarial infection.
Science 2004, 306(5703):19341937. PubMed Abstract  Publisher Full Text

Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers.
ISMB 1994, 2:2836. PubMed Abstract

Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.
Science 1993, 262(5131):208214. PubMed Abstract  Publisher Full Text

Jonassen I: Efficient discovery of conserved patterns using a pattern graph.
Comput Appl Biosci 1997, 13(5):509522. PubMed Abstract

Rigoutsos I, Floratos A: Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm.
Bioinformatics 1998, 14:5567. PubMed Abstract  Publisher Full Text

Goh KI, Oh E, Jeong H, Kahng B, Kim D: Classification of scalefree networks.
Proc Natl Acad Sci USA 2002, 99(20):1258312588. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sprinzak E, Sattath S, Margalit H: How reliable are experimental proteinprotein interaction data?
J Mol Biol 2003, 327(5):919923. PubMed Abstract  Publisher Full Text

Reiss DJ, Schwikowski B: Predicting proteinpeptide interactions via a networkbased motif sampler.
Bioinformatics 2004, 20(Suppl 1):I274I282. PubMed Abstract  Publisher Full Text

Neduva V, Linding R, SuAngrand I, Stark A, de Masi F, Gibson TJ, Lewis J, Serrano L, Russell RB: Systematic discovery of new recognition peptides mediating protein interaction networks.
PLoS Biol 2005, 3(12):e405. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pevzner PA, Sze SH: Combinatorial Approaches to Finding Subtle Signals in DNA Sequences.
ISMB 2000, 269278. PubMed Abstract

Pavesi G, Mauri G, Pesole G: An algorithm for finding signals of unknown length in DNA sequences.
Bioinformatics 2001, 17(Suppl 1):S207S214. PubMed Abstract  Publisher Full Text

Eskin E, Pevzner PA: Finding composite regulatory patterns in DNA Sequences.

Keich U, Pevzner PA: Finding motifs in the twilight zone.
Bioinformatics 2002, 18(10):13741381. PubMed Abstract  Publisher Full Text

Price A, Ramabhadran S, Pevzner PA: Finding Subtle Motifs by Branching from Sample Strings.
Bioinformatics 2003, 19(Suppl 2):II149II155. PubMed Abstract  Publisher Full Text

BarriosRodiles M, Brown KR, Ozdamar B, Bose R, Liu Z, Donovan RS, Shinjo F, Liu Y, Dembowy J, Taylor IW, Luga V, Przulj N, Robinson M, Suzuki H, Hayashizaki Y, Jurisica I, Wrana JL: Highthroughput mapping of a dynamic signaling network in mammalian cells.
Science 2005, 307(5715):16211625. PubMed Abstract  Publisher Full Text

Deng M, Mehta S, Sun F, Chen T: Inferring domaindomain interactions from proteinprotein interactions.
Genome Res 2002, 12(10):15401548. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sprinzak E, Margalit H: Correlated sequencesignatures as markers of proteinprotein interaction.
J Mol Biol 2001, 311(4):681692. PubMed Abstract  Publisher Full Text

Ng SK, Zhang Z, Tan SH: Integrative Approach for Computationally Inferring Protein Domain Interactions.
Bioinformatics 2003, 19(8):923929. PubMed Abstract  Publisher Full Text

Wang HD, Segal E, BenHur A, Koller D, Brutlag DL: Identifying proteinprotein interaction sites on a genomewide scale.

Kay BK, Williamson MP, Sudol M: The importance of being proline: the interaction of prolinerich motifs in signaling proteins with their cognate domains.
FASEB J 2000, 14(2):231241. PubMed Abstract  Publisher Full Text

Pawson T, Nash P: Assembly of Cell Regulatory Systems Through Protein Interaction Domains.
Science 2003, 300(5618):445452. PubMed Abstract  Publisher Full Text

Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update.
NAR 2004, (32 Database):D449451. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M, Ibarrola N, Deshpande N, Shanker K, Shivashankar HN, Rashmi BP, Ramya MA, Zhao Z, Chandrika KN, Padma N, Harsha HC, Yatish AJ, Kavitha MP, Menezes M, Choudhury DR, Suresh S, Ghosh N, Saravana R, Chandran S, Krishna S, Joy M, Anand SK, Madavan V, Joseph A, Wong GW, Schiemann WP, Constantinescu SN, Huang L, KhosraviFar R, Steen H, Tewari M, Ghaffari S, Blobe GC, Dang CV, Garcia JG, Pevsner J, Jensen ON, Roepstorff P, Deshpande KS, Chinnaiyan AM, Hamosh A, Chakravarti A, Pandey A: Development of human protein reference database as an initial platform for approaching systems biology in humans.
Genome Res 2003, 13(10):23632371. PubMed Abstract  Publisher Full Text  PubMed Central Full Text