Department for Molecular Biosciences, University of Oslo, P.O. Box 1041, Oslo, Blindern 0316, Norway
The Biotechnology Centre of Oslo, University of Oslo, P.O. Box 1125, Oslo, Blindern 0317, Norway
Abstract
Background
Previous studies have noted that drug targets appear to be associated with higherdegree or highercentrality proteins in interaction networks. These studies explicitly or tacitly make choices of different source databases, data integration strategies, representation of proteins and complexes, and data reliability assumptions. Here we examined how the use of different data integration and representation techniques, or different notions of reliability, may affect the efficacy of degree and centrality as features in drug target prediction.
Results
Fifty percent of drug targets have a degree of less than nine, and ninetyfive percent have a degree of less than ninety. We found that drug targets are overrepresented in higher degree bins – this relationship is only seen for the consolidated interactome and it is not dependent on nary interaction data or its representation. Degree acts as a weak predictive feature for drugtarget status and using more reliable subsets of the data does not increase this performance. However, performance does increase if only cancerrelated drug targets are considered. We also note that a protein’s membership in pathway records can act as a predictive feature that is better than degree and that highcentrality may be an indicator of a drug that is more likely to be withdrawn.
Conclusions
These results show that protein interaction data integration and cleaning is an important consideration when incorporating network properties as predictive features for drugtarget status. The provided scripts and data sets offer a starting point for further studies and crosscomparison of methods.
Background
Drug targets (DTs) are defined here as proteins targeted by drugs. These proteins are not necessarily the products of diseaselinked genes (which we will call Disease Proteins, DPs) but can be any protein whose binding might lead to a positive effect in the treatment of a disease. Yildirim
Several studies have attempted to characterize drug targets from a theoretical point of view as such knowledge could be a tool to speed up the drug discovery process. Bioinformatics methods to characterize and predict drug targets have included: pathway and tissue enrichment, domain enrichment, number of exons and protein degree in an interaction network
Drug targets can also be characterized in terms of protein network attributes such as degree and centrality. The degree of a protein in a protein interaction network is equivalent to the number of interactions a protein is involved in, while centrality measures quantify the relative importance of a protein. Types of centrality measures include Betweeness Centrality (according to the number of shortest paths that go through it) and Closeness Centrality (the shortest distance between that protein and all others). A number of studies have investigated drug targets in terms of such network based metrics including degree, betweenness centrality
The initial goal of this paper was to evaluate the predictive value of two simple graphtheoretical metrics, degree and centrality, that previously have been observed to correlate with drug targets
These studies suggested networkbased metrics might be useful for drug target prediction; however, the disparate conclusions (drugtargets are highdegree, middling degree or lowdegree) was confusing. In trying to reproduce some studies, we commonly had difficulties determining exactly what data sets were used and found that studies often reported average drug target degree instead of entire degree
The studies above that have investigated network based metrics of drug targets rely upon PI data, and explicitly or tacitly make choices of different source databases, data integration strategies, representation of proteins and complexes, and data reliability assumptions. Previous work from our group
Here, we examined the effect of data integration on the distribution of drug targets across degree and centrality measures (and the ability of these measures to predict drug targets). The above mentioned studies work with limited data sets: Yildirim
Next, we examined the effect of subsetting interaction data upon the drug target distribution over proteins of varying degree and centrality. We hypothesized that using subsets of interaction data deemed to be more reliable might alter this distribution and be useful for the purpose of predicting drug targets. There are several methods used to rank protein interactions according to some specific notion of reliability. Early attempts include the Expression Profile Reliability (EPR index), which compares protein interaction and RNA expression profiles, and the Paralogous Verification Method (PVM) that searches after paralogs of interactors which also interact (Deane
Further, we addressed the effect of representing all nary data using a spokemodel representation (where only interactions between each member of the group and one chosen protein are included) versus a matrixrepresentation (where all possible pairwise interactions between the group of proteins are included
Finally, we consider the drug target predictive ability of pathway data – a data source that is overlapping but complementary to interaction data. This partial overlap drew our attention to the usefulness of pathway data to drug target prediction, and motivated us to consider a pathwaydegree metric for proteins.
In summary, we have chosen degree and centrality as simple drug target predictor features, in order to study the validity of the conclusions about them found in the literature when we work with consolidated protein interaction data from iRefIndex and various decisions regarding data integration, representation and reliability. We have previously shown that network properties can be altered by these choices and we will show the potential effect of these factors on drug target prediction.
Results
Our results section is divided into five parts which examine: 3.1) integration, 3.2) selection, 3.3) representation, 3.4) pathway data and 3.5) relationship to diseases. In order to compare the effect of the source of data on the results, a series of human PINs were generated from the iRefIndex database
This is a plain text file that contains R code to reproduce all R analyses in the paper. See
Click here for file
Data integration
Here we test two hypotheses: First, that the high degree observed in drug targets might be related to the fact that specific databases or papers were chosen instead of a consolidated database and, therefore, this correlation might disappear after data integration, i.e., when using the iRefIndex. Secondly, that the high degree of some drug targets could be related to the inclusion of nary data.
Drug targets are correlated to highdegree only in the full data set
In order to evaluate if drug targets are on average high degree proteins in a consolidated PIN, we compared the average degree of all nodes to drug targets in the full PIN. Table
Protein interaction network
Nodes examined
Mean degree
Median degree
Degree standard deviation
Degree skewness
Degree kurtosis
Max degree
Statistical descriptors of the degree distribution of 11 different PINs whose protein complexes have all been represented as spoke models (i.e. any Nary data is included by a spokemodel representation). Drug targets have a higher degree on average, even though the standard deviations are equally higher. Degree distribution of drug targets are also more skewed and peaked than nondrug targets. This is different in distributions like the BioGRID database or the Rual and Stelzl papers, where the numerical values are not only significantly smaller but the conclusions might be even the contrary, such as drug targets having a lower degree for Rual and Stelzl. The values of the drug target subnetwork show that interactions between drug targets are scarce and, therefore, the average higher degree of drug targets represent interactions between drug targets and nondrug targets. BioGRID was used by
full PIN spoke
all
14.2
4
28.9
6.6
86.1
789
drug targets in full PIN
22.5
8
44.8
6.8
84.2
789
nondrug targets in full PIN
13.5
4
27.1
6.0
66.5
615
drug target subnetwork spoke
1.7
1
44.7
3.2
15.8
23
nondrug target subnetwork spoke
12.7
4
26.3
6.8
90.3
709
BioGRID only –spoke
all
7.5
3
13.5
7.9
133.0
395
drug targets in BioGRID only
9.0
3
18.6
5.5
41.3
203
Rual + Stelzl papers only spoke
all
4.3
2
8.8
7.5
81.6
158
drug targets in Rual + Stelzl only
3.7
2
5.5
6.0
54.7
60
Rual paper only –spoke
all
3.8
2
8.4
9.4
127.5
158
drug targets in Rual only
2.2
1
2.5
3.5
19.4
15
Given this observation, we examined the subgraph consisting only of interactions between drug targets versus the nondrug target subgraph. The average degree of the drug target subnetwork is only 1.7 (versus 12.7 for the nondrug target subnetwork), indicating that drug targets are, on average, high degree proteins more connected to other sites of the full network than among themselves.
For comparison purposes, the last six rows of Table
These initial results were consistent with drug targets having a higher degree on average in the consolidated dataset; however, the large standard deviation in these values led us to examine the relationship in greater detail. The majority of DTs have degrees between 1 and 8 (50th percentile) and 95% of all DTs have a degree less than 89. The number of DTs decrease linearly with degree between 1 and 20 followed by a long tail out to degree 789 (Additional file
Supplementary Figures 1–8.
Click here for file
In order to examine this overrepresentation in more detail, we constructed a rank of protein degrees in the full network and grouped them into bins of not less than 200 proteins each. Rank position 1 has the maximum degree of 789 and position 16078 has a degree of 1. We counted the number of drug targets per bin in the resulting 30 bins and applied a hypergeometric test to each with a significant pvalue cutoff of < 0.05. Figure
Overrepresentation of drug targets over a degree ranking of proteins.
Overrepresentation of drug targets over a degree ranking of proteins. Proteins were grouped into bins according to their degree. The width of each bin represents the number of proteins in that bin while the height (−log of the pvalue of the hypergeometric test) represents how overrepresented drug targets are in that bin. Each bin contains at least 200 proteins. Overrepresented bins (pvalue < 0.05) are highlighted in red. The number of drug targets in each bin is indicated at the top of each bar. Drug targets are overrepresented in highdegree bins and some middledegree bins for the full PIN (a), while overrepresentation is observed only in the highest degree bin of BioGrid (b) and not at all in the Rual and Stelzl (c) or Rualonly (d) data sets.
However, this trend is not seen in either the BioGrid or Rual and Stelzl subsets. In fact, drug targets were not overrepresented at all in these two subsets with the exception of the highest degree bin in BioGrid. These observations argue that using degree as a feature for drug target prediction is significantly affected by choice of dataset.
The process of subsetting the network will fragment it into smaller components containing drug targets that are disconnected from the main giant component. The full spoke human PIN contains 140 connected components, distributed as shown in (Additional file
Supplementary Tables 1–6.
Click here for file
The connected component analysis in the different networks under study will show below how disconnected the network becomes when selecting reliable interactions. For example, the number of drug targets present in these smaller, disconnected components can go from 7 in the full PIN to 41 in the PSICQUIC MIscore subset (Table
Network
# Connected components
# Proteins in disconnected components
# Drug targets in disconnected components
The full PIN contains 7 drug targets that are disconnected from the main component. This number increases as interactions are removed to generate subsets of the data that are potentially more reliable; i.e., more drug targets become disconnected from the largest component of the network.
Full PIN
140
324
7
B subset
139
354
12
Nonpredicted subset
164
376
12
MI score  IntAct
75
207
18
LTP subset
188
428
27
MI score  PSICQUIC
138
372
41
Drug target degree is not overly influenced by nary data
We considered the possibility that the higherdegree of drug targets might be influenced by the presence of nary data in the full dataset. In a previous work, we distinguished between true binary data (B), nary also known as complex data (N) and spokerepresented nary data (S)
Figure
Number of drug targets in each interactiontype subset.
Number of drug targets in each interactiontype subset. Venn diagram with the number of drug targets per interaction type in the full spoke PIN. B corresponds to binary interactions, N to nary interactions and S to spokerepresented nary interactions. 431 drug targets are found only in the binary subset while 375 are found in all three subsets.
Network
% All drug targets (#drug targets in data set / Total #drug targets)
% Drug targets in data set (#drug targets in data set / #proteins in data set)
Average degree of data set
Maximum degree
Most drug targets are present in the binary (B) subset, while the nary (N) and spokerepresented nary data (S) subsets have around half of them. This might simply be due to the size of each subset, given that the ratio of drug targets to proteins per subset is similar. The average degree of the B subset is higher in comparison to the values for the N and S subsets, suggesting that the B subset may be a candidate to display a correlation between drug targets and high protein degree.
B subset
95.19
8.11
10.44
534
N subset
45.39
8.98
7.37
282
S subset
50.94
8.78
7.57
169
Full PIN
100
7.63
14.16
789
The overrepresentation plot for drug targets in a degree rank for the B, N and S data sets confirms this. Figure
Overrepresentation of drug targets along a degree rank for Binary (B), Nary (N) and Spokerepresented (S) interaction types.
Overrepresentation of drug targets along a degree rank for Binary (B), Nary (N) and Spokerepresented (S) interaction types. Proteins were grouped into bins according to their degree. The width of each bin represents the number of proteins in that bin while the height (−log of the pvalue of the hypergeometric test) represents how overrepresented drug targets are in that bin. Overrepresented bins (pvalue < 0.05) are highlighted in red. The number of drug targets in each bin is indicated at the top of each bar. Each bin contains at least 200 proteins. Drug targets are overrepresented in highdegree bins and some middledegree bins for the B subset (a), while this trend is largely lost in the N (b) and S (c) data sets.
Drug target association with higher centrality is dependent on nary data
We repeated the above analyses using betweenness centrality instead of degree (Table
Protein interaction network
Nodes examined
Average BC (per protein)
Maximum BC
BC behaves similar to degree in the sense that drug targets have higher centralities than nondrug targets, and in the sense that BioGRID and RualStelzl display smaller values in comparison with the consolidated data set. However, a difference appears regarding interaction type, where nodes belonging to nary interactions (N and S nodes) are more central than nodes belonging to binary interactions.
full PIN spoke
all
21663.7
6930614.5
Drug targets only
47319.8
6930614.5
Nondrug targets only
19545.3
5195198.1
B nodes only
23985.2
6930614.5
N nodes only
46165.3
6930614.5
S nodes only
43327.2
6930614.5
BioGRID subnetwork
all
13704.3
4436940.8
Rual+Stelzl subnetwork
all
5960.5
506957.4
We examined the distribution of drug target centralities (Additional file
In summary, DT’s appear to be overrepresented in higherdegree and centrality bins. However this is most apparent using a consolidated data set and is somewhat dependent on the presence of nary data in the case of centrality. Most drug targets seem to be located in true binary interaction data and their degree distributions are therefore not likely to be affected by complex representation artefacts.
Data selection analysis
We wished to quantify the predictive power of high degree and centrality for drug targets and assessed this using the Receiver Operating Characteristic (ROC) on the fullnetwork. We then compared this performance over five different subsets of the data that could reasonably have an effect on reliability and on network properties with respect to the full network. Our rationale here was that removing unreliable data might decrease the degree for some nondrug targets that had been artificially inflated and thereby increase performance by removing falsepositives.
The “morereliable” data sets included binary data only (B), data excluding predicted interactions (NP), lowthroughput data only with an lpr < 22 (LTP), just edges with an MI score (IntAct) > 0.6 (I) and just edges with an MI score (PSICQUIC) of > 0.7 (P). The construction of each subset is described in the Methods section. Figure
Venn diagram of interactions found in three of the reliable subsets.
Venn diagram of interactions found in three of the reliable subsets. The Venn diagram shows that all MIPSICQUIC interactions (MIPSICQUIC score > 0.8) are contained in the LTP data set (lpr < 22), which in turn is contained in the nonpredicted data set (data set excluding the OPHID database).
Degree as a drug target predictor
An overrepresentation plot of the five different data sets (Figure
Overrepresentation of drug targets along a degree rank for the Binary (B), nonpredicted interactions, lpr < 22, MIIntAct > 0.6 and MIPSICQUIC > 0.8 data sets.
Overrepresentation of drug targets along a degree rank for the Binary (B), nonpredicted interactions, lpr < 22, MIIntAct > 0.6 and MIPSICQUIC > 0.8 data sets. Proteins were grouped into bins according to their degree. The width of each bin represents the number of proteins in that bin while the height (−log of the pvalue of the hypergeometric test) represents how overrepresented drug targets are in that bin. Each bin contains at least 200 proteins. Overrepresented bins (pvalue < 0.05) are highlighted in red. The number of drug targets in each bin is indicated at the top of each bar. Drug targets are overrepresented in highdegree bins and some middledegree bins for the binaryonly (B) subset (a), the nonpredicted subset (b) and the LTP (lpr < 22) data set (c), while there is no overrepresentation for the MIIntAct (MIIntAct score > 0.6) (d) and the MIPSICQUIC (MIPSICQUIC score > 0.8) (e) data sets.
In order to quantify the predictive power of the degree for these data sets, we plotted the ROC curve (Methods) shown in Figure
ROC curve for protein degree as a drug target predictor.
ROC curve for protein degree as a drug target predictor. Plot of False Positive Rate versus True Positive Rate for a degree rank of the full PIN and five subsets considered as containing higherconfidence interactions: nonpredicted interactions include all interactions except those coming from orthologous transfer; LTP includes interactions with an lpr score < 22; MIIntAct includes interactions with MIIntAct scores > 0.6; MIPSICQUIC includes interactions with MIPSICQUIC scores > 0.7; and B includes the true binary interactions (i.e., potential spokerepresented nary data is removed). Theoretically perfect and random classifiers are shown in grey for reference (AUC = 1 and 0.5 respectively).
Network
Number of proteins in network
AUC – Degree
AUC  BC
AUC  CC
The AUC was evaluated for degree and centrality ranks of the full PIN, five reliable subsets and two small subsets used in the literature. The best degree performance is achieved by the MIIntAct score greater than 0.6; however, this subset contains 219 proteins only, making it of limited applicability. The second best performance is achieved by the full PIN and the B subset. Other reliable subsets (nonpredicted, PSICQUIC, LTP) have a slightly inferior performance, while BioGRID and Rual+Stelzl perform close to randomness.
The best centrality performance is achieved by the full PIN, followed by three reliable subsets (B, nonpredicted and LTP). Both MIscores and both limited data sets perform close to randomness.
Full PIN, spoke
16078
0.6139
0.6294
0.5795
B subset
14408
0.6114
0.6171
0.5764
Nonpredicted interactions
14928
0.5916
0.6128
0.5647
LTP subset
10591
0.5794
0.6066
0.5482
BioGRID only
8642
0.5082
0.5467
0.4874
MI score, IntAct > 0.6
219
0.6353
0.5347
0.4382
MI score, PSICQUIC > 0.7
747
0.5719
0.5725
0.5414
Rual+Stelzl only
3575
0.5004
0.5045
0.5011
Centrality Analysis
Overrepresentation of drug targets along a centrality rank for the full PIN and each of the subsets behave similarly to degree. We assessed Betweenness Centrality performance using AUC as described above and found results similar to the degree performance (Table
Analysis of reliable subsets of the full PIN
The fact that the full network has proven to be the best data source for drug target prediction over all other subsets (except the small MIIntAct > 0.6 for degree) seemed counterintuitive since we expected that some of these would contain more reliable data. We had reasoned that removing “unreliable” interactions might decrease the degree (connectivity) for some nondrug targets that had been artificially inflated and therefore reduce noise in the predictor due to false positives.
To test this reasoning, we evaluated the average change in the degree of a protein when losing edges from the full PIN to a “reliable” subset. Table
Data sets
Avg degree change (drug targets)
Avg degree change (nondrug targets)
Wilcoxon pvalue
After generating the truebinary (B) network, drug targets lose 6.2 edges on average compared to their degree in the full PIN. At the same time, nondrug targets lose 4.9 edges. This difference is statistically significant (pvalue = 9.3e13) for the two first cases, therefore we conclude that removal of lowerconfidence data preferentially decreases the degree of drug targets rather than nondrug targets.
Full to non predicted
−5.2
−2.7
3.7e34
Full to B
−6.2
−4.9
9.3e13
Full to LTP
−10.3
−9.7
0.5
Data Representation Analysis
Up to this point, a spokemodel has been used to represent nary data in the full network. We considered the effect of using a matrixmodel instead to represent the same data. In this case, the average degree of the full human PIN is higher (42.86 for matrix versus 14.16 for spoke) (see Additional file
Network
AUC  Degree
AUC – BC
AUC – CC
A matrix model of the full PIN has a slightly inferior performance to its spoke counterpart for all predictors under consideration.
Full PIN, spoke
0.6139
0.6294
0.5795
Full PIN, matrix
0.5965
0.6264
0.5740
Observations on the integration of interaction and pathway data
Pathways have been traditionally used in drug discovery in the context of studying proteins upstream and downstream of a target in a pathway. Several studies
One could imagine employing a simple network analysis using pathways; the number of pathways that a protein is involved in could be counted as a “pathwaycentrality” and assessed for its relationship with drug target status. However, pathways from multiple databases are not easily consolidated making it difficult to determine how many distinct pathways a protein is involved in. Pathway databases are highly inconsistent both in terms of the biological entities and reactions
As a consequence we are unable to perform our analysis on a consolidated data set (analogous to the above analysis on a consolidated interaction data set). Instead, we had to resort to three separate pathwaycentrality analyses on each of three different databases keeping in mind that results might not be comparable between databases. Pathway records between databases may be redundant and overlapping making results difficult to interpret.
We first compared the distribution of drug targets and nondrug targets in three different pathway databases: PID, Reactome and KEGG. Table
Database
#drug target in database
#Nondrug target in database
% of proteins in database that are drug targets
% of all drug targets with pathway info
% of all nondrug targets with pathway info
84% of all drugtargets under study have pathway information. KEGG includes the highest number of drug targets (72%), followed by Reactome (65%). The number of nondrug targets in each database is small compared to all nondrug targets in UniProt, suggesting that pathways might be enriched for drug targets. Only 1.69% of UniProt proteins are drug targets while they constitute 15.9%23.8% of pathway databases. This observation is confirmed by comparing the percentages of drug targets and nondrug targets in UniProt to those per database: KEGG, for example, contains 72.4% of all drug targets but only 6.6% of all nondrug targets found in UniProt.
UniProt (all prots)
1953
113741
1.69
83.97
8.33
PID
394
1261
23.81
20.17
1.11
Reactome
1262
4215
23.04
64.62
3.71
KEGG
1414
7473
15.91
72.40
6.57
If we hypothesize that proteins present in many pathways might be important for the cell and, therefore, disease and treatment processes, then counting the number of pathways per protein (“pathway centrality”) might be a useful feature for drugtarget status prediction. This method can be understood as a kind of knowledgebased betweenness centrality, where shortest paths are replaced by actual information on known pathways. The distribution of the number of pathways per protein is, however, different for the different databases. Table
Database
Avg # pathways per drug target
Avg # pathways per nondrug target
Max # pathways per drug target
Max # pathways per nondrug target
AUC – Number of pathways for proteins in one pathway or more
AUC – Number of pathways for proteins in zero pathways or more
Drug targets are, on average, crossed by more pathways than nondrug targets. However, these values are relative to each pathway database.
KEGG allows the best performance for pathway centrality when using only the data in its database, while Reactome performs poorly. However, including the UniProt proteins not present in each database as part of the analysis, leads to an increase in the performance, and having both KEGG and Reactome as data sources, and the pathway centrality as predictor, can be considered as the best prediction platform investigated here.
PID
4.13
2.32
44
30
0.59
0.60
Reactome
1.85
1.71
17
23
0.53
0.81
KEGG
3.99
2.74
51
51
0.62
0.83
Drug targets are overrepresented in all pathway centrality bins for all three databases under analysis (see Additional file
Disease Analysis
The previous results motivated us to perform three additional analyses examining the relationship between drug targets and disease.
First, we surveyed the distance between drug targets and known disease proteins (Methods). Table
Distance
Description
Full PIN
BioGRID
RualStelzl
Rualonly
The full PIN contains only 436 drug targets which, at the same time, are DPs. 619 different drug targets have a shortest path of “1” to the nearest DP (they interact), while 154 have a shortest path of “2” and 5 drug targets are disconnected from any disease target, probably due to missing interaction information. Smaller subsets show that, in general, drug targets do not get farther from disease proteins after data subsetting, and, as a rule of thumb, there will always be a disease protein at least 4 steps away from any drug target. However, the proportion of drug targets in disconnected components is higher for subsets than it is for the full PIN.
0
Drug targets = DPs
436
319
71
25
1
Drug targets interact with DPs
619
246
47
10
2
Drug targets and DPs have a common interactor
154
163
77
25
3
3step paths
12
16
20
11
4
4step paths
1
1
1
2
5
5step paths

1


Inf
Drug targets disconnected from DPs
5
5
5
1
Smaller subsets in Table
Second, we hypothesized that degree might actually constitute a better predictor when applied to a subset of diseases. For example, high degree has already been noted as a feature of drug targets related to cancer
Drug targets
# Proteins
AUC  Degree
AUC  BC
AUC – CC
Cancer drug targets
303
0.6482
0.6617
0.6193
Noncancer drug targets
924
0.5976
0.6133
0.5627
Third and finally, we hypothesized that highly central proteins could lead to more sideeffects and, therefore, their drugs would be withdrawn from the market. Indeed, we found that the average BC of the subset of drug targets for withdrawn drugs is 54084.4 with a maximum of 1501217, which indicates that withdrawn drug targets have, on average, higher centralities than all drug targets and, of course, than the average of centralities in the full PIN (Wilcoxon pvalue = 9.5e6). In contrast, nonwithdrawn drug targets have an average BC of 21411.7 and a maximum of 6930614, which is similar to the average and maximum values of the full PIN (Wilcoxon pvalue = 0.8). These observations argue that high centrality should not be used as a predictor and may, in fact, be indicative of drugs that are more likely to be withdrawn.
Discussion
Using the full PIN (iRefIndex consolidated data set) gives better prediction results than using presumably more reliable subsets such as the true binary interactions, low lpr score, nonpredicted interactions, high IntAct MI score and high PSICQUIC MI score, and significantly better than using arbitrary subsets such as one given database or study. This could be taken as an argument in favour of the importance of interaction data integration in drug target prediction studies.
The poor performance of more reliable data sets compared to the full PIN might be due to one of two reasons. Either the subsets we are calling “reliable” are not as reliable as we think they are (and better definitions of reliability are needed) or, if we assume that our data is truly reliable, it is possible that the correlation of drug targets with degree and centralities is partially due to the inclusion of unreliable interactions. Both hypotheses demand further study. We would argue that our results also point out the need for more reliable interaction data and/or methods to filter for such data.
Representation issues seem to be less important for drug target prediction. Spoke models perform slightly better than matrix models, although the difference is not high. This might be due to the fact that most drug targets are present in binary interactions and not affected by complex representation.
Pathways are enriched in drug targets, only partially overlap with interaction data and the number of pathways that crosses a given protein seems to be a good drug target predictor. This could be interpreted as a need to integrate pathway data to the drug target prediction analysis, but also can be the reflection of the fact that the drug discovery process has been mainly pathwayoriented. However, as a consequence of the high inconsistency between pathway databases, an integration effort is required for pathways, similar to the iRefIndex for interaction data. There are integration efforts such as ConsensusPathDB
Our analysis can be improved in several ways. First, we are aware that degree and centralities might not be the best drug target prediction metrics and the analysis could be enriched by using better metrics and using an ensemble of features
Even though our purpose was not to examine the predictive power of degrees and centralities compared to other metrics, but only their variation due to a different data source, our analysis has given us an important insight on how these metrics work and their limitations. Data type distinction, overrepresentation analysis and ROC curves have given us a deeper understanding of the reasons for and against using degree and centralities as drug target features and can be a methodology to use in the assessment of new prediction metrics.
Conclusions
These initial results suggest that data integration is an important consideration when examining potential features for drug target prediction. Using more reliable data sets as defined here has little effect although other measures of confidence may have different results. The representation issues under analysis (nary data, matrix representation) do not have a significant effect on the predictive power of degree and centralities. This work will be of use to future studies that incorporate network data as a feature of drug target predictors.
Methods
All analyses were performed using R and some of its packages: Â«iRefRÂ» for manipulation of the protein interaction database iRefIndex; Â«igraphÂ» for network analysis; Â«momentsÂ» for computation of statistical moments; Â«limmaÂ» for generating Venn diagrams; “plotrix” for multiple histograms; and Â«org.Hs.eg.dbÂ» for conversion between gene IDs and GO and pathway information. R code to generate all networks, tables and plots is provided as Additional file
Construction of Networks
Networks were constructed and analyzed using the iRefR package
Construction of the full PIN
The iRefIndex human MITAB file v.8.0 contains 355104 unique records, of which 309726 correspond to humanhuman interactions. Using a canonical representation of the proteins and including data with all levels of confidence, two protein interaction networks can be obtained: Using a spoke model to represent complexes, the PIN (full PIN, spoke) contains 16078 nodes and 113834 edges. Using a matrix model to represent complexes, the PIN (full PIN, matrix) contains 16078 nodes and 344576 edges. Even though drug targets may be dependent on posttranslational modifications and cellular microenvironments
Construction of the Drug target List
There are several drug target databases, such as DrugBank
A MITAB representation of the DrugBank database was retrieved, where the drug is described in the first field of the interaction and the drug target in the second field. The DrugBank MITAB table from September 2011 contained 40274 records, 19500 of which correspond to proteins. 14851 of those protein records were found in iRefIndex and only 12632 of these are human proteins.
DrugBank includes an “experimental” category of drugs, defined as “Drug has been shown experimentally to bind specific proteins in mammals, bacteria, viruses, fungi, or parasites. An experimental drug is not necessarily being formally investigated”
These 7621 DrugBank records contain 1266 distinct protein drug targets. 1227 out of these 1266 drug targets belong to humanhuman protein interactions; therefore, this is the final number of drug targets that was studied.
It is important to highlight that the subset of noniRefIndex drug targets contains 1592 proteins, which means that interaction data is missing (drug targets don't have a single known protein interaction in iRefIndex's databases) for more than half of the DrugBank human drug targets.
Construction of drug target and nondrug target subnetworks
Drug target and nondrug target subnetworks were constructed using the “igraph” R package
Generating interactiontype subnetworks
The iRefIndex classifies interaction data according to three interaction types: Binary interaction records, nary interaction records (N) and polymers (not studied here). The S subset (spokerepresented nary data) corresponds to data that is represented as binary but is possibly just a representation of nary data. The S subset was detected using a simple algorithm: binary interaction records annotated by the same database from the same paper which were generated according to an experimental method that is known to generate nary data were grouped together into one Stype record
Generating highconfidence subnetworks
Using the iRefR package
The MI score tables were generated using a python script that submits iRefIndex interaction records, one at the time, to the scoring servers
In order to select the cutoff values for each score type, 9 networks were generated for each score and the ROC test was applied to each of them. Values of 0.6 (for MI score  Intact) and 0.7 (for the MI score  PSICQUIC) had the highest AUC values and were chosen as cutoffs in this study. Additional file
Prediction methods
Degree: Number of edges for a node or number of interactions for a protein. For computations, the igraph R package was used
Centrality: Node centrality is a measure of the relative importance of a node within a graph. In our case, the relative importance of a protein inside a PIN. There are various ways to calculate centrality; in this study we used the most common measures called “betweenness” and “closeness” centralities. The Betweenness Centrality is a measure of the number of shortest paths that cross a given node. A node that is found in many shortest paths will have a higher betweenness centrality than a node that is not. The Closeness Centrality is a measure of the mean shortest distance between one node (protein) and all the others that it can reach, which is a measure of how long it will take information to spread from that node to the rest of network. For computations, the “igraph” R package
GO enrichment: When examining disconnected components, we considered “enriched” as the most common GO terms associated with a given subset of proteins. The “org.Hs.eg.db” R package
Pathway Centrality: We defined pathway centrality of a protein as the number of known biological pathways that cross that protein. For computations, the Â«org.Hs.eg.dbÂ» R package
Estimation of predictive power
The Receiver Operating Characteristic (ROC) or ROC curve is a plot of the True Positive Rate (TPR) versus the False Positive Rate (FPR), calculated as follows:
where FP = False Positives, TN = True Negatives, TP = True Positives, and FN = False Negatives.
The area under this curve (AUC) is interpreted as the probability that the classifier can rank a positive example better than a negative one, and here is calculated using a simple trapezoidal rule. We note that alternatives to the ROC method could be considered
DAVID disease overrepresentation analysis
Proteins were grouped in bins of 700 proteins, from higher to lower degree, where bin 1 contained proteins with the highest degree. Each bin was submitted to DAVID
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
AM performed all analyses in this paper and wrote all code for those analyses. IMD supervised the project. AM and IMD wrote this paper. All authors read and approved the final manuscript.
Acknowledgements
The authors would like to thank Paul Boddie for producing the MITAB files for DrugBank and for the MI scores, and Katerina Michalickova for providing a version of OMIM’s Morbid Map that included gene IDs.