MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, China

Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA90089, USA

Abstract

Background

Domains are basic units of proteins, and thus exploring associations between protein domains and human inherited diseases will greatly improve our understanding of the pathogenesis of human complex diseases and further benefit the medical prevention, diagnosis and treatment of these diseases. Within a given domain-domain interaction network, we make the assumption that similarities of disease phenotypes can be explained using proximities of domains associated with such diseases. Based on this assumption, we propose a Bayesian regression approach named "

Results

Using a compiled dataset containing 1,614 associations between 671 domains and 1,145 disease phenotypes, we demonstrate the effectiveness of the proposed approach through three large-scale leave-one-out cross-validation experiments (random control, simulated linkage interval, and genome-wide scan), and we do so in terms of three criteria (precision, mean rank ratio, and AUC score). We further show that the proposed approach is robust to the parameters involved and the underlying domain-domain interaction network through a series of permutation tests. Once having assessed the validity of this approach, we show the possibility of

Conclusions

The proposed approach effectively ranks susceptible domains among the top of the candidates, and it is robust to the parameters involved. The

Background

Over the past few decades, remarkable success has been achieved for such traditional gene-mapping approaches as family-based linkage analysis

It has been shown that deleterious nonsynonymous single nucleotide polymorphisms (nsSNPs) that are responsible for a specific disease of interest may change structures of some protein domains, affect functions of corresponding proteins, and further result in the disease under investigation. Therefore, existing associations between domains and diseases can be constructed by bridging protein domains that contain known deleterious nsSNPs and human diseases with which the nsSNPs are associated

Recent studies on the modular nature of human genetic diseases have shown that diseases share common clinical characteristics are often caused by functionally related genes

We compile a set of known domain-disease associations using the Pfam database

Methods

Overview of the DomainRBF approach

We ground the inference of domains that are associated with human inherited diseases on a set of known domain-disease associations that are compiled from the Pfam database

Based on the assumption that phenotypically similar diseases are caused by functionally related domains, we propose a linear regression framework to model the relationship between a domain proximity profile and a phenotype similarity profile, and we resort to a Bayesian approach to solve the linear regression model. As shown in Figure

Scheme of the proposed domainRBF approach

**Scheme of the proposed domainRBF approach**. Texts in addition to pink arrows denote the pipeline of the domainRBF approach.

Data sources

Domain-disease associations

A domain is defined as associated with a disease if the domain contains at least one nonsynonymous single nucleotide polymorphism (nsSNP) associated with the disease

Known associations between nsSNPs and diseases are obtained from annotations of nsSNPs in the UniProt database

Relationships between human proteins and domains are obtained from the Pfam database

Using the above data sources and having defined a domain as associated with a disease if the domain contains at least one nsSNP associated with the disease, we are able to establish 1,614 associations between 671 domains and 1,145 diseases.

Domain-domain interaction networks

Our inference of domain-disease associations is based on domain-domain interaction networks extracted from the DOMINE

The latest version of DOMINE (released in February 2008) contains a total of 20,513 domain-domain interactions, out of which 4,349 (gold-standard positives) are inferred from PDB entries (the union of the sets of interactions from iPfam

The latest version of InterDom (released in July 31, 2007) contains a total of 148,938 domain-domain interactions, out of which 7,718 are inferred from PDB entries

In our work, we use two domain-domain interaction networks extracted from these data. First, we discard singletons in the PDB part of the DOMINE database

The phenotype similarity network

The phenotype similarity network of human diseases is a fully connected network obtained from an earlier work of van Driel

The DomainRBF model

Given the phenotype similarity network, we use _{
pp'
}to denote the similarity score between a query disease phenotype _{
1
}, _{
2
}, ..., _{
m
}that have at least one associated domain.

On the other hand, given a domain-domain interaction network of ^{2}}, where **K **= (_{
uv
})_{
n
}
_{×}
_{
n
}= ^{-}
^{
γL
}, where 0 < γ < 1 is a free parameter that controls the magnitude of diffusion. The matrix **L **= **D **- **A **is the Laplacian of the network, where **D **is a diagonal matrix containing node degrees, and **A **is the adjacency matrix of the domain-domain interaction network. With the diffusion kernel **K **= (_{
uv
})_{
n
}
_{×}
_{
n
}, we define the diffusion proximity of two domains _{
uv
}, i.e., the corresponding element in the diffusion kernel. Then, let _{
dd'
}denote the proximity between domains _{
dp
}= ∑_{
d
}
_{'∈}
_{
D
}
_{(}
_{
p
}
_{) }
_{
dd
}
_{'}. We further define the domain proximity profile for domain

Then, given a query disease phenotype **y**
_{
p
}using domain proximity profile **x**
_{
d
}via a linear regression model

where **y **= **y**
_{
p
}is the response vector, **X **= (**1**,**x**
_{
d
}) the design matrix, **β **
_{0}, _{1})^{
T
}the coefficient vector, and **ε **= (_{1},..., _{
m
})^{
T
}the residual vector. Note that the first column of the design matrix being 1s for the purpose of incorporating the intercept. We propose to solve this linear regression model using a Bayesian approach. We choose to take a Bayesian approach because it provides a natural way to consider the uncertainty in estimated parameters, and it provides Bayes factor, a measure of the strength of evidence for an association, which is defined as the ratio of marginal likelihoods for **y **conditional on **X **under the alternative and the null hypothesis, respectively, as described below.

For the alternative model, we assume that **y **conditional on **X **is subject to a normal distribution, as

with residuals independent and identically distributed, following normal density with mean 0 and variance σ^{2}. We set conjugate prior distributions for **β **and σ^{2}, as

and

where **μ**
_{0 }
_{0}, _{1})^{
T
}is composed of prior means, and σ^{2}
**Σ**
_{0 }prior variances with **Σ**
_{0 }= diag(σ_{
μ
}
^{2},σ_{1}
^{2}) being a diagonal matrix. The joint distribution of all random quantities **y**, **β**, and σ^{2 }is then given as

Integrating out **β **and σ^{2}, we obtain the marginal likelihood of **y **given **X **as

where _{
n
}= _{0 }and _{
n
}σ_{n}
^{2 }= _{0}σ_{0}
^{2 }+ **y**
^{
T
}
**y **+ **μ**
_{0}
^{
T
}
**Σ**
_{0}
^{-1}
**μ**
_{0}-**μ**
_{
n
}
^{
T
}
**Σ**
_{
n
}
^{-1}
**μ**
_{
n
}with **Σ**
_{n }= (**X**
^{
T
}
**X **+ **Σ**
_{0}
^{-1})^{-1 }and **μ**
_{
n
}= **Σ**
_{n }(**X**
^{
T
}
**y + Σ**
_{0}
^{-1}
**μ**
_{0}).

On the other hand, for the null model, where **y **is independent of **X**, the marginal likelihood of **y **can be derived in a similar way, as

where

Then, the Bayes factor BF is the ratio of _{1}(**y**|**X**) and _{0}(y), as

Following the literature _{
μ
}
^{2 }and 0 for both _{0 }and σ_{0}
^{2}, and we obtain the limit value of the Bayes factor as

For simplicity, we further set **μ**
_{0 }= **0 **as in the literature _{1}
^{2 }= 1 as the default setting in this paper, although the effect of these parameters are also studied.

Note that before the construction of the Bayesian regression relationship between **y**
_{
p
}and **x**
_{
d
}, we apply an inverse-normal transform to **y**
_{
p
}to guarantee that the responsive variable is normally distributed. As illustrated in

where _{
i
}is the rank of **y**
_{
p
}, **y**
_{
p
}, and Φ the cumulative distribution function of the standard normal distribution.

Validation methods and evaluation criteria

On the basis of the domain-domain interaction network and known associations between protein domains and disease phenotypes, we proceed to validate how well the proposed approach performs in recovering these known associations. We adopt three large scale leave-one-out cross-validation experiments for this purpose.

First, in the validation of random controls, we prioritize domains that are known to be associated with disease phenotypes (i.e., disease domains) against randomly selected control domains. Specifically, in each run of the validation, we select an association between a domain and a disease phenotype, assume that the association is unknown, and prioritize the domain against a set of 99 randomly selected control domains.

Second, in the validation of simulated linkage intervals, we prioritize domains that are known to be associated with disease phenotypes (i.e., seed domains) against domains that are located around the seed domains. Specifically, in each run of the validation, we select an association between a domain and a disease phenotype, assume that the association is unknown, and prioritize the domain against a set of control domains that are located within 10 Mbp upstream and downstream of this domain.

Third, in the validation of genome-wide scan, we prioritize seed domains against all known domains. Specifically, in each run of the validation, we select an association between a domain and a disease phenotype, assume that the association is unknown, and prioritize the domain against all other domains in the domain-domain interaction network.

In each of the above leave-one-out cross-validation experiments, we repeat the validation run for every known association between a domain and a disease phenotype, and we are able to obtain a number of ranking lists. We further normalize the ranks by dividing them by the total number of candidate domains in the rankling list to obtain rank ratios and calculate the values of three criteria to measure the performance of a prioritization method.

The first criterion is termed precision. We consider a prediction as successful if the known disease domain is ranked at the top (with rank 1). Then, the proportion of successful predictions among all predictions is defined as the precision. Obviously, a high precision suggests that a method has high prediction power. The second criterion is termed mean rank ratio, which is simply the average of rank ratios for all known disease domains in a cross-validation experiment. This criterion provides a summary of the ranks of all domains that are known to be associated with disease phenotypes, and the smaller the mean rank ratio, the better a method. The third criterion is termed AUC, which is the area under the receiver operating characteristic curve (ROC). Given a list of rank ratios and a predefined threshold, we define the sensitivity as the percentage of disease domains that are ranked above the threshold and the specificity as the percentage of control domains that are ranked below the threshold. By varying the threshold values, we are able to plot a receiver operating characteristic curve, which shows the relationship between sensitivity and 1-specificity. Calculating the area under the ROC curve (AUC), we are able to obtain the AUC score, which provides an overall measure for the performance of the prioritization approach.

Results

Domain proximity implying phenotype similarity

The DomainRBF approach is based on the assumption that similarities of disease phenotypes can be explained by proximities of domains associated with the phenotypes within a domain-domain interaction network via a regression model. In order to validate this assumption, we discard singletons in the PDB part of the DOMINE database ^{-16}, indicating that the similarities of disease phenotypes have a strong relationship with the proximities of associated domains.

To further substantiate this point, we perform a series of permutations towards disease-disease, domain-disease, and domain-domain relationships. First, we break the disease-disease relationship by permuting the phenotype similarity profile. Second, we break the domain-disease relationship by two methods: (1) permuting domain-disease associations and (2) replacing domains in known disease-domain associations with randomly selected domains. Third, we break the domain-domain relationship by permuting connections in the underlying domain-domain interaction network, while keeping node degrees and recalculating the diffusion kernel. For each of the above permutations, we calculate Bayes factors of disease domains and present the results in Figure

Bayes factors of the original and permuted data

**Bayes factors of the original and permuted data**. "original", "permuted PPS", "permuted seed", "random seed", and "permuted DDI" denote the results obtained using the original data, permuted phenotype similarity profile, permuted domain-disease associations, randomly selected seed domains, and permuted domain-domain interaction network, respectively. The small domain-domain interaction network composed of the PDB part of the DOMINE database and the diffusion kernel are used to obtain the results.

We also perform similar studies using the large domain-domain interaction network (48,778 interactions between 5,490 domains) that includes the entire DOMINE database ^{-16}. We further perform a series of permutation tests and present the results [Additional file

**Supplemental Figures**. Supplemental Figure S1 shows the results of a series of permutation test using the large domain-domain interaction network. Supplemental Figure S2 shows the mean ranks for domains with different frequency of occurrence in human proteins.

Click here for file

Performance of the DomainRBF approach

Since interactions from the PDB entries have the highest confidence of domain-domain interactions, we first test the validity of our approach on the PDB part of the DOMINEdatabase

For each of the three validation experiments, using either the diffusion kernel or the shortest path with Gaussian kernel, we draw a histogram of rank ratios for the entire 1,066 known associations, as shown in Figure

Histograms of rank ratios for domains known to be associated with diseases

**Histograms of rank ratios for domains known to be associated with diseases**. (A) Results for shortest path with Gaussian kernel, against random controls. (B) Results for shortest path with Gaussian kernel, against linkage intervals. (C) Results for shortest path with Gaussian kernel, against genome-wide scan. (D) Results for diffusion kernel, against random controls. (E) Results for diffusion kernel, against linkage intervals. (F) Results for diffusion kernels, against genome-wide scan. The small domain-domain interaction network composed of the PDB part of the DOMINE database and the diffusion kernel are used to obtain the results.

We then assess the performance of the proposed approach using the three criteria (mean rank ratio, precision, and AUC score) and summarize the results in Table

Results of leave-one-out cross-validation experiments on the small network.

**Random Control (%)**

**Linkage Interval (%)**

**Genome-wide Scan(%)**

**
R
**

**BF**

**
R
**

**BF**

**
R
**

**BF**

Precision

SG

26.11 (0.66)

26.56 (0.85)

21.69

23.79

6.29

5.25

DK

26.25 (0.63)

28.67 (0.72)

19.60

31.20

7.13

5.53

Mean Rank Ratio

SG

17.31 (0.11)

11.99 (0.05)

18.24

11.19

16.49

11.17

DK

17.82 (0.13)

10.75 (0.09)

19.09

9.73

17.03

9.91

AUC

SG

83.51 (0.10)

88.80 (0.05)

82.81

89.94

83.60

88.85

DK

83.01 (0.12)

90.04 (0.11)

81.95

90.18

83.06

90.11

The small network indicates the domain-domain interaction network that is composed of the PDB part of the DOMINE database. ^{2 }denotes the ordinary non-Bayesian linear regression approach (using R-square as scores for candidate domains). BF denotes the domainRBF approach (using Bayes factors as scores for candidate domains). SG denotes shortest path with Gaussian kernel. DK denotes diffusion kernel. Results for random controls are mean (standard deviation) of 10 validation runs.

Second, we conjecture from these results that the diffusion kernel measure is slightly better than the shortest path measure with Gaussian kernel, because the mean rank ratios obtained using the diffusion kernel are in general smaller, and the precisions and AUC scores are in general larger, than those obtained using the shortest path with Gaussian kernel. This phenomenon might be explained by the fact that diffusion kernel is a global network-distance measure. As such, the distance between two domains not only depends on the relative location of the candidate domain to all other domains (as the shortest path with Gaussian kernel does), but also relies on the graph structure of the entire network. Thus, for interaction networks with different graph structure, two nodes with the same shortest path distance usually have different diffusion kernel distance, and it is possible that this difference makes the diffusion kernel distance more reasonable and precise in the description of similarities between two domains in the interaction network. This point has also been explicitly illustrated in literature

Third, we conjecture from these results that the domainRBF approach with some proper defined priors can achieve higher performance than the non-Bayesian linear regression method. We compare the performance of the (Bayesian) domainRBF approach with the (non-Bayesian) ordinary linear regression method through the three large-scale leave-one-out cross-validation experiments, and we also list the results in Table

ROC curves of the leave-one-out cross-validation experiments

**ROC curves of the leave-one-out cross-validation experiments**. (A) Results for random controls. (B) Results for linkage intervals. (C) Results for genome-wide scan. BF: the domainRBF approach (using Bayes factors as scores for candidate domains). ^{2}: the ordinary non-Bayesian linear regression approach (using R-square as scores for candidate domains). SG: shortest path with Gaussian kernel. DK: diffusion kernel. Numbers in the parentheses are AUC scores of the corresponding ROC curves. The small domain-domain interaction network composed of the PDB part of the DOMINE database is used to obtain the results.

Robustness of the DomainRBF approach

Effects of network interactions

The above validation results suggest that the domainRBF approach can successfully prioritize candidate domains and put the domain that is truly associated with the query disease phenotype at the top of the candidates. However, it is still necessary to determine whether the correct prioritization of disease domains is due to the connectivity information that includes in the domain-domain interactions, domain-phenotype associations, and phenotype-phenotype similarities. To accomplish this, we artificially destroy informative interactions in the above three networks and see what performances will turn out. It is expected that both the mean rank ratios and the AUC scores will be around 50%, together with very low precisions. With this understanding, we perform three permutation experiments: 1) shuffling interactions among domains while fixing the node degree (number of direct neighbours) distribution of the entire interaction network, 2) shuffling interactions among domain-phenotype associations while fixing the number of associated domains for each of the phenotypes, and 3) shuffling the phenotype-phenotype similarity while fixing the distribution of phenotype similarities, respectively. Then we repeat the leave-one-out cross-validation experiments using the shuffled networks, which contain no informative interactions among domains, among domain and phenotypes, or among phenotypes, respectively. As shown in Figure

ROC curves of the leave-one-out cross-validation experiments on shuffled data

**ROC curves of the leave-one-out cross-validation experiments on shuffled data**. (A) Results for random controls with domain-domain interactions shuffled. (B) Results for linkage intervals with domain-domain interactions shuffled. (C) Results for genome-wide scan with domain-domain interactions shuffled. (D) Results for random controls with known domain-phenotype associations shuffled. (E) Results for linkage intervals with known domain-phenotype associations shuffled. (F) Results for genome-wide scan with known domain-phenotype associations shuffled. (G) Results for random controls with phenotype similarity profiles shuffled. (H) Results for linkage intervals with phenotype similarity profiles shuffled. (I) Results for genome-wide scan with phenotype similarity profiles shuffled. The small domain-domain interaction network composed of the PDB part of the DOMINE database and the diffusion kernel are used to obtain the results.

Effects of different domain-domain interaction networks

We notice that the two compiled domain-domain interaction networks have different properties. For example, the average degree of the smaller network that includes only PDB entries is 2.32, while that of the larger network that includes predicted interactions from both DOMINE and InterDom is 17.77. It is possible that many predicted interactions may actually be noise and thus negatively affect the prioritization of disease domains. Hence, it is necessary to validate the robustness of the proposed approaches to the underlying domain-domain interactions. For this purpose, we implement the same validation process based on the large compiled domain-domain interaction network that is composed of all interactions in the DOMINE database and high-confidence interactions in the InterDom database. Results are presented in Table

Results of leave-one-out cross-validation experiments on the large network.

**Random Control (%)**

**Linkage Interval (%)**

**Genome-wide Scan (%)**

**
R
**

**BF**

**
R
**

**BF**

**
R
**

**BF**

Precision

SG

14.24 (0.37)

18.64 (0.66)

18.54

21.23

2.17

2.35

DK

16.80 (0.36)

22.32 (0.60)

17.36

27.56

2.42

3.47

Mean Rank Ratio

SG

27.27 (0.07)

19.59 (0.09)

21.89

18.07

26.53

18.79

DK

26.07 (0.11)

14.96 (0.07)

20.21

14.12

25.34

14.11

AUC

SG

73.41 (0.07)

81.15 (0.09)

78.57

81.76

73.38

81.09

DK

74.61 (0.11)

85.82 (0.07)

82.36

86.68

74.55

85.72

The large network indicates the domain-domain interaction network that is composed of the entire DOMINE database and high-confidence interactions in the InterDom database. ^{2 }denotes the ordinary non-Bayesian linear regression approach (using R-square as scores for candidate domains). BF denotes the domainRBF approach (using Bayes factors as scores for candidate domains). SG denotes shortest path with Gaussian kernel. DK denotes diffusion kernel. Results for random controls are mean (standard deviation) of 10 validation runs.

Effects of parameters in the distance measures

We further notice that the parameter

Effects of parameters on distance measures

**Effects of parameters on distance measures**. (A) Influence of

Effects of parameters in the domainRBF approach

Besides the two parameters in the distance measures, there are also four parameters in the domainRBF approach that need to be pre-determined, namely **μ**
_{0}, σ_{1}
^{2}, _{0}, and σ_{0}
^{2}, all of which are included in the priors of the domainRBF approach (see Materials and Methods for details). In the real implementation we set **μ**
_{0 }= **0**, _{0 }= 0, and σ_{0}
^{2 }= 0, for the reason explained in the literature _{1}
^{2 }= 1, for simplicity. Therefore, we only need to test the robustness of the approach when different values of σ_{1}
^{2 }are used. To achieve this objective, we set σ_{1}
^{2 }as 0.001, 0.01, 0.1, 1, 10, and 100, respectively, and we apply the approach to the same cross validation process. We list the results in Table _{1}
^{2 }is smaller than 1, the domainRBF approach is quite robust to the change of σ_{1}
^{2}, with the change of precision within 1.12%, change of mean rank ratios within 0.53%, and change of AUC scores within 0.54%. On the other hand, when σ_{1}
^{2 }is larger than 1, the decrease in performances becomes slightly conspicuous, but remains within the scope of 3.04% for precisions, 2.22% for mean rank ratios and 1.77% for AUC scores. Hence we can see that our domainRBF approach is generally robust to the change of parameters.

Effects of parameters based on leave-one-out cross-validation experiments.

**Criteria**

**
σ
**

**Random Control (%)**

**Linkage Interval (%)**

**Genome-wide Scan (%)**

Precision

0.001

28.87 (0.70)

31.49

4.41

0.01

29.02 (1.17)

31.77

4.78

0.1

29.39 (0.60)

31.36

5.07

1

28.67 (0.72)

31.20

5.53

10

27.76 (0.62)

29.73

5.35

100

28.59 (1.18)

28.16

5.35

Mean Rank Ratio

0.001

10.23 (0.07)

9.83

9.38

0.01

10.25 (0.11)

9.98

9.41

0.1

10.41 (0.07)

9.49

9.54

1

10.75 (0.09)

9.73

9.91

10

11.14 (0.10)

10.92

10.27

100

12.08 (0.10)

11.95

11.24

AUC

0.001

90.58 (0.10)

90.14

90.55

0.01

90.56 (0.14)

90.40

90.61

0.1

90.40 (0.10)

90.22

90.40

1

90.04 (0.11)

90.18

90.11

10

89.69 (0.10)

89.23

89.86

100

88.74 (0.14)

88.41

88.98

The small domain-domain interaction network composed of the PDB part of the DOMINE database and the diffusion kernel are used to obtain the results. Results for random controls are mean (standard deviation) of 10 validation runs.

Effects of seed domain-disease associations

In order to test the influence of the size of seed, or known associations on the prioritization results, we select at random 100%, 90%, 80%, 70%, 60%, and 50% of the original seed associations, respectively, and we repeat the leave-one-out validation processes. We only calculate the performance using the domainRBF approach based on diffusion kernel measure, and we choose the PDB part of the DOMINE database as the domain-domain interaction network. Results show that with the percentage of seed associations decreases from 100% to 50%, performance also slightly decreases in terms of precision, mean rank ratio and AUC score, despite some exceptions (see Table

Effects of seed domain-disease associations based on leave-one-out cross-validation experiments.

**Criteria**

**Cutoff**

**Random Control (%)**

**Linkage Interval (%)**

**Genome-wide Scan (%)**

Precision

90%

28.28 (0.42)

33.45 (1.57)

5.34 (0.33)

80%

27.42 (0.76)

32.06 (1.44)

4.79 (0.47)

70%

27.43 (1.06)

29.59 (0.93)

3.97 (0.74)

60%

25.68 (1.04)

30.11 (1.10)

3.58 (0.58)

50%

25.85 (0.73)

28.47 (0.89)

5.35 (1.22)

Mean Rank Ratio

90%

10.90 (0.78)

8.98 (0.64)

10.16 (0.81)

80%

11.23 (0.84)

9.37 (0.92)

10.38 (1.02)

70%

11.68 (1.21)

10.09 (1.05)

10.61 (0.69)

60%

12.37 (0.97)

9.21 (1.01)

10.76 (0.80)

50%

13.47 (0.55)

9.72 (0.98)

10.95 (0.78)

AUC

90%

89.92 (0.30)

90.83 (0.58)

89.87 (0.61)

80%

89.60 (0.22)

90.26 (0.43)

89.66 (0.29)

70%

89.15 (0.17)

89.40 (0.64)

89.46 (0.48)

60%

88.48 (0.59)

90.01 (0.51)

89.32 (0.34)

50%

87.31 (0.28)

89.89 (0.46)

89.16 (0.75)

The small domain-domain interaction network composed of the PDB part of the DOMINE database and the diffusion kernel are used to obtain the results. Results for random controls are mean (standard deviation) of 10 validation runs.

In order to study how known domain-disease associations for other diseases contribute to the inference of domains that are associated with the query disease, we keep 10%, 20%, 30%, 40%, and 50% disease phenotypes that have the highest similarity scores to the query disease, respectively, and we repeat the leave-one-out validation processes, using the diffusion kernel measure and the small domain-domain interaction network that is composed of the PDB part of the DOMINE database. Results (Table

Contributions of seed domain-disease associations based on leave-one-out cross-validation experiments.

**Criteria**

**Cutoff**

**Random Control (%)**

**Linkage Interval (%)**

**Genome-wide Scan (%)**

Precision

10%

37.82 (0.72)

42.29

8.16

20%

34.55 (1.03)

38.39

8.63

30%

32.85 (0.77)

37.52

7.79

40%

32.78 (0.97)

36.98

7.79

50%

31.18 (0.62)

34.50

6.16

Mean Rank Ratio

10%

8.83 (0.06)

7.87

7.96

20%

8.80 (0.07)

7.92

7.95

30%

9.23 (0.03)

8.47

8.33

40%

9.64 (0.07)

8.61

8.74

50%

9.98 (0.06)

8.86

9.09

AUC

10%

92.14 (0.12)

94.08

91.75

20%

92.16 (0.10)

93.91

91.83

30%

91.71 (0.10)

93.46

91.42

40%

91.33 (0.08)

92.63

91.08

50%

90.94 (0.06)

92.30

90.66

The small domain-domain interaction network composed of the PDB part of the DOMINE database and the diffusion kernel are used to obtain the results. Results for random controls are mean (standard deviation) of 10 validation runs.

**Supplemental Tables**. Supplemental Table S1 lists contributions of seed domain-disease associations (leave-one-out cross-validation experiments using the large domain-domain interaction network). Supplemental Table S2 lists contributions of seed domain-disease associations (

Click here for file

Above we have used several large scale leave-one-out cross-validation experiments to evaluate the performance and robustness of the proposed domainRBF approach. However, it might be argued that a disease may be associated with more than one domain and that, consequently, the inclusion of domains already known to be associated with the query disease in the calculation of the domain proximity profile may ease the identification of novel associations. Following this line of reasoning, we demonstrate the capability of the proposed domainRBF approach in the prediction of novel associations for query diseasesby performing the following

**Random Control (%)**

**Linkage Interval (%)**

**Genome-wide Scan (%)**

Precision

22.11 (0.45)

24.39

3.90

Mean Rank Ratio

17.34 (0.07)

17.07

16.45

AUC

83.43 (0.07)

84.62

83.27

In comparison with the results in Table

We also study the contribution of seed domain-disease associations by keeping a fraction of disease phenotypes that have the highest similarity scores to the query disease and repeating

Inspired by the success of

Scheme of inferring gene-disease associations from domain-disease associations

**Scheme of inferring gene-disease associations from domain-disease associations**.

We validate our gene prioritization approach using known gene-disease associations extracted from the BioMart database

Genome-wide evidence of associations between domains and common human diseases

The identification of susceptible single nucleotide polymorphisms (SNPs) conferring risk for common human diseases is one of the main tasks of genome-wide association studies (GWAS). Since the study that identified the association of complement factor H (CFH) with age-related macular degeneration (AMD) in 2005, over 450 GWAS have been performed and more than 2,000 susceptible SNPs or genetic loci have been reported

Given a disease of interest, we collect from SNPedia

1. Count the number of reported SNPs that appear within 5 Mbp of domains that are ranked among the top 10 in the genome-wide _{0}.

2. For the _{
i
}.

3. Repeat the above random selection _{
i
}(_{0}. Record this number as

4. Calculate a

The null hypothesis in the above permutation test is that the number of the reported susceptible SNPs within 5 Mbp regions of the high ranking domains (top 10) is not different from that of randomly selected domains. Therefore, a small

We then select four disease examples (Type 1 diabetes, Type 2 diabetes, Crohn's disease, and Breast cancer), apply the above permutation test method to these diseases, and analyze the results in detail. We choose these four diseases because they are common and have GWAS results available. It has been shown that diabetes had affected 2.8% of the population worldwide by 2000

Type 1 Diabetes

Type 1 diabetes, formerly called juvenile diabetes or insulin-dependent diabetes, is a condition in which pancreatic

In addition, we also examine the 4 domains that are not close to susceptible SNPs reported by GWAS. For domain HNF-1B_C (PF04812), we notice that Urhammer

Type 2 Diabetes

Type 2 diabetes, formerly called adult-onset diabetes or noninsulin-dependent diabetes, is the most common form of diabetes. It usually begins with insulin resistance, a condition in which fat, muscle, and liver cells do not use insulin properly

In addition, we also examine the 3 domains that are not close to susceptible SNPs reported by GWAS. We notice that both domains HNF-1A_C (PF04813) and HNF-1_N (PF04814) contain mutations that may cause the type 3 form of maturity-onset diabetes of the young (MODY3), as pointed out by Urhammer

Crohn's Disease

Crohn's disease, a chronic inflammatory disorder of the gastrointestinal tract, is thought to result from the combination of effect of environmental factors and genetic predisposition

Breast Cancer

Breast cancer, the most common malignancy in women in the Western world

Contributions of seed domain-disease associations in the analysis of the four diseases

For each of the four diseases examples, we further evaluate the contribution of seed domain-disease associations by keeping 10%, 20%, 30%, 40%, and 50% disease phenotypes that have the highest similarity scores to the query disease, obtaining a rank list of all domains, and then using the permutation test to check whether the number of known susceptible SNPs is still significantly enriched around the top ranking domains. As shown in [Additional file

A predicted landscape of domain-disease associations

With the above validation results demonstrating the possibility of recovering the associations between protein domains and disease phenotypes, we further apply the domainRBF approach to all available protein domains and human disease phenotypes and predict a genome-wide landscape of the associations between protein domains and human disease phenotypes. There are a total of 5,080 phenotypes in the phenotype similarity network and 5,490 protein domains in the domain-domain interaction network (the union of the entire DOMINE and InterDom network). For each phenotype, we perform a prioritization of all domains with the use of the domainRBF approach (using the diffusion kernel measure). The prioritization results, together with a freely accessible web interface, are provided at

On the basis of the above prioritization results, we aggregate the Bayes factors between all the 5,490 domains and 1,145 phenotypes, and obtain a matrix of altogether 6,286,050 elements. Here we first make a log (base 10) transform of original matrix, and then implement clustering while removing the rows in which the values are all smaller than 0.1. Since phenotypes clustered together generally have similar molecular basis, or share significant genetic overlaps

Modular organization of the predicted landscape of human disease phenotypes

**Modular organization of the predicted landscape of human disease phenotypes**. (A) Two-way hierarchical clustering heat map for the landscape of domain-phenotype associations. (B) Zoomed-in plot of the pink circle region in the heat map, involving 17 muscular diseases and 20 related protein domains.

We further apply the above prioritization method to all human disease phenotypes and obtain a landscape of gene-phenotype associations that include 5,080 disease phenotypes and 14,944 human genes. The prioritization results, together with a freely accessible web interface, are provided at

Discussion and Conclusions

In this paper, we studied the problem of identifying domains that are associated with human inherited diseases under a prioritization framework. We proposed an approach called domainRBF from the perspective of Bayesian regression, verified its superior performance through three large-scale cross-validation experiments, and demonstrated the robustness of this approach via a series of permutation tests. We further proposed to perform

In comparison with previous studies that rely on phenotype similarity and protein-protein interaction data to infer gene-disease associations

However, our method has the following limitations. First, our method can only be applied to diseases that are included in phenotype similarity data and domains that are included in domain interaction data. In the case of phenotype similarity, a possible solution would involve the development of a visualization and annotation system such as the one in

Second, our method currently only considers conjugate priors in the Bayesian regression model. Although such formulation results in analytic solutions and thus alleviates the computational burden in the calculation of Bayes factors, it is known that the specification of prior is intrinsically complicated and subjective

Our approach can be further studied from the following aspects. First, in addition to the domain-domain interaction network, information such as annotations of Pfam domains in the Gene Ontology (GO) can also provide a means for calculating similarities between domains. Recently, methods for calculating semantic similarities between GO terms have been packed into user-friendly software

Second, it is conceptually straightforward to extend the domainRBF model to infer interactive effects of multiple domains on a query disease. For example, given a query disease and a set of candidate domains, we can enumerate all two-way combinations of the domains and then use the DomainRBF model to infer possible associations between the disease and interactions of two domains. Nevertheless, such brute force method is computationally intensive and not quite feasible in application to the study of three-way or even higher order interactive effects of candidate domains.

Third, with the accumulation of publicly available data in genome-wide association (GWA) studies, we can consider the integration of our method and GWA studies. For example, given a disease of interest and a set of candidate domains, we can prioritize the candidate domains using our method and obtain the ranks of the domains. On the other hand, given

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

WZ derived the model, implemented the method, collected the results and drafted the manuscript. YC participated in the design of the study. FS designed the research and revised the manuscript. RJ designed the research, drafted, and revised the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This work was partly supported by the National Science Foundation of China (60805010, 60928007, and 60934004), Tsinghua University Initiative Scientific Research Program, Tsinghua National Laboratory for Information Science and Technology (TNLIST) Cross-discipline Foundation, Research Fund for the Doctoral Program of Higher Education of China (200800031009), and the Scientific Research Foundation for Returned Overseas Chinese Scholars. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.