| Genetic interaction motif finding by expectation maximization – a novel statistical model for inferring gene modules from synthetic lethality1Biomedical Engineering Department, Johns Hopkins University, North Charles Street, Baltimore, MD, 21218, USA 2High-Throughput Biology Center, Johns Hopkins School of Medicine, 733 North Broadway, Baltimore, MD 21205, USA
BMC Bioinformatics 2005, 6:288doi:10.1186/1471-2105-6-288 The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/6/288
©
2005 Qi et al; licensee BioMed Central Ltd. AbstractBackgroundSynthetic lethality experiments identify pairs of genes with complementary function. More direct functional associations (for example greater probability of membership in a single protein complex) may be inferred between genes that share synthetic lethal interaction partners than genes that are directly synthetic lethal. Probabilistic algorithms that identify gene modules based on motif discovery are highly appropriate for the analysis of synthetic lethal genetic interaction data and have great potential in integrative analysis of heterogeneous datasets. ResultsWe have developed Genetic Interaction Motif Finding (GIMF), an algorithm for unsupervised motif discovery from synthetic lethal interaction data. Interaction motifs are characterized by position weight matrices and optimized through expectation maximization. Given a seed gene, GIMF performs a nonlinear transform on the input genetic interaction data and automatically assigns genes to the motif or non-motif category. We demonstrate the capacity to extract known and novel pathways for Saccharomyces cerevisiae (budding yeast). Annotations suggested for several uncharacterized genes are supported by recent experimental evidence. GIMF is efficient in computation, requires no training and automatically down-weights promiscuous genes with high degrees. ConclusionGIMF effectively identifies pathways from synthetic lethality data with several unique features. It is mostly suitable for building gene modules around seed genes. Optimal choice of one single model parameter allows construction of gene networks with different levels of confidence. The impact of hub genes the generic probabilistic framework of GIMF may be used to group other types of biological entities such as proteins based on stochastic motifs. Analysis of the strongest motifs discovered by the algorithm indicates that synthetic lethal interactions are depleted between genes within a motif, suggesting that synthetic lethality occurs between-pathway rather than within-pathway. BackgroundMuch recent research efforts have been devoted to studying gene functions in the context of highly dynamic and modular cellular networks [1-4]. Valuable information about a gene's function can be obtained from its interaction with other genes [5]. Apart from the traditional hierarchical way of gene function annotation, functional genomics takes a bottom-up approach to assemble gene interaction networks based on all pair-wise gene interactions detected. From such genetic interaction maps, Functional modules representing various biological pathways and processes can then be extracted by computational approaches. Those modules naturally suggest novel gene functions in the relevant biological processes [6]. The interactions between genes are of course highly dynamic spatially and temporally. However, one of the most intuitive yet fundamental questions about genetic interactions is whether the normal functioning of two genes depends on each other. Synthetic lethality identifies genes that complement each other's function: two genes are synthetic lethal if either single mutant is viable, but the double mutant combination is lethal. High-throughput experiments such as synthetic genetic array (SGA) [7] and synthetic lethality analyzed by microarray (SLAM) [8,9] have been done for genome-wide synthetic lethality analysis on Saccharomyces cerevisiae, where a single mutant (query gene) is introduced into the complete pool of viable yeast single-deletion (library gene) strains. Synthetic lethality data obtained through SGA, SLAM or RNA interference has shed much new light on essential biological pathways and the function assignment for many previously uncharacterized genes for the model organisms yeast and C. elegans [10,11]. Hierarchical clustering of the SGA dataset suggest that two synthetic lethal genes are likely to (i) reside in two redundant parallel pathways or (ii) complement each other's function in two branches of one essential pathway [12]. Computational methods integrating physical protein interactions and other genomic features seem to suggest that significantly more synthetic lethal interactions occur between parallel pathways [13,14]. Given the incomplete and error-prone synthetic lethal interaction map, it is highly desirable to investigate methods that extract biologically relevant information probabilistically, which accommodates network properties such as degree distribution and confidence of the links. Along this line, we have developed in this study a probabilistic model for characterizing synthetic lethal interaction motifs and an algorithm that automatically groups genes sharing similar motifs into pathways. When applied to the SGA dataset, our method automatically uncovers known and novel gene modules that correlate favourably with Gene Ontology (GO) annotations. ResultsData sourcesGenetic interaction data is obtained from SGA analysis in yeast [12]. The original query gene set includes 126 non-essential genes and 6 essential genes, tested against a library of all non-essential gene deletions. Interpretation of synthetic lethality involving essential genes is problematic since the intermediate (viable) phenotypes exhibited by conditional alleles of essential genes may include loss of function, unregulated function, and gain of function aspects. Thus we focus on synthetic lethal interactions between null alleles of non-essential genes, which by definition result from solely loss of function mutations. Ignoring library genes that have no interaction with any of the 126 query genes, our dataset consists of 126 query genes linked to 982 library genes by 4287 interactions. Both the query and the library sets contain hubs with high interaction counts (Supp. Figs. S3, S4, and S5). Yeast protein complex data were obtained from two high-throughput studies, TAP and HMS-PCI [15,16]. Protein complexes that contained two or more non-essential proteins were used (353 complexes from TAP and 427 complexes from HMS-PCI). Computational methodThe Expectation maximization (EM) algorithm has been widely used to detect motifs in biopolymer sequences, where a position weight matrix representing a recurring pattern (such as DNA binding sites or promoter regions) in multiple unaligned sequences is built iteratively by maximum likelihood scoring [17-20]. Such probabilistic approach is most suitable for the detection of patterns with a stochastic nature, which we have little prior knowledge of. In this study, we have developed an algorithm for finding genes in the same pathway, which we shall refer to as Genetic Interaction Motif Finding by expectation maximization (GIMF). Note the difference between motif here defined by genetic interaction pattern and the network topological motifs [21]. The model is developed under the hypothesis that genes within the same pathway exhibit a similar pattern of synthetic lethal interactions where a subset of common interaction partners are genes in complementary pathways [12-14]. For example, RVS161 and RVS167 are two queries that belong to the RVS161 complex. Enhanced synthetic lethal interactions with members of the RPD3 complex have been observed (Fig. 1). The RVS161 complex proteins are AR adaptor proteins involved in actin regulation, endocytosis and viability following starvation or osmotic stress. The RPD3 histone deacetylase complex is involved in silencing at telomeres. In particular, DEP1, a member of the RPD3 complex is a transcriptional modulator of phospholipids biosynthesis and also maintains mating efficiency and sporulation. Thus it is reasonable to infer that these two protein complexes are functionally complementary during endocytosis and mating or sporulation after starvation when the biological processes of the two complexes are tightly coupled.
In our analysis, we focus on finding motifs from the synthetic lethal interaction patterns of query genes. Let Xi = [Xi1 ... XiN] denote the interaction partner list for query gene i, where Xij = 1 if i interacts with library gene j and Xij = 0 otherwise. Thus the entire data set is Xi, i = 1,2,...,Q. The total numbers of query is Q = 126 and the total number of the library genes that interact with at least one query gene is N = 982. We initiate a search with a query gene s and aim to find all other genes in the same pathway as the seed gene s. We do this by iteratively constructing a motif for the group and hence identifying motif members. Mathematically, we divide the query gene set into two sets, a motif set A ={Ai}, i = 1,2,...aM, initialized to contain just the seed gene, and a non-motif set B = {Bi}, i=aM+1, aM+2, L, aM+bM, containing the remaining genes. The number of query genes in the motif and non-motif sets are aM and bM, respectively, with aM+bM=Q. We assume that genes in the motif set and those in the non-motif set have different probabilities of interacting with a library gene j, which are denoted by paj and pbj, respectively. As will be explained in DISCUSSION, this allows existence of hub library genes explicitly. The probability that query i belongs to the motif set is denoted by zi. The parameters paj, pbj and zi, where j = 1,2...,N and j = 1,2...,Q, are estimated iteratively. The expectation maximization (EM) algorithm has been used for maximum likelihood estimation with missing information. In our scenario, given a seed gene, missing information is represented by the correct partition of the entire gene pool into a motif set A and a non-motif set B starting from an initial motif estimate provided by the seed. The likelihood function, i.e. the conditional probability of observing measured data given the partition, is Thus the log likelihood function is where Let us assume that q iterations have been completed. At the start of the E-step of iteration q+1, the estimates for the model parameters from the M step of the previous iteration, Similarly the conditional probability of observing By Bayes formula, the probability that a gene i belongs to the motif set given observed data and current model estimates is, where The expected number of interactions with a library gene j is the weighted sum of all the query genes' interactions with gene j, where Xij is weighted by In the M step, model parameters are updated with expected numbers, Convergence of the algorithm is assessed by |t(q+1) - t(q)| < 10-4 for all of the model parameter estimates Given a seed gene s, the model parameters are initialized as follows: While the sum over query genes to estimate the initial background probability Results on SGA datasetOutputs of our model are
Table 1. Motif members of four seed genes. Table 2. Seven representative motifs identified by GIMF. One important property of GIMF is that it is non-commutative: if gene A identifies gene B as a motif member, it is not necessarily true that gene B identifies gene A as its motif member. Interestingly, we have observed that a seed gene tends to first pull up motif members that share a globally similar interaction pattern. If such genes are lacking, then it finds genes with locally similar interaction pattern. This enables us to probe the case when two genes' interaction partners are only similar on a local scale. This is not possible with pair-wise comparison metrics, which are commutative. For a more systematic analysis, we use GIMF to build gene networks. First, query genes with very few interactions (5 or fewer) are removed from the list of seeds. Then each of the remaining query genes is used as a seed and its motif members are generated by GIMF. For every query gene pair(i, j), if i and j are each other's motif member, then connect i and j with a Type 1 edge. We call the network thus constructed a Type 1 GIMF network (Fig. 3). This network contains 31 nodes and 42 edges, which form two clusters and eight individual pairs. The smaller cluster is a fully connected sub-graph corresponding to the PAC10 complex. The larger cluster with 10 genes (ARP1, NUM1, DYN1, PAC11, PAC1, DYN2, JNM1, YMR299C, NIP100, KIP2), representing the Dynein-Dynactin spindle orientation pathway. KIP2 was not detected by hierarchical clustering [12].
Apparently, the bi-directional rule only retains genes with globally similar interaction pattern. This can be quite stringent since genes have multiple functions and two genes operating in one pathway may have distinct roles in other pathways they participate and thus only share a fraction of synthetic lethal interaction partners. Thus we extend Type 1 network by the following simple rule: for each gene pair (i, j) in the Type 1 network, add common motif members k of genes i and j that are not already in the network (hence neither i nor j is motif member of k). Connect k to i and j with a Type 2 edge. We call the extended network a Type 2 network (Fig. 4). This analysis reveals more information in the Dynein-Dynactin pathway. The majority of Type 2 edges occur between the group members of this cluster, which elevates the confidence that the genes within this cluster are closely related. Evidence that genes in this cluster are biologically related include the presence of a dynactin protein complex (ARP1, JNM1, NIP100), reported protein-protein interactions between NIP100-PAC11, PAC11-DYN2, PAC11-NUM1 [22,23] and the suggestion that YMR299C functions as dynein light intermediate chain [12]. In the Type 2 network, several new members are incorporated into the cluster, including NBP2, BIK1 and CTF18. The molecular function of NBP2 and CTF18 are unknown while BIK1 is involved in microtubule binding. NBP2 shows hyperosmotic and heat response and is a negative regulator of protein kinase activity. CTF18 is a subunit of a complex with CTF8P that shares some subunits with Replication Factor C and is required for sister chromatid cohesion. It has been known that the mutants of six genes (NUM1, DYN1, DYN2, ARP1, JNM1, NIP100) in this cluster show nuclear migration defect in cell division process. A recent experiment has confirmed that deletion mutants of KIP2, BIK1 and CTF18 also exhibit moderate to severe nuclear migration defects [24]. These three genes have not been detected by two way clustering [12].
Under our hypothesis, genes with a similar synthetic interaction pattern (especially when the similarity is global) are likely to reside in the same pathway or map to proteins in the same complex. Thus the motif members are expected to have functional similarities at various levels. We evaluate the biological relevance of the Type 1 and Type 2 networks by computing three parameters for each edge (gene pair): the correlations with the Gene Ontology (GO) annotations (described in Appendix); the fraction of gene products that are within the same protein complex as determined by high-throughput mass spectrometry; and the fraction that are synthetic lethal. These parameters have also been computed for all directly synthetic lethal gene pairs. The Type 1 gene pairs' correlations for biological process, molecular function and cellular component GO annotations are (0.47, 0.20, 0.43), while those of the Type 2 network are (0.47, 0.15, 0.40), comparing to (0.25, 0.05, 0.31) for directly synthetic lethal gene pairs (Table 3). Clearly, much tighter functional associations are obtained between gene pairs with either globally or locally similar synthetic lethal interactions than gene pairs that are directly synthetic lethal interactions, confirming the observation of between-pathway enrichment by Wong et al. and Kelly et al. [13,14]. Significantly more Type 1 gene pairs map to proteins within the same complex than either Type 2 gene pairs or directly synthetic lethal gene pairs. Same-complex membership may explain the higher molecular function correlation for Type 1 gene pairs. Table 3. GO annotation correlations for GIMF Type 1, Type 2 gene pairs, and gene pairs that are directly synthetic lethal (SL). DiscussionIn this section, we explore a few important issues in terms of the robustness and tuning of GIMF. Without loss of generality, the discussion is primarily based on learning pathway association on the SGA dataset. It has been widely known that EM algorithm very often converges to local maxima in the evaluation of posterior likelihood function or log-likelihood [18,19]. In application to motif (e.g. transcription binding sites) discovery in DNA sequences, early versions EM assumed the existence of a single motif and aimed to find the motif that globally optimized the likelihood function. However, when multiple consensus sequences are present in the dataset, numerous local maxima in the likelihood function can well correspond to biologically meaningful motifs. One approach to finding multiple motifs is to initialize the EM from different starting points, typically selected from patterns occurring in the data, which may then relax to local maxima. This approach may be enhanced, as in the MEME algorithm, by erasing motifs previously found so that multiple motifs are found in decreasing order of likelihoods. Using these two strategies, MEME successfully detects multiple promoter consensuses from the combined CRP/LexA datasets[18]. In GIMF, we achieve a similar effect by initializing the model using seed gene's interactions, thus narrowing down the search space to the module that includes the seed. Without any prior knowledge of goodness of seeds and their consensus interactions, two problems are noteworthy: i) Motifs generated by different seeds may be redundant; ii) Certain motifs may deviate from their seeds during the iterative process. These two issues are addressed below: i) Motifs generated by different seeds may be redundantTo better understand the dissimilarity between distinct motifs, we have calculated the Euclidean distance between each pair of motifs
ii) Certain motifs may deviate from their seeds during the iterative processIn some cases, the EM algorithm may eject a seed gene from a motif. This occurs for eight seed genes when p = 0.95 using the threshold Zi > 0.9 (Table S2). Those seeds either have few interactions and/or have interactions that overlap largely with the interaction partners of some hub genes, such as the PAC10 complex genes. Indeed, most of their motif members are hub genes, whose interaction profiles override that of the seed genes during the iteration. Thus to ensure each seed stay in the motif, we can slightly modify the algorithm by fixing Zseed = 1 during all iterations. In other words, the motif search is conditioned on the seed being part of the motif. Indeed, for the eight seeds mentioned above, this modification keeps the seed gene itself in the motif till convergence while all other motif members stay unchanged. Clearly, this procedure has no effect on the 106 seeds that are already in motif without such conditioning. Symmetry imposed by Type 1 edges serves as a conservative filtering procedure that eliminates redundancy and impact of hub genes dominating interaction profiles, which reveals gene networks with tight functional correlations, which supports our finding that the local optimums in GIMF correspond to biologically relevant modules. We have investigated how the choice of p, the initialization parameter that represents our confidence on seed gene's interactions, affect the motifs. Indeed, the sensitivities of different motifs to p is non-uniform. We quantify the goodness of a seed and its motif by observing stability of its motif members across different choices of p. Genes with less than five interaction partners (12 out of 126) are not used as seeds. For every remaining query gene, we extract its motif members with p ranging from 0.6 to 0.95. The sets of motif members extracted at p = 0.95 is used as the reference to compute a Jaccard coefficient [14]. Denote the set of motif members for seed gene i obtained with initialization parameter p by To better evaluate the statistical significance of motifs detected by GIMF, we have computed the false positive rates on randomized datasets with the same degree distribution as the original synthetic lethal dataset. Randomization is done by a rewiring procedure as detailed in [21]. The fraction of overlapping links between the randomized network and the original network is around 15%. Since a random network should not contain any biologically relevant motif, any motif detected is a false positive. Thus for the GIMF algorithm, we use every query as a seed gene and any motif member returned other than the seed itself is considered a false positive. The numbers of false positives on 100 randomized networks are shown in Fig. S1. Without imposing the bi-directionality constraint, the average total number of false positives for 126 seed genes is 15. The average number of seeds that generates any false positives is 9.7 out of 126. On the real dataset, the number of seeds leading to motif detection is T = 82. Thus this corresponds to a p-value of 10-15 calculated as tail probability at T = 82 from a Poisson distribution. A detailed look at the false positive pairs of GIMF shows that most seeds that lead to false positives have very few interactions with the library genes. The top 10 seeds producing the most false positives have 6.3 interactions on average and their false positive motif genes are mostly promiscuous hub genes. However, no false positives are observed when the promiscuous genes are used as seeds. When bi-directionality is imposed on motif detection, false positive drops to 0 for all the 100 trials. Thus for an asymmetric metric like GIMF, we can impose symmetry constraint to mask the effects of promiscuous genes. Additional information can be obtained by elevating stringency once the reliable gene pairs are identified. The treatment of hub genes is a problematic issue in the analysis of power-law networks. Hubs arise from many different sources including intrinsic error in the experimental technique (such as sticky proteins in yeast two hybrid system) and experimental bias (such as the choice of query genes for SGA). Because of this heterogeneity, the treatment of promiscuous genes should be context-based. In the case of experimental error-induced hubs, a straightforward approach is to ignore all hub-associated links. This filtering method has been used to reduce the number of candidate pathways dramatically in the analysis of signal transduction networks [25]. However, the role of hub genes in the SGA data set is subtle. Genes in the PAC10 complex are hubs that have enriched synthetic lethal interaction with genes in many other complexes, such as CTF18 and PAC11. Many of the PAC10-associated links are biologically relevant since PAC10 is indeed functionally coupled to a broad spectrum of biological pathways which themselves are functionally associated. Thus removing hub links entirely unsurprisingly leads to the loss of useful information and failure to detect some relevant pathways. GIMF treats this problem by permitting an increase in the parameter pbj for hub library genes that are not part of the motif. Thus, we are able to extract biologically meaningful pathways by keeping the hub library genes whose impact is, however, automatically down-weighted. This idea is tested on the Dynein-Dynactin gene pairs. Using GIMF we identified 24 Dynein-Dynactin pairs with the original SGA dataset. Then we tested GIMF on five filtered datasets generated by removing interactions with the top 5, 10, 15, 20 and 25 hub library genes, respectively. The corresponding fractions of interaction eliminated are 4.4%, 7.8%, 10.9%, 13.6% and 16%. With model parameters unchanged, GIMF recovers (18, 15, 10, 7, 1) Dynein-Dynactin pairs on the five datasets, respectively. The reduced coverage is expected from the removal of some biologically relevant hub links. However, a substantial number of those pairs are retained when interactions with the top 5 and 10 hub library genes are absent. These results suggest that a statistical method that explicitly models the skewed degree distribution is a better strategy for pattern discovery in the presence of hubs than using simple filtering techniques in conjunction with methods that do not take into account the hub effect. The assumption in GIMF that the probability of an edge between a query and a library gene pair is proportional to the degree of the library gene works sufficiently well for the synthetic lethal interaction dataset. However, when extending the present model to other types of networks especially those with non-directional links, it would be beneficial to characterize the link probability in a subgraph based on local connectivities [26]. In this model, the link probability between a pair of genes depends on the degree of both genes. This allows us to consider each interaction in the context of its subgraph, thus has a good promise to extract motifs in power-law networks by their local deviations from randomness [27]. It would be interesting to integrate the local models into our algorithm in motif extraction of other interaction networks such as protein interaction networks. Recently, Kelley et al. have integrated physical protein-protein interactions to dissect synthetic lethal gene pairs into between-pathway and within-pathway paradigms [14]. While the focus of our study is different from their work, GIMF has an interesting correspondence with their algorithm. The algorithm proposed by Kelley et al. to construct between-pathway or within-pathway model is essentially a local search procedure described by Sharan et al [28]. Starting from a seed node, nodes whose contributions to the current seed are maximal are added one at a time. The operation is repeated in a breadth-first search fashion so long as it increases the overall score of the subgraph. This is equivalent to maintaining a set of motif and non-motif nodes each with probability 1 and only the interaction between directly linked nodes are considered during the iteration. In contrast, GIMF maintains a probability of being in the motif set for each node, thus allowing all nodes to have contribution in each iteration during motif building. The assignment of a node to the motif versus non-motif category is only determined when the probabilities converge. A similar breadth-first search procedure can also be applied to GIMF in automatically extracting gene pathways. The between-pathway and within-pathway discovery by Kelley et al. [14] aligns with the conclusion by Tong et. al [12] that synthetic lethal interactions are more abundant between genes that have the same mutant phenotype and the genes encoding proteins within the same protein complex. This idea is illustrated in Fig. 6, where "interaction", "no-interaction", "prohibited self-interaction" are represented by red, black and grey respectively. The matrix shows partial interaction profiles for five query genes. Query genes A, B, C and D, E belong to two different motifs. The within-pathway pattern shows the situation where synthetic lethal interactions are more abundant between motif members than between genes belonging to different motifs. The between-pathway pattern shows two motifs that represent two complementary pathways, with synthetic lethal interactions enriched between the pathways and depleted within a pathway. To permit a quantitative discussion, we define a within-motif score (WMS) to characterize whether synthetic lethal interactions for motif genes with each other are enriched (corresponding to the within-pathway pattern) or depleted (corresponding to the between-pathway pattern). Let WMSi represent the score for motif i given by
where The total number of motif members is the denominator of Eq. 10, and the add-one pseudocounts in The WMS was computed for three sets of motifs generated: i) for seeds in the Type 1 network; ii) for all seeds from the query set; iii) for seeds in 100 randomized datasets described earlier (Fig. 7). The distribution of WMS values for motifs in the actual network appears bimodal, with greater probability for motifs with between-pathway character (WMS < 0). The WMS distribution for the Type 1 network has significantly more between-pathway character compared to motifs discovered in random network (one-sided, unequal variance t-test on WMS values, p-value = 1.4 × 10-5). Motifs in the entire network also have significantly more between-pathway character, as judged by smaller WMS values, than motifs in the random network (p-value = 6.6 × 10-5). Motifs from the Type 1 network show marginal significance for negative WMS values (one-sided z-test, p-value = 0.055), whereas motifs from the random network have significantly positive WMS values (p-value 4.5 × 10-6). In summary, these results demonstrate that synthetic lethal interactions leading to motifs have significant between-pathway character, particularly when compared with motifs detected in randomized networks.
Though the purpose of this study is to develop a probabilistic model for characterizing synthetic lethal interaction motifs and a pathway identification algorithm based on synthetic lethal interaction datasets, the model holds good potential as an integrative method which combines multiple sources of evidence. If the sources of evidence are independent, the new likelihood function should be the multiplication of those for individual evidences. When the sources of evidence are not independent, then a Bayesian learning approach such as the framework developed by Jansen et al. [29] should be considered. A detailed discussion on the extension of GIMF into an integrative approach is however, beyond the scope of this study and hence will not be further considered here. ConclusionA probabilistic model and an automated algorithm (GIMF) have been shown to be effective in unsupervised motif learning of genetic interaction data. Starting from a seed pattern of genetic interaction partners, the method iteratively identifies genes that share the pattern and characterizes the pattern with a probabilistic motif. Functional associations are inferred from motif membership, rather than from existence of a direct genetic interaction linking two genes. Genes that belong to the same connected components in Type I and Type II networks have well correlated GO annotations, and are more likely to share annotations than genes connected by direct synthetic lethal interactions. Synthetic lethal interactions tend to be depleted between genes within a motif, suggesting that synthetic lethal interactions occur primarily between-pathway rather than within-pathway. Several desirable features of the proposed algorithm for analyzing genetic interaction data include strong 0/1 predictions for genes sharing a motif, asymmetric property and the ability to automatically down-weight the impact of promiscuous genes with large degrees. We have shown that the asymmetry can be exploited to identify even tighter associations between genes and mask the impact of promiscuous genes. Furthermore, we conjecture that this asymmetric property may be useful in discriminating genes that are exclusive to a single pathway from genes that are shared in multiple pathways. The probabilistic motifs naturally down-weight the importance of promiscuous genes with many interaction partners. When the roles of hubs are not purely due to experimental bias, it is more likely to retain biologically relevant information by modelling it probabilistically than by simple filtering. GIMF has an interesting correspondence with a log-odd score based approach. However, an important difference is GIMF performs a global search of a subgraph with best cohesiveness based on a seed. The computation of GIMF is highly efficient. It is well suited for building motifs around a subset of genes of interest with several choices of stringency. MethodsCorrelations for Gene Ontology (GO) annotation are computed for three categories: biological process, molecular function and cellular component (unpublished data, Ye et al). Within each category, the correlation coefficient is computed as follows: Find the deepest level in GO hierarchy at which the pair of genes shares an annotation, which we denote by d. Find the maximum and minimum value of d among all query gene pairs (i, j) where i = 1,2,..., Q and j = 1,2,...,Q, Q is the total number of query genes. The GO annotation correlation (biological process, molecular function and cellular components) for a pair of gene is defined by Authors' contributionsYQ developed the GIMF model and carried out the data analysis. PY provided the code for the calculation of GO correlations and suggested the test of robustness against hub removal. JSB supervised the study. Additional File 1. Supplementary methods, figures, and tables. Format: PDF Size: 408KB Download file This file can be viewed with: Adobe Acrobat Reader Additional File 2. Complete list of motif members for the 126 query genes using initialization p = 0.95. Format: TXT Size: 6KB Download file Additional File 3. Tar file containing Matlab code, input, and output. Format: TAR Size: 50KB Download file AcknowledgementsJSB acknowledges funding from the Whitaker Foundation and the NIH. YQ acknowledges support from Institute for Pure and Applied Mathematics for a relevant workshop, IBM for a Ph.D fellowship and Dr. Jianbo Gao for stimulating discussions. References
Have something to say? Post a comment on this article! |



on Google Scholar







author email
corresponding author email
Figure 1.

























Figure 2.
Figure 3.
Figure 4.

Figure 5.



Figure 6.



Figure 7.