Biologists may need to know the set of genes that are semantically related to a given set of genes. For instance, a biologist may need to know the set of genes related to another set of genes known to be involved in a specific disease. Some works use the concept of gene clustering in order to identify semantically related genes. Others propose tools that return the set of genes that are semantically related to a given set of genes. Most of these gene similarity measures determine the semantic similarities among the genes based solely on the proximity to each other of the GO terms annotating the genes, while overlook the structural dependencies among these GO terms, which may lead to low recall and precision of results.
We propose in this paper a search engine called GRank, which overcomes the limitations of the current gene similarity measures outlined above as follows. It employs the concept of existence dependency to determine the structural dependencies among the GO terms annotating a given set of gene. After determining the set of genes that are semantically related to input genes, GRank would use microarray experiment to rank these genes based on their degree of relativity to the input genes. We evaluated GRank experimentally and compared it with a comparable gene prediction tool called DynGO, which retrieves the genes and gene products that are relatives of input genes. Results showed marked improvement.
The experimental results demonstrated that GRank overcomes the limitations of current gene similarity measures. We attribute this performance to GRank’s use of existence dependency concept for determining the semantic relationships among gene annotations. The recall and precision values for two benchmarking datasets showed that GRank outperforms DynGO tool, which does not employ the concept of existence dependency. The demo of GRank using 11000 KEGG yeast genes and a Gene Expression Omnibus (GEO) microarray file named “GSM34635.pad” is available at: http://ecesrvr.kustar.ac.ae:8080/ webcite (click on the link labelled Gene Ontology 2).
Biologists often need to know the set S′ of genes that are semantically related to a given set S of genes. Determining the set S′ helps in understanding gene-disease interactions and advanced disease diagnosis. For instance, biologists in the UAE are trying to determine the set of genes that are related to the genes involved in Type 2 Diabetes (T2D) (one out of five people in the UAE between the ages of 20 to 79 lives with T2D). Some works (e.g., DynGO ) propose tools that return the set of genes that are semantically related to a given set of genes. For instance, DynGO “retrieves genes and gene products that are relatives of input genes based on similar GO annotations, and displays the related genes and gene products in an association tree” [1,2]. Other works use the concept of gene clustering in order to identify semantically related genes [3-5]. However, most of these methods determine the semantic similarities among the genes based solely on the proximity to each other of the GO terms annotating the genes, while overlook the structural dependencies among these GO terms, which may lead to low recall and precision of results.
Most similarity measure approaches can be classified into three: edge-based, node-based, and Hybrid methods. Edge-based measures [6,7] rely on counting edges in the graph. In most of these measures, the shortest path length is used as a distance measure between two terms in a graph. Node-based measures [8-12] exploit the information content (IC) of two terms being compared and of their LCA. If this LCA has high information content, the two terms are considered to be semantically similar. Hybrid methods  combine edge-based and node-based methods. Edge-based measures assume that: (1) nodes and edges are uniformly distributed, and (2) edges at the same level in a hierarchy correspond to the same semantic distance between terms. These assumptions are not always true in biological ontologies. As for node-based measures, their limitations are: (1) they do not take into account the distance separating GO terms from their LCA , (2) they use IC as the major factor for determining the semantic similarity of GO terms, which is inappropriate, (3) some of them rely only on the number of common ancestor terms, while overlooking their semantic contributions to the two terms under consideration, and (4) many of these methods overlook the information contained in the structure of the ontology and concentrate only on the information content of a term derived from the corpus statistics. Most of these algorithms determine the semantic similarities on term by term basis and therefore ignore the context of a gene which consists of multiple terms. Thus, such method cannot be easily implemented to infer the functional similarities among different groups of genes.
Nagar et al. define the path length (PL) function between two GO terms as the minimum path length in the GO graph between the two terms. Chagoyen and Pazos  relate the functional coherence of GO terms GOi and GOj to the number of proteins in GOj that are functionally associated with proteins in GOi and use the cumulative hypergeometric distribution.
In the framework of GRank, the structure of GO is described in terms of a graph, which we call GO Graph. In this graph, GO terms are nodes and the relationships between the terms are edges. For example, Figure 1 presents a fragment of a GO Graph showing the ontological relationships of 29 GO terms. GRank accepts Keyword-based queries with the form Q (“g1”, “g2”, .., “gn”), where gi denotes a gene (or a gene product) keyword.
Figure 1. A fragment of GO Graph showing the ontological relationships of 29 GO terms.Blue edges denote “is-a” relations and red edges denote “part-of” relations.
User selects an input (a query, which is composed of genes annotated to GO). GRank would then map these genes to a set of GO terms. Let ST denote these GO terms. GRank finds the meaningful Lowest Common Ancestors (LCA) and the meaningful Top Common Descendants (TCD) of the set ST. A meaningful LCA is a LCA in GO graph, on which the existence of ST depends. A meaningful TCD is a TCD in GO graph, whose existence depends on the set ST. GRank would then rank the meaningful LCAs and TCDs, which are then converted back to genes based on annotations and retrieved back to the user. The genes annotated to the meaningful LCAs and the meaningful TCDs are the most semantically related to the user’s input genes. Figure 2 is an overview of our approach. It shows the sequential processing steps for answering a query. Because there are many abbreviations of concepts in the paper, we summarize them in Table 1.
Figure 2. A graphical representation of our approach showing the sequential processing steps for answering a query.
Table 1. Abbreviations of Concept (abb denotes abbreviation)
Constructing a graph based on part-of relations
Notation 1, Keyword Context (KC): A KC is a GO term annotated to a query gene product. For example, consider Figures1and3and the query Q(“JAG1”). The term “organ morphogenesis” (GO:0009887) is a KC because the gene “JAG1” is annotated to it.
Figure 3. POG constructed from the GO Graph in Figure1. Dotted lines denote inferred part-of relationships and solid lines denote explicit part-of relationships. Some relationships are omitted from the figure for the sake of figure clarity. Each term node is accompanied with the genes annotated to it.
Let SKC be a set of KCs annotating the user’s input genes. To construct the answer for this query, GRank needs to identify the meaningful LCAs and the meaningful TCDs of the set SKC based on the concept of existence dependency. Towards this, GRank will need to check all “part-of” relations in GO graph, because: “part of has a specific meaning in GO and a part of relation would only be added between A and B if B is necessarily part of A: wherever B exists, it is as part of A, and the presence of the B implies the presence of A” . “part-of relation embodies some aspects of existence dependency. A part-of relation with existence dependent parts can simply be replaced by existence dependency: in case of existence dependent components, the existence dependency relation is identical to the part of relation” .
Since not all “part-of” relations are explicitly expressed in GO Graph (some can be inferred from the graph), GRank converts the GO Graph into a graph called Part-Of Graph (POG), which contains only the explicit and inferred “part-of” relations. The LCAs/TCDs of KCs will be determined from the POG and not from the GO Graph. A POG is a GO Graph after: (1) removing all its relations except for the “part-of” ones, and (2) adding the inferred “part-of” relations. The terms A and B are connected by a “part-of” relation in the POG, if the GO Graph either states this relation expressly or it can be inferred from the graph using the following two inference rules described in : (1) if A “is-a” B and B is “part-of” C, A is “part-of” C, and (2) if A is “part-of” B and B “is-a” C, A is “part-of” C.
Figure 3 shows a fragment of POG derived from the GO Graph in Figure 1. For example, since in Figure 1: (1) the term multicellular organismal process (GO:0032501) “is-a” the term biological process (GO:0008150), (2) the term multicellular organismal development (GO:0007275) “is-a” the term multicellular organismal process (GO:0032501), and (3) the term system development (GO:0048731) is “part-of” the term multicellular organismal development (GO:0007275), then in Figure 3 the term system development (GO:0048731) is “part-of” the term biological process (GO:0008150). In Figure 3, each term node is accompanied with the genes annotated to it.
Determining the depths of terms
We observe that in order for a LCA/TCD of KCs to be meaningful, the terms located in each path from the LCA/TCD to a KC in POG should have unique depths based on their “is-a” relations in GO graph. “is-a” is a simple type-subtype relation between two GO terms . Consider that: (1) A′ “is-a” A, (2) A “is-a” C, (3) B′ “is-a” B, and (4) B “is-a” C. Both of the terms A and B inherit the characteristics and properties of their supertype C. Therefore, A and B share common characteristics. Since A′ and B′ inherit from the characteristics and properties of terms that have the same depth (the terms A and B), A′ and B′ share also common characteristics. That is, if two terms have the same “is-a” depth, they share common characteristics. On the other hand, consider that term A “part-of” term C and term B “part-of” term C. We may not be able to infer common characteristics between A and B. Therefore, we use “is-a” relation and not “part-of” (or other relations) for computing the depth of terms, because the depth (specificity) of a GO term t based on its “is-a” relations influences the semantic relationships of t with the other terms that have no hierarchical relationships with t. Thus, the depth of a term node is the number of “is-a” relations that connect it with the root term node (its “is-a” distance to the root).
For example, recall Figure 1. The root term biological process (GO:0008150) has its own depth. Since both of the terms multicellular organismal process (GO:0032501) and developmental process (GO:0032502) inherit the same characteristics from their supertype GO:0008150, they both have the same depth. Alternatively, we can determine that these terms have the same depth, because they have the same distance to the root based on their is-a relations. As another example, the terms kidney development (GO:0001822), system development (GO:0048731), multicellular organismal development (GO:0007275), and anatomical structure morphogenesis (GO:0009653) have the same “is-a” depth and also common characteristics.
In the POG in Figure 4, each set of terms that have the same depth are colored with the same color for easy reference. For example, the terms kidney development (GO:0001822), system development (GO:0048731), and anatomical structure morphogenesis (GO:0009653) are colored with the same color as an indicative that they have the same depth. We note that the depths computed may not reveal equal conceptual depths for terms located in different subtrees in GO graph. For instance, in the example presented above, the terms kidney development and system development have the same depth even though the term kidney development is more specific than the term system development. This is because the two terms are located in two different subtrees in GO graph. If a term has multiple “is-a” inheritances, only its longest “is-a” distance to the root is considered.
We constructed an algorithm called AssignDepth (see Figure 5) that determines the depths of GO terms. It employs breadth-first search techniques. The input to the algorithm is GO Graph G = (V, E), where V is the set of term and E is the set of edges representing the relations between the terms. G is represented by its adjacency list. The algorithm assigns an alphabetical letter to each hierarchical level in the graph based on “is-a” relations to denote the depth of the terms in this level. It starts at the root s, which is at level 0. In the first stage, it visits all the terms that are at the distance of one “is-a” edge away. Then, it visits all the new terms that can be reached at the distance of two “is-a” edges away from root term s. This process continues until every term has been visited. To keep track of progress, the algorithm colors each node either white or gray. A node is discovered the first time it is encountered during the search, at which time it becomes GRAY. The color of each node u ∈V is stored in variable color[u]. The algorithm uses a queue Q to manage the set of gray nodes. In line 11, function getRelation(u, v) returns the relation between two input terms u and v. Each term v is assigned an alphabetical letter that depicts its depth, which is stored in variable d[v].
Figure 5. Algorithm AssignDepth.
Determining the meaningful lowest common ancestors and the meaningful top common descendants of KCs
We observe that in order for a Lowest Common Ancestor (LCA) to be meaningful and in order for a Top Common Descendant (TCD) to be meaningful: (1) the LCA/TCD in POG should have a different depth than the KC, and (2) the path from the LCA/TCD to the KC in the POG should not include two or more terms with the same depth.
For example, consider Figure 4 and the query Q(“Gga.4082”, “LHX1”). The KC annotating the gene “Gga.4082” is nephron epithelium morphogenesis (GO:0072088) and the KCs annotating the gene “LHX1” are nephron development (GO:0072006), and renal vesicle morphogenesis (GO:0072077). One of the LCAs of the two KCs GO:0072088 and GO:0072077 is the term nephron epithelium development (GO:0072009). However, GO:0072009 is not a meaningful LCA for GO:0072077 (i.e., GO:0072009∉ SRGO:0072077), because its depth is the same as that of GO:0072077 (recall Figure 4). Based on these observations, we now introduce proposition 1.
Proposition 1: Meaningful LCA/TCD; A LCA/TCD is considered meaningful, if: (1) the depth of the LCA/TCD is different than the depths of the KCs, and (2) the path in the POG from the LCA/TCD to each of the KCs does not include two or more terms with the same depth.
Notation 2, DPTx; DPTxdenotes the depth of GO term x.
We prove observation/proposition 1 heuristically as follows. First, we prove: if a LCA/TCDSRKC, then DPTLCA/TCD≠ DPTKC. That is, in order for a LCA/TCD to be meaningful, its depth should be different than the depth of the KC. We are going to validate this observation by checking whether it conforms to the structural characteristics of existence dependency. The concept of existence dependency was first proposed for Entity-Relationship modelling . An object x is existence-dependent on an object y if the existence of x is dependent on the existence of y. The existence dependency concept and the SRKC concept have correspondences: both denote that an object(s) has a strong association with another object. SRKC is a set of GO terms, whose existence in POG is dependent on the existence of the KC (or conversely, the existence of the KC in the graph is dependent on the existence of the set of terms). Snoeck et al.  argue that the existence dependency relation is a partial ordering of object types (i.e., depths). The authors transform an OO schema into a graph consisting of the object types found in the schema and their relations. The object types in the graph are related only through associations that express existence dependency. The authors demonstrated through the graph that an object type is never existence-dependent on itself. That is, if the two objects Oi and Oj belong to the same type, Oi cannot be dependent on Oj and vice versa. This finding is in agreement with our proposed rule, when we view: (1) a GO term in GO Graph as an object, and (2) the GO term’s depth as the type of the object. Thus, if a LCA/TCD has the same depth as the KC, the LCA/TCD can never be existence-dependent on the KC (and vice versa); therefore, this LCA/TCD is meaningless and the genes annotated to it may not be semantically related to the genes annotated to the KCs.
Second, we prove: If a LCA/TCD is semantically related to the KC, then DPTTxDPTTywhere Txand Tyare term nodes located between the LCA/TCD and the KC in POG. We can verify this rule as follows. In order for LCA/TCD ∈SRKC, all term nodes located between the LCA/TCD and the KC in the POG have to be related to the KC. Let: (1) term Ty∈SRKC, (2) Ty be a descendant of the KC, and (3) term Tx be a descendant of Ty. In order for Tx to be semantically related to the KC, intuitively Tx has to be semantically related Ty, because Ty relates (connects) Tx with the KC. If Tx and Ty have the same depth, then Tx (according to the first rule). Therefore, in order for Tx to be semantically related to the KC, DPTTxDPTTy.
Example 1: Consider Figure 4 and the query Q(“LHX1”, “Gga,4082”). As shown in Figure 4: (1) the KCs annotating the gene “LHX1” are nephron development (GO:0072006) and renal vesicle morphogenesis (GO:0072077), and (2) the KC annotating the gene “Gga.4082” is nephron epithelium morphogenesis (GO:0072088). As shown in Figure 6, one of the LCA of the two KCs is the term organ development (GO:0048513). However, by applying proposition 1, this LCA is meaningless, because the path in the POG from the LCA to the KC (GO:0072088) includes two terms with the same depth (i.e., the terms GO:0072073 and GO:0072088). Also, as shown in Figure 7, the term nephron epithelium development (GO:0072009) is a meaningless LCA of the two KCs, because this LCA and the KC (GO:0072077) have the same depth. Therefore, the genes annotated to these two meaningless LCAs (i.e., the genes “Ci-FoxI-c”, “FKH-4”, “fkh-5”, “TFAP2B”, “JAG1”, and “PANDA_003456”) will not be returned as the answer for the query Q(“Gga.4082”, “LHX1”). By applying proposition 1, the term metanephric renal vesicle morphogenesis (GO:0072283) is a meaningful TCD of the KCs (see Figures 6 and 7). Therefore, the gene annotated to GO:0072283 (i.e., the gene Ssc.22980) will be returned to the user as the answer for the query.
Figure 6. GO:0048513 is a meaningless LCA of GO:0072006 and GO:0072088. GO:0072283 is a meaningful TCD of GO:0072006 and GO:0072088.
Figure 7. GO:0072009 is a meaningless LCA of GO:0072088 and GO:0072077. GO:0072283 is a meaningful TCD of GO:0072088 and GO:0072077.
Example 2: Consider Figure 4 and the query Q(“JAG1”, “LHX1”). By applying proposition 1 and as demonstrated by Figure 8, the term organ development (GO:0048513) is a meaningful LCA for the KCs nephron xdevelopment (GO:0072006) and nephron epithelium development (GO:0072009). Therefore, the genes annotated to GO:0048513 (i.e., the genes “Ci-FoxI-c”, “FKH-4”, and “fkh-5”) are semantically related to both of the input genes “LHX1” and “JAG1”. Therefore, these genes will be returned as the answer for the query. By applying proposition 1, the term metanephric renal vesicle morphogenesis (GO:0072283) is a meaningless TCD for the KCs, because the path from GO:0072283 to the KC (GO:0072009) includes two terms with the same depth (i.e., the terms GO:0072077 and GO:0072009). Therefore, the gene annotated to GO:0072283 (i.e., the gene Ssc.22980) will not be returned to the user as an answer for the query.
Figure 8. GO:0048513 is a meaningful LCA of GO:0072006 and GO:GO:0072009. GO:0072283 is a meaningless TCD of GO:0072006 and GO:GO:0072009.
We constructed an algorithm called DetermineMLCA (see Figure 9) that checks whether an input LCA of KCs is meaningful. The input to the algorithm is a set of KCs and their LCA. For each KC in the set, lines 4–9 examine recursively each term T′ located between the KC and the LCA to check if its depth is different than the depth of the LCA and also different than the depths of all terms located between T′ and the KC. If the depths of terms between each KC and the LCA are distinct, line 14 will report that the LCA is a meaningful LCA.
Figure 9. Algorithm DetermineMLCA.
We also constructed an algorithm called DetermineMTCD that checks whether an input TCD of KCs is meaningful. The algorithm is similar to algorithm DetermineMLCA in Figure 9, except that it examines the descendants of KCs rather than their ancestors.
Ranking meaningful LCAs and meaningful TCDs
Each gene keyword of a query can be annotated to more than one GO term. Therefore, there may be more than one meaningful LCA/TCD for the query’s KCs. The degree of relativity of these LCAs/TCDs to the KCs may vary. Accordingly, the relativity of the genes described by these LCAs/TCDs to the input genes may vary. Therefore, we need a mechanism that ranks these LCAs/TCDs. We propose below techniques for ranking the LCAs/TCDs.
Ranking with respect to one input gene keyword
Let LCAi denote that GO term i is a LCA for a query’s KCs; let R(LCAi)denote the rank ofLCAi. Let KC1denote the KC annotating input gene g1. Intuitively, the rank of LCAi (i.e., R(LCAi)) with respect to KC1 is R(KC1) scaled down appropriately to account for the specificity of LCAi. We scale down the rank R(KC1) by a factor decayj−1 for each hierarchical level located between KC1 and LCAi in GO Graph. decay is a parameter that can be set to a value in the range 0 to 1 (in our experiments we set decay to 0.7). “j” is the number of hierarchical levels between KC1 and LCAi. Let R(LCAi, KC1) denote the rank of LCAi with respect to KC1. R(LCAi, KC1) is computed as follows:
In the discussion above and in equation 1, we implicitly assumed that there is only one GO term annotating the gene g1. In case g1 is annotated to n GO terms, we compute the rank of each term using equation 1. Let the computed ranks be r1, r2, …, rn. The combined rank is:
Where f is some aggregation function. We set f = max by default, but other choices (such as f = sum) are also possible.
Equations 1 and 2 show the rank of LCAi with respect to only one gene keyword. In case there are m gene keywords in the query (i.e., Q(“g1”, “g2”, …, “gm” ), the overall ranking is the rank of LCAi with respect to the m KCs annotating these genes. That is, the overall ranking of LCAi for query Q(“g1”, “g2”, …, “gm” ) is computed as shown in equation 3.
The overall ranking is the sum of the ranks with respect to each gene keyword in the query, multiplied by a measure of keyword proximity: prox(LCAi, (g1, g2. …, gm)). The keyword proximity function prox(LCAi, (g1, g2, …, gm)) can be any function that ranges from 0 (gene keywords are annotated to terms that are very far apart in GO Graph) to 1 (gene keywords are annotated to terms that occur right next to each other in GO Graph).
By default, we set our proximity function to be inversely proportional to the size of the smallest text window in LCAi containing the occurrences of all the input gene keywords g1, g2, …, gm. The smallest text window is the rectangle in GO graph that contains the relevant occurrences of all input gene keywords. The height of the window/rectangle is the number of hierarchical levels between the highest and lowest GO terms annotating the input gene keywords. The width of the window/rectangle is the number of GO terms between the right-most and left-most terms annotating the input gene keywords.
The algorithm for computing PageRank of HTML documents employs the following formula:
•p(v): the PageRank of a document v
•Nd: the total number of documents.
•Nh(v): the number of out-going hyperlinks from document v.
•d is a parameter usually set to 0.85.
•HE: hyperlink edge.
As the formula above shows, PageRank is a sum of two probabilities. The first is (1-d)/Nd, which is the probability of visiting document v at random. The second is the probability of visiting document v by navigating through other documents, which is calculated by the sum of the normalized PageRanks of all documents that connect to v by hyperlinks, multiplied the probability of navigation d.
We are going to adapt the PageRank formula to GO Graphs by mapping GO terms to documents and hyperlink edges to edges connecting the GO terms. As there is a tight coupling between each two Web pages connected by a hyperlink, there is a tight coupling between each two GO term nodes connected by a containment edge. If term node u is important (i.e., has high score), it is likely that its children and parents are important too. If the children and parents of u are important (i.e., have high scores), it is likely that u is important too. By performing mapping, adaptation, and adjustments to the PageRank formula, we constructed the following formula for computing R(KCj):
•v: The parent(s) of KCj
•p1: A discretionary parameter that denotes the probability of navigating from KCjthrough an edge to term v, because the gene the user is looking for is better described by v. In our experiments we set p1 to 0.4.
•p2: A discretionary parameter that denotes the probability of navigating from term v through an edge to KCj, because the gene the user is looking for is better described by KCj. In our experiments we set p2 to 0.6.
•Ni(v): The number of edges entering node v.
•S(v) and S(KCj): The scores of terms v and KCj respectively, and are computed as follows. Let: (1) “a” be the set of genes in microarray, (2) “b” be the set of genes annotated to term v (or to KCj), and (3) “c” be the number of significant genes in microarray. The score S(v) is the probability that the number of significant genes annotated to term v is exactly “k” out of the “c” significant genesa, and it is given by the following Fisher’s exact test :
To compute S(v), GRank provides reference sets of microarrays. Example 3, provides an example illustrating the calculation of S(v).
Example 3: Consider the microarray titled “GeneChip Human Genome U133A 2.0 Array”, which represents 14500 unique human genes and 417 significant unique genes. Let us compute the probability that the number of significant genes annotated to the term anatomical structure development (GO:0048856) is exactly 300 out of the 417 significant genes of the microarray. That is, we want to compute the score S(GO:0048856). There are 10153 unique genes annotated to GO:0048856. Therefore, the probability is computed as shown in equation 7:
Results and discussion
We experimentally evaluated the quality of GRank and compared it with DynGO . DynGO “retrieves genes and gene products that are relatives of input genes based on similar GO annotations, and displays the related genes and gene products in an association tree” [1,2]. DynGO “can support heavier computations and supports semantic retrieval of both similar terms and gene products” . We implemented GRank in Java, run on Intel(R) Core(TM)2 Duo CPU processor, with a CPU of 2.6 GHz and 4 GB of RAM, under Windows 7.
Pathways are sets of genes shown to have high functional similarity and can be used to validate similarity measures [7,14,21]. A fully described pathway represents the dynamics and dependencies among a set of gene/gene products. Therefore, we used in our experiments pathways as a reference for evaluating and comparing the similarity measures/relationships of GRank and of DynGO . Given a set S of genes, the methods should return another set S′ of genes that are semantically related to S. In order for sets S and S′ to be related, S and S′ should be part of a same pathway.
We used for the evaluation two different benchmarks: KEGG and Pfam benchmarks. We selected a set of 15 human and 15 yeast diverse KEGG pathways (see Tables 2 and 3); the genes were retrieved using the DBGET database . We selected 15 groups of highly related Pfam entries (see Table 4) from the Sanger Pfam database . The percentage of non-IEA annotations is different in the yeast and human. It is about 70% for the yeast annotations compared to about 60% for the human annotations. That is, there is a higher contribution of non-IEA annotation in yeast than in human.
Table 2. The 15 KEGG Human Pathways used in the experiments
Table 3. The 15 KEGG Yeast Pathways used in the experiments
Table 4. The15 Pfam Human Clans and the 15 Pfam Yeast Clans used in the experiments
For each group, we retrieved the corresponding human and yeast gene identifiers from the Uniprot database . Assuming that genes belonging to the same KEGG pathway are often related to a similar biological process (BP), the similarity values calculated for this dataset should be related to the BP aspect. Also, genes sharing common domains in a Pfam clan often have a similar molecular function (MF), the similarity values calculated for this second dataset should be related to the MF aspect.
Evaluating recall and precision
We measured the recall (or true positive rate) and precision of GRank and of DynGO . Let: (1) GP be all genes in a pathway and n be the number of these genes, and (2) GM be the m genes retrieved by one of the methods as semantically related to the input gene keywords:
Figures 10 and 11 show the overall average recall and precision respectively for DynGO and GRank using the 857 genes of the two benchmarks. Figure 12 shows the recall and precision results obtained with the KEGG pathways. Figure 13 shows the recall and precision results obtained with the pfam clans. For each KEGG and pfam pathway/clan (x-axis), the recall and precision values are represented as histograms (y-axis).
Figure 10. Overall average recall for DynGO and GRank.
Figure 11. Overall average precision for DynGO and GRank.
Figure 12. Recall and precision of DynGO and GRank obtained with the 15 human and 15 yeast diverse KEGG pathways.
Figure 13. Recall and precision of DynGO and GRank obtained with the 15 human and 15 yeast diverse Pfam clans.
As the figures show, recall and precision values vary based on: (1) pathways, and (2) the accuracy of each of the two methods to capture the semantic similarities and relationships among gene annotations within pathways. We can conclude that the recall and precision values for the two benchmarking datasets showed that GRank significantly outperforms DynGO. The results reveal the robustness of the GRank’s method and its ability to reflect the semantic relationships among gene annotations.
Table 5 shows the proportion of the meaningful LCAs versus the meaningful TCDs in the different sets of the benchmarks, where the significant answer genes are annotated to.
Table 5. The proportion of the meaningful LCAs versus the meaningful TCDs in the different sets of the benchmarks
To further analyze the behavior of the two methods, we classified GO Graphs into six criteria based on the relative position of a KC in the graph and on the type of relations that connect the KC with other term nodes (the classification is shown in Table 6). We analyzed the behavior of GRank and of DynGO in terms of their recall and precision under each of the six criteria. We computed the average recall and precision of the two methods under each of the six criteria. The results are shown in Table 7.
Table 6. Classification of GO Graphs (C# denotes criterion number)
Table 7. Average Recall and Precision of GRank and DynGO under the Six Criteria
As Table 7 shows:
(1) GRank outperforms DynGO under criteria 2–6.
(2) GRank does not perform well under criterion 1. More research work needs to be conducted to overcome the shortcoming of this criterion.
(3) GRank achieved the same recall values and the same precision values under criteria 4, 5, and 6, which is an indicative that the locations of KCs in GO Graph are irrelevant to the performance of GRank and that its performance does not vary with the height of a GO Graph.
GRank performance under criteria 2–6 is due to: (1) its consideration to the structural dependencies among annotation terms and to its term-depth consideration, and (2) the fact that each of these criteria requires a method to account for the structural dependencies among annotation terms. The extent of the importance of the structural dependencies among annotation terms to a criterion differs from criterion to another: it is more important to criterion 6 than to the other five criteria, which explains the substantial performance of GRank over DynGO methods under criterion 6.
Evaluating the impact of disregarding the ranking of meaningful LCA and TCD
We modified a copy of GRank by removing its capability to rank meaningful LCAs and TCDs. Our objective is to study the impact of overlooking the ranking of meaningful LCAs and TCDs on the search quality of GRank. We aim at: (1) measuring the decline in GRank’s recall and precision as a result of disregarding ranking, and (2) comparing the modified copy’s recall and precision with DynGO. If the modified copy outperforms the DynGO method, this performance would be attributed to only GRank’s computation of meaningful LCAs and TCDs. Figures 14 and 15 show the overall average recall and precision respectively of the modified copy using the 857 genes of the two benchmarks (for ease of comparison, we show also in the figure the average recall and precision of the original version of GRank and of DynGO).
Figure 14. Average recall of the modified copy of GRank and of the original GRank and DynGO.
Figure 15. Average precision of the modified copy of GRank and of the original GRank and DynGO.
As Figure 14 shows, the overall average recall has not changed, which is expected, since ranking does not have an impact on recall; rather, it has an impact on precision. As the Figure 15 shows, the overall average precision of the modified copy is:
(1) Less than that of the original copy by 26%.
(2) Higher than that of DynGO.
We can conclude that the concept of meaningful LCA and TCD has significant impact on GRank’s search quality, while ranking has important but not significant impact on GRank’s search quality.
Evaluating an alternative approach to overcoming the shortcoming of criterion 1
As discussed previously that GRank does not perform well under query criterion 1 (recall Tables 6 and 7) and that more research work needs to be done to overcome this limitation. However, our objective in this test is to evaluate a counter-technique, where a complete GO Graph is used instead of POG for computing meaningful LCAs and TCDs. Therefore, terms that connect with other terms via only is-a relations will also be considered in the computation of meaningful LCAs and TCDs. Towards this, we modified a copy of GRank so that it processes GO Graph instead of POG for computing meaningful LCAs and TCDs. Figures 16 and 17 show the average recall and precision of the modified copy using the 857 genes. As Figure 16 shows, the average recall of the modified copy is less than that of the original version by 15%, but is still higher than that of DynGO. However, as Figure 17 shows, the average precision of the modified copy is less than that of the original copy by 33%, and is slightly less than DynGO. We conclude that the concept of POG has a significant impact on GRank’s search quality and it compensates for the shortcoming of criterion 1.
In this paper, we proposed a search engine called GRank that determines the semantic relationships among genes and gene products. GRank overcomes the limitations of current gene similarity measures by using the concept of existence dependency to determine the semantic relationships among gene annotations. It determines the structural dependencies among the GO terms annotating a given set S of genes in order to identify the set of other genes that are semantically related to the set S. Towards this, the framework of GRank refines the concept of LCA and TCD by introducing the concept of meaningful LCA and meaningful TCD. Given a set of genes g1, g2, …gn, GRank identifies the meaningful LCA and the meaningful TCD of the terms annotating g1, g2, …gn. The genes annotated to the meaningful LCA and the meaningful TCD have the closest semantic relationships with g1, g2, …gn. GRank ranks the meaningful LCAs and the meaningful TCDs based on their semantic relationships with KCs. We experimentally evaluated the quality of GRank and compared it with DynGO  using KEGG and Pfam benchmarks. In summary, the recall and precision values for the two benchmarking datasets showed that GRank outperforms DynGO. The experiments showed that GRank does not perform well if all KCs in GO Graph are connected to other term nodes with only “is-a” relations. We will investigate techniques in a future work that overcome this limitation.
aThe same thing applies to the score S(KCj).
The authors declare that they have no competing interests.
KT conceived, designed, and supervised the research. KT drafted and revised this manuscript. KT, HAM, DM, and ZAM performed the implementation of the research project. KT, HAM, DM, and ZAM carried out the experiments and the analysis of the results. All authors read and approved the final manuscript.
The authors would like to thank Khalifa University Internal Research Fund (KUIRF) program for supporting this research.
Xiao X, Dow ER, Eberhart R, Miled ZB, Oppelt RJ: Gene Clustering Using Self-Organizing Maps and Particle Swarm Optimization. In Proc. of the 17th International Symposium on Parallel and Distributed Processing (PDPS’03). Washington DC: IEEE Computer Society; 2003.
Wang H, et al.: Gene Expression Correlation and Gene Ontology-Based Similarity: An Assessment of Quantitative Relationships. CA, USA: IEEE Symposium on Computational Intelligence in Bioinformatics; La Jolla; 2004.
Lin D: An information-theoretic definition of similarity, Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy; In Proc. 15th Int’l Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann; 1998:pp. 296-304.
Nagar A, Al-Mubaid H: A New Path Length Measure Based on GO for Gene Similarity with Evaluation using SGD Pathways; In Proc. 21th IEEE International Symposium on Computer-Based Medical Systems (CBMS 08). Finland: University of Jyväskylä; 2008.
Gene Ontology. 2011.
Innovative Internet Community Systems, LNCS 2003, 2877:251-262. Publisher Full Text
Comput Networks Isdn 1998, 30:107-117. Publisher Full Text
Sanger Pfam database.