Bioinformatics Research Centre, University of Glasgow, Glasgow, UK

Department of Mathematics, University of Strathclyde, Glasgow, UK

Institute of Biomedical and Life Sciences, Glasgow, UK

Abstract

Background

Interpretation of simple microarray experiments is usually based on the fold-change of gene expression between a reference and a "treated" sample where the treatment can be of many types from drug exposure to genetic variation. Interpretation of the results usually combines lists of differentially expressed genes with previous knowledge about their biological function. Here we evaluate a method – based on the PageRank algorithm employed by the popular search engine Google – that tries to automate some of this procedure to generate prioritized gene lists by exploiting biological background information.

Results

GeneRank is an intuitive modification of PageRank that maintains many of its mathematical properties. It combines gene expression information with a network structure derived from gene annotations (gene ontologies) or expression profile correlations. Using both simulated and real data we find that the algorithm offers an improved ranking of genes compared to pure expression change rankings.

Conclusion

Our modification of the PageRank algorithm provides an alternative method of evaluating microarray experimental results which combines prior knowledge about the underlying network. GeneRank offers an improvement compared to assessing the importance of a gene based on its experimentally observed fold-change alone and may be used as a basis for further analytical developments.

Background

Since its launch in 1998, the Google search engine has all but monopolized page searches on the world-wide web

Just as the original PageRank is stable against the artificial inflation of a web page's rank by web designers, we hope that GeneRank may obtain a more robust ranking of genes in (typically very noisy

Data sharing techniques have been successfully implemented previously using, for example, GO annotations

Our aim in this work is to use connectivity data to produce a prioritization of the genes in a microarray experiment that is less susceptible to variation caused by experimental noise than one based on expression levels alone. This is achieved using GeneRank, a customised version of the PageRank algorithm.

Results and discussion

The algorithm

The algorithm on which we base our method of microarray experiment analysis was originally devised for assessing the importance of web pages in search engine results. We show here that its formulation allows for a simple and intuitive extension for our application. The PageRank algorithm, used by the successful search engine Google

The Vote of Confidence Principle

**The Vote of Confidence Principle. **Just as a the PageRank of a web page will be high if it is linked to other highly ranked pages, we hope that the relative ranking of a gene will be increased if it is linked to other highly differentially expressed genes.

The original PageRank algorithm also has a random walk interpretation where the ranks correspond to the invariant measure of a teleporting random walk on the web. This is equivalent to saying that the rank of a web page is proportional to the time spent at the web page whilst surfing the web. This idea can also be intuitively extended to ranking genes, where the rank of a gene is proportional to the amount of time a biologist should spend looking at a gene whilst analysing the experimental results.

As with the original algorithm, we require a network or graph to allow us to calculate a rank for each entity in the network. With the original algorithm, nodes represent web pages and a link exists between two nodes if one page contains a hyperlink to the other. This results in a directed graph. In our case, we define an undirected graph where a node represents a gene and the edges can be defined by some other "previous knowledge". For our purpose, we use either Gene Ontology annotations

For each gene, the expression level vector contains the value for its expression change in the experiment under consideration. The algorithm, GeneRank, also uses a free parameter,

Data

In addition to the gene expression data, GeneRank uses a network or graph as input. We use the absolute value (the algorithm requires positive expression change values) of the gene expression data as a weight for each node in the graph and define the network connectivity by some other criterion. We use either Gene Ontology annotations or correlation coefficients, but there are many other possibilities, e.g., metabolic networks, transcription factor networks, or protein-protein interactions. We also used synthetic networks with controlled topological features for evaluation purposes. The three types of network were constructed as follows.

GO networks

Genes are connected if they share an annotation defined by the Gene Ontology. This defines three networks, one for each of the GO sections; Biological Process, Cellular Component and Molecular Function. We do not use the acyclic directed graph associated with the Gene Ontology, but assign leaf nodes as the annotations for each gene. A yeast diauxic shift experiment

Network Parameters

Network Parameters

Network

C

Synthetic Network 1

40

0.0918

Synthetic Network 2

28

0.1034

Synthetic Network 3

40

0.0804

Biological Process

39

0.8636

Cellular Component

44

0.9461

Molecular Function

47

0.9444

Correlation Coefficient

155

0.5326

Correlation coefficient networks

A yeast stress data set consisting of 156 microarray experiments under a wide range of stress conditions was used to construct these networks. This data set is discussed in

Synthetic networks

To allow control over the network structure, synthetic networks were defined with 1000 genes. The genes were split into two sets, _{A}, _{AB }and _{B}, where these are the probabilities of two set _{A}, _{AB }and _{B}. Representative examples are listed in Table

Synthetic network 1 is the standard case with equal expected degree across both sets and |_{deg}_{deg}_{deg}(_{A }+ |_{AB }_{deg}(_{AB }+ (|_{B}.

To justify drawing the expression levels for sets

Estimating

**Estimating μ**. To justify drawing the expression of the set

Testing the algorithm

Synthetic networks

Since we are trying to improve the ranking of genes produced in microarray experiment, we need to quantify the quality of the ranking produced by the algorithm. In the case of synthetic data, we know that all genes in set

The construction of synthetic networks allows us to obtain full control over the network structure. We experimented with various network parameters.

Relative connectivity and expression-connectivity weighting

We measure the relative connectivity as

Relative expected degree

We carried out a number of tests where _{deg}(_{deg}(_{deg}(_{deg}(

Varying the relative expected degree between sets

**Varying the relative expected degree between sets A and B**. Here we are varying the expected degrees of sets

A number of observations can be made from the experimental results where _{deg}(_{deg}(

• The maximum AUC achieved by the algorithm increases as the difference between _{deg}(_{deg}(

• As the difference between _{deg}(_{deg}(

• The maximum AUC achieved in each case occurs at larger values of

• The improvement by the algorithm over expression change ranking is greater when the difference between the expected degree of both sets is greater.

To summarise these findings, a higher expected degree of set

Relative set size

Four cases were investigated, where |

Varying the relative sizes of sets

**Varying the relative sizes of sets A and B**. Our previous tests used |

These results on synthetic networks suggest that for certain network structures GeneRank can achieve a significant improvement over ranking based on pure differential expression. The relative expected degree of sets _{deg}(_{deg}(

However, in our tests the quality of the results generally decreases for values of

GO networks

As described earlier, we construct the GO networks by defining an edge between two genes if they share an annotation allocated by the Gene Ontology Consortium. This allows us to construct three networks, one for each section defined by the Gene Ontology: Biological Process, Cellular Component and Molecular Function.

An initial test combined the real network connectivity with synthetic expression changes. We ordered the genes based on expression change in the yeast diauxic shift experiment and allocated the top 300 down-regulated genes to set

Combining real connectivity with synthetic expression

**Combining real connectivity with synthetic expression. **The network connectivity for each of the three GO networks: Biological Process, Cellular Component and Molecular Function were combined with synthetic expression data. In each case, the 300 most down-regulated genes which were defined using the real yeast diauxic shift data were allocated expression from a

To check if GeneRank produces a gene ranking which is robust to noise we conducted a further experiment using the GO networks. Real experimental data were used throughout. The Cellular Component network was used in this experiment. For each of the top 200 genes sorted by differential expression, we set its expression change to 0 in turn (i.e., defined it as "unchanged") and determined if the GeneRank algorithm was able to pick up this anomaly and consequently move the gene towards its original place in the ranked list. The premise here is that its connections to other highly changed genes will boost the artificially altered gene in the ranking. The same experiment was done for 200 randomly selected genes. The results are given in Figure

GO networks: testing the 'boosting' ability of the algorithm

**GO networks: testing the 'boosting' ability of the algorithm**. An experiment was carried out to assess how well the algorithm is able to increase the relative ranking of a gene based on its connections to other highly changed genes. The top 200 most changed genes are set in turn to have a differential expression of 0. If the ranking were based on pure differential expression only, each gene would appear at the bottom of the list. By PageRanking, we raise the position of the gene closer to its original ranking. The same effect is not observed when a random 200 genes are chosen. The majority of these genes will not have connections to other highly changed genes. The blue line represents the original expression ranked position, the red circles show the original GeneRanked position and the blue asterisks show the modified GeneRanked position.

To quantify the results we calculated the quality index

where alt_PR is the GeneRanked position after the expression of the gene has been artificially altered, orig_exp_rank is the original expression-based position in the list, and alt_exp_rank is the expression-based position after the differential expression has been set to 0.

In the case where we are altering a gene in the top 200, a 'boosting' effect is observed and the ranked position after the fold-change has been moved towards the original ranked position. We can observe groups of genes which are boosted to the same level (shown by 'lines' of blue asterisks). It is likely that these genes are a completely connected subgraph, which results in all genes being given the same ranking. Altering genes which were originally ranked in the top 200, we achieve

Correlation coefficient networks

Using the correlation coefficient network defined by the stress data set

Correlation Coefficient networks: testing the 'boosting' ability of the algorithm

**Correlation Coefficient networks: testing the 'boosting' ability of the algorithm. **The same experiment as in Figure 6 was carried out on the correlation coefficient networks. In each case the network connectivity is identical but the expression change vector, used as input to the algorithm, is randomly chosen to be one experiment from the stress data set. The x-axis represents the top 200 genes when ranked using expression change information. We calculate a 'boosting' measure to quantify how much we increase the relative rank of each gene after it has been altered. In this case, each gene was changed to have expression change 1. The large values for

Again we calculate a value of our quality index

Conclusion

The purpose of this work was to explore the possibility of adapting the PageRank algorithm, used by Google in assessing the importance of web pages, for the task of prioritizing the 'importance' of genes in a microarray experiment. Our new algorithm, GeneRank, allows connectivity and expression data to be combined to produce a more robust and informed summary of an experiment, compared to the standard procedure of basing the importance of a gene on its measured expression change. Although we use expression change values as expression data, this is not restricted, and some other means of capturing the expression information may also be used. GeneRank can be justified theoretically and has been tested on synthetic data, experimental data and a combination of both. The algorithm has a single parameter,

While the improvement of gene rankings upon application of GeneRank is already significant in the examples presented, it may become even more so once comprehensive high-quality biological network information becomes available. Of particular interest in that respect will be transcriptional regulatory networks, such as are now being generated by technologies like ChIP-chip (see

Methods

The original algorithm

We summarize the basic PageRank algorithm which was developed by Larry Page and Sergey Brin at Stanford University

PageRank assigns a measure of relevance or importance to each web page, allowing Google to return high-quality pages in response to a user query. The algorithm is designed to be robust to methods of deception, where web page designers attempt to artificially boost the PageRank of their page by altering the local link structure. Robustness follows from the recursive nature of the algorithm, where a page is highly ranked if it is linked to by other highly ranked pages. A link from page ^{N × N}, where _{ij }= 1 if there is a link from page _{ij }= 0 otherwise. We define deg_{i }:= **r**^{[0] }∈ ^{N}. The PageRank algorithm proceeds iteratively, updating the ranking for the

Here _{i }in the summation ensures that each page has equal influence in the voting procedure. Each page gets a rank of 1 - **r **∈ ^{N }in the linear system

(^{T }^{-1})**r **= (1 - **e**, (2)

where ^{T }is the transpose of _{i}) and **e **∈ ^{N }has all _{i }= 1. Applying PageRank is equivalent to applying the Jacobi iteration **r **is guaranteed under the condition

^{T}^{-1}) < 1, (3)

where

A random walk interpretation

The PageRanking process has an alternative interpretation in terms of a random walk

**teleports: **with probability 1 -

**surfs: **with probability _{ij }= 1 is equally likely to be chosen as the destination.

The PageRank vector **r**, when normalised so that its components sum to one, corresponds to the invariant measure for this process. In other words, _{j }is the long-time proportion of visits made to page

The modified algorithm: GeneRank

The PageRank idea translates intuitively to the analogous situation of gene expression analysis. Instead of producing a ranked list of web pages, we produce a ranked list of genes. PageRank views hyperlinks as votes of confidence, so we similarly allow functional connections to boost rank. Just as PageRank counts votes from a highly-ranked page as more influential than votes from a lowly-ranked page, we will allow connections to genes with high differential expression to carry greater significance than connections to genes with low differential expression. Figure

PageRank gives each web page a rank of (1 - _{i}, where ex_{i }is the absolute value of the expression change for gene **r**^{[0] }= **ex**/||**ex**||_{1}, where ||·||1 denotes the vector 1-norm. Then we let

Here ^{N × N }is the connectivity matrix for the gene network, so _{ij }= _{ji }= 1 if genes _{ij }= _{ji }= 0 otherwise.

We remark that this iteration may also be motivated from the viewpoint of

The iteration (4) corresponds to Jacobi on the system

(^{T }^{-1})**r **= (1 - **ex**, (5)

and, because the iteration matrix has not been altered, the condition that convergence is guaranteed for all 0 <^{T }by

In summary, the GeneRank algorithm is finding the customised ranking vector **r **defined by the linear system (5). A Matlab implementation of the algorithm is available in the additional file geneRank.m The random walk interpretation carries through to this more general setting. If the teleporting step is re-defined so that the destination gene is not chosen uniformly over the whole set, but rather is chosen with probability proportional to absolute expression level, then **r **in (5), suitably scaled, is the invariant measure. Overall, we have a true generalization of PageRank in the sense that (a) the algorithm has both "vote of confidence" and "random walk" interpretations and (b) for the case where all ex_{i }= 1 we recover the original PageRank algorithm.

It is trivial to check that with the choice **r **= **ex**. In this case the genes are ranked purely on expression level. We will now study the other extreme, where

For

and the system (5) for the corresponding fixed point becomes

(^{T }^{-1})**r **= **0**. (7)

First, we show that the sum of the rankings is preserved by the iteration. From (6),

Also, it is clear from (6) that the iteration preserves the nonnegativity of the initial ranks; that is, **deg**/||**deg**||_{1 }is a fixed point of (6). To see this, put **r**^{[n-1] }= **deg**/||**deg**||_{1 }in the right-hand side of (6) to obtain

Now, we observe that ^{T}^{-1}) ≤ ||^{T}^{-1}||_{1 }= 1, and hence all eigenvalues of ^{T}^{-1 }are less than or equal to 1 in modulus. Because ^{T}^{-1}**deg **= ^{T}**e **= **deg**, showing that there is at least one eigenvector, **deg**, corresponding to eigenvalue 1. Suppose now that λ = 1 is a simple eigenvalue of ^{T}^{-1 }and that **r* **with ||**r***||_{1 }= 1 is another solution of (7). Then

So **r* **- **deg**/||**deg**||_{1 }is an eigenvector of ^{T}^{-1 }corresponding to eigenvalue 1. It follows that **r* **- **deg**/||**deg**||_{1 }must be a multiple of **deg **and hence **r* **= ± **deg**/||**deg**||_{1 }. We may summarize our findings in the following result.

**Result **If the eigenvalue λ = 1 of ^{T}^{-1 }is simple, then **r **= **deg**/||**deg**||_{1 }is the unique solution of (7) that satisfies the required constraints ||**r**||_{1 }= 1 and _{i }≥ 0.

Overall, we conclude that the extremal parameter values

Abbreviations

ROC – Receiver Operating Characteristic

AUC – Area Under ROC Curve

GO – Gene Ontology

Authors' contributions

JLM implemented the algorithm, performed the experiments, and drafted the manuscript. RB provided biological datasets and interpretation, and participated in the experimental design. DRG contributed to study coordination and continuous discussions. DJH first conceived of the application of PageRanking to biological data, and participated in study design and supervision. All authors read and approved the final manuscript.

**The Matlab GeneRank implementation**. The file contains a Matlab implementation of the GeneRank algorithm. The file requires a matrix describing the network connectivity and a vector of expression changes for each gene. The output is the vector of rankings for each gene.

Click here for file

**A Matlab .mat file containing sample GO networks and expression change vectors**. This file can be loaded into Matlab using the command load G0_matrix. This will load three matrices (w_All, w_Up and w_Down) and three expression change vectors (expr_data, expr_dataUp and expr_dataDown) into the current workspace. These matrices were constructed using the all three sections of the Gene Ontology, where a link is present between two genes if they share a GO annotation. Only genes which are up-regulated are included in w_Up and only down-regulated in w_Down. The GeneRank algorithm should be used with the corresponding matrix and expression change vector, e.g. the command ranking = geneRank(w_Up, expr_dataUp,d) would be used to calculate the ranking of the up-regulated genes in the experiment.

Click here for file

Acknowledgements

JLM was supported by a Synergy scholarship, a jointly supervised studentship between the universities of Strathclyde and Glasgow.

RB was supported by a BBSRC grant (17/G17989) to Anna Amtmann and a personal research fellowship from the Caledonian Research Foundation..

DJH was supported by a Fellowship from the Royal Society of Edinburgh/Scottish Executive Education and Lifelong Learning Department and by the EPSRC grant GR/562383/01.