Dept. of Comp. Sc. and Engg, Tezpur University, Napaam, Tezpur, India

Dept. of Computer Science, University of Colorado, Colorado Springs, USA

Abstract

Background

The development of high-throughput Microarray technologies has provided various opportunities to systematically characterize diverse types of computational biological networks. Co-expression network have become popular in the analysis of microarray data, such as for detecting functional gene modules.

Results

This paper presents a method to build a co-expression network (CEN) and to detect network modules from the built network. We use an effective gene expression similarity measure called NMRS (Normalized mean residue similarity) to construct the CEN. We have tested our method on five publicly available benchmark microarray datasets. The network modules extracted by our algorithm have been biologically validated in terms of Q value and p value.

Conclusions

Our results show that the technique is capable of detecting biologically significant network modules from the co-expression network. Biologist can use this technique to find groups of genes with similar functionality based on their expression information.

Introduction

The development of high-throughput Microarray technologies has provided a range of opportunities to systematically characterize diverse types of biological networks. Biological networks can be broadly classified as protein interaction networks

Problem formulation

Due to non-transitive nature of connections among genes, genes form a very complicated connectivity network with respect to a particular similarity measure in a gene expression data set. Such a connectivity network is often referred to as a co-expression network. A major use of this co-expression network is extraction of network modules that represent the strongly connected regions in the co-expression network. These modules may present highly co expressed genes, which are functionally similar.

In this paper, we propose an effective similarity measure for gene co-expression, develop an approach to prepare a co- expression network from a gene expression data set and mine the potential network modules from the built network. We aim to produce a graph,

1. Each vertex

2. Each edge _{1},_{2} where _{1},_{2} ∈

3. There is an edge between two vertices _{1},_{2} ∈

Our contribution

We claim the following contributions in this paper.

• We introduce an effective gene similarity measure NMRS.

• We propose an approach to construct a co-expression network using NMRS.

• We develop a spanning tree based method to extract the potential network modules.

Background

In the literature, a number of techniques have been proposed for gene co-expression network construction. When inferring co-expression networks from gene expression data, the algorithms take a gene expression dataset as primary input and then, by using a correlation-based proximity measure, constructs the corresponding co-expression networks. Frequently used correlation-based measures are Pearson correlation coefficient, Spearman correlation coefficient and Mutual information. Approaches such as

Generally, in a co-expression network, the connections between genes are obtained from the absolute values of a co-expression measure. Several researchers have suggested to threshold this value of the co-expression measure to construct gene co-expression networks. There are two ways to pick a threshold: one way is picking a hard threshold (a number) based on the notion of statistical significance so that gene co-expression is encoded using binary information (connected=1, unconnected=0). The other way is called soft thresholding which weighs each connection by a number between 0 and 1. The drawbacks of hard thresholding include loss of information regarding the magnitude of gene connections and sensitivity to the choice of the threshold. Generally, hard thresholding results in unweighted networks while soft thresholding results in weighted networks.

Methodology

To construct the gene co-expression network, we use the general framework proposed by

Define a gene expression measurement

To determine whether two genes have similar expression patterns, an appropriate similarity measure must be chosen _{1}=(_{1}, _{2},…, _{n}_{2}=(_{1}, _{2},…, _{n}

where

NMRS as a metric

NMRS satisfies all the properties of a metric. We establish The non-negativity, symmetricity and triangular inequality properties for our measure in additional file

**NMRS as a metric** This additional file 1 presents the proofs of different metric properties of NMRS measure.

Click here for file

Significance of NMRS

The most widely used proximity measures in gene expression data analysis are Euclidean distance, Pearson correlation coefficient, Spearman correlation coefficient, Mean squared residue etc. In co-expression network, the used proximity measure is expected to effectively detect the linear shifting patterns in the gene expression data. But none of the widely used proximity measures can satisfactorily serve this purpose. The Euclidean distance measures the distance between two data objects. But in this domain, the overall shapes of gene expression patterns (or profiles) are of greater interest than the individual magnitudes of each feature

Comparison of proximity measures

Euclidian

Mutual

Yes

Yes

No

Pearson

Mutual

No

Yes

Yes

Spearman

Mutual

No

No

No

MSR

Aggregate

No

Yes

Yes

NMRS

Mutual

No

Yes

Yes

The table 1 presents the comparison of different proximity measure.

Let us consider a random gene pattern

Example patterns used for evaluation of proximity measures

**Example patterns used for evaluation of proximity measures** The figure 1 presents the value of some example patterns that are used to demonstrate the superiority NMRS over other proximity measures viz. Euclidean distance, Pearson correlation coefficient and Spearman correlation coefficient.

Gene pattern

4

7

6

3

6

5

8

7

3

10

13

12

9

12

11

14

13

9

10.4286

12.5714

11.8571

9.7143

11.8571

11.1429

13.2857

12.5714

9.7143

10.8571

12.1429

11.7143

10.4286

11.7143

11.2857

12.5714

12.1429

10.4286

11.2857

11.7143

11.5714

11.1429

11.5714

11.4286

11.8571

11.7143

11.1429

11.7143

11.2857

11.4286

11.8571

11.4286

11.5714

11.1429

11.2857

11.8571

12.1429

10.8571

11.2857

12.5714

11.2857

11.7143

10.4286

10.8571

12.5714

12.5714

10.4286

11.1429

13.2857

11.1429

11.8571

9.7143

10.4286

13.2857

13

10

11

14

11

12

9

10

14

The table 2 presents the random gene patterns for analysis of different proximity measures.

NMRS and Pearson correlation coefficient among considered example patterns

**NMRS and Pearson correlation coefficient among considered example patterns** The figure 2 presents NMRS and Pearson correlation coefficient of patterns

Compute an adjacency matrix

An adjacency matrix is obtained using a signum function based hard thresholding approach which encodes edge information for each pair of nodes in the co-expression network. Two genes d_{i}_{j}_{i}_{j}

Detect network modules

To detect subsets of nodes (modules) that are tightly connected to each other is an important aim of co-expression network analysis. In this paper, we use spanning trees and a topological overlap similarity measure _{ij}

where _{ij}_{u}a_{iu}a_{ij}_{i}_{u}a_{iu}

Extract useful information

Extraction of useful biological information is one of the main usages of gene co-expression networks. From the constructed network, one can explore various important information such as functionality and pathways of genes, essential genes susceptible to diseases.

Proposed algorithm: Module Miner

Module Miner takes NMRS threshold,

The symbols provided in Table

Symbolic representation

The gene expression matrix

_{i}

i^{th}

Signum threshold

Co-expression network

Set of vertices in G

Set of edges in G

Distance matrix

_{i}_{j}

NMRS distance between genes d_{i}_{j}

Adjacency matrix

_{i}_{j}

1 if v_{i}_{j}

^{con}

Set of connected region

i^{th}

Set of vertices in i^{th}

Set of edges in i^{th}

Adjacency matrix of the i^{th}

i^{th}

^{net}

Set of network modules obtained from

_{i}_{j}

Topological Matrix value between vertices v_{i}_{j}

_{1})

Average TOM of the set of vertices V_{1}

TOM for i^{th}

Maximum spanning tree obtained from i^{th}

Set of edges in

The table 3 describes the various symbols that is used in

**Definition 1 **_{i}_{j}_{i}, _{j}

**Definition 2 Connected regions **

**Definition 3 Maximum spanning tree**

**Definition 4 Network modules **

•

• _{3})>_{4}) _{4}.

Algorithm:

The pseudo code of Module Miner is presented in Algorithm 1. In the pseudo code, lines 1-4 extracts the connected regions from the gene expression data. Lines 5-25 process each of the connected regions to extract the network modules. A maximum spanning tree is constructed using Prim’s algorithm

Algorithm complexity

The complexity of different steps of our method is presented in this section.

• The preparation of the distance matrix involves a complexity of O(

• Finding connected regions from the co-expression network requires a complexity of O(

• Computation of the TOM matrix involves a complexity of O(_{c}_{c}_{c}_{c}_{c}

• Finding a maximum spanning tree consumes a complexity of

Experimental results

We implemented the Module Miner algorithm in MATLAB and tested it on five benchmark microarray datasets mentioned in Table

Datasets used for evaluating

Serial. No

Dataset

No. of Genes/ No. of Conditions

Source

1

Yeast Sporulation

474/17

2

Yeast Diauxic Shift

689/72

Sample gene in expander

3

Subset of Yeast Cell Cycle

384/17

4

Arabidopsis Thaliana

138/8

5

Rat CNS

112/9

The table 4 gives the description of various datasets used in

Validation

The performance of Module Miner on the five publicly available benchmark microarray dataset is measured in terms of p value and Q value.

p value

Biological significance of the sets of genes included in the extracted network modules are evaluated based on p values

where f and g denote the total number of genes within a category and within the genome respectively.

To compute p-value, we used a tool called FuncAssociate **7.69 × 10 ^{–27}, 3.93 × 10^{–25}, 1.03 × 10^{–26}, 1.23 × 10^{–23}, 2.32 × 10^{–28}, 5 .12 × 10^{–27}, 7.27 × 10^{–23}, 2.06 × 10^{–20}, 3.84 × 10^{–32}, 1.41 × 10^{–31}, 1.19 × 10^{–38}, 9.65 × 10^{–36}, 1.34 × 10^{–20}, 2.52 × 19^{–34}, 1.93 × 10^{–28} and 6.91 × 10^{–27}** being the highly enriched one. From the given p values, we can conclude that Module Miner shows a good enrichment of functional categories and therefore project a good biological significance.

P-value of one of the network modules of Dataset 2

2.32E-28

GO:0000788

nuclear nucleosome

5.12E-27

GO:0000786

nucleosome

7.27E-23

GO:0006334

nucleosome assembly

2.06E-20

GO:0032993

protein-DNA complex

8.61E-19

GO:0034728

nucleosome organization

1.14E-18

GO:0065004

protein-DNA complex assembly

1.12E-17

GO:0006333

chromatin assembly or disassembly

4.12E-16

GO:0005694

chromosome

2.49E-14

GO:0044454

nuclear chromosome part

1.70E-13

GO:0031298

replication fork protection complex

9.47E-14

GO:0006325

chromatin organization

6.78E-13

GO:0044427

chromosomal part

2.32E-12

GO:0034622

cellular macromolecular complex assembly

The table 5 gives the p value of one of the network modules of Dataset 2.

p-value of one of the network modules of Dataset 3

3.93E-25

GO:0006281

DNA repair

1.03E-26

GO:0006259

DNA metabolic process

1.23E-23

GO:0006974

response to DNA damage stimulus

7.69E-27

GO:0006260

DNA replication

6.94E-19

GO:0007049

cell cycle

5.55E-16

GO:0005634

nucleus

8.53E-18

GO:0044454

nuclear chromosome part

1.51E-17

GO:0022402

cell cycle process

3.53E-17

GO:0000079

regulation of cyclin-dependent protein kinase activity

5.72E-15

GO:0045859

regulation of protein kinase activity

5.16E-16

GO:0005657

replication fork

The table 6 gives the p value of one of the network modules of Dataset 3.

Q value

The Q-value **1.53 × 10 ^{–34}, 3.43 × 10^{–33}, 2.59 × 10^{–32}, 6.93 × 10^{–30}, 1.40 × 10^{–29}, 1.86 × 10^{–25}, 9.90 × 10^{–25}, 1.25 × 10^{–24}, 4.83 × 10^{–24}, 5.45 × 10^{–24}, 2.10 × 10^{–23}, 1.62 × 10^{–21}, 2.74 × 10^{–21}** being the highly enriched one. From the results of Q value, we arrive at the conclusion that the genes in a network module cluster obtained by Module Miner seem to be involved in similar functions.

Q-value of one of the network modules of Dataset 3

DNA replication

1.93E-21

DNA repair

1.93E-21

response to DNA damage stimulus

2.17E-20

DNA-dependent DNA replication

3.07E-19

replication fork

6.27E-19

nuclear chromosome

1.23E-17

mitotic sister chromatid cohesion

5.51E-17

nuclear replication fork

9.37E-17

nuclear chromosome part

2.00E-16

sister chromatid cohesion

5.13E-15

The table 7 gives the Q value of one of the network modules of Dataset 3.

Q-value of one of the network modules of Dataset 1

cytosolic ribosome

1.43E-52

cytosolic part

3.26E-48

structural constituent of ribosome

2.11E-44

ribosomal subunit

1.16E-42

cytosolic large ribosomal subunit

2.65E-36

large ribosomal subunit

1.47E-27

preribosome

2.96E-23

cytosolic small ribosomal subunit

3.71E-17

90S preribosome

8.48E-16

The table 8 gives the Q value of one of the network modules of Dataset 1.

Q-value of one of the network modules of Dataset 1

sporulation resulting in formation of a cellular spore

1.53E-34

sporulation

1.53E-34

anatomical structure formation involved in morphogenesis

1.53E-34

spore wall assembly

3.43E-33

ascospore wall assembly

3.43E-33

ascospore formation

3.43E-33

sexual sporulation

3.43E-33

spore wall biogenesis

3.43E-33

ascospore wall biogenesis

3.43E-33

sexual sporulation resulting in formation of a cellular spore

3.43E-33

cell development

3.43E-33

cell wall assembly

8.88E-33

reproductive process in single-celled organism

2.59E-32

cell differentiation

8.40E-32

fungal-type cell wall biogenesis

6.93E-30

reproductive developmental process

1.40E-29

reproductive process

1.86E-25

reproductive cellular process

1.86E-25

reproduction of a single-celled organism

9.90E-25

cell wall biogenesis

1.25E-24

sexual reproduction

4.83E-24

anatomical structure development

5.45E-24

anatomical structure morphogenesis

5.45E-24

M phase

2.10E-23

meiotic cell cycle

1.62E-21

meiosis

2.74E-21

M phase of meiotic cell cycle

2.74E-21

The table 9 gives the Q value of one of the network modules of Dataset 1.

Q-value of one of the network modules of Dataset 4

synaptic transmission

1.29E-13

glutamate receptor activity

3.77E-11

synapse

6.68E-08

regulation of synaptic transmission

3.06E-07

regulation of transmission of nerve impulse

4.00E-07

regulation of neurological system process

7.07E-07

regulation of system process

5.38E-05

synapse part

8.11E-04

cell projection part

9.46E-04

The table 10 gives the Q value of one of the network modules of Dataset 4.

Q-value of one of the network modules of Dataset 5

regulation of synaptic transmission

6.438756E-7

regulation of transmission of nerve impulse

9.297736E-7

regulation of neurological system process

1.533111E-6

intermediate filament cytoskeleton organization

2.056912E-6

intermediate filament-based process

5.218967E-6

neurofilament cytoskeleton

1.109702E-5

intermediate filament organization

1.454524E-5

synapse part

2.543099E-5

growth factor binding

2.571707E-5

intermediate filament

2.938762E-5

positive regulation of neurogenesis

9.6019E-5

The table 11 gives the Q value of one of the network modules of Dataset 5.

We have used GeneMANIA **automatically selected weighing method**. Visualization of some of the co-expression networks generated by GeneMANIA for the datasets are presented in Figures

The weightage of co-expression by Module Miner

Dataset1

C1

99.57%

C2

88.89%

Dataset2

C1

59.23%

C2

77.27%

Dataset3

C1

92.13%

C2

88.89%

C3

92.33%

C4

67.65%

Dataset4

C1

81.85%

Dataset5

C1

76.62%

The table 12 gives the percentage of co-expression on network modules produced by Module Miner.

Visualization of co-expressed network

**Visualization of co-expressed network** The figure3 presents co-expressed network by GeneMANIA for Dataset1.

Visualization of co-expressed network

**Visualization of co-expressed network** The figure 4 presents co-expressed network by GeneMANIA for Dataset2 and Dataset3.

Visualization of co-expressed network

**Visualization of co-expressed network** The figure 5 presents co-expressed network by GeneMANIA for Dataset4 and Dataset5.

Conclusion and future work

In this paper, an effective gene expression similarity measure NMRS is introduced, which is used to construct the co-expression network through a signum function based hard thresholding scheme. Finally, network modules are extracted from the network using maximum spanning tree and topological overlap matrix. However, soft thresholding method can be used to construct the adjacency matrix to reduce information loss. Generalized Topological Overlap Measure

Competing interests

The author(s) declare that they have no competing interests.

Acknowledgment

This paper is an outcome of a research project supported by (1) DST, Govt. of India in collaboration with ISI, Kolkata and (2) National Science Foundation, USA under grants CNS-095876 and CNS-085173.

This article has been published as part of