Lombardi Comprehensive Cancer Center, Georgetown University, 4000 Reservoir Rd, Washington, DC, USA

Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, 4300 Wilson Blvd., Arlington, VA, USA

School of Biology and Ecology, University of Maine, Orono, ME 04469, USA

Abstract

Background

Inferring a gene regulatory network (GRN) from high throughput biological data is often an under-determined problem and is a challenging task due to the following reasons: (1) thousands of genes are involved in one living cell; (2) complex dynamic and nonlinear relationships exist among genes; (3) a substantial amount of noise is involved in the data, and (4) the typical small sample size is very small compared to the number of genes. We hypothesize we can enhance our understanding of gene interactions in important biological processes (differentiation, cell cycle, and development, etc) and improve the inference accuracy of a GRN by (1) incorporating prior biological knowledge into the inference scheme, (2) integrating multiple biological data sources, and (3) decomposing the inference problem into smaller network modules.

Results

This study presents a novel GRN inference method by integrating gene expression data and gene functional category information. The inference is based on module network model that consists of two parts: the module selection part and the network inference part. The former determines the optimal modules through fuzzy c-mean (FCM) clustering and by incorporating gene functional category information, while the latter uses a hybrid of particle swarm optimization and recurrent neural network (PSO-RNN) methods to infer the underlying network between modules. Our method is tested on real data from two studies: the development of rat central nervous system (CNS) and the yeast cell cycle process. The results are evaluated by comparing them to previously published results and gene ontology annotation information.

Conclusion

The reverse engineering of GRNs in time course gene expression data is a major obstacle in system biology due to the limited number of time points. Our experiments demonstrate that the proposed method can address this challenge by: (1) preprocessing gene expression data (e.g. normalization and missing value imputation) to reduce the data noise; (2) clustering genes based on gene expression data and gene functional category information to identify biologically meaningful modules, thereby reducing the dimensionality of the data; (3) modeling GRNs with the PSO-RNN method between the modules to capture their nonlinear and dynamic relationships. The method is shown to lead to biologically meaningful modules and networks among the modules.

Background

In recent years, high throughput biotechnologies have made large-scale gene expression surveys a reality. Gene expression data provide an opportunity to directly review the activities of thousands of genes simultaneously. However, computational methods that can handle the complexity (noisy, substantial amount of variables, high dimensionality, etc.) of these biological data are often unavailable

Cluster analysis has been used to separate genes into groups based on their expression profiles

A variety of continuous or discrete, static or dynamic, quantitative or qualitative models have been proposed for inference of biological networks. These include biochemically driven methods

As variant sources of biological data are becoming available now, it is very necessary and helpful to infer gene regulatory network (GRN) not only from one single data source, but from data fusion of multiple complementary data sources. A few previous studies combined time course gene expression data with other data sources, such as genomic location data

Our previous studies

Results and discussion

In this section, we demonstrate the inference ability of our proposed method via two experimental studies: the rat central nervous system (CNS) and yeast cell cycle process. Both data were preprocessed in the original studies

Rat CNS data

This case study is based on the data published in

The module selection result and corresponding modules are shown in Figure

Module selection of rat CNS data

**Module selection of rat CNS data**. The module selection of rat CNS data is shown in these figures: A. Estimate of the optimal number of modules: the optimal number of FCM clustering is five, which agrees with the result presented in

The reverse engineering algorithm is applied to the four modules for network inference. The final reconstructed network was built by choosing significant parameters as described in the Methods section. Our results were compared to those obtained by Deng

Comparison of results from three studies

**Comparison of results from three studies**. A. Our method; B. Deng

The time-course of observed expression and prediction for modules of CNS data

**The time-course of observed expression and prediction for modules of CNS data**.

Yeast cell cycle data

The yeast cell cycle data presented in

Spellman et al.

Mapping of expression clusters to functional gene classes.

G1

S

S/G2

G2/M

M/G1

wave1

**210**

2

2

2

**25**

wave2

46

**63**

**67**

2

1

wave3

0

1

**38**

**125**

3

wave4

9

1

2

**24**

**58**

wave5

**35**

4

12

26

**42**

This table shows the number of genes with different peak time for each cluster in yeast cell cycle data. From the highlighted numbers in the table, we can characterize the modules: It is clear that Module 1 is responsible for genes with peaks in M/G1 or G1, followed by Module 2, and so on.

Module selection of yeast cell cycle data

**Module selection of yeast cell cycle data**. A. Estimate of the optimal number of modules; B. Five modules (waves) based on the optimal number in A.

The PSO-RNN algorithm is applied to the network inference of the five modules. The final reconstructed network is inferred by choosing significant parameters as described in the Methods section. Unlike the CNS data, we could not compare our results to other publications due to lack of similar studies. Instead we illustrate the results according to their peak attributes. As shown in Figure

Inferred yeast module network

**Inferred yeast module network**. All the regulations identified in yeast module network are positive. Considering such characteristics of the modules and directions of the arcs between modules, the obtained network is believed to code a partially consistent regulatory relationship between modules recalled from the time sequence of the phase in cell cycle. All the relationships among modules indicate that each module has some regulatory impact on its follow-up modules, according to the peaks each module stands for. There is one exception: Module 5 has an up-regulation on Module 4, which shows that some feedback may exist in yeast cycle process.

The time-course of observed expression and prediction for modules of

**The time-course of observed expression and prediction for modules of cdc15 data**.

Conclusion

Reverse engineering of GRNs from time course gene expression data is a major obstacle in system biology due to the limited number of time points. We demonstrate that our method can address this challenge by decomposing the reverse engineering problem into modules, where two steps are involved: the gene expression data is clustered into modules with biological significances to reduce the problem dimensionality, and the network is built based on the expression profiles of modules. We evaluate the performance of the algorithm using two real data sets: rat CNS data and yeast cell cycle data. The results indicate that biologically meaningful modules are selected and biologically plausible networks between modules are estimated. For example, in CNS data, the inferred network at module level is a combination of the networks verified in the other two studies

Methods

The proposed method includes two parts: module selection and network inference. In the module selection part, we cluster the genes by FCM clustering. The optimal number of clusters is determined by the relative entropy estimate method, which incorporates the gene functional category information; each cluster is considered as a module representing certain co-regulated genes. After the modules are determined, the PSO-RNN inference algorithm is applied. In this algorithm, each module is considered as a neuron in the RNN structure, and any regulation between two modules is a weight in the RNN. To find the best fit network among the modules, a generalized PSO method, including basic PSO and neural network pruning technique, is used to determine RNN structure and its parameters.

Module selection

Clustering has been a major method to partition the genes into groups of co-expressed genes

FCM clustering

FCM is a method of clustering which allows a data point to belong to two or more clusters. The detailed description of FCM method can be found in

Estimating the number of modules

We propose a new computational method to determine the number of biologically meaningful modules. This is accomplished by incorporating gene functional category information into FCM cluster analysis and applying the relative entropy to measure the biological significance of a cluster to serve as a network module. The relative entropy

where Λ is the sample space of

In a gene expression data set, all genes can be characterized into some categories according to their functions or other properties (e.g. gene peak phase in cell cycle process). For example, according to the gene functional category information, we can get the probability distribution of category for the data set (say _{i }(_{C}) in one FCM clustering, defined in (2), is considered as the estimate for the number of clusters. The number of clusters with maximum _{C}), defined in (3), is considered as the optimal module number

Network inference

In building an RNN to infer a network of interactions, the identification of the correct structure and determination of the free parameters (weights and biases) to mimic measured data is a challenging task given the limited available quantity of data and complex search space. In this paper, we apply PSO and neural network pruning methods to select the optimal architecture of an RNN and update its free parameters.

Network model

The genetic regulation model can be represented by a recurrent neural network formulation

where _{i }is the gene expression level of the ^{th }gene (1 ≤ _{ij }represents the effect of ^{th }gene on the ^{th }gene (1 ≤ _{i }denotes the bias for the ^{th }gene, and

When information about the complexity of the underlying system is available, a suitable activation function can be chosen (e.g. linear, logistic, sigmoid, threshold, hyperbolic tangent sigmoid or Gaussian function.) If no prior information is available, our algorithm uses by default the sigmoid function. A negative value of _{ij }represents the inhibition of the ^{th }gene on the ^{th }gene, whereas a positive value of _{ij }represents the activation control of the ^{th }gene on the ^{th }gene. If _{ij }is zero, then it means that the ^{th }gene has no influence on the ^{th }gene. The discrete form of (1) can written as

Figure

The description of a GRN by a RNN model

**The description of a GRN by a RNN model**. A: A fully connected RNN model, where the output of each neuron is fed back to its input after a unit delay and is connected to other neurons. It can be used as a simple form mimicking a NM, where a gene cluster or a TF is represented by a neuron. B: Details of a single recurrent neuron.

Training the RNNs involves determining the optimal weights _{ij }and bias _{i}. As a cost function, we use the mean-squared error between the expected output and the network output across time (from the initial time point 0 to the final time point

where _{i}(^{th }neuron (entity) at time

Training algorithm

There exist many algorithms for RNN training in the literature, e.g., back-propagation through time (BPTT)

Here, we use PSO

In PSO, each particle is represented as a vector ^{th }particle. The core of the PSO algorithm is the position update rule (7) which governs the movement of each of the

At any instant, each particle is aware of its individual best position, _{1 }and _{2 }are constants that weight particle movement in the direction of the individual best positions and global best positions, respectively; and _{1, j }and _{2, j},

where

The constriction factor _{1 }and _{2 }as in (8).

The key strength of the PSO algorithm is the interaction among particles. The second term in (7),

The algorithm consists of repeated application of the velocity and position update rules presented above. Termination can occur by specification of a minimum error criterion, maximum number of iterations, or alternately when the position change of each particle is sufficiently small as to assume that each particle has converged.

Selection of appropriate values for the free parameters of PSO plays an important role in the algorithm's performance. The parameter setting we used in this study can be found in Table

PSO Parameter setting

Parameter

Value

Maximum search space range, |_{max}|

[-5, 5]

Acceleration constants, _{1 }&_{2}

2.05, 2.05

Size of swarm

50–150

PSO-RNN hybrid algorithm

In this section, we illustrate how PSO optimizes the parameters of RNN and how the structure of RNN is pruned to mimic the response of an unknown network of interactions. Since PSO is a stochastic algorithm, a single solution may not reflect the underlying network. We therefore collect a number of solutions from the PSO-RNN algorithm and use them to determine a single output network that receives the majority vote. Specifically, we applied 100 runs for each network inference. If the absolute value of the average of one parameter in hundred runs is larger than its standard deviation, it is said significant and will be selected for the final network, otherwise it will be set to zero. The following reverse engineering procedure is utilized:

1. Run the reverse engineering algorithm without introducing any particular constraints (except the maximum-allowed values) in the network parameters. Perform hundred runs, and select the networks with mean squared error (MSE) less than certain threshold for further network parameter evaluation.

2. Determine the average and standard deviations of the network parameters using the results from Step 1.

3. Set non-significant parameters (if any) to zero. If there is no non-significant parameter, the procedure is stopped.

4. Return to the reverse engineering algorithm, with non-significant weights set to zero. If the results (measured by the fitness) are as good, or almost good, as for the previous sets of runs, form the network averages, and return to Step 3. If instead the results are worse than in the previous run, discontinue the procedure.

Concluding all the above process, the overall algorithm is illustrated in Figure

The flowchart of the proposed approach

**The flowchart of the proposed approach**. The flowchart of the proposed approach is illustrated here, involving mainly two components: (1) Module selection is performed after data preprocessing (including missing value imputation and normalization) to produce the module expression patterns; (2) the reverse engineering procedure PSO-RNN determines both the structure and corresponding parameters of a RNN which represents the underlying structure of a module network.

List of abbreviations used

(CNS): Central nervous system; (FCM): fuzzy c-means; (GRN): Gene regulatory network; (MSE): mean square error; (NM): network motif; (PSO): particle swarm optimization; (RNN): recurrent neural network.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Y. Zhang and H.W. Ressom designed the computational approach, wrote the code, analyzed the experimental results, and drafted the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This article has been published as part of