School of Computing, University of Southern Mississippi, Hattiesburg, MS 39406, USA

Laboratory of Molecular Immunology, National Heart Lung and Blood Institute, National Institutes of Health, Bethesda, MD 20892, USA

Environmental Services, SpecPro Inc., San Antonio, TX 78216, USA

Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS 39180, USA

Department of Internal Medicine, Rush University Medical Center, Chicago, IL 60612, USA

Abstract

Background

State Space Model (SSM) is a relatively new approach to inferring gene regulatory networks. It requires less computational time than Dynamic Bayesian Networks (DBN). There are two types of variables in the linear SSM, observed variables and hidden variables. SSM uses an iterative method, namely Expectation-Maximization, to infer regulatory relationships from microarray datasets. The hidden variables cannot be directly observed from experiments. How to determine the number of hidden variables has a significant impact on the accuracy of network inference. In this study, we used SSM to infer Gene regulatory networks (GRNs) from synthetic time series datasets, investigated Bayesian Information Criterion (BIC) and Principle Component Analysis (PCA) approaches to determining the number of hidden variables in SSM, and evaluated the performance of SSM in comparison with DBN.

Method

True GRNs and synthetic gene expression datasets were generated using GeneNetWeaver. Both DBN and linear SSM were used to infer GRNs from the synthetic datasets. The inferred networks were compared with the true networks.

Results

Our results show that inference precision varied with the number of hidden variables. For some regulatory networks, the inference precision of DBN was higher but SSM performed better in other cases. Although the overall performance of the two approaches is compatible, SSM is much faster and capable of inferring much larger networks than DBN.

Conclusion

This study provides useful information in handling the hidden variables and improving the inference precision.

Introduction

Microarrays can simultaneously measure the expression of thousands of genes. In the past decade or so, many time series experiments have employed microarrays to profile the temporal change of gene expression. For instance, one can retrieve many time-course gene expression datasets from the Gene Expression Omnibus database (

Any effective GRN inference method has to cope well with the large number of genes and the small number of time points that characterize microarray datasets. During the past few decades, many methods have been developed, such as Dynamic Bayesian Network (DBN)

A State Space Model (SSM)

In this study, we investigated the performance of SSM and addressed the effect of the number of hidden variables on inference accuracy. An intuitive way is to let the number of hidden variables equal that of observed variables, but SSM may not be convergent. To make it feasible to infer a large network from a limited number of time points, we need to determine the number of hidden variables in SSM.

Methods

In this section, we briefly present the SSM method and two approaches (BIC and PCA) for determining the number of hidden variables in GRN inference.

State Space Model

There are two kinds of variables in SSM _{t }

_{t }_{t }

We used expectation-maximization (EM)

Bayesian Information Criterion

As mentioned above, how to determine the number of hidden variables is an important factor affecting the accuracy of inferred GRNs.

_{t}_{t}_{θ }

Principal Component Analysis

Because the number of time points is usually much smaller than the number of genes, a microarray dataset _{t}_{i}_{i}_{i }_{j}_{k}_{i }_{k}_{k}_{i }_{k}_{k }_{i}_{k }_{k }_{i}

SSM uses the same idea as PCA does _{t }_{t}_{t}

Results and discussion

Two types of synthetic datasets generated by using GeneNetWeaver

We only compared the precision of GRNs inferred by SSM with that by the time-delayed DBN. The reason is that the precision of time-delayed DBN is higher than traditional DBN by considering transcriptional time lag

The relationship between precision and the number of hidden variables by using SSM with

**The relationship between precision and the number of hidden variables by using SSM with E. coli and yeast datasets**.

Precisions of GRNs inferred by SSM and DBN from synthetic Ecoli and Yeast datasets, respectively

**Precisions of GRNs inferred by SSM and DBN from synthetic Ecoli and Yeast datasets, respectively**. 'Random' means using random guess. 'm = 1' means that the number of hidden variables is set to 1 in SSM. 'm = 2,..., 5' have similar meanings. The first and second halves of figure 2 are for Ecoli and Yeast datasets, respectively.

The precision of GRN inferred by SSM or DBN may depend on network size and the number of time points. To systematically compare the performance of SSM and DBN, we generated synthetic datasets of 10 networks, each with 50 genes and 101 time points, for Ecoli and Yeast, respectively. One true Ecoli network and networks inferred using SSM and DBN are shown in Figure

A true

**A true E. coli network with 50 genes and 169 edges generated from GeneNetWeaver**.

Inferred GRN with 50 edges by using SSM with 2 hidden variables

**Inferred GRN with 50 edges by using SSM with 2 hidden variables**.

Inferred GRN with 50 edges by using DBN

**Inferred GRN with 50 edges by using DBN**.

It is worthwhile to note that when the number of hidden variables is small, some regulations are bidirectional in GRNs obtained by SSM, which means gene

Another advantage of SSM compared with DBN is that SSM can adjust the number of edges in the inferred GRN. DBN always chooses the network that gives the highest score, whose number of edges is definite. From equation (2) one can see that, the network given by SSM is a matrix _{ij}

ROC curve for

**ROC curve for E. coli and yeast datasets with 50 genes by using SSM with 2 hidden variables**. The false and true positive rates are averaged rates over 10 corresponding GRNs.

Conclusions

Determining the number of hidden variables in SSM is important in GRN inference. Our results using synthetic time series gene expression datasets of

List of abbreviations used

SSM: State Space Model; DBN: Dynamic Bayesian Networks; GRNs: Gene regulatory networks; BIC: Bayesian Information Criterion; PCA: Principle Component Analysis; PBN: Probability Boolean Network.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JZ, PG and YD initiated the project. WX and PL developed and implemented the algorithms. WX and JZ performed in-depth analysis of results and drafted the paper. PG, NW and EJP participated in network inference and analysis. PG, EJP and YD revised the paper. All authors read and approved the final manuscript.

Acknowledgements

This work was supported by the US Army Corps of Engineers Environmental Quality Program under contract #W912HZ-08-2-0011 and the NSF EPSCoR project "Modeling and Simulation of Complex Systems" (NSF #EPS-0903787). Permission was granted by the Chief of Engineers to publish this information.

This article has been published as part of