Abstract
Background
Causal networks based on the vector autoregressive (VAR) process are a promising statistical tool for modeling regulatory interactions in a cell. However, learning these networks is challenging due to the low sample size and high dimensionality of genomic data.
Results
We present a novel and highly efficient approach to estimate a VAR network. This proceeds in two steps: (i) improved estimation of VAR regression coefficients using an analytic shrinkage approach, and (ii) subsequent model selection by testing the associated partial correlations. In simulations this approach outperformed for small sample size all other considered approaches in terms of true discovery rate (number of correctly identified edges relative to the significant edges). Moreover, the analysis of expression time series data from Arabidopsis thaliana resulted in a biologically sensible network.
Conclusion
Statistical learning of largescale VAR causal models can be done efficiently by the proposed procedure, even in the difficult data situations prevalent in genomics and proteomics.
Availability
The method is implemented in R code that is available from the authors on request.
Background
The vector autoregressive regression (VAR) model is an approach to describe the interaction of variables through time in a complex multivariate system. It is very popular in economics [1] but with few exceptions [2] it has not been widely used in systems biology, where it could be employed to model genetic networks or metabolic interactions. One possible reason for this is that while the statistical properties of the VAR model are well explored [3], its estimation from sparse data and subsequent model selection is very challenging due to the large number of parameters involved [4].
In this paper we develop a procedure for effectively learning the VAR model from small sample genomic data. In particular, we describe a novel model selection procedure for learning causal VAR networks from time course data with only a few time points, and no or little replication. This procedure is based on regularized estimation of VAR coefficients, followed by subsequent simultaneous significance testing of the corresponding partial correlation coefficients.
Once the VAR model has been learned from the data, it allows to elucidate possible underlying causal mechanisms by inspecting the Granger causality graph implied by the nonzero VAR coefficients.
The remainder of the paper is organized as follows. In the next section we first give the definition of a vector autoregressive process and recall the standard estimation. Subsequently, we describe our approach to regularized inference and to network model selection. This is followed by computer simulations comparing a variety of alternative approaches. Finally, we analyze data from an Arabidopsis thaliana expression time course experiment.
Methods
Vector autoregressive model
We consider vectorvalued time series data x(t) = (x_{1}(t),...,x_{p}(t)). Each component of this row vector corresponds to a variable of interest, e.g., the expression level of a specific gene or the concentration of some metabolite in dependence of time. The vector autoregressive model specifies that the value of x(t) is a linear combination of those of earlier time points, plus noise,
In this formula m is the order of the VAR process, L the time lag, and c a 1 × p vector of means. The errors ∈_{i }are assumed to have zero mean and a p × p positive definite covariance matrix Σ. The matrices B_{i }with dimension p × p represent the dynamical structure and thus contain the information relevant for reading off the causal relationships.
The autoregressive model has the form of a standard regression problem. Therefore, estimation of the matrices B_{i }is straightforward. A special case considered in this paper is when both m and L are set to 1. Then the above equation reduces to the VAR(1) process
x(t + 1) = c + x(t)B + ε. (2)
We now denote the centered matrices of observations corresponding to x(t + 1) and x(t) by X_{f }("future") and X_{p }("past"), respectively, i.e.
This is also the maximum likelihood (ML) estimate assuming the normal distribution. The coefficients of higherorder VAR models may be obtained in a corresponding fashion [3].
Small sample estimation using JamesSteintype shrinkage
Genomic time course data contain only few time points (typically around n = 10) and often little or no replication – hence the above restriction on VAR(1) models with unit lag. Furthermore, it is known that for small sample size the least squares and maximum likelihood methods lead to statistically inefficient estimators. Therefore, application of the VAR model to genomics data requires some form of regularization. For instance, a full Bayesian approach could be used. However, for the VAR model the choice of a suitable prior is difficult [4].
Here, as a both computationally and statistically efficient alternative, we propose to apply JamesSteintype shrinkage, a method related to empirical Bayes [5,6]. This procedure has the advantage that it is computationally as simple as OLS, yet still produces efficient estimates for small samples.
Following [6,7] we now review how an unconstrained covariance matrix may be estimated using shrinkage. In the next section we then show how this result may be used to obtain shrinkage estimates of VAR coefficients.
Assuming centered data X for p variables (columns) the unbiased empirical estimator of the covariance matrix is
with
and
The particular choice of the shrinkage intensities
Shrinkage estimation of VAR coefficients
Small sample shrinkage estimates of VAR regression coefficients may be obtained by appropriately substituting the empirical by the shrinkage covariance. More specifically, we need to proceed as follows:
1. We combine the centered observations X_{p }and X_{f }into a joint matrix Φ = [X_{p}X_{f}]. Note that F contains twice as many columns as either X_{p }or X_{f}.
2. Next, we consider the (n  1) multiple of the empirical covariance matrix, S = Φ^{T}Φ, noting that S contains the two submatrices S_{1 }= X_{p }^{T }X_{p }and S_{2 }= X_{p }^{T}X_{f}. This allows to write the OLS estimate of VAR coefficients as
3. We replace the empirical covariance matrix S by a shrinkage estimate.
4. From S* we determine the submatrices
By decomposing S* using the SVD or Cholesky algorithm it is possible to reconstruct pseudodata matrices
VAR network model selection
The network representing potential directed causal influences is given by the nonzero entries in the matrix of VAR coefficient. For an extensive discussion of the meaning and interpretation of the implied Granger (non)causality we refer to [8].
As
Specifically, consider in the VAR(1) model the multiple regression that connects the first variable x_{1}(t + 1) at time point t + 1 with all variables x_{1}(t),...,x_{p}(t) at the previous time point t,
If in this equation the roles of x_{k}(t) and x_{1}(t + 1) are reversed,
the partial correlation between the two variables is the geometric mean of the corresponding
regression coefficients, times their sign, i.e.
Once the partial correlations in the VAR model are computed, we use the "local fdr" approach of [11] to identify significant partial correlations, similar to the network model selection for graphical Gaussian models (GGMs) of [9]. Note that unlike in a GGM the edges in a VAR network are directed.
We point out that recently two papers have appeared describing related strategies for VAR model selection. As in our algorithm the strategies pursued in both [12] and [13] also consist in choosing the VAR network by selecting the appropriate underlying partial correlations. However, the key advantage of our variant of VAR network search is that it is specifically designed to meet small sample requirements, by using shrinkage estimators of regression coefficients and partial correlation, and due to the adaptive nature (i.e. the automatic estimation of the empirical null) of the "local fdr" algorithm [11].
Results and discussion
Simulation study
In a comparative simulation study we investigated the power of diverse approaches to recovering the true VAR network. We simulated VAR(1) data of different sample size, with n varying between 5 and 200, for 100 randomly generated true networks with 200 edges and p = 100 nodes. The 200 nonzero regression coefficients were drawn uniformly from the intervals [1; 0.2] and [0.2; 1].
In addition to the shrinkage procedure we estimated regression coefficients by ordinary least squares (OLS) and by ridge regression (RR). All these three regression strategies were applied in conjunction with the above VAR model selection based on partial correlations, with a cutoff value for the "local fdr" statistic set at 0.2 – the recommendation of [11]. As a fourth method we employed L1 regression [14] (LASSO) to estimate VAR regression coefficients. Note that in the latter instance there is no need for additional model selection, as the LASSO method combines shrinkage and model selection and automatically sets many regression coefficients identically to zero.
In the simulations we ran OLS only for n > 100, as for small sample size the corresponding empirical covariance matrix is singular and consequently the OLS regression is illposed. The penalty for the LASSO regression was chosen as in [15]. The regularization parameter in RR was determined by generalized cross validation [16]. Unfortunately, even GCV turned out to be computationally expensive, so that for RR we conducted only 10 repetitions, rather than the 100 considered for the other methods.
The results of the simulations are summarized in Figure 1. The left box shows the positive predictive value, or true discovery rate of the four methods. This is the proportion of correctly identified edges in relation to all significant edges. Our proposed shrinkage algorithm is the only method achieving around 80% positive predictive value regardless of the sample size. Note that this is exactly the theoretically expected value, given the specified "local fdr" cutoff of 0.2. In contrast, the RR and LASSO methods perform remarkably poor at small sample size, with much lower true discovery rates. For medium to large sample size the OLS estimation dominates RR, LASSO and the shrinkage approach. This is easily explained by the fact that OLS has no parameters to optimize and that it is asymptotically optimal. However, it is bothering that for both the RR and the OLS approach the false discovery rate appears not to be properly controlled. Finally, for large sample size the Steintype estimator appears to be prone to overshrinking, which leads to an increase of false positives.
Figure 1. Relative performance of the four investigated methods for learning VAR networks in terms of positive predictive value (true discovery rate) and the number of true and false edges. The thin dotted line in the middle box at 200 corresponds to the true number of edges in the simulated networks.
The relative performance of the four approaches to VAR estimation can be further explained by considering the relative amount of true and false positive edges (Figure 1, middle and right box). The shrinkage method generally produces very few false positives. In contrast, the RR and LASSO methods lead to a large number of false edges, especially for small sample size. This is particularly pronounced for the LASSO regression, as can be seen in the differently scaled inlay plot contained in the right box of Figure 1, indicating that the penalty applied in the L1 regression may not be sufficient in this situation. In terms of the number of correctly identified edges the RR and shrinkage approach are the two top performing methods. However, even though RR finds a considerable number of true edges even at very small sample size, this has little impact on its true discovery rate because of the high number of false positives.
In summary, the simulation results suggest to apply for small sample size the JamesSteintype shrinkage procedure, and for n > p the traditional OLS approach.
Analysis of a microarray time course data set
For further illustration we applied the VAR shrinkage approach to a real world data example. Specifically, we reanalyzed expression time series resulting from an experiment investigating the impact of the diurnal cycle on the starch metabolism of Arabidopsis thaliana [17].
We downloaded the calibrated signal intensities for 22,814 probes and 11 time points for each of the two biological replicates from experiment no. 60 of the NASCArrays repository [18]. After logtransforming the data we filtered out all genes containing missing values and whose maximum signal intensity value was lower than 5 on a logbase 2 scale. Subsequently, we applied the periodicity test of [19] to identify the probes associated with the daynight cycle. As a result, we obtained a subset of 800 genes that we further analyzed with the VAR approach.
We note that a tacit assumption of the VAR model is that time points are equidistant – see Eq. 1. This is not the case for the Arabidopsis thaliana data which were measured at 0, 1, 2, 4, 8, 12, 13, 14, 16, 20, and 24 hours after the start of the experiment. However, as the intensity of the biological reactions is likely to be higher at the change points from light to dark periods (time points 0 and 12), one may argue that assuming equidistant measurements is justifiable at least in terms of equal relative reaction rate.
A further implication of the VAR model (and indeed of many other graphical models) is that dependencies among genes are essentially linear. This can easily be checked by inspecting the pairwise scatter plots of the calibrated expression levels. For the 800 considered Arabidopsis thaliana genes we verified that the linearity assumption of the VAR model is indeed satisfied.
Subsequently, we estimated from the replicate time series of the 800 preselected genes the regularized regression coefficients and the corresponding partial correlations, and identified the significant edges of the VAR causal graph as described above. We found a total number of 7, 381 significant edges connecting 707 nodes. In Figure 2 we show for reasons of clarity only the subnetwork containing the 150 most significant edges, which connect 92 nodes. Note that this graph exhibits a clear "hub" connectivity structure (nodes filled with red color), which is particularly striking for the node 570 but also for nodes 81, 558, 783 and a few other genes (for annotation of the nodes see the Additional File 1).
Additional File 1. This PDF file contains a multipage table containing the annotation data (probe IDs, gene names, description) for the 92 genes displayed in the VAR network of Figure 2.
Format: PDF Size: 38KB Download file
This file can be viewed with: Adobe Acrobat Reader
Figure 2. Directed VAR network inferred from the Arabidopsis thaliana data. The solid and dotted lines indicate positive and negative regression coefficients, respectively, and the line intensity denotes their strength. For annotation of the nodes see the Supplementary Information, Table 1. The color code of the nodes is explained in the main text.
As the VAR network contains directed edges it is possible to distinguish genes that have mostly outgoing arcs, which could be indicative for key regulatory genes, from those with mostly ingoing arcs. In the graph of Figure 2 node 570, an AP2 transcription factor, and node 81, a gene involved in DNAdirected RNA polymerase, belong to the former category, whereas for instance node 558, a structural constituent of ribosome, seems to be part of the latter. Node 627 is another hub in the VAR network, which according to the annotation of [17] encodes a protein of unknown function. Another interesting aspect of the VAR network is the web of highly connected genes (encircled and colored yellow in the lower right corner of Figure 2) which we hypothesize to constitute some form of a functional module.
Finally, we note that the VAR network visualizes influences of the genes over time, hence a VAR graph can also include directed loops and even genes that act upon themselves. In contrast, the GGM graphs discussed in [7,9] visualize the partial correlation with no time lag involved. For comparison, we display the GGM graph for the Arabidopsis thaliana data in Figure 3. As expected, both graphs share the same structure (main hubs and the module of highly connected genes): if one node influences another in the next timepoint with a constant regression coefficient (VARmodel), they also tend to be significantly partially correlated in the same time point (GGMmodel). However, using a GGM it is not possible to infer the causal structure of the network.
Figure 3. Undirected GGM network inferred from the Arabidopsis thaliana data using the algorithm of [7; 9]. The solid and dotted lines indicate positive and negative partial correlation coefficients, respectively, and the line intensity denotes their strength.
Conclusion
We have presented a novel algorithm for learning VAR causal networks. This is based on JamesSteintype shrinkage estimation of covariances between different time points of the conducted experiment, that in turn leads to improved estimates of the VAR regression coefficients. Subsequent VAR model selection is conducted by using "local fdr" multiple testing on the corresponding partial correlations.
We have shown that this approach is well suited for the small sample sizes encountered in genomics. In addition, the approach is computationally very efficient, as no computer intensive sampling or optimization is needed: the inference of the directed network for the Arabidopsis thaliana data with 640, 000 potentially directed edges takes about one minute on a standard desktop computer. While we have illustrated the approach by analyzing a microarray expression data set, it is by no means restricted to this kind of data – we expect that our VAR network approach performs equally well for similar high dimensional time series data from metabolomic or proteomic experiments.
The current algorithm employs a fixed "one step ahead" time lag. One strategy to generalization to arbitrary time lags may be to consider functional data – see, e.g., [20,21]. This would have the additional benefit to suitable deal with nonequally spaced measurements, a common characteristic of many biological experiments.
Authors' contributions
Both authors participated in the development of the methodology and wrote the manuscript. R.O. carried out all analyzes and simulations. All authors approved of the final version of the manuscript.
Acknowledgements
We thank Papapit Ingkasuwan for pointing us to the Arabidopsis thaliana data set, and the referees for their valuable comments. This work was supported by the Deutsche Forschungsgemeinschaft (DFG) via an EmmyNoether research grant to K.S.
This article has been published as part of BMC Bioinformatics Volume 8, Supplement 2, 2007: Probabilistic Modeling and Machine Learning in Structural and Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/8?issue=S2.
References

Sims CA: Macroeconomics and Reality.
Econometrica 1980, 48:148. Publisher Full Text

Bay SD, Chrisman L, Pohorille A, Shrager J: Temporal Aggregation Bias and Inference of Causal Regulatory Networks.
J Comput Biol 2004, 11:971985. PubMed Abstract  Publisher Full Text

Lütkepohl H: Introduction to Multiple Time Series Analysis. Berlin: Springer; 1993.

Ni S, Sun D: Bayesian estimates for vector autoregressive models.
J Business Economic Statist 2005, 23:105117. Publisher Full Text

Efron B, Morris CN: Stein's estimation rule and its competitorsan empirical Bayes approach.
J Amer Statist Assoc 1973, 68:117130. Publisher Full Text

OpgenRhein R, Strimmer K: Accurate ranking of differentially expressed genes by a distributionfree shrinkage approach.

Schäfer J, Strimmer K: A shrinkage approach to largescale covariance matrix estimation and implications for functional genomics.

Granger CWJ: Testing for causality, a personal viewpoint.
J Econom Dyn Control 1980, 2:329352. Publisher Full Text

Schäfer J, Strimmer K: An empirical Bayes approach to inferring largescale gene association networks.
Bioinformatics 2005, 21:754764. PubMed Abstract  Publisher Full Text

Whittaker J: Graphical Models in Applied Multivariate Statistics. New York: Wiley; 1990.

Efron B: Local false discovery rates. In Tech rep. Department of Statistics, Stanford University; 2005.

Demiralp S, Hoover KD: Searching for the causal structure of a vector autoregression.
Oxford Bull Econom Statist 2003, 65:745767. Publisher Full Text

Moneta A: Graphical models for structural vector autoregressions. In Technical Report. Laboratory of Economics and Management, Sant' Anna School of Advanced Studies, Pisa; 2004.

Tibshirani R: Regression shrinkage and selection via the lasso.

Meinshausen N, Bühlmann P: Highdimensional graphs and variable selection with the Lasso.
Ann Statist 2006, 34:14361462. Publisher Full Text

Golub G, Heath M, Wahba G: Generalized crossvalidation as a method for choosing a good ridge parameter.
Technometrics 1979, 21:215223. Publisher Full Text

Smith SM, Fulton DC, Chia T, Thorneycroft D, Chapple A, Dunstan H, Hylton C, Smith SCZAM: Diurnal changes in the transcriptom encoding enzymes of starch metabolism provide evidence for both transcriptionaland posttranscriptional regulation of starch metabolism inArabidopsis leaves.
Plant Physiol 2004, 136:26872699. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

NASCArrays: the Nottingham Arabidopsis Stock Centre's microarray database [http://affymetrix.arabidopsis.info/narrays/experimentbrowse.pl] webcite

Wichert S, Fokianos K, Strimmer K: Identifying periodically expressed transcripts in microarray time series data.
Bioinformatics 2004, 20:520. PubMed Abstract  Publisher Full Text

OpgenRhein R, Strimmer K: Inferring gene dependency networks from genomic longitudinal data: a functional data approach.

OpgenRhein R, Strimmer K: Using regularized dynamic correlation to infer gene dependency networks from timeseries microarray data.
Proceedings of the 4th International Workshop on Computational Systems Biology (WCSB 2006), Tampere 2006, 4:7376.