Abstract
Background
With the abundant information produced by microarray technology, various approaches have been proposed to infer transcriptional regulatory networks. However, few approaches have studied subtle and indirect interaction such as genetic compensation, the existence of which is widely recognized although its mechanism has yet to be clarified. Furthermore, when inferring gene networks most models include only observed variables whereas latent factors, such as proteins and mRNA degradation that are not measured by microarrays, do participate in networks in reality.
Results
Motivated by inferring transcriptional compensation (TC) interactions in yeast, a stepwise structural equation modeling algorithm (SSEM) is developed. In addition to observed variables, SSEM also incorporates hidden variables to capture interactions (or regulations) from latent factors. Simulated gene networks are used to determine with which of six possible model selection criteria (MSC) SSEM works best. SSEM with Bayesian information criterion (BIC) results in the highest true positive rates, the largest percentage of correctly predicted interactions from all existing interactions, and the highest true negative (nonexisting interactions) rates. Next, we apply SSEM using real microarray data to infer TC interactions among (1) small groups of genes that are synthetic sick or lethal (SSL) to SGS1, and (2) a group of SSL pairs of 51 yeast genes involved in DNA synthesis and repair that are of interest. For (1), SSEM with BIC is shown to outperform three Bayesian network algorithms and a multivariate autoregressive model, checked against the results of qRTPCR experiments. The predictions for (2) are shown to coincide with several known pathways of Sgs1 and its partners that are involved in DNA replication, recombination and repair. In addition, experimentally testable interactions of Rad27 are predicted.
Conclusion
SSEM is a useful tool for inferring genetic networks, and the results reinforce the possibility of predicting pathways of protein complexes via genetic interactions.
Background
While the existence of genetic compensation is widely accepted, the mechanism is largely unknown but important [1,2]. The proposed algorithm (SSEM) was motivated by inferring transcriptional compensation (TC) networks of SGS1 (or RAD27) and its synthetic sick or lethal (SSL) partners [3,4]. However, SSEM can also be applied to infer other types of networks, such as transcriptional regulatory networks. Following a gene's loss, the expression level of its compensatory gene increases (decreases), this phenomenon is called TC (transcriptional diminishment, abbreviated as TD). Paralogs or redundant genes are called digenic SSL gene pairs if the combination of two mutants, neither by itself lethal, causes the organism to die or malfunction [3,5,6]. SSL effects underlie many complex human diseases, such as type II diabetes, schizophrenia, Alzheimer's disease, and others [4]. Since genetic networks derived from model organisms, such as yeast, are likely to be conserved in humans the prediction of TC and TD may shed light on pathways that cause complex human diseases. With the abundant information produced by microarray technology, various approaches have been proposed to infer genetic networks or transcriptional regulatory networks. Most of them may be classified into three classes, namely, graph models, discrete variable models and continuous variable models. Due to space limits, we refer to [7] (in Additional file 1) for a thorough review of the models.
Additional file 1. SSEMTR. Technical Report of SSEM – Shieh et al. (2005).
Format: PDF Size: 363KB Download file
This file can be viewed with: Adobe Acrobat Reader
Graph models (for instance, [8]) depict genetic interactions through directed graphs or digraphs instead of characterizing the interactions quantitatively. Some graph models simply reveal structural information, others annotate the directions and signs of the regulations among genes. Because of their simplicity, graph models usually require much less data than models in the other two categories. But they are inherently static and may not capture the dynamics of genetic regulations and the simultaneous regulation of a given gene by multiple genes. Discrete variable models discretize gene expressions into a few states. The dynamics of gene expressions may be perceived as transitions of finite states. Typical discrete variable models proposed are Boolean networks, probabilistic Boolean networks and discrete Bayesian networks (for details, see a classic paper [9]). Continuous variable models characterize the expression of a gene or its change by a linear or nonlinear continuous function of the expression of other genes. The genetic interactions are frequently modeled by a firstorder or a secondorder differential (or difference) equation. Continuous variable models consist of two major types: continuous Bayesian networks [1012] and deterministic differential systems [13].
Although each class of models has been shown to be informative for understanding genetic interactions, most of the models, except some Bayesian networks, have the estimation bias problem due to model misspecification. The model misspecification arises from the fact that microarrays measure the mRNA expressions only, while genetic interactions may be influenced by enzymes or proteins, for instance transcriptional factors. Furthermore, most genetic networks reconstructed in previous studies considered only a subset of the whole genome. Consequently, those genes that were left out may be regarded as latent factors influencing the genes of interest. Thus, ignoring latent factors in the models may cause bias when inferring the genetic interactions. Although a Bayesian network can also incorporate latent factors [11,12], the amount of data required may prevent it from being used.
To account for the latent factors effect using a reasonable amount of microarray data, a stepwise structural equation modeling algorithm (SSEM) is proposed in this article. SSEM is based on structural equation modeling (SEM) [14], which unifies factor analysis and path analysis. Assuming linear relations among the observed and latent variables, the basic idea of SEM is to minimize the discrepancy between the fitted covariance matrix and the sample covariance matrix. Zhou et al. [15] used shortest path analysis to identify transitive genes between two given genes in the same biological pathway. Xie and Bentler [16] showed that the latent factors can be identified and their relations may be estimated reasonably by SEM. Note that without identifying the latent factors reasonably, the causal relations among genes can not be estimated correctly.
In this article, we extend the model in [16] to simultaneously infer both latent factorgene and genegene interactions. Both [16] and SSEM extend the methodology of SEM in the sense that the latent factors are formed from data and not chosen a priori from domain knowledge as commonly practiced in the social sciences. SSEM learns genetic interactions by both exploratory factor analysis (EFA) and SEM with various model selection criteria (MSC) in a stepwise fashion. The incorporation of MSC helps SSEM circumvent the overfitting problem. The performance of SSEM with six different MSC is evaluated using two sets of simulated networks to determine which MSC works best. The software SSEM automatically runs through all of the steps of SSEM, and outputs predicted gene interactions. Finally, SSEM is applied to infer TC and TD interactions for (1) small groups of genes that are synthetic sick or lethal (SSL) to SGS1, and (2) SGS1 or RAD27 with their SSL partners from 51 genes involved in yeast DNA synthesis and repair that are of interest. Both predictions are verified by an extensive quantitative RTpolymerase chain reaction experiment (qRTPCR); see Additional file 2 for details.
Additional file 2. qRTPCR. Description of the design of qRTPCR experiments and how the experiments were conducted to confirm the predicted TC and TD interactions.
Format: PDF Size: 38KB Download file
This file can be viewed with: Adobe Acrobat Reader
Results and discussion
The MSC suitable for predicting genetic networks remains unknown, while an adequate MSC can prevent the algorithm from overfitting. Thus we have carried out an extensive simulation to evaluate eight criteria used in commercial SEM software, such as Mplus version 3 [17]. The results of the top six MSC χ^{2}/df, χ^{2}df, Mean square error (MSE), Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjBIC are reported in Additional file 3. SSEM with BIC outperforms all of the others. Since the network topology, latent factors (x(t)), genegene interactions (w), and latent factorgene regulations (Λ) are well defined for the simulated data, exact quantitative performance can be computed.
Additional file 3. Simulation. The description of the 6 and 10gene networks, and the results of EBVM applied to the two networks.
Format: PDF Size: 246KB Download file
This file can be viewed with: Adobe Acrobat Reader
Results of SSEM with various MSC using simulated data
Time course data from 6gene and 10gene regulatory networks with two latent factors are generated. The simulation consists of various sample sizes and noise levels. Let x_{i}(t), y_{i}(t) and ε_{i}(t) denote the expression level of latent factor i, gene i and noise variable i, respectively. The linear dynamic factor model (LDFM, see Section 4.1 for general model setting) to generate the 6gene network is as follows:
where x_{1}(t) ~ N(0,0.1), x_{2}(t) ~ N(0,0.1), y_{i}(0) ~ U(0,1), and ε_{i}(t) ~ N (0, ), i = 1,...,6. Note that is determined by the variance of y_{i}(t) and a prespecified noise level. The noise level is quantified by a contrasttonoise ratio (CNR), defined as the ratio of the signal standard deviation to the noise standard deviation. CNR = 1.3 or 2.0 corresponds to high or medium noise levels, respectively. For the 10gene network, we refer to Equation (5) of Simulation.pdf of the Supplementary data.
Note that both the 6 and 10gene networks are sparse, which roughly follow the sparse property of cisregulatory networks [18]. For each network, time course data are simulated under various conditions; sample sizes (T =T_{min},50 or 100) and noise levels, where T is the number of time points and T_{min }= 2n + 1. Without incorporating any biological knowledge, for a fully connected ngene network (namely all interactions are nonvanished), T = 2n+1 is the minimum number of time points required (denoted as T_{min}) for proper estimation of in (4), and hence for all parameters in the model. However, the latest version of SSEM can be iterated from a nonfully connected network, and hence the restriction T = 2n+1 no longer exists. Table 1 summarizes the performance of SSEM with AIC and BIC for the 6gene network under various settings of (CNR, T). The averages of the true positive rate (TPR), true negative rate (TNR), and false positive rate (FPR) for the top 1 (top 5) networks, in terms of MSC value, in 100 experiments are reported. TPR (also known as sensitivity) is the percentage of correctly predicted links from the total existing links (positives) in the simulated network. Likewise, TNR (specificity) is the percentage of correctly predicted nonexisting links (negatives) out of the total nonexisting links in the simulated network. Clearly, SSEM with BIC outperforms SSEM with AIC, and the results from the 10gene network also confirm this.
Table 1. Performance of SSEM with BIC and AIC applied to the 6gene network with various combinations of (CNR, T).
We further compared SSEM with BIC to VBEM [12] using the 6 and 10gene networks in Simulation.pdf. The results are in Table 2, and SSEM with BIC outperforms VBEM in terms of TPRs for both networks; details are provided in Simulation.pdf.
Table 2. Performance of SSEM with BIC and VBEM applied to the 6gene and 10gene networks with various combinations of (n, CNR, T)
Results on real time course microarray data
In this section, SSEM is first applied to infer TC/TD interactions for small groups of genes SSL to SGS1, for example CSM3, MUS81, SIS2, SWE1and TOP1 in [3]. Next, SSEM infers TC/TD interactions from SGS1 or RAD27 SSL gene pairs, formed from 51 genes involved in yeast DNA synthesis and repair. SGS1 encodes a RecQ DNA helicase, of which the homologues in human cells include the WRN, BLM and RECQ4 genes. Mutation of the SGS1 gene results in premature aging in yeast mother cells as well as genome instability. Further, these genes and their processes are highly conserved in eukaryotic cells, and mutations in these genes may lead to cancerpredisposition syndromes and symptoms resembling premature aging [4]. On the other hand, Rad27 encodes a structurespecific (5'flap) endonuclease which has a human homolog, FEN1; Rad27 has a distinct role in processing Okazaki fragments during DNA synthesis in the S phase. Deletion of RAD27 in cells also causes hypersensitivities to various DNA damaging agents [19]. Rad27 was shown to be necessary for maintaining genome stability by restricting DNA recombination between short repeated sequences and processing longpatch base excision repair [2023].
cDNA microarray data from the alpha, cdc15 and cdc28 experiments in [24] were applied to the four algorithms to infer the gene network of interest. The elu data set was not included because it was synchronized differently from alpha, cdc15 and cdc28. The experiment and control groups were mRNAs extracted from synchronized and nonsynchronized yeast cultures, respectively. The synchronization was conducted by treating yeast cultures with alpha factor arrest and arrests of a temperaturesensitive mutant cdc15 and mutant cdc28. A full description and complete data sets are available at [25]. The red (R) and green (G) fluorescence intensities were measured from the mRNA abundance in the experiment group and control group, respectively. There were 18, 24 and 17 time points in the alpha, cdc15 and cdc28 data sets with no replicates; we first aggregated these three datasets to increase the number of time points to 59. This aggregation was applied in [16], and it resulted in some meaningful gene networks.
Log ratios of the six genes' expression levels were fitted to SSEM with BIC, VBEM [12], MAPEM [26] and LDS [11] algorithms. The results were checked against qRTPCR results (see Figure 1 in Additional file 4). Excluding latent factorgene interactions, the modified true positive rate (mTPR) of the top model selected by SSEM with BIC equals 7/12. While the mTPRs of VBEM, MAPEM and LDS equal 2/12 (at 99% significance level), 6/12 and 0, respectively. Fitting five genes' expression to a multivariate AR(1) model resulted in 0/12 mTPR; see Additional file 5 for details. This shows how latent factors improve the estimation of gene interactions and thus mediate proper extraction of biological knowledge. We also ran SSEM when the sample size was small (T = 11) for the 6gene network, and the mTPR of the top model predicted by SSEM equaled 7/12. For this application, SSEM took about 19 minutes on PCs with Pentium IV 3.4 GHz and 2.5 GB RAM.
Additional file 4. BayesianNW. The 6gene network predicted by the three Bayesian network algorithms in Beal et al. (2005), Rangel et al. (2004) and Perrin et al. (2003).
Format: PDF Size: 89KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 5. MultiAR(1). The result of fitting multivariate AR(1) straightforwardly to the real data for the 6gene network.
Format: PDF Size: 211KB Download file
This file can be viewed with: Adobe Acrobat Reader
Next, SSEM was applied to infer TC/TD interactions among the 51 genes that are SSL to SGS1 or RAD27. Our collaborator has conducted extensive qRTPCR experiments (in Additional file 6) to verify that among these predictions, SSEM successfully uncovered several TC/TD interactions of SGS1 with genes involved in DNA replication (e.g., SRS2, PLO32, RNR1, SLX1, MUS81 and TOP1), DNA repair (e.g., RAD51 and RAD52), checkpoint arrest (RAD9) and chromosome segregation (CSM3). These genetic interactions are consistent with the following experimental results from published literature. Sgs1 and Srs2 are known redundant pathways in replication [27,28]; for instance, Srs2 and Sgs1Top3 suppress crossovers during double strand break repair in yeast. Further, defects in RAD51 and other homologous recombination genes suppressed synthetic lethality/sickness of the double mutant sgs1Δ srs2Δ. Slx1Slx4 was found to be a second structure specific endonuclease functionally redundant with Sgs1Top3 in [29]. The Sgs1/Top3/Rmi1 and Mus81/Mms4 complexes are involved in both doublestrand break repair and homologous recombination [30]. This indicates that Sgs1/Top3/Rmi1 and Mus81/Mms4 are alternative pathways to resolve recombination intermediates. [31] identified that Sgs1 participated in a RAD52dependent recombination pathway. [32] found that Rad9 and Sgs1 interacted genetically and possibly physically. Cells lacking Sgs1 frequently arrest as largebudded cells with a single nucleus in the mother cell, or "stuck" between mother and daughter cells, which resulted in missegregation during mitosis [33,34], whereas Csm3 is required for DNA replication checkpoint and accurate chromosome segregation. Similarly, SSEM was applied to predict the interactions of the fifteen SSL pairs of RAD27, and among them HPR5, SGS1, MUS81, ESC2, HST1, HST3 and CSM3 had TC interactions with RAD27, whereas RAD52, HPR5, SIS2, SOD2, HPC2, LYS7, RAD9, RAD51 and RAD54 had TD interactions with RAD27. For the second application, SSEM took about 3 to 4 hours on PCs with Pentium IV 3.0 GHz and 1 GB RAM.
The results involving SGS1 reinforce the possibility of applying genetic interactions to predict pathways of protein complexes [35]. The predictions of RAD27 are intriguing to biologists since biological experiments to screen all possible interactions have been prohibitive thus far. Note that SSEM can also be applied to infer TC interactions of 872 SSL gene pairs in [3,4] or other large networks with a similar structure, for instance the other six groups of SSL pairs involving ARC40, ARP2, BBC1, BIM1, BNI1 and NBP2. The large network of 887 SSL pairs can be broken down to subgroups that center on SGS1, RAD27, the above six genes, and other hub genes. Then each subgroup can be inferred individually, similarly to the group involving SGS1.
Additional file 6. SSLTCNW. TC networks of SSL gene pairs. Description of how TC and TD interactions of SGS1 and RAD27 SSL gene pairs were predicted by SSEM.
Format: PDF Size: 389KB Download file
This file can be viewed with: Adobe Acrobat Reader
Conclusion
The novelties and merits of SSEM are as follows. First, SSEM expands the scope of application of most algorithms in the area of gene networks. Specifically, SSEM is shown to predict several TC/TD interactions of SGS1 accurately, verified by qRTPCR experiments, and these interactions coincide with existing pathways. Further, SSEM predicts a few novel TC/TD interactions involving RAD27, and these predictions can be verified by biological experiments. Importantly, SSEM can be further applied to predict genetic interactions of other large networks with a similar structure, while biological experiments to screen all possible interactions may be prohibitive. Second, SSEM extends the approach in [16] such that it can infer both latent factorgene and genegene interactions simultaneously. Third, SSEM incorporates an MSC in a stepwise fashion to circumvent the overfitting problem. Although SSEM was shown to infer genetic networks using time course data with no replicates, it can also be applied to short time course data with replicates by modifying the terms involved in replicates and the sample size. As technology advances, we anticipate more data sets with replicates to become available and a greater demand for algorithms like SSEM to infer gene networks.
Using SSEM with the model in Equation (2) has been shown to outperform fitting a multivariate autoregressive model straightforwardly. This demonstrates the important role of latent factors and the efficiency of SSEM. Further, SSEM outperforms three Bayesian network algorithms that impose linear models on latent factors, while SSEM does not assume any structure on latent factors. However, SSEM shares one drawback with continuous Bayesian networks. Both approaches assume that the vector of log ratios of gene expression y(t) follows a multivariate normal distribution. This assumption may limit its application, although log ratios of gene expression do follow a normal distribution in most cases.
Although SSEM may serve as an exploratory tool for genetic interactions, the model in (2) is an approximation to the true model, and BIC is a largesample result. Further improvements for future research include finding a novel MSC for SSEM when the sample size is small, and developing a nonlinear model with latent factors or a lagk and k > 1 in time to model genetic interactions. The goal of SSEM is to model small to medium networks with precise prediction instead of modeling large or genomewide networks with inaccurate prediction. Some results on incorporating various types of data, e.g. motif information, and ChIPchip data besides microarray data, to predict transcriptional modules have been explored in the literature [3638]. However, integrating various types of data for reliable prediction of complex genetic networks remains a challenging topic, and we leave this for future research.
Methods
The linear dynamic factor model
We assume that time course microarray data follow an LDFM, which includes both factorgene and genegene regulation in the model. Let denote the expression of gene i at time t for 1 ≤ i ≤ n, where n is the number of genes in the network. Further, let y_{i}(t) be the centered , namely , where is the mean of over time. We incorporate centered variables to avoid an intercept term in Equation (2) to reduce n parameters that are not of interest. Specifically, LDFM assumes that y_{i}(t) is regulated by a linear combination of latent factors at time t and centered observed variables (genes) at time (t  1), and the regulation is invariant across time as follows.
where y(t) is the vector of the expression levels of the n genes at time t, x(t) is the (k × 1) vector of the latent factors' expression at time t, and ε(t) is the (n × 1) noise vector that assumes N_{n }(, Q), where Q is a diagonal covariance matrix. Further, Λ is the (n × k) latent interaction matrix, in which λ_{ij }denotes the influence of latent factor j on gene i at the same time, and w is the genegene interaction matrix, in which W_{ij }denotes the influence of gene j at time (t  1) on gene i's expression at time t. Latent factors x(t) are assumed to follow N_{k }(, Σ_{k}), and x(t) and ε(t) are uncorrelated such that the model is identifiable. Applying biological knowledge, SSEM can infer sufficiently large networks. For example, when inferring TC interactions from SSL pairs, interactions (W_{ij}'s) are nonvanished only for SSL pairs. For instance, when predicting TC interactions of SSL gene pairs involving SGS1, fitting one equation is sufficient, where y_{i}'s are the twentythree genes that are SSL to SGS1 [3,4], and the other W_{j}'s are vanished for gene j that is not SSL to SGS1. The aforementioned equation can be inputed into the latest version of SSEM as an initial network, and when no links are specified to be deleted in the iteration, SSEM will predict genegene interactions for the nonvanished W_{j}'s, and infer the factorgene interactions from data. Note that when inferring transcriptional regulatory networks, Equation (2) is also able to model the combination of multiple genes to activate (or repress) a target gene simultaneously. The major difference between LDFM and the state space model (SSM), for example the model in [12], is that the former does not model interactions among latent factors across time.
SEM is adopted since it considers latent factorgene and genegene interactions simultaneously to reveal gene networks using microarray data. An SSEM algorithm is introduced to learn the parameters Λ and W in LDFM. The main idea is to learn the regulation network iteratively. In each iteration, for a generated network, we estimate the parameters by SEM and evaluate its goodnessoffit. The top few networks of each iteration are retained for the next iteration, until the optimal network, in terms of any MSC, emerges. SSEM is available to users upon request from the corresponding author.
SSEM consists of three parts. Specifically, in Part 1, EFA is applied to learn some initial latent structures, which specify latent factorgene interaction. In Part 2, networks consist of any given initial latent structure and (randomly generated) partially connected genegene interactions are considered. SEM is applied to estimate x(t), Λ and w of any network considered, and a specified MSC evaluates the goodnessoffit of the network. In Part 3, plausible networks are generated by systematically and iteratively eliminating insignificant links (interactions) based on the associated tstatistics resulting from SEM. These three parts are described in detail in the learning networks section.
Learning the initial latent structures
Incorporating a correct latent structure is crucial for reconstructing genetic networks. First, EFA is employed to learn potential latent structures to start the iterative process. EFA is a common practice to ascertain the latent factors that influence the observed variables. Fundamentally, factor analysis assumes that there are some latent factors, fewer in number than genes, that are responsible for the covariation among the observed gene expressions. EFA may be expressed as
where , and are all estimated without taking genegene interaction into account. Specifically, is (m × 1) the common factors at time t, is the n × m latent interaction matrix, in which denotes the influence of latent factor j on the expression of gene i at time t estimated without explicitly taking account of genegene interaction, and is (m × 1) the unique factors at time t that can not be explained by the common factors . Comparing Equations (2) and (3), the latent structure embedded in would deviate from the true one except when the factor accounts for the effect of genegene interaction, that is, equal to w_{y}(t  1). This shows that fitting a structural equation model with the latent factors estimated solely by EFA to the gene expressions [16] may not result in the correct latent structure. Therefore, given k latent factors suggested by EFA, we consider three possible numbers of latent factors (k1, k or k+1), along with the associated latent structure in Part 1 of SSEM. The common factors are extracted by a principal component analysis with promax oblique rotation, which rotates factors in order to fit a hypothesized structure of latent factors.
Determining the number of common factors that best explain the observed variables is one of the practical issues in EFA. Various guidelines have been proposed, for instance, eigenvalue ≥ 1 [39] and the scree test [40]. Different guidelines may lead to different choices. Based on the "weaker lower bound" suggested by [40], SSEM searches through k  1, k, and k + 1 common factors and the associated latent structures, where k is the number of common factors with eigenvalues ≥ 1 resulting from EFA. Then, for each given k, the latent structure is obtained by eliminating the links with factor loading less than a constant, which can be specified by users and the default value is 0.2.
Network (model) selection criterion
In the iterations of SSEM, latent factor x(t), and the parameters Λ and w of a given network are estimated, and the goodnessoffit of the network is computed by SEM. SEM is a statistical method to test the hypothesis for the existence of both latent factorgene and genegene interactions. The principal idea of SEM is to minimize the difference between the covariance matrices of the predicted variables and the observed variables. Let Cov(a, b) be the covariance matrix of two random vectors a and b. The LDFM is lag1 in time, so we consider the joined vector of y(t)^{T }and y(t  1)^{T}.
Let S denote the sample covariance matrix, which is defined as
where S_{t,t }= Cov(y(t), y(t)), S_{t1,t }= Cov(y(t  1), y(t)), S_{t,t1 }= Cov(y(t), y(t1)), and S_{t1,t1 }= Cov(y(t  1), y(t  1)). Let (t) be the column vector of the predicted expressions for the n genes at time t. Plugging in (t  1) and (t) for y(t  1) and y(t), respectively into the elements of S, we obtain the estimated covariance matrix .
In SSEM, the parameters are estimated by the maximum likelihood method with the fitting function
where denotes the estimated covariance matrix, S the sample covariance matrix, A and tr(A) the determinant and the trace of matrix A, respectively, and n the number of genes. When the sample size (T) is small, ridge estimation is applied to avoid the singularity of S and . In the application section, small ridge constants are applied such that the condition number of S and are not larger than 10^{2}.
Among the six MSC's studied, the χ^{2 }statistic is based on the idea of minimizing the discrepancy between the estimated and the sample covariance matrices, and it is defined as (T  1) times the minimized value of F_{ML }in Equation (4), where T is the sample size. When the fitting function is F_{ML}, the χ^{2 }statistic is equivalent to the generalized likelihood ratio [41]. Assuming multivariate normality, the χ^{2 }statistic has an asymptotic (large sample) χ^{2 }distribution with (p*  q) degrees of freedom, where p* = (3n^{2 }+ n)/2 since only and the lower triangle matrix of form equations to estimate parameters. Further, q is the number of parameters that equal to n^{2 }+ kn in the LDFM in (2). The condition n > k is required to have degrees of freedom p*  q> 0. However, this condition is satisfied in general since k = [n/c], where c ≥ 3. A large sample size can inflate a small difference between S and , and thus can inflate the χ^{2 }statistic. Numerous indices were proposed to remedy the bias, among them four have been assessed in our pilot studies, namely, χ^{2}/df, χ^{2 } df [42], TLI [43] and CFI [44], where df denotes degree of freedom. The former two were found to be more effective than the latter two for network (model) selection.
MSE between the observed and the predicted gene expressions is defined as , where T is the number of time points in the gene expression data. AIC [45] and BIC [46] are two widely used information criteria for model selection, which take model complexity into account. AIC is a measure based on the KullbackLeibler distance between the fitted and the true model, and AIC = 2log L() + 2q_{j}, where log L() is the loglikelihood with estimates , and q_{j }is the number of parameters in model j. To solve the inconsistency problem of AIC, Schwarz [44] proposed BIC based on maximization of the posterior choice probability. BIC = 2log L() + q_{j }log T, where T is the number of time points. To reduce the penalty imposed in BIC, Sclove [47] suggested samplesize adjusted BIC (adjBIC) by replacing T with T*, where T* = (T + 2)/24.
Learning networks through iterated SEM
A genetic network inferred from LDFM can be built by latent factorgene and genegene interactions. A correct network is essential for estimation of genegene interactions using SEM. However, learning the optimal network from data subject to a goodnessoffit index is NPhard. Although global optimization techniques, such as simulated annealing and genetic algorithms, may be applied, the required computation time is not feasible. To make the learning process practical, we propose a stepwise approach. The key idea is to generate a set of candidate networks and retain plausible links by both using SEM and iteratively filtering with a moving window as follows. For any network generated in the iteration, we apply SEM to estimate x(t), Λ and w. The significance of each link (λ_{ij }and W_{kl}) is tested by its associated tstatistic. Let t^{i}window (denoted by ) be a window of some given lower and upper bounds in the ith iteration to screen for significance of generated links. A link with a tstatistic value greater than , within or less than is regarded as a candidate link, a possible link or a futile link (denoted by clink, plink and flink), respectively.
Let S_{c}, S_{p }and S_{f }denote the sets of clinks, plinks and flinks, respectively. Suppose that EFA suggests k factors for a given data set. Given fixed k  1, k or k + 1 factors, EFA is applied again to learn the associated latent (factorgene) structures. SSEM begins with the aforementioned latent structure and a fully connected gene structure, namely, each gene is regulated by all genes and k  1, k or k + 1 latent factors. To start the stepwise search, SEM is applied to the initial networks to estimate x(t), Λ and w. For a given initial network, first let the initial t^{0}window be . Then, a set of networks Φ^{0 }can be generated as follows. Checking the tstatistics of all links against the t^{0}window , we discard all flinks, and retain all clinks, while considering all 0–1 combinations of plinks. Suppose there are l plinks in an initial model, then there are 2^{l }combinations of each plink being included in a model or not. Models including all clinks and each aforementioned combination are considered, and these 2^{l }models can be viewed as generated by the twindow filtering. That is, the twindow serves as a filter to eliminate insignificant (the lesslikelytoexist) links. Specifically, Φ^{0 }= {φφ ∈ S_{c }∪ L_{p}, ∀ L_{p }∈ P(S_{p})}, where P(S_{p}) = {L_{p}L_{p }⊆ S_{p}} is the power set of S_{p }and L_{p }is a subset of S_{p}. Furthermore, we apply SEM to each candidate network in Φ^{0 }to obtain the prespecified goodnessoffit index. To save computation time and to ensure that the superior networks are kept for the next iteration, only the top m networks (denoted by Ω^{0}) are retained for the next iteration.
Similar to the initial iteration, for each iteration i ≥ 1, SSEM generates a set of candidate networks by twindow filtering all networks generated by the top m networks from iteration(i1), i.e., Ω^{i1}, with k  1, k or k + 1 factors, respectively, to form Φ^{i}. So in total, there are 3 m seed models to generate networks. Among the networks in Φ^{i }∪ Ω^{i1}, only the top m networks (Ω^{i}) are retained by the specified goodnessoffit index for iteration (i + 1). First, we let the t^{i}window equal to the t^{i1}window +c. We use c = 0.1, but c can be other small constants. Again, given the t^{i}window, each link in the jth network in Ω^{i1 }can be discarded, retained or considered according to its tstatistic value. We denote the collection of these flinks, clinks, and plinks by S_{jf}, S_{jc}, and S_{jp}, respectively. A set of candidate networks is generated by retaining all clinks and considering all 0–1 combinations of plinks with k  1, k or k + 1 factors in the model, and this set is denoted by = {φφ ∈ S_{jc }∪ L_{p}, ∀ L_{p }∈ P(S_{jp})}. We combine all the generated sets to result in the ith set of networks .
Evaluating the specified goodnessoffit index for every network in Φ^{i}, we obtain the top m scored networks from Φ^{i }∪ Ω^{i1}, which form Φ^{i}, to go to iteration (i + 1). The iteration terminates if the specified goodnessoffit index can not be further improved.
The proposed SSEM algorithm
Initialization
Fit EFA to a given data set to determine the number of factors, say k.
• Apply EFA to generate three initial networks by estimating the latent structures with k  1, k or k + 1 latent factors, respectively.
• For given k factors, obtain the latent structure by eliminating the links with factor loading less than a constant (the default value used is 0.2).
• Specify an MSC.
Stepwise search
• For each initial network:
Step 1. Set iteration i = 0, run SEM on the data set. Specify the t^{0}window = . Generate a set of networks that consist of all clinks and one of all the 0–1 combinations of plinks. Compute the MSC of all networks and select the top m models to form the candidate set Ω^{0}.
Step 2. Set i = i + 1. Specify the t^{i}window .
Step 3. Similarly to Step 1, for each network in Ω^{i1}, generate a set of networks, and form Φ^{i}
Step 4. Evaluate the MSC for all networks in Φ^{i}, and choose the best m networks from Φ^{i }∪ Ω^{i1 }to form the ith candidate set Ω^{i}.
Step 5. If the ith top 1 MSC = the (i1)th top 1 MSC, stop ; Otherwise, go to Step 2.
• Select the best m networks from the union of all networks generated by different initial guesses.
Availability and requirements
Project home page is in [48]. SSEM algorithm is written in Visual C++ 6.0, and it calls SAS 8.2 and Mplus 3.0. Program runs under Windows 2000 or higher version operating system. The zipped code of SSEM is attached in Additional file 7. Visual C++, SAS and Mplus are readily available for purchase through Microsoft, SAS and Mplus, respectively.
Additional file 7. SSEMalgorithm. The zipped file consists of the standalone executable (.exe) file of SSEM.
Format: ZIP Size: 212KB Download file
Authors' contributions
GS and CC conceived the study, devised the method, and supervised methodology and implementation. CY carried out the method and part of simulation, and wrote an early draft of Shieh et al. in [7]. JH wrote and automated the code. WW participated in implementation. CC and YL wrote part of the paper. GS wrote the paper and coordinated the entire work. All of the authors have read and approved the final manuscript.
Acknowledgements
The authors wish to thank ChiaChang Wang and JyeJung Chang for computational assistance, Drs. TingFang Wang, ChihHung Jen, John Aston and Ivan Chang for constructive discussions in biology (the former two) and in statistics, especially Dr. Wang for kindly providing us with the qRTPCR results. This work was supported in part by NSC grant 922118M001023 and thematic grant ASTP 2333 to G.S.S.; C.Y.Y. was supported by NSC postdoctoral fellowship 922811M001037 and 932811M001071.
References

Lesage G, Sdicu AM, Menard P, Shapiro J, Hussein S, Bussey H: Analysis of β1, 3glucan assembly in S. cerevisiae using a synthetic interaction network and altered sensitivity to caspofungin.
Genetics 2004, 167:3549. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kafri R, BarEven A, Pilpel Y: Transcriptional control reprogramming in genetic backup circuits.
Nature Genetics 2005, 37:295299. PubMed Abstract  Publisher Full Text

Tong AH, Evangelista M, Parsons AB, Xu H, Bader GD, Page N, Robinson M, Raghibizadeh S, Hogue CW, Bussey H, Andrews B, Tyers M, Boone C: Systematic genetic analysis with ordered arrays of Yeast deletion mutants.
Science 2001, 294:23642366. PubMed Abstract  Publisher Full Text

Tong AH, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M, Chen Y, Cheng X, Chua G, Friesen H, Goldberg DS, Haynes J, Humphries C, He G, Hussein S, Ke L, Krogan N, Li Z, Levinson JN, Lu H, Menard P, Munyana C, Parsons AB, Ryan O, Tonikian R, Roberts T, Sdicu AM, Shapiro J, Sheikh B, Suter B, Wong SL, Zhang LV, Zhu H, Burd CG, Munro S, Sander C, Rine J, Greenblatt J, Peter M, Bretscher A, Bell G, Roth FP, Brown GW, Andrews B, Bussey H, Boone C: Global mapping of the Yeast genetic interaction network.
Science 2004, 303:808813. PubMed Abstract  Publisher Full Text

Hartman JL, Garvik B, Hartwell L: Principles for the buffering of genetic variation.
Science 2001, 291:10011004. PubMed Abstract  Publisher Full Text

Pan X, Ye P, Yuan DS, Wang X, Bader JS, Boeke JD: A DNA integrity network in the yeast Saccharomyces cerevisiae.
Cell 2006, 124:10691081. PubMed Abstract  Publisher Full Text

Shieh GS, Chen CM, Yu CY, Huang J, Wang WF: A stepwise structural equation modeling algorithm to reconstruct genetic networks. In Technical Report C200504. Institute of Statistical Science, Academia Sinica, Taiwan; 2005.

Schäfer J, Strimmer K: An empirical Bayes approach to inferring largescale gene association networks.
Bioinformatics 2005, 21:754764. PubMed Abstract  Publisher Full Text

Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data.
Journal of Computational Biology 2000, 7:601620. PubMed Abstract  Publisher Full Text

Kim SY, Imoto S, Miyano S: Inferring gene networks from time series microarray data using dynamic Bayesian networks.
Briefings in Bioinformatics 2003, 4:228235. PubMed Abstract  Publisher Full Text

Perrin BE, Ralaivola L, Mazurie A, Bottani S, Mallet J, d'AlchéBuc F: Gene networks inference using dynamic Bayesian networks.
Bioinformatics 2003, 19:ii138ii148. PubMed Abstract  Publisher Full Text

Beal MJ, Falciani F, Ghahramani Z, Rangel C, Wild DL: A Bayesian approach to reconstructing genetic regulatory networks with hidden factors.
Bioinformatics 2005, 21:349356. PubMed Abstract  Publisher Full Text

Kimura S, Ide1 K, Kashihara A, Kano M, Hatakeyama M, Masui R, Nakagawa N, Yokoyama S, Kuramitsu S, Konagaya A: Inference of Ssystem models of genetic networks using a cooperative coevolutionary algorithm.
Bioinformatics 2005, 21:11541163. PubMed Abstract  Publisher Full Text

Kaplan D: Structural equation modeling: Foundation and extensions. Sage Publication: Thousand Oaks, California, USA; 2000.

Zhou X, Kao MC, Wong WH: Transitive functional annotation by shortestpath analysis of gene expression data.
Proc Natl Acad Sci USA 2002, 99:127831278. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Xie J, Bentler PM: Covariance structure models for gene expression microarray data.
Structural Equation Modeling 2003, 10:566582. Publisher Full Text

Muthén LK, Muthén BO: Mplus User's Guide. Los Angeles, CA: Muthén & Muthén; 2004.

Van Someren EP, Wessels LFA, Backer E, Reinders MJT: Genetic network modelling.
Pharmacogenomics 2002, 3:507525. PubMed Abstract  Publisher Full Text

Hoops LL, Budd M, Choe W, Weitao T, Campbell JL: Mutations in DNA replication genes reduce yeast life span.
Mol Cell Biol 2002, 22:41364146. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Klungland A, Lindahl T: Second pathway for completion of human DNA base excisionrepair: reconstitution with purified proteins and requirement for DNaseIV (FEN1).
EMBO J 1997, 16:33413348. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tishkoff DX, Filosi N, Gaida GM, Kolodner RD: A novel mutation avoidance mechanism dependent on S. cerevisiae RAD27 is distinct from DNA mismatch repair.
Cell 1997, 88:253263. PubMed Abstract  Publisher Full Text

Negritto MC, Qiu J, Ratay DO, Shen B, Bailis AM: Novel function of Rad27 (FEN1) in restricting shortsequence recombination.
Mol Cell Biol 2001, 21:23492358. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Xie J, Qian M, Gong G: Reversible algorithm of simulating multivariate densities with multihump.
Science in China Series A 2001, 44:357364. Publisher Full Text

Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycleregulated genes of the yeast Sarcharomyces cerevisiae by microarray hybridization.
Mol Biol Cell 1998, 9:32733297. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yeast Cell Cycle Analysis project [http://cellcyclewww.stanford.edu] webcite

Rangel C, Angus J, Ghahramani Z, Lioumi M, Sotheran E, Gaiba A, Wild DL, Falciani F: Modelling Tcell activation using gene expression profiling and state space models.
Bioinformatics 2004, 20:13611372. PubMed Abstract  Publisher Full Text

Ira G, Malkova A, Liberi G, Foiani M, Haber JE: Srs2 and Sgs1Top3 suppress crossovers during doublestrand break repair in yeast.
Cell 1999, 115:401411. Publisher Full Text

Lee SK, Johnson RE, Yu SL, Prakash L, Prakash S: Requirement of yeast SGS1 and SRS2 genes for replication and transcription.
Science 1999, 286:23392342. PubMed Abstract  Publisher Full Text

Fricke WM, Brill SJ: Slx1Slx4 is a second structurespecific endonuclease functionally redundant with Sgs1Top3.
Genes Dev 2003, 17:17681778. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fabre F, Chan A, Heyer WD, Gangloff S: Alternate pathways involving Sgs1/Top3, Mus81/Mms4, and Srs2 prevent formation of toxic recombination intermediates from singlestranded gaps created by DNA replication.
Proc Natl Acad Sci USA 2002, 99:1688716892. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Onoda F, Seki M, Miyajima A, Enomoto T: Involvement of SGS1 in DNA damageinduced heteroallelic recombination that requires RAD52 in Saccharomyces cerevisiae.
Mol Gen Genet 2001, 264:702708. PubMed Abstract  Publisher Full Text

Ooi SL, Shoemaker DD, Boeke JD: DNA helicase gene interaction network defined using synthetic lethality analyzed by microarray.
Nat Genet 2003, 35:277286. PubMed Abstract  Publisher Full Text

McVey M, Kaeberlein M, Tissenbaum HA, Guarente L: The short life span of Saccharomyces cerevisiae sgs1 and srs2 mutants is a composite of normal aging processes and mitotic arrest due to defective recombination.
Genetics 2001, 157:15311542. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lo YC, Paffett KS, Amit O, Clikeman JA, Sterk R, Brenneman MA, Nickoloff JA: Sgs1 regulates gene conversion tract lengths and crossovers independently of its helicase activity.
Mol Cell Biol 2006, 26:40864094. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Collins SR, Miller KM, Maas NL, Roguev A, Fillingham J, Chu CS, Schuldiner M, Gebbia M, Recht J, Shales M, Ding H, Xu H, Han J, Ingvarsdottir K, Cheng B, Andrews B, Boone C, Berger SL, Hieter P, Zhang Z, Brown GW, Ingles CJ, Emili A, Allis CD, Toczyski DP, Weissman JS, Greenblatt JF, Krogan NJ: Functional dissection of protein complexes involved in yeast chromosome biology using a genetic interaction map.

Lemmens K, Dhollander T, Bie TD, Monsieurs P, Engelen K, Smets B, Winderickx J, Moor BD, Marchal K: Inferring transcriptional modules from ChIPchip, motif and mircoarray data.
Genome Biology 2006, 7:R37. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Tsai HK, Lu HHS, Li WH: Statistical methods for identifying yeast cell cycle transcription factors.
Proc Natl Acad Sci 2005, 12:1353213537. Publisher Full Text

HK Tsai, GT Huang, MY Chou, HH Lu, WH Li: Method for identifying transcription factor binding sites in yeast.
Bioinformatics 2006, 22:16751681. PubMed Abstract  Publisher Full Text

Guttman L: Some necessary conditions for commonfactor analysis.
Psychometrika 1954, 19:149161. Publisher Full Text

Cattell RB: The scree test for the number of factors.
Multivariate behavioural research 1966, 1:245276. Publisher Full Text

Kline RB: Principles and practice of structural equation modeling. The Guilford Press: New York NY, U.S.A; 1998.

Jöreskog KG: A general approach to confirmatory maximum likelihood factor analysis.
Psychometrika 1969, 34:183202. Publisher Full Text

Tucker LR, Lewic C: A reliability coefficient for maximum likelihood factor analysis.
Psychometrika 1973, 38:110. Publisher Full Text

Bentler PM: Comparative fit indices in structural equation models.
Psychological Bulletin 1990, 107:238246. PubMed Abstract  Publisher Full Text

Akaike H: Information theory and an extension of the maximum likelihood principle. In 2nd International Symposium on Information Theory. Edited by Petrov BN, Csaki F. Akademiai Kiado, Budapest; 1973:267281.

Schwarz G: Estimating the dimension of a model.
Annals of Statistics 1978, 6:461464. Publisher Full Text

Sclove SL: Application of modelselection criteria to some problems in multivariate analysis.
Psychometrika 1987, 52:333343. Publisher Full Text

SSEM [http://www.stat.sinica.edu.tw/~gshieh/ssem.htm] webcite