Human Genome Center, Institute of Medical Science, University of Tokyo, 461 Shirokanedai, Minatoku, Tokyo 1088639, Japan
Institute of Mathematics and Statistics, University of São Paulo, Rua do Matão, 1010, São Paulo 05508090, Brazil
Abstract
Background
In the analysis of effects by cell treatment such as drug dosing, identifying changes on gene network structures between normal and treated cells is a key task. A possible way for identifying the changes is to compare structures of networks estimated from data on normal and treated cells separately. However, this approach usually fails to estimate accurate gene networks due to the limited length of time series data and measurement noise. Thus, approaches that identify changes on regulations by using time series data on both conditions in an efficient manner are demanded.
Methods
We propose a new statistical approach that is based on the state space representation of the vector autoregressive model and estimates gene networks on two different conditions in order to identify changes on regulations between the conditions. In the mathematical model of our approach, hidden binary variables are newly introduced to indicate the presence of regulations on each condition. The use of the hidden binary variables enables an efficient data usage; data on both conditions are used for commonly existing regulations, while for condition specific regulations corresponding data are only applied. Also, the similarity of networks on two conditions is automatically considered from the design of the potential function for the hidden binary variables. For the estimation of the hidden binary variables, we derive a new variational annealing method that searches the configuration of the binary variables maximizing the marginal likelihood.
Results
For the performance evaluation, we use time series data from two topologically similar synthetic networks, and confirm that our proposed approach estimates commonly existing regulations as well as changes on regulations with higher coverage and precision than other existing approaches in almost all the experimental settings. For a real data application, our proposed approach is applied to time series data from normal Human lung cells and Human lung cells treated by stimulating EGFreceptors and dosing an anticancer drug termed Gefitinib. In the treated lung cells, a cancer cell condition is simulated by the stimulation of EGFreceptors, but the effect would be counteracted due to the selective inhibition of EGFreceptors by Gefitinib. However, gene expression profiles are actually different between the conditions, and the genes related to the identified changes are considered as possible offtargets of Gefitinib.
Conclusions
From the synthetically generated time series data, our proposed approach can identify changes on regulations more accurately than existing methods. By applying the proposed approach to the time series data on normal and treated Human lung cells, candidates of offtarget genes of Gefitinib are found. According to the published clinical information, one of the genes can be related to a factor of interstitial pneumonia, which is known as a side effect of Gefitinib.
Background
Gene network estimation from time series gene expression data is a key task for elucidating cellular systems. Thus far, wide variety of approaches have been proposed based on the vector autoregressive (VAR) model
A possible way for finding changes on regulations is to estimate networks from two data sets separately and then compare their structures. However, due to the limited length of time series data (usually less than 10 time points) and unignorable measurement noise, networks are estimated with high error rates and the estimation errors cause the serious failure on identifying changes on regulations. Thus, approaches using two time series data in an efficient manner are strongly demanded. Also, widely used statistical methods such as the VAR model and dynamic Bayesian network assume equally spaced time points in time series data. However, observed time points on usually available time series data are not equally spaced
We propose a new statistical model that estimates gene networks on two different conditions in order to identify changes on regulations between the conditions. As the basis of the proposed model, we employ the state space representation for VAR model (VARSSM), in which observation noise is considered between the measured or observed gene expressions and the true gene expressions in observation model and gene regulations between true gene expressions are considered in the system model
The hidden binary variables are estimated by searching the configuration of binary variables that maximizes the marginal likelihood of the model. However, searching the optimal configuration is computationally intractable. Thus, as an alternative approach, we derive a new variational annealing method based on
For the performance evaluation, we generate two regulatory networks in such a way that most of the regulations commonly exist and some exist only on one of the networks. We then apply our proposed approach and existing var model based and dynamic Bayesian network based approaches to two equally spaced time series data drawn separately from the generated networks. From the comparisons of true positive rates and false positive rates of these approaches, we confirm the effectiveness of our approach. We also generate unequally spaced time series data from these networks, and show that our approach works correctly on unequally spaced time series data while the performance of the existing approaches assuming equally spaced time points is drastically worsened.
Our proposed approach is used to analyze changes on regulations in gene networks between normal Human lung cells and Human lung cells treated by stimulating EGFreceptors and dosing an anticancer drug termed Gefitinib. A lung cancer condition is simulated by the stimulation of EGFreceptors in the treated cells. Since Gefitinib is known as a selective inhibitor of EGFreceptors, the stimulation of EGFreceptors would be counteracted by Gefitinib, and hence the treated cells are expected to be the same condition as normal cells. However, gene expression profiles from normal and treated cells are actually different, and offtargets of Gefitinib causing unexpected positive or negative effects are implied. We focus on genes with changes on regulations between the networks estimated by our approach and find possible offtarget genes of Gefitinib. According to the published clinical information, one of the possible offtarget genes is suggested as one of factors of interstitial pneumonia, which is known as a side effect of Gefitinib.
Methods
Vector autoregressive model and its state space representation
Vector autoregressive model
Given gene expression profile vectors of
where
State space representation of VAR model (VARSSM)
Let
where
where
Joint model of VARSSM for two time series data
Let
where ∘ denotes the Hadamard product,
The complete likelihood of our model,
where the prior distribution
Here,
where
where
In this setting, if
where
Figure
A graphical representation of the proposed model
A graphical representation of the proposed model. Hyperparameters are omitted from this representation. The nodes in gray denote observed data.
For the parameter estimation, we search the configuration of
Finding the optimal configuration of
Parameter estimation by variational annealing
In the deterministic annealing, optimization problem is solved while gradually changing temperature in a some schedule, and maximum likelihood estimator is obtained like the EM algorithm
Let
The maximum of the marginal likelihood on
where
From the Gibbs inequality, the right side of Equation (3) is also bounded:
Here,
Thus, as an approximation of
In the hill climbing,
Gradually converging
Effectiveness of variational annealing
As alternatives of the variational annealing, we may consider the variational method and the EM algorithm where
In the following, we prove a proposition in order to show that the variational annealing possibly give the optimal solution of Equation (2) even if the factorization of
Proposition 1.
For the proof of the proposition, see Section 1 in Additional file
Proof of Proposition 1 and more details on the procedures of variational annealing. A proof of Proposition 1 and more details on the procedures of variational annealing on the proposed model are described.
Click here for file
Procedures of variational annealing on proposed model
In the variational annealing on the proposed model, we calculate
Variational Estep
Parameters of
Variational Mstep
Parameters for the above functions are calculated by using
Variational Astep
For the calculation of
Update and selection of hyperparameters
The proposed model contains
We first consider update of hyperparameters
For the selection of
Summary of procedures
The procedures for estimating parameters in the proposed model are summarized as follows:
1. Set
2. Initialize other hyperparameters and hidden variables.
3. Perform the following procedures:
(a) Calculate variational Mstep.
(b) Update hyperparameters.
(c) Calculate variational Estep.
(d) Calculate variational Astep.
(e) Go back to step (a) until some convergence criterion is satisfied.
4. Divide
5. Go back to step 3 if
In our setting,
Results and discussion
Performance evaluation by Monte Carlo experiments
For the evaluation of the proposed approach, we generate two linear regulatory network models with similar topological structures
Figure
The graph structures of
The graph structures of
From each of
Comparison between variational annealing and EM algorithm
We first compare the performances of the proposed approach and the approach that is based on the proposed approach but uses the EM algorithm instead of the variation annealing using the equally spaced time series data of 50 and 25 time points on the system noise with standard deviation 1 and observation noise with standard deviation 0.1. From the comparison, we verify the effectiveness of the variational annealing, compared to the EM algorithm. Table
Comparison of the variation annealing (Proposed) and EM algorithm (EM) based on the proposed model
(a)
# of time points
50
25
# TP
# FP
PRE
# TP
# FP
PRE
Proposed
295.9
41.7
0.88
238.4
71.6
0.77
EM
294.9
119.2
0.71
196.8
66.9
0.75
(b)
# of time points
50
25
# TP
# FP
PRE
# TP
# FP
PRE
Proposed
39.8
13.2
0.75
23.4
20.8
0.53
EM
39.9
39.4
0.5
11.5
10.0
0.53
(a) The number of true positives (# TP) and false positives (# FP) of estimated regulations in two network models by the proposed approach and EM for equally spaced time series data. PRE denotes the precision of the results. Regulations in two networks are 305 in total. (b) The number of true positives (# TP) and false positives (# FP) of changes on regulations between two network models estimated by the proposed approach and EM for equally spaced time series data. The regulations changed in two networks are in total 47.
From the comparison, the results of the proposed approach contain more true positives than those of the EM algorithm based approach except for identifying changes on regulations for time points 50. For identifying changes on regulations, the EM algorithm based approach estimates bit more true positives than the proposed approach, but the difference is so small that it can be ignored. On the other hand, the results of the EM algorithm based approach contain more false positives than those of the proposed approach, and hence the precision of the results by the EM algorithm is worse than that of the proposed approach. Therefore, the effectiveness of the variational annealing is confirmed in the computational experiment as well.
Comparison between proposed approach and existing approaches
We employ the elastic net based VAR model approach
For the comparison of these approaches, we focus on the following two points: the number of correctly estimated regulations and the number of correctly estimated changes on regulations. The former is usually considered for evaluating the performance of gene network estimation methods. The numbers of true positives and false negatives of the estimated regulations are summarized in Table
A summary of results for system noise with standard deviation 1 and observation noise with standard deviation 0.1
(a)
Equally spaced
Unequally spaced
# of time points
50
25
50
25
# TP
# FP
PRE
# TP
# FP
PRE
# TP
# FP
PRE
# TP
# FP
PRE
Proposed
295.9
41.7
0.88
238.4
71.6
0.77
262.4
42.1
0.86
110.7
37.2
0.75
ENet1
246.3
119.7
0.67
109.2
67
0.62
84.7
140.6
0.38
20.3
70.4
0.22
ENet2
277.9
130.9
0.68
212.8
130
0.62
169.7
241.5
0.41
65.5
132.5
0.33
G1DBN1
223.7
48
0.82
99.9
46.2
0.68
65.1
83.1
0.44
19.3
72.7
0.21
G1DBN2
268.8
83.4
0.76
188.1
64.5
0.74
134.8
104.4
0.56
46.7
85.7
0.35
(b)
Equally spaced
Unequally spaced
# of time points
50
25
50
25
# TP
# FP
PRE
# TP
# FP
PRE
# TP
# FP
PRE
# TP
# FP
PRE
Proposed
39.8
13.2
0.75
23.4
20.8
0.53
31.2
16.5
0.65
5.5
15.9
0.26
ENet1
38.5
153.1
0.2
18.6
113
0.14
12.3
186.6
0.06
3.7
85
0.04
ENet2












G1DBN1
35.6
88.9
0.29
16.8
91.9
0.15
10.3
121.9
0.08
2.4
87.8
0.03
G1DBN2












(a) The number of true positives (# TP) and false positives (# FP) of estimated regulations in two network model by the proposed approach, ENet1, ENet2, G1DBN1, and G1DBN2 for equally and unequally spaced time series data. PRE denotes the precision of the results. Regulations in two networks are 305 in total. (b) The number of true positives (# TP) and false positives (# FP) of changes on regulations between two network models estimated by the proposed approach, ENet1, ENet2, G1DBN1, and G1DBN2 for equally and unequally spaced time series data. Since no changes are estimated by ENet2 and G1DBN2, their results are indicated by ''. The regulations changed in two networks are in total 47.
are also provided. The results are averaged on ten data sets. The number of regulations in the true network models of
For the estimation of the regulations in Table
One may think it is strange that false positives in ENet1 and G1DBN1 in Table
In order to show the performance in unequally spaced time series data, we generate unequally spaced time series data of 25 and 50 observed time points. For time series data of 25 observed time points, we first generate equally spaced time series data of 40 time points and divide it into three blocks: 15 time points, 10 time points, and 15 time points. We then remove time points in the following manner: no time point is removed in the first block; one of every two time points are removed in the second block; and two of every three time points are removed in the third block. Figure
A time point schedule on unequally spaced time series data in the Monte Carlo experiment
A time point schedule on unequally spaced time series data in the Monte Carlo experiment. Observed points in the time schedule are indicated by arrows. 15 time points are equally spaced in first block, every second point is observed in second block comprised of 5 observed time points, and every third point is observed in third block comprised of 5 observed time points.
We also consider the time series data with the high level noise: system noise with standard deviation 1 and observation noise with standard deviation 1. The results for the case are summarized in Tables
A summary of results for system noise with standard deviation 1 and observation noise with standard deviation 1
(a)
Equally spaced
Unequally spaced
# of time points
50
25
50
25
# TP
# FP
PRE
# TP
# FP
PRE
# TP
# FP
PRE
# TP
# FP
PRE
Proposed
190.2
122.8
0.61
88.1
121.0
0.42
132.1
675.5
0.66
52.1
108.9
0.33
ENet1
110.8
136.9
0.45
30.3
75.9
0.29
32.5
133.7
0.2
7.4
75.2
0.09
ENet2
189.8
218
0.47
85.8
136.2
0.39
75.9
180.7
0.3
23.5
123.3
0.16
GIDBN1
86.6
82.6
0.51
22.6
90.8
0.2
26.3
74
0.26
7.1
71.6
0.09
GIDBN2
163.9
105.7
0.61
54.4
99
0.35
66.2
91.2
0.42
17.4
92.8
0.16
(b)
Equally spaced
Unequally spaced
# of time points
50
25
50
25
# TP
# FP
PRE
# TP
# FP
PRE
# TP
# FP
PRE
# TP
# FP
PRE
Proposed
15.4
43.6
0.26
4.7
50.0
0.09
8.1
16.7
0.33
3.1
42.5
0.07
ENet1
16.9
184
0.08
3.8
95.4
0.04
5.2
155
0.03
1.2
81.4
0.01
ENet2












GIDBN1
14.5
125.1
0.1
3.9
105.7
0.04
4
91.9
0.04
1.7
76.8
0.02
GIDBN2












(a) The number of true positives (# TP) and false positives (# FP) of estimated regulations in two network model by the proposed approach, ENet1, ENet2, G1DBN1, and G1DBN2 for equally and unequally spaced time series data. PRE denotes the precision of the results. Regulations in two networks are 305 in total. (b) The number of true positives (# TP) and false positives (# FP) of changes on regulations between two network models estimated by the proposed approach, ENet1, ENet2, G1DBN1, and G1DBN2 for equally and unequally spaced time series data. Since no changes are estimated by ENet2 and G1DBN2, their results are indicated by ''. The regulations changed in two networks are in total 47.
Analysis of time series microarray data from Human small airway epithelial cells
We apply the proposed approach to two time series microarray gene expression data from normal Human small airway epithelial cells (SAECs) and SAECs treated by stimulating EGFreceptors and dosing an anticancer drug termed Gefitinib. EGFreceptors are often overexpressed in lung cancer cells such as tumoral SAECs, and a lung cancer condition is simulated in the treated SAECs by stimulating EGFreceptors. Since Gefitinib is known as a selective inhibiter of EGFreceptors, the stimulation of EGFreceptors would be counteracted by Gefitinib, and the condition of treated SACEs should be the same as that of normal SAECs in theory. However, since some gene expression patterns are different between the two conditions in practice, some unknown effects by Gefitinib may be involved in the phenomenon. Thus, we focus on changed regulations between the gene networks estimated from gene expression data in these two conditions in order to find some insights on the unknown effects of Gefitinib.
For gene set selection, we first screen 500 genes from the ranking of the gene list sorted by coefficient variation
The estimated networks from time series gene expression data in normal SAECs and treated SAECs are summarized and given in Figure
An estimated gene network by the proposed approach from time series gene expression data on normal SAECs and treated SAECs
An estimated gene network by the proposed approach from time series gene expression data on normal SAECs and treated SAECs. In the estimated network, regulations in both conditions are in black, and regulations only in normal SAECs and treated SAECs are in red and green, respectively.
Changes on regulations between normal and treated SAECs
Normal SAECs
Treated SAECs
ZC3HAV1L → FOXA2
Prss22 → foxn2
LIF → foxn2
Prss22 → cdk14
Cdc42ep2 → Spink6
Prss22 → Camk2n1
Siglec15 → NTN1
Prss22 → cttn
HAS3 → HAS3
Prss22 → Sfrs6
HAS3 → Enc1
Prss22 → ITGA2
HAS3 → LEPREL1
Prss22 → pkn2
Prss22 → Hs2st1
Prss22 → FILIP1L
Prss22 → Hcn2
Prss22 → KLF16
Ktelc1 → NTN1
Tm6sf1 → Siglec15
Estimated regulations only in normal or treated SAECs are listed in the left side or right side, respectively.
From Table
We also focus on other several genes related to changes on regulations in normal and treated SAECs. LIF, leukemia inhibitory factor, is known to affect cell growth and development. Gefitinib is also known to be effective for acute myelogenous leukemia via Sky, which is an offtarget gene of Gefitinib
Heikema
Although the stimulation of EGFreceptors in the treated SAECs is considered to be counteracted by Gefitinib, the expressions of some genes may be affected by the stimulation in practical conditions. HAS3 is related to synthesis of the unbranched glycosaminoglycan hyaluronic acid and is reported to be upregulated by EGF
Conclusions
We proposed the new computational model that is based on VARSSM and estimates gene networks from time series data on normal and treated conditions as well as identifies changes regulations by the treatment. Unlike many of existing gene network estimation approaches assuming equally spaced time points, our approach can handle unequally spaced time series data. The efficient use of time series data is achieved by representing the presence of regulations on each condition with hidden binary variables. Since finding the optimal configuration of the hidden binary variables on the proposed model is computationally in tractable, we derive the extended variational annealing method in order to address the problem as the alternative method.
In the Monte Carlo experiments, we use equally and spaced time series data from synthetically generated two regulatory networks whose structures are different in several regulations, and verified the effectiveness of the proposed model in both estimation of regulations and changes on regulations between the two conditions, compared to existing methods.
As the real data application, we use the proposed approach to analyze two time series data from normal SAECs and SAECs treated by stimulating EGFreceptors and dosing Gefitinib. From genes related to changes on regulations by the treatment, we find possible offtarget genes of Gefitinib, and one of these genes is suggested to be related to a factor of interstitial pneumonia, which is known as a side effect of Gefitinib. In this study, we consider changes on regulations in two conditions, but the proposed approach can be extended to identifying changes among more than two conditions.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
KK, SI, RY, and SM designed the approach to identify the changes on regulations by the cell treatment to SAECs. KK, SI, and AF contributed to the statistical modeling for the approach, and devised the details of methodologies for estimating the proposed model. MY and NG carried out the microarray experiment for measuring time series the gene expression data on normal and treated SAECs.
Acknowledgements
We thank the anonymous reviewers for their constructive comments and suggestions, which improved the quality of this publication.
This article has been published as part of