Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA

Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA

Laboratory of Breast Cancer Epigenomics, The Ohio State University, Columbus, OH 43210, USA

Laboratory of Ovarian Cancer Epigenomics, Indiana University, Bloomington, IN 47405, USA

Abstract

Background

Typical analysis of time-series gene expression data such as clustering or graphical models cannot distinguish between early and later drug responsive gene targets in cancer cells. However, these genes would represent good candidate biomarkers.

Results

We propose a new model - the dynamic time order network - to distinguish and connect early and later drug responsive gene targets. This network is constructed based on an integrated differential equation. Spline regression is applied for an accurate modeling of the time variation of gene expressions. Then a likelihood ratio test is implemented to infer the time order of any gene expression pair. One application of the model is the discovery of estrogen response biomarkers. For this purpose, we focused on genes whose responses are late when the breast cancer cells are treated with estradiol (E2).

Conclusions

Our approach has been validated by successfully finding time order relations between genes of the cell cycle system. More notably, we found late response genes potentially interesting as biomarkers of E2 treatment.

Background

Breast cancer represents a major public health issue since it comprises 22.9% of all cancers in women and it is an important cause of death

Biomarkers often refer to proteins measured in the blood whose concentrations reflect the presence or the severity of the disease. In the case of estrogen treatment, biomarkers can be seen as parameters reflecting the effects of the drug on the patient. The biomarkers of hormone therapy of the breast cancer is not well developed. For instance, although tamoxifen's pharmacology mechanism is well known, its clinical biomarker is not well established yet. Understanding the cascade of estrogen signaling pathway is the key to study the potential biomarkers.

Gene expression-based biomarker discovery has demonstrated efficiency for breast cancer

Unfortunately standard methods might fail to reveal key biomarkers, since they do not take into account the temporal aspect of gene expression and the complex network of gene regulation. To tackle this issue, the analysis of time series data through dynamic networks represents efficient alternatives

Late response genes might represent relevant biomarkers because they are more stable over the time. Our approach relies on this biological aspect of biomarker discovery. To identify late response genes, we propose a new model based on a dynamic time order network (DTON). The model interpretation is simple and intuitive: it reflects which genes express in the early times and which ones in the late times after the hormone treatment. The DTON is constructed based on an integrated differential equation. Spline regression is applied for an accurate modeling of the time variation of gene expressions. A likelihood ratio test is implemented to infer the time order of any gene expression pair. The advantages of this modeling approach are numerous: (i) closed-form expressions of ODEs, (ii) accurate modeling of the time series data by using spline regression and by integrating differential equations, and (iii) model learning involving simple regressions quick to compute and only a few parameters have to be estimated. The method has been validated by successfully finding time order relations between genes of the cell cycle system. Most importantly, we found late response genes as candidate biomarkers of E2 treatment.

This paper is organized as follows. Section Materials and methods first describes experiments and data preprocessing. Late response genes are defined and discussed. Then the dynamic time order network and its model learning are presented. It is described how dynamic time order relations between genes are inferred through a likelihood ratio test. The next section illustrates our method on real data analysis. Our model is validated with the well-known cell cycle system. Late response genes of E2 treatment are discovered. Finally, the last section concludes and points out promising perspectives.

Materials and methods

Experiment and data preprocessing

The gene expression data come from estrogen stimulated ZR_75_1 cells. _{0 }- _{1 }synchronization cells were treated with 10^{-8 }_{t }

Late response gene

In breast cancer cells, Cicatiello

Biologically, we favor late response genes because of their clinical implications. To check whether a drug works in human,

The dynamic time order relationship

Let _{1}(_{2}(_{1 }and _{2 }over the time _{1 }and _{2 }have a dynamic time order relation such that the expression of _{2 }is later than the one of _{1}. This relation is denoted as _{1 }→ _{2}. Then the changing rate of _{2 }should be related to the LCR of _{1 }and itself

Logarithmic concentration ratios of two genes _{1 }and _{2}

**Logarithmic concentration ratios of two genes G _{1 }and G_{2}**. a) A scenario showing a trivial order between the two genes. b) A more complex scenario showing an non-trivial order.

In Equation (1), _{2 }expression. Alternatively, Equation (1) can be expressed by integration:

In Equation (2), _{1}(_{2}(_{1 }and _{2}. The integration of the ODE can help to better distinguish which gene is firstly expressed in a non-trivial scenario, such as the one presented in Figure _{1 }and _{2 }only during the early time (because only in the early time we observe a significant difference between the two rates). By integrating the ODE (see Equation (1)), the model can take into account all the variation of the gene LCR (in early and late times). Note that this dynamic time order relation does not imply any causal relation between two genes but only indicates which one is expressed after the other.

Natural cubic spline regression

In order to apply the integrated ODE model (Equation 2), a smooth curve is required to fit gene expression over the time. For this purpose, natural cubic spline regression (NCSR)

with _{i}_{i}_{i }_{i }

with _{i}

The time interval of our gene expression data is _{i }

**Decomposition of the cubic function using knots**.

Click here for file

Let _{ij }= (_{ij0}, _{ij1}, _{ij2}, _{ij3})^{T }**t **= (1, ^{2}, ^{3}), then _{i }

with

In our study, we have 12 different time points _{i }**y**_{i }_{i0}, ..., _{i32}). Based on Equation 5, the likelihood for the NCSR model of gene _{i }

The parameters _{ij }are learned by maximizing the likelihood in Equation 6 with constrains (see Additional file _{i1 }and _{i3 }(see Additional File

**Solving of parameters β _{i1 }and β_{i3}**.

Click here for file

We can simplify the joint likelihood in Equation 6 as follows:

where **t*** can be solved by the following way:

The maximum likelihood estimator of _{i2 }for gene _{i }

with **T* **a 12-by-4 matrix (presented in Additional file **T***, each row **t*** at the time point _{i}_{i }

**Matrix T***.

Click here for file

Gene expression of APLP2 fitted by natural cubic spline regression with 2 knots _{1 }and _{2}

**Gene expression of APLP2 fitted by natural cubic spline regression with 2 knots K _{1 }and K_{2}**.

Time order determination

Based on Equation 2, the dynamic time order relationship between two genes can be learned using the following multiple linear regression:

with _{i}**y**_{it }_{i }_{i }_{i1}, _{i2 }and _{i3 }is calculated as follows:

where

We apply the model in Equation 10 to every pair of genes to determine whether there is a dynamic time order relation between them. The pairwise regression models for two genes _{1 }and _{2 }are:

with _{1 }and _{2}, and _{i}_{i}_{i}**X **= (**1, F**_{1}, **F**_{2}).

Thus in Equations 12 and 13, values of **y**_{i }**X **(right hand side) result from the integration of the NCSR functions. For the pair of genes _{1 }and _{2}, the model in equation 12 represents the dynamic time order relation _{2 }→ _{1 }and the model in equation 13 represents the dynamic time order relation _{1 }→ _{2}.

Pairwise regressions are then computed for all pairs of genes and the log-likelihoods are calculated (see Additional file

**Likelihood computation of regression for the time order determination**.

Click here for file

Network construction

After determining the time order relationships, an

Small network

When the network is small (less than one hundred nodes), it is interesting to keep as much as possible information about time order relations. The best strategy in this case is fine tune a threshold used to remove non-significant edges. For this purpose, a simple and efficient approach is the use of the median or other quantiles of the distribution of log-likelihood difference values. Then a simplification step is used to remove redundant edges. For instance, when one observes

Genome-wide network

When the network is huge, such as the genome-wide network from the microarray data, the previous approach cannot be used. The reason is that a low threshold value will create a network highly connected which is too complex to manipulate and to visualize, whereas a high threshold value will lead to a graph with many connected components from which it will only be possible to infer time orders between connected genes. To tackle this issue, we compute the so-called maximum weight spanning tree (MWST). This graph presents several advantages: (i) its tree shape is a very simple structure easy to manipulate and visualize, and (ii) every node is connected by a path such that we can access to the time order relation between each gene. Besides, the MWST can be quickly computed in ^{2}

Biological interpretation of the model

The dynamic time order network (DTON) has a biological interpretation. It is illustrated in Figure

Dynamic time order network and its biological interpretation

**Dynamic time order network and its biological interpretation**. The blue nodes indicate the late response genes whereas the red nodes point out the remaining genes.

Implementation

Our learning method is implemented in R. The R source code is available on request. For graph drawing and display, the software Tulip (

Results and discussion

Reproducing the cell cycle temporal system

The cell cycle temporal system represents a good benchmark for evaluating our method. In this subsection, in order to see if we can reproduce the time order relations, we focused on key cell cycle genes. Twelve mRNA expression data were selected, which include cyclin A1 (CCNA1), cyclin A2 (CCNA2), cyclin B1 (CCNB1), cyclin B2 (CCNB2), cyclin D1 (CCND1), cyclin D3 (CCND3), cyclin E1 (CCNE1), cyclin E2 (CCNE2), cyclin-dependent kinase 1 (CDK1), cyclin-dependent kinase 2 (CDK2), cyclin-dependent kinase 4 (CDK4) and cyclin-dependent kinase 6 (CDK6). Regressions have be computed for all pairs of genes. Then, the network of cell cycle genes has been computed by thresholding using the median of the log-likelihood differences. After simplification, the inferred network is composed of 27 time order relations. It is depicted in Figure

Cell cycle temporal system modeling

**Cell cycle temporal system modeling**. a) The inferred dynamic time order network. b) A schematic representation of the cell cycle temporal system

Genome-wide network

For genome-wide network modeling, an MWST has been constructed from all pairwise regressions on the 5003 genes. The network is depicted in Figure

Genome-wide dynamic time order network

**Genome-wide dynamic time order network**. It has been computed for all the 5003 genes. Big circles represent incoming-edge hubs at the center (blue) connected to a very large number of nodes (red). The color code is the same as in Figure 3.

List of the 10 most important incoming-edge hubs.

**Gene**

**Number of incoming edges**

**Expression**

CEACAM6

2783

Underexpressed

EPAS1

417

Underexpressed

CALB2

250

Overexpressed

UPK1A

171

Underexpressed

KRT81

150

Underexpressed

PDZK1

130

Overexpressed

MT2A

102

Overexpressed

FANCD2

78

Overexpressed

C20orf160

62

Overexpressed

WDR51A

49

Overexpressed

Gene expression profiles for the 10 most important incoming-edge hubs

**Gene expression profiles for the 10 most important incoming-edge hubs**. LCR: logarithmic concentration ratio.

The identification of late response genes does not represent a well-studied issue. Most notably, no dedicated method has been developed for this purpose. Nevertheless, we tried to compare our method with standard approaches in gene expression analysis: agglomerative hierarchical clustering (AHC) and t-tests. On the one hand, AHC is a well-used tool to cluster gene expression profiles. After computing AHC, we used the silhouette criteria to determine the optimum number

Cluster profiles obtained with agglomerative hierarchical clustering, for all the 5003 genes

**Cluster profiles obtained with agglomerative hierarchical clustering, for all the 5003 genes**.

Profiles of late response genes identified using t-tests

**Profiles of late response genes identified using t-tests**.

Venn diagram to compare the late response genes identified from the dynamic time order network (DTON), the agglomerative hierarchical clustering (AHC) and the t-tests

**Venn diagram to compare the late response genes identified from the dynamic time order network (DTON), the agglomerative hierarchical clustering (AHC) and the t-tests**.

We also search in the literature if the late response genes identified with our method can be good candidate biomarkers. Since biomarkers are molecules that are observed in cancer patients but not in healthly people, there are likely to be genes overexpressed after E2 treatment. Among the overexpressed hubs of the network, CALB2, PDZK1, MT2A and FANCD2 are well-known in the literature as diagnostic marker of breast cancer and E2 response

Conclusion

Based on experimentations carried out on time-series gene expression data, our dynamic time order network has been shown to efficiently distinguish and connect early and late response genes. First, our model has faithfully reproduced the cell cycle temporal system. Over the 27 time order relations inferred, 89% correspond to the state-of-art network, 11% cannot be checked, but no one are false. Second, our approach has been successfully applied to a genome-wide level. The learning method has been able to process five thousands genes and the network simplification through the maximum weighted spanning tree provided a graphical display of the huge network. Most notably, several incoming-edge hubs showing very high connectivity have been discovered. All these hubs showed late gene response profiles. Regarding those which are overexpressed over the time, they have been reported as biomarkers of breast cancer and E2 response in the literature and databases.

The comparison of results with other approaches is not straightforward, since our method is the only one dedicated to identify late response genes. When compared with standard methods in gene expression analysis, our approach yielded specific results, contrary to agglomerative hierarchical clustering. Moreover it does not need any complex thresholding such as with a t-test strategy. It is worth noting that all genes identified with DTON showed late responses, while this is not the case with the t-test strategy. Besides, our approach is based on the comparison of gene expression integrals combined with cubic spline regression, thus offering an accurate assessment of time order relations.

The discovery of biomarkers is one of the application of our model. The distinction between early and late response genes is also an important application in developmental biology where the understanding of the temporal aspect of gene expression is a key issue such as for cell differentiation. For the moment, we mainly focused on the identification of late response genes. The use of another graph modeling would be more efficient for pointing out early response genes than the MWST which tends to display incoming-edge hubs.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

PZ and RM both wrote the paper. PZ, RM, YX, KH and LL conceived the dynamic time order network. PZ and RM carried out the implementation and the experiments. LL, TH, KN and YL designed the study and participated in its coordination. All authors read and approved the final version of the manuscript.

Acknowledgements

The authors are grateful to the three anonymous referees for constructive comments and help in improving their manuscript. This work is supported by National Cancer Institute awards CA113001 (to T.H-M.H. and K.P.N.). Yang Xiang was supported by the National Science Foundation under Grant #1019343 to the Computing Research Association for the CIFellows Project.

This article has been published as part of