Key Laboratory for Applied Statistics of MOE and School of Mathematics and Statistics, Northeast Normal University, Changchun, PR China

Department of Mathematics, Washington University in St Louis, St Louis, USA

Abstract

Background

Time-course microarray experiments produce vector gene expression profiles across a series of time points. Clustering genes based on these profiles is important in discovering functional related and co-regulated genes. Early developed clustering algorithms do not take advantage of the ordering in a time-course study, explicit use of which should allow more sensitive detection of genes that display a consistent pattern over time. Peddada

Results

We propose a computationally efficient information criterion-based clustering algorithm, called ORICC, that also takes account of the ordering in time-course microarray experiments by embedding the order-restricted inference into a model selection framework. Genes are assigned to the profile which they best match determined by a newly proposed information criterion for order-restricted inference. In addition, we also developed a bootstrap procedure to assess ORICC's clustering reliability for every gene. Simulation studies show that the ORICC method is robust, always gives better clustering accuracy than Peddada's method and saves hundreds of times computational time. Under some scenarios, its accuracy is also better than some other existing clustering methods for short time-course microarray data, such as STEM

Conclusion

Our ORICC algorithm, which takes advantage of the temporal ordering in time-course microarray experiments, provides good clustering accuracy and is meanwhile much faster than Peddada's method. Moreover, the clustering reliability for each gene can also be assessed, which is unavailable in Peddada's method. In a real data example, the ORICC algorithm identifies new and interesting genes that previous analyses failed to reveal.

Background

The development of microarray technology provides a powerful analytical tool for large scale genomic research. Its ability to simultaneously study thousands of genes under a multitude of conditions presents a huge challenge to comprehend and interpret the resulting mass of data. An important application of microarray technology is to study the dynamic patterns of gene expression across a series of time points and find gene clusters within which genes share similar patterns. The premise is that genes sharing similar expression profiles might be functionally related or co-regulated. Therefore, microarray data may provide insights into gene-gene interactions, gene function and pathway identification. Examples of such studies include response to temperature changes and other stress conditions

In this article, from a different perspective to the order-restricted inference, we propose a new order-restricted information criterion-based clustering (ORICC) algorithm, which is computationally much more efficient than Peddada's method. Our method selects and clusters genes using the ideas of model selection for order-restricted inference, where estimation makes use of inequalities that define the candidate profiles. The first step is to define candidate profiles and express them in terms of inequalities between the expected gene expression levels at various time points. For a given candidate profile, we estimate the mean expression level at different time points of each gene using the order-restricted maximum likelihood

Results and discussion

Inequality profiles

Suppose that a time-course microarray experiment includes _{gti }the expression measurement of gene _{gt}, i.e. _{gt }= _{gti}) for all _{g }= (_{g1}, _{g2}, ⋯, _{gT})^{T}. In the following, we define some typical inequality profiles, and we drop the subscript

_{0 }

_{⊥ }

where _{i}⊥ _{j }means that there is no defined inequality constraint between _{i }and _{j}.

(with at least one strict inequality). Similarly, a _{↓ }is given by replacing ≤ by ≥ in (3).

(with at least one strict inequality among _{1 }≤ _{2 }≤ ⋯ ≤ _{i }and one among _{i }≥ _{i+1 }≥ ⋯ ≥ _{T}). Genes satisfying this profile have mean expression values non-decreasing in time up to time point _{∨}.

(with at least one strict inequality among each monotone sub-profile). Cyclical profiles may be important in relatively long time-course experiments where the mean expression value could oscillate.

(with at least one strict inequality among each monotone sub-profile). Profiles (6) are useful when the investigator is unable to specify inequalities between certain means.

Information-criterion based clustering using order-restricted maximum likelihood

Our procedure seeks to match a gene's true profile, estimated from the observed data, to one of a specified set of candidate profiles. Provided the relationship of a gene's mean expression levels at different time points is defined by a given candidate profile, we first obtain the order-restricted maximum likelihood estimates (MLE) of the gene's mean expression levels at all time points. Details for simple order and umbrella order constraints are given in the Methods section. A general discussion of order-restricted MLE can be found in _{1}(_{2}(

where

The ORIC function is similar to AIC and BIC in essence with

One-stage ORICC

Step 1. Pre-specify a collection of candidate profiles, {_{1}, ..., _{d}}. To prevent genes with very little changes over time matched to these profiles, we also include _{0 }defined in (1) into the collection.

Step 2. Compute

Step 3. For gene _{g1}, ⋯, _{gT}) and the maximum log-likelihood _{λ},

Step 4. For gene _{0 ≤ λ ≤ d}

Step 5. Repeat Steps 3 and 4 for every gene.

Although our one-stage ORICC algorithm is hundreds of times faster than Peddada's method, performing Step 3 for all genes can still cost a lot of computational time even when only a moderate number of candidate profiles are considered because the number of genes is generally huge. This issue is more imminent for relatively longer time course microarray studies as more candidate profiles usually need to be considered. Next, we propose a computationally more efficient two-stage algorithm by adding a pre-screening stage.

Two-stage ORICC

Step 1. Pre-specify a collection of candidate profiles, {_{1}, ..., _{d}}. Here, we also add _{0 }and _{⊥ }defined in (1) and (2) into the collection for screening purpose.

Step 2. Compute

Step 3. For gene _{g1}, ⋯, _{gT}) and the maximum log-likelihood _{0 }and _{⊥}.

Step 4. For gene

Step 5. Repeat Steps 3 and 4 for every gene. Denote the set of remained genes by

Step 6. Run Steps 3–5 in the one-stage ORICC algorithm for genes in _{1}, ..., _{d}} as candidate profiles.

In the one-stage algorithm, the ORIC function is evaluated for every gene under every candidate profile, whereas the two-stage algorithm first screens out genes that show no significant changes over time by comparing between two profiles _{0 }and _{⊥}, and then applies the one-stage algorithm to a much smaller set of remained genes. As a result, the two stage algorithm is usually much faster and report tighter clusters with less genes in them.

Filtering genes with small expression levels

Some genes selected by the ORICC algorithm may have small mean expression levels at every time point. Such genes may not be of interest to some investigators. Peddada

Let

where _{g }indicate that the mean expression of gene _{g }and retain the top

Assessing the reliability of the oricc results

Microarray data are often noisy and hence it is important to assess the reliability of the clustering results. Among the recently developed methods for assessing clustering reliability

In time-course microarray studies, we can use the following analysis of variance (ANOVA) model to account for sources of variation in microarray data.

where _{gti }is the relative expression measurement from array _{t}, _{i }and _{ti }account for all effects that are not gene-specific. We assume that the error terms _{gti }are independent with mean 0 and variance

Step 1. Estimate model (10), which can be done straightforwardly in any statistics software, such as SAS

Step 2. Generate

where a ^ over a term means the estimate from the original model fit in Step 1, and

Step 3. Repeat the ORICC algorithm for each bootstrap sample.

Now, the original clustering is accompanied by a collection of bootstrap clusterings, which can be regarded as a sample of clusterings that are close to the original clustering in space of all possible clusterings. When the level of noise in the original data is low, the bootstrap clusterings tend to be more like the original clustering. Then we can calculate a reliability measure for each gene by counting the proportion in the bootstrap clusterings it is attached to the same profile as in the original clustering. The larger the measure, the more reliable the gene's clustering membership.

Simulation studies

In this section, we use Monte Carlo simulation to examine the performance of the ORICC method and compare it with other clustering methods for short time-course microarray data, including Peddada's method, STEM

The STEM algorithm works by assigning genes to a pre-defined set of model profiles that capture the potential distinct patterns that can be expected from the microarray experiment. Each gene is then assigned to the closest model profile in certain distance measure, e.g. correlation, and genes assigned to the same model profile consist a cluster. Significant profiles/clusters are next determined by hypotheses tests. As a result, genes in insignificant clusters are usually not reported. Wang's method represents each gene's temporal profile by a polynomial model and estimates the model using a Bayesian approach. A heuristic search strategy

All simulations were carried out on a workstation with a 2.30 GHz AMD Athlon(tm) 64 × 2 Dual Core 4400+ processor and a 2.00 GB memory. Peddada's method, Wang's method and the one-stage ORICC algorithm are implemented in R

Simulation 1

In the first simulation study, we consider ten inequality profiles (_{1}–_{10}) plus a flat pattern (_{0}) to represent a total number of eleven clusters. We set the number of time points as ^{T }and the phrase 'with a strict inequality' when defining the profiles. True values of _{1}, _{2}, _{3}, _{4}, _{5}, _{6}) in each profile are also given.

C_{0}: _{1 }= _{2 }= _{3 }= _{4 }= _{5 }= _{6},

C_{1}: _{1 }≥ _{2 }≥ _{3 }≥ _{4 }≥ _{5 }≥ _{6},

C_{2}: _{1 }≤ _{2 }≤ _{3 }≤ _{4 }≤ _{5 }≤ _{6},

C_{3}: _{1 }≤ _{2 }≥ _{3 }≥ _{4 }≥ _{5 }≥ _{6},

C_{4}: _{1 }≤ _{2 }≤ _{3 }≥ _{4 }≥ _{5 }≥ _{6},

C_{5}: _{1 }≤ _{2 }≤ _{3 }≤ _{4 }≥ _{5 }≥ _{6},

C_{6}: _{1 }≤ _{2 }≤ _{3 }≤ _{4 }≤ _{5 }≥ _{6},

C_{7}: _{1 }≥ _{2 }≤ _{3 }≤ _{4 }≤ _{5 }≤ _{6},

C_{8}: _{1 }≥ _{2 }≥ _{3 }≤ _{4 }≤ _{5 }≤ _{6},

C_{9}: _{1 }≥ _{2 }≥ _{3 }≥ _{4 }≤ _{5 }≤ _{6},

C_{10}: _{1 }≥ _{2 }≥ _{3 }≥ _{4 }≥ _{5 }≤ _{6},

Figure _{1}–_{10}. We generated a data set with 200 genes from each profile.

Ten inequality profiles in Simulations 1 and 2

**Ten inequality profiles in Simulations 1 and 2**.

At each time point _{t }and variance ^{2}. To assess the effect of the data variability and replicates on the clustering results, we varied the variance ^{2 }from 0.2 to 1.2 (by an incremental of 0.2) and the number of replicates

Next, we clustered the simulated data using Peddada's method and the one-stage ORICC algorithm, considering ten candidate inequality profiles _{1}–_{10}. For Peddada's method, we set the number of bootstrap replications as 200 and the significance level of the bootstrap based test as 0.025. Peddada _{0 }using this choice. Meanwhile, using the common significance level of 0.05 tends to cluster many genes from _{0 }to other "non-flat" profiles. Using significance level of 0.025 offers a good compromise between the two kinds of false clustering.

Let _{i }denote the number of genes with true profile _{i }and correctly clustered to profile _{i}, _{0 }= 200, respectively. Let _{i}, _{0}. The

We then use the overall error rate, the false positive rate and the false negative rate to evaluate the accuracy of the two algorithms. Simulation results are summarized in Figures ^{2 }= 1 and

Simulation 1: The overall error rate of Peddada's method and the one-stage ORICC algorithm

**Simulation 1: The overall error rate of Peddada's method and the one-stage ORICC algorithm**. The horizontal axis represents the number of replicates, and the vertical axis represents the overall error rate. Dashed lines are for the one-stage ORICC algorithm, and solid lines are for Peddada's method.

Simulation 1: The false positive rate of Peddada's method and the one-stage ORICC algorithm

**Simulation 1: The false positive rate of Peddada's method and the one-stage ORICC algorithm**. The horizontal axis represents the number of replicates, and the vertical axis represents false positive rate. Dashed lines are for the one-stage ORICC algorithm, and solid lines are for Peddada's method.

Simulation 1: The false negative rate for Peddada's method and the one-stage ORICC algorithm

**Simulation 1: The false negative rate for Peddada's method and the one-stage ORICC algorithm**. The horizontal axis represents the number of replicates, and the vertical axis represents the false negative rate. Dashed lines are for the one-stage ORICC algorithm, and solid lines are for Peddada's method.

Simulation 2

In the second simulation study, we consider the same set of inequality profiles and simulate the data in the same way as in Simulation 1, but we fix the number of replicates ^{2 }from 0.2 to 3.0 with an incremental of 0.4. Then we cluster the simulated data set using methods including Peddada's method, Wang's method, STEM and the one-stage ORICC algorithm.

For Peddada's method and the one-stage ORICC algorithm, we consider eleven candidate inequality profiles _{0}–_{10}. For Wang's method, we set the prior hyper-parameters (_{1}, _{2}) in the gamma prior distribution _{1}, _{2}) as (2, 2). For STEM, we assume 50 possible profiles and use the recommended default settings in the package. To be consistent, we did not filter out any genes in any of these analyses. Then we use Rand's _{ij }be the number of objects simultaneously in the

which denotes the proportion of pairs of objects that are assigned consistently in the two clusterings. Figure ^{2 }and cluster sizes. It shows that the precision of all methods is decreasing for increasing variance, and the cluster size has no obvious effect on the clustering precision for Peddada's method, STEM and the one-stage ORICC algorithm, but has an increasing effect for Wang's method. This comparison shows an interesting pattern. For larger ^{2}, STEM performs the best, Wang's method the worst, and Peddada's method and the one-stage ORICC are in between. For smaller ^{2}, the result is reversed with STEM being the worst, Wang's method the best, and Peddada's method and the one-stage ORICC still in between. When the cluster size is relatively small and ^{2 }is large, Wang's method can have quite low precision under 70%. Overall, the one-stage ORICC algorithm is consistently more accurate than Peddada's method by a slight margin, and provides good precision under all scenarios. The performance of STEM is also very stable, but tends to underperform when the data are less noisy, i.e., ^{2 }is small.

Simulation 2: Clustering precision of four different methods

**Simulation 2: Clustering precision of four different methods**. The numbers in the cells are Rand's ^{2}.

Figure ^{2 }= 1 and the cluster size is 100. Figures _{0 }into one cluster but assign them into different clusters.

Simulation 2: The simulated eleven clusters when

**Simulation 2: The simulated eleven clusters when M = 5, σ = 1 and cluster size = 100**.

Simulation 2: Temporal profiles for clusters from the ORICC analysis

**Simulation 2: Temporal profiles for clusters from the ORICC analysis**.

Simulation 2: Temporal profiles for clusters from Peddada's method

**Simulation 2: Temporal profiles for clusters from Peddada's method**.

Simulation 2: Temporal profiles for clusters from Wang's method

**Simulation 2: Temporal profiles for clusters from Wang's method**.

Simulation 2: Temporal profiles for clusters from the STEM analysis

**Simulation 2: Temporal profiles for clusters from the STEM analysis**. The black curves are pre-specified model profiles.

In this simulation, Peddada's method, Wang's method and the one-stage ORICC method are implemented in R, whereas the STEM software is in JAVA written by its author. So, we can only compare the computational efficiency of the first three methods and the one-stage ORICC method is much faster than the other two. For example, when ^{2 }= 3 and the cluster size is 200, the run time for Peddada's method, Wang's method and the one-stage ORICC algorithm is 3073.37 seconds, 10303.9 seconds and 24.72 seconds, respectively.

Simulation 3

In the third simulation, we examine the robustness of the ORICC algorithm. We consider eleven inequality profiles (_{0}–_{10}) plus a cyclical profile (_{∧∧}) to represent a total number of twelve clusters. We set the number of time points as _{0}–_{10 }and the true values of _{1}, _{2}, _{3}, _{4}, _{5}, _{6}) in each profile are the same as in Simulation 1. The cyclical profile _{∧∧ }and the true value of _{1}, _{2}, _{3}, _{4}, _{5}, _{6}) in _{∧∧ }are given as follows:

We generate a data set with 200 genes from each profile of _{0}–_{10 }and 200 × _{∧∧}. At each time point _{t }and variance ^{2}. To study the robustness of the one-stage ORICC algorithm, we consider different cluster sizes, 200 × _{∧∧}. Meanwhile, we also vary the variance ^{2 }from 0.2 to 3.0 with an incremental of 0.4. Then we cluster the simulated data set using the one-stage ORICC algorithm. For the one-stage ORICC algorithm, we consider eleven candidate inequality profiles _{0 }– _{10}. Note that the cyclical profile _{∧∧ }is not included in the candidate profiles.

Then we use Rand's ^{2 }and cluster sizes of the cyclical profile _{∧∧}. It shows that the precision of the one-stage ORICC algorithm as measured by Rand's _{∧∧}. The cluster precision is however always greater than 80%, thus suggesting that the one-stage ORICC algorithm is very stable.

Simulation 3: Clustering precision of the one-stage ORICC algorithm

**Simulation 3: Clustering precision of the one-stage ORICC algorithm**. Numbers in the cells are Rand's _{∧∧}, and the vertical axis represents the data variance ^{2}.

Simulation 4

In the fourth simulation, we further examine the robustness of the ORICC algorithm. We consider the scenario where a true profile is not explicitly included in the candidate profiles but maybe viewed as a special case of a more flexible candidate profile. Let the true inequality profile be

We generate a data set containing 2000 genes from this profile. At each time point _{t }and variance 0.5. Then, we consider candidate profiles being _{1}, _{2}, _{4 }and _{9 }plus the profile _{⊥ }and cluster the simulated data using the one-stage ORICC algorithm. Note that the set of candidate profiles does not contain the true one _{∧∧}, but _{∧∧ }may be viewed as a special case of _{⊥}. Let _{⊥ }denote the proportion of genes clustered to the profile _{⊥}, and define the detection error as 1 - _{⊥}. The simulation results are summarized in Figure

Simulation 4: Detection error rate

**Simulation 4: Detection error rate**. This figure plots the detection error rate of the one-stage ORICC algorithm for a true profile that is not explicitly specified in the candidate profiles.

Application to breast cancer cell line data

Next, we apply the ORICC algorithm to log-transformed relative expression data from a breast cancer cell line microarray study in _{1}; monotone increasing, _{2}; four up-down profiles with maxima at 4, 12, 24, 36 hours, _{3 }– _{6}, respectively; and 4 down-up profiles with minima at 4, 12, 24, 36 hours, _{7 }– _{10}, respectively. Genes matched to these profiles will be regarded estrogen responsive.

The original analysis in _{1}, 24 in _{2}, 76 in _{3}, 44 in _{4}, 97 in _{5}, 72 in _{6}, 35 in _{7}, 98 in _{8}, 409 in _{9}, and 58 in _{10}. Due to limitation of space, we only present the top 50 genes ranked by the filtering criterion in (9) (Additional file _{2}. Figure _{9}) has reliability 0.9967, and the deoxythymidylate kinase (Clone ID 489092 in _{5}) has reliability 0.7800. Both genes are known specific to the metabolic process, and hence are very likely responsive to the metabolism of estrogen when overdosed estrogen are supplied to the cell. For example, the methylmalonyl Coenzyme A mutase could be involved in the breakdown of estradiol into smaller metabolic fragments. However, this gene was not reported in the top 50 list by Peddada _{9}. An interesting phenomenon about the deoxythymidylate kinase is that this gene actually corresponds to two spots on the microarray chips (Clone IDs 489092 and 248008). The original analysis in

**Gene clusters from the ORICC analysis of the breast cancer cell line data**. This table presents gene clusters given by the ORICC algorithm using ten candidate inequality profiles for the breast cancer cell line microarray data in

Click here for file

Breast cancer cell line data: Estimated profiles of the top 50 genes

**Breast cancer cell line data: Estimated profiles of the top 50 genes**. Curves are given by the order-restricted MLE of mean log expression ratios. Dashed lines indicate newly identified genes.

In

We further applied STEM and Wang's method on the breast cancer cell line data. Table

Rand's C statistics among four clustering methods in the breast cancer cell line example.

ORICC

Peddada

Wang

STEM

ORICC

1.0000

0.7767

0.6313

0.7142

Peddada

0.7767

1.0000

0.5948

0.7694

Wang

0.6313

0.5948

1.0000

0.6025

STEM

0.7142

0.7694

0.6025

1.0000

A larger value indicates more overlap between two clustering results.

Breast cancer cell line data: Temporal profiles of clusters from the ORICC analysis

**Breast cancer cell line data: Temporal profiles of clusters from the ORICC analysis**. Curves are given by connecting the observed log expression ratios at different time points.

Breast cancer cell line data: Temporal profiles of clusters from Peddada's method

**Breast cancer cell line data: Temporal profiles of clusters from Peddada's method**. Curves are given by connecting the observed log expression ratios at different time points.

Breast cancer cell line data: Temporal profiles of clusters from the STEM analysis

**Breast cancer cell line data: Temporal profiles of clusters from the STEM analysis**. Curves are given by connecting the observed log expression ratios at different time points. And the black curves are pre-specified model profiles.

Breast cancer cell line data: Temporal profiles of clusters from Wang's method

**Breast cancer cell line data: Temporal profiles of clusters from Wang's method**. Curves are given by connecting the observed log expression ratios at different time points.

Discussion

In time-course microarray experiments, the ability to exploit the temporal ordering information may be especially valuable because genes whose expression levels change over time may be involved in the same cellular process or belong to the same regulatory pathway. Making use of the valuable ordering information can improve inference. Our proposed ORICC algorithm utilizes the temporal ordering information in clustering time-course microarray data using order-restricted maximum likelihood, while most existing clustering methods either can not incorporate the temporal information or require long time series to perform reliable nonparametric smoothing,

In many situations, field researchers can have good ideas on defining the inequality profiles. For example, when studying the gene expression patterns for disease onset. It is easy to postulate that gene expressions tend to go up before the disease onset and then go down after certain treatment is given. So, the inequality constraints allow an easy adoption of a prior knowledge into the analysis, whereas existing methods usually can not take such information into consideration. In addition, when inequality constraints are given, the order-restricted MLE has some optimal properties and universally dominates the unrestricted MLE

In this paper, we present our algorithm under the context of clustering time-course microarray data. Actually, it can be applied to data from any experiment with ordered treatment or conditions, such as dose-response microarray experiments where the dose levels provide the ordering.

Our current ORICC algorithm is based on order-restricted MLE for gene expressions with a constant variance through time. It can be generalized to handle situations where the variances change or are subject to order restrictions themselves. In such situations, the estimation of mean expression levels outlined in this paper can be modified according to the approach in

Conclusion

We developed a new clustering algorithm, ORICC, for short time-course microarray data, by taking a model selection approach in order-restricted statistical inference. Our method clusters genes into clusters represented by candidate profiles defined through inequalities among mean expression levels at different time points. A newly proposed information criterion function is used to determined the cluster assignment. Compared with a previous clustering method by Peddada

Methods

Order-restricted maximum likelihood estimation

Here, we briefly present the order-restricted MLE under simple order (3) and umbrella order constraints (4), which are needed for ORICC analysis in our simulation and real data example. For more general results, we refer to _{ti}'s independent observations from normal distributions with unknown means _{t }and variances _{t }for _{t}. Then the data log-likelihood is

Where ** μ **= (

(1) _{1 }≤ _{2 }≤ ⋯ ≤ _{T}:

(2) _{1 }≥ _{2 }≥ ⋯ ≥ _{T}:

(3) _{1 }≤ ⋯ ≤ _{h }≥ ⋯ ≥ _{T}:

(4) _{1 }≥ ⋯ ≥ _{h }≤ ⋯ ≤ _{T}:

The maximum log-likelihood under the

If the variances **v **are unknown, we need to impose the assumption _{1 }= _{2 }= ⋯ = _{n }=

Under this situation, the order-restricted MLE of ** μ **can be obtained similarly as in the known variance case by letting

Again, the maximum log-likelihood for the

Availability and requirements

We have implemented ORICC in an R program, which can be downloaded from

Authors' contributions

NL, NS and BZ had the initial idea and initiated the study. NL and TL conducted the data analyses, created all tables and figures, and prepared the manuscript under the supervision of BZ. All authors read and approved the final manuscript.

Acknowledgements

The authors gratefully appreciate the editor and the referees for their valuable comments and suggestions. The authors are partly supported by the National Science Foundation of China (No.10871037) and the Science Foundation for Young Teachers of Northeast Normal University (No.20050107).