Abstract
Background
Timecourse microarray experiments produce vector gene expression profiles across a series of time points. Clustering genes based on these profiles is important in discovering functional related and coregulated genes. Early developed clustering algorithms do not take advantage of the ordering in a timecourse study, explicit use of which should allow more sensitive detection of genes that display a consistent pattern over time. Peddada et al. [1] proposed a clustering algorithm that can incorporate the temporal ordering using orderrestricted statistical inference. This algorithm is, however, very timeconsuming and hence inapplicable to most microarray experiments that contain a large number of genes. Its computational burden also imposes difficulty to assess the clustering reliability, which is a very important measure when clustering noisy microarray data.
Results
We propose a computationally efficient information criterionbased clustering algorithm, called ORICC, that also takes account of the ordering in timecourse microarray experiments by embedding the orderrestricted inference into a model selection framework. Genes are assigned to the profile which they best match determined by a newly proposed information criterion for orderrestricted inference. In addition, we also developed a bootstrap procedure to assess ORICC's clustering reliability for every gene. Simulation studies show that the ORICC method is robust, always gives better clustering accuracy than Peddada's method and saves hundreds of times computational time. Under some scenarios, its accuracy is also better than some other existing clustering methods for short timecourse microarray data, such as STEM [2] and Wang et al. [3]. It is also computationally much faster than Wang et al. [3].
Conclusion
Our ORICC algorithm, which takes advantage of the temporal ordering in timecourse microarray experiments, provides good clustering accuracy and is meanwhile much faster than Peddada's method. Moreover, the clustering reliability for each gene can also be assessed, which is unavailable in Peddada's method. In a real data example, the ORICC algorithm identifies new and interesting genes that previous analyses failed to reveal.
Background
The development of microarray technology provides a powerful analytical tool for large scale genomic research. Its ability to simultaneously study thousands of genes under a multitude of conditions presents a huge challenge to comprehend and interpret the resulting mass of data. An important application of microarray technology is to study the dynamic patterns of gene expression across a series of time points and find gene clusters within which genes share similar patterns. The premise is that genes sharing similar expression profiles might be functionally related or coregulated. Therefore, microarray data may provide insights into genegene interactions, gene function and pathway identification. Examples of such studies include response to temperature changes and other stress conditions [4], immune response [5], developmental studies [6], and various systems in the cell [7]. Early clustering analysis of microarray data were mostly on static microarray experiments, such as hierarchial clustering [8], the knearest neighbors method [9,10], and other correlationbased methods [11,12]. These methods are not designed for timecourse microarray data and can not effectively utilize the temporal information. Many clustering algorithms for timecourse microarray data have been developed afterwards. Most of them view observed temporal gene expression profiles coming from underlying smooth curves and cluster genes based on estimated expression profiles obtained from nonparametric smoothing [2,1325]. While these algorithms work well for relatively long time series data, they are not appropriate for short timecourse microarray data often taken on a small number of sparse time points. Generally, these algorithms tend to overfit the data when the number of time points is small [2]. A few clustering methods were also proposed specifically for short timecourse microarray data, such as [13,26,27]. Among them, Peddada et al. [1] proposed an interesting idea of using orderrestricted inference in clustering short timecourse microarray data. A set of candidate expression profiles is first defined by inequality constraints among expression levels at different time points, i.e. up and downs. This strategy is less restrictive than those that define profiles via prespecified expression levels because only the general shape of the profile is needed. Each candidate profile then represents a potential gene cluster. For short timecourse data, the total number of candidate profiles is generally not large. Peddada's method then assigns genes to the cluster represented by its best matched candidate profile determined by some orderrestricted statistical inference procedure. Two genes fall into the same cluster if their best matched candidate profiles are the same. This profile matching clustering strategy is different from most unsupervised clustering where a representation of a cluster is often calculated only after the cluster is formed. This method requires no smoothing of the expression profiles and was shown to discover more functionally related genes when applied to a breast cancer cellline data in [28]. However, Peddada's method is computationally very costly when the number of genes is large. According to our experience on a workstation with a 2.30 GHz AMD Athlon(tm) 64 × 2 Dual Core 4400+ processor and a 2.00 GB memory, it took at least 72 hours to run Peddada's method (implemented in Matlab) for the breast cancer cell line data containing about 1900 genes.
In this article, from a different perspective to the orderrestricted inference, we propose a new orderrestricted information criterionbased clustering (ORICC) algorithm, which is computationally much more efficient than Peddada's method. Our method selects and clusters genes using the ideas of model selection for orderrestricted inference, where estimation makes use of inequalities that define the candidate profiles. The first step is to define candidate profiles and express them in terms of inequalities between the expected gene expression levels at various time points. For a given candidate profile, we estimate the mean expression level at different time points of each gene using the orderrestricted maximum likelihood [29]. The best fitting profile for a given gene is then selected using an information criterion for orderrestricted inference. Due to the simplicity of our algorithm, the analysis of the breast cancer cellline data in [28] can now be done in a few minutes using our method.
Results and discussion
Inequality profiles
Suppose that a timecourse microarray experiment includes T time points, and at each time point there are M arrays each with G genes. Denote y_{gti }the expression measurement of gene g at time point t on the ith array. Suppose that the unknown true mean expression level of gene g at time t is μ_{gt}, i.e. μ_{gt }= E(y_{gti}) for all i. A candidate gene expression profile of gene g is then given by inequalities between the components of μ_{g }= (μ_{g1}, μ_{g2}, ⋯, μ_{gT})^{T}. In the following, we define some typical inequality profiles, and we drop the subscript g for simplicity.
C_{0 }profile :
C_{⊥ }profile :
where μ_{i}⊥ μ_{j }means that there is no defined inequality constraint between μ_{i }and μ_{j}.
Monotone increasing profile (simple order) :
(with at least one strict inequality). Similarly, a monotone decreasing profile C_{↓ }is given by replacing ≤ by ≥ in (3).
Up down profile with maximum at i (umbrella order) :
(with at least one strict inequality among μ_{1 }≤ μ_{2 }≤ ⋯ ≤ μ_{i }and one among μ_{i }≥ μ_{i+1 }≥ ⋯ ≥ μ_{T}). Genes satisfying this profile have mean expression values nondecreasing in time up to time point i and nonincreasing thereafter. One may similarly define a downup profile C_{∨}.
Cyclical profile with minima at 1, j, and T and maxima at i and k :
(with at least one strict inequality among each monotone subprofile). Cyclical profiles may be important in relatively long timecourse experiments where the mean expression value could oscillate.
Incomplete inequality profiles :
(with at least one strict inequality among each monotone subprofile). Profiles (6) are useful when the investigator is unable to specify inequalities between certain means.
Informationcriterion based clustering using orderrestricted maximum likelihood
Our procedure seeks to match a gene's true profile, estimated from the observed data, to one of a specified set of candidate profiles. Provided the relationship of a gene's mean expression levels at different time points is defined by a given candidate profile, we first obtain the orderrestricted maximum likelihood estimates (MLE) of the gene's mean expression levels at all time points. Details for simple order and umbrella order constraints are given in the Methods section. A general discussion of orderrestricted MLE can be found in [29]. Some specific examples can also be found in [1]. Peddada's method then carried out a bootstrapbased likelihood ratio test to decide a gene's best matched profile, which needs to repeat computing the orderrestricted MLEs at least hundreds of times for each gene. We replace this very timeconsuming bootstrap procedure by computing an information criterion function instead. The best matched profile is given by the one with the smallest information criterion function value. Information criterion functions, such as Akaike information criterion (AIC) [30] and Bayesian information criterion (BIC) [31], are widely used in selecting a model from a set of potential models. In our context, each candidate profile can be viewed as a potential model for a gene's expression pattern, and hence a properly defined information criterion function can be used to decide the gene's best matched profile. However, the widely used AIC and BIC are designed for models with a fixed number of parameters, whereas the inequality constraints do not really specify any parameters explicitly. Some information criteria for orderrestricted inference have been proposed recently [32,33], but they only apply to simple order constraints (3). We have recently proposed a new orderrestricted information criterion (ORIC) function for general inequality profiles, which we show to always select the correct profile when the sample size is large enough (unpublished manuscript). Suppose that there are d candidate profiles of interest. Let l(λ) denote the maximum loglikelihood under the λth candidate profile, and ν_{1}(λ) and ν_{2}(λ) are the number of ⊥ and {≥, ≤} specified in the profile, λ = 1, ..., d. Then, the ORIC function is
where M is the number of replicated arrays and
The ORIC function is similar to AIC and BIC in essence with p(λ) representing the model complexity. That is, the more the inequality constraints in a profile, the more complex it is as a model. And a profile with a smaller ORIC value is regarded as a better match to the gene's expression pattern. In the following, we describe our ORICbased clustering (ORICC) algorithm (onestage ORICC) and its computationally faster variant (twostage ORICC).
Onestage ORICC
Step 1. Prespecify a collection of candidate profiles, {C_{1}, ..., C_{d}}. To prevent genes with very little changes over time matched to these profiles, we also include C_{0 }defined in (1) into the collection.
Step 2. Compute p(λ) in (8) for all candidate profiles.
Step 3. For gene g, compute the orderrestricted MLE () of (μ_{g1}, ⋯, μ_{gT}) and the maximum loglikelihood l(λ) under each candidate profile C_{λ}, λ = 0, 1, ⋯, d.
Step 4. For gene g, compute the information criterion function ORIC(λ) in (7) for all λ = 0, 1, ⋯, d. The best matched profile is then selected as that corresponds to = arg min_{0 ≤ λ ≤ d}ORIC(λ), and gene g is assigned to the th cluster if ≠ 0.
Step 5. Repeat Steps 3 and 4 for every gene.
Although our onestage ORICC algorithm is hundreds of times faster than Peddada's method, performing Step 3 for all genes can still cost a lot of computational time even when only a moderate number of candidate profiles are considered because the number of genes is generally huge. This issue is more imminent for relatively longer time course microarray studies as more candidate profiles usually need to be considered. Next, we propose a computationally more efficient twostage algorithm by adding a prescreening stage.
Twostage ORICC
Step 1. Prespecify a collection of candidate profiles, {C_{1}, ..., C_{d}}. Here, we also add C_{0 }and C_{⊥ }defined in (1) and (2) into the collection for screening purpose.
Step 2. Compute p(λ) in (8) for all candidate profiles.
Step 3. For gene g, compute the orderrestricted MLE () of (μ_{g1}, ⋯, μ_{gT}) and the maximum loglikelihood l(λ) under profiles C_{0 }and C_{⊥}.
Step 4. For gene g, compute ORIC(λ) in (7) for λ = 0, ⊥. Exclude gene g for further consideration if ORIC(0) <ORIC(⊥).
Step 5. Repeat Steps 3 and 4 for every gene. Denote the set of remained genes by S.
Step 6. Run Steps 3–5 in the onestage ORICC algorithm for genes in S considering only {C_{1}, ..., C_{d}} as candidate profiles.
In the onestage algorithm, the ORIC function is evaluated for every gene under every candidate profile, whereas the twostage algorithm first screens out genes that show no significant changes over time by comparing between two profiles C_{0 }and C_{⊥}, and then applies the onestage algorithm to a much smaller set of remained genes. As a result, the two stage algorithm is usually much faster and report tighter clusters with less genes in them.
Filtering genes with small expression levels
Some genes selected by the ORICC algorithm may have small mean expression levels at every time point. Such genes may not be of interest to some investigators. Peddada et al. [1] suggested a simple step to remove them. We include it here for completeness.
Let
where . Large values of v_{g }indicate that the mean expression of gene g is high for at least one time point. Arrange the genes that selected by the ORICC algorithm in descending order of v_{g }and retain the top R genes.
Assessing the reliability of the oricc results
Microarray data are often noisy and hence it is important to assess the reliability of the clustering results. Among the recently developed methods for assessing clustering reliability [3438], we adopt a general bootstrap framework proposed by Kerr and Churchill [36], in which the clustering procedure is first applied to the original data and then to a large number of bootstrap samples obtained from perturbing the original data. Since Peddada's method is computationally so costly, it is impossible to put it into this framework. By contrast, our ORICC method is computationally very efficient, and we can easily embed it into this general framework.
In timecourse microarray studies, we can use the following analysis of variance (ANOVA) model to account for sources of variation in microarray data.
where y_{gti }is the relative expression measurement from array i, time point t, and gene g on appropriate scale (typically the log scale). The terms β_{t}, γ_{i }and ψ_{ti }account for all effects that are not genespecific. We assume that the error terms ε_{gti }are independent with mean 0 and variance but do not make any other distributional assumption. The bootstrap assessment is then done as follows.
Step 1. Estimate model (10), which can be done straightforwardly in any statistics software, such as SAS [39] and R [40].
Step 2. Generate B bootstrap samples by
where a ^ over a term means the estimate from the original model fit in Step 1, and are drawn with replacement from the studentized residuals of the original model fit.
Step 3. Repeat the ORICC algorithm for each bootstrap sample.
Now, the original clustering is accompanied by a collection of bootstrap clusterings, which can be regarded as a sample of clusterings that are close to the original clustering in space of all possible clusterings. When the level of noise in the original data is low, the bootstrap clusterings tend to be more like the original clustering. Then we can calculate a reliability measure for each gene by counting the proportion in the bootstrap clusterings it is attached to the same profile as in the original clustering. The larger the measure, the more reliable the gene's clustering membership.
Simulation studies
In this section, we use Monte Carlo simulation to examine the performance of the ORICC method and compare it with other clustering methods for short timecourse microarray data, including Peddada's method, STEM [2], and Wang's method [3].
The STEM algorithm works by assigning genes to a predefined set of model profiles that capture the potential distinct patterns that can be expected from the microarray experiment. Each gene is then assigned to the closest model profile in certain distance measure, e.g. correlation, and genes assigned to the same model profile consist a cluster. Significant profiles/clusters are next determined by hypotheses tests. As a result, genes in insignificant clusters are usually not reported. Wang's method represents each gene's temporal profile by a polynomial model and estimates the model using a Bayesian approach. A heuristic search strategy [23] is then applied to obtain clusters by repeatedly merging models to improve marginal likelihood.
All simulations were carried out on a workstation with a 2.30 GHz AMD Athlon(tm) 64 × 2 Dual Core 4400+ processor and a 2.00 GB memory. Peddada's method, Wang's method and the onestage ORICC algorithm are implemented in R [40], whereas the STEM software is in JAVA written by its author.
Simulation 1
In the first simulation study, we consider ten inequality profiles (C_{1}–C_{10}) plus a flat pattern (C_{0}) to represent a total number of eleven clusters. We set the number of time points as T = 6. These numbers are set to be similar to those in the real data set analyzed in the next section. The eleven profiles are specified as follows. For compactness, we drop μ ∈ R^{T }and the phrase 'with a strict inequality' when defining the profiles. True values of μ = (μ_{1}, μ_{2}, μ_{3}, μ_{4}, μ_{5}, μ_{6}) in each profile are also given.
C_{0}: μ_{1 }= μ_{2 }= μ_{3 }= μ_{4 }= μ_{5 }= μ_{6}, μ = (0,0,0,0,0,0);
C_{1}: μ_{1 }≥ μ_{2 }≥ μ_{3 }≥ μ_{4 }≥ μ_{5 }≥ μ_{6}, μ = (0,0.5,1,1.5,2,2.5);
C_{2}: μ_{1 }≤ μ_{2 }≤ μ_{3 }≤ μ_{4 }≤ μ_{5 }≤ μ_{6}, μ = (0, 0.5, 1, 1.5, 2, 2.5)
C_{3}: μ_{1 }≤ μ_{2 }≥ μ_{3 }≥ μ_{4 }≥ μ_{5 }≥ μ_{6}, μ = (0, 0.5, 0, 0.5, 1, 1.5)
C_{4}: μ_{1 }≤ μ_{2 }≤ μ_{3 }≥ μ_{4 }≥ μ_{5 }≥ μ_{6}, μ = (0, 0.5, 1, 0.5, 0, 0.5)
C_{5}: μ_{1 }≤ μ_{2 }≤ μ_{3 }≤ μ_{4 }≥ μ_{5 }≥ μ_{6}, μ = (0, 0.5, 1, 1.5, 1, 0.5)
C_{6}: μ_{1 }≤ μ_{2 }≤ μ_{3 }≤ μ_{4 }≤ μ_{5 }≥ μ_{6}, μ = (0, 0.5, 1, 1.5, 2, 1.5)
C_{7}: μ_{1 }≥ μ_{2 }≤ μ_{3 }≤ μ_{4 }≤ μ_{5 }≤ μ_{6}, μ = (0, 0.5, 0, 0.5, 1, 1.5)
C_{8}: μ_{1 }≥ μ_{2 }≥ μ_{3 }≤ μ_{4 }≤ μ_{5 }≤ μ_{6}, μ = (0, 0.5, 1, 0.5, 0, 0.5)
C_{9}: μ_{1 }≥ μ_{2 }≥ μ_{3 }≥ μ_{4 }≤ μ_{5 }≤ μ_{6}, μ = (0, 0.5, 1, 1.5, 1, 0.5)
C_{10}: μ_{1 }≥ μ_{2 }≥ μ_{3 }≥ μ_{4 }≥ μ_{5 }≤ μ_{6}, μ = (0, 0.5, 1, 1.5, 2, 1.5)
Figure 1 shows the inequality profiles C_{1}–C_{10}. We generated a data set with 200 genes from each profile.
Figure 1. Ten inequality profiles in Simulations 1 and 2.
At each time point t, we generated M replicates for each gene's expression level from normal distributions with means μ_{t }and variance σ^{2}. To assess the effect of the data variability and replicates on the clustering results, we varied the variance σ^{2 }from 0.2 to 1.2 (by an incremental of 0.2) and the number of replicates M from 2 to 10.
Next, we clustered the simulated data using Peddada's method and the onestage ORICC algorithm, considering ten candidate inequality profiles C_{1}–C_{10}. For Peddada's method, we set the number of bootstrap replications as 200 and the significance level of the bootstrap based test as 0.025. Peddada et al. originally proposed to use significance level 0.0025, and we have observed that a large number of "nonflat" genes will be clustered to C_{0 }using this choice. Meanwhile, using the common significance level of 0.05 tends to cluster many genes from C_{0 }to other "nonflat" profiles. Using significance level of 0.025 offers a good compromise between the two kinds of false clustering.
Let γ_{i }denote the number of genes with true profile C_{i }and correctly clustered to profile C_{i}, i = 0, 1, ..., 10. The overall error rate and the false positive rate are then given by /(11 × 200) and 1 γ_{0 }= 200, respectively. Let denote the number of genes with true profile C_{i}, i = 1, ..., 10 and clustered to profile C_{0}. The false negative rate is then given by /(10 × 200).
We then use the overall error rate, the false positive rate and the false negative rate to evaluate the accuracy of the two algorithms. Simulation results are summarized in Figures 2, 3 and 4. Figure 2 and Figure 4 show that the overall error rate and the false negative rate of the onestage ORICC algorithm are always better than those of Peddada's method. Figure 3 shows that the false positive rate of the onestage ORICC algorithm is better than that of Peddada's method in most cases. The onestage ORICC algorithm not only provides good clustering accuracy but also is much faster than Peddada's method. For example, when σ^{2 }= 1 and M = 5, the run time for Peddada's method and onestage ORICC algorithm is 2979.29 seconds versus 25.55 seconds.
Figure 2. Simulation 1: The overall error rate of Peddada's method and the onestage ORICC algorithm. The horizontal axis represents the number of replicates, and the vertical axis represents the overall error rate. Dashed lines are for the onestage ORICC algorithm, and solid lines are for Peddada's method.
Figure 3. Simulation 1: The false positive rate of Peddada's method and the onestage ORICC algorithm. The horizontal axis represents the number of replicates, and the vertical axis represents false positive rate. Dashed lines are for the onestage ORICC algorithm, and solid lines are for Peddada's method.
Figure 4. Simulation 1: The false negative rate for Peddada's method and the onestage ORICC algorithm. The horizontal axis represents the number of replicates, and the vertical axis represents the false negative rate. Dashed lines are for the onestage ORICC algorithm, and solid lines are for Peddada's method.
Simulation 2
In the second simulation study, we consider the same set of inequality profiles and simulate the data in the same way as in Simulation 1, but we fix the number of replicates M to be 5. To study the effect of the true cluster size to the clustering accuracy, we consider different cluster sizes, 50, 100, 150 and 200. Meanwhile, we also vary the variance σ^{2 }from 0.2 to 3.0 with an incremental of 0.4. Then we cluster the simulated data set using methods including Peddada's method, Wang's method, STEM and the onestage ORICC algorithm.
For Peddada's method and the onestage ORICC algorithm, we consider eleven candidate inequality profiles C_{0}–C_{10}. For Wang's method, we set the prior hyperparameters (α_{1}, α_{2}) in the gamma prior distribution Gamma(α_{1}, α_{2}) as (2, 2). For STEM, we assume 50 possible profiles and use the recommended default settings in the package. To be consistent, we did not filter out any genes in any of these analyses. Then we use Rand's C statistic [41] to evaluate the similarity between the true cluster assignment and the clustering results of different methods. Rand's C statistic is defined as follows. Given a pair of clusterings C and C' of the same N objects, arbitrarily number the clusters in each clustering and let n_{ij }be the number of objects simultaneously in the ith cluster of C and the jth cluster of C'. Then, Rand's C statistic is given by
which denotes the proportion of pairs of objects that are assigned consistently in the two clusterings. Figure 5 gives Rand's C statistic from different clustering methods for different σ^{2 }and cluster sizes. It shows that the precision of all methods is decreasing for increasing variance, and the cluster size has no obvious effect on the clustering precision for Peddada's method, STEM and the onestage ORICC algorithm, but has an increasing effect for Wang's method. This comparison shows an interesting pattern. For larger σ^{2}, STEM performs the best, Wang's method the worst, and Peddada's method and the onestage ORICC are in between. For smaller σ^{2}, the result is reversed with STEM being the worst, Wang's method the best, and Peddada's method and the onestage ORICC still in between. When the cluster size is relatively small and σ^{2 }is large, Wang's method can have quite low precision under 70%. Overall, the onestage ORICC algorithm is consistently more accurate than Peddada's method by a slight margin, and provides good precision under all scenarios. The performance of STEM is also very stable, but tends to underperform when the data are less noisy, i.e., σ^{2 }is small.
Figure 5. Simulation 2: Clustering precision of four different methods. The numbers in the cells are Rand's C statistics. The intensity of the color corresponds to the magnitude of the number as shown by the legend. The horizontal axis represents the cluster size, and the vertical axis represents the data variance σ^{2}.
Figure 6 shows the simulated eleven clusters when M = 5, σ^{2 }= 1 and the cluster size is 100. Figures 7, 8, 9 and 10 show the resulted clusters from the four different methods, respectively. ORICC and Peddada's method give similar clustering results that well match the true pattern. While the true number of clusters is eleven, STEM identifies six significant clusters and Wang's method keeps eight clusters. In particular, Wang's method did not cluster genes from the flat profile C_{0 }into one cluster but assign them into different clusters.
Figure 6. Simulation 2: The simulated eleven clusters when M = 5, σ = 1 and cluster size = 100.
Figure 7. Simulation 2: Temporal profiles for clusters from the ORICC analysis.
Figure 8. Simulation 2: Temporal profiles for clusters from Peddada's method.
Figure 9. Simulation 2: Temporal profiles for clusters from Wang's method.
Figure 10. Simulation 2: Temporal profiles for clusters from the STEM analysis. The black curves are prespecified model profiles.
In this simulation, Peddada's method, Wang's method and the onestage ORICC method are implemented in R, whereas the STEM software is in JAVA written by its author. So, we can only compare the computational efficiency of the first three methods and the onestage ORICC method is much faster than the other two. For example, when σ^{2 }= 3 and the cluster size is 200, the run time for Peddada's method, Wang's method and the onestage ORICC algorithm is 3073.37 seconds, 10303.9 seconds and 24.72 seconds, respectively.
Simulation 3
In the third simulation, we examine the robustness of the ORICC algorithm. We consider eleven inequality profiles (C_{0}–C_{10}) plus a cyclical profile (C_{∧∧}) to represent a total number of twelve clusters. We set the number of time points as T = 6. The eleven inequality profiles C_{0}–C_{10 }and the true values of μ = (μ_{1}, μ_{2}, μ_{3}, μ_{4}, μ_{5}, μ_{6}) in each profile are the same as in Simulation 1. The cyclical profile C_{∧∧ }and the true value of μ = (μ_{1}, μ_{2}, μ_{3}, μ_{4}, μ_{5}, μ_{6}) in C_{∧∧ }are given as follows:
We generate a data set with 200 genes from each profile of C_{0}–C_{10 }and 200 × r genes from cyclical profile C_{∧∧}. At each time point t, we generated 5 replicates for each gene's expression level from normal distributions with means μ_{t }and variance σ^{2}. To study the robustness of the onestage ORICC algorithm, we consider different cluster sizes, 200 × r, r = 1, 2, ⋯, 10, for the cyclical profile C_{∧∧}. Meanwhile, we also vary the variance σ^{2 }from 0.2 to 3.0 with an incremental of 0.4. Then we cluster the simulated data set using the onestage ORICC algorithm. For the onestage ORICC algorithm, we consider eleven candidate inequality profiles C_{0 }– C_{10}. Note that the cyclical profile C_{∧∧ }is not included in the candidate profiles.
Then we use Rand's C statistic to evaluate the similarity between the true cluster assignment and the clustering results from ORICC. Figure 11 gives Rand's C statistic from the onestage ORICC algorithm for different σ^{2 }and cluster sizes of the cyclical profile C_{∧∧}. It shows that the precision of the onestage ORICC algorithm as measured by Rand's C statistic increases for decreasing variance, and decreases when increasing the cluster size of the cyclical profile C_{∧∧}. The cluster precision is however always greater than 80%, thus suggesting that the onestage ORICC algorithm is very stable.
Figure 11. Simulation 3: Clustering precision of the onestage ORICC algorithm. Numbers in the cells are Rand's C statistics. The intensity of the color corresponds to the magnitude of the number as shown by the legend. The horizontal axis represents the cluster size of the cyclical profile C_{∧∧}, and the vertical axis represents the data variance σ^{2}.
Simulation 4
In the fourth simulation, we further examine the robustness of the ORICC algorithm. We consider the scenario where a true profile is not explicitly included in the candidate profiles but maybe viewed as a special case of a more flexible candidate profile. Let the true inequality profile be
We generate a data set containing 2000 genes from this profile. At each time point t, we generate M replicates for each gene's expression level from normal distributions with means μ_{t }and variance 0.5. Then, we consider candidate profiles being C_{1}, C_{2}, C_{4 }and C_{9 }plus the profile C_{⊥ }and cluster the simulated data using the onestage ORICC algorithm. Note that the set of candidate profiles does not contain the true one C_{∧∧}, but C_{∧∧ }may be viewed as a special case of C_{⊥}. Let γ_{⊥ }denote the proportion of genes clustered to the profile C_{⊥}, and define the detection error as 1  γ_{⊥}. The simulation results are summarized in Figure 12. It shows that the detection error decreases rapidly as the number of replicates M increases. With 5 replicates, the detection error is below 30%, and with 10 replicates, it goes down to below 5%. This indicates that it is quite safe to apply the ORICC algorithm even if some true profile is missing from the candidate profiles but a more comprehensive profile is considered.
Figure 12. Simulation 4: Detection error rate. This figure plots the detection error rate of the onestage ORICC algorithm for a true profile that is not explicitly specified in the candidate profiles.
Application to breast cancer cell line data
Next, we apply the ORICC algorithm to logtransformed relative expression data from a breast cancer cell line microarray study in [28]. The same data set was also analyzed by Peddada et al. [1]. The experiment was done as follows. First, the MCF7 breast cancer cell line was treated with 17βestradiol or ethanol (vehicle control). Then, samples were harvested at 1, 4, 12, 24, 36 and 48 hours after treatment. At each time point, M = 8 replicate arrays were prepared with each array consisting of G = 1901 genes. Similar as in [1], we assumed for each gene that the variance of the log relative expression was homoscedastic over time, and consider the following 10 candidate profiles for clustering: monotone decreasing, C_{1}; monotone increasing, C_{2}; four updown profiles with maxima at 4, 12, 24, 36 hours, C_{3 }– C_{6}, respectively; and 4 downup profiles with minima at 4, 12, 24, 36 hours, C_{7 }– C_{10}, respectively. Genes matched to these profiles will be regarded estrogen responsive.
The original analysis in [28] used a simple confidence interval approach [42], and identified 105 genes that demonstrated estrogen responsive expression. From the onestage ORICC analysis, we finally had 981 genes in the 10 clusters, with 68 in C_{1}, 24 in C_{2}, 76 in C_{3}, 44 in C_{4}, 97 in C_{5}, 72 in C_{6}, 35 in C_{7}, 98 in C_{8}, 409 in C_{9}, and 58 in C_{10}. Due to limitation of space, we only present the top 50 genes ranked by the filtering criterion in (9) (Additional file 1). The last column in Additional file 1 indicates whether the gene was previously identified in [28]. Clustering reliability for each gene is also attached. Note that these 50 genes come from only nine of the 10 clusters, with none from C_{2}. Figure 13 presents the estimated profiles of these 50 genes from the orderrestricted MLE. Of the 105 genes identified in [28], 44 are among our top 50, 82 are among our top 100, 94 were among our top 150, and 101 were among our top 200. Most of the 44 genes in the top 50 selected in common are involved in cell cycle progression and DNA replication reflecting the known sensitivity of MCF7 cells to estrogen. Among the six genes identified by our ORICC algorithm but not in [28] (denoted by dashed lines in Figure 13), two have quite high clustering reliability. The methylmalonyl Coenzyme A mutase (Clone ID 35468 in C_{9}) has reliability 0.9967, and the deoxythymidylate kinase (Clone ID 489092 in C_{5}) has reliability 0.7800. Both genes are known specific to the metabolic process, and hence are very likely responsive to the metabolism of estrogen when overdosed estrogen are supplied to the cell. For example, the methylmalonyl Coenzyme A mutase could be involved in the breakdown of estradiol into smaller metabolic fragments. However, this gene was not reported in the top 50 list by Peddada et al. [1]. Estimated profiles in Figure 13 suggests this gene matches very well with the candidate profile C_{9}. An interesting phenomenon about the deoxythymidylate kinase is that this gene actually corresponds to two spots on the microarray chips (Clone IDs 489092 and 248008). The original analysis in [28] was only able to identify one of them (Clone ID 248008) but not the other, whereas our method identifies both in the same cluster with high clustering reliability.
Additional File 1. Gene clusters from the ORICC analysis of the breast cancer cell line data. This table presents gene clusters given by the ORICC algorithm using ten candidate inequality profiles for the breast cancer cell line microarray data in [28]. The fourth column gives the rank of the gene according to the filtering criterion (9), and numbers in the parentheses are clustering reliability computed based on 300 bootstrap samples. The last column shows whether the gene was identified by the original analysis in [28].
Format: PDF Size: 25KB Download file
This file can be viewed with: Adobe Acrobat Reader
Figure 13. Breast cancer cell line data: Estimated profiles of the top 50 genes. Curves are given by the orderrestricted MLE of mean log expression ratios. Dashed lines indicate newly identified genes.
In [1], Peddada et al. reported that 39 genes in the top 50 were in common with the 105 genes selected in [28]. Our ORICC result has a good overlap with Peddada's result, with 32 genes in our top 50 also in the top 50 list reported in [1]. Peddada et al. discussed in details several newly identified genes in their top 50, such as replication factors C4 and C5. Our ORICC analysis also cluster these genes into the same clusters, but they are not in our top 50 list due to relatively low clustering reliability, with replication factor C4 ranked 75 and replication factor C5 ranked 148.
We further applied STEM and Wang's method on the breast cancer cell line data. Table 1 reports Rand's C statistics among results from four clustering methods. It shows that the three profile matching algorithms have results more alike each other, while the unsupervised Wang's method is less similar to the rest. This observation can also be seen in the temporal profiles of the clusters given by four different methods (Figures 14, 15, 16 and 17). While ORICC and Peddada's method give ten clusters plus the 'flat' cluster, Wang's method identifies seven clusters, and STEM reports twelve significant clusters.
Table 1. Rand's C statistics among four clustering methods in the breast cancer cell line example.
Figure 14. Breast cancer cell line data: Temporal profiles of clusters from the ORICC analysis. Curves are given by connecting the observed log expression ratios at different time points.
Figure 15. Breast cancer cell line data: Temporal profiles of clusters from Peddada's method. Curves are given by connecting the observed log expression ratios at different time points.
Figure 16. Breast cancer cell line data: Temporal profiles of clusters from the STEM analysis. Curves are given by connecting the observed log expression ratios at different time points. And the black curves are prespecified model profiles.
Figure 17. Breast cancer cell line data: Temporal profiles of clusters from Wang's method. Curves are given by connecting the observed log expression ratios at different time points.
Discussion
In timecourse microarray experiments, the ability to exploit the temporal ordering information may be especially valuable because genes whose expression levels change over time may be involved in the same cellular process or belong to the same regulatory pathway. Making use of the valuable ordering information can improve inference. Our proposed ORICC algorithm utilizes the temporal ordering information in clustering timecourse microarray data using orderrestricted maximum likelihood, while most existing clustering methods either can not incorporate the temporal information or require long time series to perform reliable nonparametric smoothing, e.g. spline smoothing, though most timecourse microarray data are short time series. In our method, the temporal ordering information is exploited through a set of predefined candidate expression profiles given by inequality constraints among the mean expression levels at different time points. By viewing each candidate profile as a potential model for the data and let each profile represent a cluster, we transform the clustering problem into a model selection problem. Using an ORIC function, we decide the best matched profile for each gene and hence determine the gene's clustering membership. Peddada et al. [1] instead performed a likelihood based test to decide the best matched profile. However, a bootstrap procedure is needed to decide the threshold value on the test statistic, which makes Peddada's method computationally very intensive.
In many situations, field researchers can have good ideas on defining the inequality profiles. For example, when studying the gene expression patterns for disease onset. It is easy to postulate that gene expressions tend to go up before the disease onset and then go down after certain treatment is given. So, the inequality constraints allow an easy adoption of a prior knowledge into the analysis, whereas existing methods usually can not take such information into consideration. In addition, when inequality constraints are given, the orderrestricted MLE has some optimal properties and universally dominates the unrestricted MLE [43]. Moreover, the candidate profiles are defined only based on ranks instead of the numerical value of the mean expression levels, hence our model specification is robust to small perturbation in the data. This feature is especially valuable in microarray studies since it is well known that microarray data are quite noisy. Furthermore, the rankbased specification is often closer to the real meaning of 'coexpression' that refers to two genes' expression levels changing in the same direction instead of with the same magnitude.
In this paper, we present our algorithm under the context of clustering timecourse microarray data. Actually, it can be applied to data from any experiment with ordered treatment or conditions, such as doseresponse microarray experiments where the dose levels provide the ordering.
Our current ORICC algorithm is based on orderrestricted MLE for gene expressions with a constant variance through time. It can be generalized to handle situations where the variances change or are subject to order restrictions themselves. In such situations, the estimation of mean expression levels outlined in this paper can be modified according to the approach in [44]. However, it remains a subject for future investigation to modify the definition of the model complexity p(λ) (8) accordingly.
Conclusion
We developed a new clustering algorithm, ORICC, for short timecourse microarray data, by taking a model selection approach in orderrestricted statistical inference. Our method clusters genes into clusters represented by candidate profiles defined through inequalities among mean expression levels at different time points. A newly proposed information criterion function is used to determined the cluster assignment. Compared with a previous clustering method by Peddada et al. [1] that also uses orderrestricted inference, our method is computationally much more efficient and provides an assessment of clustering reliability. Simulation studies indicate that the ORICC algorithm possesses good clustering accuracy when a moderate number of replicate arrays are available, and competes well with other existing clustering methods, such as [2] and [3]. Real data applications also indicate that our method can identify interesting genes that some correlationbased methods have failed to identify.
Methods
Orderrestricted maximum likelihood estimation
Here, we briefly present the orderrestricted MLE under simple order (3) and umbrella order constraints (4), which are needed for ORICC analysis in our simulation and real data example. For more general results, we refer to [29,45]. Suppose that y_{ti}'s independent observations from normal distributions with unknown means μ_{t }and variances v_{t }for t = 1, ⋯, T and i = 1, ⋯, n_{t}. Then the data loglikelihood is
Where , μ = (μ_{1},⋯, μ_{T})^{T }and v = (v_{1}, ⋯, v_{T})^{T}. Let w_{t} = n_{t}/v_{t}, and . When assuming the variances v are known, we have the orderrestricted MLE of μ as
(1) μ_{1 }≤ μ_{2 }≤ ⋯ ≤ μ_{T}:
(2) μ_{1 }≥ μ_{2 }≥ ⋯ ≥ μ_{T}:
(3) μ_{1 }≤ ⋯ ≤ μ_{h }≥ ⋯ ≥ μ_{T}:
(4) μ_{1 }≥ ⋯ ≥ μ_{h }≤ ⋯ ≤ μ_{T}:
The maximum loglikelihood under the λth candidate profile is then obtained by plugging in the corresponding orderrestricted MLE into (11).
If the variances v are unknown, we need to impose the assumption v_{1 }= v_{2 }= ⋯ = v_{n }= v. In microarray data, this assumption is reasonable after the data are properly normalized. Now, the data loglikelihood is
Under this situation, the orderrestricted MLE of μ can be obtained similarly as in the known variance case by letting w_{t }= n_{t }instead of w_{t }= n_{t}/v_{t}. And the MLE of v is then
Again, the maximum loglikelihood for the λth candidate profile is then obtained by plugging in the corresponding orderrestricted MLE into (12).
Availability and requirements
We have implemented ORICC in an R program, which can be downloaded from http://www.math.wustl.edu/~nlin/oricc/ webcite.
Authors' contributions
NL, NS and BZ had the initial idea and initiated the study. NL and TL conducted the data analyses, created all tables and figures, and prepared the manuscript under the supervision of BZ. All authors read and approved the final manuscript.
Acknowledgements
The authors gratefully appreciate the editor and the referees for their valuable comments and suggestions. The authors are partly supported by the National Science Foundation of China (No.10871037) and the Science Foundation for Young Teachers of Northeast Normal University (No.20050107).
References

Peddada S, Lobenhofer E, Li L, Afshari C, Weinberg C, Umbach D: Gene Selection and clustering for timecourse and doseresponse microarray experiments using orderrestricted inference.
Bioinformatics 2003, 19:834841. PubMed Abstract  Publisher Full Text

Ernst J, Nau GJ, BarJoseph Z: Clustering short time series gene expression data.
Bioinformatics 2005, 21(Suppl 1):i159i168. PubMed Abstract  Publisher Full Text

Wang L, Ramoni M, Sebastiani P: Clustering short gene expression profiles.

Gasch A, Spellman P, Kao C, CarmelHarel O, Eisen M, Storz G, Botstein D, Brown P: Genomic expression programs in the response of yeast cells to environmental changes.
Mol Biol Cell 2000, 11:42414257. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Guillemin K, Salama N, Tompkins L, Falkow S: Cag pathogenicity islandspecific responses of gastric epithelial cells to Helicobacter pylori infection.
Proc Natl Acad Sci USA 2002, 99:1513615141. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Arbeitman M, Furlong E, Imam F, Johnson E, Null B, Baker B, Krasnow W, Scott M, Davis R, White K: Gene expression during the life cycle of Drosophila melanogaster.

Storch K, Lipan O, Leykin I, Viswanathan N, Davis F, Wong W, Weitz C: Extensive and divergent circadian gene expression in liver and heart.

Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and display of genomewide expression patterns.
Proc Natl Acad Sci USA 1998, 95:1486314868. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li L, Weinberg C, Darden T, Pedersen L: Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method.
Bioinformatics 2001, 17:11311142. PubMed Abstract  Publisher Full Text

Li L, Darden T, Weinberg C, Pedersen L: Gene assessment and sample classification for gene expression data using a genetic algorothm/knearest neighbor method.

Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown P, Herskowitz I: The transcriptional program of sporulation in budding yeast.
Science 1998, 282:699705. PubMed Abstract  Publisher Full Text

Heyer L, Kruglyak S, Yooseph S: Exploring expression data: identification and analysis of coexpressed genes.
Genome Res 1999, 9:11061115. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

BarJoseph Z, Gerber G, Jaakkola T, Gifford D, Simon I: Continuous representations of time series gene expression data.

De Hoon M, Imoto S, Miyano S: Statistical analysis of a small set of timeordered gene expression data using linear splines.
Bioinformatics 2002, 18:14771485. PubMed Abstract  Publisher Full Text

Ji X, LiLing J, Sun Z: Mining gene expression data using a novel approach based on hidden Markov models.
FEBS Lett 2003, 542:125131. PubMed Abstract  Publisher Full Text

Kim J, Kim JH: Differencebased clustering of short timecourse microarray data with replicates.
BMC Bioinformatics 2007, 8:253. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Luan Y, Li H: Clustering of timecourse gene expression data using a mixedeffects model with Bsplines.
Bioinformatics 2003, 19:474482. PubMed Abstract  Publisher Full Text

Luan Y, Li H: Modelbased methods for identifying periodically regulated genes based on the time course microarray gene expression data.
Bioinformatics 2004, 20:332339. PubMed Abstract  Publisher Full Text

Ma P, CastilloDavis CI, Zhong W, Liu J: A datadriven clustering method for time course gene expression data.
Nucleic Acids Res 2006, 34(4):12611269. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ma P, Zhong W, Feng Y, Liu JS: Bayesian functional data clustering for temporal microarray data.
Int J Plant Genomics 2008, 2008:231897. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

MollerLevet CS, Cho KH, Wolkenhauer O: Microarray data clustering based on temporal variation: FCV with TSD preclustering.
Appl Bioinformatics 2003, 2:3545. PubMed Abstract

Phang T, Neville M, Rudolph M, Hunter L: Trajectory clustering: a nonparametric method for grouping gene expression time courses, with applications to mammary development.

Ramoni M, Sebastiani P, Kohane I: Cluster analysis of gene expression dynamics.
Proc Natl Acad Sci USA 2002, 99:91219126. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Song JJ, Lee HJ, Morris JS, Kang S: Clustering of timecourse gene expression data using functional data analysis.
Comput Biol Chem 2007, 31(4):265274. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wu F, Zhang W, Kusalik A: Dynamic modelbased clustering for timecourse gene expression data.
J Bioinform Comput Biol 2005, 3(4):821836. PubMed Abstract  Publisher Full Text

Peddada S, Harris S, Zajd J, Harvey E: ORIOGEN: order restricted inference for ordered gene expression data.
Bioinformatics 2005, 21:39333934. PubMed Abstract  Publisher Full Text

Wang L, Montano M, Rarick M, Sebastiani P: Conditional clustering of temporal expression profiles.
BMC Bioinformatics 2008, 9:147. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lobenhofer E, Bennett L, Cable P, Li L, Bushel P, Afshari C: Regulation of DNA replication fork genes by 17βestradiol.

Robertson T, Wright FT, Dykstra RL: Order Restricted Statistical Inference. John Wiley & Sons; 1988.

Akaike H: A new look at the statistical model identification.

Anraku K: An information criterion for parameters under a simple order restriction.

Laurent B, Olivier G, Olivier M: Using repeated measurements to validate hierarchical gene clusters.
Bioinformatics 2008, 24:682688. PubMed Abstract  Publisher Full Text

Gibbons F, Roth F: Judging the quality of gene expressionbased clustering methods using gene annotation.
Genome Res 2002, 12:15741581. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kerr M, Churchill G: Bootstraping cluster analysis: assessing the reliability of conclusions from microarray experiments.
Proc Natl Acad Sci USA 2001, 98:89618965. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Smolkin M, Ghosh D: Cluster stability scores for microarray data in cancer studies.
BMC Bioinformatics 2003, 4:36. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yeung K, Haynor D, Ruzzo W: Validating clustering for gene expression data.
Bioinformatics 2001, 17:309318. PubMed Abstract  Publisher Full Text

SAS Institute Inc: SAS 9.1.3 Help and Documentation. SAS Institute Inc., Cary, NC; 2000.

R Development Core Team: [http://www.Rproject.org] webcite
R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; 2005.

Rand WM: Objective criteria for the evaluation of clustering methods.

Chen Y, Dougherty E, Bittner M: Ratiobased decisions and the quantitative analysis of cDNA microarray images.

Hwang J, Peddada S: Confidence interval estimation subject to order restrictions.

Shi N: Maximum likelihood estimation of means and variances from normal populations under simultaneous order restrictions.

Shi N, Gao W, Zhang B: One sided estimation and testing problems for location models from grouped samples.