Abstract
Background
There are some limitations associated with conventional clustering methods for short timecourse gene expression data. The current algorithms require prior domain knowledge and do not incorporate information from replicates. Moreover, the results are not always easy to interpret biologically.
Results
We propose a novel algorithm for identifying a subset of genes sharing a significant temporal expression pattern when replicates are used. Our algorithm requires no prior knowledge, instead relying on an observed statistic which is based on the first and second order differences between adjacent timepoints. Here, a pattern is predefined as the sequence of symbols indicating direction and the rate of change between timepoints, and each gene is assigned to a cluster whose members share a similar pattern. We evaluated the performance of our algorithm to those of Kmeans, SelfOrganizing Map and the Short Timeseries Expression Miner methods.
Conclusions
Assessments using simulated and real data show that our method outperformed aforementioned algorithms. Our approach is an appropriate solution for clustering short timecourse microarray data with replicates.
Background
Time is an important factor in developmental biology, especially in dynamic genetics. For example, when a number of genes are differentially expressed under two or more conditions, it is often of great interest to know which changes are causal and which are not. When different conditions are represented by different timepoints, it helps us to understand not only how a gene gets turned on or off, but also which genegene relationships are based on the lags in the changes. Novel genes have been identified by monitoring the transcription profiles during development [1] or by looking at the differential responses of genes under different conditions [2,3].
Conventional timeseries methods are not well suited to the analysis of microarray data. Since the number of observed timepoints in a microarray is usually very small, common methods such as autoregression (AR), movingaverage (MA) or Fourier analysis modeling may not be applicable. Furthermore, these classic autocorrelation approaches generate bias when applied to short timecourse data [4]. In addition, observed timepoints are sometimes distributed unevenly and the length of intertimepoints increases exponentially due to biological phenomena and resource limitation. For example, some commonly used timepoints are 0, 4, 12, 24, and 48 hours. With conventional approaches, it is not clear how to calculate the magnitude of the slope between adjacent timepoints or how to determine the window size for smoothing, when needed. In order to address these problems, clustering analysis has been widely used. Clustering algorithms, which explore the problem space whose size is the Stirling's number where
in order to group similar objects together, can identify potentially meaningful relationships between objects and often their results can be visualized [5,6]. Phang et al. [7] devised a nonparametric clustering algorithm using only the direction of change from one timepoint to the next in order to group genes in a timecourse study. Ji et al. [8] proposed a modelbased clustering method based on a hidden Markov model (HMM). These models assume that each gene expression profile has been generated by a Markov chain with a certain probability. The original dataset of N timepoints is standardized and then transformed into a threedigitsequence (0 = no change, 1 = up, 2 = down) with the aid of a tolerance factor. Luan et al. [9] used cubic splines in building a mixedeffects model, where observed timepoints are treated as samples taken from underlying smooth processes. Ramoni et al. [10] adopted a Bayesian method for modelbased clustering of gene expression dynamics. The method represents geneexpression dynamics as autoregressive equations and uses an agglomerative procedure to search for the most probable set of clusters given the available data. Wu et al. [11] considered a timecourse gene expression dataset as a set of time series, generated by a number of stochastic processes. Each stochastic process defines a cluster and is described by an autoregressive model. A relocationiteration algorithm is proposed to identify the model parameters and each gene is assigned to an appropriate cluster based on posterior probabilities. Ernst et al. [12] assigned genes probabilistically to preselected subpatterns which were generated independent of the data in a short timecourse experiment.
As this wealth of approaches shows, a lot of effort has been put into developing clustering algorithms for gene clustering; however, they have some limitations. Geneticists still need a more intuitive and statistically sound methodology. To address this issue, we propose a differencebased clustering algorithm (DIBC) for a short timecourse gene expression data. DIBC discretizes a gene into a symbolic pattern of the first and secondorder differences representing direction and rate of change, respectively. Replicate and temporal order information from the input data are used in defining the clusters. DIBC outputs a crosssectional view of cluster hierarchies with varied cutoffs, shown in a 2dimensional map for biological interpretation. The clustering procedure used by DIBC is detailed in the Methods section.
We now examine the limitations of standard clustering algorithms and explain how we addressed each of them in developing our algorithm. First, misleading or uninterpretable clusters can occur when one only considers the similarity of expression profiles, thereby disregarding discretization information. An example from real data [1] is shown in Figure 1. The three yeast genes in Figure 1 are well studied; we know that each gene plays a different role in yeast sporulation which is characterized by sequential transcription of sets of genes'early', 'earlymid', 'middle', 'midlate' and 'late'. Every gene necessary for sporulation has been found to play a role in one of these five sets confirmed through genetic screens of visual assays. THI3 is known to have a specific temporal pattern in 'earlymid', PBP2 in 'mid' and CDC27 in 'midlate' [1]. Thus, all three genes have different profiles and roles. But these three genes would have put into the same cluster if a conventional clustering method was blindly performed considering profile similarity only.
Second, the rate of change is ignored when delineating underlying patterns in traditional clustering algorithms. For example, CDC27 in Figure 1 increases from 2 hr through 11 hr, but the rate of change decreases over time. This type of saturation is often observed in biological phenomena; examples include mRNA accumulation, developmental acceleration, or gradual changes in the drugresponse rate. Although some discretizationbased methods which use the direction of change have been presented [7,8,13], none of these deal with differences in rates of change. We used the secondorder difference – the difference between the firstorder statistics – in DIBC in order to incorporate rate of change information into the clustering procedure.
Figure 1. An example of falsely clustered genes. An example of falsely clustered genes using a conventional clustering method is shown. Three genes have similar profiles (i.e., correlation coefficients above 0.9) but the rates of change differ. The raw data were downloaded from [30].
Third, replicates are not fully utilized in the existing algorithms. Kerr et al. [14] pointed out that replication in microarray experiments is a fundamental principle of good experimental design because it increases the precision of estimated quantities and provides information about the uncertainty of estimates. However, replicates were infrequently used in microarray experiments due to their high cost. But now, more replicates are being (and will be) used in order to achieve enough statistical power thanks to the dropping costs of microarrays with the advance of the microarray technology as in many other electronic products. Even though appropriate methods are needed to analyze this emerging type of data, most of the current methods simply compute the average over replicates, disregarding variability. In contrast, DIBC makes use of moderated tstatistics [15], which consider an empirical Bayes variance estimate computed using the array replicates.
Fourth, conventional clustering techniques such as hierarchical clustering [16], tend to ignore temporal information by treating timecourse data as an unordered collection of events under different conditions [17]. However, timecourse experiments have a fixed order of conditions, i.e., the columns are not interchangeable [6]. The problem becomes more complicated in the presence of replicates, as any two members withingroup are interchangeable unlike betweengroups. DIBC incorporates this orderrestriction in the algorithm.
Fifth, templatebased methods require prior knowledge to choose representative genes. In the yeast example, each gene was assigned to the nearest prechosen representative gene based on previous studies. Peddada et al. [18] also predefine a set of potential candidate profiles then assign each gene using the orderrestricted inference method. However, these approaches are applicable only when there is enough information which is rare in practice.
Finally, visualization of clustering results is not always informative. Kmeans (KM) merely enumerates the list of genes, where each number signifies which cluster a gene belongs to. However, these are simply distinguishing symbols, adjacent numbers do not imply that two clusters are biologically related. SelfOrganizing Map (SOM) does a little better, as it displays clustering results in a 2dimensional grid.
Results
We evaluated the performance of our algorithm DIBC in comparison to Kmeans, SOM and Short Timeseries Expression Miner (STEM) methods using both simulated and real data [12,19,20]. The simulation data had 19 clusters with 10 members each, at four timepoints. There are eight replicates at each time point (See the Methods section for details). Real data on pancreas gene expression in mice [21] was obtained from Computational Biology and Informatics Laboratory, University of Pennsylvania [22]. We extracted 2,179 gene expression measures from a unique set of probes and used six timepoints with four or six replicates at each time point. The preprocessing methods used on the pancreas data are detailed in the Methods section.
Simulated data
For the simulated data, true clustering membership was used as knowledge external to the gene expression data. Also, the agreement between the true and the resulting cluster memberships was measured. The Adjusted Rand Index (ARI), an updated form of the Rand Index, is the number of agreements divided by the number of total objects [23] defined as:
where i and j index the clusters and classes, respectively. Higher ARI values indicate more accurate clusters. The ARI is a more sensitive, generalized version of the original Rand Index and is used as our measure of comparison.
With ARI measure, DIBC showed better accuracy across the cluster numbers than did the other three methods. Under lower noise simulations of 1, 2, and 5%, the maximum ARI values were obtained by DIBC at the true number of clusters (19), indicating that DIBC has the highest accuracy of the three methods (Figure 2). Under high noise (10%), Kmeans achieved the maximal ARI at 24 clusters, which was not the true cluster number. However, it is notable that DIBC peaks at the actual cluster number. DIBC outputs only in the neighborhood of the true cluster number unlike the other three methods because our algorithm refuses to separate insignificant changes.
Figure 2. ARI of the simulated data. The adjusted rand index (ARI) of the simulated data is plotted according to cluster number. Higher ARI, values indicate more accurate clustering results. Three algorithms were compared under four different noise (1, 2, 5, and 10%.)
As a datadriven evaluation measure without any external knowledge, the average proportion of the first eigenvalue (APF) was used to delineate the best overall clustering results. APF is the normalized proportion of eigenvalues for each cluster defined by:
In this paper, eigenvalues are calculated from the withincluster covariance matrix and assumed to be sorted in decreasing order so that the first eigenvalue corresponds to the largest eigenvalue, the second eigenvalue corresponds to the second largest eigenvalue and so on. From a dimension reduction perspective, the principal components of each resulting cluster lie in the directions of the axes of a constant density multidimensional ellipsoid [24]. If the relative magnitude of the first eigenvalue is large then the corresponding cluster is closer to linear in shape. Recently, MollerLevet et al. used the square root of the second eigenvalue as an overall clustering quality index for microarray data [13]. In the spirit of MollerLevet, the ratio of the normalized eigenvalue to the total number of clusters is used as an evaluation measure.
With the APF measure, DIBC had the largest value in the neighborhood of the true cluster number 19 under 1, 2, and 10% noise (Figure 3). The dots of DIBC appeared close to the true cluster number 19 and stayed only in the neighborhood of 19 with its APF values being the highest. Based on this result, we argue that DIBC produces mostly linearshaped clusters because it has the largest proportion of the first eigenvalue of each cluster covariance.
Figure 3. APF of the simulated data. The average proportion of the first eigenvalue (APF) is plotted as a function of cluster number. Higher APF values indicate that clusters are closer to a linear shape. Three algorithms were compared at four different noise levels.
A twodimensional pattern map (as described in the Methods section) is shown to explain how the simulation data was generated (Figure 4). Each gene is assigned to the true cluster where three firstorder difference pattern on the columns are further partitioned into nine secondorder patterns on the rows. In each gene, error bars are drawn around the mean for each timepoints. Ten member genes constitute a cluster with a total of 19 true clusters. Then the clustering result of DIBC is shown in Figure 5 to compare with the truth (Figure 4). The result is similar to the true answer since there was only one misclustered gene in the cluster (DDD, AV) whose cluster size is 11. Actual membership of this gene is (DDD, AN) whose cluster size is 9.
Figure 4. pattern map of simulation scheme. Twodimensional pattern map showing simulated data for 190 genes at four timepoints. Nineteen clusters are predefined. Each cluster has ten genes. Every gene has eight replicates. The symbols are I: increase, D: decrease, N: nochange, A: concave and V: convex.
Figure 5. pattern map of the simulated data. Twodimensional pattern map of the clustering test data., in which 190 genes are partitioned into 19 clusters. There was only one misclassified gene in the pattern (DDD, AV), so ARI was 1.
The hierarchical layers of the simulated data are shown in Figure 6; the four smallest thresholds are shown in this figure because the complete figure is so complex. Every threshold level uniformly, and correctly, exhibited 19 clusters. After level 1, there is no further repartitioning of a cluster and the three first differences are observed as three corresponding colors in each level.
Figure 6. hierarchical layer of the simulated data. Clustering results of simulated data are drawn as a hierarchical graph. Each level is a clustering result from each threshold value. Since every clustering result is the same for all levels, except level 0, DAG is drawn only up to level 5. Each node represents a symbolic pattern. The number of members in each cluster is written in parentheses. An interactive figure in SVG format is available on the supplement page [27].
Real data
For real data, gene set enrichment using Gene Ontology (GO) annotation was used instead of ARI (in the simulated data) since true cluster membership is not available. GO is a structured, controlled vocabulary for describing the roles of genes and gene products [25]. Following the work of Gibbons et al., the molecular function aspect was used out of the three aspects of GO. After mapping the thirdlevel GO ID to our pancreas genes, a contingency table of 2,179 genes by 13,505 GO IDs was created. Then the total mutual information MI_{real }between the cluster result and all the GO IDs were computed. Next, MI_{random }for a clustering result was obtained after random swapping of genes in the original clusters. This procedure was repeated 3,000 times to get corresponding MI_{random}s. Then, we subtracted the mean of MI_{random}s from MI_{real }and divided it by the standard deviation of MI_{random}s. This is a Zscore interpreted as a standardized distance between the MI value obtained from clustering after centering and scaling based on those MI values obtained by random assignments of genes to clusters. The higher the Zscore, the better one's clustering result because it indicates the observed clustering result is further away from the distribution of the random clustering results [26].
Zscore for the pancreas data is shown in Figure 7. Overall trend of the Zscores for the mutual information between clustering results and significant GO annotation for the real data decreases with an increase in cluster size, as noted by Gibbons et al.. When significant Zscores were considered (i.e. Zscores higher than 97.5% normalquantile, 1.96), DIBC gave higher and more stable Zscore values (Figure 7), achieving more significant and insightful clusters. DIBC had the largest Zscore (Z = 3.247) at 28 clusters. This was obtained from the first and the secondorder thresholds pair of pvalue cutoffs (9 × 10^{4}, 3 × 10^{4}) for significant differences.
Figure 7. Zscores of pancreas data. Mutual information between a clustering result and GO annotation is plotted using the cluster number. Higher Zscores indicate better clustering results based on external knowledge, GO. The optimal cluster number is 28 where the maximum Zscore, 3.247, is achieved.
With the APF measure, both DIBC and STEM outperformed SOM and Kmeans (Figure 8). DIBC gave the largest APF values across all cluster numbers larger than 19. STEM had the largest APF values when the cluster number was smaller than 19, but the differences in APF values between STEM and DIBC were small. While DIBC showed stable APF values, STEM's values decreased as the cluster number increased. Overall, DIBC had the largest (or nearly the largest) average magnitude of the first eigenvalue in each cluster.
Figure 8. APF of pancreas data. The average proportion of the first eigenvalue (APF) is plotted as a function of cluster number.
For the pancreas dataset, three representative threshold levels, including the 'optimal' result with 28 clusters (Zscore = 3.247), are applied to construct the corresponding three hierarchical layers (Figure 9). Levels 1 and 2 are included as ancestor layers of the optimal layer. Level 1 had a Zscore of 1.258 at cluster number 10 with threshold pair (1 × 10^{5}, 2 × 10^{5}); Level 2 had a Zscore of 2.557 at cluster number 28 with threshold pair (7 × 10^{4}, 1 × 10^{5}). An interactive version of this hierarchical layer can be found at the supplementary webpage [27].
Figure 9. hierarchical layer of pancreas data. Hierarchal structure of clustering results are drawn as a from the pancreas data. Three layers (or clustering results) are attached to the root produced from the corresponding cutoff pairs of (1 × 10^{5}, 2 × 10^{5}), (6 × 10^{4}, 1 × 10^{5}), and (9 × 10^{4}, 3 × 10^{3}). An interactive figure in SVG format is available on the supplement page [27].
The optimal clustering result from the last hierarchical layer is reconstructed as a twodimensional pattern map for the pancreas data (Figure 10). DIBC partitioned 2,179 probes of pancreas data into 28 clusters. The pattern map had six firstdifferences and 25 seconddifferences. As expected, a huge cluster (1,905 genes, 87.4%) of the null pattern ((N, N, N, N, N), (N, N, N, N)) was found.
Figure 10. pattern map of pancreas data. Twodimensional pattern map of the pancreas gene expression data. There were 2,179 genes and six timepoints. T1 had four replicates and the other timepoints had six. A cutoff value 0.003 was used to identify significant differences.
Discussion and conclusions
DIBC is a novel clustering algorithm based on the first and secondorder differences of a gene expression matrix. Our algorithm has several advantages over previous clustering algorithms for short timecourse data with replicates. First, DIBC generates interpretable clusters through discretization. Instead of producing many unlabeled partitions, DIBC offers selfexplanatory clusters. The resulting pattern map visualizes using both horizontal (the firstorder difference) and vertical (the secondorder difference) structures. Each cluster has a label composed of symbols indicating increases or decreases, which have intuitive biological interpretations. Second, our algorithm deals with the rate of change: convex and concave categories are incorporated into the definition of the symbolic pattern. Hence, we can discriminate genes into further subgroups. Third, the identification power is increased by using both the mean and variance of replicates. Conventional algorithms blindly use averaged summary data from replicates. In this way, two average values with different variances are treated equally, thereby decreasing the sensitivity to nonrandom patterns. Fourth, temporal order is incorporated into the algorithm. Columnwise shuffling (i.e., reordering timepoints) of input data would give a different output, which is not the case for K means or SOM because they do not consider the order of input data points. Fifth, DIBC requires no prior knowledge of representative genes. Even after the appropriate clustering algorithm is chosen, deciding the optimal number of clusters is very important. DIBC overcomes this problem by exhaustive space searching in an efficient way. Also, DIBC offers informative visualization. Clusters are arranged so that closely related patterns are gathered together. Such a metastructure approach is often needed in developmental and cancer biology.
When a cluster has few members, the APF value tends to get large since most eigenvalues of its withincluster covariance matrix could be zeros. For this potential bias of APF measure, we have assigned equally 10 members to each 19 cluster in the simulated data. APF values had a tendency to increase as the cluster number decreased (Figure 3) in the clustering algorithms other than DIBC. But our algorithm produced the highest APF value at the true cluster number 19 and only around this number. This result tells us that there might be a small bias in favor of the smaller number of the cluster, but not big enough to mask the performance of DIBC over the other algorithms. For real data, DIBC produced a huge null cluster having the majority of genes and a small number of clusters having a few members. However, note that in practice, a filtering step often precedes clustering methods based on the assumption that most genes are not expressed significantly under the conditions of microarray experiments. Hence the final clustering results are affected by the choice of filtering criteria, making the optimal partitioning (of genes) problem more complex and potentially biased in other direction. In contrast, DIBC performs filtering and clustering simultaneously because the allnullpattern ((N, N, ...), (N, N, ...)) is just another symbolic pattern in our algorithm. By excluding this null cluster, simultaneous filtering and clustering can be performed, unlike in other clustering algorithms. Tseng et al. [28] also criticized that current clustering algorithms are forced to assign every gene to a cluster. Many genes are irrelevant to biological pathways or conditions under screening and so the main interest of investigators lies in identifying the most informative, clusters of small sizes. With this in mind, we consider that the best clustering algorithm should produce a huge cluster of "irrelevant genes" (which corresponds to our null pattern) and a small number of clusters having a few members and this is exactly what DIBC does.
Our algorithm is based on conceptual discretizations, such as increasing, decreasing, remaining flat, and convexity or concavity, that are used as basic building blocks to define a pattern or cluster that shares a pattern with itself. This makes each cluster meaningful and interpretable. Although discretiztion may cause some loss of information, what investigators expect in gene expression data with time course may be such simple statements such as: Which genes express more rapidly with an increase of time? Which genes express early and late but remain flat in the middle? Simply finding patterns and clusters may not be sufficient for the biologists, but it is our belief that we need more effort to relate, even at the cost of loosing some information, computational analysis results with biological phenomena.
For the exhaustive search, DIBC iterates T^{2 }× (2p  3) times where T is the number of threshold values in the Methods section and p is the number of timepoints. In practice, the investigator would not need to use as many threshold values as in this study since there are multiple testing issues. Common choice of T would be T = {1 × 10^{5}, 2 × 10^{5}, ..., 9 × 10^{5}, ..., 1 × 10^{4}, ..., 9 × 10^{4}} which has a length 18. The runtime of our algorithm increases exponentially with the number of timepoints. However, most publicly available gene expression datasets have only a small number of timepoints. According to the survey by Ernst et. al. [12], most datasets involved only 2~8 timepoints. If the number of timepoints is very high, conventional timeseries techniques can instead be used; DIBC is designed for situations where this is not the case.
DIBC can be generalized for use on other types of ordinal data, including stressresponse or drugtreatment data, although the example datasets presented here focus on timecourse data. In the future, we plan to extend DIBC to twofactor designs whose orders are in two directions. For example, the analysis of druginduced gene expression has both timeorder and treatment doseorder. We could apply DIBC after redefining the firstand secondorder differences in order to take this account.
Methods
Data
We synthesized a test set of expression data for 190 genes at four timepoints and with eight replicates. The range of logratio values was (4, 4). Based on 19 template genes representing 19 clusters, we generated ten member genes for each cluster with uniform noise Unif (0.01, 0.01), obtaining each such gene using a 190 by 4 condition matrix. For replicates, we added normal noise using N (0, σ^{2}) where σ is taken from {0.04, 0.08, 0.2, 0.4}, so the matrix was extended to 190 by 32.
We also validated our algorithm with a real dataset involving pancreas gene expression in developing mice [21]. There were 3,840 genes and six timepoints (embryonic day E14.5, E16.5, E18.5, birth, postnatal day P7, and adulthood). Six replicates were performed for each timepoint from E16.5 through adulthood. E14.5 had only four replicates due to the low amount of mRNA [21]. We extracted 2,179 unique probes from the original 3,840: First, we filtered out genes if they were either unidentified or could not be associated with a GO ID. Then, we averaged redundant genes. For preprocessing, scaled printtip group lowess (locally weighted scatter plot smoothing) was performed to remove these spatial effects to enable fair comparison across timepoints [29].
Algorithm
An outline of the algorithm is as following. For each gene, a (moderated) tstatistic is obtained for two adjacent timepoints. If the initial number of timepoints was p, we should have a vector of (p  1) tstatistics for each gene. Then each tstatistic is categorized into one of three symbols I (Increase), D (Decrease) or N (No change) depending on the tdistribution and the predefined cutoff. This constitutes the firstorder symbolic pattern vector of length (p  1). Next, the difference of two adjacent tstatistics is calculated. Each difference is discretized into one of three symbols V (conVex), A (concAve) or N (No change) depending on the (empirical) distribution and the cutoff. These symbols constitute the secondorder symbolic pattern vector of length (p  2). Last a symbolic pattern of vector of length (p  1) + (p  2) = (2p  3) is defined. Once the symbolic pattern is obtained, the gene is automatically assigned to this pattern and the above procedures are repeated for each gene independently.
There are two inputs for the proposed algorithm: the gene expression data and the experimental design matrix.
Gene expression data Y = {y_{gjk}} where,
Experimental design matrix Q = {q_{jl}}_{n × 2 }where,
Step 1: The firstorder difference
The firstorder difference matrix is derived from Y where
is a twosample (moderated) tstatistic between adjacent j^{th }and (j + 1)^{th }timepoint groups with g respective sample sizes of and . This empirical Bayes method was proposed by Smyth et al. [15] and it reduces the observed variances towards a pooled estimate, thereby providing a more stable inference when the number of replicates is small.
Step 2: Symbolic pattern matrix F
Y^{(1) }is categorized into three symbols I (Increase), D (Decrease) or N (No change) to get the pattern matrix F = {f_{gj}}_{n × (p  1) }based on the critical value from the tdistribution. This step is a usual twosample ttest.
where g = 1,2,..., n; j = 1,2,..., p1
df_{gj }is the empirically estimated degree of freedom by [15]
Step 3: The secondorder difference
The secondorder difference matrix is obtained by subtracting the firstorder differences.
Step 4: Symbolic pattern matrix S
Y^{(2) }is discretized into three symbols V (conVex), A (concAve), or N (No change) to get the symbolic pattern matrix S = {f_{gj}}_{n × (p2)}. Here, the critical values were set using the difference of tdistributions, say T', empirically. To get the quantiles of T', two random samples of size 10,000 were generated from the tdistribution with degrees of freedom m_{j } 1 and m_{(j+1) } 1, respectively. Then, the α^{th }quantile values from the difference of the two samples were saved. This procedure was repeated 1,000 times. Finally, the median value of the 1,000 quantile values was chosen as our final critical value.
Step 5: Combined symbolic pattern
Two matrices F and S are combined columnwise to constitute the final pattern matrix H.
For each gene, a sequence of 2p  3 letters represents its cluster membership.
Step 6: Reassigning minor clusters
For better interpretability, we reassigned genes of minor clusters – clusters with fewer members than a predefined threshold – to the nearest cluster with the most correlated genes.
Step 7: Output
As output, we get a membership list for each gene, and a 2dimensional symbolic 'pattern map' with the firstorder difference pattern on the horizontal axis and the secondorder on the vertical axis. In each cell of the pattern map, every member profile is drawn along the timepoint axis with error bars of one standard deviation.
Identifying the number of clusters
We performed quasiexhaustive searching to determine the optimal number of clusters. We ran DIBC 1, 296 times varying threshold values from T = {1 × 10^{5}, 2 × 10^{5}, ..., 9 × 10^{5}, ..., 1 × 10^{2}, ..., 9 × 10^{2}}, where T = 1, 296. The clustering number which maximized the Zscore for the real data was chosen as the optimal clustering number. Since GO IDs were not available for the simulated data, an APFmaximizing threshold was used instead.
Visualization
DIBC provides a Directed Acyclic Graph (DAG) representation of multiple clustering results obtained at different threshold levels. Multiple clustering results are used to construct a DAG. Each node represents a symbolic pattern of difference and each edge depicts the parentchild relationship between two nodes. The rootnode is always an allnull pattern, irrespective of the threshold. Next, we obtain the first clustering result in level 1 from the smallest threshold, the third clustering result in level 2 from the second smallest threshold, and so on until all thresholds were used up. Clusters in the previous layer defined by a smaller threshold are repartitioned in the next layer defined by a larger threshold. Hence, the number of clusters increases or stays the same as the threshold number increases. In each level, nodes with a common firstorder pattern have the same color. Clicking on the node leads to a detailed profile of all member genes in a single cluster with error bars.
Once a final level (or the corresponding final result of clustering) is chosen, the clusters in that level are reorganized into a 2dimensional pattern map. The columns represent the firstorder difference, and the rows represent the secondorder difference. Each cell represents one symbolic pattern and contains a detailed profile of all members, with error bars.
Authors' contributions
JK conceived of the algorithm and carried out the data analysis. JHK oversaw the research and contributed to the evaluation procedure and the visualization. All authors contributed to preparation of the manuscript.
Acknowledgements
This study was supported by a grant from the Ministry of Health & Welfare, Korea (A040163) and JK's educational training was supported by a a grant from the Korean Pharmacogenomic Research Network (A030001), Ministry of Health & Welfare, Korea as well as the IT Scholarship Program supervised by IITA (Institute for Information Technology Advancement) & MIC (Ministry of Information and Communication), Republic of Korea.
We appreciate Elisabetta Manduchi's help in getting the pancreas data. We also thank Younjeong Choi, John Dawson, Sunduz Keles and William Whipple Neely for their critical readings of the manuscript and helpful suggestions.
References

Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO, Herskowitz I: The transcriptional program of sporulation in budding yeast.
Science 1998, 282(5389):699705. PubMed Abstract  Publisher Full Text

Ingram JL, AntaoMenezes A, Turpin EA, Wallace DG, Mangum JB, Pluta LJ, Thomas RS, Bonner JC: Genomic analysis of human lung fibroblasts exposed to vanadium pentoxide to identify candidate genes for occupational bronchitis.
Respir Res 8:34.
2007, Apr 25
PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text 
Monroy AF, Dryanova A, Malette B, Oren DH, Ridha Farajalla M, Liu W, Danyluk J, Ubayasena LW, Kane K, Scoles GJ, Sarhan F, Gulick PJ: Regulatory gene candidates and gene expression analysis of cold acclimation in winter and spring wheat.
Plant Mol Biol
2007, Apr 17
PubMed Abstract  Publisher Full Text 
Arnau J, Bono R: Autocorrelation and bias in short time series: An alternative estimator.
Quality & Quantity 2001, 35(4):365387. Publisher Full Text

Duran BS, Odell P: Cluster AnalysisA Survey. SpringerVerlag New York; 1974:3242.

De Hoon MJ, Imoto S, Miyano S: Statistical analysis of a small set of timeordered gene expression data using linear splines.
Bioinformatics (Oxford, England) 2002, 18(11):14771485. PubMed Abstract  Publisher Full Text

Phang TL, Neville MC, Rudolph M, Hunter L: Trajectory clustering: a nonparametric method for grouping gene expression time courses, with applications to mammary development.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 2003, 351362. PubMed Abstract

Ji X, LiLing J, Sun Z: Mining gene expression data using a novel approach based on hidden Markov models.
FEBS letters 2003, 542(1–3):125131. PubMed Abstract  Publisher Full Text

Luan Y, Li H: Clustering of timecourse gene expression data using a mixedeffects model with Bsplines.
Bioinformatics (Oxford, England) 2003., 19(4) PubMed Abstract  Publisher Full Text

Ramoni MF, Sebastiani P, Kohane IS: Cluster analysis of gene expression dynamics.
Proc Natl Acad Sci U S A 2002, 99(14):91219126. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wu FX, Zhang WJ, Kusalik AJ: Dynamic modelbased clustering for timecourse gene expression data.
J Bioinform Comput Biol 2005, 3(4):821836. PubMed Abstract  Publisher Full Text

Ernst J, Nau GJ, BarJoseph Z: Clustering short time series gene expression data.
Bioinformatics (Oxford, England) 2005, 21(Suppl 1):i159i168. PubMed Abstract  Publisher Full Text

MollerLevet CS, Cho KH, Wolkenhauer O: Microarray data clustering based on temporal variation: FCV with TSD preclustering.
Appl Bioinformatics 2003, 2(1):3545. PubMed Abstract

Kerr MK, Churchill GA: Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments.
Proc Natl Acad Sci U S A 2001, 98(16):89618965. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments.
Statistical Applications in Genetics and Molecular Biology 2004, 3.

Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns.
Proc Natl Acad Sci U S A 1998, 95(25):1486314868. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Peddada SD, Lobenhofer EK, Li L, Afshari CA, Weinberg CR, Umbach DM: Gene selection and clustering for timecourse and doseresponse microarray experiments using orderrestricted inference.
Bioinformatics (Oxford, England) 2003, 19(7):834841. PubMed Abstract  Publisher Full Text

MacQueen J: Some methods for classification and analysis of multivariate observations.
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1967, 1:281297.

Kohonen T: Selforganizing maps. Volume 30. Berlin ; New York: Springer; 1997.

Scearce LM, Brestelli JE, McWeeney SK, Lee CS, Mazzarelli J, Pinney DF, Pizarro A, CJS Jr, Clifton SW, Permutt MA, Brown J, Melton DA, Kaestner KH: Functional genomics of the endocrine pancreas: the pancreas clone set and PancChip, new resources for diabetes research.
Diabetes 2002, 51(7):19972004. PubMed Abstract  Publisher Full Text

Computational Biology and Informatics Laboratory, University of Pennsylvania [http://www.cbil.upenn.edu] webcite

Yeung KY, Haynor DR, Ruzzo WL: Validating clustering for gene expression data.
Bioinformatics (Oxford, England) 2001, 17(4):309318. PubMed Abstract  Publisher Full Text

Johnson RA, Wichern DW: Applied multivariate statistical analysis. Upper Saddle River, NJ: Prentice Hall; 2002.

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, IsselTarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium.
Nat Genet 2000, 25:2529. PubMed Abstract  Publisher Full Text

Gibbons FD, Roth FP: Judging the quality of gene expressionbased clustering methods using gene annotation.
Genome Research 2002, 12(10):15741581. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Seoul National University Biomedical Informatics [http://www.snubi.org/software/DIBC] webcite

Tseng GC, Wong WH: Tight clustering: a resamplingbased approach for identifying stable and tight patterns in data.
Biometrics 2005, 61(1):1016. PubMed Abstract  Publisher Full Text

Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP: Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation.
Nucleic Acids Res 2002, 30(4):e15. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

The Brown Lab, Stanford University [http://cmgm.stanford.edu/pbrown/sporulation] webcite