Abstract
Background
In the study of cancer genomics, gene expression microarrays, which measure thousands of genes in a single assay, provide abundant information for the investigation of interesting genes or biological pathways. However, in order to analyze the large number of noisy measurements in microarrays, effective and efficient bioinformatics techniques are needed to identify the associations between genes and relevant phenotypes. Moreover, systematic tests are needed to validate the statistical and biological significance of those discoveries.
Results
In this paper, we develop a robust and efficient method for exploratory analysis of microarray data, which produces a number of different orderings (rankings) of both genes and samples (reflecting correlation among those genes and samples). The core algorithm is closely related to biclustering, and so we first compare its performance with several existing biclustering algorithms on two real datasets  gastric cancer and lymphoma datasets. We then show on the gastric cancer data that the sample orderings generated by our method are highly statistically significant with respect to the histological classification of samples by using the Jonckheere trend test, while the gene modules are biologically significant with respect to biological processes (from the Gene Ontology). In particular, some of the gene modules associated with biclusters are closely linked to gastric cancer tumorigenesis reported in previous literature, while others are potentially novel discoveries.
Conclusion
In conclusion, we have developed an effective and efficient method, BiOrdering Analysis, to detect informative patterns in gene expression microarrays by ranking genes and samples. In addition, a number of evaluation metrics were applied to assess both the statistical and biological significance of the resulting biorderings. The methodology was validated on gastric cancer and lymphoma datasets.
1 Background
A typical aim of exploratory analysis of genomics data is to identify potentially interesting genes and pathways that warrant further investigation. There is a critical need to streamline the analysis in order to support continuing advances in high throughput genomics methods such as gene expression microarrays, which measure thousands of genes in a single assay and are the focus of this paper. However, such assays provide noisy and incomplete measurements, which require sophisticated bioinformatics techniques to identify statistically and biologically significant associations between genes and relevant phenotypes of interest.
Unsupervised analysis techniques cluster data without using prior information on the labels of samples. This enables the discovery of novel histological subtypes. However, a major limitation of traditional clustering algorithms for this task is that they cluster either genes or samples into nonoverlapping groups, based on the similarity of gene expression across all samples for gene clustering, or all genes per sample in sample clustering. This limits the ability to find groups of genes that are "cocorrelated" across only a subset of samples, or participate in multiple cellular pathways. A related open issue is how to evaluate the statistical significance of the clusters.
In spite of such limitations, there are examples of remarkable biologically significant discoveries. One such case revisited in this paper is the analysis of gastric cancer data [1]. The original paper used hierarchical clustering of both 7383 genes and 124 gastric cancer samples (malignant and premalignant). By inspecting the "heat map" they observed a number of prominent biclusters, which were linked to various aspects of cancer etiology. However, the approach [1] was heavily dependent on manual inspection to identify the groupings. In particular, several sets of coexpressed genes were not grouped together by hierarchical clustering, and needed to be grouped manually by expert analysis. Moreover, it is difficult to assess whether such clusters are robust to any changes, and whether different clustering attempts converge to a stable result. Consequently, there is a need for techniques that can guide such a process of discovering significant and worthwhile hypotheses for followup analysis.
Biclustering, also called coclustering, is a promising technique proposed for the automated discovery of highly correlated subsets of genes across a subset of samples. The concept of "biclustering" was first introduced by [2] and has been the subject of several surveys [3,4]. Many techniques have been used for finding biclusters with different objective functions, such as "SAMBA" using graphic models [5], biclustering by Gibbs sampling [6], the OrderPreserving Submatrix algorithm (OPSM [7]), biclustering using maximumsimilarity between genes [8], the Iterative Signature Algorithm (ISA [9]), and biclustering using linear geometry [10]. Recently, several studies have applied biclustering to more specific bioinformatics areas, such as local multiple sequence alignment of RNA [11] and eCCCBiclustering for gene expression timeseries data [12]. Several of these representative biclustering methods will be used as a basis for comparison in this paper.
This paper proposes a technique for exploratory biclustering analysis, which combines biclustering with an evaluation of the statistical significance and biological relevance of such biclusters. There are four main contributions that we make in this paper. First, we introduce a novel algorithm, called biordering, which is in some respects a member of the family of biclustering techniques. This algorithm is benchmarked against several relevant biclustering algorithms in the literature [2,5,6,9]. Second, we extend an existing statistic based on the hypergeometric distribution to a generalized statistic for evaluating the saturation of phenotypes in biclusters, called the MultipleClassSaturation (MCS) metric. In addition, we apply the Jonckheere trend test [13] to evaluate the significance of the correlation between ordered samples and clinical annotations. Third, we assess the stability of the observed results by assessing the size of their "basin of attraction" as follows. In our experiments, random initializations of the algorithm yield many unique biclusters, which are then grouped into a manageable number of families of very similar outcomes (called a "superbicluster") by a secondary clustering of the biclusters. The size of these superbiclusters provides a measure of bicluster "stability". We find that our technique is able to find a small set of highly stable superbiclusters, which correspond to distinct histopathological types in an existing gastric cancer dataset [1]. We have also applied our approach to analyze a lymphoma dataset [14]. Fourth, we demonstrate that the discovered superbiclusters have associated Gene Ontology (GO) terms with very significant pvalues, which can provide a basis for the biological interpretation of the gene modules.
In Section 2, we introduce our core algorithm BOA, as well as the statistics for its evaluation. In Section 3, we validate the BOA algorithm on the gastric cancer and lymphoma datasets and compare the results to other algorithms. The biological interpretations of the results from the gastric cancer datasets are discussed in Section 4, 5 and the Additional file 1. Finally, we conclude the paper in Section 6.
Additional file 1. Supplement. The Supplement contains the proof of convergence in a variant of BOA algorithm (See 2 for details), and the biological analysis of potential novel observations in the gastric cancer dataset discovered by our method.
Format: PDF Size: 828KB Download file
This file can be viewed with: Adobe Acrobat Reader
2 Methods
In this section, we introduce a protocol for identifying and characterizing modules of genes and orderings of samples that exhibit high statistical, biological and clinical significance. Our protocol, named Biordering Exploratory Analysis (BEA), comprises six main stages as described below.
Biordering Exploratory Analysis
Input: A gene expression data matrix [x_{gs}]n_{G }× n_{S }for n_{G }genes for n_{S }samples.
1. Generate biclusters with ordered genes and samples using BOA algorithm outlined below.
2. Merge similar biclusters into "superbiclusters" to identify robust modules of coexpressed genes.
3. Annotate biclusters with histological and biological attributes to support their interpretation.
4. Generate figures of merit (i.e., pvalues) for each bicluster:
(a) Overrepresentation of histological categories;
(b) Concordance of sample orderings with various phenotype gradients;
(c) Gene Ontology annotations.
5. Evaluate biological significance of results.
Now we briefly elaborate on selected key stages of the above protocol.
2.1 Biordering Analysis
In this section, we describe Biordering Analysis and its details.
2.1.1 Algorithm
The term "biclustering", introduced by Cheng and Church [2] refers to the identification of a submatrix with "significantly homogeneous entries". In this section we introduce a novel algorithm called Biordering Analysis (BOA), which is pivotal in the generation of our results. We now introduce our BOA algorithm. For the sake of clarity, we define an averaging operation as follows. For a vector s_{i }and a set of indices I with finite cardinality (0 < I < + ∞) we introduce the following notation for the average:
Algorithm 1 BiOrdering Analysis (BOA)
Input: n_{G }× n_{S }gene expression data matrix [x_{gs}]; two cutoff thresholds θ_{G }and θ_{S }for selecting genes and samples, respectively.
1. Normalize data: first, for each gene across all samples, normalize to median = 0 and std = 1, then repeat this for each sample across all genes.
2. Initialization: Select a subset of samples ∅ ≠ S ⊂ {1, ⋯, n_{S}}
3. Repeat the Steps ad below until convergence (i.e., G and S stabilise):
(a) Update gene scores f(g) ← ⟨x_{gs}⟩_{s∈S }for g = 1, ⋯, n_{G},
(c) Update sample scores h(s) ← ⟨x_{gs}⟩_{g∈G }for S = 1, ⋯, n_{S},
Output: The gene scores f(g), sample scores h(s), and a bicluster B = (G, S), which is determined by applying the cutoff thresholds θ_{G }and θ_{S }to the scores f(g) and h(s), respectively.
According to the iterative process of step 3 in the BOA algorithm, we aim to select a submatrix with coherently high gene expression levels across selected samples. In particular, we aim to detect differentially expressed genes between phenotypes. Consequently, it is important to normalize the data in step 1, otherwise, the genes with high expression levels across most samples will dominate the selection procedure. The above algorithm is similar to ISA [9], which is discussed in more detail in Section 3.2.
2.1.2 Variants of BOA
Note that two further variants of BOA are possible:
1. Underexpressed biclusters  This variant finds significantly downregulated genes by selecting genes and samples with f(g) and h(s) less than a threshold in steps 3.b and 3.d, respectively.
2. Fixedsize biclusters  We could also select G and S of fixed size rather than using the cutoff thresholds in each iteration. The attraction is that the algorithm is guaranteed to converge, which in practice may not always hold for the previous two options. The formal proof of the convergence for this variant is included in the Additional file 1.
2.1.3 Ordering scores
An important ability of the BOA algorithm is to assign an ordering score h(s) to samples and f(g) to genes in a bicluster. The sample score h(s) orders all samples according to the average expression level across a subset of genes in the current bicluster, while the gene score f(g) orders all genes according to a subset of samples. The sample score h(s) relates a bicluster to the clinical annotations of samples, especially for multiclass attributes, such as the cancer progression stage, or continuous attributes, such as the survival time. These are discussed in detail in Section 2.3.
2.2 Superbiclustering (SBC)
Superbiclustering is an important feature of our technique, in terms of improving the robustness and stability of the results. For every individual bicluster, the ultimate output of gene score f(g) and sample score h(s) are uniquely determined by the selection of the initial subset of samples S. This is because every selection in Step 3 of Algorithm 1 is deterministic for a given pair of thresholds (θ_{G}, θ_{S}). Thus, different initializations may result in different biclusters, which are local optima after a few iterations. In order to cover more potential biclusters, we run BOA with 1000 different initial subsets of samples. In every independent initialization, each sample is drawn into the initial subset randomly with probability of 0.2. Some of the 1000 generated biclusters are identical, while others may be very similar to each other but not exactly the same due to local optima. In order to identify robust and distinct biclusters, we then apply a hierarchical clustering algorithm to group similar biclusters into superbiclusters (SBC). We use the Jaccard coefficient on gene sets as a similarity measure between two biclusters. For example, given two biclusters B_{1 }= (G_{1}, S_{1}) and B_{2 }= (G_{2}, S_{2}), the similarity between them is:
The similarity measure could also be defined on samples, though here we focus on genes, which are the dominant and far more complex dimension to handle in gene expression microarray datasets. Based on the similarity measure in Equation 1, a hierarchy of unique biclusters is constructed. We then use a similarity threshold of 0.5 to extract a few top groups of biclusters, which are superbiclusters, from this hierarchy. For convenience of evaluation, a typical bicluster is selected as a prototype in each resulting SBC. Without loss of generality, we choose the bicluster that covers the most genes in its SBC as the prototype, and so any further numerical evaluation is based on these prototypes when we refer to a SBC.
2.3 Figures of merit
In order to evaluate the statistical and biological significance of the resulting biclusters, we employ three statistical methods and their corresponding pvalues.
2.3.1 Saturation metric of samples
The homogeneity of phenotypes of samples in a bicluster can be evaluated if we are given a prior classification of each sample (e.g., its cancer subtype) as the label. Ideally, each bicluster should be enriched with samples in one or a few more similar classes, e.g., normal or tumor samples. For the purpose of quantification, we use the pvalue of the hypergeometric distribution to evaluate the purity of biclusters according to the phenotypes of samples. Previously [5], a measure of homogeneity using the hypergeometric distribution was applied to the single most abundant class within a bicluster. However, if some genes are coexpressed across multiple classes, calculating pvalues on a single class is not an adequate representation of accuracy. To address this limitation, we extend this measure to a more generalized form where the significance is calculated for a group of classes to determine the dominant class(es). We refer to the original statistic used in [5] and our generalized statistic as SingleClass Saturation (SCS) and MultipleClass Saturation (MCS) metrics, respectively. The calculation of MCS pvalues based on the hypergeometric distribution is given in Equation 2 below.
Given a classification of samples with q classes {C_{1}, ⋯, C_{q}} and a bicluster B = (G, S), the pvalue with respect to a group of r classes is computed by:
where
are the numbers of samples in the dataset and in the bicluster B annotated with any class in , respectively. This quantity calculates the probability of observing k or more samples classified in in a bicluster B.
In our evaluation, we generate the full set of combinations of all sample classes from {C_{1}, ⋯, C_{q}} and compute p_{MCS }for each bicluster and each combination, so that we could discover any potential associations between gene sets and a group of phenotypes. Finally, we select the subset of classes that corresponds to the most significant pvalue for each bicluster in the evaluation in Section 3. Note that the SCS is a special case of the MCS. We compute a pvalue with respect to each individual class, and then select the single class that corresponds to the best pvalue for each bicluster.
2.3.2 Jonckheere's trend test
Another method to evaluate the significance of a bicluster is to compare the ordering of all samples h(s) generated by BOA with any relevant ordering y(s) of all samples based on their biological annotations, e.g., the progression stage of the cancer in the sample. We can test the agreement of samples ordered according to h(s) with this progression score y(s). We use Jonckheere's test statistic [13]:
for this purpose.
For a random scoring h(s) (the null hypothesis), this random variable U has an approximately normal distribution. For example, consider that we have an annotation scoring y(s) of samples with respect to q sample classes {C_{1}, ⋯, C_{q}}, which can be numerically ranked, e.g.,
Let N_{i }(1 ≤ i ≤ q) denote the number of samples in class C_{i}, and N denote the total number of samples. The approximate normal distribution of U determined by the random scoring h(s) and the annotation scoring y(s) has the mean:
and the variance
from which the pvalues can be estimated.
2.3.3 Gene Ontology Annotations
Given that each gene's expression in a bicluster is highly similar with respect to other genes in the bicluster, it is expected that the collection of genes as a whole are likely to be involved in some related biological processes. In order to determine this, the structured vocabulary of the Gene Ontology (GO) [15] is used to help uncover the biological processes represented by each of the biclusters. As each gene can be annotated with one or more terms within the GO, we can determine which GO terms are statistically overrepresented within a group of genes. We use an existing tool GOstat [16] to determine the statistically overrepresented terms within each bicluster for the biological process branch of the GO.
2.4 Efficiency
One of the advantages of the BOA algorithm is its efficiency. The time complexity in each iteration is (n_{G }+ n_{S}), since only averaging operations for computing the gene score f(g) and sample score h(s) are required. Practically, the number of iterations for generating a single bicluster is usually no more than 10, and the number of initializations is 1000 in our experiments.
3 Results
In this section, we analyze the performance of our algorithm on a real gene expression dataset, namely the gastric cancer dataset in [1]. The main reason for this choice is the availability of local expertise in the biology of this disease. We compare the performance of our algorithm in terms of SCS and MCS in Section 2.3 to the results obtained from the algorithms in [2,5,6,9] by using the parameter settings recommended in those papers, including the normalization method specified in each algorithm, or by observing the best results obtained under different parameter settings. The evaluation using Jonckheere's test, the Gene Ontology and the biological relevance of the results for gastric cancer are discussed in detail in Section 4. In addition, we also apply BOA to another lymphoma dataset for validation [14].
3.1 Results of BOA on Gastric Cancer dataset
After applying gene filtering as described in [1], we have n_{G }= 7383 gene expressions evaluated for n_{S }= 124 human tissue samples. Excluding two singletons, there are six different phenotypes in the data, of which three are subtypes of gastric cancer: 35 diffuse (DGC), 22 intestinal (IGC), 7 mixed (MGC); and the other three phenotypes are premalignant conditions: 26 chronic gastritis (CG), 22 intestinal metaplasia (IM) and 10 normal, e.g., noninflamed mucosa tissue removed during surgery for the gastric cancer. Now we briefly discuss the algorithmic aspects and setup of the experiment.
First, we generated a set of 1000 initializations, which were 1000 subsets of samples selected by the method described in Section 2.2. The actual number of initializing samples for gastric cancer data ranged from 14 to 41 across 1000 subsets. As described in Section 2.2, each sample is randomly selected with a probability of 0.2 for inclusion in the initial subset of samples. Note that other selection probabilities of 0.4, 0.6 and 0.8 have been tested, but the results were largely insensitive to changes in this parameter.
Note that in the BOA algorithm, there are other alternative normalization methods that can be used, i.e., using mean = 0 instead of median = 0 for centering the genes and samples. Here, we followed the normalization method used in [1] for the sake of a fair comparison with their manual analysis. In addition, we have found that there is very little numerical difference between normalizing by median = 0 and normalizing by mean = 0 on the dataset we have studied.
Second, we applied BOA to the gastric cancer data using 11 different pairs of thresholds: (θ_{G}, θ_{S}) = (5.0, 3.0), (4.0, 4.0), (5.0, 4.0), (5.5, 4.0), (6.0, 4.0), (5.0, 4.5), (5.5, 4.5), (5.0, 5.0), (5.5, 5.0), (6.0, 5.0), (5.5, 5.5), which used the same set of initializations.
These threshold settings were limited to this range since they produced biclusters of moderate size. For all biclusters across the 11 pairs, the minimum and maximum number of genes were 21 and 816, respectively. We have also tried several other groups of thresholds on the dataset, but the generated biclusters are not very informative if the thresholds are too large or too small.
For the resulting biclusters with each setting, we found that the minimal pvalues ranged between 4.3 × 10^{10 }and 1.6 × 10^{9 }for the SCS metric (no major difference was observed for SCS with 9 of the 11 threshold settings achieving the minimum pvalue of 4.3 × 10^{10}), and between 3.4 × 10^{27 }and 4.0 × 10^{14 }for the MCS metric. For further analysis we chose a midrange pair θ_{G }= 5 and θ_{S }= 4.5 for which, additionally, all 1000 initializations of BOA converged. Under this pair of thresholds, the algorithm converged to 49 biclusters, which were further grouped into 8 superbiclusters (see Table 1), and a prototype bicluster was chosen for each superbicluster as described in Section 2.2.
Table 1. Superbiclusters in gastric cancer dataset
To show the significance of the resulting biclusters we focus on the most stable superbicluster generated for the gastric data, labeled SBC7 in Table 1. Its prototype is shown in Figure 1. The BOA algorithm converged to this superbicluster 359 times out of 1000 initializations and its prototype 229 times out of 1000. This bicluster contains 515 genes that are prominently overexpressed in most premalignant samples and are underexpressed in most malignant samples. Numerically, it results in p_{SCS }= 4.32 × 10^{10 }with respect to the SCS metric (dominant class is CG) and a p_{MCS }= 4.03 × 10^{14 }with respect to the MCS metric (dominant classes are Normal, CG and IM).
Figure 1. Heat map of superbicluster 7. Heat map for the prototype of the most prominent superbicluster, SBC7, generated by the BOA algorithm for the gastric cancer data. The vertical axis shows the 515 most significant genes ordered by f(g) in Algorithm 1, and cut by θ_{G }= 5.0, while the horizontal axis shows all samples ordered by h(s) and cut by θ_{S }= 4.5. The yellow vertical line in the middle of figure indicates the boundary between the samples in the bicluster (leftside) and others (rightside). The bicluster samples are enriched with the CG subtype with a pvalue of 4.32 × 10^{10 }in terms of the SCS metric or enriched with a combination of {normal, CG, IM} subtypes with a pvalue of 4.03 × 10^{14 }in terms of the MCS metric. Moreover, we observe a strong gradation from least malignant samples (normal and CG), through an intermediate phenotype IM, to the malignant samples (combined intestinal, diffuse and mixed gastric cancers). Two phenotypes, squamous and adenosquamous, with only one sample are annotated with black and white, respectively, but are not shown on the legend. The probability of obtaining such or better ordering by random chance was estimated to have a pvalue of 5.35 × 10^{22 }in terms of Jonckheere's test.
However, there are two limitations of calculating SCS or MCS. First, these measures cannot deal with the case of continuous annotations of samples. Second, the significance of SCS and MCS are affected by the choice of cutoff threshold on samples, especially when the sample orderings h(s) change smoothly. Thus, we also used Jonckheere's test to overcome these limitations. We first allocated a "Malignancy Score" y(s) to each sample s following the expert advice: y(s) = 1 for normal, 2 for CG, 3 for IM and finally 4 for any gastric cancer (DGC, IGC or MGC sample). We then tested the significance of the agreement of the samples ordered according to the h(s) score generated by the BOA algorithm with this progression y(s). For the prototype of SBC7, the malignancy scores show an increasing trend from normal (y(s) = 1) to malignant samples (y(s) = 4) along the ascending ordered gene expression levels, which results in a directional pvalue of 5.35 × 10^{22}.
For every bicluster, we used the GOstat program [16] to obtain significantly overrepresented GO terms to investigate the associations between the terms and phenotypes. The GOstat program assesses the enrichment of GO terms within a group of genes by computing pvalues from the χ^{2 }distribution. The pvalues were corrected by the process of controlling the False Discovery Rate [17] in our experiment. As an example, several of the most significant GO terms of SBC7 are shown in Table 2.
Table 2. Overrepresented GO terms in gastric cancer dataset
More biological details of the gene modules and evaluation statistics for different SBCs are discussed in the next section.
3.2 Comparison with other algorithms
As a basis for comparison with our BOA algorithm, we have also tested several existing biclustering algorithms, namely, Cheng and Church's algorithm [2], SAMBA [5], biclustering by Gibbs sampling [6] and the ISA algorithm [9], which is closest to our algorithm. We have used publicly available implementations of these algorithms in our evaluation, i.e., SAMBA is tested using Expander [18], Gibbs sampling has been implemented by ourselves, and the biclustering toolbox BicAT [19] is used for the other two algorithms. We compare the results of BOA with four other algorithms in terms of the SCS and MCS metrics as shown in Figure 2. In these figures, we plot the number of biclusters (vertical axis) whose SCS or MCS pvalues are less than a given value on the horizontal axis, indicating the ability of each algorithm to find significant biclusters in terms of the SCS and MCS metrics. However, most algorithms usually generate redundant biclusters to different extents, i.e., slightly different biclusters due to local optima, and so comparing the redundant biclusters is not a fair evaluation. To eliminate the impact of redundancy, we applied the same superbiclustering process described in Section 2.2 to the resulting biclusters in each algorithm, and the results are shown as the lines with circle markers in the same figure. Note that the Gibbs algorithm [6] has the same biclusters and superbiclusters due to their strategy. SAMBA and Cheng & Church's algorithm still yield a large number of insignificant superbiclusters so they are not included. Generally, the BOA algorithm produced the most significant biclusters in terms of the pvalues of both the SCS and MCS metrics.
Figure 2. Saturation metrics for gastric cancer dataset. Gastric cancer benchmark results for five biclustering algorithms. We plot the number of unique biclusters (solid lines) and superbiclusters (broken lines) with the pvalue below the threshold indicated by the xaxis. Each algorithm is represented with a unique color as shown in the legend. The results for the superbiclusters are represented with the same color as the biclusters for BOA, ISA and Gibbs (broken lines). Note that Gibbs produces exactly the same lines for biclusters and superbiclusters due to their algorithm. We have used the SCS (left sub figure) and MCS (right sub figure) metrics to calculate the pvalues. We have applied 1000 random initializations for BOA and ISA and the parameter settings follow the suggestions in these studies.
Cheng & Church's method was chosen as a comparison because it is similar to our BOA method in two aspects. First, they both aim to identify "constant biclusters" [3]. Second, they all use an iterative process to refine a bicluster. However, the BOA algorithm has the following advantages that differ from Cheng & Church's method. First, in every step of the iterative process, BOA ranks all the genes and samples based on the current subset of samples and genes, so that the converged biclusters are local optima solutions that imply the relevance between genes and samples. The relation between genes and samples is a desirable property in biology, especially when studying the continuous annotations of samples. Second, the superbiclustering is also an important part of our framework, which identifies similar biclusters and then creates representative prototypes of biclusters, so that more meaningful biclusters could be found. In addition, to make a fair comparison, we also applied Cheng & Church's method to the gastric cancer dataset that has been normalized by the procedure described in the BOA method. The resulting biclusters of the two methods were evaluated by the saturation metrics and reported in the Additional file 1.
The BOA algorithm is very similar to ISA. However, the main objective of ISA is discerning "coregulated" gene modules, while the association with phenotype classes (conditions) is not important, whereas it is of prime interest for our medical application. The main formal differences resulting in the different performance are: (i) ISA starts with an initialisation of a subset of genes; (ii) the two sided test is used for the selection of samples; (iii) samples are weighted, with possibly negative weights, so different conditions, say with upregulated and downregulated genes, can be joined in the same bicluster. Consequently, ISA aims at generating "constant column" biclusters while BOA's objective is a "constant" bicluster [3].
Figure 2 shows that BOA generates more significant biclusters in terms of SCS and MCS.
Our evaluation of GO annotations for both ISA [9] and biclustering by Gibbs sampling [6] show that they are capable of generating biclusters of significance comparable to BOA (details of values are not shown). These algorithms generated 6 and 5 SBCs, respectively, with similar gene sets to the SBCs of BOA. For example, the GO annotations "generation of precursor metabolites and energy" and "oxidative phosphorylation" significantly associated with SBC7 of BOA whose pvalues are 3 × 10^{22 }and 4 × 10^{18 }(in Table 1) are also found by the ISA algorithm with pvalues of 3 × 10^{8 }and 4 × 10^{6 }and Gibbs algorithm with pvalues 1 × 10^{30 }and 5 × 10^{13}. Similarly, the "multicellular organismal process" and "multicellular organismal development" annotations (significant for diffusetype gastric cancer) in SBC6 of BOA, were also found by the ISA and Gibbs algorithms. However, we have observed that the BOA algorithm usually has better performance than either ISA or Gibbs in terms of Jonckheere's test, in particular, in the case of the evaluation of the "Malignancy Score".
3.3 Validation on lymphoma dataset
To further validate the performance in terms of SCS and MCS, we applied BOA to a lymphoma dataset [14], and compared the result to the benchmark results of the other four algorithms. Similar figures of the SCS and MCS pvalues are drawn and show in the Additional file 1. Again, the BOA algorithm generated very significant results in terms of identifying pathological categories (See Figure 3 for details).
Figure 3. Saturation metrics for lymphoma dataset. Lymphoma dataset benchmark results for five biclustering algorithms. The experimental settings and elements of these figures are the same as the gastric cancer experiments.
4 Biological Analysis of Gastric Cancer
In this section, we focus on validating the biological significance of our findings for the gastric cancer dataset.
4.1 Gene modules compared with previous study
We first compare the gene modules of the prototypes of the superbiclusters with those reported in a previous study [1]. In that study, hierarchical clustering was applied to the gastric cancer dataset (cDNA platform) and several regions of genes related to different cancer types or premalignant states were annotated (labeled A  K in Figures 1 & 2[1]. To validate the biological functions of our biclusters, we determined the intersection between the genes in these identified regions and the genes appearing in the prototypes of the eight superbiclusters (SBC1  SBC8) discussed in Section 3.1. The results are shown in Table 3. Note that the two largest superbiclusters (SBC6 and SBC7) were a close match for the two most prominent gene clusters annotated as regions B & K [1]. Moreover, the superbicluster SBC2 linked two separated but related biclusters in regions E & F [1], while the regions D1 to D3 that needed to be manually grouped in the hierarchical clustering [1] were automatically grouped by our method in SBC5. These unique biclusters confirm the homogeneous functions of the disjoint gene sets generated by hierarchical clustering.
Table 3. Comparison with previous literature
4.2 Biological relevance for gastric cancer
In Table 3 we then considered the significance of these superbiclusters in terms of the three types of figures of merit discussed in Section 2.3, namely, the SCS and MSC pvalues, the pvalue of the overrepresented GO annotations, and the pvalue of the Jonckheere test on the order of the progression of the cancer in the samples. We have discussed the assignment of malignancy scores y(s) and tested the significance of the agreement between y(s) and sample orderings h(s) in Section 3.1. Table 1 shows the numerical results of these statistics. The heat map of SBC7 (Figure 1) shows that the ordering induced by the bicluster has a clear negative correlation with the malignancy score of the samples. The h(s) for SBC5 and SBC7 and to a lesser extent SBC3 are very significantly correlated with y(s). More biological relevance is discussed in the Discussion section.
5 Discussion
Based on the results of our experiments, we now consider the biological significance of our findings. The generated results including the GO and clinical correlation were analysed by expert biologists and clinicians. We quote them to some extent as a proof that the formal data processing protocols as discussed here can lead to the generation of significant biological hypotheses warranting followup wet lab experiments. The BOA algorithm has shed new light on preexisting themes in gastric cancer etiology. The resulting biorderings represent successive steps in cancer progression and distinct histopathological types of the disease. Specifically, SBC1 represents epithelial morphology, typical to squamous samples; SBC2 and SBC8 are typical intestinal lipid metabolism signatures, observed in intestinal metaplasia premalignant samples; SBC3 and SBC4 represent a novel split of the inflammatory signature that in [1] were merged as one signature; SBC5 represents the proliferation signature described in [1] for intestinal type gastric cancer; SBC6 reflects the extracellular matrix deposition typical to diffuse type cancer, and elevated in all cancer samples compared to premalignant samples; SBC7 represents the metabolic stress observed in chronic gastritis samples, possibly due to elevated H. Pylori infection. There are also other observations which are potentially novel discoveries. They are available in the Additional file 1.
6 Conclusion
In this paper we have presented a novel method of biordering genes and samples from microarray data, together with two statistical techniques for evaluating the significance of the generated groupings and orderings of multiple histological samples. In comparison to several existing algorithms in the literature, our method is able to generate highly robust and statistically significant gene modules with respect to sample histological annotations on a gastric cancer dataset. The results of our analysis closely match reported theories of gastric cancer tumorgenesis, and have helped to identify promising hypotheses for further investigation in cancer research. We also show that other biclustering algorithms can also be utilized as a basis of exploratory biordering analysis of genomic data.
Authors' contributions
Fan Shi, under the supervision of Christopher Leckie and Adam Kowalczyk, developed the major part of the algorithms. Geo MacIntyre contributed to the Gene Ontology evaluation of the results. Alex Boussioutas and Izhak Haviv analysed the biological relevance of the results. All authors contributed to the writing of the manuscript, and they have read and approved the final version of the manuscript.
Additional file 2. algorithm implementation. The file of "algorithms.zip" contains the Matlab source code files (.m) implementing the BOA algorithm.
Format: ZIP Size: 8KB Download file
Acknowledgements
National ICT Australia (NICTA) is funded by the Australian Government's Department of Communications, Information Technology and the Arts and the Australian Council through Backing Australia's Ability and the ICT Center of Excellence program. This paper is an extended version of a previous paper in the 2nd International Workshop on Machine Learning in Systems Biology.
References

Boussioutas A: Distinctive Patterns of Gene Expression in Premalignant Gastric Mucosa and Gastric Cancer.
Cancer Research 2003, (63):25692577. PubMed Abstract  Publisher Full Text

Cheng Y, Church GM: Biclustering of expression data.
Proceedings of International Conference on Intelligent Systems for Molecular Biology 2000, 8:93103.

Madeira SC, Oliveira AL: Biclustering algorithms for biological data analysis: a survey.
IEEE/ACM transactions on computational biology and bioinformatics/IEEE, ACM 2004, 1:2445. PubMed Abstract  Publisher Full Text

Prelić A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E: A systematic comparison and evaluation of biclustering methods for gene expression data.
Bioinformatics 2006, 22(9):11221129. PubMed Abstract  Publisher Full Text

Tanay A, Sharan R, Shamir R: Discovering statistically significant biclusters in gene expression data.

Sheng Q, Moreau Y, De Moor B: Biclustering microarray data by Gibbs sampling.
Bioinformatics 2003., 19(Suppl 2) PubMed Abstract  Publisher Full Text

BenDor A, Chor B, Karp R, Yakhini Z: Discovering local structure in gene expression data: the orderpreserving submatrix problem.
Journal of computational biology a journal of computational molecular cell biology 2003, 10(34):373384. PubMed Abstract  Publisher Full Text

Liu , Xiaowen , Wang , Lusheng : Computing the maximum similarity biclusters of gene expression data.
Bioinformatics 2007, 23:5056. PubMed Abstract  Publisher Full Text

Ihmels J, Bergmann S, Barkai N: Defining transcription modules using largescale gene expression data.
Bioinformatics 2004, 20(13):19932003. PubMed Abstract  Publisher Full Text

Gan X, Liew AW, Yan H: Discovering biclusters in gene expression data based on highdimensional linear geometries.
BMC Bioinformatics 2008, 9:209+. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Wang S, Gutell RR, Miranker DP: Biclustering as a method for RNA local multiple sequence alignment.
Bioinformatics 2007, 23(24):32893296. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Madeira S, Oliveira A: A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series.
Algorithms for Molecular Biology 2009, 4:8+. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Jonckheere A: A test of significance for the relation between m rankings and k ranked categories.
The British Journal of Statistical Psychology 1954, 7:93100.

Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM: Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling.
Nature 2000, 403(6769):503511. PubMed Abstract  Publisher Full Text

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, IsselTarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology.
Nat Genet 2000, 25:2529. PubMed Abstract  Publisher Full Text

Beissbarth T, Speed TP: GOstat: find statistically overrepresented Gene Ontologies within a group of genes.
Bioinformatics 2004, 20(9):14641465. PubMed Abstract  Publisher Full Text

Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.
Journal of the Royal Statistical Society. Series B (Methodological) 1995, 57:289300.

Sharan R, MaronKatz A, Shamir R: CLICK and EXPANDER: a system for clustering and visualizing gene expression data.
Bioinformatics 2003, 19(14):17871799. PubMed Abstract  Publisher Full Text

Barkow S, Bleuler S, Prelic A, Zimmermann P, Zitzler E: BicAT: a biclustering analysis toolbox.
Bioinformatics 2006, 22(10):12821283. PubMed Abstract  Publisher Full Text