Rapid development of DNA microarray technology has resulted in different laboratories adopting numerous different protocols and technological platforms, which has severely impacted on the comparability of array data. Current cross-platform comparison of microarray gene expression data are usually based on cross-referencing the annotation of each gene transcript represented on the arrays, extracting a list of genes common to all arrays and comparing expression data of this gene subset. Unfortunately, filtering of genes to a subset represented across all arrays often excludes many thousands of genes, because different subsets of genes from the genome are represented on different arrays. We wish to describe the application of a powerful yet simple method for cross-platform comparison of gene expression data. Co-inertia analysis (CIA) is a multivariate method that identifies trends or co-relationships in multiple datasets which contain the same samples. CIA simultaneously finds ordinations (dimension reduction diagrams) from the datasets that are most similar. It does this by finding successive axes from the two datasets with maximum covariance. CIA can be applied to datasets where the number of variables (genes) far exceeds the number of samples (arrays) such is the case with microarray analyses.
We illustrate the power of CIA for cross-platform analysis of gene expression data by using it to identify the main common relationships in expression profiles on a panel of 60 tumour cell lines from the National Cancer Institute (NCI) which have been subjected to microarray studies using both Affymetrix and spotted cDNA array technology. The co-ordinates of the CIA projections of the cell lines from each dataset are graphed in a bi-plot and are connected by a line, the length of which indicates the divergence between the two datasets. Thus, CIA provides graphical representation of consensus and divergence between the gene expression profiles from different microarray platforms. Secondly, the genes that define the main trends in the analysis can be easily identified.
CIA is a robust, efficient approach to coupling of gene expression datasets. CIA provides simple graphical representations of the results making it a particularly attractive method for the identification of relationships between large datasets.
Microarray quantification of global gene expression is becoming a very widely used technique. Microarray technology has developed very rapidly and, as a result, different laboratories have adopted numerous different protocols and technological platforms. This severely impacts on the comparability of microarray results . The value of results from microarray gene expression studies would be much greater if they could be cross-validated and compared with data from similar studies.
Currently, meta-analyses of microarray gene expression data are usually based on cross-referencing the annotation of each probe, that is, each oligonucleotide or cDNA sequence attached to each array, extracting a list of gene probes common to all arrays and comparing the expression data of these. Cross-referencing of expression data is usually achieved using UniGene, where probes are considered matched if the GenBank accession number or IMAGE clone identifier of a probe, map to a common UniGene cluster. Meta-analysis of microarray data obtained using similar commercial platforms, or meta-analysis of small subsets of genes is often very successful . While recent attempts to correlate complete Affymetrix oligonucleotide and spotted cDNA array gene expression datasets have reported some success , others have reported remarkably poor correlation .
Efforts to standardize and improve array annotation  should improve inter-laboratory and inter-technology analysis of gene expression data. Nonetheless, the dependence of meta-analysis of microarray data on annotation is limiting for several reasons. Firstly, the identity of gene transcripts spotted on microarrays may be ambiguous. In this case, cross-referencing genes on arrays based on a gene accession number, clone identifier, or even the sequence of a complete gene, is prone to error. In the case of older microarrays, in particular, only a proportion of clones are fully sequence-verified. Furthermore, probes on different microarray platforms may hybridise to different gene regions with different GC content, which will alter the binding properties. Probes may bind to different splice variants of the gene or to homologous genes. This is particularly true when oligonucleotide and cDNA arrays are compared.
Secondly, many protocols cross-reference genes on arrays to the UniGene database [3,6]. UniGene clusters are generated using automated sequence clustering and contain hundreds of thousands of novel expressed sequence tag (EST) sequences in addition to well-characterized genes. As procedures for automated sequence clustering are still under development, and the data, particular EST data, are continually changing, gene clusters in UniGene are frequently updated, retired or joined. Thus, temporary inaccuracies in UniGene, in addition to any poor quality or inaccurate annotation of genes in several public or private databases, are propagated onto microarray probe annotations. Even though two probe sequences on an array may target the same region of a gene, the annotation of these probes may not concur.
Finally, in the case of many genomes including the human genome, it is not yet technically possible to represent the entire genome together with all possible splice variants on a single microarray chip. Thus, different subsets of genes from the genome are represented on different microarrays. Ideally, given a biological sample that has been subjected to several array analyses, one would like to concatenate and combine results from these in order to get as complete a picture as possible of the gene expression profile of that sample. However, cross-referencing of arrays based on annotation, and filtering expression data to that of genes represented across all arrays, excludes thousands of biologically interesting genes.
In this paper, we wish to describe the application of a powerful yet simple method which allows us to perform cross-platform comparison of gene expression data independent of data annotation. Co-inertia analysis (CIA)  is a multivariate method that identifies trends or co-relationships in multiple datasets. CIA is commonly applied to the analysis of relationships between species lists and physico-chemical properties of sites in ecological studies, and has already been applied in bioinformatics to the analysis of amino acid properties . It is used in a similar manner to Canonical Correlation Analysis  or Canonical Correspondence Analysis . However, these latter methods have a stringent requirement for more cases than variables and are therefore difficult to apply to microarray datasets. By contrast, CIA can be applied to datasets where the number of variables exceeds the number of observations. This is particularly attractive to the analysis of microarray data, where the number of variables (genes) far exceeds the number of samples (arrays) in most analyses. An important feature of this approach is that it is not limited to the analysis of datasets containing the same number of variables (genes). Thus, CIA does not require annotation or statistically based filtering of data prior to cross-platform analysis.
CIA is accomplished by finding successive orthogonal axes from the two datasets with maximum squared covariance. These axes can be derived by principal components analysis, in which case CIA is closely related to the method of partial least squares (PLS). PLS is, in fact, a particular case of CIA. Although the analyses and diagrams described in this paper could have been produced in a similar manner using PLS, we prefer to derive the axes by correspondence analysis (COA), as this is particularly effective at analysing and visualising relationships in microarray data [11,12]. CIA has further flexibility in that it can be used to analyse multiple sets of qualitative as well as quantitative data .
We illustrate the power of CIA for cross-platform analysis of microarray data by using it to identify the main common relationships in expression data on a panel of 60 cell lines from the National Cancer Institute (NCI) which have been subjected to different microarray studies using Affymetrix [14,15] and spotted cDNA array  technology.
Mathematical basis of CIA
Ordination is a term used in ecology, where it refers to the representation of objects (sites, stations, variables, etc) as points along one or several axes. These axes are often chosen so as to maximise the variance of the plotted points and so as to be orthogonal to all preceding axes. The axes are usually found as eigenvectors from an eigenvalue decomposition of the original data, after some transformation.
We will briefly describe the underlying mathematical basis of the ordination methods COA and CIA, following the notation of Dolédec and Chessel  and of the ADE-4 package . These utilise the model of the duality diagram which is based on the concept of a statistical triplet. A statistical triplet is composed of three matrices (X, Dc, Dr), a data matrix X (having n rows/cases and p columns/variables) with possibly an appropriate transformation, and two diagonal matrices of column and row weights Dc, Dr which will be defined below. When n <p, the principle of the method is the diagonalisation of a n × n matrix B defined as:
B = Dc1/2XDrXtDc1/2 (1)
where Xt is X transposed and Dc1/2 is Dc with the square root of each diagonal element along the diagonal. The diagonalisation of B gives n eigenvalues corresponding to the n principal axes.
In the case of COA, the original n × p table of genes and arrays is transformed into a table of chi-square values giving the association or correspondence between each gene and each array. Let M be our matrix containing the raw data, this matrix having n rows and p columns. We can write M = [mij] with 1 ≤ i ≤ n and 1 ≤ j ≤ p. We denote the row and column sums of M as mi• and m•j respectively, m•• corresponding to the grand total. The relative contribution or weight of row i to the total variation in the data set is then denoted ri and is calculated as:
ri = mi•/m•• (2)
while the relative contribution of column j is denoted as cj and is calculated as:
cj = m•j/m•• (3)
Similarly, the contribution of each individual element of M to the total variation in the data set is denoted as fij and is calculated as:
fij = mij/m•• (4)
The above calculations produce two vectors R = [ri] and C = [cj] of length n and p respectively, and one matrix F = [fij] of dimension n × p. We use these vectors and this matrix to determine the values of xij, which are calculated as:
These values define the matrix X = [xij], which along with the diagonal matrices Dr (an n × n matrix of zeros with the elements of R along the diagonal) and Dc (p × p matrix with the elements of C along the diagonal) are used for COA computation as described in equation 1. This analysis results in a series of axes (the eigenvectors of the decomposition) ranked by eigenvalue, on which the arrays can be plotted. COA is of particular interest because one can also add the positions of the variables (the genes) on the plot and examine the relationships between these and the arrays. An array and a gene that have a strong association have a high chi-square value in table X and will be plotted in a similar direction from the origin of the plot.
With CIA, we have two statistical triplets from two datasets, which we wish to analyse:
(X, Dcx, Dr) and (Y, Dcy, Dr)
These are from two datasets, x and y, which contain the same number of rows (arrays in this case) with the same row weights (Dr), but may have different numbers of columns (two different sets of genes) with different column weights (Dcx, Dcy). Tables X and Y are the chi-squared tables derived from the two raw datasets as described equation (5). CIA then proceeds by an eigenvalue decomposition of the triplet (YtDrX, Dcx, Dcy), using equation (1). The details for deriving the co-inertia axes corresponding to the two datasets and the proof that these are maximally co-variant are given in Dolédec and Chessel . The derivation of these axes is also described by Dray et al., [13,18] and in 6 [see 6].
Additional File 1. Ross_5643.zip is a Microsoft Excel file that is compressed using winzip 8.0. It contains the pre-processed Ross (spotted microarray) data subsets described in this manuscript. The excel file contains 5 worksheets; the first is a readme which gives further details of the data. In addition details of the data contained in this file are given in additional file 2 'Readme.txt'.
Format: ZIP Size: 5.2MB Download file
This produces two sets of axes, one from each dataset, where the first pair of axes are chosen so as to be maximally co-variant and represent the most important joint trend in the two datasets. The second pair of axes are chosen as to be maximally co-variant but orthogonal to the first pair, and so on for the rest of the axes. We can measure the similarity between the ordinations in two ways. The simplest is to measure the correlation between the data points on any two corresponding axes, one from each ordination. Additionally we measure the overall similarity using a multivariate extension of the Pearson correlation coefficient called the RV-coefficient . The RV-coefficient is calculated as the total co-inertia (sum of eigenvalues of a co-inertia analysis) divided by the square root of the product of the squared total inertias (sum of the eigenvalues) from the individual COAs. It has a range 0 to 1 where a high RV-coefficient indicates a high degree of co-structure.
The main result of the analysis is then a pair of plots, one from each dataset, with the arrays plotted out on the first 2 or 3 axes. These plots should show similar arrangements of the arrays if the datasets have strong joint trends. A simple graphical device is to superimpose the plots for the first two axes of the analysis from the two datasets. If the sample (array) scores are normalised to unit variance along each axis, the standardised scores can be superimposed. Then the location of each data point (each array) can be indicated using an arrow. The tip of the arrow is used to show the location in one plot and the start of the arrow shows the location in the other. If the datasets agree very strongly, the arrows will be short. Equally, a long arrow demonstrates a locally weak relationship between the two sets of variables for that case (array). This is the rationale behind the plots in Figures 1, 2 and 4. In Figure 4, we also plot the locations of the variables (genes) in the two plots.
Figure 1. Analysis of very similar and unrelated gene expression datasets using CIA. The first two axes of control CIA studies of very similar (A) and unrelated (B) profiles of Ross spotted cDNA gene expression data of the NCI 60 panel of cell lines are shown. The figure shows results from CIA of A) two random gene subsets of the 1375 gene dataset B) two unrelated datasets composed of 1375 genes, where the 60 cell dataset was duplicated and the arrays in one dataset were randomly permutated. Circles and arrows represent the projected co-ordinates of each dataset, and these are joined by a line, where the length of the line is proportional to the divergence between the datasets. The colours represent the eight NCI60 cell line classes as defined by Blower et al., .
Figure 2. Cross-platform comparison of Affymetrix and spotted cDNA expression profiles using CIA. The first two axes of a CIA of gene expression profiles of the complete gene set from the Ross spotted cDNA array dataset (closed circles) and 1517 genes from the Staunton Affymetrix dataset (arrows) are shown. Circles and arrow represent the projected co-ordinates of each dataset, and these are joined by a line, where the length of the line is proportional to the divergence between the different gene expression profiles. The cell lines are coloured as in Figure 1. The cell lines are derived from breast (BR), melanoma (ME), colon (CO), ovarian (OV), renal (RE), lung (LC), central nervous system (CNS, glioblastoma), prostate (PR) cancers and leukaemia (LE). Colon and leukaemia cells were separated from those with mesenchymal or stromal features (glioblastoma and renal tumour cell lines) on the first axis (F1, horizontal), and melanoma cell lines were distinguished from the other cell lines on the second axis (F2, vertical). A histogram of the main factors which explain the total variability of this CIA is superimposed on the top right corner. The first three axes represented 42%, 21% and 8% of the inertia.
Figure 3. Hierarchical clustering of Affymetrix and spotted cDNA expression profiles of 60 cell lines. Dendrograms showings average linkage hierarchical clustering of NCI60 human cancer cell lines using Spearman Rank correlations. Cluster analyses of the 60 cell lines based on A) gene expression profiles of 1415 genes from the Ross spotted cDNA array dataset and B) 1517 genes from the Staunton Affymetrix dataset are shown. The cell lines are coloured as in Figure 1. The colon tumour cell line HT29 and cluster of colon tumour cell lines are highlighted by a green arrow and bar respectively.
CIA of randomised datasets
A number of control studies of CIA of gene expression data were performed. We wished to establish what happens when datasets that are artificially similar or artificially distinct are compared. Firstly, we took the 1375 gene subset of the Ross dataset (described in the Methods section) and split it in two (by randomly assigning genes to one split or the other). This provided two datasets which have different collections of genes but which are expected to show similar patterns and trends. A graphical representation of results from this CIA of these datasets is shown in Figure 1a. Each sample (array of a cell line) is defined by an arrow where the head of the arrow marks the position of the sample according to one ordination, and the end of the arrow indicates the sample position in the second ordination. The arrows are short and randomly oriented. The two pairs of projection coordinates are highly correlated (R = 0.99 between the two sets of co-ordinates on the first axes F1). The overall similarity in the structure of the datasets was very high resulting in a RV co-efficient of 0.97. Clearly, CIA is able to detect and highlight the similarity between these subsets, despite the fact that they have practically no variables in common.
Secondly, the effect of comparing two unrelated datasets using CIA was assessed. The same Ross dataset of 60 arrays and 1375 genes was duplicated and the arrays (cell lines) of one of these datasets were randomly permuted. Thus more or less all of the rows in these two datasets should be unrelated. The results of CIA analysis of these datasets are shown in Figure 1b. Long randomly orientated arrows connected samples and the RV coefficient was only 0.30 reflecting the lack of joint structure in these datasets.
Cross-platform comparison of gene expression data using CIA
Matching genes common across arrays using annotation
Currently, meta-analyses of microarray gene expression data are usually based on cross-referencing each spot represented on the arrays, extracting genes common to all arrays and examining the correlation between the expression profiles of only these genes. Several subsets of the Ross spotted cDNA expression dataset have been selected in different studies [16,20,21]. The number of genes common across these and the subsets of the Staunton Affymetrix datasets (described in more detail in the Methods section), were compared using MatchMiner . MatchMiner matched the IMAGE clone identifiers of genes represented on the cDNA arrays with GenBank accession numbers of oligonucleotide sequences attached to the Affymetrix array. The number of "matched" or common genes across each of the data subsets is given in Table 1. Only 1416 genes were matched between the largest Ross (5643 genes) and Staunton (3144 genes) datasets.
Table 1. Results of CIA of different subsets of gene expression datasets
Identifying the most covariant gene expression data subsets using CIA
The disadvantage of only examining genes present across all arrays is that data from biologically significant genes may be lost if a gene is not represented on all DNA microarray platforms examined. CIA does not require pre-filtering of genes to those present in all datasets. We applied CIA to compare gene expression profiles from the Ross and Staunton datasets. Each of the Ross datasets; the complete dataset of 5643 genes, along with the Blower  subset of 3748 genes, and the two Scherf  subsets of 1375 and 1415 genes, were compared to different sub-selections of genes from the Staunton dataset using CIA. These preprocessed data which were used to perform these analyses are available [see Additional file 1,2,3,4, 5].
Additional File 2. Staunton_7129.zip is a Microsoft Excel file that is compressed using winzip 8.0. It contains the pre-processed Staunton (Affymetrix) data subsets described in this manuscript. The excel file contains 7 worksheets; the first is a readme which gives further details of the data. In addition details of the data contained in this file are given in additional file 2 'Readme.txt'.
Format: ZIP Size: 5.1MB Download file
Format: PDF Size: 118KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional File 4. Ross_5643_KNN.txt is a tab delimited plain text file. Ross_5643_KNN.txt is worksheet 4 from Ross_5643.zip. This 5643 gene subset of the Ross data is described in detail in the manuscript. The IMAGE clone identifiers are in the first column, and sample (array) names in the first row.
Format: TXT Size: 2.1MB Download file
Additional File 5. Staunton_1517_CS.txt is a tab delimited plain text file. Staunton_1517_CS.txt is worksheet 3 from Staunton_7129.zip. This 1517 gene subset of the Staunton data is described in detail in the manuscript. The Affymetrix probe identifiers are in the first column, and sample (array) names in the first row.
Format: TXT Size: 914KB Download file
The relationships between these datasets as described by the RV co-efficient after CIA is shown in Table 1. The correlations between the pairs of ordinations along the first (F1, horizontal axis) and second pair of axes (F2, vertical axis) are also shown. The results in Table 1 show that between 49% and 64% of the total variance (sum of the eigenvalues) are represented by the F1 and F2 in each analysis, and there is a high correlation between pairs of ordinations on each axis. CIA of the complete Ross dataset and the smaller Staunton subset of 1517 genes resulted in the highest RV co-efficient (0.88) among these data subsets examined. CIA results from this analysis are examined in detail below.
Visualising cross-platform consistencies and divergences using CIA
In Figure 2, the results of CIA co-structure analysis between the gene expression profiles of the two datasets are shown. According to the eigenvalue histogram, the first three axes accounted for 42%, 21% and 8% of the explained variance respectively. Thus 63% of the variance of the co-inertia analysis was accounted for by the first and second co-inertia axes and thus presented a good initial summary of the co-structure between the two datasets. The correlation (R value) between the first axes (F1) of the two ordinations was 0.96, and it was 0.98 between second axes (F2) of the two ordinations. These high values partly result from the maximisation of the covariance, ie the product of the correlation and the squared variances projected onto the co-inertia axes. Thus a Monte Carlo permutation test, where the rows of one matrix are randomly permutated followed by a re-computation of the total inertia  was used to check the significance of co-structure of this CIA. A total of 1000 co-inertia analyses using random matching of the two tables were processed. Permutation analysis of these 1000 datasets showed that the observed inertia was much greater than that of the simulated datasets. The probability of obtaining a total inertia equal to that observed, using the hypothesis of independence between the gene expression datasets, was less than 0.001. This underlines that the two tables are significantly related and a co-structure exists.
In the CIA plot of Figure 2, the co-ordinates of the 60 cell lines from both the Ross (circles) and Staunton (arrows) datasets are connected by a line, the length of which indicates divergence between the two datasets. The first axes (the horizontal F1 axes from the two data sets) separated leukaemia cells and colon cells with epithelial characteristics, from cells with mesenchymal or stromal features such as the glioblastoma and renal tumour cell lines. We inferred that the second axis (F2, vertical) is the melanoma axis, separating the melanoma cell lines from the other cell lines.
Cell lines from non-small cell lung carcinomas and breast cancers were distributed in multiple clusters indicating that their gene expression patterns were more heterogeneous. For example, we observed that the breast cancer cell line Hs578T clustered (was geometrically close to) with the stromal/mesenchymal cluster of glioblastoma and renal tumour cell lines at the positive end of the F1 axes. By contrast, the breast cancer cells MCF-7 and T47D were projected at the opposite end of the F1 axes, closer to the colon cancer cells which have an epithelial phenotype. These observations agree with previous findings .
For most cell lines, the divergence between the Ross and Staunton gene expression profiles was little above background noise. However the colon tumour cell line HT29 was represented by a long arrow, indicating that there were significant cross-platform differences between the expression profiles of this cell line. In the Ross ordination, the cell line HT29 clustered with the other colon tumour cell lines, but in the Staunton ordination it shifted significantly. Hence, we performed an independent evaluation using hierarchical cluster analysis (Figure 3). This analysis verified that the HT29 cell line clustered within the colon cell lines cluster when the Ross data but not the Staunton data were analysed. No single gene was responsible for the shift between ordinations of HT29.
Figure 4. Detecting genes defining major trends identified using CIA. The central panel (B) is the CIA from Figure 2. The co-ordinates of the genes in each ordination are shown in the side panels A) Ross cDNA and C) Staunton Affymetrix. The top ten genes at the end of axes F1 and F2 are labelled, where red gene labels indicate genes that were present in both datasets. Genes labelled in bold describe genes that were replicated on the microarray. Genes labelled in blue represent genes that were not contained in the top ten genes, but were in the top thirty genes at the end of each axes and are of biological interest.
Each projection of cell lines was defined by the expression of specific genes. A summary of a number of genes that were identified using CIA on each of the axes is given in Table 2 and plots showing the coordinates of genes that defined the first two axes of the CIA are shown in Figure 4. The genes most responsible for defining the axes are located at the ends of the axes. Genes and cell lines which project in the same direction from the origin have a strong association and represent genes whose expression is increased or upregulated in these cell lines. Equally genes projected in the opposite direction from the origin to cell lines are frequently genes that are lost or down regulated in those cell lines.
Table 2. Selection of genes identified using CIA
In Figure 4 the most extreme genes from the ends of each axis are labelled. Genes labelled in red are those that were present in the top 30 genes at the ends of F1 and F2 and were "matched" across platforms, that is where an IMAGE clone identifier of a spotted cDNA clone and a GenBank accession number of an Affymetrix oligonucleotide probe set mapped to the same UniGene cluster. Of the 1416 genes "matched" between these two datasets (Table 1), only 11 "matched" genes were projected within the top 30 genes at the ends of the F1 and F2 axes in both ordinations. Although only 11 of 120 genes at the ends of the F1 and F2 axes were matched, many top genes of one ordination were present in the second dataset, but were not projected at the ends of these axes. Among the top 120 genes in the Staunton Affymetrix ordination, 53 were present in the Ross spotted cDNA dataset. Equally 40 of the top 120 genes detected in the Ross ordination were present in the Affymetrix dataset. This observation that several genes present on both arrays were only associated with trends in one ordination, could highlight annotation problems, differences in binding properties between the oligonucleotide and cDNA probes representing these genes or measurement error in one or more datasets.
The observation that the majority of genes associated with trends were represented on only one array type is significant, as these would have been excluded from analysis if standard "annotation based" methods were used. Thus gene expression data from each platform are co-visualised using CIA. We examined the genes defining each axis in the Ross or Staunton ordinations in more detail.
Epithelial versus mesenchymal clusters of cell lines on the first axis
The first axis clearly distinguished cells with epithelial versus mesenchymal characteristics. The epithelial to mesenchymal transition (EMT) is an ancient pathway integral to normal embryonic development and is implicated in the progression of malignancy of epithelial cancers such as breast and colon carcinomas . During EMT, cells acquire a morphology that is appropriate for migration and thus understanding the processes that trigger EMT may help in refining our knowledge of the biological basis of tumour progression to metastasis.
Epithelial genes were projected in the same direction as the less invasive carcinoma cell lines. The breast carcinoma cell lines MCF-7 and T47D, which have a pure luminal phenotype, were projected onto the epithelial side of the F1 axis, whereas the more invasive breast cancer MDA MB231 was projected onto the mesenchymal end of the F1 axis. This ordination agrees with recent immunohistochemical studies on these tumour cell lines .
The genes at the mesenchymal end of the first pair of CIA axes included TGFβ, N-cadherin, along with several muscle, collagen and mesenchymal markers, such as vimentin and fibronectin (Table 2). At the opposite end of this axis, several markers of epithelially-derived genes, including E-cadherin, the cytokeratins 8, 18 and 19, as well as desmoplakin I were observed.
Although a number of these genes were present in both the Staunton and Ross ordinations, the majority were in one of the two datasets only (Table 2). In the Ross ordination, E-cadherin and N-cadherin were projected at opposite ends of the F1 axis. E-cadherin maintains the integrity of epithelial tissue and is considered the primary "caretaker" gene of the epithelial phenotype. Loss of E-cadherin is heavily implicated in EMT. Loss of E-cadherin is accompanied by loss of epithelial keratins and gain of mesenchymal vimentin and fibronection, as well as progression of malignant carcinoma . N-cadherin is gained in some carcinomas that have lost E-cadherin and this has been associated with reduced five year survival in patients with non-small cell lung cancer . We also observed that metallothionein A2 was strongly associated with the mesenchymal side of the F1 axis in the Affymetrix dataset ordination and this has shown to be implicated with invasive ductal breast carcinoma . Both hepatocyte growth factor (HGF), and TGFβ have been shown to induce EMT, and colon cancers that lack receptors to TGFβ have a better prognosis . TGFβ and vimentin were identified in the Staunton Affymetrix data. SPRINT2, an inhibitor of an inhibitor of HGF, was detected at the epithelial end of the F1 axis in both ordinations. These genes are integral to EMT and thus the merging of such information from both of these datasets using CIA is noteworthy.
Genes associated with the colon cell and leukaemia cell line clusters
The first axis distinguished CNS/renal tumour tissue derived cell lines from those having their origin in either leukaemia or colon cancer. Although the leukaemia and colon tumour cell lines appear close together on the first axis, these were separated to either end of the third axis, thus, genes defining each of these cell types could be identified.
Two genes, tumor-associated calcium signal transducer 1 (TACSTD1) and cyclin-dependent kinase inhibitor 2A (CDKN2A, p16), a tumour suppressor gene, were strongly associated with the colon tumour cell lines in the Ross spotted cDNA array data ordination. TACSTD1 also featured on the Staunton Affymetrix ordination. TACSTD1 is a cell adhesion molecule expressed on the majority of tumour cells in most patients with colorectal carcinoma and, interestingly, was the target of one of the first mouse monoclonal antibodies produced for therapeutic use. Several clinical trials are ongoing using TACSTD1/CO17-1A/EpCam as a target antigen in colorectal carcinoma . We observed that increased gene expression of CDKN2A was associated with the colon tumour cell lines, although hypermethylation of CDKN2A has been correlated with poor prognosis of patients with colorectal cancer .
Genes that are expressed preferentially in haematopoietic tissues defined the leukaemia cluster. ARHGDIB, a lymphoid-specific guanosine diphosphate dissociation inhibitor, was strongly associated with the leukaemia cell line cluster and was present on both microarray platforms. In addition, a number of genes that distinguished the leukaemia cluster were only present on one of the two DNA microarray platforms. Lymphocyte cytosolic protein 1 (L-plastin, LCP1) was represented in the spotted cDNA array dataset, but not in the Affymetrix array subset. LCP1 encodes an actin-binding protein and is situated at 3q27, a locus associated with a translocation event t(3;13)(q27;q14) found in various types of non-Hodgkin's lymphoma . In the Affymetrix ordination, T-cell receptor TRCB, and an interferon induced transmembrane protein (IFITM1) which has been implicated in the control of cell growth and deregulation, were among the genes associated with the leukaemia cluster.
Melanoma cell lines clustered with two metastases BR_MDAN and BR_MDAMB435
We observed an interesting trend within the melanoma cell line cluster, which contained seven melanoma cell lines, as well as BR_MDAN and BR_MDAMB435, two melanoma metastases which were derived from a patient diagnosed with breast cancer. In the ordination of the Ross dataset, these two "breast cancer" cell lines were furthest along the second axis. However, the melanoma cell lines were projected further along this axis in the ordination of the Staunton gene expression data. This indicated that the Affymetrix gene expression profiles contained more information on the melanoma cell lines compared to the two metastases which were not as discriminated on the axis. Thus, we examined the melanoma-specific genes represented in each dataset.
Diagnosis of melanoma is normally associated with a neoplasm that is keratin negative, and is positive for vimentin, S100 and HMB-45, though MITF and Melan-A were reported recently to be superior markers to S-100 and HMB-45 .
These melanoma-specific genes were very well represented on the Staunton Affymetrix ordination. We observed expression of vimentin and MITF, as well as other genes associated with pigmentation/differentiation (TYR, DCT, TYRP1, MITF, RAB7), several serum markers of melanoma progression (MIA, MAGE 3 and MAGE 12) and glycomembrane protein nmb (GPNMB) in this ordination. Expression of GPNMB has been shown to be inversely correlated with the metastatic potential of melanoma cell lines . In addition, on the negative end of this axis, keratins 8, 18 and 19, along with S100A2 were observed. Absence of these keratins is used in clinical diagnosis of melanoma and loss of S100A2 gene expression has been implicated as an early event in melanoma development . Thus, the melanoma phenotype was well represented on the Affymetrix ordination.
By contrast, there were considerably less melanoma-specific genes in the Ross dataset. Expression of melanoma cell adhesion molecule MCAM (also called MUC18), which reportedly correlates directly with the metastatic potential of human melanoma cells, was detected in the Ross cDNA ordination. In addition, keratin 8 was projected onto the negative end of the F1 axis in the Ross ordination. Although Ross et al.  identified TYR, S100β and DCT as melanoma associated genes, these were subsequently excluded in the revised release of their dataset (see Methods section) and were thus not identified in this analysis.
CIA is a particularly attractive method for visually relating multiple microarray gene expression datasets. CIA is a data coupling approach that identifies trends or patterns in tables of data that contain the same samples. In this paper CIA is applied to the cross-platform analysis of relationships in gene expression profiles of 60 cell lines, rather than to the analysis of specific genes. This is an attractive feature of CIA. Since CIA maps two gene expression datasets at the data, not the annotation level, it is not limited by the immaturity of gene annotation. Secondly as CIA can accept data where the number of variables exceeds the number of individuals, filtering of data to those genes represented on all arrays is not required, and thus more genes are available for analysis. An earlier report which attempted to correlate these datasets reported disappointingly poor correlations between gene datasets . Kuo and colleagues  used the BLAST algorithm to sequence match genes represented on both array platforms. Of the 9,703 cDNA probes on the spotted cDNA array, in question, and 7,245 probes sets of the Hu6800 Affymetrix arrays, 2,895 spots/probe sets were found to be sequence-matched. However analysis of this filtered set of data showed poor cross-platform concordance.
In our analyses, the divergence between the Ross and Staunton gene expression profiles of most cell lines was little above background noise, however, we detected a large variation between the expression patterns of the colon tumour cell line HT29. The melanoma cell lines were more defined in the Affymetrix ordination than in the ordination from the Ross dataset. This may be due to the increased numbers of melanoma associated genes in this dataset. Thus, CIA can be used to highlight lack or presence of co-structure between datasets. Moreover, CIA can assist in the selection of the strongest features from each datasets for subsequent analysis.
Several clinically significant genes were detected in the CIA of the Ross and Staunton data. The first axes were associated with the characteristics of epithelial and mesenchymal phenotypes. Mesenchymal cells possess migratory and invasive properties typical of malignant metastasising cancer, and thus the transition between epithelial and mesenchymal phenotypes is a key field in cancer biology . Carcinoma cell lines with more invasive phenotypes were associated with the mesenchymal end of the axis. We were easily able to identify several of the most important genes associated with both the epithelial (keratins 8, 18 and 19, E-cadherin, SPINT2) and mesenchymal (TGFβ, vimentin and fibronectin) cell types. Although a number of defining genes were present on both arrays (keratin 8, fibronectin), the majority of genes were present only on one array (Table 2). Thus, given a strong association, CIA provides an opportunity to assimilate data from different gene expression sources. Equally, on the second axes of the ordinations, which defined the melanoma phenotype, and the third axes, which distinguished the leukaemia cells, nearly all of the genetic markers detected were only present in one rather than both datasets, and thus these would have been lost if we had filtered our data to those genes present across all arrays.
CIA is very flexible and extensible . It is suitable for analysis of quantitative, qualitative or even fuzzy variables. It allows coupling of two tables which can be subjected to various transformations and/or centering (COA, PCA etc) with the only constraint being that the samples (arrays) are weighted in the same way for the two analyses.
We believe CIA is a very useful method for cross-platform comparison of gene expression profiles where the same tissue or cell lines have been arrayed multiple times. Consensus and divergence between gene expression profiles from different DNA microarray platforms are graphically visualised. Importantly, this method is not dependent on probe or sequence annotation, and thus it can extract important genes even when there are not present across all datasets.
The NCI 60 series consists of a panel of 60 human tumour cell lines derived from patients with leukaemia, melanoma, along with, lung, colon, central nervous system, ovarian, renal, breast and prostate cancers. This panel has been subjected to three different DNA microarray studies using Affymetrix [14,15] and spotted cDNA array  technology. We compared two of these studies, one cDNA spotted  and one Affymetrix  study and refer to them as the Ross and Staunton datasets respectively. These pre-processed data are available in additional data files [see Additional file 1,2,3,4].
The Ross Dataset
The Ross dataset contained gene expression profiles of each cell lines in the NCI-60 panel, which were determined using spotted cDNA arrays containing 9,703 human cDNAs. The data were downloaded from The NCI Genomics and Bioinformatics Group Datasets resource http://discover.nci.nih.gov/datasetsNature2000.jsp webcite. The updated version of this dataset (updated 12/19/01) was retrieved. Data were provided as log ratio values. In this study, rows (genes) with greater than 15% of values missing were deemed unreliable and were removed from analysis, reducing the dataset to 5643 spot values per cell line. Remaining missing values were imputed using a K nearest neighbour method, with 16 neighbours and a Euclidean distance metric . This set of 5643 genes, along with subsets of 1375, 1415  or 3748 genes  that were used in previous reports, were used.
The Staunton Dataset
These data were derived using high density Hu6800 Affymetrix microarrays containing 7129 probe sets. The dataset was downloaded from the Whitehead Institute Cancer Genomics supplemental data to the paper from Staunton et al., http://www-genome.wi.mit.edu/mpr/NCI60/ webcite, where the data were provided as average difference (perfect match-mismatch) values. As described by Staunton et al., , an expression value of 100 units was assigned to all average difference values less than 100. Genes whose expression was invariant across all 60 cell lines were not considered, reducing the dataset to 4515 probe sets. Gene subsets where the minimum change in gene expression across all 60 cell lines was greater than 100, 200 and 500 average difference units were selected resulting in subsets of 3144, 2455, and 1517 probe sets. Data were logged (base 2) and median centred.
Computation of CIA
Computation of CIA was performed using the ADE-4 package , a general-purpose package for multivariate statistical analysis, which has been widely used in the analysis of environmental and ecological data. It runs under MacOS 7 or Windows operating systems and can be downloaded from The ADE-4 homepage http://pbil.univ-lyon1.fr/ADE-4/ webcite. In addition, ADE-4 is available as routines written in the R statistical computing language. These can be downloaded from The R homepage http://cran.r-project.org/src/contrib/PACKAGES.html#ade4 webcite or The ADE-4 for R homepage http://pbil.univ-lyon1.fr/ADE-4 webcite. R scripts to run CIA are available on request.
The ADE-4 modules required to perform CIA are ADEtrans, FilesUtil (using the Transpose option), PCA (Correlation Matrix PCA, Covariance matrix PCA options), COA (Correspondence Analysis, Row weighted COA), CoInertia (Match two statistical triplets, coinertia test, coinertia analysis). ADE-4 can be run interactively or in batch mode. Graphical displays were obtained using the ADE-4 modules Scatters and Scatterclass.
Cross-platform comparison of two microarray datasets using CIA
The labelling of the NCI-60 cell lines varied between the Ross and Staunton studies. The cell line labels were verified, matched and sorted so that the order of the arrays was the same in each analysis. Within the ADE4 implementation of CIA, it assumes that the row weights of both datasets are the same, thus for analysis of microarray data, the data was transposed. All data points in each dataset were made positive by the addition of a constant, as done by Fellenberg et al.,  and Culhane et al., .
CIA was used to determine the main relationships between the gene expression profiles from the same 60 cell lines, but which were derived using two different microarray technologies. Each of the four subsets of the spotted Ross data and the three subsets of the Staunton data were subjected to analysis. COA was performed on each Ross dataset, and row weighted COA was performed on the gene expression data from the Staunton data, where row weights from the Ross analysis were used. The covariance of the rows (arrays) of the two chi-squared tables were then analysed using CIA.
Cross-platform comparison of two microarray datasets using annotation methods
A list of gene transcripts represented on both array platforms was determined by using BLAST  to compare sequences represented on each array. In addition IMAGE clone identifiers of spotted cDNA elements and GenBank accession numbers of genes detected by Affymetrix oligonucleotide probe sets were "annotation matched" via UniGene ID using MatchMiner . SOURCE  was used to retrieve and update gene annotation.
Before applying clustering, rows and columns (genes and cell lines) of datasets were median centred and normalised to unity. We used average linkage cluster analysis to cluster cell lines and genes using the Spearman Rank correlation measure of similarity. Analyses were accomplished using the Cluster and Treeview programs .
AC conceived this study and carried out the analysis as a postdoctoral researcher in the group of DH. DH supervised the study and provided input both in the design of the study and drafting of the final manuscript. GP provided input regarding the interpretation of the methodology and results. All authors read and approved the manuscript.
We wish to thank the three anonymous referees for their very helpful comments. The authors would also like to thank Dr. William Gallagher for useful discussions. We are grateful to the Health Research Board, Ireland for funding.
Ball CA, Sherlock G, Parkinson H, Rocca-Sera P, Brooksbank C, Causton HC, Cavalieri D, Gaasterland T, Hingamp P, Holstege F, Ringwald M, Spellman P, Stoeckert C. J., Jr., Stewart JE, Taylor R, Brazma A, Quackenbush J: Standards for microarray data.
Comput Appl Biosci 1995, 11:321-329. PubMed Abstract
Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, Van de Rijn M, Waltham M, Pergamenschikov A, Lee JC, Lashkari D, Shalon D, Myers TG, Weinstein JN, Botstein D, Brown PO: Systematic variation in gene expression patterns in human cancer cell lines.
Statistics and Computing 1997, 7:75-83. Publisher Full Text
Scherf U, Ross DT, Waltham M, Smith LH, Lee JK, Tanabe L, Kohn KW, Reinhold WC, Myers TG, Andrews DT, Scudiero DA, Eisen MB, Sausville EA, Pommier Y, Botstein D, Brown PO, Weinstein JN: A gene expression database for the molecular pharmacology of cancer.
Anticancer Res 2002, 22:3415-3419. PubMed Abstract
Oncol Rep 2003, 10:935-938. PubMed Abstract
Galiegue-Zouitina S, Quief S, Hildebrand MP, Denis C, Detourmignies L, Lai JL, Kerckaert JP: Nonrandom fusion of L-plastin(LCP1) and LAZ3(BCL6) genes by t(3;13)(q27;q14) chromosome translocation in two cases of B-cell non-Hodgkin lymphoma.
Am J Clin Pathol 2002, 118:930-936. PubMed Abstract
Degen WG, Weterman MA, van Groningen JJ, Cornelissen IM, Lemmers JP, Agterbos MA, Geurts van Kessel A, Swart GW, Bloemers HP: Expression of nma, a novel gene, inversely correlates with the metastatic potential of human melanoma cell lines and xenografts.
Diehn M, Sherlock G, Binkley G, Jin H, Matese JC, Hernandez-Boussard T, Rees CA, Cherry JM, Botstein D, Brown PO, Alizadeh AA: SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data.