Department of Computer Science, P.O. Box 68, FI-00014, University of Helsinki, Finland

Helsinki Institute for Information Technology, Finland

Department of Information and Computer Science, Helsinki University of Technology, P.O. Box 5400, FI-02015 HUT, Finland

Abstract

Background

Bioinformatics data analysis toolbox needs general-purpose, fast and easily interpretable preprocessing tools that perform data integration during exploratory data analysis. Our focus is on vector-valued data sources, each consisting of measurements of the same entity but on different variables, and on tasks where source-specific variation is considered noisy or not interesting. Principal components analysis of all sources combined together is an obvious choice if it is not important to distinguish between data source-specific and shared variation. Canonical Correlation Analysis (CCA) focuses on mutual dependencies and discards source-specific "noise" but it produces a separate set of components for each source.

Results

It turns out that components given by CCA can be combined easily to produce a linear and hence fast and easily interpretable feature extraction method. The method fuses together several sources, such that the properties they share are preserved. Source-specific variation is discarded as uninteresting. We give the details and implement them in a software tool. The method is demonstrated on gene expression measurements in three case studies: classification of cell cycle regulated genes in yeast, identification of differentially expressed genes in leukemia, and defining stress response in yeast. The software package is available at

Conclusion

We introduced a method for the task of data fusion for exploratory data analysis, when statistical dependencies between the sources and not within a source are interesting. The method uses canonical correlation analysis in a new way for dimensionality reduction, and inherits its good properties of being simple, fast, and easily interpretable as a linear projection.

Background

Combining evidence from several heterogeneous data sources is a central operation in computational systems biology. We assume several vector-valued data sources, such that each source consists of measurements from the same object or entity, but on different variables.

In modeling in general, when it is possible to make sufficiently detailed modeling assumptions, data integration is in principle straightforward. Given a statistical model of how transcriptional regulation works, for instance, the Bayesian framework tells how to integrate gene expression data, prior knowledge, and transcription factor finding data. Lots of practical problems of course remain to be solved. Alternatively, in a classification task of proteins to ribosomal or membrane proteins, for instance, integration is likewise straightforward: do the integration such that the classification accuracy is maximized. This has been done effectively in semidefinite programming for kernel methods

In exploratory analysis, that is, when "looking at the data" to start data analysis while the hypotheses are still vague, it is not as straightforward to decide how data sources should be integrated. The task of exploring data is particularly important for the current high-throughput data sources, to be able to spot measurement errors and obvious deviations from what was expected of the data, and to construct hypotheses about the nature of the data. Nowadays in bioinformatics applications this stage is typically done using dimensionality reduction and information visualization methods, and clusterings. A good exploratory analysis method is (i) fast to apply interactively, (ii) easily interpretable by the analyst, and (iii) widely applicable. Linear projection methods, as such or as preprocessing for clusterings and other methods, fulfill all these criteria.

Fusing the sources is not trivial since we need to choose from three very different options. If all sources are equally important and there is not special reason to do otherwise, it makes sense to simply concatenate the variables from all sources together, and then continue with the resulting single source. The classical linear preprocessing method for this case is Principal Component Analysis (PCA). The second option is suitable when one of the sources, such as the class indicator in functional classification tasks, is known to be of the most interest. Then it is best to include only those variables or features within each source that are informative of the class variable. A classical linear method applicable in this case is linear discriminant analysis. This second option is supervised, and only applicable when the class information is available.

The third option is to include only those aspects of each source that are

Commonalities in data sources have been studied by methods that search for statistical dependencies between them. The earliest method was classical linear Canonical Correlation Analysis (CCA)

CCA addresses the right problem, searching for commonalities in the data sources. Moreover, being based on eigenvalue analysis it is fast and its results are interpretable as linear correlated components. It is not directly usable as a data fusion tool, however, since it produces separate components and hence separate preprocessing for each source. If the separate outputs could be combined in a way that is both intuitively interpretable and rigorous, the resulting method could become a widely applicable dimensionality reduction tool, analogously to PCA for a single source. Performing dimensionality reduction helps in avoiding overfitting, focusing on the most important effects, and reduces computational cost of subsequent analysis tasks.

In this paper we turn CCA into a data fusion tool by showing that the justified way of combining the sources is simply to sum together the corresponding CCA components from each source. An alternative view to this procedure is that it is equivalent to whitening each data source separately, and then running standard PCA on their combination. This is one of the standard ways of computing CCA, but for CCA the eigenvectors are finally split into parts corresponding to the sources. So the connection to CCA is almost trivial and it is amazing that, as far as we know, it has not been utilized earlier in this way.

Our contribution in this paper is to point out that CCA can be used to build a general-purpose preprocessing or feature extraction method, which is fast, and easily interpretable. There are two alternative interpretations. The first is the connection to CCA discussed above. The second is that it extends the standard practice of standardizing the mean and variance of each variable separately before dimensionality reduction. Now each data source is standardized instead of each variable.

We have developed a practical software tool for R that incorporates the subtle but crucial choices that need to be made to choose the dimensionality of the solution. The method is demonstrated on three collections of gene expression measurements.

A kernelized version of CCA (KCCA) has been used in specific data fusion tasks (see e.g.

Results and Discussion

Algorithm

In this section we first explain a simple two-step procedure, based on whitening and PCA, for finding the aspects shared by the sources, and then show how the same fusion solution can equivalently be derived from the result of applying a generalized CCA to the collection. The two-step procedure provides the intuition for the approach: First

Denote a collection of **X**_{1},...,**X**_{p}}, where each **X**_{i }is a _{i }matrix such that _{i}. The rows of the matrices correspond to the same object in each set, while the columns correspond to features that need not be the same in the data sets. For example, in traditional expression analyses the rows would be genes and the columns would be conditions, treatments, time points, etc. For notational simplicity, we assume zero mean data.

In the first step, each data set is whitened to remove all within-data correlations, and the data are scaled so that all dimensions have equal variance. The whitened version **X**_{i }is given by **X**_{i}.

After each data set has been whitened, the next step is to find the shared variation in them. This is done by principal component analysis (PCA) on the columnwise concatenated whitened data sets. Since all the within-data structure PCA could extract has been removed, it can only find variation shared by at least two of the data sets, and the maximum variance directions it searches for correspond to the highest between-data correlations.

Formally, applying PCA to the columnwise concatenation of the whitened data sets

**C**_{Z }= **V Λ V**^{T},

where the orthonormal matrix **V **contains the eigenvectors, **Λ **is a diagonal matrix of projection variances, and **C**_{Z }is the covariance matrix of **Z**.

Projecting **Z **onto the first **V**_{d }corresponding to the

**P**_{d }= **ZV**_{d},

where **P**_{d }is of size

As mentioned in the

CCA is a method for finding linear projections of two sets of variables so that the correlation between the projections is maximal. CCA is often formulated as a generalized eigenvalue problem

where **C**_{ij }denotes the (cross-)covariance of **X**_{i }and **X**_{j}. The eigenvalues _{1}, 1 -_{1},...,1 + _{p}, 1 - _{p}, 1,...,1, where _{1}, _{2}), and (_{1},...,_{p}) are the canonical correlations. The canonical weights corresponding to the canonical correlations are

In conventional use of CCA we are usually interested in the correlations, the canonical weights **u**^{i}, and the canonical scores, defined as projections of **X**_{1 }and **X**_{2 }on the corresponding canonical weights. Next we show how the combined data set (2) can be obtained from the canonical scores, thus providing a way of using CCA to find a single representation that captures the dependencies.

For a single component, (1) can be equivalently written as

where **v **is the corresponding principal component, and

where **Iv **has been subtracted from both sides. Equivalently,

Let us denote **W**_{1}, **W**_{2}], and multiply the right hand side of (4) by the identity matrix **I **= diag[**W**_{1}, **W**_{2}]^{-1}diag [**W**_{1}, **W**_{2}]. On the right side of the equation we then have the term diag[**W**_{1}^{T}, **W**_{2}^{T}]^{-1 }diag[**W**_{1}, **W**_{2}]^{-1 }= diag [**C**_{11}, **C**_{22}] based on the definition of the whitening matrix, and thus (4) can be written as

where

The combined representation (2) of **P**_{d }= **ZV**_{d }= [**X**_{1}, **X**_{2}]diag[**W**_{1}, **W**_{2}]**V**_{d }= [**X**_{1}, **X**_{2}] **X**_{1}**U**_{1,d }+ **X**_{2}**U**_{2,d}, where **U**_{1,d }and **U**_{2,d }are the first

CCA can be generalized to more than two data sets in several ways **Cu **= **Du**, where **C **is the covariance matrix of the column-wise concatenation of the **X**_{i }and **D **is a block-diagonal matrix having the dataset-specific covariance matrices **C**_{ii }on its diagonal. Here **u **is a row-wise concatenation of the canonical weights corresponding to the different data sets. The proof follows along the same lines as for two data sets, and again the combined data set for any

where each **U**_{i,d }contains the

In summary, the simple linear preprocessing method of whitening followed by PCA equals computing the generalized CCA on a collection of data sets and summing the canonical scores of the data sets. In practice it does not matter in which way the result is obtained, but the two-step procedure illustrates more clearly why this kind of approach is useful for data integration. Furthermore, it is not limited to linear projections, and the same motivation could be extended to different kind of models. In practice implementing the first step might, however, be difficult in more complex models.

Choice of dimensionality

The dimensionality of the projection can be chosen to be fixed, such as two or three for visualization, or alternatively an "optimal" dimensionality can be sought. In this section we introduce our suggestion for optimizing the dimensionality. Intuitively, the dimensionality should be high enough to preserve most of the shared variation and yet low enough to avoid overfitting. The first few components contain most of the reliable shared variation among the data sets, while the last components may actually represent just noise, and thus dropping some of the dimensions makes the method more robust.

The maximum dimensionality is the sum of the dimensionalities of the data sets, but in practice already a considerably smaller dimensionality is often sufficient, and in fact leads to a better representation due to suppression of noise. Note also that in the case of two data sets the number of unique projections is only the minimum of the data dimensionalities.

In a nutshell, we increase the dimensionality one at a time, testing with a randomization test that the new dimension captures shared variation. To protect from overfitting, all estimates of captured variation will be computed using a validation set, i.e., for data that has not been used when computing the components (dimensions). The randomization test essentially compares the shared variance along the new dimension to the shared variance we would get under the null-hypothesis of mutual independency. When the shared variance does not differ significantly from the null-hypothesis, the final dimensionality has been reached.

To compute the shared variance of the original data, we divide the data into training, **V**^{t }and the whitening matrix **W**^{t}, where **W**^{t }is a block diagonal matrix containing the whitening matrices for each matrix in training data. The fused representation for the validation data is computed as **X**^{v }is the columnwise concatenation of the validation data matrices. Variance in the fused representation is now our estimate of shared variance. We average the estimate over 3 different splits into training and validation sets.

To compute the shared variance under the null hypothesis, random data sets are created from the multivariate normal distribution with a diagonal covariance matrix where the values in diagonal equal the columnwise variances of

The shared variance in the original data is then compared to the distribution of shared variances under the null hypothesis, starting from the first dimension. When the dimensions no longer differ significantly (we used 2% confidence level), we have arrived at the "optimal" dimensionality and the rest of the dimensions are discarded.

Note that assuming normally distributed data in the null hypothesis is consistent with the assumptions implicitly made by CCA. The underlying task is to capture all statistical dependencies in the new representation, and finding correlations (as done by CCA) is equivalent to that only for data from the normal distribution. For considerably non-normal data the choice of dimensionality may not be optimal, but neither is the method itself. Therefore transforming the data so that it would roughly follow normal distribution (such as taking logarithm of gene expression values) would be advisable.

Implementation

We have implemented the method, including the choice of dimensionality and the validation measures presented in the section

A software package in R. An R implementation of the method including the source codes and documentation of the software.

Click here for file

Experiments

Validation on gene expression data

We first validate the method on three gene expression data sets (described in Section

In case of two data sets an estimate of mutual information can be computed directly from the canonical correlations as

based on the assumption of normally distributed data. Consequently we started by confining to pairs of data sources. Figure

Mutual Information

**Mutual Information**. Mutual information for two data sets as a function of the reduced dimensionality. Each subgraph represents mutual information curve for two data sets corresponding to each data collection. The curves for other pairs in each data collection show a similar pattern.

For more than two variables, the measures explained in the

Shared variance (6) and data-specific (7) variance captured by the fused data were computed for each of the three data collections. The presented results are averages over five-fold cross-validation, and the variances have always been computed for the left-out data. In addition to the PCA comparison, we provide baseline results obtained with random orthonormal projections that have uniform distribution on the unit sphere.

The results are presented for each of the data sets in Figures

Shared and Data-specific variation for leukemia data

**Shared and Data-specific variation for leukemia data**. Shared (top) and data-specific (bottom) variation retained with CCA (solid line) and PCA (dashed line) as a function of the reduced dimensionality for the leukemia data. The values obtained by random projections (dash-dotted line and dotted confidence intervals) have been included for reference. The suggested dimensionality for the CCA-projection is marked with a tick.

Shared and Data-specific variation for cell-cycle data

**Shared and Data-specific variation for cell-cycle data**. Shared (top) and data-specific (bottom) variation retained with CCA (solid line) and PCA (dashed line) as a function of reduced dimensionality for the cell-cycle data. The values obtained by random projections (dash-dotted line and dotted confidence intervals) have been included for reference. The suggested dimensionality for the CCA-projection is marked with a tick.

Shared and Data-specific variation for stress data

**Shared and Data-specific variation for stress data**. Shared (top) and data-specific (bottom) variation retained with CCA (solid line) and PCA (dashed line) as a function of the reduced dimensionality for the stress data. The values obtained by random projections (dash-dotted line and dotted confidence intervals) have been included for reference. The suggested dimensionality for the CCA-projection is marked with a tick.

At the same time the proposed method retains more between-data variation (top subfigures) for wide range of dimensionalities in all cases. The difference is particularly clear for the leukemia data (Fig.

It is striking that in all three cases the PCA, which simply aims to keep maximal variation, is the best also in terms of the shared variation for dimensionality of one. A one-dimensional projection, however, loses a lot of the variation and is not too interesting as a summary of several data sets. Hence, this finding does not have a lot of practical significance.

One notable observation is that especially for the leukemia data (Fig.

The curves of extracted variance can be contrasted to the suggested dimensionalities (see Section

Prototypical Applications

In this section we will discuss a few prototypical ways in which the method could be applied. The method is a general-purpose tool for integrating a collection of data sets in such a way that the effects common to several sets are enhanced. After the integration step any analysis method operating on vectorial data can be used. Here some simple methods are used for demonstrational purposes. The applications are demonstrated on the same data sets that were used in the technical validation.

Shared effects in leukemia subtypes

Pediatric acute lymphoblastic leukemia (ALL) is a heterogeneous disease with subtypes that differ markedly in their cellular and molecular characteristics as well as their response to therapy and subsequent risk of relapse

The fusion method was applied to combine the five ALL data sets, resulting in a 11-dimensional representation. After this we can proceed as if we only had one data source. It has a 11-dimensional feature vector for each gene, and we separate the 1% of genes that have the highest distance from the origo, implying highest total contribution to the shared variation. This set of genes is compared to the corresponding set obtained from a 11-dimensional PCA projection of the whole collection. In addition, a baseline result computed from the full concatenation of the original data sets is included.

A functional annotation tool, DAVID (Database for Annotation, Visualization and Integrated Discovery)

GO enrichment by CCA and PCA

GO term

PCA

CCA

Baseline

Response to biotic stimulus

53, 2.2E-15

61, 2.7E-19

55, 6.2E-17

Defense response

51, 1.1E-14

58, 7.3E-18

53, 3.4E-16

Immune response

47, 3.8E-13

54, 9.5E-17

48, 3.9E-14

Response to pest, pathogen, parasite

29, 7.0E-09

30, 1.4E-08

26, 1.4E-06

Response to other organism

29, 3.0E-08

30, 6.1E-08

26, 4.8E-06

Response to stimulus

61, 6.4E-07

75, 4.1E-12

62, 2.0E-07

Response to stress

35, 9.5E-06

36, 3.6E-05

31, 1.5E-03

Organismal physiological process

54, 5.7E-05

68, 4.8E-10

55, 2.0E-05

Response to external stimulus

22, 1.8E-04

22, 9.4E-04

19, 1.5E-02

Response to wounding

19, 3.3E-04

20, 2.9E-04

16, 3.8E-02

The enriched gene ontology terms from the biological processes category with p-values (Bonferroni corrected) lower than 0.01. Both CCA and PCA result in the same 10 terms, and here they are sorted according to the

Classification of cell cycle regulated genes in yeast

The second prototype application is about cell-cycle regulation using the gene expression of

As the new representation is simply a real-valued vector for each gene, several alternative classifiers are applicable; here K-nearest neighbor (KNN) classifier is selected for demonstrational purposes. We use the cell-cycle regulated genes reported by

The leave-one-out classification accuracy of CCA and PCA projections is shown in Figure

KNN classification for cell-cycle data

**KNN classification for cell-cycle data**. The classification accuracy obtained using the combined representation as a function of dimensionality. The CCA-based combination (solid line) is clearly superior to PCA-based approach (dashed line) for a wide range of dimensionalities and obtains higher maximal accuracy. As a baseline, the classification accuracy obtained by the concatenation of all original data sets (dotted line) is also included.

Defining the environmental stress in yeast

We also study yeast gene expression data from

In

We suggest that it might be a better idea to focus on the variation shared by the different data sources, instead of trying to characterize the similarity based on all variation. Treatment-specific effects would be specific stress responses and if the task is to find a general response, its fingerprint is in the shared variation. Thus the analysis of environmental stress response should start with a preprocessing step like the one suggested here. We demonstrate how the results of such approach differ from those obtained by

We applied a KNN classifier to the combined data space to classify the genes to belong to the three categories labeled in

KNN classification for stress data

**KNN classification for stress data**. The classification accuracy obtained using the combined representation as a function of dimensionality. The CCA-based combination (solid line) is consistently worse than the PCA-based approach (dashed line), implying that the class labels might not correlate that well with the true shared response. As a baseline, the classification accuracy obtained by the concatenation of all original data sets (dotted line) is also included.

This result hints that the definitions created after CCA-based preprocessing would be mostly the same as the ones given in

Conclusion

We studied the problem of data fusion for exploratory data analysis in a setting where the sensible fusion criterion is to look for statistical dependencies between data sets of co-occurring measurements. We showed how a simple summation of the results of a classical method of canonical correlation analysis gives a representation that captures the dependencies, leading to an efficient and robust linear method for the fusion task. It does not solve the data integration task in general, but it shows that the criterion in the data fusion task should not necessarily be to keep all the possible information present in the data collection. Instead, we may want to focus on the aspects shared by different views. We showed how that can be achieved with simple and easily applicable methods.

We demonstrated the validity of the method on three different real gene expression data sets using technical criteria. We further presented three examples on how the method could be used as the preprocessing step in different kinds of analysis tasks.

Methods

Data

Leukemia data

We used the data from

We used RMA (Robust Multi-array Analysis) to preprocess the data, and subtracted the mean of the hyperdiploid samples. In total we analyzed 22,283 genes for 31 patients, divided into 5 data sets.

Cell-cycle data

We used the cell-cycle data from

We preprocessed the data by imputing missing values with the K-nearest neighbor method, using

Yeast stress data

We used the yeast gene expression data under various stress conditions from

We normalized all time series with their respective zero-points, and imputed missing values by gene-wise averages within each data set. After combining the genes from both sources we got 5,998 genes, out of which 868 were identified as ESR-genes by

Validation measures

The method aims to keep all the variance that is shared among the data sets, while ignoring the variation that is specific to only one of them. In this section we introduce measures on how well this is achieved in real applications. Since there is not straightforward way of quantifying the degree of dependency for several high-dimensional data sources (correlation is only defined for two variables, and estimation of multivariate generalizations of mutual information is difficult), we used two partly heuristic variance-based criteria as comparison measures.

Both measures are based on examining reconstructions of the original data sets. If an integrated representation **P**_{d }of full dimensionality is used then it is naturally possible to create a perfect reconstruction, but lower dimensionality introduces errors. We want to measure to what degree the preserved information was shared and to what degree specific to individual data sets.

The reconstruction **W**_{i}. Here **V**_{d}, defined as

The first criterion measuring the data-specific variation after the dimensionality reduction to

Each term in the sum is simply the variance of a single reconstruction, and the sum matches the total variation in the collection of data sets. The measure is further normalized so that the value for

For the shared variation we measure the pairwise variation between all pairs of data sets. The measure uses the same reconstructed data sets, and is defined as

again normalized so that the full dimensionality gives the value one. It is worth noticing that the sum of pairwise variations is not a perfect measure for the shared variation for collections with more than two data sets, but it is computationally simple and intuitive.

Availability and requirements

Project name: drCCA;

Project home page:

Operating system(s): Platform independent;

Programming language: R

License: GNU LGPL;

Any restrictions to use by non-academics: Read GNU LGPL conditions

Authors' contributions

AT implemented the software and carried out the experiments. All authors participated in developing the algorithm, designing of the experiments, and writing of the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors are with the Adaptive Informatics Research Centre. This work was supported in part by the Academy of Finland, decision number 207467, in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778, and in part by a grant from the University of Helsinki's Research Funds. This publication only reflects the authors' views.