School of Computer Science, University of Manchester, Oxford Road, Manchester, M13 9PL, UK

Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, OX3 7BN, UK

College of Information Science and Technology, Nanjing University of Aeronautics and Astronautics, 29 Yudao Street, Nanjing 210016, PR China

Department of Computer Science, University of Sheffield, Regent Court 211 Portobello Street, Sheffield, S1 4DP, UK

ChELSI Institute, Department of Chemical and Process Engineering, University of Sheffield, Mappin Street, Sheffield, S1 3JD, UK

NIHR Cardiovascular Biomedical Research Unit, Sheffield Teaching Hospitals NHS Trust, Beech Hill Road, Sheffield, S10 2RX, UK

Abstract

Background

Most analyses of microarray data are based on point estimates of expression levels and ignore the uncertainty of such estimates. By determining uncertainties from Affymetrix GeneChip data and propagating these uncertainties to downstream analyses it has been shown that we can improve results of differential expression detection, principal component analysis and clustering. Previously, implementations of these uncertainty propagation methods have only been available as separate packages, written in different languages. Previous implementations have also suffered from being very costly to compute, and in the case of differential expression detection, have been limited in the experimental designs to which they can be applied.

Results

Conclusion

For the first time, the

Background

The analysis of microarray experiments typically involves a number of stages. The first stage for analysis of Affymetrix GeneChip arrays is usually the application of a summarisation method such as MAS5.0 or RMA in order to obtain an expression level for each probeset on each array. Subsequent analyses then use these expression levels, for example to determine differentially expressed (DE) genes, or to find clusters of genes and/or conditions. Although there are a number of summarisation methods which can give accurate point estimates of expression levels, few can also provide any information about uncertainty in expression levels (such as standard errors). Even for methods that can provide uncertainty information, this is rarely used in subsequent analyses due to the lack of available methods for dealing with such information. A large amount of potentially valuable information is therefore lost. Recently, there has been a growing trend for disregarding the probe-to-probeset annotation provided by the array manufacturer in favour of so-called "remapped" data (e.g.

The multi-mgMOS method

While many microarray studies are concerned with identifying genes that are differentially expressed between two levels of a single factor, for example between cancer and non-cancer patients, microarrays are also increasingly being used in more complex experimental designs where more than one factor is varied. This is often achieved with a factorial-designed experiment, where each combination of the levels of each factor is tested. As well as enabling a researcher to identify the effects of multiple factors in a single experiment, a factorial design also enables the study of the effect of interactions between different factors. The PPLR method is not directly applicable to such experiments.

Perhaps the most popular Bioconductor package for analysis of differential expression is

Introduction to puma algorithms

multi-mgMOS and probe-level measurement error

Affymetrix GeneChips use multiple probe-pairs called a probe-set to interrogate gene expression profiles. Each probe-pair contains a perfect match (PM) probe and a mismatch (MM) probe. The PM probe is designed to measure the specific hybridisation of the target and the MM probe measures the non-specific hybridisation associated with its corresponding PM probe. However, microarray experimental data show that the MM probe also measures the specific hybridisation signal in practice and the intensities of both PM and MM probes vary in probe-specific ways. This makes the identification of the true signal difficult. The probabilistic model multi-mgMOS _{ijc }and _{ijc }represent the

where Ga represents the gamma distribution. The parameter _{ic }accounts for the background and non-specific hybridisation associated with the probe-set and _{ic }accounts for the specific hybridisation measured by the probe-set. The parameter _{ij }is a latent variable which models probe-specific effects.

The Maximum a Posteriori (MAP) solution of this model can be found by efficient numerical optimisation. The posterior distribution of the logged gene expression level can then be estimated from the model and approximated by a Gaussian distribution with a mean, _{ic}. The mean of this distribution is taken as the estimated gene expression for gene

Including measurement uncertainty in finding DE genes

The PPLR method

where _{j }is the mean logged expression level under condition _{j }is the inverse of the between-replicates variance and v_{ij }is the probe-level measurement error, which can be calculated from probabilistic probe-level analysis methods such as multi-mgMOS.

PPLR assumes that the parameters _{j}}, {_{j}}} are independent and

where _{0}, _{0},

Including measurement uncertainty in principal components analysis

We write the measurement error, _{i}, as a vector capturing the main technical sources of variance of the measured expression level on each chip _{i}, as an additional term in the observation noise of this model,

Unlike standard PCA there is no longer a closed form maximum likelihood solution and an iterative EM algorithm is used for parameter estimation.

Including measurement uncertainty in mixture clustering

Similarly to NPPCA, PUMA-CLUST _{i }is the true expression level for data point _{i}|_{k}) = _{i}|_{k}, Σ_{k}). For the measured expression level

where diag(_{i}) represents the diagonal matrix whose diagonal entries starting in the upper left corner are the elements of _{i}.

This version of PUMA-CLUST treats each chip as an individual condition. For replicated data we have developed an improved method which propagates measurement error to a robust Student's t mixture model. Once published, this method will be incorporated into the puma package

Contributions

The

• pumaDE – an extension of the PPLR method to the multi-factorial case.

• The automated creation of design and contrast matrices for typical experimental designs.

• pumaComb – an implementation of the method of combining information from replicates

• pumaPCA – an R implementation of NPPCA, with much improved execution speed over the previous matlab version.

• Bringing together for the first time in a single package a suite of algorithms for propagating uncertainty in microarray analysis, together with tools for plotting, data manipulation, and comparison to other methods.

• Demonstration of uncertainty propagation methods on "remapped" Affymetrix GeneChip data.

Implementation

Multi-factorial extension of PPLR

The calculation of PPLR between two conditions is given in equation (15) of

where _{ij }corresponds to the mean expression when the two factors take values

Under the variational approximation developed in

Automated creation of design and contrast matrices

The

• All pairwise comparisons within each factor.

• Comparisons of one level vs all other levels for factors with three or more levels.

• All main effects of factors.

• All interaction terms (up to three way) between factors.

Parallelisation

We have parallelised the most time-consuming step of a typical

Using puma

multi-mgMOS

We have implemented a separate Bioconductor experimental data package

Results and discussion

Accounting for Uncertainty in Probeset Summarisation

The first step in a typical analysis is to load in data from Affymetrix CEL files, using the

The recommended summarisation method to use within

**Propagating Uncertainty in Principal Component Analysis**

A useful first step in any microarray analysis is to look for gross differences between arrays. This can give an early indication of whether arrays are grouping according to the different factors being tested. This can also help to identify outlying arrays, which might indicate problems, and might lead an analyst to remove some arrays from further analysis. Principal components analysis (PCA) is often used for determining such gross differences.

Comparison of pumaPCA and standard PCA

**Comparison of pumaPCA and standard PCA**. First two components after applying

Identifying differentially expressed genes

There are many different methods available for identifying differentially expressed (DE) genes.

Note that running the

Parallelisation speed-up

**Parallelisation speed-up**. Execution times for a typical run of the

Because this is a 2 × 2 factorial experiment, there are a number of contrasts that could potentially be of interest.

Here we can see that there are seven contrasts of potential interest. The first four are simple comparisons of two conditions. The next two are comparisons between the two levels of one of the factors. These are often referred to as "main effects". The final contrast is known as an "interaction effect". In more simple cases, where there are just two conditions,

Suppose we are particularly interested in the interaction term. We saw above that this was the seventh contrast identified by

The gene shown in Figure

Example of an apparently DE gene identified using RMA/limma

**Example of an apparently DE gene identified using RMA/limma**. RMA expression levels for the gene determined by RMA/limma to be most likely to be differentially expressed due to the interaction term in the estrogen data set.

Casting doubt on the example gene identified as DE using RMA/limma

**Casting doubt on the example gene identified as DE using RMA/limma**. multi-mgMOS expression levels for the gene determined by RMA/limma to be most likely to be differentially expressed due to the interaction term in the estrogen data set. Note that multi-mgMOS provides error bars as well as point estimates for the expression levels.

The following code determines and plots the gene most likely to be differentially expressed due to the interaction term using multi-mgMOS and pumaDE. This analysis was not possible using previous implementations of multi-mgMOS and PPLR, as the PPLR method was only able to determine differential expression between two levels of a single condition.

Figure

Example showing benefits of using multi-mgMOS/PPLR for differential expression detection

**Example showing benefits of using multi-mgMOS/PPLR for differential expression detection**. Expression levels and error bars (as calculated by multi-mgMOS) for the gene determined most likely to be differentially expressed due to the interaction term in the estrogen data set by mmgmos/pumaDE.

Clustering with pumaClust

The following code will identify seven clusters from the output of

The result of this is a list with different components such as the cluster each probeset is assigned to and cluster centers. The following code will identify the number of probesets in each cluster, the cluster centers, and will write out a csv file with probeset to cluster mappings:

Examples of improved performance on real and simulated data sets of PUMA-CLUST when compared with a standard Gaussian mixture model (MCLUST) are given in

Analysis using remapped CDFs

There is increasing awareness that the original probe-to-probeset mappings provided by Affymetrix are unreliable for various reasons. Various groups have developed alternative probe-to-probeset mappings, or "remapped CDFs", and many of these are available either as Bioconductor annotation packages, or as easily downloadable cdf packages. One of the issues with using remapped CDFs is that many probesets in the remapped data have very few probes. This makes reliable estimation of the expression level of such probesets even more problematic than with the original mappings. Because of this, we believe that even greater attention should be given to the uncertainty in expression level measurements when using remapped CDFs than when using the original mappings. In the

Application beyond Affymetrix microarray data

Although the methods within

Conclusion

The

Availability and requirements

• Project name: puma

• Project homepage:

• Operating systems: Platform independent

• Programming language: R, C

• Other requirements: R

• License: LGPL except puma uses donlp

1. donlp2 is under the exclusive copyright of P. Spellucci (e-mail:

2. donlp2 and its constituent parts come with no warranty, whether expressed or implied, that it is free of errors or suitable for any specific purpose. It must not be used to solve any problem, whose incorrect solution could result in injury to a person, institution or property. It is at the users own risk to use donlp2 or parts of it and the author disclaims all liability for such use.

3. donlp2 is distributed "as is". In particular, no maintenance, support or trouble-shooting or subsequent upgrade is implied.

4. The use of donlp2 must be acknowledged, in any publication which contains results obtained with it or parts of it. Citation of the authors name and netlib-source is suitable.

5. The free use of donlp2 and parts of it is restricted for research purposes. Commercial uses require permission and licensing from P. Spellucci.

List of abbreviations

mgMOS: modified gamma Model Of Signal; multi-mgMOS: multi-chip modified gamma Model Of Signal; NPPCA: noise-propagation in principal component analysis; PPLR: probability of positive log ratio; DE: differentially expressed.

Authors' contributions

RDP extended the PPLR method to factorial experiments, developed the puma package from earlier code, maintains puma, and devised and wrote the manuscript. XL originally developed the mmgMOS and PPLR methods. GS partly developed the original matlab code for NPPCA. MM developed the code for mgMOS. NDL partly developed the original matlab code for NPPCA and partly initiated the puma project. MR partly initiated the puma project and supervised the development of the puma package.

Acknowledgements

We thank Leo Zeef and Nick Gresham for work on the parallelisation of pumaComb. RDP was supported by a NERC "Environmental Genomics/EPSRC" studentship. XL acknowledges support from NSFC (60703016) and JiangsuSF (BK2007589). MR and NDL acknowledge support from a BBSRC award "Improved processing of microarray data with probabilistic models".