Department for Proteomics and Signal Transduction, Max-Planck Institute of Biochemistry, Am Klopferspitz 18, D-82152 Martinsried, Germany

Abstract

Quantitative proteomics now provides abundance ratios for thousands of proteins upon perturbations. These need to be functionally interpreted and correlated to other types of quantitative genome-wide data such as the corresponding transcriptome changes. We describe a new method, 2D annotation enrichment, which compares quantitative data from any two 'omics' types in the context of categorical annotation of the proteins or genes. Suitable genome-wide categories are membership of proteins in biochemical pathways, their annotation with gene ontology terms, sub-cellular localization, presence of protein domains or membership in protein complexes. 2D annotation enrichment detects annotation terms whose members show consistent behavior in one or both of the data dimensions. This consistent behavior can be a correlation between the two data types, such as simultaneous up- or down-regulation in both data dimensions, or a lack thereof, such as regulation in one dimension but no change in the other. For the statistical formulation of the test we introduce a two-dimensional generalization of the nonparametric two-sample test. The false discovery rate is stringently controlled by correcting for multiple hypothesis testing. We also describe one-dimensional annotation enrichment, which can be applied to single omics data. The 1D and 2D annotation enrichment algorithms are freely available as part of the Perseus software.

Introduction

Mass spectrometry-based proteomics can now deliver highly accurate data on hundreds of thousands of peptide features in a single biological project

The ability to perform side-by-side large-scale quantitative profiling of the proteome ,transcriptome or genome raises the question which classes of gene products show concordant and which show discordant behavior between the different levels of gene expression. For instance, question in proteomics is how far absolute levels of expression or expression changes correlate between the transcriptome and the proteome. In the hypothetical case of pure transcriptional regulation the correlation between these two levels would be near one, and would only be limited by the technical limitations and imperfections of the respective quantitative profiling technologies. Indeed, while early investigations found low or no correlation between proteome and transcriptome

While transcriptional regulation is generally a dominant aspect of the entire expression cascade, there are many known examples of posttranscriptional regulation like micro-RNA controlled inhibition of transcripts

Materials and methods

The protein intensity data used in the explanation of the 1D annotation distribution in the Results section is taken from a label-free proteome study of mouse dendritic cells to a depth of 5,780 proteins

The yeast data used in the sub-section on 2D annotation enrichment is obtained from de Godoy et al.

In all cases peptides were analyzed on a nanoflow HPLC system connected to a hybrid LTQ-Orbitrap or Orbitrap-Velos mass spectrometer (Thermo Fisher Scientific). Human and mouse data were searched against International Protein Index

Results

We start the description of the data analysis workflow at the point where protein abundances or protein expression ratios have already been calculated. While all examples show proteomics data obtained with the MaxQuant software

When reporting quantitative data for proteins from shotgun proteomics data, care has to be taken in the counting of independent protein identifications. The measured peptides may in some cases not be unique and map to several proteins, which is called the protein interference problem

Matching proteins to other high-throughput data

When using MaxQuant, if several quantitative experiments are combined or replicates were made these will all be projected onto the same protein grouping over all 'quantitative columns'. Therefore, the proteome data will be in a convenient matrix form already, even for very complex experimental designs. This will not be the case when one wants to compare the proteome data with transcriptome data, for instance. Several probe sets of an Affymetrix chip measure the same gene and there may be several genes belonging to the same protein group. For the matching we take a protein centric view. For each protein in the protein group we determine all probe sets that are annotated in the chip annotation file with a UniProt identifier. It is not trivial to decide which UniProt identifiers to use for a group of proteins that are indistinguishable by the measured peptides. A protein group consists of proteins from the list of protein sequences submitted to the search engine that cannot be quantified independently based on the set of identified peptides. In particular, if two proteins have identical sets of identified peptides they will be grouped together. Also if the set of identified peptides of one protein is completely contained in the set of identified peptides of another protein, these two proteins will be combined in a protein group as well. Proteins within a protein group are sorted by the number of identified peptides in descending order. For the remaining ambiguities we use the razor peptide or parsimony concept, which means that the peptide is assigned according to Occam's razor principle to the protein group that most plausibly explains its existence, which is the one which already has the most peptide identifications assigned to it.

The number of probe sets matched in this way to every protein group can vary from zero to several. If none is matching then no comparison can be made for this protein. If one is matching, then the quantitative information for this probe set will be used. If several probe sets match the point-wise median of their quantitative profiles is taken. Expression data from other microarray types can be matched in a similar way as long as the vendor provides UniProt or other protein identifiers for the hybridization probes. Deep RNA sequencing data is also easily matched, for instance in the form of RPKM values

Note that for the quantitative analysis of expression data (irrespective of which kind) it is usually advisable to take the logarithm before proceeding with further steps. This is true for ratios as well as for abundance data. Before averaging expression profiles this is also advisable, even if the median is taken. This is because for the median an averaging can take place between the two central numbers in case there is an even number of values. The need for taking logarithms becomes immediately apparent in the case of ratios. One would expect that the average of a two-fold up-regulation and a two-fold down-regulation should be no regulation. This is however not the case if the ratios are averaged (2 + 1/2)/2 = 1.25 ≠ 1. If logarithms (e.g. to the base two) are averaged the desired result is obtained: (log(2) + log(1/2))/2 = 0. The base of the logarithm does not matter in principle since it can be absorbed in an overall factor multiplied to all the data. However it is customary to use base two for ratio data and base two or ten for abundance data.

Protein annotations

We base the annotation of proteins on UniProt identifiers

One major source of annotation is the gene ontology

1D annotation enrichment

The 2D annotation enrichment algorithm works equally well for 1D data, such as any quantitative proteomics experiment. We first describe the principle of the 1D distribution analysis, which also serves as a preparation for the 2D algorithm. The input is a single column of numerical values assigning one numerical value to every protein. These values are typically protein ratios or absolute protein abundances. They could also be derived quantities, like average fold-changes between replicate groups or p-values or test statistic resulting from a test for significant changes in protein expression. If the column has missing values then the respective proteins will be ignored in further analysis.

We wish to test for every annotation term (such as every protein complex or pathway) whether the corresponding numerical values have a preference to be systematically larger or smaller than the global distribution of the values for all proteins. In the schematic example displayed in Figure

Histogram of log protein intensities for all mouse proteins quantified in dendritic cells in Luber et al

**Histogram of log protein intensities for all mouse proteins quantified in dendritic cells in Luber et al****(blue)**. The green histogram indicates the ribosomal proteins within this distribution. They are significantly enriched at large values. Heights of the green bars were multiplied by five for better visibility.

To be independent of the shape of the distribution we apply a non-parametric test which in particular does not assume a normal distribution of the numerical values. These properties single out the two-sided (two-sample) Wilcoxon-Mann-Whitney test as the method of choice, which we apply to all protein categorizations in a given set of terms.

The Wilcoxon-Mann-Whitney test assesses whether one of two observation groups tends to have larger values than the other. The method works entirely with the ranking of the values and checks in our case if the proteins of interest tend to be ranked higher (or lower) as a group relative to the ranking of all proteins. The test statistic is

Where _{1 }is the size of group 1 and _{1 }is the sum of ranks in group 1.

Technically, the Mann-Whitney test assumes independence of the values, which is a good approximation in our case, in particular since every peptide is used in only one protein group for quantification. If non-unique peptides were used in several protein groups, the independence assumption would not hold.

The number of terms and therefore also the number of hypotheses tested simultaneously can be quite large. For instance, there are 9,732 different terms among the GO molecular functions. This makes it important to adjust for multiple hypothesis testing. We apply the Benjamini-Hochberg method

where R_{1 }and R_{2 }are the average ranks within the group under consideration and its complement (all remaining proteins in the experiment), respectively and n is the total number of data points. It is a number between -1 and 1. A value near 1 indicates that the protein category is strongly concentrated at the high end of the numerical distribution while a value near -1 means that the values are all at the low end of the distribution. For significant terms it is not possible that s reaches zero exactly, but especially for larger categories that show a slight but consistent trend it is possible to have small absolute values of s. A moderately positive value of s for a category with many members, for instance, indicates that there is a significant collective shift towards larger values for this category which however is small in absolute terms and possibly not noticeable when looking at individual proteins. Note that the method's calculations are entirely based on information within the measured proteome. Often, enrichment calculations in proteomics against the whole genome are problematic. By construction these problems are completely circumvented here.

In the ribosome example of Figure ^{-37} and the s-value is 0.85, indicating that ribosomal proteins are strongly enriched among the most abundant proteins.

When applied to ratios of protein abundances the method described here is similar to the quantile-based enrichment calculations introduced by Pan et al.

2D annotation enrichment

For the analysis of quantitative protein expression values together with other high throughput data we would like to generalize the method described above to the joint distribution of two numerical quantities. To be specific in the further discussion we will assume that the other high throughput data to be analyzed together with proteomic data is constituted by mRNA expression levels. One may for instance be interested in the enrichments of annotations in the plane spanned by protein abundances and mRNA abundances. Similarly one may wish to plot protein abundance ratios (e.g. from isotopic labeling experiments) against mRNA abundance ratios between the same samples. Figure

Yeast protein ratios vs. mRNA ratios between the haploid and diploid populations from de Godoy et al

**Yeast protein ratios vs. mRNA ratios between the haploid and diploid populations from de Godoy et al**

Also for the two-dimensional case, we want to avoid the normality assumption and therefore wish to use a non-parametric testing strategy. What is needed for the generalization to two numerical dimensions is a replacement of the Wilcoxon-Mann-Whitney test that works with two-dimensional input data. All the remaining strategy can then be taken over from the one-dimensional case. The concept of rank sums that is used in the definition of the test statistic for the Wilcoxon-Mann-Whitney test at first appears to be tied to the one-dimensional case since only in the one-dimensional case is it possible to define an order of the data points in a meaningful way. For points in a two-dimensional plane, in contrast, a natural order relationship does not exist. The situation is different for parametric tests, like Student t-test or analysis of variance (ANOVA) where the generalization to the multivariate case is straightforward and known as multivariate analysis of variance (MANOVA) (see e.g. reference

The test statistic for the MANOVA test for two groups in two dimensions are given here for reference. It is proportional to

where

are the differences of the group means between group 1 and 2 in the x and y coordinartes, respectively,

are the summed squares of the deviations from the group means for x, y and mixed coordinates,

are the means of groups 1 and 2 in x and y coordinates,

are the sizes of groups 1 and 2 and

are the ranked values for x and y dimensions, separated into group 1 and 2.

We define the resulting MANOVA test result as the 2D annotation enrichment p-value. The FDR of this approach can be controlled with the Benjamini-Hochberg method in the same way as was done for the one-dimensional case.

After determining which annotation terms show a significantly deviating protein/mRNA level distribution, we calculate an s-score in analogy to the one-dimensional case. Now the score is a number pair (s_{x}, s_{y}), the coordinate-wise difference of average ranks used in the one-dimensional case. It is confined to the square -1 ≤ s_{x }≤ 1 and -1 ≤ s_{y }≤ 1. The point (s_{x}, s_{y}) = (0,0) corresponds to annotation terms that are not distributed differently from the overall distribution of value pairs. The significance cutoff creates an empty region around the origin. The remaining parts of the rectangle can be subdivided into eight regions corresponding to correlating, non-correlating and anti-correlating regions (see Figure

Schematic representation of the 2D annotation enrichment score

**Schematic representation of the 2D annotation enrichment score**. The score is a number pair inside the displayed rectangle. Significant terms will avoid a circular region around the origin. The green regions correspond to concordant up or down regulation. The blue regions correspond to terms that are up or down in one direction, but not in the other, while the terms in the red regions show anti-correlating behavior.

Figure

2D annotation enrichment based on the yeast protein and mRNA ratios displayed in Figure 2

**2D annotation enrichment based on the yeast protein and mRNA ratios displayed in Figure 2**. 'Pheromone-dependent signal transduction' is located near the diagonal with positive values for both scores. 'Cell wall' has only a small mRNA score but a large protein score.

Another example is shown in Figure

2D annotation enrichment for Comparative Genomic Hybridization (CGH) ratios (vertical) vs. protein ratios (horizontal) from Geiger et al

**2D annotation enrichment for Comparative Genomic Hybridization (CGH) ratios (vertical) vs. protein ratios (horizontal) from Geiger et al**

Software implementation

The 2D enrichment analysis is integrated into the Perseus software package which will be described elsewhere. Perseus is freely available and can be downloaded from www.biochem.mpg.de/mann/tools/. All necessary preprocessing and normalization steps can be found in the 'Processing' menu in Perseus. The 2D enrichment analysis is located in the main menu under 'Processing → Annotation → 2D annotation enrichment'. Figure

Parameter window of the 2D annotation enrichment in the Perseus software

**Parameter window of the 2D annotation enrichment in the Perseus software**.

Discussion

While replicates within one technology are usually best done as 'biological' as possible to ensure that the findings are robust and reproducible, for cross technology comparisons it is more desirable to have the equivalent of a 'technical' replicate. For instance, the cell populations from which the transcriptome and the proteome are measured should be as similar as possible, ideally aliquots from the same sample so that one is sure that one samples the same cellular state on different levels of expression. If desired, the whole measurement including the proteome and the transcriptome can be repeated as 'biological' replicates. In the majority of cases, however, the available data has not been recorded in this optimal way. Of course the data analysis described here can still be applied to this situation as well.

Often enrichment analysis in proteomics is performed by calculating a p-value corresponding to a test if a certain annotation term is enriched in a certain set of proteins relative to all genes in the genome. The results of this kind of calculations have to be taken with caution, especially in cases where the proteome coverage is far away from saturation or completeness since apart from the effect under investigation they are biased by which proteins are measurable at all by the employed mass spectrometric technology

Another issue of interest is the potential application of corrections when multiple related terms are used for statistical comparisons. For example, terms in GO are mutually dependent. In principle correction methods like this can be applied to 1D and 2D annotation enrichment as well and we might do so in the future. Note, however that by not taking the hierarchy and relatedness of terms into account the significant findings reported after multiple hypothesis correction are on the conservative side, since the number of effectively independent tests is lower than the total number of terms which is used in the multiple testing correction. Therefore there is no danger of over-reporting. On the contrary, at fixed FDR one might miss a few significant terms which one would have obtained with a method taking the relatedness into account.

Many other tools for enrichment analysis already exist. In Hauang et al.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

We thank all the other members of the Proteomics and Signal Transduction group for help and discussions. This work was partially supported by PROSPECTS, a 7th Framework grant by the European Directorate (grant agreement HEALTH-F4-2008-201648/PROSPECTS) and by the Max Planck Society for the advancement of Science.

This article has been published as part of