School of Mathematics and Physics, University of Tasmania, Hobart, Tasmania, Australia

Department of Statistics, Texas A&M University, College Station, TX, USA

Fundamental and Computational Sciences Directorate, Pacific Northwest National Laboratory, Richland, WA, USA

Abstract

Shotgun proteomic data are affected by a variety of known and unknown systematic biases as well as high proportions of missing values. Typically, normalization is performed in an attempt to remove systematic biases from the data before statistical inference, sometimes followed by missing value imputation to obtain a complete matrix of intensities. Here we discuss several approaches to normalization and dealing with missing values, some initially developed for microarray data and some developed specifically for mass spectrometry-based data.

Background

High-throughput mass spectrometry (MS) has become an important technology for protein identification and quantitation due to its ability to rapidly provide identification and quantitation of thousands of peptides

Systematic bias is inherent in MS-based data due to complex biological, experimental and technical processing. Bias, which may be loosely defined as any non-biological signal, may occur due to many factors, including variations in sample processing conditions, instrument calibrations, LC columns, changes in temperature over the course of an experiment, etc. One may observe systematic biases in mass measurement accuracy, LC retention times, and/or peak intensities. In an effort to better enable comparisons between samples, it is desirable to remove any excess technical variability by utilizing various normalization techniques. For example, LC-MS samples may be aligned in terms of their LC retention time and mass profiles, or nonlinear modeling may be employed to capture systematic errors in mass measurements

Another challenge in quantitative proteomics is wide-spread missing data (i.e. missing identifications or abundance values). A peptide intensity value may be missing due to several mechanisms, including: (i) the peptide truly is present at an abundance the instrument should be able to detect, but is not detected or is incorrectly identified, (ii) the peptide truly is present but at an abundance below the instrument's detection limits, and (iii) the peptide is not present. Different methods for dealing with missing values should be used depending on the mechanism that gave rise to a missing value. In case (i), using observed values to impute missing values or simply ignoring a few missing values is appropriate. However in cases (ii) and (iii), when a peptide abundance is below our ability to detect it, simple imputations based on observed values are not appropriate

Analysis of MS data almost always involves dealing with both bias and missing values. Deciding which normalization to use and when can be challenging. For example, one needs to make a decision on whether to impute the missing values first and normalize next, or the other way around. All work is generally performed on the logarithm (log) scale of the abundances/intensities. This simplifies statistics that follow, as log abundances are often approximately normally distributed. Logarithm base two is preferred for the ease of fold change interpretation but any logarithmic transformation will produce approximately Normally distributed intensities. Here we use terms abundance and intensity interchangeably.

Methods

Normalization

In the context of -Omics applications, bias can generally be defined as non-biological signal; that is, systematic features of the data that are entirely attributable to experimental or technical aspects. There are many sources of bias in LC-MS data, all of which have the potential to affect the measured peptide/protein expression levels (e.g., non-optimal ionization efficiencies in complex samples, differences in LC columns, differences in sample preparation and data acquisition between technicians). The term normalization refers to the process of removing such biases. There are many different approaches, but we focus our discussion on those that are most widely used or have properties that work especially well for proteomic data.

A

Robust scatter plot smoothing or

ANOVA and regression models can effectively estimate and remove systematic biases when sources of bias are known exactly

Where _{ijkbl }_{i }_{ij }_{ik }_{ib }

When doing differential expression analysis it is advised to check the distribution of the p-values as a diagnostic plot. Figure

Missing values

Missing values are common in MS data and are a key challenge in quantitative proteomics. Missing values arise when, for example, a peptide is identified in some samples but not in others; for the samples in which the peptide was not identified, abundances are not assigned or are assigned NA (not a number). A peptide may be missing because it is not present in the sample, it may be present but at a concentration below instrument detection limit, or may be present and not identified correctly or detected due to some unknown effect. Generally one cannot easily distinguish why peptide abundance is missing. What we do with the missing values, on the other hand, should ideally rely on the mechanism that caused the values to be missing. For example, we can usually separate missing values into two categories: missing completely at random (MCAR) and abundance-dependent missing values. MCAR values occur due to some "glitches" in the instrumentation, such as poor ionization, other peptides competing for charge,

Values missing completely at random (MCAR) can be imputed by simple methods, although some methods are better than others

Censored data present a more complicated problem, as observed values are not a good basis for imputation. In this scenario, censored values are said to be informative, in that the fact that a peptide was not observed tells us that its abundance was simply below our ability to detect it. The observed values for a peptide are not representative of the unobserved values, and analyzing only the observed values or performing imputation based on their average, or even random values generated from an estimated probability distribution as described above, will result in upward-biased estimates and downward-biased standard errors.

Figure

Examples of missing data

**Examples of missing data**. Intensities for a peptide with two treatment groups with (A) no missing values, (B) MCAR missing values, (C) censored missing values, and (D) censored missing values imputed as a minimum observed value.

Percent coverage for nominal 95% confidence intervals of protein-level differences

**Percent coverage for nominal 95% confidence intervals of protein-level differences**.

Histograms of the null p-values for normalized (left) and raw (right) peptide abundances

**Histograms of the null p-values for normalized (left) and raw (right) peptide abundances**.

In prior work, we proposed a statistical framework for protein quantitation and inference in the presence of values missing completely at random as well as censored

A maximum likelihood model is employed that expresses protein-level abundances in terms of peptide-level abundances and accounts for the two types or missingness. Statistical inference proceeds by numerically optimizing the maximum likelihood model to obtain p-values for differential protein expression. It is also possible to use rough parameter estimates to perform model-based filtering and imputation without the use of numerical optimization to obtain imputed values for future inference. Note that model (2) represents one approach for 'rolling up' peptide-level information to the protein level. In general, peptide-to-protein rollup is a complex exercise, and others have taken a variety of approaches

In the automated filtering routine, information content from maximum likelihood theory guides the selection and exclusion of peptides and proteins. Model (2) is constructed under the assumption that primary interest is in protein-level group differences; this would make the Treat_{ik }

Luo et al. 2009 proposed a Bayesian approach to dealing with censored (they call them non-randomly missing) observations in iTRAQ (isobaric Tags for Relative and Absolute Quantitation) data. The authors use logistic regression to determine if a peptide is censored or MCAR. This is based on the assumption that there is a negative correlation between probability of missing value and peptide abundance and an approximate linear relationship between the missing peptide probability and the observed intensity at the logit scale. Logistic regression is a nice fit for a problem where a distinction needs to be made between only two classes, here they are censored and MCAR values. Although the authors apply their model to labeled data, it can also be applied to unlabeled MS data. The authors perform inference while taking into account censored missing values but no model for actual imputation of missing values is proposed. In this sense, using the proposed approach is similar to using the maximum likelihood model (Karpievitch et al. 2009b) to obtain the p-values for differential expression; we can get the p-values but not the p-values for imputed data. No implementation has been made available to the public as described in the manuscript.

The choice of thresholds to use when identifying peptides is related to the problem of missing values, although the specific nature of this relationship is not known. Lowering the amount of evidence required for identifying peptides (lowering the threshold) will result in more peptides and peak intensities but will not necessarily result in a decrease in the number of missing values. In fact, it might be expected that lower identification thresholds will lead to a greater number of missing values, since a greater proportion of the additional peptides are liable to be false identifications. Having said that, a careful examination of these issues would make for interesting future work.

Missing values may also occur due to the limited depth of coverage of the instruments. In MS/MS identification, generally only a small portion of the most abundant peptides at a given time (MS^{1}) are selected for further fragmentation and identification in the MS^{2 }phase. Thus, if a peptide is of lower abundance in one treatment group vs. the other it may not be identified if there are enough more-abundant peptides in the same MS scan. This issue can be addressed by selecting specific masses for further fragmentation instead of the top most abundant peptides.

Impact of complex preprocessing on downstream statistical inference

Normalization and missing value imputation generally occur as preprocessing steps followed by statistical inference to answer questions of primary scientific interest. However, all data processing, including both preprocessing and downstream inference, "uses up" some of the information in the data to fit and employ statistical or mathematical models. Ideally, a single statistical model would be used to simultaneously carry out preprocessing and inference

where Technical Signal represents any systematic biases as well as missing-data patterns, and Biological Signal represents systematic biological differences between comparison groups of interest.

The most natural approach to data analysis based on the above model would be to fit it in a single step. This would correspond to carrying out preprocessing and inference simultaneously. In practice, the typical analysis pipeline, composed of preprocessing steps followed by downstream inference, is analogous to first fitting the model:

then carrying out inference on the basis of the model:

where Residuals are the processed (normalized) data. The problem with this approach is that the variability introduced into the pipeline from the first step is not communicated to the second step. In other words, any degrees of freedom that are used up in the preprocessing model are not discounted when using the second model for inference. This can lead to an "overfitting" of the data, whereby resulting statistical inferences are not directly interpretable

As mentioned above, to minimize the issues of overfitting, inference and normalization can be carried out simultaneously

Combining normalization and missing value imputation

At this point we have shown that systematic bias and missing observations are prevalent in MS-based data. The fact that many normalization routines require a 'complete' matrix with no missing values, raises a question: should the imputation be done first followed by normalization? That seems like a reasonable solution at first. For example, one can impute missing values using one of the methods described above to produce a complete matrix, and then use one of the normalization routines to remove bias. At this time one should wonder if the imputation was to be repeated would the values be different and would that affect normalization? The answer is generally, yes, imputed values will be different every time because they are drawn at random from an appropriate distribution. Thus, imputed values, especially if there is a high proportion of those, can potentially obscure the bias trends and prevent normalization routines from effectively removing it. Moreover, it does not make sense to impute missing values based on biased observations.

We show the impact that normalization and imputation and the order of those procedures may have on the processed data and significance analysis by using simulated data. Simulated data were created with 10 samples in each of two treatment groups (20 samples total). The size and structure of the simulated data were selected to mimic those in a real dataset composed of human subjects. Specifically, 1400 proteins were simulated with varying numbers of peptides per protein and 13% of missing values. Proportion MCAR values was simulated to be 5% and 8% censored values. Simulated data were generated from model (6), which is an adaptation of model (1) that (through the Samp_{im }

The index

Top three eigentrends identified in raw (left), imputed (middle); and normalized after imputation data (right)

**Top three eigentrends identified in raw (left), imputed (middle); and normalized after imputation data (right)**. X-axis is the sample index, y-axis are values in eigentrends.

Normalization followed by imputation, on the other hand, performs better (Figure

Top three eigentrends identified in raw (left), normalized (middle); and imputed after normalization data (right)

**Top three eigentrends identified in raw (left), normalized (middle); and imputed after normalization data (right)**. X-axis is the sample index, y-axis are values in eigentrends.

Discussion

Quantitative proteomic data present complex challenges to the data analyst. We have discussed two common issues in the context of spectral peak intensity analysis, involving biases due to systematic technical variation and informative missingness patterns. Normalization is the solution to biases, but the normalization techniques employed must be simultaneously flexible enough to capture arbitrary patterns and delicate enough to not overfit the data. Importantly, some of the biases such as sample/subject selection bias may not be corrected by a normalization technique as in some cases subjects are selected not entirely at random, such as subjects visiting a doctor about a specific condition. Missing values, meanwhile, greatly complicate the statistical analysis of quantitative proteomic data, particularly as missingness in this context is largely synonymous with censoring. However, standard statistical techniques can be used to facilitate valid statistical conclusions.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

YVK, ARD, and RDS contributed equally to the general formulation and layout of the paper. YVK led the writing and revision.

Acknowledgements

This article has been published as part of