Abstract
Background
Twodimensional polyacrylomide gel electrophoresis (2D gel, 2D PAGE, 2DE) is a powerful tool for analyzing the proteome of a organism. Differential analysis of 2D gel images aims at finding proteins that change under different conditions, which leads to largescale hypothesis testing as in microarray data analysis. Twocomponent empirical Bayes (EB) models have been widely discussed for largescale hypothesis testing and applied in the context of genomic data. They have not been implemented for the differential analysis of 2D gel data. In the literature, the mixture and null densities of the test statistics are estimated separately. The estimation of the mixture density does not take into account assumptions about the null density. Thus, there is no guarantee that the estimated null component will be no greater than the mixture density as it should be.
Results
We present an implementation of a twocomponent EB model for the analysis of 2D gel images. In contrast to the published estimation method, we propose to estimate the mixture and null densities simultaneously using a constrained estimation approach, which relies on an iteratively reweighted leastsquares algorithm. The assumption about the null density is naturally taken into account in the estimation of the mixture density. This strategy is illustrated using a set of 2D gel images from a factorial experiment. The proposed approach is validated using a set of simulated gels.
Conclusions
The twocomponent EB model is a very useful for largescale hypothesis testing. In proteomic analysis, the theoretical null density is often not appropriate. We demonstrate how to implement a twocomponent EB model for analyzing a set of 2D gel images. We show that it is necessary to estimate the mixture density and empirical null component simultaneously. The proposed constrained estimation method always yields valid estimates and more stable results. The proposed estimation approach proposed can be applied to other contexts where largescale hypothesis testing occurs.
Background
Complementing functional genomics, proteomics deals with the largescale analysis of proteins expressed by a tissue under specific physiological conditions. A broad range of technologies are used in proteomics, but the central paradigm has been the use of a method for separating mixtures of proteins followed by identification of protein by mass spectrometry (MS). Twodimensional polyacrylomide gel electrophoresis (2D PAGE, 2D gel, 2DE) very popular, despite the availability of other powerful separation techniques. With 2D PAGE [1], proteins are separated in one dimension according to their molecular mass and in the orthogonal dimension according to their isoelectric charge. In theory, each protein is uniquely determined by its location along the two dimensions of separation. The separated proteins are then stained with fluorescent dyes so that they are amenable to imaging. Proteomic differences across multiple samples can be studied by comparing the expression profiles across sets of gels.
Figure 1 shows typical images of 2D gels. Each dark spot with a smooth contour represents a different protein. The darkness of a spot is proportional to the quantity of the corresponding protein on the gel. By comparing spot intensities across images, we are able to compare the volumes of the same protein under different treatments or exposures or stages of tissue development and identify protein spots that change in volume under conditions of interest. It would be unwieldy to do this manually since there are thousands of spots to compare and gels undergo distortions during the experimental process.
Figure 1. Images of proteomes from rat spleens.
The main steps in differential analysis of twodimensional gels involve image denoising, spot detection, spot quantification, spot matching and statistical analysis, which were discussed in detail in [2]. Unlike the analysis of microarray data, the statistical differential analysis of 2D gel images is still in its infancy. The main difficulties are the discrimination between actual protein spots and noise, the quantification of protein expression levels thereafter, and spot matching for individual comparison. Although there are commercial software packages for 2D gel image analysis (e.g. PDQuest, Dymension), considerable human intervention is required for spot matching. Spot matching is the process by which one maps a spot on a particular gel to the corresponding spots on the other gels so that spots corresponding to the same protein are identified. With a larger number of images, this step becomes increasingly problematic as fewer spots are matched and the analysis is performed on sparser data [3]. Moreover, in available software packages, the comparison of the quantitative features is based on classical tests, such as the ttest or the Ftest. Attempts have been made to avoid image segmentation and spot quantification. Models based on image pixels [4] are not practical given the huge number of pixels, high variation in the background intensity and sensitivity to misalignment.
Recently, academic software was developed to cope with difficulties in the analysis pipeline including protein spot detection, quantification and spot matching [3,5,6]. To improve the spotdetection results and avoid spot matching, the methods in [3,6] utilize the mean gel image as the template for locating spots. The pinnacle method [3] uses a fixed window for spot detection, quantification and background separation. The approaches in [5,6] rely on the watershed transform [7] for spot segmentation and quantification. The RegStaGel software [6] provides advanced statistical tools. Comparison of different software for protein spot quantification is beyond the scope of the current paper. We shall focus on the statistical analysis, assuming that spot quantification has been performed appropriately. For convenience, we employ RegStatGel [6] to obtain spot quantification for statistical analysis of the set of gel images considered in this paper.
Since hundreds or thousands of proteins are usually featured on a gel, once proteins are quantified, we are faced with a largescale hypothesistesting problem. The RegStatGel software [6] applies the BenjaminiHochberg (BHFDR) procedure [8] in combination with multivariate analysis for identifying significantly changed proteins. The BHFDR procedure is widely used for selecting the pvalue threshold to control the false discovery rate (FDR). Under the assumptions that tests are independent or weakly dependent and the null distribution of the pvalues is uniform, the BHFDR procedure controls the falsediscovery rate at a given level. But in practice, these two assumptions are often invalid. Strong dependence usually exists, especially in the field of genomics and proteomics [9], where the dependencies themselves are actually also of interest. Considerable effort has been dedicated to the estimation of the proportion of true null hypotheses and of the false discovery rate at a given pvalue threshold [1019]. The empirical Bayes methodology and closely related methods exploiting a twocomponent mixture model [10,15,20,21] represent typical examples of such effort. The twocomponent EB models assumes that a test statistic follows either the null or the nonnull distribution.
It has been commonly assumed that the null distribution of the test statistics follows some distribution theoretically. However, Efron [1215] pointed out that in largescale hypothesis testing the theoretical null distribution often does not hold for reasons including incorrect model assumption, unobserved covariates and correlations among test statistics. It is more appropriate to estimate the null density of the test statistics directly from the data instead of using the theoretical null density. Using the twocomponent empirical Bayes (EB) model, Efron [1215] proposed to estimate the mixture density from the entire histogram and the null component from data around the central peak of the mixture density. The twocomponent EB model aims at separating a small subset interesting cases from a large group of uninteresting cases. Efron's innovative concept and estimation approach have been throughly discussed [2226]. The locfdr R package [27] was developed to estimate the twocomponent model using Poisson regression and computing the local false discovery rate (FDR).
Two methods [12,15] were proposed to estimate the null component. One is based on finding an optimal normal approximation to the mixture density around the central peak of the histogram, and the other on maximumlikelihood estimation. In both methods, the mixture density and the null component are estimated separately. The estimation of the mixture density does not take assumptions about the null density into account. Thus, there is no guarantee that the estimated null component is no greater than the mixture density over the entire domain. The two approaches may result in the estimated local FDR having multiple peaks or its being greater than 1 [25]; neither is desirable. We present a modified estimation method for the twocomponent EB model: the null and the mixture densities are estimated simultaneously with a necessary constraint, which can be achieved with a constrained iteratively reweighted least squares (IRLS) algorithm. The proposed methodology is applied to the analysis of a set of 2D gel images from a factorial experiment. Simulation studies are conducted to further validate and investigate the performance of the proposed approach.
Methods
Data
To investigate the effect of nicotine exposure on the proteome of spleen cells of female and male rats, a 2 × 2 factorial design with gender and treatment (nicotine exposure) factors was used with 3 rats in each experimental group. Spleen cells from the control and treated rats were harvested on postnatal day 65 and then cultured in the presence of convanavalin A. After 4 days in culture, cell pellets were lysed and solubilized directly in rehydration buffer. Lysates were aliquoted and stored frozen at 80°C. Samples were thawed and 20 μg protein from each sample applied to a pH 47 immobilized pH gradient strip (IPG; Amersham Biosciences/GE Healthcare) by overnight rehydration. Isoelectric focusing was performed using an IPGphor IEF system (Amershan Biosciences/GE Healthcare) with voltage increased gradually from 500 to 800 V and then kept constant at 8000 volts for 4 hours. Separation in the second dimension was performed on 12.5% Excel prepared gels specifically made for the Multiphor II apparatus (Amersham Biosciences/GE Healthcare) and run at 40 mA for 35 minutes followed by 100 mA for 1.25 hours. Gels were silver stained (Amersham Plus One silver stain kit) and imaged using a UMACS Power Look 3 scanner (Amersham).
Figure 1 shows four images, each from a different experimental group. The top row has examples of control rats and the bottom row of rats exposed to nicotine. The left column has examples for female rats and the right column of male rats. First, the images were aligned using the algorithm described in [28]. After alignment, boundaries for the interesting portion of the images were set and the region outside these boundaries was cropped.
The objective is to find proteins that changed in quantity under exposure to nicotine or show a gender effect. The next steps would be to determine the genomic sequence of the differrentiallyexpressed proteins by mass spectrometry and to refer these sequences to a database of protein sequences in order to identify them and investigate their functions.
The proteins were detected and quantified using the the default settings of the RegStatGel software [6,29]. Specifically, the watershed algorithm was applied to the mean image to generate a master watershed map which is then imposed onto each individual gel image. Each watershed region contains a single object, either a single spot or an aggregate of two spots 9a seldom occurence). The pixels in each region are then classified as either belonging to the object or to the background using Otsu's method [30]. The mean intensity difference between the object and background serves as a summary statistic for each region and therefore for each protein (or aggregate), and is used for comparison across images. The RegStaGel software is fast, easy to use and has comparable performance to commercial software packages [29]. Note that other free programs such as Pinnacle [3] can also be used for protein quantification.
For the dataset under consideration, there are a total of 439 watershed regions containing proteins (including overlapping spots). Now, we set up a statistical model for comparing protein quantities across experimental groups. Denote the log of the statistic of interest (e.g. average pixel intensity, total pixel intensity, mean intensity difference) for protein i, image l, experimental group g by y_{gli}, where g = 1, ..., n_{c}, l = 1, ..., n, i = 1, ..., K. For the dataset described above, we have n_{c }= 4, n = 3, K = 439, and the experimental conditions (g = 1, ..., 4) correspond to the factorial combinations of treatment and gender. We have the following linear model:
where τ_{i}, γ_{i}, (τγ)_{i }are, respectively, the treatment, gender, and interaction effect for protein i. With the assumption that , the test statistic for the treatment effect on protein i is
where S_{i }is the pooled sample variance and t_{i }follows the tdistribution with df = 4(n  1) degrees of freedom under the null hypothesis that τ_{i }= 0. The test statistics for the gender and interaction effects follow the same tdistribution under the null hypothesis. Let z_{i }= Φ^{1}(F_{df}(t_{i})), where F_{df }is the cumulative t_{df }distribution. Theoretically, under the null hypothesis, z_{i }follows the standard normal distribution.
Twocomponent Empirical Bayes Model
The twocomponent EB model assumes a mixture model for the density of z_{i},
where p_{0 }is the prior probability that z_{i }complies with the true null hypothesis, f_{0}(z_{i}), is the null density and f_{1}(z_{i}) is the density under the alternative hypothesis. This model is very popular in the literature on differential analysis of microarray data, where most authors assume the null density is the theoretical null density.
Efron [10,15] defined the posterior probability that z_{i }is from the null hypothesis as the local FDR, which is given by
It can be shown [12,15] that the relationship of the local FDR to the usual FDR is
To estimate the local FDR, we must estimate the unknown p_{0}, f_{0}, f. Theoretically, f_{0 }should be the N(0, 1) density. However, for many reasons, this theoretical null density may not be valid in practice. For example, strong correlations among tests or covariates unaccounted for in the model will invalidate the usual assumptions [1215]. Moreover, when the majority of tests show small effects, it is sounder to select the relatively more interesting effects by comparing larger effects to smaller effects rather than to the theoretical zero effects. Therefore, it is more appropriate to estimate the null density of the test statistics directly from the data instead of using the theoretical null distribution.
Efron [12,15] assumed the null distribution to be N(δ, σ^{2}) and estimated the null distribution from the data. The log of the mixture density log(f(z)) was estimated by fitting a natural cubic spline or highorder polynomial to the log of counts in the histogram bins via Poisson regression. Indeed, suppose the zvalues have been binned and the bin counts are
Assume the m_{j}'s to be Poisson counts, i.e.
with the unknown ν_{j }proportional to the density f(x_{j}) at the midpoint x_{j }of bin j, i.e. approximately
where Δ is the width of the bin and N is the total number of tests. log(ν_{j}) can be modeled using a polynomial function at x_{j }or a natural cubic spline and estimated using standard generalized linear models (GLM) for Poisson observations.
Efron's estimation methods for the empirical null distribution
Both the central matching (CME) and the maximum likelihood (MLE) methods of estimation are implemented in the locfdr R package [15,27]. MLE is somewhat more stable but can be more biased than CME. Efron [12] shows that CME yields nearly unbiased estimates.
Central matching
When z_{i }is generated from a ttest, the central peak of the histogram is assumed to coincide with the null density. To estimate the empirical null density from the estimated mixture density, a quadratic curve is fitted to the central peak of ,
Assuming f_{0}(z) ~ N(δ, σ^{2}), the log of the null component is
p_{0}, δ, and σ can be estimated from , and . The local FDR at z is then estimated by . The quadratic curve is obtained by finding a leastsquares approximation to the estimated using bins in a selected interval [a, b] containing null z_{i}'s.
Maximum likelihood estimation
An alternative estimation method is based on the maximumlikelihood estimator of the parameters p_{0}, δ, σ. Assume that the nonnull density f_{1}(z) is supported outside some given interval [a, b]. Let N_{0 }be the number of z_{i }in [a, b], and define
Then the likelihood function for all the zvalues in [a, b] is
where ϕ denotes the normal density. The estimates of p_{0}, δ, and σ can be obtained by maximizing this likelihood.
Constrained Estimation Approach
In the procedures described above, the mixture density and its null component are estimated separately. The estimated null component may be greater than the mixture density . Thus, there is no guarantee that we will have for all z. Indeed, we may end up awkwardly having that for some z_{1 }< z_{2 }< 0, as shown in Figure 2, where both approaches were implemented on the set of gels of interest.
Figure 2. Estimation results using unconstrained approaches. Estimates of mixture densities and their null components from the CME and MLE methods, and the local FDR. Upper panel: solid green curves are the splinefitted mixture densities; the blue dashed and red dotted curves are the empirical null densities from the CME and MLE methods, respectively. Lower panel: the local FDR estimates from the CME (blue solid curve) and MLE (red dotted curve) methods.
Therefore, we propose to modify the CME approach by estimating the mixture density and its null component simultaneously. The log of the null component is estimated via a quadratic approximation to the central peak of using bins contained in the interval [a, b]. We add the constraint that (for all histogram bins x_{j}) while maximizing the Poisson likelihood. To solve this problem, we utilize a constrained iteratively reweighted leastsquares algorithm, as shown below. We approximate the bin counts of the mixture histogram via Poisson regression using a natural cubic spline with D knots. Assume the knots are x_{1 }= h_{1 }< ⋯ < h_{D }≤ x_{J}, where x_{1 }and x_{J }are the two bins at the left and right ends of the histogram. Denote the value of the natural cubicspline function at point x by s(x; θ), where θ is the unknown parameter vector for the cubic splines. Then
where θ = [θ_{1}, ..., θ_{D}]', B(x) = [B_{1}(x), ..., B_{D}(x)]'. B_{d}(x) are the natural cubic spline basis functions [31]:
where and (x  h_{d})_{+ }= 0 if x < h_{d}. We fit the log of the histogram counts using the natural cubic spline assuming
Suppose the nonnull density is close to zero in [a, b], we have approximately for x_{j }∈ [a, b]
where q(x; β) is a quadratic function with parameter β.
The constraint that log (f(x_{j})) ≥ log (p_{0 }f_{0}(x_{j})) leads to s(x_{j}; θ) ≥ q(x_{j}; β) for all x_{j}'s. Then, we only need to estimate the parameters θ by maximizing the Poisson likelihood with the constraint that s(x_{j}; θ) ≥ q(x_{j}; β), which results in solving
where L(m_{j}, x_{j}; θ) = exp{s(x_{j}; θ)} + m_{j }s(x_{j}; θ). L(m_{j}, x_{j};θ) is the Poisson log likelihood for bin j, omitting the constant term unrelated to the parameter θ. q(x; β) is the best quadratic approximation to s(x; θ) based on bins in [a, b]. To solve this, the parameter β must be expressed as a function of θ. Below, we show how to rewrite the constraint in terms of the spline parameter θ.
Denote the values of the natural cubic spline at all the bins x_{1}, ..., x_{J }as a vector S(θ). We have
where Γ, a J × D matrix, has entry in row j and column d Γ(j, d) = B_{d}(x_{j}). Similarly, we denote the values of the spline at bins in [a, b] in a vector form as S_{0}(θ) = Γ_{0}θ, where Γ_{0 }is the corresponding submatrix of Γ. S_{0}(θ) approximates the null component of the mixture density. Let q(x; β) = ω(x)'β be a quadratic function, where ω(x) = [1, x, x^{2}]' and β = [β_{1}, β_{2}, β_{3}]'. The values of the quadratic function at all bin midpoints can be written in a vector form as Q(β) = Ωβ, where Ω is the J × 3 matrix with jth row as ω(x_{j}) for j = 1, ..., J. Similarly, we denote the values of the quadratic function at bin midpoints n [a, b] as Q_{0}(β) = Ω_{0}β, where Ω_{0 }is the submatrix of Ω corresponding to bins in [a, b].
We want to obtain the best quadratic approximation to the natural cubic spline s(x; θ) based on the bin midpoints x_{j }∈ [a, b]. The leastsquares solution minimizing (Γ_{0}θ  Ω_{0}β)'(Γ_{0}θ  Ω_{0}β) is given by
Thus, maximizing the likelihood (2) is equivalent to solving
The above problem is solved by means of nonlinear programming. A simple computational algorithm for estimating the null and mixture densities is to modify the iteratively reweighted leastsquares (IRLS) procedure [32] for Poisson regression by adding the constraint to the weighted leastsquares regression. The IRLS algorithm converges very fast, based on our experience.
The pseudo code for the modified IRLS algorithm is as follows:
/* Initialization of deviance Dev and oldDev */
Dev = 100000, oldDev = 0
/* Initialization of estimation of ν_{k }*/
Where (DevoldDev > tolerance)
{
/* Update weights */
w_{j }= ν_{j}
/* Constrained weighted regression */
ν_{j }= exp{s(x_{j}; θ)}
/* Update Poisson deviance */
oldDev = Dev
Dev = 2Σ{m_{j }log(m_{j})  m_{j } (m_{j }log(ν_{j})  ν_{j})}
}
Results and Discussion
In this section, we implement the twocomponent EB model on the set of 2D gel images described previously. Both Efron's estimation approach and the proposed one will be applied for comparison. These approaches will be further compared using simulations.
Analyzing 2D Gel Images
At first, we analyze the z_{i }values for the treatment, gender, and interaction effects using Efron's locfdr R package. The upper panel of Figure 2 shows the histograms (50 bins) of the corresponding zvalues, the mixture density from the Poisson regression, and the null component estimated using CME and MLE. For estimation of the null component, we chose the intervals [1.25, 0.25], [2.5, 1.2] and [0.5, 1.2] for the treatment, gender and interaction effects, respectively. The degrees of freedom of the splines were chosen to minimize the AIC criterion [33], which were 5, 10 and 10 respectively. The green solid curves in the upper panel of Figure 2 are estimates of the mixture densities from the Poisson regression. The blue dashed and red dotted curves in the upper panel represent the empirical null component estimated using the CME and MLE methods, respectively. The lower panel of Figure 2 shows the local FDR at different zvalues based on the empirical null component from the CME (blue solid line) and MLE (red dotted line) methods. Figure 2 clearly conveys the message that the theoretical null, the standard normal density N(0, 1), is not appropriate for the proteomic data at hand. Taking the treatment effect as an example, the empirical null distribution is N(0.595, 0.915^{2}) by CME and N(0.48, 0.891^{2}) by MLE with proportions of true null hypotheses close to 1 for both, which indicates nicotine exposure effect affects similarly all proteins expressed by spleen cells. Clearly, the empirical null density is even further from its theoretical form for the gender effect. The central peak of the zvalues is to the left of 1.
Figure 2 also demonstrates that neither CME nor MLE yields a desirable empirical null estimate. The estimated null components are not below the estimated mixture density throughout the range of zvalues. Consequently, the estimated local FDR has multiple peaks and values greater than 1 at many z's. The estimate for the proportion of true null hypotheses can also be greater than 1, which is not a desirable outcome. There is significant discrepancy between the results from CME and MLE, as demonstrated by plots for the gender effect. We tried alternative specifications for the intervals used for estimating the empirical null density and different degrees of freedom for the splines: all yielded very similar results. Moreover, we found that MLE is more sensitive to the choice of the interval [a, b] as also observed in [24]. Next, we applied the proposed constraint estimation approach with the same choices of null intervals. The degrees for the splines that minimize the AIC were 5, 9 and 5 for the treatment, gender and interaction effects, respectively. Figure 3 displays the results. The green solid and blue dashed curves in the upper panel represent the mixture and empirical null densities, respectively. The lower panel shows the estimated local FDR at different zvalues.
Figure 3. Estimation results using the constrained approach. Estimates of mixture densities and their null components from the constrained estimation approach and the corresponding local FDR estimation. Upper panel: the solid green curves are the splinefitted mixture densities; the blue dashed curves are the empirical null densities from constrained estimation approach. Lower panel: the blue solid curves represent the local FDR estimates.
Comparing with Figure 2, we see that the proposed constrained estimation approach yielded results similar to those obtained with CME. However, now, the empirical null component is below the mixture density, and the local FDR estimate is no greater than 1, smooth and nonincreasing at both tails. For treatment and interaction effects, the null proportion is nearly one, indicating that there is no apparent differential effect of nicotine exposure. The treatment and interaction effects follow approximately N(0.459, 0.89^{2}) and N(0.284, 0.915^{2}), respectively. The empirical null distribution for the gender effect s N(1.511, 1.07^{2}) with the null proportion about 0.84. The results for the gender effect show that we need to interpret results from largescale hypothesis testing with caution. The bulk of the histogram is centered around 1.5, indicating that the majority of proteins have higher expression in female rats. The local FDR plot for the gender effect reveals that there is a small group of proteins with higher expression in males. This group of proteins is clearly separate from the rest as evidenced by the small local FDR. The local FDR is therefore more indicative of how different the gender effect is on a protein compared to the majority of the proteome, and less indicative of how significant the gender effect is. Should the theoretical null distribution be used, there would be a large number of effects at the left tail. Overall, we note that the estimated means of the null components are far from zero, especially for the gender effect, which may indicate the need to further normalize the data to remove some systematic bias.
Simulation Validations
Numerical simulation
In this section, we compare the proposed constrained estimation procedure with the CME approach without constraint using numerical simulations. The simulation model consists of z_{i }~ N(1, 1), i = 1, ..., 5000, and z_{i }~ N(3, 1), i = 5001, ..., 5500. Thus, the first 5000 belong to the null distribution and the last 500 to the nonnull distribution, and the null proportion p_{0 }= 0.909. The interval [2, 0] was used for estimating the null component. The estimated mixture density and its null component are displayed in Figure 4, with the left column showing the results from the CME approach without constraint and the right column showing the results from the proposed constrained estimation approach. The upper panel shows the histogram of the simulated zvalues from one run, the estimated mixture density (solid green curve) and the empirical null component (blue dashed curve). The lower panel shows the estimated local FDR from each approach.
Figure 4. Estimation results for simulated zvalues. Histogram of simulated zvalues, estimated mixture densities (green solid curves) and null components (blue dashed curves) using the CME approach (left column) and the constrained estimation approach (right column). The lower panel displays the estimated local FDR for each approach.
Even when the true null distribution is normal and there is a large number of observations, the unconstrained estimation approach generated undesirable results. The null component is greater than the mixture distribution at some points around the peak of the histogram. Moreover, the left tail of the local FDR is close to 0, indicating that some true null values will be declared as nonnull depending on the threshold of the local FDR. The estimated null density follows N(1.013, 0.876^{2}) with the null proportion , which is quite different from the values in the simulation model. In contrast, the empirical null density estimated using the constrained estimation approach is more accurate. The estimated empirical null density follows N(1.011, 0.979^{2}) with . The right tails of the estimated local FDR are similar under the two approaches, which indicates that both have similar sensitivity. The left tail of the local FDR has much larger values in the constrained method, indicating a lower chance of a true null value being declared as a nonnull.
We performed 100 simulations to compare the bias and standard deviation of estimates of the null parameters from both approaches. We chose different numbers of bins (50 bins or 100 bins) as well as different numbers of observations (N = 550 or N = 5500). Table 1 shows the mean and standard deviations (SD) of the estimates of the null parameters from both approaches.
Table 1. Comparison of Estimates for Null Parameters (δ = 1, σ = 1, p_{0 }= 0.909; 100 simulations).
From Table 1, we see that both approaches yielded estimates that are nearly unbiased. The estimates from the proposed approach have much smaller standard error, especially for σ and p_{0}. The superior performance of the constrained procedure continues as the total number of observation increases. The constrained approach is not sensitive to the number of bins used for estimation when this number is large enough (50 or 100) for the histogram counts to be roughly proportional to the density in the bins. The unconstrained approach is more affected by the number of bins, with a smaller number leading to increased variability for the estimates of σ and p_{0}. The simulation results clearly demonstrate that the constrained approach is better at estimating the null component.
Next, we compare the performance of both approaches for estimation of the local FDR at points close to the nonnull component. To do that, we choose several z's on the right tail to compare the local FDR estimates with the true values. The results are shown in Table 2. The comparison is based on the ratio of the average of the local FDR estimates at a given z to the true value and on the relative SD of the estimates from the two approaches for the 100 simulations. The relative SD was computed as the SD from the constrained approach divided by the SD from the unconstrained approach.
Table 2. Comparison of Estimates for Local FDR (100 simulations).
Table 2 clearly shows that the estimate of the local FDR from the proposed procedure has smaller bias, much less variability, and converges to the true value faster when N increases. The bias (relative to the magnitude of the true values) in the unconstrained approach increases with greater values of z (smaller local FDR), and larger number of bins when N is fixed. The bias of both approaches decreases when N increases. When N is not so large and the number of observation per bin is small, the unconstrained approach leads to much larger variability and bias for smaller true local FDR values. Overall, the performance of constrained estimation is much more stable and not sensitive to the number of bins as well as to the magnitude of the true local FDR values.
Validation using Simulated Gels
To further validate the proposed approach, we analyzed a set of simulated 2D gel images, which was generated by randomly perturbing an actual gel image as described in [29]. The 20 simulated gels were divided into two groups of 10. To simulate the group (treatment or intervention) effect, we artificially altered 11 manually selected spots such that these 11 spots were significantly differentially expressed across groups. Figures 5 and 6 show two simulated gel images from different groups with the 11 altered spots circled. The test statistics for the 147 spots were obtained using the RegStatGel software. We applied both estimation approaches. The results are shown in Figure 7. The interval [2.5, 2.5] was used for estimating the null component. The left column shows the results from the CME approach without constraint and the right column shows the results from the proposed constrained approach. The upper panel shows the histogram of z values, the estimated mixture density (solid green curve) and the empirical null (blue dashed curve). The lower panel shows the estimated local FDR from each approach. The '+' signs in the lower panel locate the observed points. Both approaches identified all and only the 11 spots. Both approaches yield local FDR estimates for the 11 spots much lower than for the other proteins. Again, the unconstrained approach shows a bizarre local FDR curve.
Figure 5. A simulated gel image from group 1. A simulated gel image from group 1. The 11 altered spots are circled.
Figure 6. A simulated gel image from group 2. A simulated gel image from group 2. The 11 altered spots are circled.
Figure 7. Estimation results for simulated 2D gel images. Histograms of zvalues from simulated gels, estimated mixture densities (green solid curves) and null components (blue dashed curves) using the CME approach (left column) and the constrained estimation approach (right column). The lower panel displays the estimated local FDR for each approach. The '+' signs in the lower panel locate the observed points.
Conclusions
Similar to microarray data analysis, proteomic analysis leads to largescale simultaneous hypothesis testing and thus carries similar challenges. The twocomponent model plays an important role in the microarray literature. We applied a twocomponent EB model for analyzing a set of 2D gel images. As demonstrated by the 2D gel data, the true null density can be very different from its theoretical form, which supports Efron's innovative idea of choosing the empirical null distribution for hypothesis testing. The problem of estimating the null density is important and fundamental in the twocomponent EB model. Efron generalized the theoretical null N(0, 1) to N(δ, σ^{2}) and proposed two methods, CME and MLE, for estimating the null density, which are convenient to use.
However, as shown here, neither method is devoid of problematic results, which are hard to interpret in practice. To improve the estimation of the null density, we proposed a constrained estimation approach based on the central matching method. This novel procedure naturally takes the shape of the null density and its relationship to the mixture density into account for estimation, and explicitly constrains the estimated mixture density to being no less than the null density. Both the unconstrained and constrained approaches are nearly unbiased. The constrained method yields more stable and desirable estimation, as demonstrated by our simulation results. It can be generalized to include the situation where the null density comes from a family broader than the normal. The proposed approach can certainly be applied to any context where largescale hypothesis testing occurs. Here, we have constrained the null component to be no greater than the mixture density for the histogram bins. It is a simplified version of the constraint that the null component is no greater than the mixture density over the entire real line, which is much more complicated. We note that, given the smoothness of the mixture density, the simplified constraint suffices in practice. It is reasonable to assume that the local FDR is a nonincreasing function near the tail areas where the zvalues are farther away from the null component. To impose this nonincreasing property on the estimation of the local FDR, the monotone spline regression technique [34] should be utilized. We will tackle this in our future work.
The choice of the interval [a, b] may be influential for the estimation, especially if it is misspecified. When it is appropriately specified, i.e., the nonnull component is nearly zero in the interval, our limited experience showed that the proposed approach is not sensitive to the choice of [a, b]. However, how the interval [a, b] can affect the estimation in general needs further research.
A quite different method for empirical null estimation is based on Fourier analysis [35]. Rather than modeling the mixture density, an attractive method for modeling the local FDR directly has also been proposed [25]. The former is nonparametric and the latter relies on parametric model assumptions. Both methods yield good estimates.
We have focused on estimating the local FDR based on test statistics. The twocomponent EB model is robust to correlation effects among the test statistics. It may be more informative to model the structure inherent in the data, which is certainly a challenging problem and relies on model assumptions. Further research is certainly needed here.
We utilized the protein quantifications from software RegStatGel with default settings. It should be noted that different software may generate different quantifications [36]. It is beyond the scope of the current paper to compare different quantifications.
Authors' contributions
FL developed the constrained estimation approach and generated all the numerical results, as part of his doctoral work. FSM and FL wrote the paper together. All authors read and approved the final manuscript.
Acknowledgements
Work for this paper was supported in part by NIH Grant 5R01GM075298. The authors thank Carol Whisnant for the use of her data. This article reflects the views of the author and should not be construed to represent FDA's view or policies.
References

O'Farrell P: High resolution twodimensional electrophoresis of proteins.

Roy A, SeillierMoiseiwitsch F, Lee K, Hang Y, Marten M, Raman B: Analyzing TwoDimensional Gel Images.

Morris J, Clark BN, Gutstein HB: Pinnacle: A fast, automatic and accurate method for detecting and quantifying protein spots in 2dimensional gel electrophoresis data.
Bioinformatics 2008, 24:529536. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Conradsen K, Pedersen J: Analysis of TwoDimensional Electrophoretic Gels.
Biometrics 1992, 48:12731287. Publisher Full Text

Anjos Ad, Moller ALB, Ersbol BK, Finnie C, Shahbazkia HR: New approach for segmentation and quantification of twodimensional gel electrophoresis images.
Bioinformatics 2011, 27:368375. PubMed Abstract  Publisher Full Text

Li F, SeillierMoiseiwitsch F: Differential Analysis of 2D Gel Images. In Methods in Enzymology. Volume 487. Edited by Johnson M, Brand L. San Diego: Academic Press; 2011::596609.

Vincent L, Soille P: Watersheds in digital spaces: An efficient algorithm based on immersion simulations.
IEEE Transactions on Pattern Analysis and Machine Intelligence 1991, 13:583598. Publisher Full Text

Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing.
Journal of the Royal Statistical Society, Series B 1995, 57:289300.

Qiu X, Klebanov L, Yakovlev A: Correlation Between Gene Expression Levels and Limitations of the Empirical Bayes Methodology for Finding Differentially Expressed Genes.
Statistical Applications in Genetics and Molecular Biology 2005, 4:113.

Efron B, Tibshirani R, Storey , Tusher V: Empirical Bayes analysis of a microarray experiment.
Journal of the American Statistical Association 2001, 96:11511160. Publisher Full Text

Efron B: Largescale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis.
Journal of the American Statistical Association 2004, 99:96104. Publisher Full Text

Efron B: Correlation and LargeScale Simultaneous Significance Testing.
Journal of American Statistical Association 2007, 102:93103. Publisher Full Text

Efron B: Size, Power, and False Discovery Rates.
Annal of Statistics 2007, 35:13511377. Publisher Full Text

Efron B: Microarrays, Empirical Bayes and the TwoGroups Model.
Statistical Science 2008, 23:122. Publisher Full Text

Storey J, Tibshirani R: Statistical significance for genomewide studies.
Proceedings of National Academy of Sciences 2003, 100:94409445. Publisher Full Text

Pounds S, Morris SW: Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of pvalues.

Aubert J, Barhen A, Daudin J, Robin S: Determination of the differentially expressed genes in microarray experiments using local FDR.

Broberg P: A new estimate of the proportion unchanged genes in a microarray experiment.

Lee MLT, Kuo F, Whitmore G, Sklar J: Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations.
Proc Natl Acad Sci 2000, 97:98349838. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Newton M, Kendziorsk C, Richmond C, Blattner F, Tsui K: On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data.

Benjamini Y: Comment: Microarrays, Empirical Bayes and the TwoGroups Model.
Statistical Science 2008, 23:2328. Publisher Full Text

Morris C: Comment: Microarrays, Empirical Bayes and the TwoGroups Model.
Statistical Science 2008, 23:3440. Publisher Full Text

Cai T: Comment: Microarrays, Empirical Bayes and the TwoGroups Model.
Statistical Science 2008, 23:2933. Publisher Full Text

Rice K, Spiegelhalter D: Comment: Microarrays, Empirical Bayes and the TwoGroups Model.
Statistical Science 2008, 23:4144. Publisher Full Text

Efron B: Rejoinder: Microarrays, Empirical Bayes and the TwoGroups Model.
Statistical Science 2008, 23:4547. Publisher Full Text

Locfdr: R package for computing local false discovery rate [http://cran.rproject.org/web/packages/locfdr/index.html] webcite

Potra F, Liu X, SeillierMoiseiwitsch F, Roy A, Hang Y, Marten M, Raman B: Protein Image Alignment via Piecewise Affine Transformations.
Journal of Computational Biology 2006, 13:614630. PubMed Abstract  Publisher Full Text

Li F, SeillierMoiseiwitsch F: Regionbased Statistical Analysis of 2D PAGE Images.
Computational Statistics and Data Analysis 2011, 55:30593072. PubMed Abstract  Publisher Full Text

Otsu N: A threshold selection method from gray level histograms.
IEEE Transactions on Systems, Man and Cybernetics 1979, 9:6266.

Hastie T, Tibshirani R, Friedman J: The elements of statistical learning. SpringerVerlag; 2008.

Hardin JW, Hilbe JM: Generalized Linear Models and Extensions. StataCorp LP; 2001.

Akaike H: A new look at the statistical model identification.
IEEE Transactions on on Automatic Control 1974, 19:716723. Publisher Full Text

Ramsay J: Monotone Regression Splines in Action.
Statistical Science 1988, 3:425441. Publisher Full Text

Jin J, Cai T: Estimating the null and the proportion of nonnull effects in largescale multiple comparison.

Stressl M, Noe CR, Lachmann B: Influence of imageanalysis software on quantitation of twodimensional gel electrophoresis data.
Electrophoresis 2009, 30:325328. PubMed Abstract  Publisher Full Text