Skip to main content

NITPICK: peak identification for mass spectrometry data

Abstract

Background

The reliable extraction of features from mass spectra is a fundamental step in the automated analysis of proteomic mass spectrometry (MS) experiments.

Results

This contribution proposes a sparse template regression approach to peak picking called NITPICK. NITPICK is a Non-greedy, Iterative Template-based peak PICKer that deconvolves complex overlapping isotope distributions in multicomponent mass spectra. NITPICK is based on fractional averagine, a novel extension to Senko's well-known averagine model, and on a modified version of sparse, non-negative least angle regression, for which a suitable, statistically motivated early stopping criterion has been derived. The strength of NITPICK is the deconvolution of overlapping mixture mass spectra.

Conclusion

Extensive comparative evaluation has been carried out and results are provided for simulated and real-world data sets. NITPICK outperforms pepex, to date the only alternate, publicly available, non-greedy feature extraction routine. NITPICK is available as software package for the R programming language and can be downloaded from http://hci.iwr.uni-heidelberg.de/mip/proteomics/.

Background

The reliable extraction of proteomic features from complex biological mixtures is of utmost interest for unraveling the intricate biomolecular interplay at the heart of many systems biology research questions. In this context, mass spectrometry (MS) has become a key technology which provides peptide and protein identification, modification characterization and quantification capabilities. In contrast to gene expression microarray technologies, MS analysis yields a direct view on the whole set of proteins (the proteome) present in the system under investigation and can thus contribute to a richer picture of protein interaction, real-time dynamics and their regulation [1]. MS contributes to clinical research and the diagnosis process [2], it is used to detect, grade and characterize cancer diseases [3], it serves as a general purpose tool for microorganism characterization [4, 5] and provides sensitive and specific means for pharmaceutical quality control.

MS experiments typically contain tens to thousands of spectra, each of which holds intensity information for tens to hundreds of thousands of mass channels. These data stem from a set of different mass analysis technologies, combining chemical separation procedures (chromatography), ionization methods (electrospray ionization, matrix-assisted laser desorption/ionization) and mass analyzers (time-of-flight, quadrupole, ion cyclotron motion). Despite physicochemical preprocessing and the availability of high mass resolution instruments, spectra which stem from complex biochemical mixtures (e.g. cell lysate, blood or serum) frequently exhibit overlapping isotope distributions of independent molecular species. Moreover, in many quantitative MS approaches, these mixtures are present by design and their manual unmixing, quantification and interpretation is tedious or unfeasible.

As a consequence, the automated analysis and interpretation of multicomponent mass spectra is highly desirable. An (incomplete) set of challenges for MS feature extraction includes the sheer data set sizes, mixtures of isotope patterns, the presence of multiple charge states, chemical and detector noise, species-dependent ionization efficiencies, chemical reproducibility and deviations from detector linearity. Among all requirements that derive from these challenges, it is important to emphasize the crucial nature of the feature extraction step: as all subsequent analysis steps rely on the set of extracted features, meaningful biological conclusions are highly dependent on the adequacy and reliability of the feature extraction method.

Apart from few special alternate approaches [6, 7], all automated methods for feature extraction from isotope-resolved mass spectra compare the observed (experimental) spectral pattern to a set of precalculated theoretical isotope patterns. The calculation of isotope patterns is based on the estimation of average stoichiometries for a particular molecular mass (averagine[8] and related methods [9]) or on relative isotope abundance estimation [10] or on protein database-driven mean isotope distribution calculation [11]. The computation of isotope patterns is based on efficient implementations [1214] of Yergey's original polynomial method [15, 16].

Comparison of theoretical and experimental isotope distributions is typically accomplished based on subtractive fitting and peak selection algorithms, attempting to sequentially detect the dominant components in a mixture spectrum. These subset selection methods attempt to determine a small set of basis functions capable of approximating the observed signal well. Facing the infeasibility of an exhaustive search over all possible subsets of explanatory basis functions, they apply greedy search strategies. Here, "greediness" refers to the fact that these approaches consistently overestimate individual feature contributions and are incapable of excluding a basis function once it has been included in the active set. Hence, although providing sparseness, they are not globally optimal. In the context of mixture modeling of mass spectra, these approaches amount to sequential isotope distribution template matching procedures [6, 811, 1722]. Fitting is carried out via χ2 distances [8, 20], least squares [911, 17, 2123], weighted least squares [19], or cross-correlation [18, 24]. The automatic determination of the charge state associated with an isotope pattern present in an experimental spectrum is based on cross-correlation [19, 25] or on dot products in Fourier space [25, 26], exploiting the shift theorem of the Fourier transform. There are only few [27] non-greedy feature selection algorithms and mixture model approaches for MS data [2831]. Among these, Matching[28] and Roussis' method [29] rely on manual preselection of contribution candidates. Sparse non-greedy procedures include pepex[30] and Du's method [31]. The pepex approach is suitable for single charge data and is based on a non-negative sparse regression scheme, with an approximate L0-norm constraint. Du and Angeletti [31] perform data reduction prior to feature extraction and apply a sparseness-promoting variable selection scheme [32]. With the exception of Du's [31] and Kaur's [19] methods, none of the mentioned mixture model approaches provide support for the detection of a sparse set of a priori unknown peptide peaks under an arbitrary set of charge states. Du's method [31] and NITPICK overcome Kaur's greedy iterative weighted least squares fitting approach. In contrast to [31], NITPICK does not rely on a heuristic parameterization and is instead based on statistical model selection, making use of an algorithmically more efficient non-greedy sequential feature selection procedure with a statistically motivated termination criterion. NITPICK was designed to support the calculation of accurate monoisotopic peak lists from raw mass spectra and was specifically tailored to cases where the raw spectra stem from unknown, possibly overlapping experimental isotope patterns of multiple charge states.

The methods section details the mixture modeling approach, fractional averagine for improved stoichiometry estimation and data fitting, and our main contribution, a computationally efficient method for improved non-negative feature selection and the corresponding statistical complexity estimation approach in conjunction with the derivation of a lower bound for early termination. Comparative results on simulated and real-world data sets are given in the results and consequently discussed. Eventually, we conclude and offer perspectives. Derivations of the formulas used in the main article are available in the appendix.

Methods

The NITPICK algorithm (cf. figure 1) models an observed mixture spectrum as a linear combination of theoretical isotope distribution patterns. Statistically, finding a sensible parameterization of this mixture model amounts to a constrained regression problem in which we seek to minimize the raw signal reconstruction error in a least-squares sense while adhering to a set of additional constraints. Such an approach requires reliable underlying isotope patterns, and we propose an improvement for the well-known averagine model to achieve this goal. We subsequently introduce NITPICK's iterative feature selection procedure, which employs a novel, non-greedy isotope distribution selection method and is based on a statistically motivated termination criterion, attempting to eliminate premature or late iteration termination.

Mixture model

We assume that observed spectra are available in a discrete (not necessarily equispaced) mass binning scheme defined by a mass vector m= (m1, m2, ..., m N )T and represent a raw multicomponent mass spectrum by a vector s of size N × 1, where s i corresponds to the abundance observed in the i th mass bin m i . In practical applications, the vector s may also result from preprocessing steps such as relevant region detection [19] and may thus represent only a part of a complete raw spectrum. The basic assumption behind the mixture model approach is that s be a linear combination of mass spectrum abundances of K pure components ϕ i ,

s = k = 1 K c i ϕ i = Φ c . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGae83CamNaeyypa0ZaaabCaeaacqWGJbWydaWgaaWcbaGaemyAaKgabeaakiabew9aMnaaBaaaleaacqWGPbqAaeqaaOGaeyypa0dcceGae4NPdyKae83yamgaleaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGlbWsa0GaeyyeIuoakiabc6caUaaa@404C@
(1)

Each of the concentration coefficients c i , i = 1, ..., K is associated with a column ϕ i of the N × K model matrix Φ. We regard these columns as basis functions and their elements ϕ ji correspond to the mass spectrum abundance expected in the j th mass bin m j of the i th pure component ϕ i .

For the estimation of the concentration vector c, the model matrix Φ has to be available, and in general this is not the case. One hence resorts to approximating the basis functions by a large set of theoretical isotope distributions (i.e. isotope abundance patterns) densely spread over the prespecified mass/charge binning scheme. Effectively, this recasts the original peak picking task into the framework of a feature (i.e. basis function) selection problem.

Model matrix calculation

Given an elemental stoichiometry, the corresponding theoretical isotope distribution is well-defined and can easily be calculated [1215]. Hence, if a prespecified set of stoichiometries of potential pure components is available, the calculation of the respective set of theoretical isotope distributions (including chemical modifications and multiple charge states) is straightforward. These isotope distributions are subsequently convolved with instrument-specific, possibly mass-dependent peak shape functions, yielding the basis functions ϕ i .

Fractional averagine

In many practical applications prior knowledge about potential components is not at hand. Thus, one needs to resort to expected average stoichiometry estimates as a best-effort approximation. In this case, the quality of the feature selection procedure is highly dependent on the quality of the stoichiometry model. We therefore extended the widely used averagine approach [8] to amend its discrete and discontinuous nature, gaining models without mass errors and improved true isotope distribution reconstruction properties. Fractional averagine provides a real-valued element stoichiometry ρ = (ρ 1 ,...,ρ5)T according to the mapping f: 5 between a mass value and the number of element atoms in an averagine (H7.75833C4.9384N1.35777O1.4773S0.0417) molecule. The calculation of the theoretical isotope distribution of ρ is based on the observation that isotope abundances follow a multinomial distribution [33], and that fractional numbers of trials in a multinomial can be modeled as linear interpolation between the probability functions of the multinomials parameterized with the surrounding integers (see appendix A). For computational ease, calculations are carried out in the realm of the corresponding moment generating function (MGF) [34] of the multinomial probability mass function. For the i th stoichiometry element, the MGF given ρ i can be factorized according to

M x ( t 1 , , t k 1 | ρ i ) = [ p 1 e t 1 + + p k 1 e t k 1 + p k ] [ ρ i ] + ( ρ i ρ i ) = M x 1 ( t 1 , , t k 1 | ρ i ) M x 2 ( t 1 , , t k 1 | ( ρ i ρ i ) ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeaabmqaaaqaaiabd2eannaaBaaaleaacqWG4baEaeqaaOWaaeWaaeaacqWG0baDdaWgaaWcbaGaeGymaedabeaakiabcYcaSiablAciljabcYcaSiabdsha0naaBaaaleaacqWGRbWAcqGHsislcqaIXaqmaeqaaOGaeiiFaWNaeqyWdi3aaSbaaSqaaiabdMgaPbqabaaakiaawIcacaGLPaaaaeaacqGH9aqpdaWadaqaaiabdchaWnaaBaaaleaacqaIXaqmaeqaaOGaemyzau2aaWbaaSqabeaacqWG0baDdaWgaaadbaGaeGymaedabeaaaaGccqGHRaWkcqWIVlctcqGHRaWkcqWGWbaCdaWgaaWcbaGaem4AaSMaeyOeI0IaeGymaedabeaakiabdwgaLnaaCaaaleqabaGaemiDaq3aaSbaaWqaaiabdUgaRjabgkHiTiabigdaXaqabaaaaOGaey4kaSIaemiCaa3aaSbaaSqaaiabdUgaRbqabaaakiaawUfacaGLDbaadaahaaWcbeqaamaadmaabaGaeqyWdi3aaSbaaWqaaiabdMgaPbqabaaaliaawUfacaGLDbaacqGHRaWkdaqadaqaaiabeg8aYnaaBaaameaacqWGPbqAaeqaaSGaeyOeI0YaayWaaeaacqaHbpGCdaWgaaadbaGaemyAaKgabeaaaSGaayj84laawUp+aaGaayjkaiaawMcaaaaaaOqaaiabg2da9iabd2eannaaBaaaleaacqWG4baEdaahaaadbeqaaiabigdaXaaaaSqabaGcdaqadaqaaiabdsha0naaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaeSOjGSKaeiilaWIaemiDaq3aaSbaaSqaaiabdUgaRjabgkHiTiabigdaXaqabaGccqGG8baFdaGbdaqaaiabeg8aYnaaBaaaleaacqWGPbqAaeqaaaGccaGLWJVaay5+4daacaGLOaGaayzkaaGaemyta00aaSbaaSqaaiabdIha4naaCaaameqabaGaeGOmaidaaaWcbeaakmaabmaabaGaemiDaq3aaSbaaSqaaiabigdaXaqabaGccqGGSaalcqWIMaYscqGGSaalcqWG0baDdaWgaaWcbaGaem4AaSMaeyOeI0IaeGymaedabeaakiabcYha8naabmaabaGaeqyWdi3aaSbaaSqaaiabdMgaPbqabaGccqGHsisldaGbdaqaaiabeg8aYnaaBaaaleaacqWGPbqAaeqaaaGccaGLWJVaay5+4daacaGLOaGaayzkaaaacaGLOaGaayzkaaaaaaaa@AB1C@
(2)

where p l is the probability of occurrence of the l th isotope l = 1 k p l = 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaabmaeaacqWGWbaCdaWgaaWcbaGaemiBaWgabeaakiabg2da9iabigdaXaWcbaGaemiBaWMaeyypa0JaeGymaedabaGaem4AaSganiabggHiLdaaaa@3783@ , x = (x1, ..., x k ) T denotes the number of times a particular isotope is chosen l = 1 k x l = ρ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaabmaeaacqWG4baEdaWgaaWcbaGaemiBaWgabeaakiabg2da9iabeg8aYnaaBaaaleaacqWGPbqAaeqaaaqaaiabdYgaSjabg2da9iabigdaXaqaaiabdUgaRbqdcqGHris5aaaa@39DF@ and t = (t1, ..., t k )T is the corresponding variable of the MGF. By rearrangement of the MGFs of all elements, it is possible to separate integer and real-valued contributions, yielding the common averagine model ρ ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaccmGaf8xWdiNbaKaaaaa@2DAD@ = (ρ1, ρ2, ..., ρ5)T for the integers and the fractional averagine correction ρ ˜ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaccmGaf8xWdiNbaGaaaaa@2DAC@ = (ρ1 - ρ1, ρ2 - ρ2, ..., ρ5 - ρ5)T for the remaining fractional masses. The theoretical isotope distribution for ρ ˜ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqyWdiNbaGaadaWgaaWcbaGaemyAaKgabeaaaaa@2F2B@ is given by the linear combination of a peak of intensity one at mass zero and the theoretical isotope distribution of the i th averagine element, weighted by 1 - ρ ˜ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqyWdiNbaGaadaWgaaWcbaGaemyAaKgabeaaaaa@2F2B@ and ρ ˜ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqyWdiNbaGaadaWgaaWcbaGaemyAaKgabeaaaaa@2F2B@ , respectively. Thus, efficient calculation of the theoretical isotope distribution of the stoichiometry ρ ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaccmGaf8xWdiNbaKaaaaa@2DAD@ is carried out based on the Mercury7 algorithm [14], and the theoretical isotope distribution for the fractional stoichiometry ρ is subsequently obtained with five additional convolution steps.

Basis function selection

Given the set of basis functions Φ = [ϕ1ϕ2 ... ϕ k ], basis function selection and subsequent determination of the contribution coefficients c i provides a solution to eq. (1). Thus, as the modeling parameters and, in particular, the monoisotopic masses for all basis function are known, one can determine which isotope distributions are present and in what abundance (assuming ∑ k ϕ ki = 1).

In practice, basis functions are calculated for each possible monoisotopic mass and each expected charge state, yielding model matrices Φ with K N (in the case of one basis function per mass/charge bin and charge, we have K = n Z N, where n Z corresponds to the number of charge states observable in the experiment; hence, for n Z > 1, there exists an infinite number of solutions for eq. (1)). This is a problem intrinsic to the proposed mixture modeling approach and has been observed previously [23, 28, 30]. The least absolute shrinkage and selection operator (LASSO) [32] enjoys favorable properties of regularization and subset selection. Because the LASSO is capable of shrinking coefficients to exactly zero, it offers a non-greedy way to gain sparse models. The LASSO solution c ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGaf83yamMbaKaaaaa@2D3C@ for equation (1) is given by

c ^ = arg min c { | | s Φ c | | 2 } s . t . i = 1 K | c i | t , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabiqaaaqaaGqadiqb=ngaJzaajaGaeyypa0JagiyyaeMaeiOCaiNaei4zaC2aaCbeaeaacyGGTbqBcqGGPbqAcqGGUbGBaSqaaiab=ngaJbqabaGccqGG7bWEcqGG8baFcqGG8baFcqWFZbWCcqGHsisliiqacqGFMoGrcqWFJbWycqGG8baFcqGG8baFdaahaaWcbeqaaiabikdaYaaakiabc2ha9bqaauaabeqabiaaaeaacqqGZbWCcqGGUaGlcqqG0baDcqGGUaGlaeaadaaeWbqaaiabcYha8jabdogaJnaaBaaaleaacqWGPbqAaeqaaOGaeiiFaWNaeyizImQaemiDaqNaeiilaWcaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGlbWsa0GaeyyeIuoaaaaaaaaa@5CEE@
(3)

where t ≥ 0 is a user-defined tuning parameter [31, 32]. Mass spectra intensities s i , basis function values ϕ ji , and basis function contributions c k are strictly non-negative, thus adding a non-negativity constraint to the solution space of c ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGaf83yamMbaKaaaaa@2D3C@ , yielding

c ^ = arg min c { | | s Φ c | | 2 } s . t . i = 1 K | c i | t , c i 0. MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabiqaaaqaaGqadiqb=ngaJzaajaGaeyypa0JagiyyaeMaeiOCaiNaei4zaC2aaCbeaeaacyGGTbqBcqGGPbqAcqGGUbGBaSqaaiab=ngaJbqabaGccqGG7bWEcqGG8baFcqGG8baFcqWFZbWCcqGHsisliiqacqGFMoGrcqWFJbWycqGG8baFcqGG8baFdaahaaWcbeqaaiabikdaYaaakiabc2ha9bqaauaabeqabiaaaeaacqqGZbWCcqGGUaGlcqqG0baDcqGGUaGlaeaadaaeWbqaaiabcYha8jabdogaJnaaBaaaleaacqWGPbqAaeqaaOGaeiiFaWNaeyizImQaemiDaqNaeiilaWIaem4yam2aaSbaaSqaaiabdMgaPbqabaGccqGHLjYScqaIWaamcqGGUaGlaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabdUealbqdcqGHris5aaaaaaaaaa@6366@
(4)

For fixed t, this is a quadratic programming problem with linear inequality constraints which can be solved by an active set algorithm, sequentially introducing the inequality constraints and seeking a feasible solution satisfying the Kuhn-Tucker conditions [32, 35, 36]. Equation (4) corresponds to c ^ ( λ ) = arg min c { | | s Φ c | | 2 + λ i = 1 K | c i | } MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGaf83yamMbaKaacqGGOaakcqaH7oaBcqGGPaqkcqGH9aqpcyGGHbqycqGGYbGCcqGGNbWzcyGGTbqBcqGGPbqAcqGGUbGBdaWgaaWcbaGae83yamgabeaakiabcUha7jabcYha8jabcYha8jab=nhaZjabgkHiTGGabiab+z6agjab=ngaJjabcYha8jabcYha8naaCaaaleqabaGaeGOmaidaaOGaey4kaSIaeq4UdW2aaabmaeaacqGG8baFcqWGJbWydaWgaaWcbaGaemyAaKgabeaakiabcYha8bWcbaGaemyAaKMaeyypa0JaeGymaedabaGaem4saSeaniabggHiLdGccqGG9bqFaaa@5995@ with c i ≥ 0 where the parameter t is related to the Lagrangian multiplier λ which determines the number of free parameters df(λ) in the linear model [32, 3638].

Common procedures for the optimal selection of λ or df(λ) are based on the minimization of the prediction error. This involves estimation of training optimism via C p -statistics, the Akaike Information Criterion (AIC), or the Bayesian Information Criterion (BIC) [37]. Alternatively, direct estimation of prediction error can be carried out via cross-validation or generalized cross-validation (GCV) [37]. All these methods require the LASSO trace c ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGaf83yamMbaKaaaaa@2D3C@ (λ l ), where λ l and = { λ 1 , ... , λ | | } MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NeHWKaeyypa0Jaei4EaSNaeq4UdW2aaSbaaSqaaiabigdaXaqabaGccqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqaH7oaBdaWgaaWcbaGaeiiFaWNae8NeHWKaeiiFaWhabeaakiabc2ha9baa@47D0@ defines the set of LASSO regularization parameters for which the prediction error is calculated. In general, the calculation of the LASSO trace is computationally intensive and it is not clear how the elements of should be selected [36]. Least angle regression (LARS) [39] is an algorithmically different approach to variable selection which can be modified such that the LARS algorithm implements the non-negative LASSO from equation (4). The LASSO-modified LARS is a constructive active set procedure which constructs the LASSO regularization path in a stepwise manner. Denote by A MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ (λ) the set of indices i {1. ..., K} of those ϕ i which are in the active set for a particular choice of λ. Starting from λ = ∞ and letting λ → 0, the algorithm computes non-negative LASSO solutions for all λ for which the active set changes, thus implicitly defining . The LASSO-modified LARS guarantees A MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ (λ j ) ≠ A MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ (λj+1), but it allows for the deletion of previously selected basis functions, and hence | A MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ (λ j ) | need not increase monotonically for increasing j. Basis functions can be required to enter the active set in their predefined directions [39] which allows the implementation of a non-negativity constraint. Necessary matrix inversions are constrained to | A MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ (λ) | × | A MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ (λ)|-sized scatter matrices Φ A ( λ ) T Φ A ( λ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aa0baaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdsfaubaakiab=z6agnaaBaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaaaa@4451@ and can be implemented as iterative updates, thus the procedure is computationally efficient.

Complexity estimation

It is desirable to terminate active set updates as soon as the basis functions in the active set are able to explain the observed data sufficiently well, i.e. until the increase in explanatory power does not justify the increase in model complexity anymore. We now describe a modification to the non-negative LASSO-modified LARS, which enables us to sequentially build a BIC trace along the LASSO regularization path and to identify minima along this trace. Upon termination, the proposed procedure returns the estimate c ^ A MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGaf83yamMbaKaadaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae4haXheabeaaaaa@38DB@ and the set A = { i | c ^ A i > 0 } MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXhKaeyypa0Jaei4EaSNaemyAaKMaeiiFaWhcbmGaf43yamMbaKaadaWgaaWcbaGae8haXh0aaSbaaWqaaiabdMgaPbqabaaaleqaaOGaeyOpa4JaeGimaaJaeiyFa0haaa@4506@ of active basis functions.

BIC measure

The LARS C p -type risk reestimation formula [39] for optimal selection of λ does not hold under the non-negative LASSO modification. Instead, we recalculate a BIC measure

BIC ( λ ) = 1 σ 2 | | s Φ A ( λ ) c ^ A ( λ ) q | | 2 N . MSE ( λ ) + d f ( λ ) log N , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeeOqaiKaeeysaKKaee4qamKaeiikaGIaeq4UdWMaeiykaKIaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqaHdpWCdaahaaqabeaacqaIYaGmaaaaaOWaaGbaaeaacqGG8baFcqGG8baFieWacqWFZbWCcqGHsisliiqacqGFMoGrdaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabeaakiqb=ngaJzaajaWaa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdghaXbaakiabcYha8jabcYha8naaCaaaleqabaGaeGOmaidaaaqaaiabd6eaojabc6caUiabb2eanjabbofatjabbweafjabcIcaOiabeU7aSjabcMcaPaGccaGL44pacqGHRaWkcqWGKbazcqWGMbGzcqGGOaakcqaH7oaBcqGGPaqkcyGGSbaBcqGGVbWBcqGGNbWzcqWGobGtcqGGSaalaaa@722A@
(5)

in each LARS iteration [40]. For the calculation of the unbiased training error MSE(λ) in eq. (5) we require an additional non-negative least squares fit

c ^ A ( λ ) q = arg min c A ( λ ) | | s Φ A ( λ ) c A ( λ ) | | 2 s .  t . ( c ^ A ( λ ) q ) i 0. MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabiqaaaqaaGqadiqb=ngaJzaajaWaa0baaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdghaXbaakiabg2da9iGbcggaHjabckhaYjabcEgaNnaaxababaGagiyBa0MaeiyAaKMaeiOBa4galeaacqWFJbWydaWgaaadbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabeaaaSqabaGccqGG8baFcqGG8baFcqWFZbWCcqGHsisliiqacqqFMoGrdaWgaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabeaakiab=ngaJnaaBaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaOGaeiiFaWNaeiiFaW3aaWbaaSqabeaacqaIYaGmaaaakeaafaqabeqacaaabaGaee4CamNaeiOla4IaeeiiaaIaeeiDaqNaeiOla4cabaWaaeWaaeaacuWGJbWygaqcamaaDaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGXbqCaaaakiaawIcacaGLPaaadaWgaaWcbaGaemyAaKgabeaakiabgwMiZkabicdaWiabc6caUaaaaaaaaa@795E@
(6)

The noise variance σ2 in eq. (5) is estimated as the mean residual sum of squares of a low-bias non-negative least squares estimate [37].

Estimation of df(λ)

The calculation of BIC(λ) in eq.(5) requires an estimate for the degrees of freedom df(λ), which can be obtained via the generalized degrees of freedom (GDF) [38]. The GDF of an NN-LASSO-modified LARS model based on an active set A MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ (λ) are given by

GDF ( λ ) = 1 σ 2 s T Φ A ( λ ) c ^ A ( λ ) q . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaee4raCKaeeiraqKaeeOrayKaeiikaGIaeq4UdWMaeiykaKIaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqaHdpWCdaahaaqabeaacqaIYaGmaaaaaGqadOGae83Cam3aaWbaaSqabeaacqWGubavaaacceGccqGFMoGrdaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabeaakiqb=ngaJzaajaWaa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdghaXbaakiabc6caUaaa@549C@
(7)

Because the coefficients ( c ^ A ( λ ) q ) i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaeWaaeaaieWacuWFJbWygaqcamaaDaaaleaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGXbqCaaaakiaawIcacaGLPaaadaWgaaWcbaGaemyAaKgabeaaaaa@40C7@ > 0 are non-negative, the estimate c ^ A ( λ ) q MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGaf83yamMbaKaadaqhaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemyCaehaaaaa@3DAD@ solves

c ^ A ( λ ) q = arg min c ^ A ( λ ) q { | | s Φ c A ( λ ) q | | 2 + λ i = 1 K ( c A ( λ ) q ) i } MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGaf83yamMbaKaadaqhaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemyCaehaaOGaeyypa0JagiyyaeMaeiOCaiNaei4zaC2aaCbeaeaacyGGTbqBcqGGPbqAcqGGUbGBaSqaaiqb=ngaJzaajaWaa0baaWqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdghaXbaaaSqabaGcdaGadaqaaiabcYha8jabcYha8jab=nhaZjabgkHiTGGabiab9z6agjab=ngaJnaaDaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGXbqCaaGccqGG8baFcqGG8baFdaahaaWcbeqaaiabikdaYaaakiabgUcaRiabeU7aSnaaqahabaWaaeWaaeaacqWGJbWydaqhaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemyCaehaaaGccaGLOaGaayzkaaWaaSbaaSqaaiabdMgaPbqabaaabaGaemyAaKMaeyypa0JaeGymaedabaGaem4saSeaniabggHiLdaakiaawUhacaGL9baaaaa@7935@
(8)

which is differentiable with respect to ( c ^ A ( λ ) q ) i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaeWaaeaaieWacuWFJbWygaqcamaaDaaaleaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGXbqCaaaakiaawIcacaGLPaaadaWgaaWcbaGaemyAaKgabeaaaaa@40C7@ . Setting the derivative to zero, we obtain

c ^ A ( λ ) q = ( Φ A ( λ ) T Φ A ( λ ) ) 1 ( Φ A ( λ ) T s 1 2 λ 1 A ( λ ) ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGaf83yamMbaKaadaqhaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemyCaehaaOGaeyypa0JaeiikaGccceGae0NPdy0aa0baaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdsfaubaakiab9z6agnaaBaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaOGaeiykaKYaaWbaaSqabeaacqGHsislcqaIXaqmaaGccqGGOaakcqqFMoGrdaqhaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemivaqfaaOGae83CamNaeyOeI0scfa4aaSaaaeaacqaIXaqmaeaacqaIYaGmaaGccqaH7oaBieqacqaFXaqmdaWgaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabeaakiabcMcaPiabc6caUaaa@68F8@
(9)

Hence, given an active set A MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ (λ), the generalized degrees of freedom from eq. (7) can be written as

GDF ( λ ) = s T 1 σ 2 ( Φ A ( λ ) c ^ A ( λ ) q ) = s T 1 σ 2 Φ A ( λ ) ( Φ A ( λ ) T Φ A ( λ ) ) 1 ( Φ A ( λ ) T s 1 2 λ 1 A ( λ ) ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeaabmqaaaqaaiabbEeahjabbseaejabbAeagjabcIcaOiabeU7aSjabcMcaPaqaaiabg2da9Gqadiab=nhaZnaaCaaaleqabaGaemivaqfaaKqbaoaalaaabaGaeGymaedabaGaeq4Wdm3aaWbaaeqabaGaeGOmaidaaaaakiabcIcaOGGabiab+z6agnaaBaaaleaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaOGaf83yamMbaKaadaqhaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemyCaehaaOGaeiykaKcabaGaeyypa0Jae83Cam3aaWbaaSqabeaacqWGubavaaqcfa4aaSaaaeaacqaIXaqmaeaacqaHdpWCdaahaaqabeaacqaIYaGmaaaaaOGae4NPdy0aaSbaaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqabaGccqGGOaakcqGFMoGrdaqhaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemivaqfaaOGae4NPdy0aaSbaaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqabaGccqGGPaqkdaahaaWcbeqaaiabgkHiTiabigdaXaaakiabcIcaOiab+z6agnaaDaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGubavaaGccqWFZbWCcqGHsisljuaGdaWcaaqaaiabigdaXaqaaiabikdaYaaakiabeU7aSHqabiab8fdaXmaaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaOGaeiykaKcaaaaa@8D6D@
(10)

which is monotonously increasing for decreasing λ (see appendix B for a proof).

Optimal termination

The minimal possible training error of the model is attained when all variables are in the active set, in which case the respective coefficients c ^ q MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGaf83yamMbaKaadaahaaWcbeqaaiabdghaXbaaaaa@2ED4@ are given by c ^ q MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGaf83yamMbaKaadaahaaWcbeqaaiabdghaXbaaaaa@2ED4@ = arg min c ||s - Φ c||2 subject to c i ≥ 0, and the corresponding error is MSE = 1 N | | s Φ c ^ q | | 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeeyta0Kaee4uamLaeeyrauKaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqWGobGtaaGccqGG8baFcqGG8baFieWacqWFZbWCcqGHsisliiqacqGFMoGrcuWFJbWygaqcamaaCaaaleqabaGaemyCaehaaOGaeiiFaWNaeiiFaW3aaWbaaSqabeaacqaIYaGmaaaaaa@40F5@ . Thus, a lower bound for BIC(λ) is given by

BIC m i n ( λ ) = N σ 2 MSE + GDF ( λ ) log N MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeeOqaiKaeeysaKKaee4qam0aaSbaaSqaaiabd2gaTjabdMgaPjabd6gaUbqabaGccqGGOaakcqaH7oaBcqGGPaqkcqGH9aqpjuaGdaWcaaqaaiabd6eaobqaaiabeo8aZnaaCaaabeqaaiabikdaYaaaaaGccqqGnbqtcqqGtbWucqqGfbqrcqGHRaWkcqqGhbWrcqqGebarcqqGgbGrcqGGOaakcqaH7oaBcqGGPaqkcyGGSbaBcqGGVbWBcqGGNbWzcqWGobGtaaa@4CDE@
(11)

(see appendix C for a proof). In general, BIC(λ) will have several minima for increasing values of GDF(λ), hence we track the minimum BIC(λ min ) through the NN-LASSO-modified LARS cycles and accept λ min as a minimizer as soon as the lower bound BIC min (λ) of a subsequent LARS step exceeds the current best estimate BIC(λ min ), i.e. BIC(λ min ) < BIC min (λ) (see figure 2).

Figure 1
figure 1

NITPICK workflow overview. Raw spectrum preprocessing, relevant region detection, region-wise peak picking, merging of detected peaks and peak list postprocessing. At the heart of the method lies an iterative feature selection procedurecontrolled by a statistical termination criterion, as illustrated by the large box in the center. As a tightlyinterconnected prerequisite to the main workflow, the column on the left depicts the steps required for thecalculation of the regression model matrix. Numbers in parentheses give the manuscript sections in whichthe specific steps are detailed.

Figure 2
figure 2

Efficient automated determination of the number of components in an area withoverlapping peaks using the BIC min (λ) termination criterion. The mean squared error MSE (scaled, dotted) decreases monotonically over the LARS steps and the generalized degrees of freedom GDF(λ) (dashed) increase monotonically. The resulting BIC(λ) measure(solid) exhibits a minimum BIC(λ 9 ) in the 9th LARS step and λ 9 is accepted as a minimizer because thelower bound BIC min 10 ) exceeds BIC(λ 9 ) in the 10th LARS step.

Regression on selected models

The sum constraint in equation (4) is ultimately responsible for the sparseness property of the LASSO. Its regularizing effect is similar to the one of the regularization term found in ridge regression, especially with respect to the fact that all LASSO estimates c ^ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafm4yamMbaKaadaWgaaWcbaGaemyAaKgabeaaaaa@2EBB@ , i = 1, ..., K are subject to shrinkage [32, 37] and represent biased versions of the least squares estimates. Given an active set A MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ , the shrinkage bias on the c ^ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafm4yamMbaKaadaWgaaWcbaGaemyAaKgabeaaaaa@2EBB@ can effectively be removed by introducing a subsequent non-negative least squares regression step after the basis functions have been selected by the LASSO procedure [32]. This also holds true for the NN-LASSO-modified LARS procedure, and the corresponding unbiased quantification estimate c ^ A q MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGaf83yamMbaKaadaqhaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae4haXheabaGaemyCaehaaaaa@3A47@ is given by equation (6) with A MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ (λ) = A MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ .

Postprocessing

The estimate c ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGaf83yamMbaKaaaaa@2D3C@ is subject to modeling errors and these shortcomings lead to suboptimal NN-LASSO-modified LARS estimates and active sets. In particular, the estimation depends on the match between the observed and theoretical peak shape function. Especially in high mass resolution experiments, one can frequently observe spurious peak detections in bins directly adjacent to monoisotopic mass bins of true peaks [30]. A possible remedy is a local maximum detection implemented as a postprocessing filter φ(·) applied to the active basis function index set A MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ :

A = φ ( A | G ) = { j A | ( c ^ A q ) j = max { ( c ^ A q ) l | l ν G ( j ) } } MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeWabiqaaaqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiqb=bq8bzaafaGaeyypa0JaeqOXdOMaeiikaGIae8haXhKaeiiFaWNaem4raCKaeiykaKcabaGaeyypa0Jaei4EaSNaemOAaOMaeyicI4Sae8haXhKaeiiFaWNaeiikaGIafm4yamMbaKaadaqhaaWcbaGae8haXheabaGaemyCaehaaOGaeiykaKYaaSbaaSqaaiabdQgaQbqabaGccqGH9aqpcyGGTbqBcqGGHbqycqGG4baEcqGG7bWEcqGGOaakcuWGJbWygaqcamaaDaaaleaacqWFaeFqaeaacqWGXbqCaaGccqGGPaqkdaWgaaWcbaGaemiBaWgabeaakiabcYha8jabdYgaSjabgIGiolabe27aUnaaBaaaleaacqWGhbWraeqaaOGaeiikaGIaemOAaOMaeiykaKIaeiyFa0NaeiyFa0haaaaa@6D36@
(12)

where ν G ( j ) = { k A | | b k b j | G 1 2 } MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeqyVd42aaSbaaSqaaiabdEeahbqabaGccqGGOaakcqWGQbGAcqGGPaqkcqGH9aqpcqGG7bWEcqWGRbWAcqGHiiIZt0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFaeFqcqGG8baFcqGG8baFcqWGIbGydaWgaaWcbaGaem4AaSgabeaakiabgkHiTiabdkgaInaaBaaaleaacqWGQbGAaeqaaOGaeiiFaWNaeyizImAcfa4aaSaaaeaacqWGhbWrcqGHsislcqaIXaqmaeaacqaIYaGmaaGccqGG9bqFaaa@55B8@ defines an m/z-neighborhood of size G around each peak and b j is the mass/charge bin index of the monoisotopic mass m0 of the j th theoretical isotope distribution ϕ j . If A A MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXhKaeyiyIKRaf8haXhKbauaaaaa@3AD3@ , c ^ A ( λ ) q MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGaf83yamMbaKaadaqhaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemyCaehaaaaa@3DAD@ is reestimated using eq. (6) with A ( λ ) = A MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXhKaeiikaGIaeq4UdWMaeiykaKIaeyypa0Jaf8haXhKbauaaaaa@3D78@.

Results

Stoichiometry models

The fractional averagine stoichiometry model was compared against the classical averagine model based on the analysis of their respective approximation errors using simulated theoretical peptide isotope distributions.

Data Set

All UniProt (version 51.4.) [41] human proteins were subjected to in silico tryptic digestion. For each of the R digestion product peptides P r , r { 1 , ... , R } MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbiGae8huaa1aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabciab+Ts8YbqabaGccqGGSaalcqGFReVCcqGHiiIZcqGG7bWEcqaIXaqmcqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWGsbGucqqG9bqFaaa@470A@ , exact element stoichiometries ρ r x MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaccmGae8xWdi3aa0baaSqaaiabdkhaYbqaaiabdIha4baaaaa@30B0@ and exact theoretical isotope distributions d r x MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGae8hzaq2aa0baaSqaaiabdkhaYbqaaiabdIha4baaaaa@3041@ were calculated. Peptides with monoisotopic masses above m/z 5000 were discarded.

Comparison of deviations

Classical and fractional averagine were used to estimate approximate element stoichiometries ρ ^ r MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaccmGaf8xWdiNbaKaadaWgaaWcbaGaemOCaihabeaaaaa@2F46@ and ρ r MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaccmGaf8xWdiNbaKaadaWgaaWcbaGaemOCaihabeaaaaa@2F46@ , respectively, for all peptides P r MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaa1aaSbaaSqaaGqaciab=jhaYbqabaaaaa@2E9E@ in the data set. Based on ρ ^ r MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaccmGaf8xWdiNbaKaadaWgaaWcbaGaemOCaihabeaaaaa@2F46@ and ρ r MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaccmGaf8xWdiNbaKaadaWgaaWcbaGaemOCaihabeaaaaa@2F46@ , the corresponding theoretical isotope distribution intensity vectors d ^ r MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGaf8hzaqMbaKaadaWgaaWcbaacbiGae4NCaihabeaaaaa@2EDD@ and d r MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGaf8hzaqMbaKaadaWgaaWcbaacbiGae4NCaihabeaaaaa@2EDD@ were calculated. Figure 3 shows the cumulative distribution of the squared differences between the classical averagine and the true theoretical isotope distribution intensity vectors ( | | d ^ r d r x | | 2 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeiiFaWNaeiiFaWhcbmGaf8hzaqMbaKaadaWgaaWcbaGaemOCaihabeaakiabgkHiTiab=rgaKnaaDaaaleaacqWGYbGCaeaacqWG4baEaaGccqGG8baFcqGG8baFdaqhaaWcbaGaeGOmaidabaGaeGOmaidaaaaa@3C49@ , dashed black), and fractional averagine and the true theoretical isotope distribution intensity vectors ( | | d r d r x | | 2 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeiiFaWNaeiiFaWhcbmGae8hzaq2aaSbaaSqaaiabdkhaYbqabaGccqGHsislcqWFKbazdaqhaaWcbaGaemOCaihabaGaemiEaGhaaOGaeiiFaWNaeiiFaW3aa0baaSqaaiabikdaYaqaaiabikdaYaaaaaa@3C39@ , solid red).

Figure 3
figure 3

Comparison of the impact of averagine and fractional averagine stoichiometry estimationerrors on the estimation of theoretical isotope distributions. The cumulative histograms of least squares deviations from the true theoretical isotope distribution illustrate the superior overall performance of fractional averagine (solid line) compared to Senko's classical averagine (dashed line): fractional averagine causes a 17% decrease in mean squared error magnitude.

Peak picking

For peak picking/feature extraction performance evaluation, we determine representative peak picking statistics: we calculate accuracy, sensitivity, specificity, and positive and negative predictive values on simulation data. Further, and in contrast to previous contributions, we explicitly perform manual validation on a real-world data set.

Data sets

Simulation data sets

For the simulation, all UniProt (version 51.4.) [41] human protein sequences were subjected to in silico tryptic digestion. Simulation sets were generated by random drawing of digestion product peptides and intensities. To ensure a fair comparison with the pepex procedure (which was selected for benchmarking as the only publicly available procedure implementing non-greedy feature extraction) which is limited to singly charged data sets, all simulated peptide were endowed with a single charge. Mercury7 [14] was used for the calculation of the respective theoretical isotopic distributions. After convolution with an m/z-dependent Gaussian aperture function [42], intensity-weighted linear combinations of peptide spectra were calculated and a Poisson noise model (see appendix D) was applied to obtain spectra of different signal to noise (SNR) ratios. Simulations were performed in the densely populated m/z 500–700 range (see Additional file 1 for the data sets).

Real-world data set

Experiments on real-world data were performed using Bovine Serum Albumin (BSA) LC/(ESI-)MS calibration data. The data set was acquired on a QSTAR XL mass spectrometer (Applied Biosystems/MDS Sciex) equipped with microsale capillary HPLC system (Famos Autosampler, LC packings, Agilent 1100 HPLC pump). A mixture spectrum with many overlapping peaks was obtained by integration of the LC/MS data set over the retention time domain (see Additional files 2 and 3). Peak identification was carried out in the m/z 500–700 range and peak shape functions were modeled according to mass-dependent Gaussian distributions with standard deviations σ(m/z) = 0.005 m/z [42].

Performance estimation

We characterize peak picking performance based on a set of measures from statistical test theory, all of which depend on the availability of the numbers of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN).

Ground truth is based on knowledge of the complete set of peptide signals present in a mass spectrum. For simulated data sets, this information is available. In real-world experiments, the definition of ground truth is complicated by sample complexity, stochastic sample modification, non-peptidic components and limited dynamic range. As a consequence, TNs, FNs and the overall number of true peaks are not available for real-world data, limiting the available statistical measures to positive predictive values and the ratio of true positives (sensitivity ratios).

Nevertheless, we can determine the number of TPs and FPs in both cases: we check whether a detected peak really exists and if it has been assigned its correct monoisotopic mass m0 and charge z. If so, it is counted as true positive (TP) or, otherwise, as false positive (FP).

Simulation data

As the complete set of simulated peaks is known, the remaining set of undetected peaks can be determined and its members are counted as false negatives (FN). With the true number of positives and negatives available the calculation of the number of true negatives (TN) is straightforward, thus enabling the use of related statistical test error measures for performance characterization:

  • accuracy (ACC) measures the rate of correct peak vs. no peak decisions, i.e. ACC = TP + TN TN + FP + TP + FN MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeeyqaeKaee4qamKaee4qamKaeyypa0tcfa4aaSaaaeaacqqGubavcqqGqbaucqGHRaWkcqqGubavcqqGobGtaeaacqqGubavcqqGobGtcqGHRaWkcqqGgbGrcqqGqbaucqGHRaWkcqqGubavcqqGqbaucqGHRaWkcqqGgbGrcqqGobGtaaaaaa@41E4@

  • the negative predictive value (NPV) gives the rate at which there is no peak at positions where the procedure was unable to find a peak, NPV = TN TN + FN MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeeOta4KaeeiuaaLaeeOvayLaeyypa0tcfa4aaSaaaeaacqqGubavcqqGobGtaeaacqqGubavcqqGobGtcqGHRaWkcqqGgbGrcqqGobGtaaaaaa@38B2@

  • the positive predictive value (PPV) measures the rate of correct peak detections among all peaks detected by the procedure, PPV = TP TP + FP MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeeiuaaLaeeiuaaLaeeOvayLaeyypa0tcfa4aaSaaaeaacqqGubavcqqGqbauaeaacqqGubavcqqGqbaucqGHRaWkcqqGgbGrcqqGqbauaaaaaa@38C2@

  • sensitivity (SE) measures the method's ability to detect a peak if it exists, SE = TP TP + FN MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaee4uamLaeeyrauKaeyypa0tcfa4aaSaaaeaacqqGubavcqqGqbauaeaacqqGubavcqqGqbaucqGHRaWkcqqGgbGrcqqGobGtaaaaaa@377B@

  • specificity (SP) measures the method's ability to correctly identify the absence of peaks in the spectrum, SP = TN TN + FP MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaee4uamLaeeiuaaLaeyypa0tcfa4aaSaaaeaacqqGubavcqqGobGtaeaacqqGubavcqqGobGtcqGHRaWkcqqGgbGrcqqGqbauaaaaaa@378D@

All measures have been computed with and without the application of postprocessing.

Real-world data

Resorting to LC/MS data and creating a semi-artificial data set by integration over the retention time domain was motivated by the fact that this approach yields a data set accessible to human manual validation. With LC resolution power available to the human expert (and resorting to comparatively simple mixtures), all peaks detected in the integrated mixture can still be manually verified. Exemplary peak picking results are illustrated below.

Comparative results

Pepex

We chose to compare NITPICK to a conceptually similar, model-based approach called pepex [30]. In contrast to model-free approaches and in accordance with NITPICK, pepex models observed spectra based on a linear mixture model, which is augmented by a complexity constraint. It uses the averagine model to describe unknown features and is capable of terminating its feature selection routine after a sufficient number of basis functions has been selected. However, as the publicly available implementation of the pepex approach is limited to charge state z = 1 data sets, NITPICK comparison against pepex was limited to the simulated data set.

For the analysis, the pepex algorithm was tailored to the problem at hand: its parameters were heavily optimized to maximize peak picking performance on the simulation data set. As a consequence, the reported results underestimate the pepex generalization error and overestimate its performance (see Additional file 4). For NITPICK, no specific parameter optimization was carried out, postprocessing was kept to a minimum (G = 3), and the reported results are representative (see Additional file 5).

MarkerView

We also compared NITPICK's ability to extract peak information from a retention time integrated mixture spectrum against the proprietary MarkerView application (Applied Biosystems/MDS Sciex, Concord, Canada) version 1.2, which includes an LC/MS peak picking algorithm. In contrast to NITPICK, MarkerView was provided with the original LC/MS data set and thus had retention time information available. Peak picking was carried out in the m/z 400–1400 range and detected peaks were manually validated (see Additional files 6 and 7).

Discussion

Stoichiometry models

In comparison (see figure 3, classical averagine in dashed black, fractional averagine in solid red), Senko's classical averagine [25] features a larger number of very small deviances from the truth than fractional averagine. This is caused by the rounding to integers property of the classical approach, yielding exact models more often. At the same time, the deviance distribution of the fractional averagine model has a significantly lighter tail, i.e. the model generates significantly less stoichiometries whose theoretical isotope distributions have large deviations. The cumulative distribution based on the fractional averagine model approaches 1 more quickly, and its use yields an overall decrease in theoretical isotope distribution deviations. This finding is supported by the corresponding one-sided non-parametric Mann-Whitney test (p < 2.6 × 10-11). Because the overall impact on the peak picking performance depends on the squared mean error magnitude (7.6 × 10-4 for classical averagine, 6.3 × 10-4 for fractional averagine, corresponding to a 17% decrease for fractional averagine), fractional averagine clearly is the preferable model.

Peak picking

Simulation data set

Figure 4 shows the results for the peak detection performance analysis. As expected, ACC, NPV and PPV improve with increasing SNR. Postprocessing causes a decrease in NPV and an increase in PPV for all SNR levels as the removal of spurious peaks decreases FP but also, erroneously, increases FN. The ACC plot (top left) illustrates the fact that NITPICK is successful at simultaneously maximizing PPV and NPV. Postprocessing can then be used to trade specificity for sensitivity as supported by the sensitivity-specificity trace in figure 4 (bottom right). Here, each dot marks sensitivity and specificity of a given NITPICK postprocessing parameterization. Lines connect points of different SNRs. As expected, the introduction of a postprocessing step increases specificity and decreases sensitivity. Further analysis of FNs in the simulated data reveals that false negatives are predominantly due to low-intensity components in complex mixtures (data not shown).

Figure 4
figure 4

Evaluation and comparison with the pepex algorithm on simulated data. Accuracy (top left), negative predictive values (top right), positive predictive values (bottom left) and sensitivity-specificity traces (bottom right). Plots show NITPICK results in solid red, NITPICK resultswithout postprocessing in dashed blue and pepex results (optimized, see text) in dashed-dotted black.NITPICK is clearly superior in terms of accuracy, specificity and sensitivity.

Comparative results

In comparison with pepex, NITPICK exhibits better results with respect to all statistical measures in figure 4. It is especially obvious that pepex suffers from a severe increase in false positives (FPs) for very low SNR situations, yielding significant decreases in accuracy (ACC) and specificity (SP). For PPV, although the pepex approach outperforms NITPICK when no postprocessing is applied, it is inferior to the full NITPICK algorithm with simple spurious peak removal corresponding to eq. (12). With respect to sensitivity (SE) and specificity (SP), figure 4 reveals constant high (above 0.99) and superior specificity values for NITPICK at greatly increased sensitivity. Thus one can conclude that the NITPICK algorithm is more sensitive than pepex and, at the same time, provides picked peaks with higher confidence.

Real-world data set

We give peak picking illustrations for the mass ranges m/z 507–525 (with a zoom on m/z 518–525), m/z 636–646, m/z 695–725 and m/z 775–782, detailing positive and negative peak picking performance aspects. In the m/z 507–525 mass range (figure 5), all picked peaks could be verified, including the monoisotopic masses of the mixture distribution with components located at m/z 523.23 (z=3) and m/z 523.82 (z=5). Upon re-examination of the raw data, we detected a missed low-intensity peak at m/z 515.76.

Figure 5
figure 5

Peak picking in the m/z 507-525 mass range. Illustration of observed (top) and reconstructed (bottom) spectra. All detected peaks could be confirmed,including the monoisotopic masses of the mixture distribution with components located at m/z 523.23(z=3) and m/z 523.82 (z=5).

Figure 6 zooms onto two cases of overlapping isotope distributions in the m/z 518–525 mass range. At m/z 518.22 and m/z 519.11 NITPICK resolves two distinct monoisotopic masses, in spite of their unfavorable mass distance. Although the second isotope peak of the doubly charged ion with monoisotopic mass m/z 518.22 exhibits a heavy overlap with the monoisotopic peak of the ion at m/z 519.11, NITPICK is still able to correctly detect the monoisotopic peaks of the two isotope distributions. NITPICK also separates two isotope distributions located at m/z 523.23 (z=3) and m/z 523.82 (z=4). The detection of the monoisotopic mass at m/z 523.82 is particularly non-trivial because of its heavy overlap with an isotope peak of the isotope distribution located at m/z 523.23 and also because the detected monoisotopic mass peak at m/z 523.82 is not the most abundant peak within its isotope distribution.

Figure 6
figure 6

Zoom on the m/z 518–525 mass range. NITPICK proves capable of resolving overlapping isotope distributions and assigning correct monoisotopic masses for the distributions located at m/z 518.22 and 519.11 and at m/z 523.23 and 523.82. See text fordetails.

In the m/z 636–646 mass range (figure 7) we observe an example of incomplete unmixing: the isotope distribution (z=3) with monoisotopic mass located at m/z 636.29 heavily overlaps the distribution (z=3) located at m/z 636.64 (left triangle marker). The overlap proves inseparable and the monoisotopic mass of the second distribution is wrongly detected at m/z 636.96. Further, due to conservative noise level/complexity estimation, the isotope distribution located at m/z 642.33 (right triangle marker) is not detected. Note that in both of the correctly detected distributions located at m/z 636.29 and m/z 639.65, the monoisotopic mass peak does not correspond to the most prominent peak.

Figure 7
figure 7

Peak picking results in the m/z 636–646 mass range. Illustration of observed (top) and reconstructed (bottom) spectra. At m/z 636.64 and m/z 636.96 we observe incomplete unmixing: The isotope distribution (z=3) with monoisotopic mass m0 located at m/z 636.29 heavily overlaps the distribution (z=3) with m0 = 636.64 m/z (left triangle marker). The overlapproves inseparable and the monoisotopic mass of the second distribution is wrongly detected at m/z 636.96. Further, due to conservative noise level/complexity estimation, the isotope distribution located at m/z642.33 (right triangle marker) is not detected. Note that in both of the distributions located at m/z 636.29and m/z 639.65, the monoisotopic mass peak does not correspond to the most intensive peak.

In the m/z 695–725 mass range (figure 8), with one exception, all detected peaks could be verified. The wrongly detected peak at m/z 714.29 corresponds to the first isotope peak of the isotope distribution located at m/z 713.78 (z = 2). Especially in the m/z 718 to m/z 724 region the algorithm proves capable of resolving nontrivial low-intensity mixtures.

Figure 8
figure 8

Peak picking in the m/z 695–725 mass range. Illustration of observed (top) and reconstructed (bottom) spectra. With a single exception, all detectedpeaks could be manually confirmed. The peak detected at m/z 714.29 corresponds to the first isotope peakof the isotope distribution located at m/z 713.78 (z=2). In the m/z 718–724 region the algorithm provescapable of resolving nontrivial low-intensity mixtures.

In the m/z 775–782 range (figure 9), the separation of two heavily overlapping isotope distribution clearly illustrates the benefits of NITPICK's intensity model-based approach to the peak picking/feature extraction problem: the second isotope peak of the isotope distribution located at m/z 779.32 (z=2) and the monoisotopic peak of the distribution located at m/z 780.35 (z=2) overlap completely and can only be distinguished by taking intensity information into account.

Overall, the results obtained on real-world data are in agreement with simulation results: after manual validation of 192 peaks detected in the real-world dataset, we observe 127 true positives, yielding a positive predictive value of PPV = 66.15%.

Figure 9
figure 9

Observed (top) and reconstructed (bottom) mass spectrum in the m/z 775–782 range. The separation of two heavily overlapping isotope distribution clearly illustrates the benefits of NITPICK's intensity model-based approach to the peak picking/feature extraction problem: the second isotope peak ofthe isotope distribution located at m/z 779.32 (charge 2) and the monoisotopic peak of the distributionlocated at m/z 780.35 (charge 2) are exactly superimposed and can only be distinguished by takingintensity information into account.

Comparison with MarkerView

On the BSA data set, MarkerView detected 388 peaks, for 96 (24.7%) of which charge state information was available. Peaks without charge state assignment were counted as true peaks if their detected mass/charge ratio was correct. This resulted in 205 true positives for 82 (40.0%) of which charge state information was available. In comparison to NITPICK, this yields a sensitivity ratio of SER = SE M a r ker V i e w SE N I T P I C K = 205 127 = 1.61 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaee4uamLaeeyrauKaeeOuaiLaeyypa0tcfa4aaSaaaeaacqqGtbWucqqGfbqrdaWgaaqaaiabd2eanjabdggaHjabdkhaYjGbcUgaRjabcwgaLjabckhaYjabdAfawjabdMgaPjabdwgaLjabdEha3bqabaaabaGaee4uamLaeeyrau0aaSbaaeaacqWGobGtcqWGjbqscqWGubavcqWGqbaucqWGjbqscqWGdbWqcqWGlbWsaeqaaaaakiabg2da9KqbaoaalaaabaGaeGOmaiJaeGimaaJaeGynaudabaGaeGymaeJaeGOmaiJaeG4naCdaaOGaeyypa0JaeGymaeJaeiOla4IaeGOnayJaeGymaedaaa@5708@ and a positive predictive value of PPV = 0.53.

As expected, with retention time information available, MarkerView manages to detect a significantly larger number of peaks. Surprisingly, though, retention time information did not contribute to an increased PPV. The partial lack of charge state information also caused the performance interpretation to favor MarkerView: for peaks with correct mass/charge ratio, we assumed completely error-free charge state assignments, which is unlikely to hold true in reality. In contrast, in absence of retention time information, NITPICK delivered charge state information for each and every peak and peaks were counted as true positives if and only if their assigned charge state was correct. MarkerView's PPV and SER are subject to overestimation, whereas NITPICK's PPV is not. Even under this pro-MarkerView bias, if joint maximization of PPV and sensitivity is desired, NITPICK arguably proved competitive with MarkerView: despite the 1.6-fold increase in sensitivity, only slightly more than half of the peaks reported by MarkerView are true positives.

Analysis CPU time on the real-world spectrum was 114s on a 2 GHz AMD Opteron machine. Measurements are based on native, interpreted R code. Preliminary tests with an in-house C++ implementation (to be published elsewhere) yielded a speed increase by a factor of ≈ 20.

Conclusion and perspectives

Conclusions

We present NITPICK, an iterative, non-greedy, globally optimal mixture modeling approach for feature extraction from multicomponent mass spectra. The calculation of the set of explanatory theoretical isotope distributions is based on fractional averagine, a mass error-free extension to the well-known averagine [8] model. Subsequent feature selection is driven by a modified least angle regression [8] algorithm for which we derived a suitable, statistically motivated early stopping criterion. Experiments show that NITPICK is able to unmix and deconvolve complex mixture mass spectra. The algorithm was thoroughly evaluated on simulated and real-world data sets and was found to perform better than a conceptually similar algorithm. NITPICK was even found to deliver competitive results when compared against a vendor-supplied algorithm which, in contrast to NITPICK, had retention time resolution available.

We would like to note that although the analysis at hand was confined to a proteomics data set, the application of the proposed methodology is in no way limited to this type of data and can easily be adapted to similar problems outside the field of proteomics.

NITPICK is available as software package for the R programming language and can be downloaded from http://hci.iwr.uni-heidelberg.de/mip/proteomics/.

Perspectives

The constrained least squares regression model in equation (3) implicitly assumes Gaussian noise on the observed spectra. Especially with low-intensity time-of-flight spectra the Gaussian approximation is crude, yielding suboptimal estimates. The incorporation of a data type- and intensity-dependent procedure pursuing a suitable Poisson regression approach [36] in appropriate cases could improve on this shortcoming.

The non-negative least squares step in equation (6) assumes error-free basis functions ϕ i . Although fractional averagine improves over the classical averagine model, this assumption is still violated. Possible remedies include direct intensity estimation techniques [43, 44] and enhanced sparse feature selection methodology which allows for errors in explanatory variables. Alternatively, extended stoichiometry models could provide problem-tailored basis functions if model bias is not an issue.

For charge states z < 3 and mass ranges m/z 1400, there exist so-called forbidden regions [45] within the mass spectrum, i.e. mass ranges which are inaccessible to peptides (including modifications). Such information has been reported to be suitable as a preprocessing filter [31].

Further computational efficiency could be achieved by a complexity-driven hierarchical estimation approach, resorting to subtractive feature extraction for simple signals and to the full mixture modeling for complex samples only.

Appendix

A Computation of fractional averagine

For the computation of the isotopic distribution of fractional averagine, we build on the fact that the distribution of the isotopes of an element follows a multinomial [33]. The multinomial is discrete, hence for fractional counts of events we can interpolate between the two adjacent integer multinomials for each element such that

P n = c ( X 1 = x 1 , ... , X k 1 = x k 1 ) = ( c c ) P n = c ( X 1 = x 1 , ... , X k 1 = x k 1 ) + ( c c ) P n = c ( X 1 = x 1 , ... , X k 1 = x k 1 ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeaabmqaaaqaaiabdcfaqnaaBaaaleaacqWGUbGBcqGH9aqpcqWGJbWyaeqaaOGaeiikaGIaemiwaG1aaSbaaSqaaiabigdaXaqabaGccqGH9aqpcqWG4baEdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdIfaynaaBaaaleaacqWGRbWAcqGHsislcqaIXaqmaeqaaOGaeyypa0JaemiEaG3aaSbaaSqaaiabdUgaRjabgkHiTiabigdaXaqabaGccqGGPaqkaeaacqGH9aqpdaqadaqaamaahmaabaGaem4yamgacaGLUJVaayz+4dGaeyOeI0Iaem4yamgacaGLOaGaayzkaaGaemiuaa1aaSbaaSqaaiabd6gaUjabg2da9maagmaabaGaem4yamgacaGLWJVaay5+4daabeaakmaabmaabaGaemiwaG1aaSbaaSqaaiabigdaXaqabaGccqGH9aqpcqWG4baEdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdIfaynaaBaaaleaacqWGRbWAcqGHsislcqaIXaqmaeqaaOGaeyypa0JaemiEaG3aaSbaaSqaaiabdUgaRjabgkHiTiabigdaXaqabaaakiaawIcacaGLPaaaaeaacqGHRaWkdaqadaqaaiabdogaJjabgkHiTmaagmaabaGaem4yamgacaGLWJVaay5+4daacaGLOaGaayzkaaGaemiuaa1aaSbaaSqaaiabd6gaUjabg2da9maahmaabaGaem4yamgacaGLUJVaayz+4daabeaakmaabmaabaGaemiwaG1aaSbaaSqaaiabigdaXaqabaGccqGH9aqpcqWG4baEdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdIfaynaaBaaaleaacqWGRbWAcqGHsislcqaIXaqmaeqaaOGaeyypa0JaemiEaG3aaSbaaSqaaiabdUgaRjabgkHiTiabigdaXaqabaaakiaawIcacaGLPaaaaaaaaa@9FCE@
(13)

with c = min j ( j c ) MathType@MTEF@1@1@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGagiyBa0MaeiyAaKMaeiOBa42aaSbaaSqaaiabdQgaQjabgIGioprr1ngBPrwtHrhAYaqeguuDJXwAKbstHrhAGq1DVbaceeGae8xfH4eabeaakiabcIcaOiabdQgaQjabgwMiZkabdogaJjabcMcaPaaa@442C@ , c = max j ( j c ) MathType@MTEF@1@1@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGagiyBa0MaeiyyaeMaeiiEaG3aaSbaaSqaaiabdQgaQjabgIGioprr1ngBPrwtHrhAYaqeguuDJXwAKbstHrhAGq1DVbaceeGae8xfH4eabeaakiabcIcaOiabdQgaQjabgsMiJkabdogaJjabcMcaPaaa@441F@ and X i representing the number of times the i th isotope of an element occurs. Under the (reasonable) assumption of independence of the atomic distributions of the elements, the resulting joint distribution for a molecule follows from the multiplication of the distributions of its elements.

By changing the order of multiplication and separating the highest possible integer number from the remaining fractional numbers, the calculation of fractional averagine can be related to the Mercury7 algorithm [14], yielding a highly efficient calculation scheme (see eq. (2)). For the convolution of the Mercury integer results and the fractionals we follow [46]: Let g p (i) represent the i th element of the probability vector of the first and f p (j) the j th element of the second distribution, then

h p ( k ) = i g p ( i ) f p ( k i ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiAaG2aaSbaaSqaaiabdchaWbqabaGccqGGOaakcqWGRbWAcqGGPaqkcqGH9aqpdaaeqbqaaiabdEgaNnaaBaaaleaacqWGWbaCaeqaaOGaeiikaGIaemyAaKMaeiykaKIaemOzay2aaSbaaSqaaiabdchaWbqabaGccqGGOaakcqWGRbWAcqGHsislcqWGPbqAcqGGPaqkaSqaaiabdMgaPbqab0GaeyyeIuoaaaa@4500@
(14)

can be used to compute h p (k), the k th element of the new vector of probabilities for the joint distribution. Similarly, the corresponding mass vector h m can be computed using the probability vectors g p and f p and the corresponding mass vectors g m and f m using

h m ( k ) = ( i g p ( i ) f p ( k i ) ) 1 i g p ( i ) f p ( k i ) ( g m ( i ) + f m ( k i ) ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeWabiqaaaqaaiabdIgaOnaaBaaaleaacqWGTbqBaeqaaOGaeiikaGIaem4AaSMaeiykaKIaeyypa0ZaaeWaaeaadaaeqbqaaiabdEgaNnaaBaaaleaacqWGWbaCaeqaaOGaeiikaGIaemyAaKMaeiykaKIaemOzay2aaSbaaSqaaiabdchaWbqabaGccqGGOaakcqWGRbWAcqGHsislcqWGPbqAcqGGPaqkaSqaaiabdMgaPbqab0GaeyyeIuoaaOGaayjkaiaawMcaamaaCaaaleqabaGaeyOeI0IaeGymaedaaaGcbaWaaabuaeaacqWGNbWzdaWgaaWcbaGaemiCaahabeaakiabcIcaOiabdMgaPjabcMcaPiabdAgaMnaaBaaaleaacqWGWbaCaeqaaOGaeiikaGIaem4AaSMaeyOeI0IaemyAaKMaeiykaKIaeiikaGIaem4zaC2aaSbaaSqaaiabd2gaTbqabaGccqGGOaakcqWGPbqAcqGGPaqkcqGHRaWkcqWGMbGzdaWgaaWcbaGaemyBa0gabeaakiabcIcaOiabdUgaRjabgkHiTiabdMgaPjabcMcaPiabcMcaPaWcbaGaemyAaKgabeqdcqGHris5aOGaeiOla4caaaaa@6C44@
(15)

B Proof of the monotony of the GDF for the non-negative lasso

As long as a given set Φ A ( λ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaaaa@3C59@ is valid, it can be easily shown that the GDF are monotonous in λ. Starting with the GDF(λ) of equation (10),

GDF ( λ ) = s T 1 σ 2 Φ A ( λ ) ( Φ A ( λ ) T Φ A ( λ ) ) 1 ( Φ A ( λ ) T 1 2 λ 1 A ( λ ) ) = 1 σ 2 i = 1 N s i ( Φ A ( λ ) ( Φ A ( λ ) T Φ A ( λ ) ) 1 ( Φ A ( λ ) T s 1 2 λ 1 A ( λ ) ) ) i = 1 σ 2 i = 1 N s i ( Φ A ( λ ) ( Φ A ( λ ) T Φ A ( λ ) ) 1 Φ A ( λ ) T s ) i λ 1 2 σ 2 i = 1 N s i ( Φ A ( λ ) ( Φ A ( λ ) T Φ A ( λ ) ) 1 1 A ( λ ) ) i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeaabyqaaaaabaGaee4raCKaeeiraqKaeeOrayKaeiikaGIaeq4UdWMaeiykaKcabaGaeyypa0dcbmGae83Cam3aaWbaaSqabeaacqWGubavaaqcfa4aaSaaaeaacqaIXaqmaeaacqaHdpWCdaahaaqabeaacqaIYaGmaaaaaGGabOGae4NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqabaGccqGGOaakcqGFMoGrdaqhaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemivaqfaaOGae4NPdy0aaSbaaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqabaGccqGGPaqkdaahaaWcbeqaaiabgkHiTiabigdaXaaakmaabmaabaGae4NPdy0aa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdsfaubaakiabgkHiTKqbaoaalaaabaGaeGymaedabaGaeGOmaidaaOGaeq4UdWgcbeGaeWxmaeZaaSbaaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaakiaawIcacaGLPaaaaeaacqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabeo8aZnaaCaaabeqaaiabikdaYaaaaaGcdaaeWbqaaiab=nhaZnaaBaaaleaacqWGPbqAaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aaGcbiqaaamacaWLjaWaaeWaaeaacqGFMoGrdaWgaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabeaakmaabmaabaGae4NPdy0aa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdsfaubaakiab+z6agnaaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqGHsislcqaIXaqmaaGcdaqadaqaaiab+z6agnaaDaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGubavaaGccqWFZbWCcqGHsisljuaGdaWcaaqaaiabigdaXaqaaiabikdaYaaakiabeU7aSjabigdaXmaaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaaGccaGLOaGaayzkaaaacaGLOaGaayzkaaWaaSbaaSqaaiabdMgaPbqabaaakeaacqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabeo8aZnaaCaaabeqaaiabikdaYaaaaaGcdaaeWbqaaiab=nhaZnaaBaaaleaacqWGPbqAaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aOWaaeWaaeaacqGFMoGrdaWgaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabeaakmaabmaabaGae4NPdy0aa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdsfaubaakiab+z6agnaaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqGHsislcqaIXaqmaaGccqGFMoGrdaqhaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemivaqfaaOGae83CamhacaGLOaGaayzkaaWaaSbaaSqaaiabdMgaPbqabaaakeGabaaKaiaaxMaacqGHsislcqaH7oaBjuaGdaWcaaqaaiabigdaXaqaaiabikdaYiabeo8aZnaaCaaabeqaaiabikdaYaaaaaGcdaaeWbqaaiab=nhaZnaaBaaaleaacqWGPbqAaeqaaOWaaeWaaeaacqGFMoGrdaWgaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabeaakmaabmaabaGae4NPdy0aa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdsfaubaakiab+z6agnaaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqGHsislcqaIXaqmaaGccqaIXaqmdaWgaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabeaaaOGaayjkaiaawMcaamaaBaaaleaacqWGPbqAaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aaaaaaa@2021@
(16)

To show that the GDF are monotonously increasing for decreasing values of λ, it suffices to analyze the following part of the formula,

i = 1 N ( s i ( Φ A ( λ ) ( Φ A ( λ ) T Φ A ( λ ) ) 1 1 A ( λ ) ) i ) = i = 1 N ( e i T ( Φ A ( λ ) ( Φ A ( λ ) T Φ A ( λ ) ) 1 1 A ( λ ) ) ) T ( e i T s ) = i = 1 N ( 1 A ( λ ) T ( Φ A ( λ ) T Φ A ( λ ) ) 1 Φ A ( λ ) T ) e i e i T s = ( 1 A ( λ ) T ( Φ A ( λ ) T Φ A ( λ ) ) 1 Φ A ( λ ) T ) I N s = ( 1 A ( λ ) T ( Φ A ( λ ) T Φ A ( λ ) ) 1 Φ A ( λ ) T ) s = j = 1 K c ^ j L S A ( λ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeaabyqaaaaabaWaaabCaeaadaqadaqaaGqadiab=nhaZnaaBaaaleaacqWGPbqAaeqaaOWaaeWaaeaaiiqacqGFMoGrdaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabeaakmaabmaabaGae4NPdy0aa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdsfaubaakiab+z6agnaaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqGHsislcqaIXaqmaaGccqaIXaqmdaWgaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabeaaaOGaayjkaiaawMcaamaaBaaaleaacqWGPbqAaeqaaaGccaGLOaGaayzkaaaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoaaOqaaiabg2da9maaqahabaWaaeWaaeaacqWGLbqzdaqhaaWcbaGaemyAaKgabaGaemivaqfaaOWaaeWaaeaacqGFMoGrdaWgaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabeaakmaabmaabaGae4NPdy0aa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdsfaubaakiab+z6agnaaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqGHsislcqaIXaqmaaGccqaIXaqmdaWgaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabeaaaOGaayjkaiaawMcaaaGaayjkaiaawMcaamaaCaaaleqabaGaemivaqfaaOGaeiikaGIaemyzau2aa0baaSqaaiabdMgaPbqaaiabdsfaubaakiab=nhaZjabcMcaPaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaakeaacqGH9aqpdaaeWbqaamaabmaabaGaeGymaeZaa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdsfaubaakmaabmaabaGae4NPdy0aa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdsfaubaakiab+z6agnaaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqGHsislcqaIXaqmaaGccqGFMoGrdaqhaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemivaqfaaaGccaGLOaGaayzkaaGaemyzau2aaSbaaSqaaiabdMgaPbqabaGccqWGLbqzdaqhaaWcbaGaemyAaKgabaGaemivaqfaaOGae83CamhaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoaaOqaaiabg2da9maabmaabaGaeGymaeZaa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdsfaubaakmaabmqabaGae4NPdy0aa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdsfaubaakiab+z6agnaaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqGHsislcqaIXaqmaaGccqGFMoGrdaqhaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemivaqfaaaGccaGLOaGaayzkaaGaemysaK0aaSbaaSqaaiabd6eaobqabaGccqWFZbWCaeaacqGH9aqpdaqadaqaaiabigdaXmaaDaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGubavaaGcdaqadeqaaiab+z6agnaaDaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGubavaaGccqGFMoGrdaWgaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabeaaaOGaayjkaiaawMcaamaaCaaaleqabaGaeyOeI0IaeGymaedaaOGae4NPdy0aa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdsfaubaaaOGaayjkaiaawMcaaiab=nhaZbqaaiabg2da9maaqahabaGaf83yamMbaKaadaqhaaWcbaGaemOAaOgabaGaemitaWKaem4uam1aaSbaaWqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaaaaWcbaGaemOAaOMaeyypa0JaeGymaedabaGaem4saSeaniabggHiLdaaaaaa@2E48@
(17)

where e i denotes the i th canonical unit vector of length N and I N = i = 1 N e i e i T MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemysaK0aaSbaaSqaaiabd6eaobqabaGccqGH9aqpdaaeWaqaaiabdwgaLnaaBaaaleaacqWGPbqAaeqaaOGaemyzau2aa0baaSqaaiabdMgaPbqaaiabdsfaubaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoaaaa@3CAE@ is the identity matrix of size N.

c ^ j L S A ( λ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGaf83yamMbaKaadaqhaaWcbaGaemOAaOgabaGaemitaWKaem4uam1aaSbaaWqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaaaaaa@401C@ is the least squares regression coefficient for the corresponding least squares problem of the active set. It is known that all non-negative lasso coefficients c ^ A ( λ ) j q MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGaf83yamMbaKaadaqhaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae4haXhKaeiikaGIaeq4UdWMaeiykaKYaaSbaaWqaaiabdQgaQbqabaaaleaacqWGXbqCaaaaaa@3F42@ are greater or equal zero, so

j = 1 K c ^ A ( λ ) j q 0 ( e q . 9 ) ( 1 A ( λ ) T ( Φ A ( λ ) T Φ A ( λ ) ) 1 Φ A ( λ ) T ) s 1 A ( λ ) T ( Φ A ( λ ) T Φ A ( λ ) ) 1 1 2 λ 1 A ( λ ) 0 ( 1 A ( λ ) T ( Φ A ( λ ) T Φ A ( λ ) ) 1 Φ A ( λ ) T ) s 1 2 λ 1 A ( λ ) T ( Φ A ( λ ) T Φ A ( λ ) ) 1 1 A ( λ ) 0 j = 1 K c ^ j L S A ( λ ) 1 2 λ 1 A ( λ ) T ( Φ A ( λ ) T Φ A ( λ ) ) 1 1 A ( λ ) 0 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeaabCqaaaaabaWaaabCaeaaieWacuWFJbWygaqcamaaDaaaleaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaemOAaOgabeaaaSqaaiabdghaXbaakiabgwMiZkabicdaWaWcbaGaemOAaOMaeyypa0JaeGymaedabaGaem4saSeaniabggHiLdaakeaadaWfWaqaaiabgsDiBdWcbaaabaGaeiikaGIaemyzauMaemyCaeNaeiOla4IaeeiiaaIaeGyoaKJaeiykaKcaaOWaaeWaaeaacqaIXaqmdaqhaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemivaqfaaOWaaeWaaeaaiiqacqqFMoGrdaqhaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemivaqfaaOGae0NPdy0aaSbaaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaakiaawIcacaGLPaaadaahaaWcbeqaaiabgkHiTiabigdaXaaakiab9z6agnaaDaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGubavaaaakiaawIcacaGLPaaacqWFZbWCaeGabaaobiaaxMaacqGHsislcqaIXaqmdaqhaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemivaqfaaOWaaeWaaeaacqqFMoGrdaqhaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemivaqfaaOGae0NPdy0aaSbaaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaakiaawIcacaGLPaaadaahaaWcbeqaaiabgkHiTiabigdaXaaajuaGdaWcaaqaaiabigdaXaqaaiabikdaYaaakiabeU7aSjabigdaXmaaBaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaOGaeyyzImRaeGimaadabaGaeyi1HS9aaeWaaeaacqaIXaqmdaqhaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemivaqfaaOWaaeWaaeaacqqFMoGrdaqhaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemivaqfaaOGae0NPdy0aaSbaaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaakiaawIcacaGLPaaadaahaaWcbeqaaiabgkHiTiabigdaXaaakiab9z6agnaaDaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGubavaaaakiaawIcacaGLPaaacqWFZbWCaeGacaaWaaaMeiaaxMaacqGHLjYSjuaGdaWcaaqaaiabigdaXaqaaiabikdaYaaakiabeU7aSjabigdaXmaaDaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGubavaaGcdaqadaqaaiab9z6agnaaDaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGubavaaGccqqFMoGrdaWgaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabeaaaOGaayjkaiaawMcaamaaCaaaleqabaGaeyOeI0IaeGymaedaaOGaeGymaeZaaSbaaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaGccqGHLjYScqaIWaamaeaacqGHuhY2daaeWbqaaiqb=ngaJzaajaWaa0baaSqaaiabdQgaQbqaaiabdYeamjabdofatnaaBaaameaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaaaaaSqaaiabdQgaQjabg2da9iabigdaXaqaaiabdUealbqdcqGHris5aaGcbiqaaamacaWLjaGaeyyzImBcfa4aaSaaaeaacqaIXaqmaeaacqaIYaGmaaGccqaH7oaBcqaIXaqmdaqhaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemivaqfaaOWaaeWaaeaacqqFMoGrdaqhaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemivaqfaaOGae0NPdy0aaSbaaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaakiaawIcacaGLPaaadaahaaWcbeqaaiabgkHiTiabigdaXaaakiabigdaXmaaBaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaOGaeyyzImRaeGimaadaaaaa@3015@
(18)

as Φ A ( λ ) T Φ A ( λ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aa0baaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdsfaubaakiab=z6agnaaBaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaaaa@4451@ is the inverse of a covariance matrix and, thus, positive-semidefinite, and λ is by definition always greater or equal 0. Thus, the second part of equation (16) is monotone with regard to λ and therefore the GDFs are monotone as long as a given active set is valid.

It remains to be shown that changes of Φ A ( λ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaaaa@3C59@ do not influence the monotony, so it needs to be shown that neither the addition of ϕ j to the set Φ A ( λ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaaaa@3C59@ nor the removal of ϕ k from Φ A ( λ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaaaa@3C59@ lead to a decrease of cov(s, Φ A ( λ ) c ^ A ( λ ) q MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaacbmGccuqFJbWygaqcamaaDaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGXbqCaaaaaa@447C@ ) as given in (10). A formal proof is given further below, nevertheless, this can also be argued intuitively.

In the non-negative LARS implementation as described above and in [39], a variable ϕ j will be added to the active set ϕ A ( λ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeqy1dy2aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab=bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaaaa@3CA3@ only if it is positively correlated with the remaining residuals, i. e. if

cov ( ϕ j , s Φ A ( λ ) c ˆ A ( λ ) q ) > 0 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGagi4yamMaei4Ba8MaeiODay3aaeWaaeaacqaHvpGzdaWgaaWcbaGaemOAaOgabeaakiabcYcaSGqadiab=nhaZjabgkHiTGGabiab+z6agnaaBaaaleaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaOGaf83yamMbaKaadaqhaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemyCaehaaaGccaGLOaGaayzkaaGaeyOpa4JaeGimaadaaa@530F@
(19)

This obviously leads to an increase of cov (s, Φ A ( λ ) c ^ A ( λ ) q MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaacbmGccuqFJbWygaqcamaaDaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGXbqCaaaaaa@447C@ ) as less unexplained variation remains. A variable ϕ k is removed from the active set Φ A ( λ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaaaa@3C59@ only if cov (ϕ k , s- Φ A ( λ ) c ^ A ( λ ) q MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaacbmGccuqFJbWygaqcamaaDaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGXbqCaaaaaa@447C@ ) < 0, so if the residuals are negatively correlated with the variable its removal leads to an increase of cov (s, Φ A ( λ ) c ^ A ( λ ) q MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaacbmGccuqFJbWygaqcamaaDaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGXbqCaaaaaa@447C@ ) as well.

Thus, as long as changes of the set Φ A ( λ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaaaa@3C59@ appear one at a time (which is ensured by the active set implementation), they do not influence the monotonous character of the estimate of the degrees of freedom.

More formally, when a variable ϕ j is added to the current set of variables Φ A ( λ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaaaa@3C59@ , the solution for Φ A ( λ ) + = Φ A ( λ ) ϕ j MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbeaakiabg2da9iab=z6agnaaBaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaOGaeyOkIGSaeqy1dy2aaSbaaSqaaiabdQgaQbqabaaaaa@4A3A@ can be constructed from the solution of Φ A ( λ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaaaa@3C59@ in the following manner [39]:

Φ A ( λ ) + c ^ A ( λ ) + q = Φ A ( λ ) c ^ A ( λ ) q + γ ^ u A ( λ ) + MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbeaaieWakiqb9ngaJzaajaWaa0baaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbaGaemyCaehaaOGaeyypa0Jae8NPdy0aaSbaaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaGccuqFJbWygaqcamaaDaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGXbqCaaGccqGHRaWkcuaHZoWzgaqcaiabdwha1naaBaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqabaaaaa@6159@
(20)

where

γ ^ = min j A ( λ ) C + { D ^ d ^ j B A ( λ ) b j } > 0 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4SdCMbaKaacqGH9aqpcyGGTbqBcqGGPbqAcqGGUbGBdaqhaaWcbaGaemOAaOMaeyicI48enfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXhKaeiikaGIaeq4UdWMaeiykaKYaaWbaaWqabeaacqWGdbWqaaaaleaacqGHRaWkaaGcdaGadaqcfayaamaalaaabaGafmiraqKbaKaacqGHsislcuWGKbazgaqcamaaBaaabaGaemOAaOgabeaaaeaacqWGcbGqdaWgaaqaaiab=bq8bjabcIcaOiabeU7aSjabcMcaPaqabaGaeyOeI0IaemOyai2aaSbaaeaacqWGQbGAaeqaaaaaaOGaay5Eaiaaw2haaiabg6da+iabicdaWaaa@5ADD@
(21)

is strictly positive by definition and gives the magnitude of the change.

d ^ = Φ T ( s Φ A ( λ ) c ^ A ( λ ) q ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmizaqMbaKaacqGH9aqpiiqacqWFMoGrdaahaaWcbeqaaiabdsfaubaakiabcIcaOGqadiab+nhaZjabgkHiTiab=z6agnaaBaaaleaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaOGaf43yamMbaKaadaqhaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemyCaehaaOGaeiykaKcaaa@4E1F@
(22)

is the vector of the current correlation and

D ^ = max j { d ^ j | d ^ j > 0 } . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmiraqKbaKaacqGH9aqpcyGGTbqBcqGGHbqycqGG4baEdaWgaaWcbaGaemOAaOgabeaakiabcUha7jqbdsgaKzaajaWaaSbaaSqaaiabdQgaQbqabaGccqGG8baFcuWGKbazgaqcamaaBaaaleaacqWGQbGAaeqaaOGaeyOpa4JaeGimaaJaeiyFa0NaeiOla4caaa@4145@
(23)

In addition,

B A ( λ ) = ( 1 A ( λ ) T ( Φ A ( λ ) T Φ A ( λ ) ) 1 1 A ( λ ) ) 1 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOqai0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab=bq8bjabcIcaOiabeU7aSjabcMcaPaqabaGccqGH9aqpdaqadaqaaiabigdaXmaaDaaaleaacqWFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGubavaaGcdaqadaqaaGGabiab+z6agnaaDaaaleaacqWFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGubavaaGccqGFMoGrdaWgaaWcbaGae8haXhKaeiikaGIaeq4UdWMaeiykaKcabeaaaOGaayjkaiaawMcaamaaCaaaleqabaGaeyOeI0IaeGymaedaaOGaeGymaeZaaSbaaSqaaiab=bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaakiaawIcacaGLPaaadaahaaWcbeqaaiabgkHiTKqbaoaalaaabaGaeGymaedabaGaeGOmaidaaaaaaaa@6287@
(24)

and

u A ( λ ) = Φ A ( λ ) B A ( λ ) ( Φ A ( λ ) T Φ A ( λ ) ) 1 1 A ( λ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyDau3aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab=bq8bjabcIcaOiabeU7aSjabcMcaPaqabaGccqGH9aqpiiqacqGFMoGrdaWgaaWcbaGae8haXhKaeiikaGIaeq4UdWMaeiykaKcabeaakiabdkeacnaaBaaaleaacqWFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaOWaaeWaaeaacqGFMoGrdaqhaaWcbaGae8haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemivaqfaaOGae4NPdy0aaSbaaSqaaiab=bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaakiaawIcacaGLPaaadaahaaWcbeqaaiabgkHiTiabigdaXaaakiabigdaXmaaBaaaleaacqWFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaaaa@6371@
(25)

leading to

b = Φ T u A ( λ ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOyaiMaeyypa0dcceGae8NPdy0aaWbaaSqabeaacqWGubavaaGccqWG1bqDdaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabeaakiabc6caUaaa@42C3@
(26)

We need to show that

cov ( s , Φ A ( λ ) + c ^ A ( λ ) + q ) cov ( s , Φ A ( λ ) c ^ A ( λ ) q ) cov ( s , Φ A ( λ ) + c ^ A ( λ ) + q Φ A ( λ ) c ^ A ( λ ) q ) 0 s T ( Φ A ( λ ) + c ^ A ( λ ) + q Φ A ( λ ) c ^ A ( λ ) q ) 0. MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeaabmqaaaqaaiGbcogaJjabc+gaVjabcAha2naabmaabaacbmGae83CamNaeiilaWccceGae4NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab9bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbeaakiqb=ngaJzaajaWaa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbaGaemyCaehaaaGccaGLOaGaayzkaaGaeyyzImRagi4yamMaei4Ba8MaeiODay3aaeWaaeaacqWFZbWCcqGGSaalcqGFMoGrdaWgaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabeaakiqb=ngaJzaajaWaa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdghaXbaaaOGaayjkaiaawMcaaaqaaiabgsDiBlGbcogaJjabc+gaVjabcAha2naabmaabaGae83CamNaeiilaWIae4NPdy0aaSbaaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbeaakiqb=ngaJzaajaWaa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbaGaemyCaehaaOGaeyOeI0Iae4NPdy0aaSbaaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqabaGccuWFJbWygaqcamaaDaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGXbqCaaaakiaawIcacaGLPaaacqGHLjYScqaIWaamaeaacqGHuhY2cqWFZbWCdaahaaWcbeqaaiabdsfaubaakmaabmaabaGae4NPdy0aaSbaaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbeaakiqb=ngaJzaajaWaa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbaGaemyCaehaaOGaeyOeI0Iae4NPdy0aaSbaaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqabaGccuWFJbWygaqcamaaDaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGXbqCaaaakiaawIcacaGLPaaacqGHLjYScqaIWaamcqGGUaGlaaaaaa@C0C3@
(27)

Using the construction of Φ A ( λ ) + c ^ A ( λ ) + q MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbeaaieWakiqb9ngaJzaajaWaa0baaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbaGaemyCaehaaaaa@46B0@ from above, this leads to

s T ( Φ A ( λ ) c ^ A ( λ ) q + γ ^ u A ( λ ) + Φ A ( λ ) c ^ A ( λ ) q ) 0 s T ( γ ^ u A ( λ ) + ) 0. MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeGabiqaaaqaaiabgsDiBJqadiab=nhaZnaaCaaaleqabaGaemivaqfaaOWaaeWaaeaaiiqacqGFMoGrdaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabeaakiqb=ngaJzaajaWaa0baaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdghaXbaakiabgUcaRiqbeo7aNzaajaGaemyDau3aaSbaaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbeaakiabgkHiTiab+z6agnaaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaOGaf83yamMbaKaadaqhaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKcabaGaemyCaehaaaGccaGLOaGaayzkaaGaeyyzImRaeGimaadabaGaeyi1HSTae83Cam3aaWbaaSqabeaacqWGubavaaGccqGGOaakcuaHZoWzgaqcaiabdwha1naaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqabaGccqGGPaqkcqGHLjYScqaIWaamcqGGUaGlaaaaaa@7C9B@
(28)

It is known from its definition that γ ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4SdCMbaKaaaaa@2D8C@ is strictly positive, thus it can be dropped from the inequality and

s T u + 0 s T Φ A ( λ ) + B A ( λ ) + ( Φ A ( λ ) + T Φ A ( λ ) + ) 1 1 A ( λ ) + 0. MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeGabiqaaaqaaGqadiab=nhaZnaaCaaaleqabaGaemivaqfaaOGaemyDau3aaSbaaSqaaiabgUcaRaqabaGccqGHLjYScqaIWaamaeaacqGHuhY2cqWFZbWCdaahaaWcbeqaaiabdsfaubaaiiqakiab+z6agnaaBaaaleaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqabaGccqWGcbGqdaWgaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKYaaSbaaWqaaiabgUcaRaqabaaaleqaaOWaaeWaaeaacqGFMoGrdaqhaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKYaaSbaaWqaaiabgUcaRaqabaaaleaacqWGubavaaGccqGFMoGrdaWgaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKYaaSbaaWqaaiabgUcaRaqabaaaleqaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqGHsislcqaIXaqmaaGccqaIXaqmdaWgaaWcbaGae0haXhKaeiikaGIaeq4UdWMaeiykaKYaaSbaaWqaaiabgUcaRaqabaaaleqaaOGaeyyzImRaeGimaaJaeiOla4caaaaa@721A@
(29)

It is also known from the idea of the non-negative lasso that all variables in X A are positively correlated with the remaining residuals, so

cov ( Φ A ( λ ) , s Φ A ( λ ) c ^ A ( λ ) q ) 0 ( s Φ A ( λ ) c ^ A ( λ ) q ) T Φ A ( λ ) 0 s T Φ A ( λ ) ( Φ A ( λ ) c ^ A ( λ ) q ) T Φ A ( λ ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeaabmqaaaqaaiGbcogaJjabc+gaVjabcAha2naabmaabaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaGccqGGSaalieWacqqFZbWCcqGHsislcqWFMoGrdaWgaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabeaakiqb9ngaJzaajaWaa0baaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdghaXbaaaOGaayjkaiaawMcaaiabgwMiZkabicdaWaqaaiabgsDiBpaabmaabaGae03CamNaeyOeI0Iae8NPdy0aaSbaaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaGccuqFJbWygaqcamaaDaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGXbqCaaaakiaawIcacaGLPaaadaahaaWcbeqaaiabdsfaubaakiab=z6agnaaBaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaOGaeyyzImRaeGimaadabaGaeyi1HSTae03Cam3aaWbaaSqabeaacqWGubavaaGccqWFMoGrdaWgaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabeaakiabgwMiZoaabmaabaGae8NPdy0aaSbaaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaGccuqFJbWygaqcamaaDaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGXbqCaaaakiaawIcacaGLPaaadaahaaWcbeqaaiabdsfaubaakiab=z6agnaaBaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaOGaeiOla4caaaaa@9E70@
(30)

Using this result,

s T Φ A ( λ ) + B A ( λ ) + ( Φ A ( λ ) + T Φ A ( λ ) + ) 1 1 A ( λ ) + ( Φ A ( λ ) + c ^ A ( λ ) + q ) T Φ A ( λ ) + B A ( λ ) + ( Φ A ( λ ) + T Φ A ( λ ) + ) 1 1 A ( λ ) + MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeaabmqaaaqaaGqadiab=nhaZnaaCaaaleqabaGaemivaqfaaGGabOGae4NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab9bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbeaakiabdkeacnaaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqabaGcdaqadaqaaiab+z6agnaaDaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqaaiabdsfaubaakiab+z6agnaaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqabaaakiaawIcacaGLPaaadaahaaWcbeqaaiabgkHiTiabigdaXaaakiabigdaXmaaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqabaaakeaacqGHLjYSdaqadaqaaiab+z6agnaaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqabaGccuWFJbWygaqcamaaDaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqaaiabdghaXbaaaOGaayjkaiaawMcaamaaCaaaleqabaGaemivaqfaaOGae4NPdy0aaSbaaSqaaiab9bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbeaakiabdkeacnaaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqabaaakeGabaa2aiaaxMaadaqadaqaaiab+z6agnaaDaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqaaiabdsfaubaakiab+z6agnaaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqabaaakiaawIcacaGLPaaadaahaaWcbeqaaiabgkHiTiabigdaXaaakiabigdaXmaaBaaaleaacqqFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqabaaaaOGaeeiiaacaaa@A75B@
(31)

holds true and it suffices to show that

( Φ A ( λ ) + c ^ A ( λ ) + q ) T Φ A ( λ ) + B A ( λ ) + ( Φ A ( λ ) + T Φ A ( λ ) + ) 1 1 A ( λ ) + 0 ( Φ A ( λ ) + c ^ A ( λ ) + q ) T u A + 0 c ^ A ( λ ) + q 1 T Φ A ( λ ) + T u A ( λ ) + 0. MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeaabqqaaaaabaWaaeWaaeaaiiqacqWFMoGrdaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae4haXhKaeiikaGIaeq4UdWMaeiykaKYaaSbaaWqaaiabgUcaRaqabaaaleqaaGqadOGaf03yamMbaKaadaqhaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKYaaSbaaWqaaiabgUcaRaqabaaaleaacqWGXbqCaaaakiaawIcacaGLPaaadaahaaWcbeqaaiabdsfaubaakiab=z6agnaaBaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqabaGccqWGcbGqdaWgaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKYaaSbaaWqaaiabgUcaRaqabaaaleqaaaGcbiqaaGwacaWLjaWaaeWaaeaacqWFMoGrdaqhaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKYaaSbaaWqaaiabgUcaRaqabaaaleaacqWGubavaaGccqWFMoGrdaWgaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKYaaSbaaWqaaiabgUcaRaqabaaaleqaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqGHsislcqaIXaqmaaGccqaIXaqmdaWgaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKYaaSbaaWqaaiabgUcaRaqabaaaleqaaOGaeyyzImRaeGimaadabaGaeyi1HS9aaeWaaeaacqWFMoGrdaWgaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKYaaSbaaWqaaiabgUcaRaqabaaaleqaaOGaf03yamMbaKaadaqhaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKYaaSbaaWqaaiabgUcaRaqabaaaleaacqWGXbqCaaaakiaawIcacaGLPaaadaahaaWcbeqaaiabdsfaubaakiabdwha1naaBaaaleaacqGFaeFqdaWgaaadbaGaey4kaScabeaaaSqabaGccqGHLjYScqaIWaamaeaacqGHuhY2cuqFJbWygaqcamaaDaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqaaiabdghaXbaakiabigdaXmaaCaaaleqabaGaemivaqfaaOGae8NPdy0aa0baaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbaGaemivaqfaaOGaemyDau3aaSbaaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbeaakiabgwMiZkabicdaWiabc6caUaaaaaa@B908@
(32)

When further recalling the fact from [39] that Φ A ( λ ) T u A ( λ ) = B A ( λ ) 1 A ( λ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aa0baaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqaaiabdsfaubaakiabdwha1naaBaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeqaaOGaeyypa0JaemOqai0aaSbaaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaGccqaIXaqmdaWgaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKcabeaaaaa@51F7@ , this can be reduced to

( c ^ A ( λ ) + q ) T B A ( λ ) + 1 A ( λ ) + 0 , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaeWaaeaaieWacuWFJbWygaqcamaaDaaaleaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqaaiabdghaXbaaaOGaayjkaiaawMcaamaaCaaaleqabaGaemivaqfaaOGaemOqai0aaSbaaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbeaakiabigdaXmaaBaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqabaGccqGHLjYScqaIWaamcqGGSaalaaa@5479@
(33)

but as B A ( λ ) + MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOqai0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab=bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbeaaaaa@3D02@ is strictly positive by definition, it follows that

( c ^ A ( λ ) + q ) T 1 A ( λ ) + 0 i ( c ^ A ( λ ) + q ) i 0. MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeGabiqaaaqaamaabmaabaacbmGaf83yamMbaKaadaqhaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae4haXhKaeiikaGIaeq4UdWMaeiykaKYaaSbaaWqaaiabgUcaRaqabaaaleaacqWGXbqCaaaakiaawIcacaGLPaaadaahaaWcbeqaaiabdsfaubaakiabigdaXmaaBaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqabaGccqGHLjYScqaIWaamaeaacqGHuhY2daaeqbqaamaabmaabaGaf83yamMbaKaadaqhaaWcbaGae4haXhKaeiikaGIaeq4UdWMaeiykaKYaaSbaaWqaaiabgUcaRaqabaaaleaacqWGXbqCaaaakiaawIcacaGLPaaadaWgaaWcbaGaemyAaKgabeaakiabgwMiZkabicdaWiabc6caUaWcbaGaemyAaKgabeqdcqGHris5aaaaaaa@61ED@
(34)

This is always fulfilled for the non-negative lasso as it is the constraint on its initial optimization problem. The case of the removal of ϕ k from Φ A ( λ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaaaa@3C59@ can be argued almost identically with the only difference being that now

Φ A ( λ ) + c ^ A ( λ ) + q = Φ A ( λ ) c ^ A ( λ ) q + γ ˜ u A ( λ ) + MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbeaaieWakiqb9ngaJzaajaWaa0baaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPmaaBaaameaacqGHRaWkaeqaaaWcbaGaemyCaehaaOGaeyypa0Jae8NPdy0aaSbaaSqaaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaGccuqFJbWygaqcamaaDaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkaeaacqWGXbqCaaGccqGHRaWkcuaHZoWzgaacaiabdwha1naaBaaaleaacqGFaeFqcqGGOaakcqaH7oaBcqGGPaqkdaWgaaadbaGaey4kaScabeaaaSqabaaaaa@6158@
(35)

where

γ ˜ = min γ j > 0 { γ j } MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4SdCMbaGaacqGH9aqpdaWfqaqaaiGbc2gaTjabcMgaPjabc6gaUbWcbaGaeq4SdC2aaSbaaWqaaiabdQgaQbqabaWccqGH+aGpcqaIWaamaeqaaOGaei4EaSNaeq4SdC2aaSbaaSqaaiabdQgaQbqabaGccqGG9bqFaaa@3EB0@
(36)

which is also always positive and thus can be dropped from the resulting inequality in exactly the same fashion as γ ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4SdCMbaKaaaaa@2D8C@ could be dropped for the case of the addition of a variable. Consequently, changes in Φ A ( λ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8NPdy0aaSbaaSqaamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+bq8bjabcIcaOiabeU7aSjabcMcaPaqabaaaaa@3C59@ do not change the monotony of the GDF estimate.

C Lower bound properties of BIC min

BIC min is a lower bound for BIC, if kiBIC min (i) ≤ BIC(k),

which equals

N σ ε 2 MSE + d f ( λ i ) log N N σ ε 2 MSE ( λ i ) + d f ( λ k ) log N MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqWGobGtaeaacqaHdpWCdaqhaaqaaiabew7aLbqaaiabikdaYaaaaaGccqqGnbqtcqqGtbWucqqGfbqrcqGHRaWkcqWGKbazcqWGMbGzcqGGOaakcqaH7oaBdaWgaaWcbaGaemyAaKgabeaakiabcMcaPiGbcYgaSjabc+gaVjabcEgaNjabd6eaojabgsMiJMqbaoaalaaabaGaemOta4eabaGaeq4Wdm3aa0baaeaacqaH1oqzaeaacqaIYaGmaaaaaOGaeeyta0Kaee4uamLaeeyrauKaeiikaGIaeq4UdW2aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkcqGHRaWkcqWGKbazcqWGMbGzcqGGOaakcqaH7oaBdaWgaaWcbaGaem4AaSgabeaakiabcMcaPiGbcYgaSjabc+gaVjabcEgaNjabd6eaobaa@61AB@
(38)

which is always fulfilled because MSE ≤ MSE(λ i ) and df (λ i ) ≤ df (λ k ) for ik and N ≥ 1, σ ε 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4Wdm3aa0baaSqaaiabew7aLbqaaiabikdaYaaaaaa@305E@ > 0.

D SNR definition for simulated spectra

Given the undistorted simulated signal s, the effect of Poisson noise is simulated with s i v i , where v i is drawn from a Poisson distribution with mean ks i + 1. The signal-to-noise ratio (SNR) thus depends on the parameter k. In order to determine k for a selected set of SNR values, we consider the definition

SNR σ s 2 σ n 2 . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaee4uamLaeeOta4KaeeOuaiLaeSiuIiucfa4aaSaaaeaacqaHdpWCdaqhaaqaaiabdohaZbqaaiabikdaYaaaaeaacqaHdpWCdaqhaaqaaiabd6gaUbqaaiabikdaYaaaaaGccqGGUaGlaaa@3AF1@
(39)

The empirical variance of the original signal s multiplied by a scalar k is defined as

σ s 2 ( k ) k 2 i = 1 N ( s i s ¯ ) 2 , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4Wdm3aa0baaSqaaiabdohaZbqaaiabikdaYaaakiabcIcaOiabdUgaRjabcMcaPiablcLicjabdUgaRnaaCaaaleqabaGaeGOmaidaaOWaaabCaeaacqGGOaakcqWGZbWCdaWgaaWcbaGaemyAaKgabeaakiabgkHiTiqbdohaZzaaraGaeiykaKYaaWbaaSqabeaacqaIYaGmaaaabaGaemyAaKMaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdGccqGGSaalaaa@4738@
(40)

where s ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafm4CamNbaebaaaa@2D5C@ denotes the mean over all s i . For Poisson noise, location and dispersion parameters coincide, i.e. with X ~ P MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaafaaa@2CFE@ (λ) we have Var(X) = E MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWefv3ySLgznfgDOjdaryqr1ngBPrginfgDObcv39gaiqaacqWFecFraaa@37B5@ (X) = λ, and we approximate the variance of a set of Poisson variables n i ~ P MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaafaaa@2CFE@ (ks i ), i = 1, ..., N by their average

σ n 2 ( k ) 1 N i = 1 N k s i . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4Wdm3aa0baaSqaaiabd6gaUbqaaiabikdaYaaakiabcIcaOiabdUgaRjabcMcaPiablcLicLqbaoaalaaabaGaeGymaedabaGaemOta4eaaOWaaabCaeaacqWGRbWAcqWGZbWCdaWgaaWcbaGaemyAaKgabeaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoakiabc6caUaaa@4377@
(41)

For a given SNR, this allows the estimation of k because

SNR = σ s 2 ( k ) σ n 2 ( k ) = k σ s 2 σ n 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaee4uamLaeeOta4KaeeOuaiLaeyypa0tcfa4aaSaaaeaacqaHdpWCdaqhaaqaaiabdohaZbqaaiabikdaYaaacqGGOaakcqWGRbWAcqGGPaqkaeaacqaHdpWCdaqhaaqaaiabd6gaUbqaaiabikdaYaaacqGGOaakcqWGRbWAcqGGPaqkaaGccqGH9aqpcqWGRbWAjuaGdaWcaaqaaiabeo8aZnaaDaaabaGaem4CamhabaGaeGOmaidaaaqaaiabeo8aZnaaDaaabaGaemOBa4gabaGaeGOmaidaaaaaaaa@4B75@
(42)

and thus

k = σ n 2 σ s 2 SNR . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4AaSMaeyypa0tcfa4aaSaaaeaacqaHdpWCdaqhaaqaaiabd6gaUbqaaiabikdaYaaaaeaacqaHdpWCdaqhaaqaaiabdohaZbqaaiabikdaYaaaaaGccqqGtbWucqqGobGtcqqGsbGucqGGUaGlaaa@3C11@
(43)

References

  1. Jensen ON: Interpreting the protein language using proteomics. Nature Reviews Molecular Cell Biology 2006, 7(6):391–403. 10.1038/nrm1939

    Article  CAS  PubMed  Google Scholar 

  2. Beretta L: Proteomics from the Clinical Perspective: Many Hopes and Much Debate. Nature Methods 2007, 4(10):785–786. 10.1038/nmeth1007-785

    Article  CAS  PubMed  Google Scholar 

  3. Schwartz SA, Weil RJ, Johnson MD, Toms SA, Caprioli RM: Protein Profiling in Brain Tumors Using Mass Spectrometry: Feasibility of a New Technique for the Analysis of Protein Expression. Clinical Cancer Research 2004, 10: 981–987. 10.1158/1078-0432.CCR-0927-3

    Article  CAS  PubMed  Google Scholar 

  4. Claydon MA, Davey SN, Edwards-Jones V, Gordon DB: The Rapid Identification of Intact Microorganisms Using Mass Spectrometry. Nature Biotechnology 1996, 14: 1584–1586. 10.1038/nbt1196-1584

    Article  CAS  PubMed  Google Scholar 

  5. Pineda FJ, Antoine MD, Demirev PA, Feldman AB, Jackman J, Longenecker M, Lin JS: Microorganism Identification by Matrix-Assisted Laser/Desorption Ionization Mass Spectrometry and Model-Derived Ribosomal Protein Biomarkers. Analytical Chemistry 2003, 75(15):3817–3822. 10.1021/ac034069b

    Article  CAS  PubMed  Google Scholar 

  6. Zhang Z, Marshall AG: A Universal Algorithm for Fast and Automated Charge State Deconvolution of Electrospray Mass-to-Charge Ratio Spectra. Journal of the American Society for Mass Spectrometry 1998, 9(3):225–33. 10.1016/S1044-0305(97)00284-5

    Article  CAS  PubMed  Google Scholar 

  7. Yu W, Wu B, Lin N, Stone K, Williams K, Zhao H: Detecting and Aligning Peaks in Mass Spectrometry Data with Applications to MALDI. Computational Biology and Chemistry 2006, 30: 27–38. 10.1016/j.compbiolchem.2005.10.006

    Article  CAS  PubMed  Google Scholar 

  8. Senko M, Beu S, McLafferty F: Determination of Monoisotopic Masses and Ion Populations for Large Biomolecules from Resolved Isotopic Distributions. Journal of the American Society for Mass Spectrometry 1995, 6: 229–233. 10.1016/1044-0305(95)00017-8

    Article  CAS  PubMed  Google Scholar 

  9. Horn DM, Zubarev RA, McLafferty FW: Automated Reduction and Interpretation of High Resolution Electrospray Mass Spectra of Large Molecules. Journal of the American Society for Mass Spectrometry 2000, 11(4):320–332. 10.1016/S1044-0305(99)00157-9

    Article  CAS  PubMed  Google Scholar 

  10. Wehofsky M, Hoffmann R, Hubert M, Spengler B: Isotopic Deconvolution of Matrix-Assisted Laser Desorption/Ionization Mass Spectra for Substance-Class Specific Analysis of Complex Samples. European Journal of Mass Spectrometry 2001, 7: 39–46. 10.1255/ejms.387

    Article  CAS  Google Scholar 

  11. Gras R, Muller M, Gasteiger E, Gay S, Binz PA, Bienvenut W, Hoogland C, Sanches JC, Bairoch A, Hochstrasser DF, Appel RD: Improving Protein Identification from Peptide Mass Fingerprinting through a Parameterized Multi-Level Scoring Algorithm and an Optimized Peak Detection. Electrophoresis 1999, 20: 3535–3550. 10.1002/(SICI)1522-2683(19991201)20:18<3535::AID-ELPS3535>3.0.CO;2-J

    Article  CAS  PubMed  Google Scholar 

  12. Rockwood A, Van Orden S, Smith R: Rapid Calculation of Isotope Distributions. Analytical Chemistry 1995, 67: 2699–2704. 10.1021/ac00111a031

    Article  CAS  Google Scholar 

  13. Rockwood A, Van Orden SL, Smith RD: Ultrahigh-Speed Calculation of Isotope Distributions. Analytical Chemistry 1996, 68: 2027–2030. 10.1021/ac951158i

    Article  CAS  PubMed  Google Scholar 

  14. Rockwood A, Haimi P: Efficient Calculation of Accurate Masses of Isotopic Peaks. Journal of the American Society for Mass Spectrometry 2006, 17: 415–419. 10.1016/j.jasms.2005.12.001

    Article  CAS  PubMed  Google Scholar 

  15. Yergey JA: A General Approach to Calculating Isotopic Distributions for Mass Spectrometry. International Journal of Mass Spectrometry and Ion Physics 1983, 52: 337–349. 10.1016/0020-7381(83)85053-0

    Article  CAS  Google Scholar 

  16. Senko M: Isopro 3.0.1997. [http://members.aol.com/msmssoft/]

    Google Scholar 

  17. Breen EJ, Hopwood FG, Williams KL, Wilkins MR: Automatic Poisson Peak Harvesting for High Throughput Protein Identification. Electrophoresis 2000, 21: 2243–2251. 10.1002/1522-2683(20000601)21:11<2243::AID-ELPS2243>3.0.CO;2-K

    Article  CAS  PubMed  Google Scholar 

  18. Chen L, Sze SK, Yang H: Automated Intensity Descent Algorithm for Interpretation of Complex High-Resolution Mass Spectra. Analytical Chemistry 2006, 78: 5006–5018. 10.1021/ac060099d

    Article  CAS  PubMed  Google Scholar 

  19. Kaur P, O'Connor PB: Algorithms for automatic interpretation of high resolution mass spectra. Journal of the American Society for Mass Spectrometry 2006, 17(3):459–468. 10.1016/j.jasms.2005.11.024

    Article  CAS  PubMed  Google Scholar 

  20. Szymura JA, Lamkiewicz J: Band Composition Analysis: a new Procedure for Deconvolution of the Mass Spectra of Organometallic Compounds. Journal of Mass Spectrometry 2003, 38: 817–822. 10.1002/jms.499

    Article  CAS  PubMed  Google Scholar 

  21. Wehofsky M, Hoffmann R: Automated Deconvolution and Deisotoping of Electrospray Mass Spectra. Journal of Mass Spectrometry 2002, 37: 223–229. 10.1002/jms.278

    Article  CAS  PubMed  Google Scholar 

  22. Zhang X, Hines W, Adamec J, Asara JM, Naylor S, Regnier FE: An Automated Method for the Analysis of Stable Isotope Labeling Data in Proteomics. Journal of the American Society for Mass Spectrometry 2005, 16: 1181–1191. 10.1016/j.jasms.2005.03.016

    Article  CAS  PubMed  Google Scholar 

  23. Mason CJ, Therneau TM, Eckel-Passow JE, Johnson KL, Oberg AL, Olson JE, Nair KS, Muddiman DC, Bergen HRI: A Method for Automatically Interpreting Mass Spectra of 18O Labeled Isotopic Clusters. Molecular & Cellular Proteomics 2006, 6: 305–318. 10.1074/mcp.M600148-MCP200

    Article  Google Scholar 

  24. Wang W, Zhou H, Lin H, Roy S, Shaler TA, Hill LR, Norton S, Kumar P, Anderle M, Becker CH: Quantification of Proteins and Metabolites by Mass Spectrometry without Isotopic Labeling or Spiked Standards. Analytical Chemistry 2003, 75: 4818–4826. 10.1021/ac026468x

    Article  CAS  PubMed  Google Scholar 

  25. Senko MW, Beu SC, McLafferty FW: Automated Assignment of Charge States from Resolved Isotopic Peaks for Multiply Charged Ions. Journal of the American Society for Mass Spectrometry 1995, 6: 52–56. 10.1016/1044-0305(94)00091-D

    Article  CAS  PubMed  Google Scholar 

  26. Tabb DL, Shah MB, Strader MB, Conelly HM, Hettich RL, Hurst GB: Determination of Peptide and Protein ion Charge States by Fourier Transformation of Isotope-Resolved Mass Spectra. Journal of the American Society for Mass Spectrometry 2006, 17: 903–915. 10.1016/j.jasms.2006.02.003

    Article  CAS  PubMed  Google Scholar 

  27. Listgarten J, Emili A: Statistical and Computational Methods for Comparative Proteomic Profiling Using Liquid Chromatography-Tandem Mass Spectrometry. Molecular and Cellular Proteomics 2005, 4(4):419–434. 10.1074/mcp.R500005-MCP200

    Article  CAS  PubMed  Google Scholar 

  28. Fernández-de-Cossio J, Gonzalez LJ, Satomi Y, Betancourt L, Ramos Y, Huerta V, Besada V, Padron G, Minamino N, Takao T: Automated Interpretation of Mass Spectra of Complex Mixtures by Matching of Isotope Peak Distributions. Rapid Communications in Mass Spectrometry 2004, 18: 2465–2472. 10.1002/rcm.1647

    Article  PubMed  Google Scholar 

  29. Roussis SG, Proulx R: Reduction of Chemical Formulas from the Isotopic Peak Distributions of High-Resolution Mass Spectra. Analytical Chemistry 2003, 75: 1470–1482. 10.1021/ac020516w

    Article  CAS  PubMed  Google Scholar 

  30. Samuelsson J, Dalevi D, Levander F, Rögnvaldsson T: Modular, Scriptable and Automated Analysis Tools for High-Throughput Peptide Mass Fingerprinting. Bioinformatics 2004, 20: 3628–3635. 10.1093/bioinformatics/bth460

    Article  CAS  PubMed  Google Scholar 

  31. Du P, Angeletti RH: Automatic Deconvolution of Isotope-Resolved Mass Spectra Using Variable Selection and Quantized Peptide Mass Distribution. Analytical Chemistry 2006, 78: 3385–3392. 10.1021/ac052212q

    Article  CAS  PubMed  Google Scholar 

  32. Tibshirani R: Regression Shrinkage and Selection via the LASSO. Journal of the Royal Statistical Society 1996, Series B 58: 267–288.

    Google Scholar 

  33. Kaur P, O'Connor PB: Use of Statistical Methods for Estimation of Total Number of Charges in a Mass Spectrometry Experiment. Analytical Chemistry 2004, 76: 2756–2762. 10.1021/ac035334w

    Article  CAS  PubMed  Google Scholar 

  34. Casella G, Berger RL: Statistical Inference. Duxbury Press; 2001.

    Google Scholar 

  35. Lawson CL, Hanson RJ: Solving Least Squares Problems. Prentice-Hall, Englewood Cliffs, N J; 1974.

    Google Scholar 

  36. Park MY, Hastie T: An L1Regularization-path Algorithm for Generalized Linear Models. Journal of the Royal Statistical Society, Series B 2007, 69: 659–677. 10.1111/j.1467-9868.2007.00607.x

    Article  Google Scholar 

  37. Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning; Data Mining, Inference, and Prediction. Springer Verlag New York; 2001.

    Google Scholar 

  38. Ye J: On Measuring and Correcting the Effects of Data Mining and Model Selection. Journal of the American Statistical Association 1998, 93: 120–131. 10.2307/2669609

    Article  Google Scholar 

  39. Efron B, Hastie T, Johnstone I, Tibshirani R: Least Angle Regression. Annals of Statistics 2004, 32(2):407–499. 10.1214/009053604000000067

    Article  Google Scholar 

  40. Zou H, Hastie T, Tibshirani R: On the "Degrees of Freedom" of the Lasso. Annals of Statistics 2007, 35(5):2173–2192. 10.1214/009053607000000127

    Article  Google Scholar 

  41. Bairoch A, Apweiler R: The SWISS-PROT Protein Sequence Database and its Supplement TrEMBL in 2000. Nucleic Acids Research 2000, 28: 45–48. 10.1093/nar/28.1.45

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  42. Tibshirani R, Hastie T, Narasimhan B, Soltys S, Shi G, Koong A, Le QT: Sample Classification from Protein Mass Spectrometry, by Peak Probability Contrasts. Bioinformatics 2004, 20(17):3034–3044. 10.1093/bioinformatics/bth357

    Article  CAS  PubMed  Google Scholar 

  43. Wallace WE, Kearsley AJ, Guttman CM: An Operator-Independent Approach to Mass Spectral Peak Identification and Integration. Analytical Chemistry 2004, 76: 2446–2452. 10.1021/ac0354701

    Article  CAS  PubMed  Google Scholar 

  44. Kearsley AJ, Wallace WE, Bernal J, Guttman CM: A Numerical Method for Mass Spectral Data Analysis. Applied Mathematics Letters 2005, 18: 1412–1417. 10.1016/j.aml.2005.02.033

    Article  Google Scholar 

  45. Mann M: Useful Tables of Possible and Probable Peptide Masses. 43rd Conference on Mass Spectrometry and Allied Topics 1995.

    Google Scholar 

  46. Rockwood AL, Kushnir MM, Nelson GJ: Dissociation of individual isotopic peaks: predicting isotopic distributions of product ions in MSn. Journal of the American Society for Mass Spectrometry 2003, 14(4):311–22. 10.1016/S1044-0305(03)00062-X

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Yin Yin Lin (Dept. of Pathology, Children's Hospital, Boston, MA, USA) for LC/MS data acquisition, Lyle Burton (AB/MDS Sciex, Concord, Canada) for MarkerView 1.2 evaluation versions, Ullrich Köthe, Linus Görlitz, Björn Menze, Michael Kelm (Interdisciplinary Center for Scientific Computing (IWR), University of Heidelberg, Germany), and Flavio Monigatti (Dept. of Pathology, Children's Hospital, Boston, MA, USA) for comments, suggestions, and fruitful discussions. We gratefully acknowledge financial support by the Hans L. Merkle foundation (M.K.), the Karl Steinbuch Scholarship (B.Y.R.), dm/Filiadata GmbH (B.Y.R.), Robert Bosch GmbH (F.A.H.), the Children's Hospital Trust (J.A.J.S. and H.S.), and the DFG under grant no. HA4364/2-1 (B.Y.R., F.A.H).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fred A Hamprecht.

Additional information

Authors' contributions

BYR and MK have developed the methodology, implemented the software, carried out the data analysis and drafted the manuscript. HS and JAJS have contributed to the basic methodology and the manuscript, carried out critical review and provided application feedback and evaluation for the proposed methods. FAH has suggested the fractional averagine approach, and has contributed to the manuscript and the overall project design. All authors have read and approved the final manuscript.

Bernhard Y Renard, Marc Kirchner contributed equally to this work.

Electronic supplementary material

Additional file 1: A zip folder containing all simulation files (R data files). (ZIP 12 MB)

12859_2008_2340_MOESM2_ESM.zip

Additional file 2: The zipped original LC/MS .wiff-file on which MarkerView was run (as acquired by the AB/Sciex QStar instrument) (ZIP 2 MB)

12859_2008_2340_MOESM3_ESM.txt

Additional file 3: The original spectrum of BSA-sample.wiff integrated over retention time (23.817-29.278 minutes) on which NITPICK was run. (TXT 570 KB)

Additional file 4: A zip-folder containing all pepex results on the simulated data. (ZIP 16 MB)

12859_2008_2340_MOESM5_ESM.zip

Additional file 5: A zip-folder containing all NITPICK results on the simulated data sets and for each SNR (SNR in 5, 10, 25, 50, 100) a R data file called • resultList_0_’SNR’_0.1.RDA gives the peaks found by NITPICK for all spectra of a certain SNR • lengthResultList_0_’SNR’_0.1.RDA gives the number of peaks found by NITPICK for each spectrum • correct_0_’SNR’_0.1.RDA gives the number of correctly identified peaks found by NITPICK for each spectrum • tooMany_0_’SNR’_0.1.RDA gives the number of incorrectly identified peaks found by NITPICK for each spectrum • pp_resultList_0_3_0_’SNR’_0.1.RDA gives the peaks found by NITPICK for all spectra of a certain SNR after postprocessing with g=3 • lengthResultList_0_3_0_’SNR’_0.1.RDA gives the number of peaks found by NITPICK for each spectrum after postprocessing with g=3 • correct_0_3_0_’SNR’_0.1.RDA gives the number of correctly identified peaks found by NITPICK for each spectrum after postprocessing with g=3 • tooMany_0 3_0_’SNR’_0.1.RDA gives the number of incorrectly identified peaks found by NITPICK for each spectrum after postprocessing with g=3 (ZIP 1 MB)

12859_2008_2340_MOESM6_ESM.xls

Additional file 6: Excel sheet containing the peaks detected by NITPICK (mz-position, charge, intensity) as well as their manual validation. (XLS 50 KB)

12859_2008_2340_MOESM7_ESM.xls

Additional file 7: Excel sheet containing the peaks detected by MarkerView (mz-position, charge, if available) as well as their manual validation. (XLS 109 KB)

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Renard, B.Y., Kirchner, M., Steen, H. et al. NITPICK: peak identification for mass spectrometry data. BMC Bioinformatics 9, 355 (2008). https://doi.org/10.1186/1471-2105-9-355

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-9-355

Keywords