Institute of Informatics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland

Mossakowski Medical Research Centre PAS, Pawinskiego 5, 02-106 Warsaw, Poland

Department of Oncological Genetics, Maria Skłodowska-Curie Memorial Cancer Center and Institute of Oncology, 02-781 Warsaw, Poland

Department of Gastroenterology, Medical Center for Postgraduate Education, 01-813 Warsaw, Poland

College of Inter-Faculty Individual Studies in Mathematics and Natural Sciences, University of Warsaw, Zwirki i Wigury 93, 02-089 Warsaw. Poland

Abstract

Background

In this paper we deal with modeling serum proteolysis process from tandem mass spectrometry data. The parameters of peptide degradation process inferred from LC-MS/MS data correspond directly to the activity of specific enzymes present in the serum samples of patients and healthy donors. Our approach integrate the existing knowledge about peptidases' activity stored in MEROPS database with the efficient procedure for estimation the model parameters.

Results

Taking into account the inherent stochasticity of the process, the proteolytic activity is modeled with the use of Chemical Master Equation (CME). Assuming the stationarity of the Markov process we calculate the expected values of digested peptides in the model. The parameters are fitted to minimize the discrepancy between those expected values and the peptide activities observed in the MS data. Constrained optimization problem is solved by Levenberg-Marquadt algorithm.

Conclusions

Our results demonstrates the feasibility and potential of high-level analysis for LC-MS proteomic data. The estimated enzyme activities give insights into the molecular pathology of colorectal cancer. Moreover the developed framework is general and can be applied to study proteolytic activity in different systems.

Background

Motivation and related research

Recent advances in high throughput technologies, which evaluate tens of thousands of genes or proteins in a single experiment, are providing new methods for identifying biochemical determinants of the disease process. One of the experimental technologies allowing us to study molecular basis underlying specific disease phenotype is mass spectrometry (MS)

Paradoxically, one can take advantage of these findings in cancer diagnostics

As development in hardware and software progresses, we can obtain better and better estimates of peptide concentrations in body fluids, which give many insights into the peptide degradation process. Proteolysis modeled in this paper is the process in which a protein is broken down partially, into peptides, or completely, into amino acids, by proteolytic enzymes present in blood serum. Among proteolytic enzymes two main groups are distinguished. One group includes

Our results

In this paper we present formal mathematical model describing serum proteolysis dynamics. We focus here on the activity of peptide cutting enzymes (peptidases). The model parameters are inferred from liquid chromatography tandem mass spectrometry data (LC-MS/MS).

The dynamical changes in peptide composition caused by proteolytic degradation are described by means of biochemical reactions network. It corresponds to Markov process whose evolution is governed by the system of stochastic differential equations (i.e. Chemical Master Equation).

The current approach significantly extends the exopeptidase activity model presented in

Organization of the paper

We start by description of our model presented with the use of so called

Model of proteolysis process

To illustrate the process of peptide degradation we introduce the

By proteolytic event we mean the cleavage of a specific substrate at specific site made by a specific peptidase. Hence each event node is labelled by a peptidase, and has one ingoing edge and two outgoing edges (leading to peptide prefix and suffix obtained by cutting the substrate at a single site).

Now we visualize the peptide subsequences as particles placed at peptide nodes of the cleavage graph. The particles are flowing through the edges of the graph according to the Petri net operational semantics, i.e. the transition (event node) consumes one substrate particle, and produces two particles. To assure the stationarity of the system we allow for creation and degradation of particle at any node. We also add the source and the sink in the graph modeling the creation of precursor peptides (e.g. caused by the activity of some endopeptidases, which is not captured by our model) and complete degradation of short peptides. The cleavage graph is constructed for every processed MS sample. The peptide nodes are appropriately filled with mass spectrometry readouts and specific enzymes are assigned to event nodes according to data about real cleavage events (see the next section for details).

A small exemplary fragment of the cleavage graph is depicted in Figure

Exemplary cleavage graph

**Exemplary cleavage graph**. The cleavage graph for precursor peptide (

The operation

where

Methods

In this section we describe the process of cleavage graph construction. It has several phases: firstly the set of nodes are determined. Peptide nodes correspond to the sequences identified in tandem MS experiment, while event nodes are selected carefully according to the knowledge from MEROPS database (version 9.4.). During the second stage the graph should be filled with appropriate readouts from LC-MS spectra. To this aim we have to determine which signal in two-dimensional spectral map corresponds to a given peptide sequence (i.e. node in the graph) and to assign to this node the number of particles reflecting the signal strength.

Having the cleavage graph we solve the constrained optimization problem to infer the unknown enzyme activity coefficients which minimize the discrepancy between expected number of peptides (calculated according to the model) and the observed signals in MS samples.

Cleavage graph construction

Let us define the set of amino acids together with the space letter

For each peptidase

Frequency matrix for elastase-2 and trypsin-1

**Frequency matrix for elastase-2 and trypsin-1**. Frequency matrices for elastase-2 (left) and trypsin-1 (right) based on MEROPS database

Using frequency matrix we construct so called sequence logo

Affinity coefficients

Let us consider cleavage

We define

Filling the graph with LC-MS readouts

MS samples were acquired from the blood serum of 20 colorectal cancer patients and 19 healthy donors. Each sample was digested by trypsin before LC-MS processing. Having so called

Then we look for corresponding signals in MS spectra as follows: using

Constrained optimization

Let us denote by

1. if

2. if

3. if

The graph pruning grants that

Denote by _{i }

Compositional data

To make the outcome of estimation procedure comparable across different MS samples we normalized the vector of parameters corresponding to peptidases' activities. Notice, that normalization does not change the value of function _{v }

Results

The optimization procedure was applied to infer the enzymatic activity for 39 LC-MS samples, i.e. for each sample we obtained optimal parameters

We run LMA for each data set 7 times (each time from different starting point) and use the maximal number of iterations set up to 200 as a stop criterion. To measure the quality of estimation we use relative squared errors (rse)

where

Adequacy of the model

Aiming in justifying the adequacy of the proposed model we made the following experiment. The estimation procedure was run to obtain the expected number of peptide sequences

Comparison of median value of relative squared error

**Comparison of median value of relative squared error**. Comparison of median value of relative squared errors for real data and synthetic data generated according to the model (plot for the sample no. 5 on the left and for the sample no. 19 on the right).

Statistical significance of estimation quality

Optimization procedure yielded rather small rse errors for most samples. However, we were interested how the final relative squared error depends on the input data, and whether results obtained by us are statistically significant. To answer this question for each MS sample

Table

Final relative squared errors (rse) and p-values (calculated from rse distribution).

**sample**

**final rse**

**p-value**

**sample**

**final rse**

**p-value**

19

0.008

20

0.011

5

0.012

1

0.016

0.001

9

0.034

0.001

2

0.026

0.005

14

0.091

0.005

13

0.061

0.007

4

0.063

0.008

10

0.065

0.008

11

0.031

0.01

30

0.136

0.021

6

0.125

0.026

29

0.163

0.029

15

0.058

0.032

28

0.185

0.057

7

0.131

0.076

24

0.11

0.078

16

0.134

0.093

32

0.392

0.11

8

0.156

0.144

23

0.379

0.215

33

0.45

0.23

22

0.358

0.234

27

0.483

0.257

26

0.471

0.29

21

0.367

0.301

18

0.262

0.317

25

0.521

0.324

12

0.332

0.434

34

0.589

0.436

38

0.589

0.44

35

0.628

0.452

31

0.436

0.623

37

0.478

0.529

36

0.567

0.64

3

0.637

0.729

Biological significance of inferred enzymes

Figure

Peptidases' activities for sample no. 5

**Peptidases' activities for sample no. 5**. (A) Inferred peptidases' activities for sample no. 5 (healthy donor). (B-D) Same parameters for synthetic data generated from the model with standard deviation set to 0.1, 0.01, 0.001, respectively. Red lines correspond to model peptidases' activities, which we aim to recover.

Peptidases' activities for sample no. 19

**Peptidases' activities for sample no. 19**. (A) Inferred peptidases' activities for sample no. 19 (colorectal cancer patient). (B-D) Same parameters for synthetic data generated from the model with standard deviation set to 0.1, 0.01, 0.001, respectively. Red lines correspond to model peptidases' activities, which we aim to recover.

The set of identified enzymes do not vary significantly between all investigated samples: there are 6 peptidases identified in all samples and 19 peptidases found in at least one sample (listed in Figure

Peptidases' activities

**Peptidases' activities**. Peptidases' activities (after clr transformation) for all analyzed samples. The red-white scale represents peptidase activities in descending order.

Heatmap in Figure

We have conducted principal component analysis for enzyme activities inferred for 19 samples having smallest p-values (c.f.Table

Principal component analysis

**Principal component analysis**. Principal component analysis scatterplot for 19 samples with best p-values. Corresponding loadings on the left panel.

Conclusions

In this paper we significantly extend formal model of protein degradation proposed in

Being aware of the problems with quality and reproducibility of the LC-MS experiments we selected for detailed analysis only a part of accessible data, namely those for which the parameter estimation procedure converges and yields small error. The expected retention time for investigated substances is obtained by rather unsophisticated approach (i.e. linear regression model), which may have impact on the analysis. Preliminary outcomes for these samples are very promising: identified enzymes are known to play a crucial role in colorectal cancer. However, our results are far from any medical diagnosis. The proposed method constitutes the proof of concept and requires more profound investigations meeting all clinical standards. We also should discuss here the limitations of our methods applied to MS data obtained by present technologies. There is a lot of tryptic peptides which are not identified by tandem mass spectrometry. LC-MS signals corresponding to these peptides and theirs degraded forms are missed during cleavage graph filling phase. Therefore the inference of proteolytic enzymes' activities is based on only partial information and could be incomplete as well. However, it is worth to noting here that our method would demonstrate its full potential while applied to high quality data hopefully obtained from the future MS technologies. One direction for further development is to focus on cleavage detection and to apply recently proposed

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

AG developed strategy for the study, and prepared the final version of the manuscript. PD implemented algorithms for estimation of enzymatic activity and participated in drafting the manuscript. JO and JK provided the LC-MS/MS samples and participated in the design of the study. All authors read and approved the final manuscript.

Acknowledgements

The authors would like to thank Neil Rawlings for helpful information about MEROPS and Bogusław Kluge for the source code from

This article has been published as part of