European Brain Research Institute, Via Fosso del Fiorano 64, Roma, Italy

Lay Line Genomics SpA, S.Raffaele Science Park, Castel Romano, Italy

International School of Advanced Studies (SISSA/ISAS), Biophysics Dept., Via Beirut 2-4, Trieste, Italy

ENEA, Casaccia Research Center, Computing and Modelling Unit, Via Anguillarese 301, S.Maria di Galeria, Italy

Ylichron Srl, c/o ENEA, Casaccia Research Center, Via Anguillarese 301, S.Maria di Galeria, Italy

Abstract

Background

The "inverse" problem is related to the determination of unknown causes on the bases of the observation of their effects. This is the opposite of the corresponding "direct" problem, which relates to the prediction of the effects generated by a complete description of some agencies. The solution of an inverse problem entails the construction of a mathematical model and takes the moves from a number of experimental data. In this respect, inverse problems are often ill-conditioned as the amount of experimental conditions available are often insufficient to unambiguously solve the mathematical model. Several approaches to solving inverse problems are possible, both computational and experimental, some of which are mentioned in this article. In this work, we will describe in details the attempt to solve an inverse problem which arose in the study of an intracellular signaling pathway.

Results

Using the Genetic Algorithm to find the sub-optimal solution to the optimization problem, we have estimated a set of unknown parameters describing a kinetic model of a signaling pathway in the neuronal cell. The model is composed of mass action ordinary differential equations, where the kinetic parameters describe protein-protein interactions, protein synthesis and degradation. The algorithm has been implemented on a parallel platform. Several potential solutions of the problem have been computed, each solution being a set of model parameters. A sub-set of parameters has been selected on the basis on their small coefficient of variation across the ensemble of solutions.

Conclusion

Despite the lack of sufficiently reliable and homogeneous experimental data, the genetic algorithm approach has allowed to estimate the approximate value of a number of model parameters in a kinetic model of a signaling pathway: these parameters have been assessed to be relevant for the reproduction of the available experimental data.

Background

The "inverse" problem is related to the determination of unknown causes on the bases of the observation of their effects. This is the opposite of the corresponding "direct" problem, which relates to the prediction of the effects generated by a complete description of some agencies. Typical inverse problems in electrocardiology are related to the modelling of the human heart functional structure from surface electrocardiogram signals (ECG)

The solution of an inverse problem entails the construction of a mathematical model and takes the moves from a number of experimental data. In this respect, inverse problems are often ill-conditioned as the amount of experimental conditions available are often insufficient to unambiguously solve the mathematical model. Moreover, as model construction usually depends upon the minimization of specific functions, such as the system energy or the difference between the model prediction and some given experimental results, its solution does not necessarily lead to a single global optimal solution but to a set of optimal solutions, defining what is called the "Pareto optimal frontier" in the space of solutions

In this work, we will attempt to solve an inverse problem which arose in the study of a signalling pathway. Compared to pathways of metabolic reactions, which are of a limited size comprising up to a few hundreds of proteins, signalling processes involve about 20% of the genome, i.e. thousands of expressed proteins

• The species involved in molecular interactions, including chemical reactions

• How the interactions connect the chemical actors and form a signalling network

• How these interactions can be modelled

• The model parameters necessary to computationally simulate the time behaviour of the system.

The mathematical form of the chemical interactions, the model parameters and even the network topology are often only partially known. This implies that model approximations and numerical estimates and, whenever possible, additional specific experimental measurements, are necessary to make a numerical simulation feasible and reliable. This is true whatever modelling techniques is used, such as differential equations

Only at the end of this phase, further experimental activity and the techniques for parameter's estimate come into play: wherever possible, purposely designed experiments should be carried out in order to directly measure unknown kinetic parameters or to use these measures as constraints for the estimate's algorithm or to decide between alternative models. If new experiments cannot be done, the parameter estimate must rely just on literature data.

Databases of protein interactions

Protein interactions maps, partially stored in public databases, contain mainly qualitative information on the connectivity of intracellular p-p interactions, while quantitative data on the kinetics of interactions and reactions are still largely unavailable, except for enzyme kinetics. There are to date a number of public databases sites containing qualitative data on protein interaction maps:

• **iHOP**: genetic and protein interactions are extracted by text mining of literature abstract

• **Amaze**: it is built upon a complex object-oriented data model that allows it to represent and analyze molecular interactions and cellular processes, kinetic data can potentially be inserted into the data structure

• **IntAct**: this offers a database and analysis tools for protein interactions

• **Kegg**: it is a large database that contains also several signalling pathways

• **DIP**: it contains interactions from over 100 organisms

• **IMEx**: it is a consortium of major public providers of molecular interaction data, current members are DIP, IntAct, MINT, MPact, BioGRID, BIND

• **Reactome**: this is a curated database of biological pathways in human beings

It should be remarked that a great care has to be payed when dealing with qualitative data: they are often dependent on specific experimental conditions and most of them obtained in unicellular organisms. A straightforward extrapolation of these data to higher organisms is often quite unreliable

The situation is even worse when one analyzes quantitative p-p interactions data in public repositories: the total amount of experimentally-derived kinetic data is only a small percentage of what would be needed to characterize the topology data (i.e. the p-p interactions map). Furthermore, available kinetic constants are often extracted from a single publication where they were measured in vitro, while the kinetics of interactions is highly dependent on experimental set-up and environmental conditions, such as PH, temperature, concentration of other proteins in the cellular environment. It is always advisable to assume that the measured quantities indicate more realistically ranges rather than precise values and care must be used to insert these values into large-scale network models

This point, however, is already a major concern of the Systems Biology: several programs are being performed aimed at producing sets of validated data, homogeneously refered at specific organisms in well defined and standardized thermo-chemical conditions. The standardization of experimental data sets and of experimental models is the object of an intense debate in the Systems Biology community. There is a wide consensus on the need of standards but also on some drawbacks for a general use of standards as the best research framework in any case. Anyway the way towards a deeper and deeper though slow integration of existing datasets, modelling languages and methodologies appears to be set, as witnessed for example by the wider and wider use of SBML as a language to describe biochemical models, or by the integration of previously separated datasets into a single larger database compliant with new criteria established by international consortia. One example of the latter case is the HUPO – PSI initiative

p-p interactions in signalling pathways can be divided into two main categories: (a) binding interactions that involve no chemical modifications and (2) biochemical processing, such as phosphorylation and phosphatization. On one hand, the few public sources of kinetic data on binding protein interaction often provide only dissociation constants, i.e. values describing an equilibrium state that offer only partial information about the dynamics of the reaction. To our knowledge, only the KDBI database

A further source of signalling pathways and of p-p interactions data, including the kinetic part, are the repositories of biochemical models, though in these models not all the kinetic parameters were measured experimentally and some of them had to be numerically estimated. Among them:

• **Biomodels.Net**: it has been published very recently and it is currently the most curated database of biochemical models, offering tested and verified models in several standard formats included, SBML, CellML and XML

• **JWS Online**: another curated repository of models in SBML and PySces formats

• **CellML**: repository of biochemicals models in CellML format

• **DOQCS**: this is a large repository of signalling pathways, where all the reactions and kinetic parameters are directly shown, furthermore the models can be downloaded in the Genesis language

• **ModelDB**: this is a repository of detailed biochemical and electrphysiological processes in the neuronal cell: the models are written in the Genesis language and Neuron languages

Experimental measures of kinetic parameters

The measure of protein activation level is of paramount importance to monitor signalling processes. Several methods exist to quantitate the concentration of protein species, such as immunoblotting, ELISA, radioimmunoassay, protein arrays. If a cellular system is sampled several times over the duration of a given signalling process, a time series can be composed describing the time course of a concentration, for example that of a phosphorilated protein. Radioimmunoassays are very sensitive methods but are even complex, expensive and dangerous to set up; protein arrays offer the advantage of a high throughput approach, while ELISA and immunoblotting are easier to implement and, thus, widely used, though they allow a lower threshold of detection when a very low concentrations of radioactive compounds is present

Enzymatic reactions can be monitored, nowadays, in a high throughput scale both in vivo and in vitro: this allows us to measure kinetic parameters characterizing fundamental steps in signalling pathways, such as binding and removal of phosphate groups by kinases and phosphatases. Bioreactors are widely used to perform enzymatic reactions and other biochemical processes but their use for a real time monitoring of products is limited by the sampling process. More recent modified reactors allow a real-time sampling of multiple reactions in vivo over a short reaction time: the reaction broth flows at constant velocity along a thin pipe where spilling at uniform space intervals corresponds to uniform time sampling. In this system the samples can be rapidly quenched and analyzed by mass-spectrometry techniques

"In silico" parameter's estimate

When only a few kinetic parameters are available to implement a model of a signaling network, one might resort to attempting a "theoretical" estimate of these values. The attempt could be performed, in principle, by using an "inverse problem" approach, i.e. by optimizing the unknown parameters of a reaction's model in order to obtain the best possible agreement between simulated and experimental data.

This is the aim of the present work. We devise a methodological workflow (and the corresponding numerical and computational tools) to estimate the unknown reaction constants of a model signalling pathway by starting from (a) a given set of known data of reaction constants and (b) experimental results of the time course of some biochemical species involved in the reaction.

An intracellular signal transduction pathway in the neuronal cell was used as a model system to implement the proposed parameter's estimate procedure.

The chosen pathway is a protein network downstream of the neurotrophic receptors Trk and P75

Scheme of a model signaling network

**Scheme of a model signaling network**. Scheme of the signaling network used to demonstrate the validity of the parameter estimate method. The network consists of a series of proteins (the nodes) linked by different types of unary, binary or multiple molecular interactions (shown as the edges of the network). The role of the mitochondrion (in purple) is taken into account. Binding protein-protein interactions are shown by green edges between the nodes, activation and deactivation interactions are in blue and red, respectively, chemical transformations are shown by purple dotted lines, while the release of proteins from the mitochondria in shown in solid purple lines. The signaling process can be activated by the binding of ligands (in grey) to receptors. Every compound is identified by a name and a numerical code.

The p-p interactions, such as molecular binding, phosphorylation/dephospshorylation or chemical transformations, are described using first order non-linear ordinary differential equations, which take into account also synthesis and degradation processes. The space variable is neglected in this model, since proteins are considered to be close enough to justify the approximation of a geometrical point. The release from the mitochondria was considered to be mathematically equivalent to an additional protein synthesis

The activation rate of protein B is : _{act }= _{act }[^{th}-order kinetics, where _{i}, with forward and reverse rate constants ^{-1 }respectively:

Thus the association rate is _{ass }= _{1}] [_{2}]...[_{n}] and the dissociation rate is _{diss }= _{-1}[

Each of the N = 98 nodes of the network is described by the two independent variables _{i }and _{i }(i = l...N): the first refers to the total concentration of the protein species, the second to the concentration of the active fraction of that species. Each protein species

where _{prod, a}(_{cons,a}), with a = _{i}, _{i}, represent production (consumption) reactions having the a-species as object. The complete system of equations describing the system assumes the following explicit mathematical structure:

_{i,j }is the number of different interactions involving the nodes _{i,j,r }is the number of components when _{i}

In our approximation we considered both the topology of the protein interaction map and the kinetic parameters as constant in time, i.e. each protein keeps the same neighbours during the time evolution of the system and interacts with them with constant strength. We decided to completely assign the connectivity matrix of the network on the basis of the existing experimental data. On the other hand, the kinetic parameters were largely unknown on the basis on the same information sources: as a consequence, in this application, the object of the "inverse problem" are the unknown model's constants. The "inverse problem" has been implemented with the following scheme:

1. eqs.(5–6) are solved and the time course of variables _{i }and _{i }(i = l...N) are calculated for a given set of model's parameters

2. the predicted time course of certain quantities is compared with the corresponding experimental data and a specific "distance" between time-courses evaluated

3. procedure is iterated up to minimizing that distance

Although, at least in principle, the strategy is simple, in practice the space of parameters to be estimated is very large, thus the strategy of points (1–3) above must rely on the availability of an efficient optimization algorithm. We have resorted to choose Genetic Algorithms (GA) for a number of reasons which will be highlighted in the following section.

GA: generality, numerical and computational implementation

The genetic algorithm (GA) is a programming technique that mimics biological evolution as a problem-solving strategy. Given a specific problem, the input to the GA is a set (called a "population") of potential solutions (called "individuals") to that problem. Each individual contains a "genome" able to provide a sub-obtimal solution to the problem. This ability could be quantified if a specific fitness function is defined, able to quantify how much an individual, by means of its genome, is fit for the solution of the optimization problem (i.e. to measure the "distance" between the sub-optimal and the optimal solution). The purpose of the GA is to produce successive population of individuals which are generated with the aim of increasing, as much as possible, the fitness of their individuals, i.e. their ability to solve the optimization problem by decreasing that "distance". This is done by producing successive populations of individuals by using the same procedures of the natural selection: mating and mutation. In the GA workflow, given an initial population of individuals, these are evaluated and classified according to their fitness. A selection rule is then defined to allow mating of couples of individuals, that mix their genomes, to form new ones (a further population) and an appropriate frequency of mutation of the genomes is defined, to introduce "new tracts" into individuals (which, in turn, would have been composed only by tracts coming from previous populations). If selection rules for mating and frequency of mutation are appropriatly chosen, the GA produces successive sets of individuals (" generations") which are progressively more and more fit to the optimization problem. In other words, individuals are better and better approximation of the optimal problem's solution.

The " inverse problem" we have attempted to solve starts from the description of a signalling network in terms of biochemical interacting species and reaction's constants. After a mining procedure to discover the value of the known reaction's constants, the system of eqs. (5–6) can be solved, by setting, for the unknown reaction's constants, an initial gauge of values. The solution of eqs.(5–6), in terms of functions describing the predicted time course of each of the system's variables (i.e. the concentration of all the biochemical species of the network), is thus strictly related to the intial set of reaction's constants. If one defines, as individual of the GA, the complete set of reaction's constants (the _{pred}) of some variables and that effectively measured by an experimental test of the same variables on that network (_{exp}). Formally, a distance between the two functions representing the

where

Eq.(8) can be thus retained as the "fitness" function of the considered individual; one can thus measure its "distance" from the "optimal" solution. Indeed, a more general formulation of the fitness function could be given by attributing "empirical" weight factors

The aim of the GA is to produce solutions which progressively reduce the value of the distance of its individuals. The scheme of producing successive "generations" of individuals can be resumed as follows:

1. start with a set of initial _{i}, _{i }is a real number in the interval [10^{-5},10^{0}]. The interval was chosen on the basis of a reasonable number of kinetic values of protein-protein interactions published in the literature

Genetic algoritm scheme

**Genetic algoritm scheme**. Flow chart of the estimate procedure using the genetic algorithm (GA). Every unknown model parameter is called a " gene", while the whole set of parameters to be estimated is defined as the " genome". Every genome is contained within an " individual", the computational entity able to " evolve". An ensemble of genomes corresponds to a "population". The GA procedure begins with an initial random guess of the parameters values used to run a simulation of the model network. This first step is iterated for all the individuals belonging to different populations. For each individual, the simulated time course of the concentrations for specific proteins are compared with the experimental measures and the distances between the functions are calculated. Every individual is thus related to a fitness index, measuring the degree of compatibility of the genome with the experimental constraints. A small number of individuals are selected based on their fitness but also on probabilistic rules: they will have the genomes randomly mutated by genetic operators, giving birth to a new offspring that enters the next generation. At each round the plot describing the evolution of the best fitness computed until then is updated: when it clearly saturates the algorithm stops and the genome corresponding to that fitness is the solution of the algorithm.

2. for each individual, evaluate the distance **d **of eq.(8)

3. select, according to some defined rule, the individuals to be mated to form the new generation of individuals.

4. perform the mating procedure as follows: given two different individuals {^{A}(^{B}(^{A}(

The parameter estimate does not include the topology of the network, that is the connectivity matrix is considered as a constant of the system and no interaction parameter is allowed to go to zero during the optimization procedure. The experimental data used as model constraints to optimize the system are the experimental time course of the concentrations of the active fraction of ERK-1, c-Raf, MEK, PKC-iota proteins

The algorithm starts assigning every individual with random genome. The initial genes _{i }were randomly generated in the following manner: _{i}= 10^{α }where

The Fitness Function F() is here defined, for each individual, as the inverse of the squared Euclidean distance between the experimental time course of the concentration of the activated fraction of ERK-1, c-Raf, MEK, PKC-iota proteins (see above) and the simulated time course for the same species, obtained using the genome {_{1},...,_{n}} of the individual (Fig.

Here p = _{1}_{np }indexes the protein species used for the fitness evaluation, t = _{1}_{nt }indexes the sampling times, ^{th }individual of the population is calculated as:

_{i }= ^{1/t},

where 10^{-4 }<_{mut }< 0.04. The individuals are distributed among NSp sub-populations, each containing NI of individuals, in our case NI = 16 and 7 < NSp < 33. The evolution process takes place independently within each sub-population at each generation. Every NM generations, with NM of the order of the sub-population size, MI of the best individuals in each population, again selected according to a probabilistic rule, move into a different sub-population, there replacing others that on their turn entered another sub-population: MI is of the order of 10%-30% of NI. This "migration" operator allows a sub-population to partially renew its genetic pool and tends to fasten the evolution process. The algorithm keeps in memory the "optimal" genome and the corresponding fitness, that is the best individual out of all the sub-populations obtained until that stage in the evolution process: these are compared with the best genome and corresponding fitness in the current generation: if the new fitness is better the optimal genome is replaced by the new one. The plot of the optimal fitness versus the generation number describes a monotone non-increasing function: when the curve derivative saturates, the procedure comes to an end and the individual corresponding to the optimal fitness provides the solution genome.

The GA in intrinsically parallel, thus the necessary computation can be very efficiently distributed over several CPUs. The GA was implemented on a cluster of Alpha CPUs, using the Fortran 90 language and the MPI protocol, under Linux operating system. In this implementation each computational node stores the genomes of a single sub-population, which evolves independently, except when there is a migration of individuals. In that case genome vectors are exchanged between the nodes (Fig.

Results and Discussion

Results

The system under investigation does not guarantee that the inverse problem has one unique solution, using the chosen experimental constraints. Therefore we must assume that the GA will find not one single solution but one ensemble of solutions, formed by many sets of model parameters {_{1},...,_{n}}. The ensemble describes a small sub-space within the entire space of parameters. We decided to sample this sub-space to study the properties of the solutions. The first step in this work was to obtain several numerical estimates of the set of unknown kinetic parameters. The second step was the analysis of the properties of a single solution, then the analysis of the collective properties of the ensemble. Eventually one solution was used as the best estimate of the kinetic parameters, to compare the simulated behaviour of the network with independent experimental data, to assess the reliability of the method. The genetic algorithm was started using each time different random genomes.

The time evolution of the fitness belonging to the optimal individual is a non-increasing function, with an envelope following a decreasing exponential like shape (Fig.

Fitness index

**Fitness index**. Time evolution of the fitness index, during the calculation of the optimal sets of kinetic parameters. The diagrams describes the fitness evolution of the optimal individuals as a result of parallel calculations on 8, 16 and 30 CPUs and are the average over different session, this explains the small discontinuities in the decreasing trend. The time required to reach the saturation decreases as the number of CPUs increases.

When the time derivative approaches zero, the algorithm ends and the current optimal individual is considered to be the estimated solution of the problem. The time of computation necessary to reach a good level of approximation decreases with the number of used CPUs, as it is shown in Fig.

Simulated and experimental data

**Simulated and experimental data**. Comparison of experimental and simulated data. The experimental time courses of concentration for the proteins used as constraints in the calculation of the fitness function, is compared with the corresponding simulated behaviours.

Though the experimental and simulated data may appear different, nevertheless, the essential dynamical features, some transients and the following relaxation of the system, are approximately described by the simulation. Since no further significant improvements of the best parameter sets could be obtained using the genetic algorithm, we can attribute the differences to the incomplete connectiveness of the model network, which make some protein concentrations unable to be sufficiently modulated by the activity of the rest of the network. This does not imply that this algorithm proves to be unfit for estimating important properties of the unknown parameters of the model. We obtained a total of 36 solutions of the inverse problem, each of them requiring few days of computation to be calculated.

The initial random parameter sets were completely altered by the genetic operators, both by the cross-over and the random mutation, which affected at least once every element of the genomes, therefore the final outcome of the algorithm, the optimal genome, has lost every numerical similarity with the initial parameter sets. These two points together could have two kind of consequences: either all the the reactions are necessary for a correct dynamics of the network, or only few reactions dominate the dynamics and guarantee that the chosen experimental constraints are satisfied, while the other rate constants may just fluctuate almost randomly. Further analysis, later in this article, will show that the second hypothesis is probably the correct one. Some more hints come from the calculation of the proximity matrix of the logarithm of solution vectors, whose elements are the non-squared Euclidean distances between all the couples of solutions genomes. We have plotted the frequency distribution of the elements (Fig.

Proximity matrix of solutions

**Proximity matrix of solutions**. Normalized frequency histogram of the elements of the proximity matrix built by computing the non-squared Euclidean distance ||log_{10 }_{i}, log_{10 }_{j}||_{i, j = 1...n}, where {_{i}} and {_{j}} represent single solution parameter sets and n is the number of unknown model parameters. On the abscissa the distance values. For comparison we show also the distribution of the proximity matrix for a large set of randomly generated K vectors.

The asymmetrical bell shape is typical of the distributions of the distances between all the geometrical points contained in a generic hypercube, here described by the parameter ranges in the n-dimensional space, where n = number of unknown parameters: for instance the same distribution pattern holds even in two dimensions. The two distributions have very similar shapes, though the solutions are slightly shifted towards shorter distances, a feature that is not surprising since the solutions belong to a smaller sub-space of the cited n-dimensional hypercube, thus the corresponding points in the parameter space are closer one to the other. The fact that the distribution of solutions is shifted of a small value, about 20% of the bell width, suggests that probably only few parameters contribute to this shift while the others are essentially randomly distributed. After analyzing the solutions parameter sets as static entities, separated from the network dynamics they describe, they must eventually be characterized on the basis of such dynamics. To make again a genetic comparison, it is not sufficient to analyze the "genotypes", the solutions, but rather the corresponding "phenotypes", the time course of protein concentrations. Each of solution parameter sets can be used to simulate the signal transduction process in the network, since it is considered to be a "realistic" set of kinetic parameters. The dynamics described by each of the solutions is slightly different, though, in any case, the time course of protein concentrations meets the experimental constraints used for the genetic algorithm. This similarity can be explained by a closer investigation of the detailed structure of such ensemble, to understand what explains the similarities and, at the same time, the differences among the simulated dynamics obtained with the different estimated solutions. We computed the ratio, in the logarithmic scale, between the standard deviation and mean for each parameter _{i }belonging to the genome, with i = l...N, and across the whole ensemble of computed solutions {Solutions}, that is the vector of coefficients of variation:

where N is the number of parameters. The 17 parameters showing a ratio smaller than 0.3 were considered as conserved elements across the ensemble of solutions. This threshold was chosen on the basis of the distribution of the coefficient of variation of a variable X, where X is sampled from a uniform distribution in the interval [-5,0]. The distribution of the coefficient of variation can be approximated by a Gaussian density function N(

Distribution of the coefficients of variation of solution parameters

**Distribution of the coefficients of variation of solution parameters. **The coefficient of variation _{10}_{i})_{10}_{i}), where {K_{i}}_{i = 1...n }is any kinetic parameter, was computed for every parameter across the entire ensemble of solution sets. Their distribution is shown (red line). For comparison the distribution of the coefficient of variation of a variable X is shown (green line), where X is sampled from a uniform distribution in the interval [-5,0]. The distribution of the coefficient of variation can be approximated by a Gaussian density function N(

Variability of solutions

**Variability of solutions**. Most conserved kinetic parameters. The coefficient of variation _{10}_{i})_{10}_{i}), where {_{i}}_{i = 1...n }is any kinetic parameter, was computed for every parameter across the entire ensemble of solution sets. The kinetic parameters with a ratio ≤ 0.33 are highlighted in the network graphical representation: thick arrows refer to kinetic rates of protein-protein interaction, the red circles refer to degradation rates and the green circles to synthesis rates.

The parameters highlighted in fig.

We have also investigated the level of complexity of the network dynamics through the evaluation of the eigenvalue spectrum and the eigenvectors of the Jacobian matrix of the system of eqs.(5–6). The Jacobian was evalued at a fixed time point (corresponding to t = 60 mins) of a time simulation perfomed by using one parameter set obtained by the GA procedure. The eigenvalue spectrum spans 24 orders of magnitude, from 10^{-22 }to 10^{2}, with about 75% of them being real negative values and 25% real positive ones: this implies that the majority of kinetic modes (eigenvectors) in the diagonalized system lead to an exponential decay, though with a large spectrum of decay rates. The components of the orthonormal 2N eigenvectors along the original set of 2N coordinates _{i}, _{i }describe how the nodes of the networks are involved in the corresponding kinetic modes. In this respect 20 eigenvectors have significant components (larger than 0.1) just along one of the coordinate, therefore the corresponding dynamics involves essentially only one node of the network, while other 57 eigenvectors have significant components only along two coordinates corresponding to two distinct nodes. On the other hand more than 50% of the eigenvectors have significant components along 3 or more coordinates, up to 12: they thus correspond to more complex modes that involve a large number of network proteins. Moreover many eigenvectors project on the same coordinates, which means that many proteins are inolved in different kinetic modes. In conclusion we can say that a group of small subnetworks exists, composed by one or two nodes, that show a very simple increasing or decreasing dynamics, but this group cannot describe in an exhaustive way the system dynamics: only a complex relation between several kinetic modes can account for the simulated bahaviour.

Discussion

Different methods for parameter estimate and fitting

GA has proven to be a powerful and successful problem-solving strategy. It has been used, in fact, to solve NP-complete optimization problems in a wide variety of fields such as chemistry, biology, engineering, astrophysics, aerospace, electronics, mechanical and electrical design, military plans, mathematics, robotics and many others. Notable examples of GAs applications in molecular biology are in modelling of genetic and regulatory networks

Comparison of simulations with experimental data and multiple solutions of the inverse problem

We believe anyway that the major limitation of this model is not the degree of approximation used to describe protein-protein interactions but that some other biologically relevant features are missing, such as the connections with the gene transcription network and with other signalling pathways and the role of the space diffusion, which may be the subject of future improvement of the model. These reasons should explain why this network is more a test case for the implementation of the GA in the inverse problems domain than an accurate description of the neurotrophic and apoptotic signal transduction processes. It is likely that other independent experimental data would allow us to have an unambiguous selection among the different solutions of the Pareto set, in two different manners: either these data could be added as additional constraints from the beginning of the GA procedure, to consistently reduce the Pareto set since the beginning, or they could play the role of independent criteria to select one single or at least a subset of proper solutions obtained by the GA procedure as presented in this work. The modelled signaling network must also be able to respond to a variety of external stimuli, coming from the rest of the cellular environment, as a consequence of this the diversity displayed by these behaviours is compatible with the existence of this ability. The lack of functional connections to other signalling pathways does not allow the network to directly display these potential modalities of response. A related point is the robustness of the system. The optimal solutions belonging to the Pareto set correspond to different dynamical evolutions, though all meet the experimental conditions: this suggests that the network shows some robustness since it is able to guarantee the same signal transduction in many different conditions, with very different combinations of protein-protein interactions strength. The robustness is a fundamental property of biological systems, essential for survival when it is necessary to face dangerous situations and sudden changes in the cellular environment.

Conserved kinetic parameters

At the end of this work we found out that a sub-vector of the kinetic parameters is characterized by a small coefficient of variation

across the Pareto set of optimal solutions, where _{j }is a model parameter value describing the ^{th}interaction/reaction. This is an important and informative result since those parameters correspond to protein-protein interactions and synthesis/degradation processes essential to make the model correctly describe the experimental data used as constraints for the parameter estimate procedure. This sub-vector includes protein-protein interactions and single protein reactions that could explain the robustness of the network dynamics, across the whole Pareto set. The sub-vector can be considered as composed by values almost unambiguously estimated, within a reasonable error, compared to the rest of the parameters. The existence of this sub-vector supports the idea that a sufficient amount of experimental determinants could sufficiently condition the inverse problem to allow a reliable estimate of the whole parameter set. What we have done is in fact to sample the space of solutions of the inverse problem using a genetic algorithm: a larger number of experimental constraints would reduce the dimension of the space of solutions.

Conclusion

In this work we have discussed the problem of mining, measuring and estimating the value of parameters needed in mathematical models describing the signalling processes mediated by protein-protein interactions. The lack of kinetic interaction rates measured in reliable

Authors' contributions

The author(s) contributed equally to this work

Acknowledgements

This work was supported by the Italian Ministry of Education University and Research, grant FISR D.M. 1.506 Ric. 28.10.2003.

This article has been published as part of