Institute for Collaborative Biotechnologies, University of California, Santa Barbara, CA 931065080, USA
Max Planck Institute for Dynamics of Complex Technical Systems, Sandtorstr. 1, 39106 Magdeburg, Germany
Faculty of Mechanical Engineering Specialty Division for Systems Biotechnology, Technische Universitat München, Boltzmannstr. 15, 85748 Garching, Germany
(Bio) Process Engineering Group, IIMCSIC, C/Eduardo Cabello 6, 36208 Vigo, Spain
Abstract
Background
Model development is a key task in systems biology, which typically starts from an initial model candidate and, involving an iterative cycle of hypothesesdriven model modifications, leads to new experimentation and subsequent model identification steps. The final product of this cycle is a satisfactory refined model of the biological phenomena under study. During such iterative model development, researchers frequently propose a set of model candidates from which the best alternative must be selected. Here we consider this problem of model selection and formulate it as a simultaneous model selection and parameter identification problem. More precisely, we consider a general mixedinteger nonlinear programming (MINLP) formulation for model selection and identification, with emphasis on dynamic models consisting of sets of either ODEs (ordinary differential equations) or DAEs (differential algebraic equations).
Results
We solved the MINLP formulation for model selection and identification using an algorithm based on
Conclusions
The presented MINLPbased optimization approach for nestedmodel selection and identification is a powerful methodology for model development in systems biology. This strategy can be used to perform model selection and parameter estimation in one single step, thus greatly reducing the number of experiments and computations of traditional modeling approaches.
Background
Model development is a key task in systems biology, and involves different steps, such as model calibration, experimental design and model refinement which usually take place in an iterative way (see reviews in
A number of researches have proposed different iterative schemes for model development involving the steps of parameter estimation, identifiability analysis, and optimal experimental design
Verheijen
Here, we present a method to simultaneously select a model and calibrate it in a single step. This contribution is based on the following four key ideas: (i) frequently, iterative model development cycles can be considered in a more compact way if sets of hypotheses can be grouped together and formulated as a parameterized set of models, from which the best alternative must be selected; (ii) we consider the problem of model selection formulating it as a simultaneous model selection and parameter identification problem; (iii) further, in order to make the selection decision in a systematic way, we formulate it as an optimization problem
The paper is organized as follows: First, we describe the framework used for model selection and identification, based on the nested models paradigm. Then we state the corresponding optimization problem using a formulation based on mixedinteger nonlinear programming subject to differential and algebraic constraints. In the following sections, we describe the application of this methodology to a case study considering a dynamic model of the KdpD/KdpE system of
Methods
To the best of our knowledge, this is the first time that an MINLP framework for simultaneous model selection and identification is presented. The key issues for the successful design of this combined approach are: (i) selection of the integer and binary parameters that accurately describe all the possible nested models; (ii) reliable and accurate parameter estimation; (iii) use of efficient algorithms with reduced computational cost; (iv) assessment of model identifiability.
Nestedmodels: selection and identification
In this contribution we consider dynamic models which are nested, i.e. there is a hierarchy such that each model is a particular subcase of an extended parameterized model, which can be considered as a superstructure. These nestedmodels arise from existing models plus new hypotheses such as e.g. the existence of new positive or negative feedback loops. In a loose sense, we can say that Model B is nested within Model A if Model B is a special case of Model A. Figure
Example of nested models
Example of nested models. Example of nested models where Model C and Model D are nested within Model B that is in turn nested within Model A, which can be considered as a superstructure.
Several functions have been suggested as metrics to asses the goodness of models fit. The maximumlikelihood estimation (MLE), introduced by Fisher in 1912
where p: set of parameters to be estimated
The Akaike information criterion (AIC)
Many functions have been suggested to compare two or more models. Despite the fact that several authors have questioned whether AIC is biased towards complex model structures
In order to reduce the computational burden, in this work we used the AIC as cost function for finding the optimal set of parameters formed by a subset of binary parameters defining the model structure (e.g. presence or absence of a certain feedback loop) and another subset of integer and real parameters characterizing the model dynamics.
Formulation of the MINLP
The formulation of the simultaneous model selection and identification problem is stated as an MINLP optimization problem. In mathematical terms, the general MINLP is defined as finding the vector of
subject to:
•System dynamics in the form of DAEs, with state variables y
• Additional requirements in the form of equality and/or inequality constraints
• Upper and lower bounds (superscripts U and L respectively) on decision variables
This set of constraints defines the feasible space
Solution of the MINLP problem
The problem of parameter estimation is a crucial step in the development of models of biological systems
In the case of the MINLP problem at hand, the need to use GO methods is increased by the additional nonlinearities coming from the binary and integer variables and the augmented size of the problem. ACOmi (Ant Colony Optization for mixedinteger problems)
ACOmi (Ant Colony Optization for mixed integer problems) is an extension of the ant colony optimization metaheuristic that enables to handle mixed integer variable search domains. In this method a new penalization strategy was introduced in order to extend the ACO framework to face constrained optimization problems. A detailed explanation of the hybrid implementation ACOmi, incorporating the extended ACO framework and a robust oracle penalty method, is given by
fSSm is a new evolutionary method for complexprocess optimization. It is partially based on the principles of the scatter search methodology, but making use of innovative strategies to be more effective in the context of complexprocess optimization using a small number of tuning parameters. In particular, this method uses a new combination method based on path relinking, which considers a broader area around the population members than previous combination methods. It also uses a populationupdate method which improves the balance between intensification and diversification, as described in
MISQP is a modified sequential quadratic programming method for solving MINLP problems. MISQP assumes that the model functions are smooth in the sense that an increment of an integer variable by one leads to a small change of function values but it does not require that the mixedinteger program is convex or relaxable (i.e. function values are evaluated only at integer points). Thus, this algorithm is expected to be more efficient than any other method that starts from a solution of the relaxed problem
Moreover, in contrast to other local optimization solvers, the evaluation of the exact gradient is not always required for a proper convergence of SQP methods. The evaluation of the performance of the method used in this study, MISQP, on a test set of 186 academic test examples published in
Model identifiability, sensitivity and correlation analysis
Several powerful approaches have been recently developed to asses the identifiability of model parameters in systems biology models, namely, those exploiting the profile likelihood
If the FIM is full rank the parameters are considered identifiable
Sensitivity analysis measures the importance of the parameters with respect to the influence of their variations on model predictions. The most widely used method is the local sensitivity analysis which consists of calculating the partial derivatives of the model state variables to the model parameters evaluated at the normal operating point where all the parameters have their nominal value. This method gives a linear approximation of how much a variable changes due to a given change in a parameter. The use of relative measures, where the sensitivity function is normalized by the value of the parameter and the state, is recommended to make these measures comparable for parameters and states of different order of magnitude:
To lump the sensitivity of a parameter with respect to different states at different time points and different experiments, Brun et al.
A high value of the sensitivity index means that a change in parameter
The main drawback of local sensitivity indices is that they are computed at the nominal values used for the parameters and the behavior of the response function is described only locally in the parameter space. Moreover, preliminary experiments and parameter estimation tests should be carried out in order to obtain a first guess for the parameter values and an iterative scheme involving both steps is required to study the model sensitivity. In addition, these methods are linear; thus, they are not sufficient for dealing with complex models, especially those in which there are nonlinear interactions between parameters.
In contrast, global sensitivity analysis (GSA) methods evaluate the effect of a parameter while all other parameters are varied simultaneously, accounting for interactions between parameters without depending on the stipulation of a nominal point. In this work, a pseudoglobal sensitivity analysis as described in
For models with several parameters, high parameter sensitivity, although necessary, does not ensure the identifiability of the model. In addition, the sensitivity functions of the parameters have to be linearly independent so a change in one parameter can not be compensated by changes in the other parameters. When the parameters are identifiable, we can study the degree of linear dependence among the sensitivity functions by means of a correlation analysis based on the Fisher Information Matrix (FIM) as described in
In order to eliminate the dependence on a nominal point, a pseudoglobal identifiability analysis as described in
Dynamic model of the KdpD/KdpE system of
Bacteria constantly monitor their environment and adapt immediately to changing conditions to survive. There are several adaptation mechanisms notably special signal transduction systems. A sensor kinase (
The dynamic model presented by Kremling and coworkers
where
Results and discussion
Computations were carried out using Matlab™ (Version 7.9.0, R2009b; The Mathworks, MA, USA) running on a dual INTEL®;XEON®;2.13 GHz CPU desktop under Windows 7. All the scripts needed to reproduce the results presented in the following are provided in the Additional file
K_homeostasis_MINLP. K_homeostasis_MINLP.zip contains all the scripts needed to reproduce the results presented in this manuscript using the toolbox SensSB
Click here for file
Identifiability analysis of the original model
Simulation studies showed that the concentration of
A local identifiability analysis of the original model with the best set of parameters was performed. As already suggested by Kremling
The importance ranking of the parameters estimated from the
These modifications led to a second formulation of the model (Model II) with 7 DAEs and 17 parameters that fits the experimental data equally well.
New hypotheses for the
Based on unpublished data of a mutant strain with impaired
Scheme of the reaction mechanism for the
Scheme of the reaction mechanism for the
where
• Regulation of translation (
• Regulation of proteolysis (
• Stimulus counteraction (
In order to account for the different
While two different expressions were hypothesized for the wild strain:
Note that the dynamics of the mutant strain do not depend on parameters
These possible new loops were integrated with the Model II considering a superstructure, which has a total of 25 degrees of freedom: 17 reals, 5 integers and 3 binaries, resulting in 1700 nested models. In a traditional setting, each of this model should be identified (calibrated) from experimental data by solving the corresponding minimization problem, that is, a nonlinearprogramming problem subject to differentialalgebraic constraints (NLPDAEs), prior model selection. Since the solution of each problem is quite computationally expensive, this is obviously not tractable. As an alternative, we applied the strategy outlined above and performed a simultaneous selection and identification via MINLP optimization.
MINLP solutions
In order to illustrate the capabilities of the methodology presented in this work, we generated
Therefore, we generated
Parameter
Nominal value
MINLP solution
The bounds for the real parameters were taken at 50% and 200% around the initial values and for the integers they are based on the typical values of Hill coefficients, from 0 to 3. The value of parameter
5.18E+07
4.74E+07
9.76E+01
1.20E+02
5.79E02
5.23E02
1.00E+03
8.66E+02
4.96E+03
4.28E+03
1.03E+03
1.27E+03
2.05E+03
1.64E+03
4.99E+01
6.06E+01
1.00E+01
1.25E+01
6.16E04
7.08E04
1.82E07
2.11E07
1.00E+03
8.48E+02
1.18E+00
9.55E01
2.00E+06
1.76E+06
9.74E01
8.07E01
1.36E01
1.84E01


3
3
1
1
2
3
3
3
0
0
1
1
1
1
0
0
Subsequently, we solved the MINLP problem using fSSm and ACOmi as optimization methods and the AIC as cost function. Both, fSSm and ACOmi, could solve the problem of simultaneous model selection and parameter identification in an acceptable computation time, while fSSm showed a better overall performance (data not shown). The convergence curves for ten runs of fSSm (AIC value versus computational time) are depicted in Figure
Convergence curve of fSSm for the MINLP problem
Convergence curve of fSSm for the MINLP problem. Convergence curve of fSSm (AIC value versus computational time, in seconds, using a PC/INTEL XEON CPU, 2.13 GHz).
As can be seen in Table
KdpFABC data
KdpFABC data
KdpFABC data
KdpFABC data
mRNA data
mRNA data
mRNA data
mRNA data
Checking the multimodality of the MINLP
In order to assess the multimodality of the MINLP problem, a traditional multistart approach (i.e. choosing a large set of random initial points from inside the parameter bounds, and performing local searchers from each one) using MISQP was performed. The histogram in Figure
Multistart of the local Solver MISQP on the MINLP problem
Multistart of the local Solver MISQP on the MINLP problem. Histogram of a multistart of 50 runs using the local solver
Identifiability analysis of the resulting model
The FIM computed for the best set of parameters obtained by the global solver is full rank; therefore, we can assert that the parameters are locally identifiable.
Figure
Pseudoglobal sensitivity for Model III
Pseudoglobal sensitivity for Model III. Pseudoglobal sensitivity of Model III with respect to the two measured states (protein
Correlation matrix for Model III
Correlation matrix for Model III. Correlation matrix for Model III with the best parameter set.
The correlation matrix shows several pairs of parameters highly correlated what explains the difficulties encountered by the local method in finding the global solution. Despite the identifiability difficulties of this problem, which make most of the solvers fail when trying to solve it, the residuals for the solution obtained by fSSm are small indicating a precise parameter estimation,
Methodology strengths and limitations
The goal of this study is not to solve the general problem of model inference but a dense subcase of it, i.e., the discrimination among a subset of nested competing models and simultaneous estimation of the model parameters. In other words, we consider the very frequent situation in systems biology where a first model is available based on previous knowledge but new experimental information allow to formulate different hypotheses to refine it. Thus, instead of solving a general inference problem (i.e. find the model structure plus the parameters from a set of data), we consider a subproblem which is smaller (although still very challenging) and dense (so sparsity is not an issue), and which, therefore, does not suffer from many of the illposedness and illconditioning maladies of the general inference problem
• Scaling up to largescale models: the corresponding MINLPs might become rather large and therefore the computational effort needed for their solution might become prohibitive.
• Nonuniqueness of biochemical reaction mechanisms: it is known that biochemical reaction networks with different structure and/or parametrization may produce the same dynamic response describing the timeevolution of species concentrations (see the recent discussion and results in
• Model identification/selection metric: the use of more advanced metrics for model selection such as the likelihood ratio or the Ftest can not be used in this approach since they rely on pairwise comparisons. However, in the presented methodology the AIC could be replaced by any other metric for model selection as long as it can establish a ranking for the set of competing models encompassing model performance and model complexity.
Conclusions
Here we have considered the modelbuilding cycle where an initial model, based on existing data and
Model selection scheme
Model selection scheme. Model selection scheme: Local sensitivity analysis and identifiability analysis allowed to reduce Model I leading to Model II. Subsequently, new hypotheses and model selection and identification via MINLP were conducted to formulate Model III. The identifiability of Model III was assessed by means of a pseudoglobal sensitivity approach and correlation analysis indicating that no further modifications were required.
We consider this cycle in a more compact way grouping sets of hypotheses together and formulating a parameterized set of nested models, from which the best alternative must be selected. We then formulate the decision problem as an MINLPbased optimization for simultaneous model selection and parameter identification.
This procedure has been applied to a case study considering potassium homeostasis in bacteria, arriving to the following conclusions: (i) the presented MINLPbased approach for nestedmodel selection is a powerful methodology for model selection and identification in systems biology; and (ii) for the case study considered, it has resulted in a model that presents a better fit to the
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
MRF and MR implemented the model options and performed the analysis of the novel methodology, carrying out the necessary computations. MRF performed the analysis of the optimization results, the identifiability computations and assisted in the coordination of the study. JRB and AK conceived of the study and participated in its design and coordination. MR, JRB and MRF drafted the manuscript. All authors read and approved the final manuscript.
Acknowledgements
Authors MRF and JRB acknowledge financial support from the EU ERASysBio programme and the Spanish MICINN and MINECO (SYSMO grant KOSMOBAC, ref. GEN200627747E/SYS and project MultiScales ref. DPI201128112C0403, both with partial support from the European Regional Development Fund, ERDF). MR was supported by the Max Planck society and the European Erasmus project. AK was funded in part by the BMBF through the EraNet initiative SysMO. We acknowledge support of the publication fee by the CSIC Open Access Publication Support Initiative through its Unit of Information Resources for Research (URICI).