Abstract
Background
Postgenome era brings about diverse categories of omics data. Inference and analysis of genetic regulatory networks act prominently in extracting inherent mechanisms, discovering and interpreting the related biological nature and living principles beneath mazy phenomena, and eventually promoting the wellbeings of humankind.
Results
A supervised combinatorialoptimization pattern based on information and signalprocessing theories is introduced into the inference and analysis of genetic regulatory networks. An associativity measure is proposed to define the regulatory strength/connectivity, and a phaseshift metric determines regulatory directions among components of the reconstructed networks. Thus, it solves the undirected regulatory problems arising from most of current linear/nonlinear relevance methods. In case of computational and topological redundancy, we constrain the classified group size of pair candidates within a multiobjective combinatorial optimization (MOCO) pattern.
Conclusions
We testify the proposed approach on two realworld microarray datasets of different statistical characteristics. Thus, we reveal the inherent design mechanisms for genetic networks by quantitative means, facilitating further theoretic analysis and experimental design with diverse research purposes. Qualitative comparisons with other methods and certain related focuses needing further work are illustrated within the discussion section.
Background
Various cell phenotypes and functions within multicellular organisms relate directly to genetic contents decoded from DNA and RNA during transcriptional and translational processes. Inference of gene regulatory networks or maps for those intercellular processes plays significant roles in the further comprehension of underlying regulatory mechanisms. Thus reconstructing such biological regulatory networks directly from gene profile datasets measured at different cell phases, types and even species becomes one of the foremost research topics recently.
Due to capabilities of simultaneous measurement for multiple expression profiles with gradually increasing accuracy and decreasing costs of experiments, those advances in highthroughput microarray and ChIP assays techniques facilitate the corresponding learning and inference of the regulatory maps and even functionality of these genetic networks. During the past decades, manifold inference and learning methods have been proposed to integrate raw data to computational frameworks for network models, such as (probabilistic) Boolean network and (dynamic) Bayesian network, systematic differential/difference equations [16], information theorybased modelling [710], graph and control theoretic approaches [1113].
Furthermore, most of current biochemical networks are regarded as static descriptions of the inherent regulatory mechanisms in the sense that once the system models and parameters for those genetic networks are set, the regulatory processes are determined. While during genetic transcriptional and translational processes, realworld regulatory maps may undergo various perturbations from intercellular and intracellular signals and undiscovered factors. From this perspective, a single modelling mode may not be sufficient to characterize all kinds of possible structures of these networks, or even crucial ones for specific analysis purposes. The problems above solicit flexible mechanism designs to improve the present rigid methods for network inference.
Within the following parts, we propose an integrative supervised learning method for the inference of timedelayed cell cycle regulatory mechanism based on information and signal processing theories. We firstly introduce definitions for those crucial concepts as correlation measure and mutual information; then we propose a novel associative quantity for the two kinds of dependency measures. With the proposed integrative metric and the Pvalues from the Pearson correlation operations on all pairwise genes from the raw data pool, we may determine the dependency and connectivity among those pairwise candidates. Such kind of integrative dependency metric improves the performance of above onefold linear or nonlinear criterion since multiplecriteria may perform cross validation functions for measuring dependency within the test results.
Moreover, from signal processing theory [5,1416], a phaseshift metric is introduced for measuring time delay of gene expression within pairwise candidates. The advantages of such a phaseshift metric lie in its flexible characteristics of determining the regulatory delay variation via dynamic thresholds of relevant transfer gains between pairwise candidates. Since factual regulatory mechanisms possess multiple possibilities during biological processes and underlying regulatory delay effects may vary in the context of related courses. The phaseshift metric elucidates such possibilities underlying the regulatory mechanisms quantitatively via dynamic threshold of transfer gains.
The other advantage of the method includes its inherent capabilities of integrating existing biological knowledge as a priori. This kind of knowledgebased inference method avoids redundant falsepositive connectivity within pairwise gene candidates. Moreover, dynamic threshold for transfer gain facilitates its potential applicability to the majority of problems facing theoretic and experimental biologists. Since regulatory connectivity underlying pairwise gene candidates may differ from each other at various tissues and sampling times, quantitative determination of these regulations with existing empirical and theoretical knowledge will act as much more effective roles, compared to most of current simplex computational approaches.
Results
The supervised learning framework mainly covers two aspects, namely, it should characterize pairwise regulatory strengths and constrain subsequent computational redundancy. We utilize the proposed method for two realworld datasets, selected from the Stanford Microarray Database. The both datasets are of different statistical characteristics, normalized and benchmarked in the recent literatures [1719].
Analysis on the Saccharomyces cerevisiae cell cycle microarray dataset
The first Saccharomyces cerevisiae cell cycle microarray dataset was measured through the regulatory responses under the elutriation treatment, available at the Stanford Microarray Database. The dataset has been benchmarked in the literature [10,20,21]. The log2normalized expression profile of 24 genes from the regulatory network is plotted in the following Figure 1.
Figure 1. The log2normalized gene expression profile for 24 genes from the cell cycle regulatory network (Experiment condition: response to elutriation). The horizontal coordinate represents the sample time. (14 points from 0 to 6.5 hours, equally sampled per 30 minutes); the vertical coordinate illustrates 24 genes from the cell cycle genetic network.
Based on the definitions and concepts illustrated in the methodology part, we calculated the mutual information, correlation and Pvalues among pairwise genes for constructing regulatory activities. The mutual information matrix, correlation and corresponding Pvalues are given in the additional Figure 1A in Additional file 1 and additional Figure 1B in Additional file 2.
Additional file 1. The calculated mutual information matrix for 276 gene pairs from the 24 cellcycle genes.
Format: DOC Size: 78KB Download file
This file can be viewed with: Microsoft Word Viewer
Additional file 2. The descendingorder sorted mutual information, correlation coefficient and corresponding Pvalue statistics.
Format: DOC Size: 109KB Download file
This file can be viewed with: Microsoft Word Viewer
As depicted in the lower subgraph of the additional Figure 1B in Additional file 2, there are more than 101 pairs with their Pvalues not greater than 0.05 (indicated by the vertical line), commonly adopted in most research fields. Therefore around 60% or 165 hypothetic reaction edges are redundant and may be reduced for the further reconstruction of the regulatory network, and thus in this map, on average, every gene has direct or indirect relations with 4 to 5 other genes. The phenomena conform to the generally recognized viewpoints that most biochemical regulatory networks are sparsely constructed.
Thus through dynamic thresholding of mutual information and correlation coefficient, we obtain the global distributions for three pair groups under dynamic metrics. The distributions for the classified pair groups are illustrated in Figure 2.
Figure 2. The global statistics for pairwise gene numbers under different mutual information values and correlation coefficients. Totally, there are 276 pair candidates for the network of 24 genes. The horizontal axis represents different mutual information thresholds, and the vertical axis illustrates correlation coefficient thresholds. The corresponding threedimensional graph is given in the additional Figure 2A in Additional file 3 for comparative purposes within the three groups.
Additional file 3. The threedimensional distribution for authentic (APGs), questionable (QPGs), and unauthentic pairwise genes (UPGs).
Format: DOC Size: 54KB Download file
This file can be viewed with: Microsoft Word Viewer
The supervised inference procedure starts from the respective centroids, i.e. 0.5709 and 0.4358 for mutual information and correlation coefficient. Actually, from the heat maps illustrated in Figure 2, we may find the proximatelydiagonal symmetries of the variations between mutual information and correlation coefficient, especially for the group APGs. Such interesting phenomena facilitate detecting suitable initial thresholds and optimal iteration tracks.
Also with the acquired knowledge, e.g. the genetic networks are sparsely constructed and their topologies normally follow the ‘smallworld’ properties, the interactive computations halt at 0.4950 for mutual information and 0.4602 for correlation thresholds. At the terminated thresholds, the APGs, QPGs and UPGs groups have 83, 157 and 36 candidates respectively.
Thus, we might calculate the global phaseshift statistics for the APGs group, based on the signal processing theory defined in the methodology section. Figure 3 illustrates the calculated global phaseshift statistics. The details of the statistics for the gene pairs in the APGs group are given in the additional Figure 3A in Additional file 4.
Figure 3. The global phaseshift statistics distribution for the APGs of the cell cycle regulatory network (totally 83 pairwise candidates in APGs). The phaseshift statistics vary as functions of the gain thresholds. The blue bold curve represents the integral tendency of gene pairs with leading phase shifts (positive), the red for the pairs with lagging phase shifts (negative), and the green for those without detected phase shift (undirected), i.e. there might be no regulatory activities between corresponding gene pairs (the same as in following figures). Through dynamic gain thresholding, we may easily determine concrete regulatory time lags, regulatory directions and signal intensities from the quantitative signal processing perspective.
Additional file 4. The phaseshift statistics for the group APGs.
Format: DOC Size: 46KB Download file
This file can be viewed with: Microsoft Word Viewer
For this case, the gain threshold is set at 0.3, see the additional Figure 3A in Additional file 4. The centroids for the mutual information and correlation coefficients within total available pairs are 0.6193 and 0.6900 respectively. The whole searching for optimal solutions stops with the mutual information (0.4950), correlation coefficient (0.4602) and Pvalue (0.05). Thus we get valid links and concrete regulatory directions at the current conditions. Figure 4 illustrates the reconstructed regulatory network.
Figure 4. The interweaved cell cycle regulatory network rebuilt based on the MICORPS framework. Each gene/protein is denoted as a blackedged circle. The calculated associativity metric and phaseshift information between pairwise genes are marked as blue along each bilateral links, see the additional Figure 4A in Additional file 5 of associativity measure for details.
Additional file 5. Associativity measure statistics for the group APGs from the Saccharomyces cerevisiae cell cycle microarray dataset.
Format: DOC Size: 121KB Download file
This file can be viewed with: Microsoft Word Viewer
As depicted, only the gene #4 (YDL056W) is isolated from the network structure, meaning that YDL056W might belong to other regulatory processes at the current situation. Besides, the gene #2 (YER111C) only has a single regulatory link, similar to the genes #9 (YLR079W) and #10 (YAL040C). While for such genes as #1 (YDR146C), #3 (YLR182W), #16 (YDR507C), etc., they have multiple regulatory links, indicating they undertake much more responsibilities during the underlying interaction and regulation processes.
Since the above analysis is for the case of normal statistical characteristics, one may directly utilize the proposed methods. Within the following part, we discuss another kind of microarray dataset of different statistical properties.
Analysis on the dataset from a p53 pathway with multiple feedback loops
The profile dataset of the p53 pathway with multiple feedback loops is selected from the recent work [10], concerning human leukaemia cell lines (MOTL4) with the functional protein p53. The triplicate MOTL4 microarray experiments are implemented under irradiation from 0 to 12 hours at intervals of 2 hours, depicted in Figure 5. The additional Figure 5A in Additional file 6 and additional Figure 5B in Additional file 7 illustrate related mutual information matrix and correlation statistics of total gene pairs for the p53 pathway.
Additional file 6. Mutual information matrix for the triplicate MOTL4 microarray experiments.
Format: DOC Size: 64KB Download file
This file can be viewed with: Microsoft Word Viewer
Additional file 7. The descendingsorted mutual information, correlation coefficient and corresponding Pvalue statistics.
Format: DOC Size: 90KB Download file
This file can be viewed with: Microsoft Word Viewer
Figure 5. The triplicate MOTL4 microarray experiments are implemented under irradiation from 0 to 12 hours at intervals of 2 hours. The expression profile is plotted with the mean values of the triplicate datasets. The horizontal axis denotes the time range from 0 to 12 hours, and the vertical axis for the corresponding 16 gene/protein names.
However, this kind of dataset does not satisfy the above networkconstructing algorithm since there are only 10 pair candidates with their Pvalues below 0.05 (91.7% of the total pairs with correlation statistical significance above 0.05), see the additional Figure 5B in Additional file 6. Therefore, it is impossible to construct a genetic network of 16 genes with just 10 suitable candidate links under the current situation. Thus, before utilizing the PGHC algorithm, it is necessary to modify the Pvalue threshold.
As the former case, 40%~45% of the total pairs as suitable candidates are needed for constructing genetic networks, then we lift the threshold higher enough, and derive necessary suitable pair candidates for composing the group APGs via the proposed PGHC algorithm. For this case, we lift the Pvalue threshold to 0.8 or so, and obtain the global statistical distribution for three groups through dynamic threshold of mutual information and correlation coefficient. The distribution plots for the classified pair groups are illustrated in Figure 6.
Figure 6. The global statistics for pairwise gene numbers under different mutual information values and correlation coefficients. Totally, there are 120 pair candidates for the network of 16 genes. The horizontal axis represents different mutual information thresholds, and the vertical axis illustrates correlation coefficient thresholds. See the additional Figure 6A in Additional file 8 of threedimensional comparative graph for the three groups.
Additional file 8. The threedimensional distribution for authentic (APGs), questionable (QPGs), and unauthentic pairwise genes (UPGs).
Format: DOC Size: 225KB Download file
This file can be viewed with: Microsoft Word Viewer
Thus, we might calculate the global phaseshift statistics for the APGs group, based on the signal processing concepts defined in the methodology section. The calculated global phaseshift details are given in Figure 7. The additional Figure 7A illustrates the details of the statistics for the gene pairs in the APGs group.
Figure 7. The calculated phaseshift statistics distribution (totally 55 pairwise candidates for the APGs group in the multifeedback p53 pathway). The blue bold curve represents the integral tendency of gene pairs with leading phase shifts (positive), the red for the pairs with lagging phase shifts (negative), and the green for those without detected phase shift (undirected), i.e. there might be no regulatory activities between corresponding gene pairs.
Within the following networkbuilding procedure, we still choose the corresponding centroids of both metrics as the initial points for the iterative computation. The centroids for the mutual information and correlation coefficients for the totally available pairs are 0.7992 and 0.5203 respectively.
The searching for optimal solutions stops when the mutual information threshold backtracks to 0.7 and the correlation coefficient takes 0.3 and the Pvalue adopts 0.8 for the whole iterative procedure. To testify the significance of gain to network topological structures, the gain thresholds take 0.3 and 1 respectively. Thus, we may derive valid links and concrete regulatory directions at the two gain thresholds from the additional Figure 7A in Additional file 9. And the reconstructed regulatory networks are plotted in Figure 8 and the additional Figure 8A in Additional file 10. The detailed information for the related links within the APGs group is given in additional Figure 8B in Additional file 11.
Additional file 9. The phaseshift statistics for the group APGs.
Format: DOC Size: 59KB Download file
This file can be viewed with: Microsoft Word Viewer
Additional file 10. The constructed genetic map with gain threshold at 1.
Format: DOC Size: 196KB Download file
This file can be viewed with: Microsoft Word Viewer
Additional file 11. Associativity measure statistics for the group APGs in the human cancer MOTL4 cell cycle microarray dataset.
Format: DOC Size: 139KB Download file
This file can be viewed with: Microsoft Word Viewer
Figure 8. The constructed genetic graphs under different gain thresholds. The structure is constructed with gain threshold at 0.3, and the additional Figure 8A in Additional file 10 adopts 1 as the gain threshold. As depicted in the figure, #5 (cdk2) is the weakconnected node, #3 (MDM2), #10 (βcatenin), and #12 (PIP3), etc. are the strongconnected ones under the current gain threshold.
Discussion
The comparison with the currentlyavailable inference methods
Currently, there exist several inference approaches for the biochemical networks, e.g. probabilistic approaches, equationbased methods, etc. As depicted within the context, the proposed method tackles such key inference issues for integrating previouslyacquired biological knowledge as a priori via dynamic threshold of multisource information. Thus compared with most computationoriented methods, the proposed inference framework ameliorates inference accuracy and experimental achievements within a problemoriented scheme.
Secondly, the proposed method tackles one of most important problems from the perspective of signal processing theory, namely, the determination of regulatory directions between candidate gene pairs. The introduced metrics quantify those underlying regulatory strengths, directions between pair candidates globally and comparatively. Thus, it facilitates the followup networkrebuilding procedure.
Moreover, the proposed inference framework might illustrate in parallel multiple optimal or suboptimal potential regulatory maps, instead of the one computational solution for one problem scheme, since for most cases such solutions cannot explain convincingly so much inherent mechanism as expected. The proposed method might utilize the diverse knowledge available, either from concrete biochemical experiments or current literatures.
The current focuses of the proposed method and its future directions
Although the proposed inference framework is validated with the realworld profile datasets, there are still several directions needing further refinement, depicted within the below section.
In practice, most available profile datasets are of high dimensions, particularly as those kinds of lesspoint and multisample profiles, together with unavoidable measurement noises, etc. Thus, any suitable preprocessing is demanded for the kinds of subjects before further analysis. The indispensable preprocessing covers denoise treatments, functional and hierarchical clustering and so forth, before the nextstage network reconstruction.
The second concern mainly relates to the biologicallyfunctional analysis on relative network modules and motifs by quantitative means. The proposed framework deciphers genetic regulatory activities with a richinformation mode. Thus, the inference results and related information between pairwise candidates have the potentials for those applications as succeeding identification of biological modules and motifs of particular interests.
The third focus might go to topological properties of inferred regulatory networks. Quantitative analysis and comparison between diverse constructed topologies might reveal inherent coordination and organization mechanisms, which thus have potential applications in, to name a few, identifying target genes, and novel drug discovery, particularly for those subjects in computational systems biology.
Conclusions
Within the work, we propose a combinatorial theorybased learning pattern for the inference and analysis of genetic networks from microarray timeseries datasets.
For different kinds of microarray datasets gathered from multiple organisms and species, there still does not exist such an efficient solution applicable to most of current problems facing biological theoreticians and experimentalists. In consideration of previouslyacquired knowledge, decisionmakers’ preferences and practical constraints, the network inference might be transformed into a kind of multiobjective combinatorial optimization (MOCO) problem.
Compared with currently available methods for inferring biochemical networks [20,21], the proposed approach renders the possibilities for biologists to incorporate concrete theoretic and empirical knowledge, and thus to construct regulatory networks with much more reliabilities and accuracy. Secondly, different regulatory models should focus on specific perspectives and utilities adopted by the builders, thus the inherent complexity from the inference procedures and the necessity to optimize those results appeal such a kind of associative relevance metrics and multiobjective combinatorial optimization method.
To include specific nodes into or exclude them from reconstructed networks with sufficient confidence and previouslyacquired knowledge, there exists several design approaches for such purposes within the proposed framework. Within the work, we decipher the underlying design mechanisms of pairwise connectivity via dynamic threshold of linear/nonlinear relevance metrics, i.e. mutual information, correlation coefficient, and Pvalue; and determine regulatory orientations among genetic networks with signal processing metrics, i.e. phase shift and transfer gain.
With the inference procedure being transposed into a kind of MOCO problem, we might constrain the multiobjective iterative searching problems with reasonable terms from acquired knowledge, experimental conditions, and other computational considerations or decisionmakers’ preferences.
We utilize the proposed method in analyzing two microarray datasets with different statistical characteristics. Thus by quantitative means, we reveal the inherent design mechanisms for genetic networks, facilitating the further theoretic analysis and experimental design with diverse biochemical aims.
For the sake of simplicity, we testify the proposed approach on a few smallscale datasets; different clustering and classification methods are beneficial and necessary as preprocess purposes on some largescale, say more than hundreds or thousands of gene/proteins within those kinds of datasets.
Methods
Based on probability and signal processing theories, the following section introduces a dimensionless metric for regulatory strengths and a phaseshift metric for determining regulatory orientations. For network inference, we propose a combinatorialoptimization framework for constraining the inference complexities. The framework allows the possibility of incorporating acquired knowledge and specific aims for integrative mining and analysis.
Probability theorybased inference of biological network structures
Correlation analysis aims to reveal the strength of a linear relationship between random variables (R.V.); statistical correlation (coefficient) represents the departure of two R.V. from independence. Among the various metrics often used to measure the correlation or association, the Pearson productmoment correlation coefficient is applicable to some data of diverse characteristics. Normally, the correlation ρ_{X,Y} is denoted as the covariance of two R.V. divided by the product of their standard deviations, which can be represented as [7,10,12,13]
(1)
where cov indicates covariance, E is the expected value operator, μ_{X} = E(X), and σ_{X}^{2} = E[(XE(X))^{2}]=E(X^{2})E^{2}(X).
When interpreting the Pearson productmoment correlation coefficient, Cohen noted that the proposed interpretative criteria were arbitrary in general and that specific treatments should be adopted for specific cases in those ranging from physics to other social sciences [22]. Apart from the parametric statistic, nonparametric correlation metrics such as the χ^{2} test, Spearman’s ρ, and Kendall’s τ are proposed, and those metrics can be applied to problems of diverse nonnormal distributions [23].
Informationtheoretic inference of biological network structures
To quantify the mutual dependence of two R.V., mutual information is frequently adopted as an alternative in informationtheoretic applications, in addition to the above metric. The mutual information of two discrete R.V. can be defined as [24],
(2)
where p(x, y) denotes the joint probability distribution of X and Y, and p_{1}(x) and p_{2}(y) represents the marginal probability distributions of X and Y respectively. The measure normally adopts the welldefined form I(X, Y, b), where b denotes the base. In general, a base of 2 can be specified since that is the common unit of the bit. Thus, for analysis within this context, we consistently use the base of 2.
Associativity measure for describing regulatory connectivity
The abovedescribed measures illustrate the correlation and dependence relationships of R.V. Normally, these R.V. characterize different entities within a system. The interconnections in the biological network can be weighted by the probability of association between the pairs being investigated [25]. Since the above metrics, i.e. the Pearson productmoment correlation and mutual information are dimensionless vector quantities; we introduce an associativity measure (AM) for illuminating the connectivity between candidate pairs. Within this uniform measure, the quantities of mutual information and correlation metrics can be projected onto the orthogonal coordinates of a 2D plane. The metric is represented in a formal term as,
(3)
where MI_{i} and Cor_{i} denote the mutual information and correlation quantities respectively; ω_{i1} and ω_{i2} represent the weights of both quantities; α_{i} is the phase difference for the i th pair candidate; and N is a set of natural numbers. Note that the weights here aim to leverage any possible asymmetric distribution within the datasets of the above subterms MI_{i} and Cor_{i}. The weights can be derived from previouslyacquired knowledge or from a specific theoretical hypothesis, e.g. the respective centroids of datasets.
Phaseshift metric for determining regulatory directions
Currently, most gene expression profiles are discrete timeseries data. The data samples are diverse expression densities measured at multiple time points, and the data intervals represent the sampling periods. When n samples are compared, a total of n(n1)/2 pairwise comparisons are obtained. Butte et al. utilized a type of signal processing method to cluster and compare the similarity of expression profiles [26]. For every potential pairwise regulation, the activities of the investigated genes can be modularized as a subsystem. Their expression patterns might be viewed as input and output signals, as shown in Figure 9.
Figure 9. Each pairwise association might be modularized as a subsystem with the expression patterns serving as input and output signals.
For each pair, the coherence, gain, and phase shift might be calculated by discrete Fourier transform (DFT) of the inputs and outputs. The coherence of signals a and b is a function of the power spectral density (PSD) and the cross power spectral density (CPSD), defined as below,
(4)
where PSD_{aa}(f), PSD_{bb}(f), and CPSD_{ab}(f) measure the PSD and CPSD of the associated pairwise signals. The symbol f represents a frequencydomain metric. Normally, signals a and b are of the same length. A coherence of 1 represents a scalar multiples relationship between two investigated signals, while 0 indicates that such a relationship is not linearly related. The transfer function (TF) between two associated input/output signals measures the signal amplification and related time lag/latency properties, which are defined as,
(5)
The regular transfer functions will be of the complexvalued form, the arctangents of which are the corresponding transfer phases (TP). The absolute values denote the related transfer gains (TG), and both metrics are represented as,
(6)
(7)
Theoretically, the TP illustrates the phase shift between the investigated pairwise signals, i.e. the input and output. The phase shift ranges might be allocated within π to π, where π represents a phase lead of half a wavelength and π denotes a phase lag of half a wavelength. Whether the input signals are amplified or not is not illuminated at the output by the transfer gain and determines the related degrees at different frequencies. The larger the ratio, the less energy is lost by the output. Note that at different frequencies, the transfer phase and relative transfer gain might differ from each other. An effective evaluation criterion for these metrics is the related coherence, namely, at frequencies where the coherence values are high, the corresponding transfer phases and gains are much more reliable than others.
The advantages of such metrics lie in the flexible and quantitative characteristics of determining the regulatory delay via dynamic threshold. Factual regulatory mechanisms have multiple possibilities, and inherent regulatory delay effects might vary during the whole biological processes. The phaseshift metric determines such possibilities underlying regulatory mechanisms in a quantitative manner. The advantages include the inherent capabilities of integrating a priori biological knowledge. This kind of knowledgebased inference method avoids redundant falsepositive connectivities within pairwise candidates.
Such dynamic threshold is applicable to the majority of problems facing theoretical and experimental biologists. Since regulatory connectivity underlying pairwise candidates may differ in diverse processes or at different sampling times, systematic and quantitative determination of these regulations with empirical and theoretical knowledge will be much more effective than those generated by most currentlyavailable computational approaches [17]. Such types of flexible network connectivities and regulations characterize major regulatory processes from the perspectives of information and signal processing theories.
A MOCO pattern for constraining computational complexities
In the following sections, we extract inherent regulations and decipher network structures by introducing a pairwise gene hierarchy criterion (PGHC) for classifying possible gene pairs into three major groups as follows.
(1) Authentic Pairwise Genes (APGs): These include pairs with mutual information values and correlation coefficients larger than specific thresholds. Moreover, the corresponding P value resides in the confidence interval, namely, smaller than 0.05.
(2) Questionable Pairwise Genes (QPGs): These include pairs that do not satisfy both of the thresholds mentioned above. The group contains pairs of two classes. One class has pairs with mutual information larger than specific thresholds but satisfies neither the criteria of correlation coefficients nor P values. The other class includes pairs with correlation coefficients larger than specific thresholds and with P values residing in the confidence interval but the related mutual information does not satisfy specific thresholds.
(3) Unauthentic Pairwise Genes (UPGs): These include those pair candidates that do not satisfy any criteria of the APGs or QPGs defined above.
The QPGs actually act as a subsidiary candidate pool for the APGs in case the empirical thresholds are set too high to extract structures merely from the APGs. Under such conditions, the QPGs will be ranked according to mutual information values, correlation coefficients, and P values. Optimal pairs will be allocated to the APGs to refine the former network connectivity. The algorithm for the supervised PGHC is shown in table 1.
Table 1. Algorithm: Pairwise Gene Hierarchy Criterion
Thus, network reconstruction might be transformed into a class of MOCO problems [10,12,13]. The optimization objectives include first reaching suitable thresholds for mutual information and correlation coefficient to maximize the feasible components in the APGs. The inference might be carried out with much more confidence and reliability. The second objective is to maximize the UPGs. The larger the UPGs, the fewer the problems faced during further solution searching. This decreases the feasible solution space for subsequent computations. In addition, the following relative constraints exist. There are nonnegative constraints for the sizes of groups, and the total number of pair candidates is fixed, i.e. the valid combinatorial space is limited. The gain thresholds for guaranteeing valid network connectivity and previouslyacquired biochemical knowledge and different experimental conditions constitute other prominent constraints for the reconstruction process. The MOCO paradigm is described as follows,
(8)
where F_{i} is the multiobjective function set; S_{1} is the set of feasible group combinations for APGs, QPGs, and UPGs; S_{2} is the number set of all gene pairs (S_{2} = {n(n1) / 2}, n is the total number of genes); S_{3} is the set of necessary gain constraints (GC); and S_{4} is the set of possible constraints from acquired biological knowledge (ABK).
Recently quite a few authors have argued the necessity of incorporating the preferences of decisionmakers (DM) into MOCO solution selection [2729]. For the problem under investigation, the DM’s preferences mainly stem from the GC (S_{3}) and ABK (S_{4}) illustrated above.
In cases governed by lower thresholds of mutual information and correlation metrics, APGs will form the group with the maximum components within the total pair candidates. On the other hand, with the heightened thresholds, many more pairs might be grouped into UPGs. This reduces the computational complexity for network reconstruction since APGs have fewer components in such situations. If APGs are classified with abovenormal sizes, the reconstructed network will be densely connected and will have much more redundancies. On the contrary, a sparsely connected structure will be inferred with an undersized candidate group of APGs.
Since biological theoreticians and experimentalists may vary specific mutual information and correlation thresholds to incorporate empirical or concrete knowledge into the reconstruction procedures, the underlying coordination approaches via the MOCO framework might be feasible and significant, especially for those containing pivotal structural connectivity or for specific analysis purposes.
The APGs, QPGs, and UPGs engender the underlying evolutionary mechanisms with respect to dynamic threshold by the above metrics and related biochemical knowledge, as shown in Figure 10.
Figure 10. Schematic representation of the MOCO problem by dynamic thresholding of mutual information and correlation metrics. Total pairs are classified into APGs, QPGs and UPGs. The upper rightward horizontal arrow represents dynamic thresholding by mutual information, and the left descending arrow is for thresholding of the correlation measure.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
BHT proposed the methods, performed the analysis and composed the work; XCW and GT gave advice and proofchecked the work; SSC commented on the methods and the writing; QJ and BRS led the project and coordinated the research progress.
Acknowledgements
This research work has been supported in part by the National 973 Program of China (No. 2007CB947002) and the Postgraduate Innovation Fund of Tongji University.
This article has been published as part of BMC Systems Biology Volume 4 Supplement 2, 2010: Selected articles from the Third International Symposium on Optimization and Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/17520509/4?issue=S2
References

Shmulevich I, Gluhovsky I, Hashimoto RF, Dougherty ER, Zhang W: Steadystate analysis of genetic regulatory networks modelled by probabilistic Boolean networks.
Comparative and Functional Genomics 2003, 4(6):601608. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Faure A, Naldi A, Chaouiya C, Thieffry D: Dynamical analysis of a generic Boolean model for the control of the mammalian cell cycle.
Bioinformatics 2006, 22(14):e124131. PubMed Abstract  Publisher Full Text

Garg A, Di Cara A, Xenarios I, Mendoza L, De Micheli G: Synchronous versus asynchronous modeling of gene regulatory networks.
Bioinformatics 2008, 24(17):19171925. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Faryabi B, Vahedi G, Chamberland JF, Datta A, Dougherty ER: Optimal constrained stationary intervention in gene regulatory networks.

Ching WK, Zhang SQ, Jiao Y, Akutsu T, Tsing NK, Wong AS: Optimal control policy for probabilistic Boolean networks with hard constraints.
Systems Biology, IET 2009, 3(2):9099. Publisher Full Text

Zou M, Conzen SD: A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data.
Bioinformatics 2005, 21(1):7179. PubMed Abstract  Publisher Full Text

de Hoon M, Imoto S, Kobayashi K, Ogasawara N, Miyano S: Inferring gene regulatory networks from timeordered gene expression data of Bacillus subtilis using differential equations.
Pac Symp Biocomput 2003, 1728. PubMed Abstract  Publisher Full Text

Perkins TJ, Hallett M, Glass L: Inferring models of gene expression dynamics.
Journal of Theoretical Biology 2004, 230(3):289299. PubMed Abstract  Publisher Full Text

Tiana G, Krishna S, Pigolotti S, Jensen MH, Sneppen K: Oscillations and temporal signalling in cells.
Physical Biology 2007, 4(2):R1R17. PubMed Abstract  Publisher Full Text

Wang Y, Joshi T, Zhang XS, Xu D, Chen L: Inferring gene regulatory networks from multiple microarray datasets.
Bioinformatics 2006, 22(19):24132420. PubMed Abstract  Publisher Full Text

Schneidman E, Still S, II MJB, Bialek W: Network information and connected correlations.
Phys Rev Lett 2003, 91(23):238701238704. PubMed Abstract  Publisher Full Text

Meyer PE, Kontos K, Lafitte F, Bontempi G: Informationtheoretic inference of large transcriptional regulatory networks.

Zhao W, Serpedin E, Dougherty ER: Inferring connectivity of genetic regulatory networks using informationtheoretic criteria.
IEEE/ACM Trans Comput Biol Bioinformatics 2008, 5(2):262274.

Huber W, Carey V, Long L, Falcon S, Gentleman R: Graphs in molecular biology.
BMC Bioinformatics 2007, 8(Suppl 6):S8. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Christensen C, Thakar J, Albert R: Systemslevel insights into cellular regulation: inferring, analysing, and modelling intracellular networks.
Systems Biology, IET 2007, 1(2):6177. Publisher Full Text

Tang B, He L, Jing Q, Shen B: Modelbased identification & adaptive control of the core module in a typical cell cycle pathway via network and system control theories.
Advances in Complex Systems 2009, 12(1):2143. Publisher Full Text

Butte AJ, Bao L, Reis BY, Watkins TW, Kohane IS: Comparing the similarity of timeseries gene expression using signal processing metrics.
Journal of Biomedical Informatics 2001, 34(6):396405. PubMed Abstract  Publisher Full Text

Dougherty ER, Shmulevich I, Bittner ML: Genomic signal processing: the salient issues.

Candy JV: Modelbased signal processing. Hoboken, New Jersey: John Wiley & Sons, Inc; 2006.

Barenco M, Tomescu D, Brewer D, Callard R, Stark J, Hubank M: Ranked prediction of p53 targets using hidden variable dynamic modeling.
Genome Biology 2006, 7(3):R25. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Chu LH, Chen BS: Comparisons of robustness and sensitivity between cancer and normal cells by microarray data.

Papoulis A: Probability, random variables, and stochastic processes. 2nd edition. New York: McGrawHill; 1984.

Cohen J: Statistical power analysis for the behavioral sciences. 2nd edition. Hillsdale, New Jersey: Lawrence Erlbaum Associates; 1988.

Simon MK: Probability distributions involving Gaussian random variables. New York: Springer; 2002.

Yao YY: Informationtheoretic measures for knowledge discovery and data mining. In Entropy Measures, Maximum Entropy Principle and Emerging Applications. Edited by Karmeshu. Springer; 2003:115136.

Forst CV, Schulten K: Phylogenetic analysis of metabolic pathways.
Journal of Molecular Evolution 2001, 52(6):471489. PubMed Abstract  Publisher Full Text

Jaszkiewicz A: Genetic local search for multiobjective combinatorial optimization.
European Journal of Operational Research 2002, 137(1):5071. Publisher Full Text

Liefooghe A, Basseur M, Jourdan L, Talbi EG: Combinatorial optimization of stochastic multiobjective problems: an application to the flowshop scheduling problem.

Köksalan M: Multiobjective combinatorial optimization: some approaches.
Journal of MultiCriteria Decision Analysis 2009, 15(34):6978.