Department of Bioinformatics, Tongji University, Shanghai, 212003, China

Institute of Protein Research, Tongji University, Shanghai, 212003, China

Department of Computer Science, ETH, Zurich, 8092, Switzerland

CISE and Systems Biology Lab, University of Florida, Gainesville, FLA, 32611, USA

Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, China

Center for Systems Biology, Soochow University, Suzhou, 215006, China

Abstract

Background

Post-genome era brings about diverse categories of omics data. Inference and analysis of genetic regulatory networks act prominently in extracting inherent mechanisms, discovering and interpreting the related biological nature and living principles beneath mazy phenomena, and eventually promoting the well-beings of humankind.

Results

A supervised combinatorial-optimization pattern based on information and signal-processing theories is introduced into the inference and analysis of genetic regulatory networks. An associativity measure is proposed to define the regulatory strength/connectivity, and a phase-shift metric determines regulatory directions among components of the reconstructed networks. Thus, it solves the undirected regulatory problems arising from most of current linear/nonlinear relevance methods. In case of computational and topological redundancy, we constrain the classified group size of pair candidates within a multiobjective combinatorial optimization (MOCO) pattern.

Conclusions

We testify the proposed approach on two real-world microarray datasets of different statistical characteristics. Thus, we reveal the inherent design mechanisms for genetic networks by quantitative means, facilitating further theoretic analysis and experimental design with diverse research purposes. Qualitative comparisons with other methods and certain related focuses needing further work are illustrated within the discussion section.

Background

Various cell phenotypes and functions within multi-cellular organisms relate directly to genetic contents decoded from DNA and RNA during transcriptional and translational processes. Inference of gene regulatory networks or maps for those intercellular processes plays significant roles in the further comprehension of underlying regulatory mechanisms. Thus reconstructing such biological regulatory networks directly from gene profile datasets measured at different cell phases, types and even species becomes one of the foremost research topics recently.

Due to capabilities of simultaneous measurement for multiple expression profiles with gradually increasing accuracy and decreasing costs of experiments, those advances in high-throughput microarray and ChIP assays techniques facilitate the corresponding learning and inference of the regulatory maps and even functionality of these genetic networks. During the past decades, manifold inference and learning methods have been proposed to integrate raw data to computational frameworks for network models, such as (probabilistic) Boolean network and (dynamic) Bayesian network, systematic differential/difference equations

Furthermore, most of current biochemical networks are regarded as static descriptions of the inherent regulatory mechanisms in the sense that once the system models and parameters for those genetic networks are set, the regulatory processes are determined. While during genetic transcriptional and translational processes, real-world regulatory maps may undergo various perturbations from intercellular and intracellular signals and undiscovered factors. From this perspective, a single modelling mode may not be sufficient to characterize all kinds of possible structures of these networks, or even crucial ones for specific analysis purposes. The problems above solicit flexible mechanism designs to improve the present rigid methods for network inference.

Within the following parts, we propose an integrative supervised learning method for the inference of time-delayed cell cycle regulatory mechanism based on information and signal processing theories. We firstly introduce definitions for those crucial concepts as correlation measure and mutual information; then we propose a novel associative quantity for the two kinds of dependency measures. With the proposed integrative metric and the

Moreover, from signal processing theory

The other advantage of the method includes its inherent capabilities of integrating existing biological knowledge as

Results

The supervised learning framework mainly covers two aspects, namely, it should characterize pairwise regulatory strengths and constrain subsequent computational redundancy. We utilize the proposed method for two real-world datasets, selected from the Stanford Microarray Database. The both datasets are of different statistical characteristics, normalized and benchmarked in the recent literatures

Analysis on the

The first

The log2-normalized gene expression profile for 24 genes from the cell cycle regulatory network (Experiment condition: response to elutriation).

**The log2-normalized gene expression profile for 24 genes from the cell cycle regulatory network (Experiment condition: response to elutriation).** The horizontal coordinate represents the sample time. (14 points from 0 to 6.5 hours, equally sampled per 30 minutes); the vertical coordinate illustrates 24 genes from the cell cycle genetic network.

Based on the definitions and concepts illustrated in the methodology part, we calculated the mutual information, correlation and

Additional Figure 1-A.

Click here for file

Additional Figure 1-B.

Click here for file

As depicted in the lower sub-graph of the additional Figure 1-B in Additional file

Thus through dynamic thresholding of mutual information and correlation coefficient, we obtain the global distributions for three pair groups under dynamic metrics. The distributions for the classified pair groups are illustrated in Figure

The global statistics for pairwise gene numbers under different mutual information values and correlation coefficients.

**The global statistics for pairwise gene numbers under different mutual information values and correlation coefficients.** Totally, there are 276 pair candidates for the network of 24 genes. The horizontal axis represents different mutual information thresholds, and the vertical axis illustrates correlation coefficient thresholds. The corresponding three-dimensional graph is given in the additional Figure 2-A in Additional file

Additional Figure 2-A.

Click here for file

The supervised inference procedure starts from the respective centroids,

Also with the acquired knowledge,

Thus, we might calculate the global phase-shift statistics for the APGs group, based on the signal processing theory defined in the methodology section. Figure

The global phase-shift statistics distribution for the APGs of the cell cycle regulatory network (totally 83 pairwise candidates in APGs).

**The global phase-shift statistics distribution for the APGs of the cell cycle regulatory network (totally 83 pairwise candidates in APGs).** The phase-shift statistics vary as functions of the gain thresholds. The blue bold curve represents the integral tendency of gene pairs with leading phase shifts (positive), the red for the pairs with lagging phase shifts (negative), and the green for those without detected phase shift (undirected),

Additional Figure 3-A.

Click here for file

For this case, the gain threshold is set at 0.3, see the additional Figure 3-A in Additional file

The interweaved cell cycle regulatory network rebuilt based on the MICORPS framework.

**The interweaved cell cycle regulatory network rebuilt based on the MICORPS framework.** Each gene/protein is denoted as a black-edged circle. The calculated associativity metric and phase-shift information between pairwise genes are marked as blue along each bilateral links, see the additional Figure 4-A in Additional file

Additional Figure 4-A.

Click here for file

As depicted, only the gene #4 (YDL056W) is isolated from the network structure, meaning that YDL056W might belong to other regulatory processes at the current situation. Besides, the gene #2 (YER111C) only has a single regulatory link, similar to the genes #9 (YLR079W) and #10 (YAL040C). While for such genes as #1 (YDR146C), #3 (YLR182W), #16 (YDR507C),

Since the above analysis is for the case of normal statistical characteristics, one may directly utilize the proposed methods. Within the following part, we discuss another kind of microarray dataset of different statistical properties.

Analysis on the dataset from a p53 pathway with multiple feedback loops

The profile dataset of the p53 pathway with multiple feedback loops is selected from the recent work

Additional Figure 5-A.

Click here for file

Additional Figure 5-B.

Click here for file

The triplicate MOTL4 microarray experiments are implemented under irradiation from 0 to 12 hours at intervals of 2 hours.

**The triplicate MOTL4 microarray experiments are implemented under irradiation from 0 to 12 hours at intervals of 2 hours.** The expression profile is plotted with the mean values of the triplicate datasets. The horizontal axis denotes the time range from 0 to 12 hours, and the vertical axis for the corresponding 16 gene/protein names.

However, this kind of dataset does not satisfy the above network-constructing algorithm since there are only 10 pair candidates with their

As the former case, 40%~45% of the total pairs as suitable candidates are needed for constructing genetic networks, then we lift the threshold higher enough, and derive necessary suitable pair candidates for composing the group APGs via the proposed PGHC algorithm. For this case, we lift the

The global statistics for pairwise gene numbers under different mutual information values and correlation coefficients.

Additional Figure 6-A.

Click here for file

Thus, we might calculate the global phase-shift statistics for the APGs group, based on the signal processing concepts defined in the methodology section. The calculated global phase-shift details are given in Figure

The calculated phase-shift statistics distribution (totally 55 pairwise candidates for the APGs group in the multi-feedback p53 pathway).

**The calculated phase-shift statistics distribution (totally 55 pairwise candidates for the APGs group in the multi-feedback p53 pathway).** The blue bold curve represents the integral tendency of gene pairs with leading phase shifts (positive), the red for the pairs with lagging phase shifts (negative), and the green for those without detected phase shift (undirected),

Within the following network-building procedure, we still choose the corresponding centroids of both metrics as the initial points for the iterative computation. The centroids for the mutual information and correlation coefficients for the totally available pairs are 0.7992 and 0.5203 respectively.

The searching for optimal solutions stops when the mutual information threshold backtracks to 0.7 and the correlation coefficient takes 0.3 and the

Additional Figure 7-A.

Click here for file

Additional Figure 8-A.

Click here for file

Additional Figure 8-B.

Click here for file

The constructed genetic graphs under different gain thresholds.

**The constructed genetic graphs under different gain thresholds.** The structure is constructed with gain threshold at 0.3, and the additional Figure 8-A in Additional file

Discussion

The comparison with the currently-available inference methods

Currently, there exist several inference approaches for the biochemical networks,

Secondly, the proposed method tackles one of most important problems from the perspective of signal processing theory, namely, the determination of regulatory directions between candidate gene pairs. The introduced metrics quantify those underlying regulatory strengths, directions between pair candidates globally and comparatively. Thus, it facilitates the follow-up network-rebuilding procedure.

Moreover, the proposed inference framework might illustrate in parallel multiple optimal or suboptimal potential regulatory maps, instead of the one computational solution for one problem scheme, since for most cases such solutions cannot explain convincingly so much inherent mechanism as expected. The proposed method might utilize the diverse knowledge available, either from concrete biochemical experiments or current literatures.

The current focuses of the proposed method and its future directions

Although the proposed inference framework is validated with the real-world profile datasets, there are still several directions needing further refinement, depicted within the below section.

In practice, most available profile datasets are of high dimensions, particularly as those kinds of less-point and multi-sample profiles, together with unavoidable measurement noises,

The second concern mainly relates to the biologically-functional analysis on relative network modules and motifs by quantitative means. The proposed framework deciphers genetic regulatory activities with a rich-information mode. Thus, the inference results and related information between pairwise candidates have the potentials for those applications as succeeding identification of biological modules and motifs of particular interests.

The third focus might go to topological properties of inferred regulatory networks. Quantitative analysis and comparison between diverse constructed topologies might reveal inherent coordination and organization mechanisms, which thus have potential applications in, to name a few, identifying target genes, and novel drug discovery, particularly for those subjects in computational systems biology.

Conclusions

Within the work, we propose a combinatorial theory-based learning pattern for the inference and analysis of genetic networks from microarray time-series datasets.

For different kinds of microarray datasets gathered from multiple organisms and species, there still does not exist such an efficient solution applicable to most of current problems facing biological theoreticians and experimentalists. In consideration of previously-acquired knowledge, decision-makers’ preferences and practical constraints, the network inference might be transformed into a kind of multi-objective combinatorial optimization (MOCO) problem.

Compared with currently available methods for inferring biochemical networks

To include specific nodes into or exclude them from reconstructed networks with sufficient confidence and previously-acquired knowledge, there exists several design approaches for such purposes within the proposed framework. Within the work, we decipher the underlying design mechanisms of pairwise connectivity via dynamic threshold of linear/nonlinear relevance metrics,

With the inference procedure being transposed into a kind of MOCO problem, we might constrain the multiobjective iterative searching problems with reasonable terms from acquired knowledge, experimental conditions, and other computational considerations or decision-makers’ preferences.

We utilize the proposed method in analyzing two microarray datasets with different statistical characteristics. Thus by quantitative means, we reveal the inherent design mechanisms for genetic networks, facilitating the further theoretic analysis and experimental design with diverse biochemical aims.

For the sake of simplicity, we testify the proposed approach on a few small-scale datasets; different clustering and classification methods are beneficial and necessary as pre-process purposes on some large-scale, say more than hundreds or thousands of gene/proteins within those kinds of datasets.

Methods

Based on probability and signal processing theories, the following section introduces a dimensionless metric for regulatory strengths and a phase-shift metric for determining regulatory orientations. For network inference, we propose a combinatorial-optimization framework for constraining the inference complexities. The framework allows the possibility of incorporating acquired knowledge and specific aims for integrative mining and analysis.

Probability theory-based inference of biological network structures

Correlation analysis aims to reveal the strength of a linear relationship between random variables (R.V.); statistical correlation (coefficient) represents the departure of two R.V. from independence. Among the various metrics often used to measure the correlation or association, the _{
X,Y
}

where cov indicates covariance, _{
X
}
_{X}
^{2} = ^{2}]=^{2})-^{2}(

When interpreting the ^{2} test, Spearman’s

Information-theoretic inference of biological network structures

To quantify the mutual dependence of two R.V., mutual information is frequently adopted as an alternative in information-theoretic applications, in addition to the above metric. The mutual information of two discrete R.V. can be defined as

where _{
1
}(_{
2
}(

Associativity measure for describing regulatory connectivity

The above-described measures illustrate the correlation and dependence relationships of R.V. Normally, these R.V. characterize different entities within a system. The interconnections in the biological network can be weighted by the probability of association between the pairs being investigated

where _{
i
}
_{
i
}
_{
i
1
} and _{
i
2
} represent the weights of both quantities; _{
i
}
_{
i
}
_{
i
}

Phase-shift metric for determining regulatory directions

Currently, most gene expression profiles are discrete time-series data. The data samples are diverse expression densities measured at multiple time points, and the data intervals represent the sampling periods. When

Each pairwise association might be modularized as a subsystem with the expression patterns serving as input and output signals.

Each pairwise association might be modularized as a subsystem with the expression patterns serving as input and output signals.

For each pair, the coherence, gain, and phase shift might be calculated by discrete Fourier transform (DFT) of the inputs and outputs. The coherence of signals

where _{
aa
}
_{
bb
}
_{
ab
}

The regular transfer functions will be of the complex-valued form, the arctangents of which are the corresponding transfer phases (TP). The absolute values denote the related transfer gains (TG), and both metrics are represented as,

Theoretically, the TP illustrates the phase shift between the investigated pairwise signals,

The advantages of such metrics lie in the flexible and quantitative characteristics of determining the regulatory delay via dynamic threshold. Factual regulatory mechanisms have multiple possibilities, and inherent regulatory delay effects might vary during the whole biological processes. The phase-shift metric determines such possibilities underlying regulatory mechanisms in a quantitative manner. The advantages include the inherent capabilities of integrating

Such dynamic threshold is applicable to the majority of problems facing theoretical and experimental biologists. Since regulatory connectivity underlying pairwise candidates may differ in diverse processes or at different sampling times, systematic and quantitative determination of these regulations with empirical and theoretical knowledge will be much more effective than those generated by most currently-available computational approaches

A MOCO pattern for constraining computational complexities

In the following sections, we extract inherent regulations and decipher network structures by introducing a pairwise gene hierarchy criterion (PGHC) for classifying possible gene pairs into three major groups as follows.

(1) Authentic Pairwise Genes (APGs): These include pairs with mutual information values and correlation coefficients larger than specific thresholds. Moreover, the corresponding

(2) Questionable Pairwise Genes (QPGs): These include pairs that do not satisfy both of the thresholds mentioned above. The group contains pairs of two classes. One class has pairs with mutual information larger than specific thresholds but satisfies neither the criteria of correlation coefficients nor

(3) Unauthentic Pairwise Genes (UPGs): These include those pair candidates that do not satisfy any criteria of the APGs or QPGs defined above.

The QPGs actually act as a subsidiary candidate pool for the APGs in case the empirical thresholds are set too high to extract structures merely from the APGs. Under such conditions, the QPGs will be ranked according to mutual information values, correlation coefficients, and

**Algorithm: Pairwise Gene Hierarchy Criterion**

**Input:**

all pairwise gene candidates GPs;

initial MI threshold MIth = MI's centroid;

initial CC threshold CCth = CC's centroid;

increments _{mi}, _{cc} for MI and CC.

**Output:**

classified APGs, UPGs and QPGs.

**while** count(GPs)>0 **do**

1. construct APGs, QPGs using initial MIth, CCth and

2. group the others into UPGs;

**if** (APGs' undersized) && count(QPGs)>0 **then do**

MIth=MIth-_{mi} & CCth=CCth-_{cc};

continue Step 1 for QPGs & obtain Δ_{APGs} and Δ_{UPGs};

APG=APGs+Δ_{APGs} & UPGs=UPGs-Δ_{UPGs}.

**elseif**(APGs' oversized) **then do**

MIth=MIth-_{mi} & CCth=CCth+_{cc};

continue Step 1 for APGs & obtain Δ_{APGs} and Δ_{UPGs};

APG=APGs-Δ_{APGs} & UPGs=UPGs+Δ_{APGs}.

**endif**

**end**

Thus, network reconstruction might be transformed into a class of MOCO problems

where _{
i
}
_{
1
} is the set of feasible group combinations for APGs, QPGs, and UPGs; _{
2
} is the number set of all gene pairs (_{2}
_{
3
} is the set of necessary gain constraints (GC); and _{
4
} is the set of possible constraints from acquired biological knowledge (ABK).

Recently quite a few authors have argued the necessity of incorporating the preferences of decision-makers (DM) into MOCO solution selection _{
3
}) and ABK (_{
4
}) illustrated above.

In cases governed by lower thresholds of mutual information and correlation metrics, APGs will form the group with the maximum components within the total pair candidates. On the other hand, with the heightened thresholds, many more pairs might be grouped into UPGs. This reduces the computational complexity for network reconstruction since APGs have fewer components in such situations. If APGs are classified with above-normal sizes, the reconstructed network will be densely connected and will have much more redundancies. On the contrary, a sparsely connected structure will be inferred with an undersized candidate group of APGs.

Since biological theoreticians and experimentalists may vary specific mutual information and correlation thresholds to incorporate empirical or concrete knowledge into the reconstruction procedures, the underlying coordination approaches via the MOCO framework might be feasible and significant, especially for those containing pivotal structural connectivity or for specific analysis purposes.

The APGs, QPGs, and UPGs engender the underlying evolutionary mechanisms with respect to dynamic threshold by the above metrics and related biochemical knowledge, as shown in Figure

Schematic representation of the MOCO problem by dynamic thresholding of mutual information and correlation metrics.

**Schematic representation of the MOCO problem by dynamic thresholding of mutual information and correlation metrics.** Total pairs are classified into APGs, QPGs and UPGs. The upper rightward horizontal arrow represents dynamic thresholding by mutual information, and the left descending arrow is for thresholding of the correlation measure.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

BHT proposed the methods, performed the analysis and composed the work; XCW and GT gave advice and proof-checked the work; SSC commented on the methods and the writing; QJ and BRS led the project and coordinated the research progress.

Acknowledgements

This research work has been supported in part by the National 973 Program of China (No. 2007CB947002) and the Postgraduate Innovation Fund of Tongji University.

This article has been published as part of