Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, 79409, USA

Abstract

Background

Numerous approaches exist for modeling of genetic regulatory networks (GRNs) but the low sampling rates often employed in biological studies prevents the inference of detailed models from experimental data. In this paper, we analyze the issues involved in estimating a model of a GRN from single cell line time series data with limited time points.

Results

We present an inference approach for a Boolean Network (BN) model of a GRN from limited transcriptomic or proteomic time series data based on prior biological knowledge of connectivity, constraints on attractor structure and robust design. We applied our inference approach to 6 time point transcriptomic data on Human Mammary Epithelial Cell line (HMEC) after application of Epidermal Growth Factor (EGF) and generated a BN with a plausible biological structure satisfying the data. We further defined and applied a similarity measure to compare synthetic BNs and BNs generated through the proposed approach constructed from transitions of various paths of the synthetic BNs. We have also compared the performance of our algorithm with two existing BN inference algorithms.

Conclusions

Through theoretical analysis and simulations, we showed the rarity of arriving at a BN from limited time series data with plausible biological structure using random connectivity and absence of structure in data. The framework when applied to experimental data and data generated from synthetic BNs were able to estimate BNs with high similarity scores. Comparison with existing BN inference algorithms showed the better performance of our proposed algorithm for limited time series data. The proposed framework can also be applied to optimize the connectivity of a GRN from experimental data when the prior biological knowledge on regulators is limited or not unique.

Introduction

Technological advances in the last two decades have provided numerous approaches to measure various aspects of the regulome in a cell. However, the data generated for specific conditions are still limited both in terms of number of time points and number of samples. Models of genetic regulatory network (GRN) are regularly being inferred from limited time series data on average tissue expression as measured by technologies such as microarray. Selection of a mathematical model to represent a GRN and its inference from limited noisy time series data remains an important problem in systems biology.

The foremost aspect of inference of a mathematical model for explaining a regulatory process is selection of the model. A comprehensive model can provide an accurate picture of the regulation assuming that the parameters of such a model can be correctly inferred. However, we are often faced with limitations on the experimental data which motivates us to design simpler models with the ability to capture the coarse-scale dynamics of the GRN. In this paper, we consider cases where there are only one set of time series transcriptomic or proteomic data generated from a cell line after a specific perturbation. Here, we are considering cell population averaged data as measured by techniques such as microarrays and thus we will start with a deterministic model explaining the average behavior of the system. For a deterministic model, common choices will be Differential Equation (DE) or Boolean Network (BN) type of models. Inference of the parameters of a DE model from minimal data can produce unreliable models as was observed when we tried to infer commonly used linear and non-linear DE models

Our goal in this paper is to provide a BN inference approach from limited time series data and prior biological knowledge on connectivity. The proposed framework can also be applied to optimize the connectivity of a GRN from experimental data when the prior biological knowledge on regulators is limited or not unique. Our analysis will reveal that the chances of generating a BN with small length attractor cycles and satisfying the observed transitions with constraints on connectivity is extremely rare if the regulators of a gene are selected randomly and the data itself lacks structure. We apply our inference approach on time series transcriptomic data of 6 genes and 6 time points from an HMEC cell line following application of epidermal growth factor (EGF) and were able to generate a BN with a biologically plausible singleton attractor structure and satisfying the experimentally observed transitions. The theoretical analysis shows that the generation of such a network from 6 random state transitions and random selection of 3 regulators of every gene is extremely low which in turns suggests that there is structure in the biological data that is exploited by our inference algorithm to arrive at a biologically plausible BN. We next set up an experimental design to compare synthetic BNs with BNs generated through our framework based on state transitions from the synthetic BNs. The results illustrate the capability of the proposed inference technique to generate BNs that are similar to the original BNs by using few state transitions when the connectivity is known.

The paper is organized as follows: The 'methods' section contains (a) a review of BNs and the biologically motivated assumptions and constraints that will be imposed during inference, (b) theoretical analysis of the search space for the inverse problem and (c) Inference Algorithm. The 'results' section contains the results of applying the framework to experimental HMEC data and synthetic BNs; results of comparison with 2 other approaches is also discussed in this section. Further analysis of the results are included in the 'conclusions' section.

Methods

GRN model and modeling assumptions

A Boolean network (BN) B = (V, F) on _{1}, ..., _{n}_{i }_{1}, ..., _{n}_{i }^{n }_{i }_{i}_{i }_{i }_{i }_{i }_{i }_{i}

The biologically motivated assumptions and constraints that we will impose are:

(i) Biological networks usually have sparsity in their connectivity structure. Thus we will restrict our connectivity to

(ii) Biological networks are usually robust to perturbations and can produce a reproducible trait under changing conditions. The robustness of an inferred model will be measured in terms of coherency of the BN _{s }_{b}_{n}_{n }_{b }_{n }

(iii) GRNs usually have small attractor cycles and thus any oscillation observed in our data should be reflected in the Boolean model as a limited state attractor cycle.

(iv) Among two feasible functions, the one with lower inconsistency will be selected. Here, inconsistency refers to same state of the predictor state producing different target output. Let us consider that we have _{i }_{i+1 }for _{i }_{1}(_{2}(_{n}^{k }

Search space analysis

In this section, we will analyze the size of the search space for the inverse problem of inferring a Boolean model of a GRN from time series data based on connectivity and structural constraints.

Let us consider the case of experimental data of ^{n }^{n }_{1,1}) of 1st cell in table _{1 }is _{1,1 }at ^{3 }× 3 possible places in the truth table that can be filled with either 1 or 0, the total number of distinct truth tables is

Illustration of the number of possible BNs with no constraints on connectivity

_{1}

0/1

0/1

0/1

0/1

0/1

0/1

0/1

0/1

_{2}

0/1

0/1

0/1

0/1

0/1

0/1

0/1

0/1

_{3}

0/1

0/1

0/1

0/1

0/1

0/1

0/1

0/1

Here

When we restrict the connectivity to _{n }_{n }_{n}^{k }^{2 }cells to fill with 0 or 1 to find a BN. So there are

Illustration of number of possible BNs with constraints on connectivity

_{n }

_{1}

0/1

0/1

0/1

0/1

←n = 1

_{2}

0/1

0/1

0/1

0/1

←n = 2

_{3}

0/1

0/1

0/1

0/1

←n = 3

Here _{n }

Without restriction on connectivity, knowledge of ^{n(N-L)}. Next, we will consider the case when our connectivity is restricted to ^{k}^{k }^{k}_{1 }denote the event that no new entry of a row was filled in the 2nd transition and _{2 }denote the event that a new entry of a row was filled in the 2nd transition. Then,

Similarly it can be shown that the probability of hitting a unique place at the 4th transition is

In general, we can say that the probability of hitting a unique place at the ^{k }^{L-1 }ways with no constraint on the number of balls in each of (^{L}

From ^{k }

We next consider the expected number of distinct transitions required to fill up (_{ex }^{k }

Another characteristic of a BN that is desirable from a biological perspective is lack of large length attractor cycles ^{N-1}/^{N }^{6 }= 64 states, there is a 1 - 1/64 = 0.98 probability that the attractor structure of the BN will not consist of a single attractor. Thus, if our inference approach can produce a BN of 2^{6 }= 64 states with only a singleton attractor, there is a high probability that it is not due to a random event but it might reflect on the use of prior biological connectivity and structure present in the experimental data.

Inference algorithm

Our propsoed BN inference algorithm is as follows:

**Algorithm 1 **Algorithm for Calculating the Score of a BN

_{1 }→ _{2 }... → _{L+1}

**for ****do**

Calculate the transitions in the generated BN from state _{i }_{l }_{+1}, ... , _{L+1}] for _{i}

Score = Score + _{i}

**end for**

Results

We used transcriptomic and proteomic time series data generated by Rogers

Pathway of the 6 genes generated from literature search

**Pathway of the 6 genes generated from literature search**.

For the 6 genes

State transition diagram of the inferred BN

**State transition diagram of the inferred BN**.

State transition diagram of the BN inferred using random connectivity

**State transition diagram of the BN inferred using random connectivity**.

Validation with synthetic network models

In the previous section, we showed the result of our inference approach when applied to experimental data. Since the true structure of the Boolean Network for the _{1}) and used a path of this BN as our synthetic data (it's the experimental data in our inference algorithm) and applied steps 1 to 3 (inference algorithm) to create a new BN (_{2}) to compare the similarities with _{1}. For step 1, we've used the regulator set of _{1 }(which is known) as our regulator set for the synthetic data. We defined a new similarity measure to compare two BNs that is shown in algorithm 2. For comparison, we have to locate all individual paths in _{1 }which starts with a distinct state and ends with an attractor. The ratio of similarity score (similarity ratio, _{2 }perfectly matches with _{1}. It should be less than 1 for mismatch.

**Algorithm 2 **Algorithm for Calculating Similarity Measure of Two Different BNs

_{1}

**for **i = 1:NumPath **do**

_{1 }→ _{2}..→ _{L+1}

**for ****do**

Calculate the transitions in the generated _{2 }from state _{j }_{2 }will look like _{l }_{j+1}, ..., _{L+1}] for _{j}

Score(i) = Score(i)+ _{j}

**end for**

**if ****then**

**else**

**end if**

**end for**

For example, if we take Figure _{1 }and one of its path that starts from the bottom most level as synthetic data, then we get _{2 }with _{2 }which is an exact match of _{1}. If we reduce the number of transitions, then the similarity ratio _{2 }that has a similarity ratio

_{2}: an exact match of _{1 }in Fig 2

** BN _{2}: an exact match of BN_{1 }in Fig 2**.

_{2 }has _{1 }in Fig 2

** BN _{2 }has R = 0.4214 for synthetic data = 6 → 56 → 57 taken from a path in BN_{1 }in Fig 2**.

If we analyze the structure of Figure _{1 }as synthetic data and combine them to find the truth table of the Boolean functions. The modifications of our inference algorithm for use of

_{1 }are selected and set as synthetic data (_{1},_{2}..._{η}_{1 }will be the path which has greater transition length. If _{1 }≤ _{1}. If all of them have doubleton/singleton attractors, then _{1 }is chosen randomly among those _{1 }paths. The other paths are set as _{2}, _{3}...._{η }_{1},_{2}..._{η }_{1}.

_{1},_{2}..._{η }_{1},_{2 }..._{η }_{1},_{2}..._{η }_{1 }is selected. The remaining unfilled entries are filled with the steady state value in _{1}.

_{2 }is generated based on the truth table and the regulator set. Then _{2 }and _{1 }are compared according to algorithm 2 and similarity score is measured.

For example, if we use Figure _{1 }and use one single path of _{1 }as synthetic data, we get _{2 }(Figure _{max}_{max }_{1},_{2}..._{n }_{2 }(Figure _{max }_{2 }in Figure _{1 }and has a high similarity ratio. As we would expect, combining 3 paths resulted in Figure _{max }

_{1 }with multiple attractors including a doubleton attractor

** BN _{1 }with multiple attractors including a doubleton attractor**.

_{2 }where single path is used

** BN _{2 }where single path is used**.

_{2 }where 2 paths are used

** BN _{2 }where 2 paths are used**.

_{2 }where 3 paths are used

** BN _{2 }where 3 paths are used**.

Since _{max }_{1}).

Similarity ratios for BNs inferred from data generated from Fig 6

**
R
_{
max
}
**

**
R
_{mean-4}
**

**
R
_{mean-5}
**

1 path

0.1588

0.0904

0.1010

2 paths

0.5731

0.2486

0.3040

3 paths

1.0

0.4243

0.5390

_{max }denotes the maximum achieved similarity ratio. _{mean-n }denotes the expected value of the similarity ratio

We also considered numerous other BNs with at least 4 attractors as the _{1 }to generate the synthetic data. The details of the experiment is available in the website _{2/1path}), 2 paths (_{2/2path}) and 3 paths (_{2/3path}) corresponding to each _{1}.

For the results reported in Figures _{1}). The gradual increase of values of _{max }_{mean }_{1}) in Figure _{max }_{mean-n }

Similarity ratios for BNs inferred from data generated from Fig 6 with random predictor set

**
R
_{
max
}
**

**
R
_{mean-4}
**

**
R
_{mean-5}
**

1 path

0.1132

0.0635

0.0678

2 paths

0.1643

0.0662

0.0681

3 paths

0.2570

0.1088

0.1129

Comparison with existing BN inference approaches

We compared the performance of our proposed algorithm with (a) Liang

Comparison with REVEAL

REVEAL is a well-known reverse engineering algorithm for inference of genetic regulatory architectures proposed by Liang

We've implemented REVEAL algorithm in MATLAB. For convenience of comparison between our approach and REVEAL, we've used synthetic BN where the original regulators and functions are known. We've used our algorithm to infer the BN with maximum similarity ratio (_{max}_{reveal}_{1 }in Figure _{2 }in Figure _{max }_{2 }in Figure _{reveal }_{max }

_{2 }inferred using REVEAL

** BN _{2 }inferred using REVEAL**.

Comparison with MDL approach

Zhao

As we're trying to find a deterministic Boolean network, we've binarized the gene expression based on a conditional probability threshold of 0.5. For example, let's assume that there are 3 genes (_{1}, _{2 }and _{3}) in a network and regulators for _{1 }are _{2 }and _{3}. If the conditional probability _{1 }= 1_{2}_{3 }= 00) _{1 }= 1 if _{2}_{3 }= 00. Similarly, the value of _{1 }for other combination of _{2}_{3 }is found using the conditional probability table derived by the MDL approach. For the parameter Γ in equation 5 of

Similar to the comparison technique with REVEAL, we've used the same path from _{1 }in Figure _{2 }with maximum similarity ratio (_{max}_{2 }in Figure _{mdl}_{max }

_{2 }inferred using MDL approach

** BN _{2 }inferred using MDL approach**.

Other than the performance with respect to similarity ratio, our approach performs better than both of REVEAL and MDL approaches in elucidating the attractors. Our results also support the claim in Zhao

Conclusions

In systems biology, we are often faced with the issue of reverse engineering a GRN model from limited time series data. This article proposes an inference approach utilizing prior biological knowledge of connectivity to generate a BN with biologically plausible state transition structure and explaining the observed transitions in the data. The proposed framework can also be applied to optimize the connectivity of a GRN from experimental data when the prior biological knowledge on regulators is limited or not unique. We validated our algorithm based on experimental data of HMEC cell line and data generated from synthetic BNs with known state transition structure. Through theoretical analysis and simulations, we were able to illustrate that inference of a BN from limited time series data with constraints on connectivity that explains the observed state transitions, is extremely rare if we consider random connectivity. High performance of our proposed algorithm as compared to existing BN inference algorithms that depend on inference of connectivity from the data, further support the advantage of using prior biological knowledge on connectivity. Thus, for cases of limited experimental data, the prior biological knowledge of connectivity should be utilized to arrive at robust BNs with biologically plausible state transition structures. For future research, we will consider combining transcriptomic and proteomic data to reduce the inconsistencies in the data. One of the significant challenges in combined analysis will be the different degradation times for mRNA and proteins.

List of abbreviations used

BN: Boolean Network; DE: Differential Equation; GRN: Genetic Regulatory Network; EGF: Epidermal growth factor; HMEC: Human Mammary Epithelial Cell; MDL: Minimum Description Length.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Conceived and Designed the Experiments: SH RP. Performed the Experiments: SH. Analyzed the Results: SH RP. Wrote the article: SH RP. All authors read and approved the final manuscript.

Acknowledgements

Based on “Inference of a genetic regulatory network model from limited time series data”, by Saad Haider and Ranadip Pal which appeared in

This work was supported by NSF grant CCF 0953366.

This article has been published as part of