Biomolecular Science and Engineering Program, UC Santa Barbara, Santa Barbara, CA, USA

Department of Computer Science, UC Santa Barbara, Santa Barbara, CA, USA

Abstract

Background

The combination of genotypic and genome-wide expression data arising from segregating populations offers an unprecedented opportunity to model and dissect complex phenotypes. The immense potential offered by these data derives from the fact that genotypic variation is the sole source of perturbation and can therefore be used to reconcile changes in gene expression programs with the parental genotypes. To date, several methodologies have been developed for modeling eQTL data. These methods generally leverage genotypic data to resolve causal relationships among gene pairs implicated as associates in the expression data. In particular, leading studies have augmented Bayesian networks with genotypic data, providing a powerful framework for learning and modeling causal relationships. While these initial efforts have provided promising results, one major drawback associated with these methods is that they are generally limited to resolving causal orderings for transcripts most proximal to the genomic loci. In this manuscript, we present a probabilistic method capable of learning the causal relationships between transcripts at all levels in the network. We use the information provided by our method as a prior for Bayesian network structure learning, resulting in enhanced performance for gene network reconstruction.

Results

Using established protocols to synthesize eQTL networks and corresponding data, we show that our method achieves improved performance over existing leading methods. For the goal of gene network reconstruction, our method achieves improvements in recall ranging from 20% to 90% across a broad range of precision levels and for datasets of varying sample sizes. Additionally, we show that the learned networks can be utilized for expression quantitative trait loci mapping, resulting in upwards of 10-fold increases in recall over traditional univariate mapping.

Conclusions

Using the information from our method as a prior for Bayesian network structure learning yields large improvements in accuracy for the tasks of gene network reconstruction and expression quantitative trait loci mapping. In particular, our method is effective for establishing causal relationships between transcripts located both proximally and distally from genomic loci.

Background

In order to model and dissect the complexity underlying physiological processes, including diseases, developmental programs, and responses to pharmacological treatments, systematic approaches based on genome-wide data are imperative. Expression profiling technologies, such as microarray

Already several studies have provided methodologies aimed at exploiting the genotypic component of eQTL data to improve causal modeling in gene networks

A distinct but related problem to gene network reconstruction (GNR) is expression quantitative trait loci (eQTL) mapping. While these two tasks have to date been addressed independently, they are likely to become more intertwined as eQTL-related computational methodologies advance. This corollary follows from the fact that accurately-modeled networks should inform on transcript-locus associations by virtue of the implied causal pathways. Traditional univariate methods, which involve an exhaustive search between all transcripts and loci, typically entail the use of linear regression, ANOVA, or the t-test

Our goal is to reconstruct causal networks with high-fidelity at all levels of the network. Consequently, by improving the accuracy of the reconstructed network, we show that our method can provide biological hypotheses as well as enable greater accuracy in eQTL mapping.

Results

We assess the performance of our method for the tasks of gene network reconstruction (GNR) and expression quantitative trait-loci (eQTL) mapping. For GNR, we compare our methodology to standard unaugmented Bayesian network structure learning, herein referred to as "Basic," and the leading LCMS methodology of Schadt and colleagues

Network Parameters

**Stronger Cor. Structure**

**Weaker Cor. Structure**

Mean CC

0.68

0.55

90% Upper Interval

0.88

0.76

90% Lower Interval

0.45

0.31

Mean _{i}

0.75

0.6

S.D. _{i}

0.2

0.2

Mean _{i}

0.5

0.5

S.D. _{i}

0.1

0.1

Precision and recall of network edges

We first assess the performance of gene network reconstruction. Figures

Causal Edge Precision-Recall; 100 Samples

**Causal Edge Precision-Recall; 100 Samples**. Precision and recall of directed edges for 100 experiments. At the precision level of 0.8, the SCT method yields a recall of 0.744, versus 0.391 and 0.393 for the LCMS and Basic methods, respectively.

Causal Edge Precision-Recall; 200 Samples

**Causal Edge Precision-Recall; 200 Samples**. Precision and recall of directed edges for 200 experiments. At the precision level of 0.8, the SCT method produces a recall of 0.910, versus 0.789 and 0.753 for the LCMS and Basic methods, respectively.

Causal Edge Precision-Recall; 300 Samples

**Causal Edge Precision-Recall; 300 Samples**. Precision and recall of directed edges for 300 experiments. At the precision level of 0.8, the SCT method produces a recall of 0.940, versus 0.876 and 0.856 for the LCMS and Basic methods, respectively.

Figures

Causal Edge Precision-Recall w/Weaker Correlations; 100 Samples

**Causal Edge Precision-Recall w/Weaker Correlations; 100 Samples**. Precision and recall of directed edges for 100 experiments from datasets with weaker correlation structure. At the precision level of 0.8, the SCT method yields a recall of 0.570, versus 0.310 and 0.300 for the LCMS and Basic methods, respectively.

Causal Edge Precision-Recall w/Weaker Correlations; 200 Samples

**Causal Edge Precision-Recall w/Weaker Correlations; 200 Samples**. Precision and recall of directed edges for 200 experiments from datasets with weaker correlation structure. At the precision level of 0.8, the SCT method produces a recall of 0.824, versus 0.640 and 0.637 for the LCMS and Basic methods, respectively.

Causal Edge Precision-Recall w/Weaker Correlations; 300 Samples

**Causal Edge Precision-Recall w/Weaker Correlations; 300 Samples**. Precision and recall of directed edges for 300 experiments from datasets with weaker correlation structure. At the precision level of 0.8, the SCT method produces a recall of 0.894, versus 0.810 and 0.793 for the LCMS and Basic methods, respectively.

We note that the performance gains of LCMS-augmented structure learning over unaugmented structure learning reported by Zhu

Precision and recall of transcript-locus associations

Next, we sought to assess the potential of using the learned networks for the purpose of expression quantitative trait loci (eQTL) mapping. In order to establish a transcript-locus linkage, for each head node (locus) in the Bayesian network, we run a depth-first search down the respective branches. All reachable transcripts from the source locus are associated with that locus. Starting with a set of sampled networks, we apply this procedure on each of the individual networks, each yielding a set of transcript-loci linkages. Precision-recall curves were generated from the totality of the individual networks (Methods section). We compare this approach to traditional univariate mapping utilizing a t-test. We note that while there are several recent clustering-based methodologies related to eQTL mapping

Figures

eQTL Linkage Precision-Recall; 100 Samples

**eQTL Linkage Precision-Recall; 100 Samples**. Precision and recall of transcript-locus linkages for 100 experiments. At the precision level of 0.8, the SCT method achieves a recall of 0.290, versus 0.130, 0.118 and 0.03 for the LCMS, Basic and univariate mapping methods, respectively.

eQTL Linkage Precision-Recall; 200 Samples

**eQTL Linkage Precision-Recall; 200 Samples**. Precision and recall of transcript-locus linkages for 200 experiments. At the precision level of 0.8, the SCT method achieves a recall of 0.695, versus 0.301, 0.345 and 0.065 for the LCMS, Basic and univariate mapping methods, respectively.

eQTL Linkage Precision-Recall; 300 Samples

**eQTL Linkage Precision-Recall; 300 Samples**. Precision and recall of transcript-locus linkages for 300 experiments. At the precision level of 0.8, the SCT method achieves a recall of 0.735, versus 0.590, 0.25 and 0.08 for the LCMS, Basic and univariate mapping methods, respectively.

eQTL Linkage Precision-Recall w/Weaker Correlations; 100 Samples

**eQTL Linkage Precision-Recall w/Weaker Correlations; 100 Samples**. Precision and recall of transcript-locus linkages for 100 experiments from datasets with weaker correlation structure. At the precision level of 0.8, the SCT method achieves a recall of 0.021 versus 0.020 and 0.019 for the LCMS and Basic methods, respectively. The univariate mapping method recall is 0.008.

eQTL Linkage Precision-Recall w/Weaker Correlations; 200 Samples

**eQTL Linkage Precision-Recall w/Weaker Correlations; 200 Samples**. Precision and recall of transcript-locus linkages for 200 experiments from datatsets with weaker correlation structure. At the precision level of 0.8, the SCT method achieves a recall of 0.282, versus 0.115, 0.110 and 0.025 for the LCMS, Basic and univariate mapping methods, respectively.

eQTL Linkage Precision-Recall w/Weaker Correlations; 300 Samples

**eQTL Linkage Precision-Recall w/Weaker Correlations; 300 Samples**. Precision and recall of transcript-locus linkages for 300 experiments from datasets with weaker correlation structure. At the precision level of 0.8, the SCT method achieves a recall of 0.400, versus 0.164, 0.142 and 0.028 for the LCMS, Basic and univariate mapping methods, respectively.

Robustness of network reconstruction

As outlined in the Methods section, we assess convergence and reconstruct consensus networks from two independent MCMC runs, each consisting of 150 million iterations. However, an additional test for robustness involves assessing the stability of edge frequencies across multiple MCMC runs. We follow the protocol outlined by Zhu

**Supplementary Figures**. Supplementary Figures 1a-1 d, 2, 3a-c; Supplementary Tables 1a-b, 2a-b.

Click here for file

Possible Explanations for Increased Performance

There are two likely reasons for the enhanced performance associated with our SCT method. The first and most obvious reason is the increased coverage associated with the SCT method. The second reason is attributable to better resolution of ordered triplets. For example, consider the hypothetical sequential triplet influenced by a single locus: _{1 }→ _{i }
_{j }
_{k}
_{i }
_{j}
_{i }
_{k }
_{j }
_{k}
_{j }
_{k }
_{i }
_{k}
_{1}. In contrast, our method does not rely on a fixed anchor (genomic locus), and is capable of discriminating between potentially confounding motifs of this nature. This is due to the fact that, while both transitions of _{i }
_{j }
_{i }
_{k }

Discussion

We presented methodology aimed at utilizing genotypic data for the task of gene network reconstruction on eQTL datasets. Our method is motivated by previous studies focused on the same goal, however, we are able to provide improvements in coverage and resolution. Furthermore, with enhanced network reconstruction accuracy, we show that sampling a set of networks is efficacious at eQTL mapping. Although we followed established protocols for simulating eQTL data, it's inevitable that the simulated data does not perfectly model natural eQTL data. For example, our model omits feedback loops, though such motifs are common in real gene networks. Furthermore, we clearly are unable to model cases where genetic variations are associated with amino acid substitutions without corresponding expression changes. Other situations that we are unable to model include post-translational modifications, such as protein-phosphorylations and other mechanisms affecting protein concentrations. It is worth noting that, since our model is generally more complicated than univariate mapping techniques, it stands to reason that univariate mapping might be less sensitive to discrepancies between the model used in our study and real eQTL networks.

Future work involves applying our methodology to datasets that incorporate macroscopic phenotypes, including medical conditions and responses to pharmacological treatments

There are several possible ways in which our approach can be optimized. For example, we plan to investigate the use of iterative procedures, where information from prior runs is incorporated into subsequent runs to improve accuracy. With respect to the general area of Bayesian network structure learning, it would be interesting to consider integrating the causal ordering information from our method with other sources of prior biological information, such as protein-protein interactions or gene ontology (GO) annotation

Conclusions

We developed a probabilistic method based on stochastic causal trees to learn the causal relationships between gene transcripts in genetical genomics studies. Incorporating the information from our method as a prior into Bayesian network structure learning increases the performance of network reconstruction and eQTL mapping.

Methods

Estimating Network Properties

The synthetic network consists of 2, 200 transcripts and 50 loci, connected by 2, 598 edges. These values were chosen based on analysis run on the yeast eQTL data published by Kruglyak and colleagues ^{-5 }or lower. Next, correlated and adjacent loci were aggregated, yielding 50 genomic "epicenters." This number is roughly consistent with the two recent mapping studies of Zhang

Network Simulation

Given the established number of loci and transcripts, we next implemented a network-generating procedure that yields a level of complexity on par with real eQTL data in terms of the distribution of transcript-loci linkages. Step 1 involves randomly assigning the leaves to one of the 50 loci, where the assignment of the transcript can be to any of the nodes on the growing tree. At this point, every transcript is part of a tree rooted by a single locus, and the loci generally do not contain an equal number transcripts due to the random allocation of leaves. Step 2 involves randomly adding feed-forward edges and inter-loci edges. A feed-forward edge connects a transcript belonging to a particular locus to another transcript already belonging to that locus, whereas inter-loci edges connect transcripts that belong to different loci. The target ratio of inter-loci edges to feed-forward edges is 9 : 1, achieved by randomly selecting a number from a uniform distribution with a 0

Transcript-Loci Distribution

**Transcript-Loci Distribution**. The distribution of transcript-loci linkages. 60% of the transcripts link to 2 or more loci.

Simulating eQTL Data

From the simulated network of 2, 200 transcripts and 50 loci, eQTL data are subsequently generated according to the protocol presented by Zhu

Expression traits were simulated according to a linear model:

The coefficient _{i }
_{i}
_{i }

For these traits, the interaction term, _{
i,j
}, is drawn from a Gaussian distribution with a mean of 0.5 and a standard deviation of 0.1. Ultimately, the mean correlation between parent and child is 0.68. To generate eQTL data composed of weaker correlation structure, we drew _{i }

In summary, we generated both strongly and weakly correlated datasets, each composed of 100, 200, and 300 samples, resulting in six total datasets.

Traditional Univariate eQTL Mapping

We used the t-test statistic to implement traditional univariate eQTL mapping, which involves an exhaustive search between all transcripts and loci. For each transcript-locus test, the expression levels for the transcript across all segregants are partitioned by the genotypes at the locus. Subsequently, the t-test is performed to assess the extent to which a locus influences the expression level of a transcript. This is repeated for all transcripts against all loci. In order to account for multiple hypothesis testing, we applied the false discovery rate (FDR) test of Benjamini

0.1 LCMS Method

We implemented the LCMS method from Schadt and colleagues as outlined in their previous publications _{x }
_{y}
_{j}

Models

Each model applies to a locus, _{j}
_{x }
_{y}
_{j}

Bootstrapping is applied 1, 000 times for each triplet, from which the probability of each of the respective models is obtained. Given these probabilities, the actual transcript-transcript priors are obtained by the following rules:

**If: **(_{x }
_{y|}t_{x}
_{y}
_{j}

**Else if: **
_{x }
**→ **
_{
y
}|_{
x
}, _{
x
}, _{
j
}) > 0.5, then:

To summarize this logic, the authors prefer to downweight the prior score over two transcripts in cases where the independent model has a probability greater than 0.5.

Stochastic Causal Tree Method

The stochastic causal tree method is a probabilistic procedure for learning causal hierarchies representing the propagation of influence that emanates from genomic loci and is transmitted through gene transcripts. The trees consist of genomic loci, which serve as roots for their respective trees, and an arbitrary number of transcripts that are stochastically added to the growing tree. The integrity of the branches are maintained with a combination of second- and third-order potentials that act in concert to maintain causal alignments. Once the tree is initiated with a particular locus serving as a root, the crux of the method involves choosing optimal transitions, assessed by the likelihoods associated with transcripts being added as leaves to the growing tree. We express the likelihood of a transition as the sum of the likelihoods of two potentials involving the grandparent (_{g}
_{p}
_{c}

There are several functions that could reasonably be used to implement the potentials, including Pearson's correlation, mutual information, or regression functions. We considered both the PCC and regression functions for our study. Ultimately, due to the fact that eQTL datasets consist of both binary (loci) and continuous (expression) data, we opted to employ linear regression functions to model the potentials. In addition to being suitable for modeling datasets composed of binary and continuous variables, regression functions lend adaptability to studies involving diploids where heterozygosity can be represented by a separate category from either homozygous state. Thus, all figures corresponding to our SCT method in this manuscript are derived from an implementation with regression functions. However, for reference we include a performance comparison between the PCC and regression functions in Additional file _{p}
_{c}

**
n
_{c }
**=

**1**. Let e be the residuals from the linear regression model: n_{c }= _{0 }+ _{1 }* n_{p }+

**2**. Set **e **= _{0 }+ _{1 }* **n _{g }
**+

Given concrete functions to implement the second- and third-order potentials represented by equations 1 and 2, we wish to model the likelihood of obtaining any value for the potentials:

_{p}
_{c}
_{p}
_{c}
_{p }
_{c}
_{p}
_{c}
_{p}
_{p }
_{p }
_{p }
_{c}
_{p}
_{c}

The covariance,

For the conditional potential, _{g}
_{c}|n_{p}
_{g}
_{c}|n_{p}

Equations 3 and 4 are used in tandem to model the likelihood that a transcript should be included as a leaf on a growing tree, and the log likelihood score is expressed as follows:

where _{1 }
_{g}
_{p}|n_{g}
_{c}
_{p}

To describe the SCT method conceptually, the starting points of the algorithm are at the genomic loci, each of which serve as a root for their respective trees. As an example, we refer to a hypothetical tree depicted in Figure _{1}, and four transcripts, _{a}
_{b}
_{c }
_{d}

Stochastic Causal Tree Schematic

**Stochastic Causal Tree Schematic**. Schematic of the stochastic causal tree method. Blue nodes represent the locus (square) and transcripts (circles) that are currently part of the tree, with causal edges denoted by solid arrows. Optimal transitions for the locus and the four transcripts, _{a}_{b}_{c }_{d}_{z }_{d }_{d}_{z}_{b}_{z}_{d}). The potentials for this particular move are depicted in red.)

The SCT algorithm will stochastically choose between the five optimal transitions. The choice is weighted by a factor of the log likelihood score to a power represented by parameter _{g}
_{c}|n_{p}

The SCT method produces a set of trees which can be represented as adjacency matrices. The SCT output is converted into an

Formally, the SCT method is described as follows:

The dataset _{i}
_{N }
_{i}
_{M }
**x**[**x**[**g**[**g**[

**Input **: The dataset,

**Input **:

**Input **:

**Input **:

**Output**:

**for **
**do**

**for **

S←∅;// S is the set nodes included in the tree

include node

**for **

// Convert the trees into the prior matrix,

**Algorithm 1**: SCT Main Procedure

**Input **:

**Output**: the next leaf to be included in the tree

//

**for **s **∈ **S **do**

return

**Algorithm 2**: StochasticLeaf (Node[] S)

**Input **:

**Input **:

**Output**: the optimal transition corresponding to

//

**for **
**∈ **
**do**

_{
g,p,c
}= _{1}

return _{best }
_{best }
_{
g,p,*});

**Algorithm 3**: GetBestTransition (int p, int g)

Bayesian Network Structure Learning

Bayesian networks provide a graphical representation of the joint probability distribution for a set of random variables, allowing for efficient computation of the probability of graphical structures

To improve the computational feasibility of the structure learning algorithm, we restrict the maximum number of parents to be three, a constraint that is employed by several other studies

MCMC simulations are initialized with a graph consisting of 2, 000 randomly selected edges, after which the evolution of the structure is governed by the acceptance function. Formally, the acceptance probability is expressed as follows:

New structures are drawn from the proposal distribution,

Following recent research on the subject of incorporating prior biological knowledge into the Bayesian network structure learning procedure

where

Where

The edge frequencies over a set of sampled graphs {_{1...}
_{N }
_{n}
_{th }

For each of the three structure learning methods, and for each dataset, we initiate two parallel MCMC runs. Each run consists of 150 million iterations with a burn-in period of 10 million iterations. We assess convergence in the subsequent 140 million iterations. As has been noted by others _{i}
^{th }
**S _{A }
**be the set of sampling intervals for the first MCMC simulation, and let

Optimal values of

Beta Parameter; Stronger Correlation

**Method**

**Beta**

**100 Samples**

**200 Samples**

**300 Samples**

LCMS

4

0.628

**0.867**

**0.926**

LCMS

8

**0.646**

0.856

0.925

LCMS

12

0.645

0.854

0.920

LCMS

16

0.628

0.845

0.906

LCMS

20

0.616

0.831

0.894

LCMS

24

0.601

0.817

0.881

SCT

4

0.783

0.926

0.953

SCT

8

0.821

0.943

0.964

SCT

12

0.837

0.953

0.969

SCT

16

**0.843**

**0.959**

0.967

SCT

20

0.841

0.958

0.968

SCT

24

0.842

0.957

**0.969**

Beta Parameter; Weaker Correlation

**Method**

**Beta**

**100 Samples**

**200 Samples**

**300 Samples**

LCMS

4

0.577

0.787

0.884

LCMS

8

0.578

**0.788**

0.884

LCMS

12

**0.587**

0.781

**0.886**

LCMS

16

0.574

0.786

0.885

LCMS

20

0.574

0.773

0.880

LCMS

24

0.562

0.773

0.881

SCT

4

0.689

0.847

0.925

SCT

8

0.715

0.869

0.935

SCT

12

0.726

0.878

0.939

SCT

16

**0.726**

0.883

0.940

SCT

20

0.724

**0.885**

**0.941**

SCT

24

0.725

0.884

0.939

The continuous data are discretized via k-means clustering into three levels representing down-regulated expression, steady expression, and up-regulated expression. The data are parameterized with a multinomial distribution and we utilized the structure-equivalent Dirichlet priors introduced by Heckerman

Therefore, while both the LCMS and SCT methods utilize continuous data to produce a prior matrix, the Bayesian network structure learning procedure uses discrete data. We preferred the computational efficiency associated with discrete data given the large number of simulations conducted in our study. Finally, we note that genomic loci are modeled as head variables in our networks, which has two consequences: 1) we do not need to test for cis- versus trans-relationships; 2) We are able to use the resulting networks for expression quantitative trait loci (eQTL) mapping.

Parameter Optimization

The conditional weight parameter _{1 }and the power parameter _{1 }= 5.0 and

c1 and p Parameter Optimization; Stronger Correlation; 200 Samples

**p**

**1.0**

**2.0**

**3.0**

**4.0**

**c _{1}**

0.0

0.606

0.668

0.672

0.660

1.0

0.661

0.740

0.751

0.751

2.0

0.686

0.777

0.801

0.790

3.0

0.718

0.804

0.824

0.812

4.0

0.736

0.823

0.839

0.822

5.0

0.753

0.836

0.850

0.857

6.0

0.771

0.841

0.854

0.836

c1 and p Parameter Optimization; Weaker Correlation; 200 Samples

**p**

**1.0**

**2.0**

**3.0**

**4.0**

**c _{1}**

0.0

0.600

0.652

0.643

0.635

1.0

0.630

0.692

0.698

0.668

2.0

0.662

0.722

0.726

0.700

3.0

0.677

0.739

0.746

0.722

4.0

0.698

0.772

0.762

0.738

5.0

0.714

0.768

0.778

0.750

6.0

0.727

0.798

0.786

0.782

Parameter

M Parameter Optimization

**M Parameter Optimization**. The performance of the SCT method alone (y1 axis) and the performance of the SCT-augmented Bayesian networks (y2 axis) versus the M-parameter (x axis). In both cases, AUC refers to the area under the precision-recall curves.

Precision-recall curves

To assess the quality of the reconstructed networks, we generated precision-recall plots, which provide a graphical depiction of precision versus recall. Precision is defined as

For causal edge detection, we construct precision-recall curves by sampling 1:4 M networks at every 200 iterations over an interval consisting of 140 M iterations for two individual MCMC runs. Therefore, the total number of networks sampled is expressed as:

Though far fewer networks would likely be sufficient to assess performance, we note that at least 10, 000 thousand samples are ideal, given that any sampled network from the MCMC chain will contain many suboptimal edges. To generate precision-recall curves for the consensus network, we lower the threshold frequency at or above which an edge occurs, starting at 1.4 M. If an edge occurs at or above the threshold, it is assigned a true or false positive based on whether the edge is present in the true network. The threshold is repeatedly decremented by 1 until reaching 0.

For eQTL mapping, we take individual networks and establish transcript-loci (eQTL) linkages by conducting a depth-first from each locus. That is, an eQTL linkage is established if there is a directed path from a locus to a transcript. Since any sampled network from the MCMC chain will contain a considerable number of extraneous edges, we extract 1, 000 networks that are spaced 5, 000 iterations apart, then use the extracted networks as initializations for greedy optimization procedures

Availability

Software for the SCT method and MCMC Bayesian structure learning procedure can be accessed online:

Authors' contributions

KC and AS designed the study. KC implemented the software and carried out the experiments. Both authors read and approved the final manuscript.

Acknowledgements

This work is supported in part by National Science Foundation grants IIS-0917149 and IIS-0612327. We thank Jarad Niemi for providing insight on the subject of stepwise regression. We also thank two anonymous reviewers for advice in several areas, including QTL mapping and MCMC methods.