MRC Epidemiology Unit, Institute of Metabolic Science, Box 285, Addenbrooke's Hospital, Hills Road, Cambridge CB2 0QQ, UK

Section of Applied Statistics, School of Biological Sciences, The University of Reading, Earley Gate, Reading RG6 6FN, UK

Department of Biochemistry, Pharmacology and Genetics, Odense University Hospital, Sdr Boulevard 29, Odense C., DK-5000, Denmark

Abstract

The Genetic Analysis Workshop 15 (GAW15) Problem 1 contained baseline expression levels of 8793 genes in immortalized B cells from 194 individuals in 14 Centre d'Etude du Polymorphisme Humain (CEPH) Utah pedigrees. Previous analysis of the data showed linkage and association and evidence of substantial individual variations. In particular, correlation was examined on expression levels of 31 genes and 25 target genes corresponding to two master regulatory regions. In this analysis, we apply Bayesian network analysis to gain further insight into these findings. We identify strong dependences and therefore provide additional insight into the underlying relationships between the genes involved. More generally, the approach is expected to be applicable for integrated analysis of genes on biological pathways.

Background

Recent genetic dissection of common diseases has largely been through linkage and association studies involving discrete or continuous traits including intermediate phenotypes such as gene expression data from microarray experiments. The latter can involve thousands of genes, and annotation of their roles in biological pathways and in relation to DNA polymorphisms poses immense challenges and has sparked huge interest

A key challenge in analysis of gene expression data is the reconstruction of regulatory networks. Several approaches directly extend classical techniques such as cluster analysis to infer the relationship between plural variables. A novel but apparently unpopular approach of cluster analysis is to extract the patterned information formally and use it in typical linkage and association analyses. More importantly, cluster analysis can be followed by Gaussian graphical modelling

The Problem 1 data from Genetic Analysis Workshop 15 (GAW15) offers an excellent opportunity for investigating the utility of Bayesian networks. An earlier report

Methods

Gene expression levels, treated as continuous variables, can be assumed to follow a multivariate normal distribution, and to be consistent with a Bayesian network with linear Gaussian conditional densities. The prior of this network is characterized by a prior network reflecting our belief in the joint distribution of the variables in question, and equivalent sample size (ESS) effectively behaving as if it was calculated from a "prior" data set of that size. For instance, without

The GAW15 Problem 1 consists of 194 individuals from 14 three-generation CEPH (Centre d'Etude du Polymorphisme Humain) pedigrees, with baseline expression levels of genes in immortalized B-cells. The data provided contains expression of 8793 genes. Following an earlier investigation

Affymetrix CEL-files were preprocessed with BioConductor package

Results

Cluster analysis shows that the dendrogram (not shown) differs somewhat from the earlier report ^{th }checkpoint) showed that the following genes are independent of any other genes in the model:

Importance of the dependencies

**Importance of the dependencies**. Solid line has direct causal influence ("direct" means that causal influence is not mediated by any other variable that is included in the study).

Importance of the causal structure

**Importance of the causal structure**. Solid line has direct causal influence ("direct" means that causal influence is not mediated by any other variable that is included in the study). Dashed line indicates there are two possibilities, but we do not know which holds. Dashed line without any arrow heads indicates there is a dependency but we do not know the reciprocal dependence.

Strength of the dependency. Removing any of the edges (Vertex 1 to Vertex 2) in edge set one from the chosen model would decrease the probability of the model to less than one thousandth the probability of the original model, while removing any of the edges in edge set two decreases the probability of the model (exact ratio listed).

Edge Set 1

Edge Set 2

Vertex 1

Vertex 2

Ratio

Vertex 1

Vertex 2

Ratio

TOP1

IGBP1

436736

CYCS

TIMM17A

102

TIMM17A

C20orf24

201449

RFC5

CYCS

92

CYCS

CCT6A

89880

FHIT

TOP1

61

IGBP1

TUBG1

16221

PIM1

NDUFB2

41

CYCS

DDX39

9248

RFC5

IGBP1

31

IGBP1

CYCS

4388

FHIT

DTYMK

17

XPC

MIR16

15

XPC

RFC5

9.96

FHIT

G0S2

5.85

MIR16

DDX39

3.58

TIMM17A

PLAA

3.57

CCT6A

PIM1

3.31

Discussion

Our analysis provides new insights into the complex interactions of gene expression levels in GAW15 Problem 1 data. This work demonstrates the potential usefulness of statistical inference on causal structure. Without an

An apparent limitation of this work, though not uncommon in gene-expression studies, is the relatively small sample size used. To fully elucidate the biological pathways involved may be difficult. For example,

Our inference of gene networks also exploits the covariance structure of the data, like structural equation modelling

Conclusion

Bayesian network modelling is applied to GAW15 gene expression data and shown to be more informative than classic cluster analysis. While the findings are the subject of further investigation, the approach merits further attention.

Competing interests

The author(s) declare that they have no competing interests.

Acknowledgements

We thank Dr. Cathy Falk and other members of Group 5 at GAW15 for useful comments.

This article has been published as part of