Department of Molecular Biology and Genetics, Aarhus University, Blichers Allé 20, 8830 Tjele, Denmark

Abstract

Background

Although genome-scale expression experiments are performed routinely in biomedical research, methods of analysis remain simplistic and their interpretation challenging. The conventional approach is to compare the expression of each gene, one at a time, between treatment groups. This implicitly treats the gene expression levels as independent, but they are in fact highly interdependent, and exploiting this enables substantial power gains to be realized.

Results

We assume that information on the dependence structure between the expression levels of a set of genes is available in the form of a Bayesian network (directed acyclic graph), derived from external resources. We show how to analyze gene expression data conditional on this network. Genes whose expression is directly affected by treatment may be identified using tests for the independence of each gene and treatment, conditional on the parents of the gene in the network. We apply this approach to two datasets: one from a hepatotoxicity study in rats using a PPAR pathway, and the other from a study of the effects of smoking on the epithelial transcriptome, using a global transcription factor network.

Conclusions

The proposed method is straightforward, simple to implement, gives rise to substantial power gains, and may assist in relating the experimental results to the underlying biology.

Background

Although genome-scale expression experiments are performed routinely in biomedical research, understanding the data they generate remains a major challenge. A widely used approach to relate such data to biology is

Recently there has been intense interest in methods that build on the information in biological networks, that is to say, methods that exploit the topology rather than just the set of genes in the network. We briefly summarize some of the methods proposed.

One approach extends gene set enrichment analysis by defining scores that build on network topology. For example, gene set scores that can be expressed as sums of pairwise weights between genes in the set may be modified by weighting gene pairs by the inverse of their path distance in the network

Another approach makes explicit use of network models for the expression data. In

An alternative network-based approach

Methods exploiting pathway structure have also been proposed for other, related purposes. For classification, knowledge of an undirected gene network has been used to develop classifiers of gene profiles by performing a spectral decomposition of the expression profiles with respect to the eigenfunctions of the graph

In the following section we describe a simple way to incorporate a known network or pathway into the analysis of gene expression data. This entails augmenting the network with a discrete node, representing the treatment or class variable. We show that this leads to a simple modification of conventional differential expression analysis. The augmented network contains a discrete as well as multiple continuous (Gaussian) nodes: networks containing both types of node are usually called hybrid networks. With a few recent exceptions

Methods

The model framework

We suppose that data from a gene expression study are available, in the form of an _{
g
}:

Under such a model, the joint distribution of the data _{
g
}:

where _{
g
} with covariates given by the variables

Another strength of the methodology is the ability to read from the graph which conditional independences hold under the model, using the property of d-separation

to mean that _{
U
} and _{
V
} are conditionally independent given _{
W
}.

To model the effect of the treatment or class variable _{
T
}⊆

where _{
T
} is a set of additional edges of the form (_{
T
}. We suppose that the object of the analysis is to find (i.e., estimate) _{
T
}.

We assume that

Comparing (1) with (2) we see two changes: firstly, a term _{
T
})=Pr(_{
T
}) is introduced. Since we usually condition on _{
T
} we need to let the conditional distribution of _{
g
}depend on _{
T
} as well as

Maximum likelihood estimates under the model (2) can be obtained by maximizing the likelihood for each factor separately: since these are all standard models, this is easily done. The likelihood ratio test (or

An important special case occurs when _{0}and _{1} differ by one edge only, say

where _{
g
} is conditionally independent of _{
T
} given the parents of _{
g
} given _{
g
}with covariates _{
S
} and a discrete term for _{
T
}, and so to test _{1}, we can associate each edge

Testing the conditional independence of each gene and treatment, given the parents of the gene in the network can be regarded as a simple modification of conventional methods for differential expression analysis that are based on tests of marginal independence between treatment and genes. To compare and contrast the conditional and marginal approaches, consider the two models relating a treatment _{1} and _{2} shown in Figure

Comparison of marginal and conditional tests

**Comparison of marginal and conditional tests.** Comparison of conditional and marginal tests for two models. Under (**a**), where _{2}, _{2}and _{2} | _{1}, but the conditional test will generally have greater power than the marginal test, since using _{1}as a regressor will explain some proportion of _{2}’s variation. Under (**b**), where _{2}, _{2}but _{2} | _{1}. Hence the conditional null hypothesis holds, and the Type II error of the conditional test is less than

Note also that the marginal approach can be regarded as the special case of the conditional approach that occurs when

Since multiple hypotheses are tested, use of conventional significance level thresholds would inflate the false positive rate. Many approaches to correct for multiplicity are available

Results

In this section we describe two applications of the method.

Hepatotoxicity and the PPAR pathway

Here we describe the analysis of data taken from a hepatotoxicity study in rats (

The stated objective of the study was to use microarray gene expression data acquired from the liver of rats exposed to hepatotoxicants to build classifiers for prediction of liver necrosis. In the study 418 rats were exposed to one of eight compounds (1,2-dichlorobenzene, 1,4-dichlorobenzene, bromobenzene, monocrotaline, N-nitrosomorpholine, thioacetamide, galactosamine, and diquat dibromide). All eight compounds were studied using standardized procedures, i.e. a common array platform (Affymetrix Rat 230 2.0 microarray), experimental procedures and data retrieving and analysis processes. For each compound, four to six male, 12 week old F344 rats were exposed to a zero dose, low dose, mid dose(s) or a high dose of the toxicant and sacrificed at 6, 24 or 48 hrs later. At necropsy, liver was harvested for RNA extraction, histopathology, and clinical chemistry assessments.

For simplicity we use the subset of data from the study pertaining to 1,2-dichlorobenzene, and compare active with control treatments (ignoring the effects of dose and exposure time). The preprocessing steps are described on the GEO website. In all there were 46 arrays in the subset: 12 animals were in the control group, and 34 animals were exposed to active drug.

Peroxisome proliferator-activated receptors (PPARs) are a group of nuclear receptor proteins that function as transcription factors, playing essential roles in the regulation of cellular differentiation, development, and metabolism of higher organisms. Several types have been identified, denoted PPAR-

We obtained a copy of the KEGG

To examine the effect of treatment on the network, the network-based tests of _{
T
} ╨ _{
g
}. In both cases, to correct for multiplicity we use Holm’s step-down procedure

Using the network-based tests,

Augmented PPAR pathway

**Augmented PPAR pathway.** An inferred PPAR pathway showing the effects of treatment (multiplicity-adjusted p-values less than 0.05). Transcription factors are shown in red.

The effects of smoking

Here we describe the analysis of data taken from a study of the effects of cigarette smoke on the human oral mucosal transcriptome

Here we use the data to characterize the effect of smoking on gene expression, making use of a global transcription network constructed using information on human transcription factors (TFs) and their putative target genes (TGs) obtained from the TRANSFAC database

We applied the method to

Comparison of marginal and conditional adjusted p-values

**Comparison of marginal and conditional adjusted p-values.** A scatterplot of multiplicity-adjusted

It is instructive to relate the differentially expressed genes to the network topology. For purposes of illustration we consider the 254 genes that satisfy the false discovery rate at 2.5%, of which 10 are transcription factors. The subnetwork of

A subnetwork of the global transcription network

**A subnetwork of the global transcription network.** A subnetwork of the global transcription network. The expression of all genes in the subnetwork are affected by smoking (false discovery rate of 2.5%). Transcription factors are shown in red.

Discussion

We have described a simple way to exploit network information in the analysis of gene expression data, using tests for the conditional independence of each gene and treatment given the parents of the gene in the network. This method can be regarded as an extension of conventional methods of gene expression analysis, that takes the network structure into account. We demonstrated using two examples that the method can result in a substantial increase in power.

In a related approach to the analysis of genetics of gene expression data

As described above, some authors

The approach builds on some assumptions that may or may not be unwarranted. The key assumption is that the steady state distribution of the gene expression levels follows a given Bayesian network. Gene regulation is extremely complex and as yet imperfectly understood, so such an assumption can at best be tentative. We have illustrated the approach using two networks, one based on a signalling pathway and the other constructed using transcription factor/target gene data. Biochemical pathways represent phenomena occurring at the protein level, which are not necessarily reflected at the transcript level: see Figure

A hypothetical signalling pathway

**A hypothetical signalling pathway.** The figure shows a hypothetical signaling pathway adapted from

As we have described the method, it assumes that the expression data are Gaussian distributed, but this is not critical. Expression data from microarrays are typically taken to be Gaussian after log transformation

An assumption, implicit in both the marginal and the network-based analyses, is that the treatment variable is causal rather than reactive in respect to the gene expression data. Ideally the treatment should represent a randomized intervention, allowing secure causal interpretations, but gene expression studies are rarely randomized. In poorly designed studies, treatment allocation may be confounded with other factors

Similar remarks apply to the use of the terms

Finally, it is assumed that the treatment affects the parameters of the network but not its topology. In some applications this may not be appropriate. For example, interventions affecting chromatin structure may alter the accessibility of DNA binding sites and hence patterns of regulatory control due to transcription factors.

Conclusions

A straightforward way to exploit network information in the analysis of gene expression data is to assume that the network models the steady state distribution of the gene expression levels, and that the treatment affects the parameters but not the topology of the network. In this framework, genes whose expression is directly affected by the treatment may be identified using tests for the conditional independence of each gene and treatment given the parents of the gene in the network. This method can be regarded as an extension of conventional methods of gene expression analysis that takes network structure into account. It is simple to implement, gives rise to substantial power gains, and may give insight into the biological processes involved.

Competing interests

The authors declare that they have no competing interests.

Author’s contributions

DE conceived the methods, performed the analyses and drafted the manuscript. LW participated in the analyses. All authors read and approved the final manuscript.

Acknowledgements

The financial support of Quantomics, a collaborative project under the 7th Framework Programme (FP7) is gratefully acknowledged (PS).