Abstract
Background
Microarray experiments often measure expressions of genes taken from sample tissues in the presence of external perturbations such as medication, radiation, or disease. The external perturbation can change the expressions of some genes directly or indirectly through gene interaction network. In this paper, we focus on an important class of such microarray experiments that inherently have two groups of tissue samples. When such different groups exist, the changes in expressions for some of the genes after the perturbation can be different between the two groups. It is not only important to identify the genes that respond differently across the two groups, but also to mine the reason behind this differential response. In this paper, we aim to identify the cause of this differential behavior of genes, whether because of the perturbation or due to interactions with other genes.
Results
We propose a new probabilistic Bayesian method CMRF based on Markov Random Field to identify such genes. CMRF leverages the information about gene interactions as the prior of the model. We compare the accuracy of CMRF with SSEM and Student's t test and our old method SMRF on semisynthetic dataset generated from microarray data. CMRF obtains high accuracy and outperforms all the other three methods. We also conduct a statistical significance test using a parametric noise based experiment to evaluate the accuracy of our method. In this experiment, CMRF generates significant regions of confidence for various parameter settings.
Conclusions
In this paper, we solved the problem of finding primarily differentially regulated genes in the presence of external perturbations when the data is sampled from two groups. The probabilistic Bayesian method CMRF based on Markov Random Field incorporates dependency structure of the gene networks as the prior to the model. Experimental results on synthetic and real datasets demonstrated the superiority of CMRF compared to other simple techniques.
Background
Microarray experiments often measure expressions of genes taken from sample tissues in the presence of external perturbations such as medication, radiation, or disease [1,2]. Typically in such experiments, gene expressions are measured before and after the application of external perturbation, and are called control data and noncontrol data, respectively. In this paper, we focus on an important class of such microarray experiments that inherently have two groups of tissue samples. Different groups in a microarray measurement can exist in many different ways. For instance, samples can be taken from members of multiple closely related species (e.g. rat versus mouse). Within the same species there can be subgroups with different phenotypes (e.g. African American versus Caucasian American). Another example is when the samples have already been through several alternative external perturbations (e.g. fasting and not fasting). When such different groups exist, it is not only important to observe overall changes in gene expression, but also to observe how different groups respond to the external perturbation. For example, Taylor et al. applied medications on 36 Caucasian American and 33 African American patients infected with Hepatitis C [3]. Gene expressions were collected before and after the medication.
In a perturbation experiment, some of the genes respond by noticeably changing their expression values between the control and noncontrol data. Genes that change their expressions in a statistically significant way are referred to as differentially expressed (DE), while those that do not, are referred to as equally expressed (EE) genes. In the context of two groups, we refer to a gene that has the same state in both the groups, i.e. either DE or EE for both the groups, as equally regulated (ER) gene. On the contrary, if a gene is DE in one group and EE in the other, we denote it as differentially regulated (DR).
Genes for any organism typically interact with each other via regulatory and signaling networks. For simplicity, we will refer to them as gene networks for the rest of this paper. A small portion of an example gene network can be seen in Figure 1.
Figure 1. A sample gene regulatory network. Illustration of the impact of a hypothetical external perturbation on a small portion of the Pancreatic Cancer pathway. The pathway is taken from the KEGG database. The solid rectangles denote the genes affected directly by perturbation, the dashed rectangles indicate genes secondarily affected through the networks. The dotted rectangles denote the genes without any change in expression. → implies activation and ⊣ implies inhibition. In this figure, genes KRas, Raf and Cob42Roc are primarily affected and MEK, Ral and RalGDS are secondarily affected through interactions.
Once an external perturbation is applied, a gene can be DE in one of the two ways  as a direct effect of the perturbation or via interaction with other DE genes through gene networks. We denote a gene as primarily affected DE, if it is DE due to the external perturbation. Similarly, a gene is secondarily affected DE, if it is DE due to another gene in the gene network. Figure 1 shows the state of the genes in the Pancreatic Cancer pathway after a hypothetical external perturbation is applied. In this figure, genes KRas, Raf and Cob42Roc are primarily affected and MEK, Ral and RalGDS are secondarily affected through interactions.
Recall that for a gene to be DR, it has to be EE in one group and DE in another group. For such a gene, if it happens to be DE in one group because of the external perturbation, we call it as primarily differentially regulated (PDR) gene. When it is DE in one group because of the interaction with other DE genes in the gene networks, we will refer to it by secondarily differentially regulated (SDR) gene. In this paper, we consider the problem of identifying the PDR genes in a given set of control and noncontrol gene expressions from two groups of samples.
Existing methods to identify the primarily affected DE genes using association analysis techniques [4,5], haploinsufficiency profiling [68] and chemicalgenetic interaction mapping [9] are limited to applications where additional information such as fitness based assays of drug response or a library of genetic mutants is available. Bernardo et al. suggested a regression based approach named MNI that assumes that the internal genetic interactions are offset by the external perturbation [10]. It estimates genegene interaction coefficients from the control data and uses them to predict the target genes in the noncontrol data. Cosgrove et. al. proposed a method named SSEM that is similar to MNI [11]. SSEM models the effect of perturbation by an explicit change of gene expression from that of the unperturbed state.
We have also developed a method to detect the primarily and secondarily affected genes in perturbation experiments with a single data group [12]. We will call this method SMRF (single MRF) in the rest of this paper for it applies MRF on single group datasets. In that paper we developed a Bayesian probabilistic method based on Markov Random Field that leverages the information from gene networks as the prior belief of the model.
Though these methods analyze primary and secondary effects of perturbation on gene expressions, they are not directly applicable for multigroup perturbation experiments.
Several recent studies aim to identify DE genes in multiple groups of data points. maSigPro is a two stage regression based method that identifies genes that demonstrate differential gene expression profiles across multiple experimental groups [13]. Hong et al. proposed a functional hierarchical model for detecting temporally differentially expressed genes between two experimental conditions [14]. They modeled gene expressions by basis function expansion and estimate the parameters using a Monte Carlo EM algorithm. Tai et al. ranked DE genes using data from replicated microarray time course experiments, where there are multiple biological conditions [15]. They derived a multisample multivariate empirical Bayes statistic for ranking genes. Angelini et al. proposed a Bayesian method for detecting temporally DE genes between two experimental conditions [16]. Deun et al. developed a Bayesian method to find the genes that are differentially expressed in a single tissue or condition over multiple tissues or conditions [17]. All these methods identify differentially expressed genes in multiple groups. However, none of these methods analyzed the primary and secondary effects in a two group perturbation experiment. In this paper, we develop a method to solve this problem.
Our approach
In this paper, we propose a new probabilistic Bayesian method CMRF to find the PDR genes in two group perturbation experiment dataset as defined above. We call this method CMRF (Comparative MRF) for it applies MRF on two groups of data for comparison purpose. Our Bayesian method incorporates information about relationship from gene networks as prior beliefs. We consider the gene network as a directed graph where each node represents a gene, and a directed edge from gene g_{i }to gene g_{j }represents a genetic interaction (e.g g_{i }activates or inhibits g_{j}). We define two genes as neighbors of each other if they share a directed edge. For example, in Figure 1, genes KRas and Raf are neighbors as KRaf activates Ras. We also classify a neighbor as incoming or outgoing, if it is at the start or at the end of the directed edge respectively. In Figure 1, Raf is an incoming neighbor of MEK and MEK is an outgoing neighbor of Raf. When the expression level of a gene is altered, it can affect some of its outgoing neighbors. Thus, the gene expression can change due to external perturbation or because of one or more of the affected incoming neighbors.
We represent the external perturbation by a hypothetical gene (i.e. metagene) g_{0 }in the gene network. We add an edge from the metagene to all the other genes because
the external perturbation has the potential to affect all the other genes. So, g_{0 }is an incoming neighbor to all the other genes. We call the resulting network the
extended gene network. CMRF estimates the probability that a gene g_{j }is DR due to an alteration in the activity of gene
Figure 2 illustrates different components of CMRF and the connectivity between them. Note that, (C) corresponds to the Bayesian prior based on MRF.
Figure 2. Illustration of different components of CMRF and connectivity between them. (A) obtains an initial estimates of state variables using Student's t test. (B) estimates parameters in a way that maximizes data likelihood. (C) estimates parameters in order to maximize prior density. Both (B) and (C) use a global optimization technique called Differential Evolution. (D) employs Iterated Conditional Modes to maximize the pseudolikelihood. (B), (C) and (D) consist of an alternating optimization technique. These three steps (B), (C) and (D) are repeated till the algorithm meets a criteria for completion. Finally, once the optimization is complete, the DR genes are sorted in decreasing order of their likelihood with respect to the metagene g_{0}. The genes at the top of the list are declared PDR.
We compare the accuracy of CMRF with that of SSEM and Students t test on semisynthetic dataset generated from microarray data in Cosgrove et al [11]. We also compare CMRF with our old method SMRF that we developed to identify the primarily affected DE genes in a single group perturbation data [12]. CMRF obtains high accuracy and outperforms all the other three methods. Also, we conduct a statistical significance test using a parametric noise based experiment to evaluate the accuracy of CMRF. In this experiment our model demonstrates reasonable confidence regions for various values of the parameters.
The rest of the paper is organized as follows. Section Results and discussion presents the results of our experiments. Section Methods describes our methods in detail. Section Conclusions concludes our discussion.
Results and discussion
In this section we discuss the experiments we conducted to evaluate the quality of CMRF. We implemented CMRF in MATLAB and Java. We obtained the code for Differential Evolution from http://www.icsi.berkeley.edu/~storn/code.html webcite. We compared CMRF with SSEM as SSEM is one of the most recent methods that considers identifying primarily affected genes in a perturbation experiments [11].
We obtained SSEM from http://gardnerlab.bu.edu/SSEMLasso webcite. We executed our code on a QuadCore AMD Opteron 2 Ghz workstation with 32 GB of memory.
Dataset
We used four different sets of data to conduct the experiments in this paper.
• Dataset 1. The first dataset was collected by Smirnov et al. [20]. This dataset was generated using 10 Gy ionizing radiation over immortalized B cells obtained from 155 members of 15 Centre d'tude du Polymorphisme Humain (CEPH) Utah pedigrees [21]. Microarray snapshots were obtained before (at zeroth hour) and after (at second and sixth hours) the application of radiation.
• Dataset 2. The second dataset corresponds to a drug response experiment conducted by Taylor et al [3]. Medications were applied on 36 Caucasian American and 33 African American patients infected with Hepatitis C. Gene expressions were collected before the medication was started and at 1, 2, 7, 14, 28 days after the medication was administered.
Both dataset 1 and 2 are microarray time series data with more than two time points. We adapted these two time series data two create control and noncontrol data suitable for our experiments. We used the data before perturbation as control data. For the noncontrol data we calculated the expected expression of a gene at each points after the perturbation. We selected the one with highest absolute difference from the expected expression of control data for that gene.
• Dataset 3. We created dataset 3 using dataset 1. We used the control group of dataset
1 as the control group of dataset 3. Then, we changed the expression values of some
of the randomly selected genes to model the primary effect of external perturbation.
From that perturbed dataset, we simulated the secondary effects using the sigmoid
method described in Garg et al. [22]. We denote the parameter for primary perturbation effect by deviation. Deviation is the ratio of the change of expression value Δx of a gene to its original expression value x (i.e.
• Dataset 4. We create this dataset from dataset 1 in two steps as follows.
 Selection of genes. In order to carry out experiments on larger scale data with known PDR genes, we generated data in the presence of a hypothetical perturbation from the real datasets as follows. We first select three sets of genes. Each set consists of some primarily affected genes and a higher number of secondarily affected genes. Here, we describe how we construct each of the three sets of affected genes. We first select a random gene from the network and label it as a primarily affected DE gene. We then traverse its outgoing neighbors in a breadth first search manner. As we visit a gene during traversal, we label it as a secondarily affected DE gene with a probability of 1 − (1 − q)^{η}, where η is the number of incoming DE neighbors. Here q is the probability that a gene is DE due to a DE predecessor (0.4 in our experiments). We repeat these steps to create the desired number of primarily affected genes.
After we obtain the three set of genes, we assign one set to both D_{A }and D_{B }groups. We assign the other two sets of genes to different groups. These two set of genes are differentially regulated as they are affected in only one group and not in the other. The three groups can contain different number of primarily and secondarily affected genes. We call these three sets of genes as primarily differentially regulated, secondarily differentially regulated and equally regulated genes.
 Generation of gene expression. Once we identify these three sets of genes in the two groups, we create control and noncontrol data for D_{A }and D_{B }over N samples. We use the control part of the real dataset in Smirnov et al. as the control part of our synthetic dataset in both D_{A }and D_{B }[20]. To generate the noncontrol dataset, we traverse each of the genes that participate in the gene networks. Consider a gene g_{i }with mean and standard deviation of expression in the control dataset given by µ_{i }and σ_{i }respectively.
If the gene is EE we generate its noncontrol data points from the a normal distribution
given by the parameters (µ_{i},
To summarize, we used the same variance in the noncontrol group as that in the control group. However, for an affected gene we changed the value of the mean in the noncontrol group from that in the control group. For a primarily affected gene we applied a higher deviation of mean than that of the secondarily affected genes.
Regulatory networks
We collected 24,663 genetic interactions from the 105 regulatory and signaling pathways of KEGG database [23]. Overall 2,335 genes belong to at least one pathway in KEGG. In our model, we considered only the genes that take part in the gene networks.
Comparison to other methods
Our method provides us a list of differentially regulated genes. We sort the list of those genes as follows. Consider a DR gene g_{i}, which is DE in D_{A }and EE in D_{B}. We calculate the likelihood of being EE in D_{A }and DE in D_{B }for that gene. We can interpret this step as the probability of being DR, but in a reverse way. We could instead use the probability that the gene is DE in D_{A }and EE in D_{B}. However, according to our observation, the earlier metric provides a much better accuracy. We sort all the DR genes with increasing order of that likelihood.
As per our knowledge, no other method exists that differentiates between the primary and secondary effects in a twogroup perturbation experiment. There exist some studies in identifying primarily affected genes in single group datasets. We compared the accuracy of CMRF to three such methods namely, SMRF, Student's t test and SSEM.
Experimental setup
Given an input dataset, using each of the four methods, we ranked all the genes. Highly ranked genes have higher chance of being a PDR according to each method. However, as other three methods are not tailored to solve this problem, we created separate ranking on D_{A }and D_{B}. Then, out of those two ranks, we created a unified rank of differentially regulated genes. We shall elaborate on this unified rank creation later. We, first, explain how we create ranks on individual groups D_{A }and D_{B }for other three methods.
• SMRF. We apply the SMRF to each group separately and obtain a set of differentially expressed genes. We sort the genes in decreasing order of joint likelihood with the metagene. A higher joint likelihood implies a higher chance of being primarily affected.
• SSEM. We train SSEM on the control dataset, where it learns the correlation between the genes. We test SSEM on the noncontrol dataset of each group, where it produces a rank for each single data point.
• Student's t test. We use the function called ttest2 from MATLAB. We apply it on every individual gene, where it takes control and noncontrol dataset as input and produces a pvalue as output. We assume that the null hypothesis corresponds that the gene is EE. So a substantially lower pvalue implies a higher chance of being primarily affected. We perform the test on all the genes and rank them according the increasing order of pvalues.
Now we describe how we create an unified ranking of differentially regulated genes for these three methods. We denote the ranks from data group D_{A }and D_{B }by R_{A }and R_{B }respectively. The unified rank is defined by R_{U }. We denote the number of genes in each rank to be ω_{A }and ω_{B }respectively. We scan both the ranks simultaneously from first position to ω = min(ω_{A}, ω_{B}). While scanning at the kth position, we denote the equally regulated set obtained till that position by Λ_{k }= R_{A }(1: k) ∩ R_{B }(1: k). We include R_{T }(k) to the unified rank R_{U }if R_{T }(k) ∉ Λ_{k}, T ∈ {A, B}. For SSEM we obtain a separate R_{U }for each data point. We average the accuracies over all these ranks.
Results
In this experiment we used dataset 3, that we have just described. To observe the accuracy of CMRF at varying degree of difficulties, we conducted experiments with four different values of deviation, namely, {0.5, 0.6, 0.7, 0.8}. However, we discuss only two of them in this paper (see Figure 3) since for other two parameters the results are similar. The results we discuss correspond to the cases when deviation = {0.6, 0.8}.
Figure 3. Comparison of CMRF with three other methods. Comparison of our method CMRF to SMRF, SSEM and ttest. The number of primarily differentially affected genes is 40. The values for deviation (maximum perturbation to the PDR genes) are 0.6 and 0.8. The figures indicate that CMRF outperforms SMRF, SSEM and ttest.
Figures 3(a) and 3(b) show the sensitivity of the four methods with the two deviation settings. The former one corresponds to the computationally harder case as the difference between the noncontrol groups of primarily and secondarily affected genes is small. As the deviation increases identifying primarily affected genes becomes easier. Form the figure, we observe that CMRF is significantly more accurate than the other three methods for all datasets consistently. It reaches almost 50% sensitivity (i.e., it can find around 1518 primarily affected genes out of 30) in the top 50 ranked genes, when the deviation is 0.6. On the other hand, its achieves a sensitivity of 0.6 when the deviation is 0.8. We obtained similar results for other deviations, which we do not discuss here. The method in SMRF reaches to 30% and 40% accuracy, however at a slower pace. The ttest reaches around 25% and 30% sensitivity at ranking position 50 for these two cases respectively. SSEM's sensitivity is below 0.1 for all experiments even within the top 50 positions.
We believe that there are three major factors for the success of our method over the other competing methods. First, the other methods do not simultaneously handle two groups of datasets and are not able to generate an unified ranking of differentially regulated genes. CMRF encompasses both groups in a single model and probabilistically determines the PDR genes. Hence, it is more shielded against the false positives introduced during the unification of ranking. Second, CMRF can successfully incorporate the gene interactions using MRFs while others ignore this information. Finally, in real perturbation experiments, multiple genes are often primarily affected. CMRF is capable of dealing with both large and small number of primarily affected genes, while performances of other methods deteriorate as the number of primarily affected genes grows. Thus, we conclude that our method is more suitable for real perturbation experiments.
Statistical significance experiment
The experiments in the last section enable us to compare the accuracy of CMRF with that of the other methods on synthetic datasets. We also wanted to evaluate the accuracy of CMRF on real dataset. However, we do not have any gold standard available that enlists true set of PDR genes. Hence, we conducted a set of statistical significance experiments to estimate the confidence of our accuracy. Specifically, we obtained the control data from a real dataset, perturbed it in a controlled way for a number of genes. We calculated the likelihood probabilities of those genes and created a distribution. We repeated this process with varying amount of perturbation. Finally, we executed CMRF on a real dataset and analyzed the result.
Results
We obtained the real dataset from drug response experiment conducted by Taylor et al [3], which is actually dataset 2. Apart from this real dataset, we create different versions of dataset 4 by varying d_{p }as {0.1, 0.2, 0.3,..., 3.0}. If d_{p }> 1.1, we set d_{s }to 1, otherwise d_{s }= 0.5 × d_{p}. Thus, we have 30 synthetic datasets in total. In every dataset, we fix the number of primarily and secondarily differentially regulated genes to 50 and 172 respectively. To decide whether a gene g_{i }is DR, when g_{i }is DE in D_{A }and EE in D_{B}, we define a nullhypothesis H_{0i}: g_{i }is DR, but in the reverse way, i.e. g_{i }is EE in D_{A }and DE in D_{B}. We calculate the likelihood of being EE in D_{A }and DE in D_{B }for that gene, as described. For gene g_{i}, we denote the log likelihood of accepting H_{0i }by LL_{i}. In every dataset, we create a box plot of the 50 LL_{i }values, as the number of DR genes in each dataset is 50. A lower value LL_{i }indicates that g_{i }has a higher probability of being differentially regulated.
Figure 4 illustrates the statistical significance of the experiments over the datasets with d_{p }= 1.2 to 2.0. The box plot demonstrates a relationship between the Pvalue and d_{p}. A higher value of d_{p }indicates a lower Pvalue and hence, a high chance of being PDR. We also observe that the variance of Pvalue increases with the increase of d_{p}.
Figure 4. Illustration of statistical significance test. Illustration of the statistical significance test. Box plot demonstrates the Pvalue of null hypothesis of the DR genes over synthetic dataset. From the plot we clearly conclude that a higher gap between the control and noncontrol group of a DR gene leads to a lower Pvalue. The genes with a lower Pvalue have a higher chance of being primarily differentially regulated.
We also executed CMRF on the real datasets without any modification. Interestingly, on the real dataset from Taylor et al. [3] (dataset 2), we did not obtain any genes as differentially regulated. A careful observation concludes that when both the number of data points and the gap d_{p }(i.e. the signal to noise ratio) is low, the coefficients γ_{6 }and γ_{7 }in the prior density become strong and all genes are identified as equally regulated. However, when either the number of data points or d_{p }is significantly high, the data can overcome the prior. In the current real dataset, the number of data points is only 33 and the gaps between the control and noncontrol group were less than 1.2 × σ. As a result, CMRF identifies no differentially regulated genes in the dataset. Thus, we can conclude that either there is not much difference between the two groups in the real data, or the data does not contain enough data points, so that our model can highlight difference between the two groups.
In Figure 4 we present the results for d_{p }= 1.2 to 2.0. Note, that for d_{p }< 1.2 × σ our model did not identify any DR genes. Here also, we attribute a similar reason for not finding any DR genes as both d_{p }and the number of data points are small. On the other hand, in Section Comparison to other methods, when we execute CMRF on synthetic datasets with 155 data points we were able to identify a substantial number of true PDR genes even with d_{p }= 0.02 × σ. To substantiate our conclusion that there exists little difference between the two groups in the real dataset, we conducted a set of permutation tests. We shuffled the two original groups to create new sets of data. We repeated this process for a number of times (40 in the present experiment) and executed CMRF on each of them. For every derived dataset, CMRF did not find any DR genes. Hence, this experiment bolsters the claim that there are no DR genes in the original real data.
An interesting question can be raised is that "is there indeed no DR genes in the real dataset from Taylor et al. [3]?" Another similar question can be "will our method be able to detect DR and PDR genes from similar other real datasets?" We believe that CMRF requires a bigger dataset for DR and PDR genes to be discovered. For example, CMRF is able to identify the DR and PDR genes from the synthetic dataset that contains substantially higher number of data points than that of the real dataset. Since the difference between control and noncontrol groups of a DE gene is small compared to the variance of the data points, it is difficult to detect that subtle effect of perturbation with a small dataset. For a small dataset, the prior due to third hypothesis becomes strong and the two corresponding parameters γ_{6 }and γ_{7 }assumes extreme values. Thus the support from data is not sufficient to overcome the prior and hence, the method is not able to identify the DR and PDR genes. There are two solutions to overcome this problem. First of them is to employ a bigger dataset. With the advancement of comparatively inexpensive and high throughput technologies bigger dataset are increasingly common nowadays. From that perspective, CMRF is supposed to perform more accurately in the near future. A second option to circumvent the problem is to restrict the growth of the two parameters γ_{6 }and γ_{7}. If we have knowledge about the values of these two parameters, we can assign then as input to the program and refrain from estimating their values. This will enable us to employ a comparatively noninformative prior which will be easier for the data to overcome. Also, we can use specific bound over those variables while estimating them to avoid them becoming stronger.
Conclusions
Microarray experiments often measure expressions of genes taken from sample tissues in the presence of external perturbations such as medication, radiation, or disease. Typically in such experiments, gene expressions are measured before and after the application of external perturbation.
In this paper, we solved the problem of finding primarily differentially regulated genes in the presence of external perturbations when the data is sampled from two groups. The probabilistic Bayesian method based on Markov Random Field incorporates dependency structure of the gene networks as the prior to the model. Experimental results on synthetic and real datasets demonstrated the superiority of CMRF compared to other simple techniques.
Methods
In this section we describe different components of CMRF. Section Notation and problem formulation describes the notation and formulates the problem. Section Overview of the solution provides a high level overview of the solution. Section Computation of the prior density function describes the calculation of the prior density function of MRF. Section Approximation of the objective function discusses the definition of a tractable objective function. Section Computation of likelihood density function discusses the calculation of the likelihood function. Finally, Section Objective function optimization describes the algorithm to optimize the objective function.
Notation and problem formulation
In this section, we describe our notation and formally define the problem. We define a Bayesian model for gene expression in a twogroup perturbation experiment. We classify the random variables of the model into two different groups, namely observed variables and hidden variables. We have the values for the observed variables, while we estimate the values of the hidden variables.
Observed variables
We define two sets of observed variables, one for microarray gene expression data and another for the neighborhood in the extended gene network.
• Microarray data. We denote the number of genes by M and the number of data points in the two groups D_{A }and D_{B }by N_{A }and N_{B }respectively. We represent the set of genes with
• Neighborhood variables. We use the term
Hidden variables
We define three sets of hidden variables, These variables govern the state of genes, regulations of genes and interactions among genes respectively.
• State variables. We use
• Regulation variables. We denote the regulation condition of gene g_{i }with Z_{i}. Table 1 enumerates different values of Z_{i }for the values of S_{Ai }and S_{Bi}. In this formulation, the cases Z_{i }= {2, 3} indicate that g_{i }is DR, whereas Z_{i }= {1, 4} indicate that g_{i }is ER. The metagene is guaranteed to be ER, since S_{A}_{0 }= S_{B}_{0 }= 1.
• Interaction variables. In order to govern the joint regulation states of genes g_{i }and g_{j }we define interaction variables
It is easy to see that the hidden variables follow a hierarchical structure. For instance, the value of Z_{i }depends on the values of S_{Ai }and S_{Bi}. Similarly, the value of X_{ij }depends on the values of Z_{i }and Z_{j}. Thus, the value of the dependent variable X_{ij }is based on the values of four independent variables S_{Ai}, S_{Bi}, S_{Aj }and S_{Bj }. Table 1 enumerates the values of Z_{i}, Z_{j }and X_{ij }for different values of S_{Ai}, S_{Bi}, S_{Aj }and S_{Bj }.
Table 1. Enumeration of the values of Z_{i}, Z_{j }and X_{ij}
It is worth noting that the different values that we assign to the hidden variables are categorical in nature.
Problem formulation
Let
A higher value of p (X_{0j }= {2, 3}·) indicates a higher probability of a gene g_{j }being PDR. Using the estimated values of p (X_{0j }·), ∀_{j }∈ {1, 2,... M}, we can create an ordered list of candidate PDR genes.
Overview of the solution
This section describes a high level overview of our approach to estimate p(X_{0j}·), ∀_{j }∈ {1, 2,... M}. One simple approach can be using a hypothesis test to find out the PDR genes in the given dataset [15]. However, the available hypothesis tests do not consider the interactions among genes in the gene network. Also, deciding on the significance of test can be a complex step. Another approach can be to use SSEM to create a rank of the potential primarily affected genes in each group separately [11]. Then we can select the top k genes in each group and perform a set difference to obtain the PDR genes. Though SSEM considers the correlation between the genes, it does not utilize any known information from the gene networks.
We build a Bayesian probabilistic method based on Markov Random Field where we leverage the information from gene networks as the prior belief of the model. Using Bayes theorem [24] we can write the joint probability density of interaction variables as,
The first term in the numerator,
Given the joint probability density function outlined in Equation 1, our original problem reduces to obtaining assignments for the interaction variables and the parameters θ_{X }and θ_{Y }that maximize it.
A Maximum Likelihood Estimation (MLE) of Equation 1 is practically infeasible even for a small number of genes since the number of terms in the denominator grows exponentially. Instead we use a pseudolikelihood version of the objective function as shown in Section Approximation of the objective function. We use Iterative Conditional Modes (ICM) [19] and Differential Evolution [25] in an alternating optimization technique to maximize the pseudolikelihood with respect to , θ_{X }and θ_{Y}.
After the optimization, we obtain an assignment for , θ_{X }and θ_{Y}. Using these assignments and the observed data, we estimate the posterior probability of all X_{ij }variables. Using the estimated values of p(X_{0j}·), ∀_{j }∈ {1, 2,... M}, we create an ordered list of candidate PDR genes. We elaborate on each of these steps next.
Figure 2 illustrates different portions of CMRF and the connectivity between them.
Computation of the prior density function
In this section, we describe how we incorporate gene network as the the prior belief into our Bayesian model. From the structure and properties of gene network, we build three hypotheses and embed them into our model. We present the entire concept in three numbered subsections.
1. Statement of hypotheses
Here we state the three hypotheses on the biological networks in brief.
• Hypothesis 1. In each group D_{T}(T ∈ {A, B}), the metagene g_{0 }can change the state of all the other genes. Thus, all the genes can be directly affected by the external perturbation.
• Hypothesis 2. In each group D_{T}(T ∈ {A, B}), a gene g_{i }can change the states of its outgoing neighbors g_{j }in the same data group, i.e. a gene can be indirectly affected by the perturbation through genetic interactions.
• Hypothesis 3. Each gene has a high probability of being equally regulated. This follows from the observation that, often the difference between the expressions of most of the genes in two groups is small. We expect that the response of genes in these groups is very similar.
Clearly, when the data does not follow one or more of the hypotheses, the optimization function can overcome the prior belief with a strong support from the data.
2. Markov Random Field construction
In order to compute the prior density function, we define a Markov Random Field (MRF)
over the variables [18]. MRF is a probabilistic model, where the state of a variable depends only on the
states of its neighbors. MRF is useful to model our problem as the states of genes
depend on their neighbors. Here, the MRF is an undirected graph
Figure 5. A hypothetical gene network and corresponding Markov random graph. (a) A small hypothetical gene network with perturbation in two datasets D_{A }and D_{B}. The genes in the two datasets interact through identical network, although they assume different states. The circle g_{0 }represents the abstraction of the external perturbation. Rectangles denote genes. → implies activation and ⊣ implies inhibition. The potential effect of metagene to all other genes is indicated by dotted arrows from the metagene to all the other genes. For example, g_{1 }is primarily affected in D_{A}, but not affected in D_{B}. g_{2 }is primarily affected in both the datasets. g_{3 }is secondarily affected in both D_{A }and D_{B}. (b) The Markov Random Field graph constructed based on the small hypothetical gene network in (a). The numbers in the parenthesis are the expected assignments to the variables based on the states of the genes in (a). Nodes with dotted boundaries indicate that those nodes are required for completeness of the model, however the corresponding interactions do not exist.
One important point to note is that, this graph does not use the state variables or the regulation variables to model the dependencies between the genes. Rather, it establishes those dependencies over the variables. For example, in Figure 5(b) we draw the MRF graph corresponding to the hypothetical gene network in Figure 5(a). In the gene network, there is an edge from g_{2 }to g_{3}. So, g_{2 }can potentially change the state of g_{3}. We create an edge from X_{12 }to X_{13 }that corresponds to the edge from g_{2 }to g_{3}. As g1 is common for X_{12 }and X_{13}, if they assume the same value (i.e. X_{12 }= X_{13}), it implies that the genes g_{2 }and g_{3 }are in same state (i.e. S_{T}_{2 }= S_{T}_{3},T ∈ {A, B}). We formulate these dependency constraints using a set of unary and binary functions called feature functions. We discuss these feature functions next.
3. Development of feature functions
We denote the neighbors of X_{ij }in the MRF graph as
We define seven feature functions to capture the dependencies among the variables in 1471216413S2S5 according to the three hypotheses.
Unary feature functions
F_{1}, F2, F3. A primary component of the prior density function is modeling the frequency of X_{ij }itself. Here, we focus on two values of X_{ij }namely X_{ij }= {2, 3}, since they correspond to the events that a gene g_{j }is DR due to the metagene g_{0}. When X_{ij }= 2, g_{j }is DE in D_{A }and EE in D_{B}. To capture this, we define a feature function F_{1}(X_{ij}) which returns one when X_{ij }= 2. It returns zero otherwise. Similarly, X_{ij }= 3 when g_{j }is EE in D_{A }and DE in D_{B}. We define another feature function F_{2}(X_{ij}), which returns one when X_{ij }= 3. We capture all the other values of X_{ij }by a feature function called F_{3}(X_{ij}). It returns zero when X_{ij }∈ {2, 3} and equals to one otherwise. Table 2 enumerates the the domains and ranges of F_{1}, F_{2 }and F_{3}.
Table 2. Feature functions
Binary feature functions
F_{4}, F_{5}. Let ϒ represent the hypothesis that in a group D_{T },T ∈ {A, B} a gene g_{j }including the metagene can change the state of one of its outgoing neighbors g_{k}. We make a stronger hypothesis ϒ°that, ϒ holds simultaneously in D_{A }and D_{B }with high probability. Note that, this stronger hypothesis is based on the assumption that the genes in both D_{A }and D_{B }express in a similar fashion. This assumption is meaningful as in these twogroup perturbation experiments the different groups belong to similar biological conditions [3].
ϒ°is encoded in 1471216413S2S5 domain as follows. Consider four genes g_{p}, g_{i}, g_{j }and g_{k}, such that g_{p }→ g_{i}, g_{i }→ g_{j }and g_{j }→ g_{k}. Here → indicates that the gene on the left activates or inhibits the gene on the right. By definition, (X_{pj}, X_{ij}) and (X_{ij }, X_{ik}) are edges in the MRF. Note that the first edge corresponds to an incoming neighbor g_{p }of g_{i}, while the second edge corresponds to an outgoing neighbor g_{k }of g_{j }. We discriminate between these two sets of neighbors of X_{ij }, as they are related to the incoming neighbors of g_{i }and outgoing neighbors of g_{j }respectively. It can be shown that, for the first set of edges, X_{pj }equals to X_{ij }if and only if (iff) Z_{p }= Z_{i}, i.e. ϒ°holds true. Similarly, for the second set of edges X_{ij }equals to X_{ik }iff Z_{j }= Z_{k}, which in tern implies that ϒ°is satisfied.
We define two sets of feature functions to formalize these equalities based on the incoming neighbors of g_{i }and the outgoing neighbors of g_{j}.
• Left external equality. We denote the incoming neighbors of g_{i }with In (g_{i}). We write a feature function f_{4}(X_{pj }, X_{ij}), ∀_{p}, g_{p }∈ In (g_{i}). f_{4}(X_{pj }, X_{ij}) = 1 if Z_{i }= Z_{p }and W_{pi }= W_{ij }= 1. Otherwise, f_{4}(X_{pj}, X_{ij}) = 0. We denote the summation of this function over all the incoming neighbors of g_{i }as,
• Right external equality. We denote the outgoing neighbors of g_{j }as Out (g_{j}). We define a feature function f_{5}(X_{ik}, X_{ij }), ∀_{k}, g_{k }∈ Out (g_{j}). f_{5 }(X_{ik}, X_{ij}) = 1 if S_{k }= S_{j }and W_{jk }= W_{ij }= 1. Otherwise, f_{5 }(X_{ik}, X_{ij }) = 0. We denote the summation of this function over all the outgoing neighbors of g_{j }as,
Tables 3 and 4 enumerate the values of f_{4 }and f_{5 }for different values of X_{ij}. The missing entries in these tables correspond to the cases which can not occur during the optimization. For instance, in Table 3, a missing entry corresponds to different values of Z_{j }in X_{ij }and X_{pj }which is not possible.
For feature functions f_{4 }and f_{5}, X_{pj }or X_{ik }may not represent any interactions from the extended gene network when W_{pj }= 0 or W_{ik }= 0 respectively. We represent them by dotted rectangles in Figure 5(b).
Unary feature functions
F_{6}, F_{7}. We introduce two unary feature functions to incorporate our last hypothesis, that all genes are ER with a high probability. We consider two genes g_{i }and g_{j }such that g_{i }→ g_{j}. This hypothesis holds true, if g_{i }is equally regulated or g_{j }is equally regulated.
• Left internal equality. We define this feature function to capture the events when g_{i }is equally regulated. As, g_{j }can assume any state, this feature function holds true for eight different values of X_{ij }. We denote the feature function by f_{6}(X_{ij}, t) that returns one if its two arguments are equal and zero otherwise. We denote the summation of this functions over all these eight values of X_{ij }as,
• Right internal equality. We define this feature function to capture the events when g_{j }is equally regulated. As, g_{i }can assume any state, this feature function holds true for eight different values of X_{ij}. We denote the feature function by f_{7}(X_{ij}, t) that returns one if its two arguments are equal and zero otherwise. We denote the summation of this functions over all these eight values of X_{ij }as,
The last two columns of Table 2 enumerate these two internal equalities.
Based on these feature functions, we define the joint density function of 1471216413S2S5 as,
In the above equation γ_{k}, k ∈ {1, 2,... 7} are the coefficients of the seven feature functions in MRF.
In the next section, we discuss how we approximate the objective function of the MRF and the data. We also describe how we formulate the posterior probability density function for X_{ij}.
Approximation of the objective function
A direct maximization of the objective function given by Equation 1 is intractable, as it requires evaluation of exponential number of terms in the denominator. We employ pseudolikelihood as an established substitute to Equation 1 [27]. Pseudolikelihood is the simple product of the conditional probability density function of the X_{ij }variables. Geman et al. proved the consistency of the maximum pseudolikelihood estimate [28]. The approximated objective function can be written as,
The posterior density function F_{ij }of X_{ij }as,
Derivation of F_{ij}.
In step 2 of the derivation, we substitute by Y_{Ai}, Y_{Bi}, Y_{Aj }and Y_{Bj }as X_{ij }is independent of all Y_{Ck }such that k ≠ {i, j} and C ≠ {A, B}. Also, in the 15th step we assume that X_{ij }is independent of θ_{Y }given
Derivation of
A (X_{ij}) is
□
There are two different terms in objective function of Equation 4.
Computation of likelihood density function
In this section, we describe how we derive the likelihood function in three numbered subsections. Here, we assume that gene expressions in a group follow a normal distribution, We can rewrite the derivations if gene expressions follow some other distribution.
1. Likelihood for a single gene
Consider a set of measurements for a gene g_{i }that follows a single Gaussian distribution by z_{i }= {z_{i}_{1}, z_{i}_{2},..., z_{iN}}. We denote the latent mean of z_{i }as µ and the standard deviation as σ. As different genes can have different average expressions, we assume that µ follows a genome wise distribution with mean µ_{0 }and standard deviation τ [29]. Thus, for z_{i}, the likelihood for the data points in that group is given by,
The derivation of Equation 5 can be obtained from Demichelis et al [30]. If a gene is DE, its expression measurements in control and noncontrol groups follow different distributions [29]. On the other hand, for equally expressed genes, all the measurements in both groups share the same mean. The likelihood function for a DE gene g_{i }in group D_{T },T ∈ {A, B} is given by,
Similarly, for EE genes it is given by,
For instance, the likelihood of a gene to be DE in group D_{A }is given by
2. Likelihood for a regulation variable
As for a gene g_{i}, the regulation variable Z_{i }can assume four different values from 1 to 4, the equations of the likelihood that a gene is DR or ER also take four different forms given by,
2. Likelihood for an interaction variable
We have 16 different forms for the likelihood of the X_{ij }due to its 16 different values. However, here, we shall derive only for X_{ij }= 1, as for the other values of X_{ij }we have a similar derivation.
From the definition of
In a similar way, we can derive the likelihood functions for all the 16 different
values of X_{ij }variables. A special case arises when g_{i }is the metagene, i.e. g_{0}. We assume that
Objective function optimization
So far, we have described how we compute the posterior density function. The final challenge is to find the values of the hidden variables that maximize the objective function (Equation 3). We develop an iterative algorithm to address this challenge.
In our model we have three different sets of parameters. The nodes of the MRF given by 1471216413S2S5 consist of one set. Other two sets are the parameters of conditional probability density function of X_{ij }and likelihood function of observed data given by θ_{X }= {γ_{1},... γ_{7}} and θ_{Y }= {µ_{0}, σ, τ), respectively. In each iteration, we first estimate θ_{X }and θ_{Y }based on the estimated value of 1471216413S2S5 in the previous iteration. Next, based on the estimated parameters, we estimate 1471216413S2S5 that maximize the objective function in Equation 3.
The likelihood function is nonconvex in terms of the parameters θ_{Y }= {µ_{0}, σ, τ). Also, the conditional density is nonconvex in terms of θ_{X }= {γ_{1},... γ_{7}}. We use a global optimization method called differential evolution to optimize both of them [25]. To optimize the objective function in Equation 3, we employ the ICM algorithm described by Besag [19]. Briefly, our iterative algorithm works as follows.
1. Obtain an initial estimate of variables. In our implementation we use student's ttest assuming the data follows normal distribution. We use 5% confidence interval for this purpose.
2. Estimate parameters θ_{Y }that maximizes the data likelihood function given by,
We implement this step using Differential Evolution, which is similar to the genetic algorithm.
3. Calculate an estimate of the parameters θ_{X }that maximizes the conditional prior density function by,
We also implement this step using Differential Evolution.
4. Carry out a single cycle of ICM using the current estimate of , θ_{X }and θ_{Y}. For all S_{i }, maximize
5. Go to step 2 for a fixed number of cycles or until 1471216413S2S5 converges to a certain predefined value.
We optimize the objective function in terms of the S_{i }(1 ≤ i ≤ M) variables instead of X_{ij }variables. Specifically, in step 4, we go over all the S_{i }variables, and optimize F_{ij }function (given by Equation 4) for only those X_{ij }variables that are impacted by the change of S_{i}. Figure 2 illustrates different components of CMRF and the connectivity between them.
The optimization procedure is guaranteed to converge since in every iteration the value of the objective function increases. We continue the iterative process, until the changes in estimates of the parameters between two consecutive iterations reach below a certain cutoff level.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
NB conceived the study, analyzed the data, implemented the methods, supplied the analysis tools, designed the experiments, performed the experiments and wrote the paper. MS conceived the study and participated in writing the paper. SR conceived the study, designed the experiments and participated in writing the paper. TK conceived the study, designed the experiments and participated in writing the paper.
Acknowledgements
This work was supported partially by NSF under grants CCF0829867 and IIS0845439.
This article has been published as part of BMC Genomics Volume 13 Supplement 2, 2012: Selected articles from the First IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2011): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S2.
References

Cheng R, Zhao A, Alvord W, Powell D, Bare R, Masuda A, Takahashi T, Anderson L, Kasprzak K: Gene expression doseresponse changes in microarrays after exposure of human peripheral lung epithelial cells to nickel(II).
Toxicol Appl Pharmacol 2003, 191:2239. PubMed Abstract  Publisher Full Text

Ideker T, Thorsson V, Ranish J, Christmas R, Buhler J, Eng J, Bumgarner R, Goodlett D, Aebersold R, Hood L: Integrated genomic and proteomic analyses of a systematically perturbed metabolic network.
Science 2001, 292(5518):92934. PubMed Abstract  Publisher Full Text

Taylor K, PenaHernandez K, Davis J, Arthur G, Duff D, Shi H, Rahmatpanah F, Sjahputera O, Caldwell C: Largescale CpG methylation analysis identifies novel candidate genes and reveals methylation hotspots in acute lymphoblastic leukemia.
Cancer Res 2007, 67(6):261725. PubMed Abstract  Publisher Full Text

Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, Kidd MJ, King AM, Meyer MR, Slade D, Lum PY, Stepaniants SB, Shoemaker DD, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend SH: Functional discovery via a compendium of expression profiles.
Cell 2000, 102:109126. PubMed Abstract  Publisher Full Text

Marton MJ, Derisi JL, Bennett HA, Iyer VR, Meyer MR, Roberts CJ, Stoughton R, Burchard J, Slade D, Dai H, Bassett DE, Hartwell LH, Brown PO, Friend SH: Drug target validation and identification of secondary drug target effects using DNA microarrays.
Nat Med 1998, 4(11):12931301. PubMed Abstract  Publisher Full Text

Giaever G, Shoemaker DD, Jones TW, Liang H, Winzeler EA, Astromoff A, Davis RW: Genomic profiling of drug sensitivities via induced haploinsufficiency.
Nature Genetics 1999, 21(3):278283. PubMed Abstract  Publisher Full Text

Giaever G, Flaherty P, Kumm J, Proctor M, Nislow C, Jaramillo DF, Chu AM, Jordan MI, Arkin AP, Davis RW: Chemogenomic profiling: Identifying the functional interactions of small molecules in yeast.
Proceedings of the National Academy of Sciences of the United States of America 2004, 101(3):793798. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lum P, Armour C, Stepaniants S, Cavet G, Wolf M, Butler J, Hinshaw J, Garnier P, Prestwich G, Leonardson A, GarrettEngele P, Rush C, Bard M, Schimmack G, Phillips J, Roberts C, Shoemaker D: Discovering modes of action for therapeutic compounds using a genomewide screen of yeast heterozygotes.
Cell 2004, 116:12137. PubMed Abstract  Publisher Full Text

Parsons AB, Brost RL, Ding H, Li Z, Zhang C, Sheikh B, Brown GW, Kane PM, Hughes TR, Boone C: Integration of chemicalgenetic and genetic interaction data links bioactive compounds to cellular target pathways.
Nature Biotechnology 2003, 22:6269. PubMed Abstract  Publisher Full Text

di Bernardo D, Thompson MJ, Gardner TS, Chobot SE, Eastwood EL, Wojtovich AP, Elliott SJ, Schaus SE, Collins JJ: Chemogenomic profiling on a genomewide scale using reverseengineered gene networks.
Nature Biotechnology 2005, 23(3):377383. PubMed Abstract  Publisher Full Text

Cosgrove EJ, Zhou Y, Gardner TS, Kolaczyk ED: Predicting gene targets of perturbations via networkbased filtering of mRNA expression compendia.
Bioinformatics 2008, 24(21):24822490. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bandyopadhyay N, Somaiya M, Kahveci T, Ranka S: Modeling perturbations using gene networks.

Conesa A, Nueda MJ, Ferrer A, Talón M: maSigPro: a method to identify significantly differential expression profiles in timecourse microarray experiments.
Bioinformatics 2006, 22(9):10961102. PubMed Abstract  Publisher Full Text

Hong F, Li H: Functional Hierarchical Models for Identifying Genes with Different TimeCourse Expression Profiles.
Biometrics 2006, 62(2):534544. PubMed Abstract  Publisher Full Text

Chuan Tai Y, Speed TP: On Gene Ranking Using Replicated Microarray Time Course Data.
Biometrics 2009, 65:4051. PubMed Abstract  Publisher Full Text

Angelini C, De Canditiis D, Pensky M: Bayesian models for twosample timecourse microarray experiments.
Computational Statistics & Data Analysis 2009, 53(5):15471565. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Van Deun K, Hoijtink H, Thorrez L, Van Lommel L, Schuit F, Van Mechelen I: Testing the hypothesis of tissue selectivity: the intersectionunion test and a Bayesian approach.
Bioinformatics 2009, 25(19):25882594. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li SZ: Markov Random Field Modeling in Image Analysis. 3rd edition. Springer Publishing Company, Incorporated; 2009.

Besag J: On the statistical analysis of dirty pictures.
Journal of the Royal Statistical Society 1986, 48(3):259302.

Smirnov D, Morley M, Shin E, Spielman R, Cheung V: Genetic analysis of radiationinduced changes in human gene expression.
Nature 2009, 459(7246):58791. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dausset J, Cann H, Cohen D, Lathrop M, Lalouel J, White R: Centre d'etude du polymorphisme humain (CEPH): collaborative genetic mapping of the human genome.
Genomics 1990, 6(3):5757. PubMed Abstract  Publisher Full Text

Garg A, Mendoza L, Xenarios I, DeMicheli G: Modeling of multiple valued gene regulatory networks.
Conf Proc IEEE Eng Med Biol Soc 2007, 2007:1398404. PubMed Abstract  Publisher Full Text

Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes.
Nucleic Acids Res 2000, 28:2730. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bishop CM: Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: SpringerVerlag New York, Inc; 2006.

Storn R, Price K: Differential evolution — a simple and efficient heuristic for global optimization over continuous spaces.
Journal of Global Optimization 1997, 11(4):341359. Publisher Full Text

Hammersley JM, Clifford P: Markov fields on finite graphs and lattices.

Besag J: Efficiency of pseudolikelihood estimation for simple Gaussian fields.
Biometrika 1977, 64(3):616618. Publisher Full Text

Geman S, Graffigne C: Markov random field image models and their applications to computer vision.
Proceedings of the International Congress of Mathematics: Berkley 1986, 14961517.

Kendziorski C, Newton M, Lan H, Gould M: On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles.
Stat Med 2003, 22(24):3899914. PubMed Abstract  Publisher Full Text

Demichelis F, Magni P, Piergiorgi P, Rubin M, Bellazzi R: A hierarchical Naive Bayes Model for handling sample heterogeneity in classification problems: an application to tissue microarrays.
BMC Bioinformatics 2006, 7:514. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text