Background
Microarray technology has made it possible to investigate the expression levels of thousands of genes simultaneously. At the same time, it presents a challenging statistical problem because of the large number of transcripts with small sample sizes that are surveyed. A fundamental statistical problem in microarray gene expression data analysis is the need to reduce the dimensionality of the transcripts. A common approach for dimensionality reduction is the identification of differentially expressed (DE) genes under different conditions or groups. By associating differential expressions with the genotypes of molecular markers, useful information on the regulatory network can be obtained
1
2
3
4
. By assigning DE genes to the list of gene sets, it is possible to obtain a useful biological interpretation
5
6
. Further, because the number of DE genes that influence a certain phenotype may be large while their relative proportion is usually small, it is challenging to identify these DE genes from among the large number of recorded genes
7
8
9
10
11
12
13
14
. Two main types of statistical inferences for the identification of DE genes have been used: (1) classical parametric (for example, ttest, Ftest, likelihood ratio test) and nonparametric
13
15
16
17
18
procedures; and (2) empirical Bayes (EB) parametric
8
9
10
11
12
14
19
20
21
22
and nonparametric
23
24
procedures. In general, classical procedures detect the DE genes using pvalues (significance levels) either estimated by permutation or based on the distribution of a test statistic, while EB procedures use the posterior probability of differential expression for the identification of DE genes.
Classical parametric testing procedures (like the t, F or χ
^{2}test) may produce misleading results when they are used directly to determine DE genes, because these methods strongly depend on the sample size and normality of the expression data
2
17
25
26
27
28
. EB hierarchical models have gradually become more popular than classical methods for identification of DE genes because these models explicitly specify the distribution of the genespecific mean expression levels and the distribution of the expression profiles around the means. EB approaches detect a DE gene by sharing information across the whole genome; such approaches also work well for small sample sizes. A popular EB approach using a hierarchical gammagamma (GG) model
11
was developed for the identification of DE genes. The model was extended
8
to replicate chips with multiple conditions and a new option of using a hierarchical lognormalnormal (LNN) model was introduced. The GG and LNN models were both developed under the assumption of a constant coefficient of variation across genes. However, this assumption is not very realistic and it can negatively affect the resulting inference. To overcome these problems, both models were extended assuming genespecific variances
29
. It has been shown that the extended versions of both the GG and LNN models outperform previous versions of GG and LNN
8
11
as well as the nonparametric SAM (significance analysis of microarray) model
17
. A different version of the extended EBLNN model that assumes genespecific variances
30
is also available. The performance of the EBLNN model has been investigated using several normalization techniques
1
. Most of the algorithms described above are not robust against outliers. Some recent studies have reported that the assumption of normality does not hold for most of the existing microarray data
31
32
. One of the causes for the breakdown of the normality assumption for gene expression data may be data contamination by outliers. The cDNA microarray data are often contaminated by outliers that arise because of the many steps that are involved in the experimental process from hybridization to image analysis. A few Bayesian parametric approaches
32
33
34
35
for the robust identification of DE genes are available; however, the identification of contaminating genes or irregular patterns of expression has never been discussed. When one of these Bayesian parametric approaches is used, it is difficult to scrutinize or diagnose contaminating DE genes in reduced gene expression datasets. As a result, any further statistical investigations like, for example, the clustering/classification of the genes in the reduced gene expression dataset may produce misleading results.
To overcome this problem, we developed a βempirical Bayes (βEB) approach as an extension of the EBLNN model
8
30
assuming genespecific variances for the identification of DE genes. The βEB model is a unique parametric approach because, not only is it robust against outliers, but it also detects contaminating genes and statistically diagnoses gene expression profiles. These features may significantly improve any further statistical analysis of gene expression data like clustering/classification. The βEB method was developed based on the βdivergence estimation that was proposed by Basu et al.
36
and fully described later by Minami and Eguchi
37
. It was shown that the minimization of βdivergence is equivalent to maximizing the weighted (quasi) likelihood which we have called βlikelihood. The proposed βEB method introduces a βweight function that produces smaller weights for contaminating genes and larger weights for uncontaminating genes to obtain weighted estimates for the model parameters. Thus, based on the value of the βweight function, the inference becomes robust. The value of β, which controls the balance between robustness and efficiency, is selected by maximizing the predictive β
_{0}likelihood. When the dataset satisfies the model assumptions and does not include outliers, β may be selected to be 0. On the other hand, when the model is misspecified or when the data include outliers, the selected β may be positive.
Here, we introduce the βweight distribution as a sensor that detects outliers or the misspecification of the model. When βweights outside the range of the predicted distribution are observed, a detailed inspection of the data is conducted. Microarray data offers a unique opportunity to investigate the distribution of the βweights because the data represents the expression of a large number of genes. By contracting the observed distribution of the weights with the predicted distribution, it is possible to detect outliers and to diagnose the hierarchical model statistically. Although, in this paper, we have introduced a Gaussian model, the βlikelihoodbased approach could still be applied for robustification of any likelihoodbased estimation of statistical models and this feature may serve as a useful tool for genome data analysis.
Methods
Here the extension of the EBLNN model assuming genespecific variances
8
30
by βdivergence, which we have called the βEB approach, for the identification of DE genes, is discussed. The simulated and real microarray gene expression datasets that we have analyzed to investigate the performance of the proposed method are also described.
Empirical Bayes hierarchical model
If the transcriptspecific parameter
θ
t
=
(
μ
t
,
θ
t
∗
)
, where μ
_{
t
} and
θ
t
∗
are the location and scale parameters respectively, then the conditional likelihood of the tth transcript’s expression measurement
y
_{
t
}= (y
_{
t1},y
_{
t2},…,y
_{
tn
}) can be expressed as
∏
i
=
1
n
f
obs
y
ti

θ
t
(t=1,2,…,T). The location parameter μ
_{
t
} follows the prior distribution, Π(μ
_{
t
}
θ
), where
θ
is the hyperparameter specifying the prior distribution. The predictive likelihood of
y
_{
t
} (unconditional on the location parameter μ
_{
t
}) is obtained by integrating over the location parameter, μ
_{
t
}, as follows:
f
0
(
y
t

θ
,
θ
t
∗
)
=
∫
∏
i
=
1
n
f
obs
y
ti

μ
t
,
θ
t
∗
Π
(
μ
t

θ
)
d
μ
t
.
When expression measurements between two groups (for example, different cell types) are compared for transcript t, the measurements are partitioned into two user defined groups G
_{1} and G
_{2} of sizes n
_{1} and n
_{2} respectively, where n
_{1} + n
_{2} =n. If there is no significant difference between the means of the two groups, the gene is assumed to be equivalently expressed (EE); otherwise, it is assumed to be a DE gene. If the tth transcript is DE, the two groups will have different mean expression levels,
μ
t
(
j
)
,
j
=
1
,
2
. Given the values of
μ
t
(
j
)
,
j
=
1
,
2
and
θ
t
∗
, the conditional likelihood of
y
t
=
y
t
(
1
)
:
y
t
(
2
)
is written as follows:
f
1
(
y
t

μ
t
(
1
)
,
μ
t
(
2
)
,
θ
t
∗
)
=
∏
i
=
1
n
1
f
obs
y
ti

μ
t
(
1
)
,
θ
t
∗
×
∏
i
″
=
1
n
2
f
obs
y
t
i
″

μ
t
(
2
)
,
θ
t
∗
,
because components of
y
_{
t
}are independent of each other. Assuming that the group means
μ
t
(
j
)
,
j
=
1
,
2
(such that
μ
t
(
1
)
≠
μ
t
(
2
)
) independently originate from Π(μ
_{
t
}
θ
), then the predictive likelihood of
y
_{
t
} (unconditional on the location parameters
μ
t
(
j
)
,
j
=
1
,
2
) is obtained as a mean of the conditional likelihood of
y
_{
t
}(2) over the prior distribution of
μ
t
(
1
)
and
μ
t
(
2
)
as follows:
f
1
(
y
t

θ
,
θ
t
∗
)
=
∫
∫
f
1
(
y
t

μ
t
(
1
)
,
μ
t
(
2
)
,
θ
t
∗
)
Π
(
μ
t
(
1
)

θ
)
Π
(
μ
t
(
2
)

θ
)
×
d
μ
t
(
1
)
d
μ
t
(
2
)
=
∫
∏
i
=
1
n
1
f
obs
y
ti

μ
t
(
1
)
,
θ
t
∗
Π
μ
t
(
1
)

θ
d
μ
t
(
1
)
×
∫
∏
i
″
=
1
n
2
f
obs
y
t
i
″

μ
t
(
2
)
,
θ
t
∗
Π
μ
t
(
2
)

θ
d
μ
t
(
2
)
=
f
0
(
y
t
(
1
)

θ
,
θ
t
∗
)
f
0
(
y
t
(
2
)

θ
,
θ
t
∗
)
.
Because it is unknown whether the tth gene is EE or DE between the two groups, the final likelihood of
y
_{
t
}(unconditional on the location parameters) becomes a mixture of two distributions (1) and (3) as follows:
f
(
y
t

θ
,
θ
t
∗
,
p
0
)
=
p
0
f
0
(
y
t

θ
,
θ
t
∗
)
+
p
1
f
1
(
y
t

θ
,
θ
t
∗
)
.
Here, p
_{0 }and p
_{1} are the mixing proportions of the EE and DE transcripts in the two user defined groups respectively, such that p
_{0} + p
_{1 }= 1. The posterior probability of differential expression (PPDE) is calculated by Bayes rule using the estimates of p
_{0}, f
_{0} and f
_{1} as follows:
p
1
f
1
(
y
t

θ
,
θ
t
∗
)
p
0
f
0
(
y
t

θ
,
θ
t
∗
)
+
p
1
f
1
(
y
t

θ
,
θ
t
∗
)
.
It should be noted here that
θ
and
θ
t
∗
in equations (1)(5) are assumed to be exactly the same.
Maximum βlikelihood estimation of mixture distribution using an EMlike algorithm to calculate βposterior probabilities of differential expressions
Box and Cox
38
proposed a family of power transformations of the dependent variable in regression analysis to robustify the normality assumption. By choosing an appropriate value of λ in the transformation,
g
λ
(
y
)
=
y
λ
−
1
λ
(
λ
>
0
)
log
y
(
λ
=
0
)
,
the standard linear regression model with the normality assumption fits well to a wide range of data. Inspired by this idea, Basu et al
36
and Minami and Eguchi
37
proposed a robust and efficient method for estimating model parameter
θ
by minimizing a density power divergence in a general framework of statistical modeling and inference. They
36
37
have also shown that minimizer of density power divergence is equivalent to the maximizer of βlikelihood function. According to the current problem in this paper, the βlikelihood function for
θ
given the values of the mixing parameter p
_{0 }= 1 − p
_{1} and the gene specific scale parameter
θ
t
∗
for all t can be written as
L
β
(
θ

y
)
=
1
Tβ
∑
t
=
1
T
f
β
(
y
t

θ
,
θ
t
∗
,
p
0
)
−
l
β
(
θ
)
,
where f(.) is the mixture of distributions as defined in (4) and
l
β
(
θ
)
=
1
1
+
β
∫
f
β
+
1
(
y

θ
,
θ
t
∗
,
p
0
)
d
y
−
β
−
1
β
which is independent of observations. Because the gradient of (6) can be converted as follows,
∂
∂
θ
L
β
(
θ

y
)
=
1
T
∑
t
=
1
T
f
β
(
y
t

θ
,
θ
t
∗
,
p
0
)
∂
∂
θ
log
f
(
y
t

θ
,
θ
t
∗
,
p
0
)
−
∂
∂
θ
l
β
(
θ
)
,
the maximum βlikelihood estimator (βMLE) of
θ
can be regarded as a weighted (quasi) likelihood estimator. Then the weight of gene t is described as a power function of its likelihood,
f
β
(
y
t

θ
,
θ
t
∗
,
p
0
)
, where f(.) is defined by equation (4). Thus, the genes with low likelihoods have unexpected expression patterns and have low weights because the normal density function produces smaller outputs for larger inputs. By assigning low weights to outliers, the inference becomes robust. It is obvious from (7) that βMLE reduces to the classical MLE for β = 0. Because the expression pattern (EE or DE) of each gene is unknown, it is difficult to optimize both the classical loglikelihood function and the proposed βlikelihood function for directly estimating
θ
. To overcome this problem, we consider the EMlike algorithm to obtain βMLE of
θ
treating the mixture distribution (4) as an incompletedata density. The hyperparameters
θ
and the mixing proportion p
_{0 }are estimated by EM algorithm as follows:
The hyperparameters,
θ
,p
_{0} are estimated by the EM algorithm in two steps. Estep: Compute the Qfunction which is defined by the conditional expectation of the completedata βlikelihood with respect to the conditional distribution of missing data (
Z
) given the observed data (
Y
) and the current estimated parameter value
θ
β
(
j
)
as follows:
Q
β
θ

θ
β
(
j
)
=
1
Tβ
∑
t
=
1
T
∑
k
=
0
1
p
k
f
k
(
y
t

θ
,
θ
̂
t
∗
)
β
×
π
tk
(
j
)
−
λ
β
(
θ
)
where k = 0 for
y
_{
t
} belongs to EE pattern and k = 1 for
y
_{
t
} belongs to DE pattern. Here
λ
β
(
θ
)
=
1
1
+
β
∫
∑
k
=
0
1
p
k
f
k
(
y

θ
,
θ
̂
∗
)
1
+
β
dy
−
β
−
1
β
which does not depend on observations,
π
tk
(
j
)
=
p
k
(
j
)
f
k
(
y
t

θ
β
(
j
)
,
θ
̂
t
∗
)
∑
k
″
=
0
1
p
k
″
(
j
)
f
k
″
(
y
t

θ
β
(
j
)
,
θ
̂
t
∗
)
,
(
k
=
0
,
1
)
is the posterior probability of kth pattern for gene t and the value of p
_{1 }= 1 − p
_{0} is updated by a separate EM formulation as follows:
p
1
(
j
+
1
)
=
∑
t
=
1
T
f
1
β
(
y
t

θ
β
(
j
)
,
θ
̂
t
∗
)
π
t
1
(
j
)
∑
t
=
1
T
f
0
β
(
y
t

θ
β
(
j
)
,
θ
̂
t
∗
)
π
t
0
(
j
)
1
β
−
1
+
1
−
1
,
for
β
>
0
=
1
T
∑
t
=
1
T
π
t
1
(
j
)
,
for
β
=
0
.
For
β
→
0
, the proposed Qfunction
Q
β
(
θ

θ
(
j
)
)
reduces to the standard Qfunction Q(
θ

θ
^{(j)}) of the standard empirical Bayes approaches
8
30
.
Mstep: Find
θ
^{(j + 1) }by maximizing the proposed Qfunction as defined in (8). Continue EM iterations up to the convergence of successive estimates of
θ
. The estimate of
θ
after convergence is taken to be the βMLE of
θ
according to the EM properties.
The tuning parameter, β, controls the balance between the robustness and efficiency of the estimators. By setting a tentative value for β
_{0}, the optimal value is estimated by maximizing the predictive β
_{0}likelihood via a fivefold cross validation. The dataset is divided into five subsets by transcripts. For each value of β, the predictive β
_{0}likelihood of each subset is calculated based on the maximum βlikelihood estimates of the parameters based on the rest of the data. Finally, the β value that maximizes the average predictive β
_{0}likelihood is selected as the optimal value of β. For more information about βselection, please see
39
40
.
Then, based on the estimate values of the model parameters, we can compute the PPDE between two groups of
y
_{
t
} using equation (5) for all t. However, PPDE of contaminated gene using equation (5) might be produced misleading result, since PPDE of
y
_{
t
} depends on the estimate values of parameters and measurements of
y
_{
t
}. To overcome this problem, we detect contaminated genes using βweight function and replace the contaminated measurements in
y
_{
t
}by its group means. Then we compute the PPDE of contaminated
y
_{
t
} using equation (5) also. The PPDE based on βMLE, we call βPPDE in this paper. The detail discussion for computation of βPPDE under LNN model is discussed below in the LNN model.
The LNN model
In this paper, we use the LNN (lognormalnormal) hierarchical model for computing the posterior probability of differential expressions. In the LNN model, logtransformed gene expression measurements are assumed to follow normal distribution for each gene with the transcriptspecific parameter
θ
t
=
(
μ
t
,
θ
t
∗
)
, where μ
_{
t
} is the transcriptspecific mean and
θ
t
∗
=
σ
t
2
is the transcriptspecific variance for gene t
8
30
. A conjugate prior for μ
_{
t
} is assumed to follow the normal with some underlying mean μ
_{0}and variance
τ
0
2
; that is,
Π
μ
t

θ
∼
N
(
μ
0
,
τ
0
2
)
, where
θ
=
(
μ
0
,
τ
0
2
)
. By integrating as in (1), the density f
_{0}(·) for an ndimensional input becomes Gaussian with the mean vector
μ
_{0} =
(
μ
0
,
μ
0
,
…
,
μ
0
)
t
and an exchangeable covariance matrix as follows:
Σ
tn
=
(
σ
t
2
)
I
n
+
(
τ
0
2
)
M
n
,
where
I
_{
n
} is an n × n identity matrix and
M
_{
n
}is a matrix of ones.
The gene specific variance
σ
t
2
is computed separately assuming prior distribution for
σ
t
2
as scaleinverse
χ
2
(
ν
∗
,
σ
∗
2
)
, where ν
_{∗} is the degrees of freedom and
σ
∗
2
is the scaled parameter. Yang et al.
30
proposed that
σ
t
2
could be estimated by a Bayes estimator defined as,
σ
̂
t
2
=
ν
̂
∗
σ
̂
∗
2
+
(
n
1
+
n
2
−
2
)
σ
~
t
2
n
1
+
n
2
+
ν
̂
∗
−
2
where
σ
~
t
2
=
(
n
1
−
1
)
σ
~
t
1
2
+
(
n
2
−
1
)
σ
~
t
2
2
n
1
+
n
2
−
2
is the pooled sample variances with
σ
~
tg
2
=
∑
i
=
1
n
g
(
y
ti
(
g
)
−
y
̄
t
(
g
)
)
2
/
(
n
g
−
1
)
as the sample variance in group g = 1,2. By viewing the pooled sample variances
σ
~
t
2
as a random sample from the prior distribution of
σ
t
2
, the estimates
(
ν
̂
∗
,
σ
̂
∗
2
)
of
(
ν
∗
,
σ
∗
2
)
are obtained using the method of moments. However, it is obvious that (12) will be very sensitive to outliers. Therefore, we have used a maximum βlikelihood estimation of
σ
tg
2
which is highly robust against outliers
39
and can be obtained iteratively as follows:
μ
tg
(
j
+
1
)
=
∑
i
=
1
n
g
ψ
β
(
y
ti
(
g
)

μ
tg
(
j
)
,
σ
tg
2
(
j
)
)
y
ti
(
g
)
∑
i
=
1
n
g
ψ
β
(
y
ti
(
g
)

μ
tg
(
j
)
,
σ
tg
2
(
j
)
)
σ
tg
2
(
j
+
1
)
=
∑
i
=
1
n
g
ψ
β
(
y
ti
(
g
)

μ
tg
(
j
)
,
σ
tg
2
(
j
)
)
(
y
ti
(
g
)
−
μ
tg
(
j
)
)
2
∑
i
=
1
n
g
ψ
β
(
y
ti
(
g
)

μ
tg
(
j
)
,
σ
tg
2
(
j
)
)
where
ψ
β
(
y
ti
(
g
)

μ
tg
,
σ
tg
2
)
=
exp
−
β
2
y
ti
(
g
)
−
μ
tg
σ
tg
2
is the βweight function for estimating robust mean and variance which produces an almost zero or very small weight for y
_{
ti
} if it is an outlying/extreme observation.
To estimate the hyperparameters
θ
=
(
μ
0
,
τ
0
2
)
by maximizing of the proposed Qfunction (8) in the Mstep, we compute the gradient of
Q
β
(
θ

θ
(
j
)
)
with respect to
θ
which is given by
∂
∂θ
Q
β
(
θ

θ
(
j
)
)
=
1
T
∑
t
=
1
T
∑
k
=
0
1
p
k
f
k
(
y
t

θ
,
σ
̂
t
2
)
β
×
∂
∂
θ
log
p
k
f
k
(
y
t

θ
,
σ
̂
t
2
)
×
π
tk
(
j
)
−
∂
∂
θ
λ
β
(
θ
)
.
It reduces to the gradient of the standard Qfunction denoted by
∂
∂θ
Q
(
θ

θ
(
j
)
)
based on the loglikelihood function for β = 0. The second term on the righthand side of equation (15) is independent of observations; the first term is the weighted gradient of Q(
θ

θ
^{(j)}) with the weight function
p
k
f
k
(
y
t

θ
,
σ
̂
t
2
)
β
. This weight function produces a smaller weight if the tth gene is contaminated by outliers; otherwise, it produces a comparatively larger weight for the tth gene independent of whether it is EE (k=0) or DE (k=1). Therefore contaminated genes cannot influence the estimates and robust estimates of the parameters can be obtained. For convenience of choosing the threshold weight to identify contaminated genes statistically, we define the βweight function for the gene t as follows
ϕ
β
(
y
t

θ
̂
,
σ
̂
t
2
,
k
)
∝
[
p
k
f
k
(
y
t

θ
̂
,
σ
̂
t
2
)
]
β
,
where the circumflex above a parameter indicates the proposed estimate of the parameters. Excluding the normalization constant, the βweight function corresponding to an EE gene becomes,
ϕ
β
(
y
t

θ
̂
,
σ
̂
t
2
,
k
=
0
)
=
exp
{
−
β
2
(
y
t
−
μ
̂
0
)
″
Σ
̂
tn
−
1
(
y
t
−
μ
̂
0
)
}
,
which measures the deviation of each gene expression data vector from the grand mean vector for the expression of all the genes in the dataset. The βweight function corresponding to a DE gene becomes
ϕ
β
y
t

θ
̂
,
σ
̂
t
2
,
k
=
1
=
exp
−
β
2
y
t
(
1
)
−
μ
̂
0
(
1
)
″
×
Σ
̂
t
n
1
−
1
y
t
(
1
)
−
μ
̂
0
(
1
)
+
y
t
(
2
)
−
μ
̂
0
(
2
)
″
×
Σ
̂
t
n
2
−
1
y
t
(
2
)
−
μ
̂
0
(
2
)
,
where
μ
̂
0
(
1
)
=
(
μ
̂
0
,
μ
̂
0
,
…
,
μ
̂
0
)
t
and
μ
̂
0
(
2
)
=
(
μ
̂
0
,
μ
̂
0
,
…
,
μ
̂
0
)
t
are the grand mean vectors, and
Σ
̂
t
n
1
=
(
σ
̂
t
2
)
I
n
1
+
(
τ
0
2
)
M
n
1
and
Σ
̂
t
n
2
=
(
σ
̂
t
2
)
I
n
2
+
(
τ
0
2
)
M
n
2
are the exchangeable covariance matrices in two user defined groups. Both the βweight functions defined by equations (17) and (18) for genes t = 1,2,…,Tproduce weights that are between 0 and 1 for any data vector
y
_{
t
}.
Because, both weight functions are the negative exponential function of the squared Mahalanobis Distance (MD) defined by
MD
t
=
(
y
t
−
μ
̂
0
)
″
Σ
̂
−
1
(
y
t
−
μ
̂
0
)
≥
0
between the data vector
y
_{
t
} and and the mean vector
μ
̂
0
. From equations (17) and (18), the βweight for gene t decreases when MD_{
t
} increases and increases when MD_{
t
} decreases. That is, the βweight for a gene t becomes smaller (≥ 0) when
y
_{
t
} is contaminated by outliers, and larger (≤ 1) when it is not contaminated.
The large number of transcripts in microarray data enables a statistical investigation of the observed distribution of the βweights compared to the predicted distribution under the assumption that the model is correct and the data is free from outliers. To investigate this further, we start with the case where the predicted distribution can be obtained theoretically. When the normality assumptions hold and there are no outliers, and when the genespecific variance is known for EE genes, the cumulative distribution of the βweight
w
t
=
ϕ
β
(
y
t

θ
,
σ
t
2
,
k
=
0
)
for gene t with known gene specific variance (
σ
t
2
) becomes,
G
t
(
w
0
)
=
Pr
{
w
t
≤
w
0
}
=
Pr
exp
−
β
2
y
t
−
μ
0
″
Σ
tn
−
1
y
t
−
μ
0
≤
w
0
=
1
−
P
χ
n
2
(
−
2
β
log
w
0
)
,
which implies that w
_{
t
}follows
2
β
×
w
0
p
χ
(
n
)
2
(
−
2
β
log
w
0
)
, where
χ
(
n
)
2
denotes the chisquare variable which assumes values
−
2
β
log
w
0
for 0 < w
_{0 }≤ 1, with n degrees of freedom. Similarly, for DE genes (18) the βweight
w
t
=
ϕ
β
(
y
t

θ
,
σ
t
2
,
k
=
1
)
also follows
2
β
×
w
0
p
χ
(
n
=
n
1
+
n
2
)
2
(
−
2
β
log
w
0
)
, for 0 < w
_{0 }≤ 1 using the additive property of χ
^{2 }distributions.
In many cases, however, the variance is unknown. For such cases, the distribution of the βweights is obtained by parametric bootstrapping. Thus statistically, we can examine whether or not a gene is contaminated by outliers using either one of the two βweight functions because both weight functions follow the same distribution and show similar trends for the observed weights of both gene expression patterns (DE and EE). However, the tth gene is defined as contaminated by outliers if
w
t
=
ϕ
β
(
y
t

θ
̂
,
σ
t
2
,
k
=
1
)
<
w
0
=
ξ
p
where ξ
_{
p
} is the pquantile of the βweights defined by
Pr
ϕ
β
(
y
t

θ
̂
,
σ
t
2
,
k
=
1
)
<
ξ
p
≤
p.
Heuristically, we choose p = 10^{−5} for the detection of contaminating genes. Then we compute the βPPDE using equation (5) updating the measurements in the contaminated genes. To compute the βPPDE with respect to a contaminating gene expression, say, for example,
y
t
=
y
t
(
1
)
:
y
t
(
2
)
by equation (5), we modify the contaminated measurements in
y
t
(
g
)
using the robust mean
μ
̂
tg
obtained iteratively using equation (13). Here
y
ti
(
g
)
is taken to be the ith contaminated measurement of
y
t
(
g
)
in group g=1, 2 if
ψ
β
(
y
ti
(
g
)

μ
̂
tg
,
σ
̂
tg
2
)
<
α
p
,
where α
_{
p
} is the pquantile of the βweights defined by
Pr
ψ
β
(
y
ti
(
g
)

μ
̂
tg
,
σ
̂
tg
2
)
<
α
p
≤
p.
Here
ψ
β
(
y
ti
(
g
)

μ
tg
,
σ
tg
2
)
is the βweight function that is used to compute the robust mean and variance (14), which follows
2
β
×
w
0
p
χ
(
1
)
2
(
−
2
β
log
w
0
)
, where
χ
(
1
)
2
denotes the chisquare variable which assumes values of
−
2
β
log
w
0
for 0 < w
_{0 }≤ 1, with 1 degree of freedom. However, we can set an arbitrary threshold (α
_{0 }= 0.2 ) to detect contaminated measurements with weights that are below the threshold, because weights are close to zero for outlying/extreme observations.
Simulated data that were used to examine the performance of the βEB approach
The βEB approach that we developed detected a large proportion of outliers with pvalues less than 10^{−5}. In the microarray data of head and neck cancer, 1.75% of the genes were outliers; in the lung cancer data, 13.75% were outliers; and in Arabidopsis thaliana, 16.59% were outliers in the empirical data analysis. A detailed inspection of the outliers detected in the lung cancer data reflected misspecification of the model. To investigate the effect of outliers and model misspecification, we conducted a numerical simulation in which we compared the performance of the proposed βEB approach with the ttest, linear models for microarray data (Limma)
22
, SAM
17
, and other EB approaches (EBLNN, eGG
29
, eLNN
29
, GaGa
21
). The ttest, Limma, and SAM detect DE genes based on pvalues while, the EB procedures and the β−EB approach detect DE genes based on posterior probabilities. Therefore, we calculated the AUC (area under the curve) and pAUC (partial area under the curve) of the ROC curves. We also compared the estimated proportion of DE genes obtained using the β−EB and EB approaches. This characteristic plays an important role, especially when the aim of the study is to identify the major regulatory elements that influence the expressions of a large number of genes. The EB approaches estimate the proportion of DE genes by the mean posterior probability. The β−EB approach estimates it by using equation (11). No reasonable procedure to calculate the proportion of DE genes for the ttest, Limma and SAM methods could be found, because, in these methods, the estimation depends on the threshold value of the pvalues.
Simulated gene expression profiles with and without outliers
We generated 50 datasets that roughly reflect the head and neck cancer data described in empirical data analysis below. Each dataset contained measurements of 1,000 genes, and 50 out of the 1,000 genes were DE (p
_{1 }= 0.05). The logtransformed expression was assumed to follow normal distribution. The mean logexpression level of a gene followed a normal distribution with the mean μ
_{0 }= 2.0 and the variance
τ
0
2
=
3
.
0
.
The genespecific variance
σ
t
2
of the log expression level among the genes varied from the exponential distribution with a mean of σ
^{2 }= 0.10.
We considered two scenarios with different proportions of contaminating genes (10%, 20%), and two scenarios with two patterns of outliers (mild outliers: μ
_{
ti
}
′ = 5μ
_{
ti
}), and (extreme outliers: μ
_{
ti
}
′ = 10μ
_{
ti
}). To estimate the dependence of the performance on the sizes of the groups, we considered two more scenarios with different group sizes (moderate/large (n
_{1}= n
_{2 }= 30) and small (n
_{1}= n
_{2 }= 10)).
Simulated gene expression profiles from misspecified model
To show how the β− weight can be used for model diagnosis, we generated the expressions of each of the 1,000 genes in the dataset from their gamma distribution. The shape parameter that we obtained followed log normal distribution with the location parameter 1 and scale parameter 1. The scale parameter of the gamma distribution was set to 0.067. The LNN model was applied to this data. When the shape parameter is large, a gamma distribution can be approximated by a log normal distribution; however, when the shape parameter is small, especially when it is smaller than 1, the gamma distribution has a heavy mass near 0 and it cannot be approximated by a log normal distribution. In our simulation scenario, the proportion of transcripts with a shape parameter < 1 was 0.159. We used the dataset that contained the measurements of 1,000 genes with 30 samples in each of the two groups. The measurements for 50 out of 1,000 genes were DE (p
_{1} = 0.05). The genespecific variance (scale) of the log expression level among genes varied from the gamma distribution.
The empirical data
Head and neck cancer data
The publicly available microarray data from the study of head and neck cancer
41
was used in this study. Most head and neck cancers are squamous cell carcinomas (HNSCC), originating from the mucosal lining (epithelium) of these regions. The data consists of the expression levels of 12,625 cellular RNA transcripts in the tumor and normal tissues from 22 patients with histologically confirmed HNSCC.
Lung cancer data
The publicly available microarray data from the study of two types of lung cancer
42
were used in this study. Nonsmall cell lung cancer (NSCLC) is the most common bronchial tumor. It has been classified into two major histological subtypes, adenocarcinoma (AC) and squamous cell carcinoma (SCC). After quality assessment of 60 microarray hybridizations, the data represent the gene expression profiles of 54,675 cellular RNA transcripts in 40 AC and 18 SCC samples
42
.
Arabidopsis thaliana expression data
The published preprocessed expression data for 22,810 probe sets on the Affymetrix Arabidopsis ATH1 (25K) array across 1,436 hybridization experiments
43
was analyzed in the present study. The data included a highdensity haplotype map of the Arabidopsis Bay0 × Sha RIL population (211 RILs), using 578 single feature polymorphism (SFP) markers. Data obtained from TAIR (The Arabidopsis Information Resource:
http://www.arabidopsis.org/) included the complete genome sequence, the gene structure, and gene product information.