Abstract
Background
Growing interest on biological pathways has called for new statistical methods for modeling and testing a genetic pathway effect on a health outcome. The fact that genes within a pathway tend to interact with each other and relate to the outcome in a complicated way makes nonparametric methods more desirable. The kernel machine method provides a convenient, powerful and unified method for multidimensional parametric and nonparametric modeling of the pathway effect.
Results
In this paper we propose a logistic kernel machine regression model for binary outcomes. This model relates the disease risk to covariates parametrically, and to genes within a genetic pathway parametrically or nonparametrically using kernel machines. The nonparametric genetic pathway effect allows for possible interactions among the genes within the same pathway and a complicated relationship of the genetic pathway and the outcome. We show that kernel machine estimation of the model components can be formulated using a logistic mixed model. Estimation hence can proceed within a mixed model framework using standard statistical software. A score test based on a Gaussian process approximation is developed to test for the genetic pathway effect. The methods are illustrated using a prostate cancer data set and evaluated using simulations. An extension to continuous and discrete outcomes using generalized kernel machine models and its connection with generalized linear mixed models is discussed.
Conclusion
Logistic kernel machine regression and its extension generalized kernel machine regression provide a novel and flexible statistical tool for modeling pathway effects on discrete and continuous outcomes. Their close connection to mixed models and attractive performance make them have promising wide applications in bioinformatics and other biomedical areas.
Background
The rapid progress in gene expression array technology in the past decade has greatly facilitated our understanding of the genetic aspect of various diseases. Knowledgebased approaches, such as gene set or pathway analysis, have become increasingly popular. In such gene sets/pathways, groups of genes act in concert to accomplish tasks related to a cellular process and the resulting genetic pathway effects may manifest themselves through phenotypic changes, such as occurrence of disease. Thus it is potentially more meaningful to study the overall effect of a group of genes rather than a single gene, as singlegene analysis may miss important effects on pathways and difficult to reproduce from studies to studies [1]. Researchers have made significant progress in identifying metabolic or signaling pathways based on expression array data [2,3]. Meanwhile, new tools for identification of pathways, such as GenMAPP [4], Pathway Processor [5], MAPPFinder [6], have made pathway data more widely available. However, It is a challenging task to model the pathway data and test for a potentially complex pathway effect on a disease outcome.
One way to model pathway data is through the linear model approach, where the pathway effect is represented by a linear combination of individual gene effects. This approach has several limitations. Activities of genes within a pathway are often complicated, thus a linear model is often insufficient to capture the relationship between these genes. Furthermore, genes within a pathway tend to interact with each other. Such interactions are not taken into account by the linear model approach.
In this paper we propose a nonparametric approach, the kernel machine regression, to model a pathway effect. The kernel machine method, with the support vector machine (SVM) as a most popular example, has emerged in the last decade as a powerful machine learning technique in highdimensional settings [7,8]. This method provides a flexible way to model linear and nonlinear effects of variables and genegene interactions, unifies the model building procedure in both one and multidimensional settings, and shows attractive performance compared to other nonparametric methods such as splines.
Liu et al. [9] proposed a kernel machinebased regression model for continuous outcomes. In this paper, we propose a logistic kernel machine regression model for binary outcomes, where covariate effects are modeled parametrically and the genetic pathway effect is modeled parametrically or nonparametrically using the kernel machine method. A main contribution of this paper is to establish a connection between logistic kernel machine regression and the logistic mixed model. We show that the kernel machine estimator of the genetic pathway effect can be obtained from the estimator of the random effects in the corresponding logistic mixed model. This connection provides a convenient vehicle to connect the powerful kernel machine method with the popular mixed model method in the statistical literature. This mixed model connection also provides an unified framework for statistical inference for model parameters, including the regression coefficients, the nonparametric genetic pathway function, and the regularization and kernel parameters. Based on the proposed logistic kernel machine regression model, we develop a new test for the nonlinear pathway effect on disease risk. An appealing feature of the proposed test is that it performs well without the need to correctly specify the functional form of the effects of each gene or their interactions. This feature has a significant practical implication when analyzing genetic pathway data, as the true relationship between the pathway and the disease outcome is often unknown. We extend the results to generalized kernel machine regression for a class of continuous and discrete outcomes and discuss its connection with generalized linear mixed models [10].
Recently, Wei and Li [11] proposed a nonparametric pathwaybased regression (NPR) to model pathway data. NPR is a pathwaybased gradient boosting procedure, where the base learner is usually a regression or classification tree. It provides a flexible approach in modeling pathways and interactions among genes within a pathway. Michalowski et al. [12] proposed a Bayesian Belief Network approach for pathway data. Neither method is likelihoodbased. Thus parameter estimation and inference cannot be casted within a unified likelihood framework. It is hence difficult to estimate and quantify the overall pathway effect on disease risk and assess its statistical uncertainty. Secondly, a primary interest in this paper is to test for the statistical significance of the overall pathway effect on the risk of a disease. Both NPR and Bayesian belief network do not provide such a statistical test for the pathway effect. For example, NPR uses an importance score to rank the relative importance of each pathway. It lacks formal inferential procedure for assessing the statistical significance of a pathway. Further, when considering a single pathway, the importance score loses its meaning in assessing the importance of a pathway. Our method, on the other hand, is based on penalized likelihood, and estimation and inference can be conducted in a systematic manner within the likelihood framework. We also propose a formal statistical test for the significance of a pathway effect on the risk of a disease.
Goeman et al. [13] proposed a linear mixed model to relate the pathway effect with a continuous outcome. They modeled the pathway effect using a linear function with each gene entering into the model as a regressor. They assumed the regression coefficients of the gene as random from a common distribution with mean 0 and an unknown variance. The pathway effect can then be tested through a variance component test for random effects. Our approach is different from theirs in the following aspects. First, we model the pathway effect by allowing for a nonparametric model rather than a parametric one. As we commented earlier, the highly complicated nature of activities of genes within a pathway makes the linear model assumption untenable. Secondly, the kernel function used in kernel machine regression usually contains unknown tuning parameters. The parameter is present under the alternative hypothesis but disappears under null hypothesis. This makes tests as proposed in [13,14] not applicable. Our proposed test, on the other hand, works quite well under this scenario. Third, Goeman et al. [14] extended their linear model results to discrete outcomes using basis functions. A key advantage of the kernel machine approach over this basis approach for modeling multigene effects is that one does not need to specify bases explicitly, which is often difficult for highdimensional data especially when interactions are modeled.
Results
Analysis of prostate cancer data
In this section, we apply the proposed logistic kernel machine regression model (3) as described in the Methods section to the analysis of a prostate cancer data set. The data came from the Michigan prostate cancer study [15]. This study involved 81 patients with 22 diagnosed as noncancerous and 59 diagnosed with local or advanced prostate cancer. Besides the clinical and demographic covariates such as age, cDNA microarray gene expressions were also available for each patient. The early results of Dhanasekaran et al. [15] indicate that certain functional genetic pathways seemed dysregulated in prostate cancer relative to noncancerous tissues. We are interested in studying how a genetic pathway is related to the prostate cancer risk, controlling for the covariates. We focus in this analysis on the cell growth pathway, which contains 5 genes. The pathway we describe was annotated by the investigator (A. Chinnaiyan) and is simply used to illustrate the methodology. Of course, one could take the pathways stored in commercial databases such as Ingenuity Pathway Analysis (IPA) and use the proposed methodology based on those gene sets.
The outcome was the binary prostate cancer status and the covariate includes age. Since the functional relationship between the cell growth pathway and the prostate cancer risk is unknown, the kernel machine method provides a convenient and flexible framework for the evaluation of the pathway effect on the prostate cancer risk. Specifically, we consider the following semiparametric logistic model
where h(·) is a nonparametric function of 5 genes within the cell growth pathway. The detail of the estimation procedure is provided in the Methods section. In summary, we fit this model using the kernel machine method via the logistic mixed model representation and using the Gaussian kernel function in estimating h(·). Under the mixed model representation, we estimated (β_{0}, β_{1}) and h(·) using penalized quasilikelihood (PQL), and estimated the smoothing parameter τ and the Gaussian kernel scale parameter ρ simultaneously by treating them as variance components. The results are presented in Table 1.
Table 1. Analysis of prostate cancer data.
The test for the cell growth pathway effect on the prostate cancer status H_{0}: h(z) = 0 vs H_{1}:h(z) ≠ 0 was conducted using the proposed score test as described in the Methods section. For the purpose of comparison, we also conducted the global test proposed by Goeman et al. [13] that assumed a linear pathway effect. Note that our test allows a nonlinear pathway effect and genegene interactions. Table 1 gives the pvalues for both tests. The pvalue of our test suggests that cell growth pathway has a highly significant effect on the disease status, while the test from Goeman et al. [13] indicates only marginal significance of the growth pathway effect.
Simulation Study for the Parameter Estimates
We conducted a simulation study to evaluate the performance of the parameter estimates of the proposed logistic kernel machine regression by using the logistic mixed model formulation. We considered the following model
where the true regression coefficient β = 1. We consider p = 5 and set h(z_{1}, ..., z_{5}) = 2{sin(z_{1}) – + z_{1 }exp(z_{3}) sin(z_{2}) cos(z_{3}) + + sin(z_{4}) cos(z_{1}) + + z_{3}z_{5}}. To allow x_{i }and (z_{i1}, ⋯, z_{ip}) to be correlated, x_{i }was generated as x_{i }= sin(z_{i1}) + 2u_{i}, where u_{i }and z_{ij }(j = 1, ⋯, p) follow independent Uniform(0.5, 0.5). The Gaussian kernel was used throughout the simulation. All simulations ran 300 times. Settings 1, 2, and 3 correspond to sample size n = 100, 200, and 300, respectively.
The simulation results are shown in Table 2. Due to the multidimensional nature of the variables z, it is difficult to visualize the fitted curve (z). We hence summarized the goodnessoffit of (·) in the following way. For each simulated data set, we regressed the true h on the fitted value , both evaluated at the design points. We then empirically summarized the goodnessoffit of (·) by calculating the average intercepts, slopes and R^{2}'s obtained from these regressions over the 300 simulations. If the kernel machine method fits the nonparametric function well, then we would expect the intercept to be close to 0, the slope close to 1, and R^{2}also close to 1.
Table 2. Simulation results on estimation.
Our results show that even when the sample size is as low as 100, estimation of the regression coefficient and nonparametric function only has small bias. When the kernel parameter ρ is estimated, these biases tend to be small compared with those when ρ is held fixed. With the increase of sample size the estimates of β and h become closer to the true values, especially when ρ is estimated, while there are still some bias when ρ is fixed at values farther away from the estimated one. Table 3 compares the estimated standard errors of with the empirical standard errors. Our results show that they agree to each other well when ρ is estimated.
Table 3. Simulation results on standard errors.
Simulation Study of the Score Test for the Pathway Effect
We next conducted a simulation study to evaluate the performance of the proposed variance component score test for the pathway effect H_{0}: h(·) = 0 vs H_{1}: h(·) ≠ 0. In order to compare the performance of our test with the linearitybased global test proposed by Goeman et al. [13], both tests were applied to each simulated data set. Nonlinear and linear functions of h(z) were both considered. For the nonlinear pathway effect, the true model is logit(y) = x +ah(z), where h(z) = 2(z_{1 } z_{2})^{2 }+ z_{2}z_{3 }+ 3 sin(2z_{3})z_{4 }+ + 2 cos(z_{4})z_{5}. For the linear pathway effect, the true model is logit(y) = x + ah(z), where h(z) = 2z_{1 }+ 3z_{2}+z_{3 }+ 2z_{4}+z_{5}. All z's were generated from the standard normal distribution, and a = 0, 0.2, 0.4, 0.6, 0.8. To allow x and (z_{i1}, ⋯, z_{ip}) to be correlated, x was generated as x = z_{1}+e/2 with e being independent of z_{1 }and following N (0, 1). We studied the size of the test by generating data under a = 0, and studied the power by increasing a. The sample size was 100. For the size calculations, the number of simulations was 2000; whereas for the power calculations, the number of runs was 1000. Based on the discussions in Section "Test for the genetic Pathway Effect", the bound of ρ is set up by interval , and the interval is divided by 500 equally spaced grid points. All simulations were conducted using R 2.5.0, and the package "globaltest" v4.6.0 was used for the test proposed by Goeman et al. [13] as a comparison.
Table 4 reports the empirical size (a = 0) and power (a > 0) of the variance component score test for the pathway effect. When the true function h(z) is nonlinear in z, the results show that the size of our test was very close to the nominal value 0.05, while the size of the global test of Goeman et al. [13] is inflated. The results also show that our test had a much higher power. This was not surprising since the test of Goeman et al. [13] was based on a linearity assumption of the pathway effect. When the true underlying model is far from linear, the linearity assumption breaks down and the test quickly loses power. The results also show that the proposed test works well for moderate sample sizes. When the pathway effect is linear, the results show that the size of both tests were very close to the nominal value 0.05 and their power were also very close. This demonstrates that our test is as powerful as the global test when the true underlying h(z) is linear. Therefore our test could be used as a universal test for testing the overall effect of a set of variables without the need to specify the true functional forms of each variable. This feature is especially desirable for genetic pathway data, because the relationship between genes and clinical outcome is often unknown.
Table 4. Simulation results on score test.
Conclusions and Discussion
In this paper, we developed a logistic kernel machine regression model for binary outcomes, where the covariate effects are modeled parametrically and the genetic pathway effect is modeled nonparametrically using the kernel machine method. This method provides an attractive way to model the pathway effect, without the need to make strong parametric assumptions on individual gene effects or their interactions. Our model also allows for parametric pathway effects if a parametric kernel, such as the firstdegree polynomial kernel, is used.
A key result of this paper is that we have established a close connection between the generalized kernel machine regression and generalized linear mixed models, and show that the kernel machine estimators of regression coefficients and the nonparametric multidimensional pathway effect can be easily obtained from the corresponding generalized linear mixed models using PQL. The mixed model connection provides a unified framework for estimation and inference and can be easily implemented in existing software, such as SAS PROC GLIMMIX or R GLMMPQL. The mixed model connection also makes it possible to test for the overall pathway effect through the proposed variance component test. A key advantage of the proposed score test for the pathway effect is that it does not require an explicit functional specification of individual gene effects and genegene interactions. This feature is of practical significance as the pathway effect is often complex. Our simulation study shows the proposed test performs well for moderate sample size. It has similar power to the linearitybased pathway test of Goeman et al. [13] when the true effect is linear, but much higher power when the true effect is nonlinear.
We have considered in this paper a single pathway. One could generalize the proposed semiparametric model to incorporate multiple pathways by fitting an additive model:
where z_{j }(j = 1, ⋯, m) denotes a p_{j }× 1 vector of genes in the jth pathway and hj(·) denotes the nonparametric function associated with the jth genetic pathway.
Machine learning is a powerful tool in advancing bioinformatics research. Our effort helps to build a bridge between kernel machine methods and traditional statistical models. This connection will undoubtedly provide a new and convenient tool for the bioinformatics community and opens a door for future research.
Methods
The Logistic Kernel Machine Model
Throughout the paper we assume that gene expression data have been properly normalized. Suppose the data consist of n samples. For subject i (i = 1, ⋯, n), y_{i }is a binary disease outcome taking values either 0 (nondisease) or 1 (disease), x_{i }is a q × 1 vector of covariates, z_{i }is a p × 1 vector of gene expression measurements in a pathway/gene set. We assume that an intercept is included in x_{i}. The binary outcome y_{i }depends on x_{i }and z_{i }through the following semiparametric logistic regression model:
where μ_{i }= P (y_{i }= 1 x_{i}, z_{i}), β is a q × 1 vector of regression coefficients, and h(z_{i}) is an unknown centered smooth function.
In model (3), covariate effects are modeled parametrically, while the multidimensional genetic pathway effect is modeled parametrically or nonparametrically. A nonparametric specification for h(·) reflects our limited knowledge of genetic functional forms. Note that h(·) = 0 means that genes in the pathway have no association with the disease risk. If h(z) = γ_{1}z_{1 }+ ⋯ + γ_{p}z_{p}, model (3) becomes the linear model considered by Goeman et al. [13].
In nonparametric modeling, such as smoothing splines, the unknown function is usually assumed to lie in a certain function space. For the kernel machine method, this function space, denoted by , is generated by a given positive definite kernel function K(·, ·). The mathematical properties of imply that any unknown function h(z) in can be written as a linear combination of the given kernel function K(·, ·) evaluated at each sample point. Two popular kernel functions are the dth polynomial kernel and the Gaussian Kernel K(z_{1}, z_{2}) = exp{ z_{1 }– z_{2}^{2}/ρ^{2}}, where and ρ is an unknown parameter. The first and second degree polynomial kernels (d = 1, 2) correspond to assuming h(·) to be linear and quadratic in z's, respectively. The choice of a kernel function determines which function space one would like to use to approximate h(z). The unknown parameter of a kernel function plays a critical role in function approximation. It is a challenging problem to optimally estimate it from data. In the machine learning literature, this parameter is usually prefixed at some values based on some adhoc methods. In this paper, we show that we can optimally estimate it from data based on a mixed model framework.
The Estimation Procedure
Assuming h(·) ∈ , the function space generated by a kernel function K(·, ·), we can estimate β and h(·) by maximizing the penalized loglikelihood function
where λ is a regularization parameter that controls the tradeoff between goodness of fit and complexity of the model. When λ = 0, it fits a saturated model, and when λ = ∞, the model reduces to a simple logistic model logit (μ_{i}) = . Note that there are two tuning parameters in the above likelihood function, the regularization parameter λ and kernel parameter ρ Intuitively, λ controls the magnitude of the unknown function while ρ mainly governs the smoothness property of the function.
By the representer theorem [16], the general solution for the nonparametric function h(·) in (4) can be expressed as
where k_{i }= {K(z_{i}, z_{1}), ..., K(z_{i}, z_{n})}^{T }and α = (α_{1}, ⋯, α_{n})^{T}, an n × 1 vector of unknown parameters.
Substituting (5) into (4) we have
where K = K(ρ) is an n × n matrix whose (i, i')th element is K(z_{i}, z_{i}') and often depends on an unknown parameter ρ.
Since J(β, α) in (6) is a nonlinear function of (β, α), one can use the Fisher scoring or NewtonRaphson iterative algorithm to maximize (6) with respect to β and α. Let (k) denote the k^{th }iteration step, then it can be shown (for details see Appendix A.3) that the (k + 1)^{th }update for β and α solves the following normal equation:
where , τ = 1/λ, h^{(k) }= Kα ^{(k)}, and . The estimators and at convergence are the kernel machine estimators that maximize (6).
The Connection of Logistic Kernel Machine Regression to Logistic Mixed Models
Generalized linear mixed models (GLMMs) have been used to analyze correlated categorical data and have gained much popularity in the statistical literature [10]. Logistic mixed models are a special case of GLMMs. We show in this section that the kernel machine estimator in the semiparametric logistic regression model (3) corresponds to the Penalized QuasiLikelihood (PQL) [10] estimator from a logistic mixed model, and the regularization parameter τ = 1/λ and kernel parameter ρ can be treated as variance components and estimated simultaneously from the corresponding logistic mixed model. Specifically, consider the following logistic mixed model:
where β is a q × 1 vector of fixed effects, and h = (h_{1}, ..., h_{n}) is a n × 1 vector of subjectspecific random effects following h ~N{0, τ K(ρ)}, and the covariance matrix K(ρ) is the n × n kernel matrix as defined in previous section.
As K is not diagonal or blockdiagonal, the random effects h_{i}'s across all subjects are correlated. The i^{th }mean response μ_{i }depends on other random effects h_{i}' (i' ≠ i) through the correlations of h_{i }with other random effects. To estimate the unknown parameters in the logistic mixed model (8), we estimate β and h by maximizing the PQL [10], which can be viewed as a joint log likelihood of (β , h),
Setting τ = 1/λ and h = Kα , one can easily see that equations (6) and (9) are identical. It follows that the logistic kernel machine estimators and can be obtained by fitting the logistic mixed model (8) using PQL. In fact, examination of the kernel machine normal equations (7) shows that they are identical to the normal equations obtained from the PQL (9) [10], where in (7) is in fact the PQL working vector and Dis the PQL working weight matrix.
Note that the estimators of β and h depend on the unknown regularization parameter τ and the kernel parameter ρ. Within the PQL framework, we can estimate these parameters δ = (τ, ρ) by maximizing the approximate REML likelihood
where V = D^{1 }+ τ K, and is the working vector as defined above. The estimator of δ can be obtained by setting equal to zero the first derivative of (10) with respect to δ. The estimating procedure for β, h, and δ = (τ, ρ) can be summarized as follows: we fit the logistic kernel machine model by iteratively fitting the following working linear mixed model to estimate (β, h) using BLUPs and to estimate δ using REML, until convergence
where is the working vector defined below equation (7), h is a random effect vector following N{0, τ K(ρ)}, and ε ~ N(0, D). The covariance of is estimated by (X^{T}V^{1}X)^{1}, and the covariance of is estimated by , where P = V^{1 } V^{1}X(X^{T}V^{1}X)^{1}X^{T}V^{1 }and V = V(). The covariance of can be obtained as the inverse of the expected information matrix calculated using the second derivative of (10) with respect to δ. The square roots of the diagonal elements of the estimated covariance matrices give the standard errors of , , and δ. The above discussion shows that we can easily fit the logistic kernel machine regression using the existing PQLbased mixed model software, such as SAS GLIMMIX and R GLMMPQL.
Test for the Genetic Pathway Effect
It is of significant practical interest to test the overall genetic pathway effect H_{0}: h(z) = 0. Assuming h(z) ∈ , one can easily see from the logistic mixed model representation (8) that H_{0}: h(z) = 0 vs H_{1}: h(z) ≠ 0 is equivalent to testing the variance component τ as H_{0}: τ = 0 vs H_{1}: τ > 0. Note that the null hypothesis places τ on the boundary of the parameter space. Since the kernel matrix K is not block diagonal, unlike the standard case considered by Self and Liang [17], the likelihood ratio for H_{0}: τ = 0 does not follow a mixture of and distribution. We consider instead a score test in this paper.
When conducting statistical tests for pathways, two types of tests could be formulated. The first is called the competitive test and the second the selfcontained test [18]. The competitive test compares an interested gene set to all the other genes on a gene chip. An example of the competitive test is the gene set enrichment analysis (GSEA) [1], where an enrichment score of a gene set is defined and a permutation test is used to test for the significance of the gene set based on the enrichment score. The selfcontained test compares the gene set to an internal standard which does not involve any genes outside the gene set considered. In other words, the selfcontained test examines the null hypothesis that a pathway has no effect on the outcome versus the alternative hypothesis that the pathway has an effect. The variance component test of [13] for the linear pathway effect is a selfcontained test. Goeman and Bühlmann [18] pointed out that the selfcontained test has a higher power than a competitive test and that its statistical formulation is also consistent for both single gene tests and gene set tests, and the statistical sampling properties of the competitive test can be difficult to interpret.
Our pathway effect hypothesis H_{0}: h(z) = 0 vs H_{1}: h(z) ≠ 0 is a selfcontained hypothesis. We propose in this paper a selfcontained test for the pathway effect by developing a kernel machine variance component score test for H_{0}: τ = 0 vs H_{0}:τ > 0. The proposed test allows for both linear and nonlinear pathway effects and includes the tests by Goeman et al. [13,14] as a special case. A key advantage of our kernelbased test is that we do not need to explicitly specify the basis functions for h(·), which is often difficult for modeling the joint effects of multiple genes, and we all let the data to estimate the best curvature of h(·).
Zhang and Lin [19] proposed a score test for H_{0}: τ = 0 to compare a polynomial model with a smoothing spline. Goeman et al. [14] also proposed a global test against a high dimensional alternative under the empirical Bayesian framework. The variancecovariance matrix used in these tests do not involve any unknown parameters. However, the kernel function K(·, ·) in a kernel machine model usually depends on some unknown parameter ρ. One can easily see from the mixed model representation (8) that under H_{0}: τ = 0, the kernel matrix K disappears. This makes the parameter ρ inestimable under the null hypothesis and therefore renders the above tests inapplicable.
Davies [20,21] studied the problem of a parameter disappearing under H_{0 }and proposed a score test by treating the score statistic as a Gaussian process indexed by the nuisance parameter and then obtaining an upper bound to approximate the pvalue of the score test. We adopt this line of approaches for our proposed score test.
Using the derivative of (10) with respect to τ, we propose the following score test statistic for H_{0}: τ = 0 as
where
where is the MLE of β under H_{0}: τ = 0, , μ_{Q }= tr{P_{0}K(ρ)}, = 2tr{P_{0}K(ρ)P_{0}K(ρ)}, and P_{0 }= D_{0 }– D_{0}X(X^{T}D_{0}X)^{1 }X^{T}D_{0}, where . Note that under H_{0}: τ = 0, model (3) reduces to the simple logistic model logit (μ_{i}) = . Hence the is the MLE of β under this null logistic model.
If the Gaussian kernel is used, then an arbitrary nonlinear pathway effect is implicitly assumed. Our proposed test, which is derived to test for any nonlinear effect, is therefore more powerful than tests based on a parametric assumption. We show in Appendix A.1 that when ρ becomes large in the Gaussian kernel, our test statistic reduces asymptotically to the one based on linearity assumption of genetic effects. Hence our test includes linear model based test as a special case. From (11) it is also clear that our test is invariant to the relative scaling of the kernel function K(·, ·).
Under appropriate regularity conditions similar to those specified in [22], S(ρ) under the null hypothesis can be considered as an approximate Gaussian process indexed by ρ. Using this formulation, we can then apply Davies' results [20,21] to obtain the upper bound for the pvalue of the test. Since a large value of Q_{τ }(, ρ) would lead to the rejection of H_{0}, the pvalue of the test corresponds to the upcrossing probability. Following Davies [21], the pvalue is upperbounded by
where Φ (·) is the normal cumulative distribution function, M is the maximum of S(ρ) over the range of ρ, W =  S(ρ_{1}) – S(L) +  S(ρ_{2}) – S(ρ_{1})  + ... +  S(U) – S(ρ_{m}) , L and U are the lower and upper bound of ρ respectively and ρ_{l}, l = 1, ..., m are the m grid points between L and U. Davies [20] points out that this bound is sharp. For the Gaussian kernel, we suggest to set the bound of ρ as L = 0.1 and U = 100 . For justifications, see Appendix A.2.
Extension to generalized kernel machine model
For simplicity, we focus in this paper on logistic regression for binary outcomes. The proposed semiparametric model (3) can be easily extended to other types of continuous and discrete outcomes, such as normal, count, skewed data, whose distributions are in the exponential family [23]. In this section, we briefly discuss how to generalize our estimation and testing procedures for binary outcomes to other data types within the generalized kernel machine framework and discuss its fitting using generalized linear mixed models.
Suppose the data consist of n independent subjects. For subject i (i = 1, ..., n), y_{i }is a response variable, x_{i }is a q × 1 vector of covariates, z_{i }is a p × 1 vector of gene expressions within a pathway. Suppose y_{i }follows a distribution in the exponential family with density [23]
where θ_{i }is the canonical parameter, a(·) and c(·) are known functions, φ is a dispersion parameter, and m_{i }is a known weight. The mean of y_{i }satisfies μ_{i }= E(y_{i}) = a'(θ_{i}) and Var(y_{i}) = φ m_{i}a"(θ_{i}). The generalized kernel machine model is an extension of the generalized linear model [23] by allowing the pathway effect to be modeled nonparametrically using kernel machine as
where g(·) is a known monotone link function, and h(·) is an unknown centered smooth function lying in the function space generated by a positive definite kernel function K(·, ·). For binary data, setting g(μ) = logit(μ) = gives the logistic kernel machine model (3); for count data, g(μ) = log(μ) gives the Poisson kernel machine model; for Gaussian data, g(μ) = μ gives linear kernel machine model [9]. The regression coefficients β and the nonparametric function h(·) in (14) can be obtained by maximizing the penalized loglikelihood function
where ℓ(·) = ln(p) is the loglikelihood, p is the density function given in (13), and λ is a tuning parameter. Using the kernel expression of h(·) in (5), the generalized kernel machine model (14) can be written as
and the penalized likelihood can be written
where K is an n × n matrix whose (i, j)th element is K(z_{i}, z_{j}).
One can use the Fisher scoring iteration to solve for β and a. The procedure is virtually the same as that described in Section "The Estimation Procedure". The normal equation takes the same form as (7), except that now μ_{i }is specified under (14) and D = diag{var(y_{i})} under (13). Similar calculations to those in Section "The Connection of Logistic Kernel Machine Regression to Logistic Mixed Models" show that model (14) can be fit using the generalized linear mixed model [10] via PQL
where τ = 1/λ, and h = (h_{1 }..., h_{n}) is an n × n random vector with distribution N {0, τ K(ρ)}. The same PQL statistical software, such as SAS PROC GLIMMIX and R GLMMPQL, can be used to fit this model and obtain the kernel machine estimators of β and h(·).
The score test (11) also has a straightforward extension. The only change is that the elements in matrix D in (11) be replaced by appropriate variance function var(y_{i}) under the assumed parametric distribution of y_{i}.
Appendix
A.1 Proof of the relationship of the proposed score test and that of Goeman, et al [13] under the linearity assumption
We show in this section when the scale parameter is large, the proposed nonparametric variance component test for the pathway effect using the Gaussian kernel reduces to the linearitybased global test of Goeman et al. [13].
Suppose K(·) is the Gaussian kernel. It can be shown that the score statistic for testing H_{0}: τ = 0 satisfies
where is the MLE of μ under H_{0}. The test statistic of Goeman et al. [13] takes the form
where R = ZZ^{T}. We now show when ρ is large relative to
Simple Taylor expansions show that
When is small, i.e., when ρ is large relative to , we have that for any i ≠ j. Hence
Since under the PQL, we have . Hence
This proves the approximate relation (19).
A.2 Calculations of the lower and upper bounds of ρ
Although in theory ρ could take any positive values up to infinity, for computational purpose we would require ρ to be bounded. For the proposed test statistic (11), its value in fact only depends on a finite range of ρ values. We describe why this is the case and how to find this range. For a given data set, the proof in Appendix A.1 shows that when is sufficiently large, the quantity 0.5ρQ_{τ }(, ρ) converges to , which is free of ρ.
These arguments suggest that for numerical evaluation, it is not necessary to consider all ρ values up to infinity. Instead, a moderately large enough value would suffice. Now the question comes down to how to decide on appropriate upper and lower bounds for ρ. The proof in Appendix A.1 requires be close to 0. Let C_{1 }be some large positive number such that 1/C_{1 }≈ 0. If we take the upper bound of ρ to be C_{1 }, then would be close to 0. In practice we suggest taking C_{1 }= 100, which would give good approximation. Using a similar idea, we can find a lower bound for ρ. It is clear that when /ρ → ∲ any nondiagonal element of K(ρ) will be 0 and the kernel matrix reduces to an identity matrix. Hence, if we pick a small enough number C_{2 }such that 1/C_{2 }→ ∲, we can effectively set the lower bound of ρ to be C_{2 }. In practice we suggest take C_{2 }= 0.1, which yields a good approximation.
A.3 derivation of normal equation (5)
Taking partial derivative of (6) with respect to β and writing in matrix notation, we have X^{T}(yμ). Similarly for α, we have K(y – μ) – λ Kα. The gradient vector is thus
Taking derivative of q with respect to β and α, we can get the following hessian matrix
where D = Diag{μ_{i}(1 – μ_{i})}. The NewtonRaphson iteration states that the parameter value at the (k + 1)^{th }iteration can be updated by the following relationship
where δ = (β^{T}, α^{T})^{T}. Substitute (20) and (21) into (22), we arrive at normal equation (7).
Availability
Our algorithm is available at http://www.biostat.harvard.edu/~xlin webcite.
Authors' contributions
DL performed statistical analysis. All authors participated in development of the methods and preparation of the manuscript. All authors read and approved the final manuscript.
Acknowledgements
Liu and Lin's research was supported by a grant from the National Cancer Institute (R37 CA76404). Ghosh's research was supported by a grant from the National Institute of Health (R01 GM72007). We thank A. Chinnaiyan for providing the prostate cancer data. We also thank the two anonymous referees for their comments which helped improve the manuscript.
References

Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, Gillette M, Paulovich A, Pomeroy S, Golub T, Lander E, Mesirov J: Gene set enrichment analysis: A knowledgebased approach for interpreting genomewide expression profiles.
Proceedings of the National Academy of Sciences 2005, 102:1554515550. Publisher Full Text

Eisenberg D, Graeber TG: Bioinformatic identification of potential autorine signaling loops in cancers from gene expression profiles.
Nature Genetics 2001, 29:295300. PubMed Abstract  Publisher Full Text

Raponi M, Belly R, Karp J, Lancet J, Atkins D, Wang Y: Microarray analysis reveals genetic pathways modulated by tipifarnib in acute myeloid leukemia.
BMC Cancer 2004, 4:56. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Dahlquist KD, Salomonis N, Vranizan K, Lawlor SC, Conklin BR: GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways.
Nature Genetics 2002, 31:1920. PubMed Abstract  Publisher Full Text

Grosu P, Twonsend JP, Hartl DL, Cavalieri D: Pathway Processor: A tool for integrating wholegenome expression results into metabolic networks.
Genome Research 2002, 12:11211126. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Doniger SW, Salomonis N, Dahlquist KD, Vranizan K, Lawlor SC, Conklin BR: MAPPFinder: using Gene Ontology and GenMAPP to create a global geneexpression profile from microarray data.
Genome Biology 2003, 4:R7. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Vapnik V: Statistical Learning Theory. New York: Wiley; 1998.

Schölkopf B, Smola A: Learning with Kernels. Cambridge, Massachusetts: MIT press; 2002.

Liu D, Lin X, Ghosh D: Semiparametric regression of multidimensional genetic pathway data: least squares kernel machines and linear mixed models.
Biometrics 2007, 63(4):10791088. PubMed Abstract  Publisher Full Text

Breslow N, Clayton D: Approximate inference in generalized linear mixed models.
Journal of the American Statistical Association 1993, 88:925. Publisher Full Text

Wei Z, Li H: Nonparametric pathwaybased regression models for analysis of genomic data.
Biostatistics 2007, 8(2):265284. PubMed Abstract  Publisher Full Text

Sprague R, Ed: Proceedings of the 39th Annual Hawaii International Conference on System Sciences. Los Alamitos: IEEE; 2006.
[CD ROM version]

Goeman JJ, Geer SA, de Kort F, van Houwelingen HC: A global test for groups of genes: testing association with a clinical outcome.
Bioinformatics 2004, 20:9399. PubMed Abstract  Publisher Full Text

Goeman JJ, Geer SA, van Houwelingen HC: Testing against a high dimensional alternative.
Journal of the Royal Statistical Society: Series B 2006, 68:477493. Publisher Full Text

Dhanasekaran S, Barrette T, Ghosh D, Shah R, Varambally S, Kurachi K, Pienta K, Rubin M, Chinnaiyan A: Delineation of prognostic biomarkers in prostate cancer.
Nature 2001, 412(6849):8226. PubMed Abstract  Publisher Full Text

Kimeldorf G, Wahba G: Some results on Tchebycheffian spline functions.
Journal of Mathematical Analysis and Applications 1970, 33:8295. Publisher Full Text

Self SG, Liang KY: Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions.
Journal of the American Statistical Association 1987, 82:605610. Publisher Full Text

Goeman JJ, Bühlmann P: Analyzing gene expression data in terms of gene sets: methodological issues.
Bioinformatics 2007, 23:980987. PubMed Abstract  Publisher Full Text

Zhang D, Lin X: Hypothesis testing in semiparametric additive mixed models.
Biostatistics 2002, 4:5774. Publisher Full Text

Davies R: Hypothesis testing when a nuisance parameter is present only under the alternative.
Biometrika 1977, 64:247254. Publisher Full Text

Davies R: Hypothesis testing when a nuisance parameter is present only under the alternative.

le Cessie S, van Houwelingen J: Goodness of fit tests for generalized linear models based on random effect models.
Biometrics 1995, 51(2):600614. PubMed Abstract  Publisher Full Text

McCullagh P, Nelder J: Generalized Linear Models. New York: Chapman & Hall; 1989.