Sanofi–Aventis, Cambridge, Massachusetts, USA

Department of Statistics, University of Wisconsin-Madison, Madison, Wisconsin, USA

Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland, USA

Center for Bioinformatics and Computational Biology, Computer Science Department, University of Maryland-College Park, College Park, Maryland, USA

Department of Computer Sciences, University of Wisconsin-Madison, Madison, Wisconsin, USA

Abstract

Background

In systems biology, the task of reverse engineering gene pathways from data has been limited not just by the curse of dimensionality (the interaction space is huge) but also by systematic error in the data. The gene expression barcode reduces spurious association driven by batch effects and probe effects. The binary nature of the resulting expression calls lends itself perfectly to modern regularization approaches that thrive in high-dimensional settings.

Results

The Partitioned LASSO-Patternsearch algorithm is proposed to identify patterns of multiple dichotomous risk factors for outcomes of interest in genomic studies. A partitioning scheme is used to identify promising patterns by solving many LASSO-Patternsearch subproblems in parallel. All variables that survive this stage proceed to an aggregation stage where the most significant patterns are identified by solving a reduced LASSO-Patternsearch problem in just these variables. This approach was applied to genetic data sets with expression levels dichotomized by gene expression bar code. Most of the genes and second-order interactions thus selected and are known to be related to the outcomes.

Conclusions

We demonstrate with simulations and data analyses that the proposed method not only selects variables and patterns more accurately, but also provides smaller models with better prediction accuracy, in comparison to several alternative methodologies.

Background

The LASSO-Patternsearch (LPS) algorithm
_{1}-regularized logistic regression formulation, targeting the case in which only a small fraction of the large number of possible candidate patterns are significant. The approach can be used to consider simultaneously all possible patterns up to a specified order. It can identify complicated correlation structures among the predictor variables, on a scale that can cause serious difficulties for algorithms that target problems of more modest size.

When applied to very large models with higher-order interactions between the predictor variables, however, LPS quickly runs into computational limitations. For example, a problem with two thousand predictor variables yields a logistic-regression formulation with about two million variables if both first- and second-order patterns are included in the model. Problems of this size are at the limit of LPS capabilities, yet current problems of interest in genetic epidemiology consider ten thousand markers or more

In this article, we propose a Partitioned LASSO-Patternsearch Algorithm (pLPS) scheme to tackle gigantic data sets in which we wish to consider second- and possibly third-order interactions among the predictors, in addition to the first-order effects. As in LPS, we assume that all predictor variables are binary (or that they have been dichotomized before the analysis). The model thus contains a huge number of possible patterns, but the solution is believed to be sparse, with only a few effects being significant risk factors for the given outcome. In the first (screening) stage of pLPS, the predictors are divided into partitions of approximately equal size, and LPS is used to solve smaller subproblems in which just the predictors and higher-order effects within a single partition, or the interactions between variables in small groups of partitions, are considered as variables in the optimization model. These reduced problems can be solved independently and simultaneously on a parallel computer. By the end of the screening stage, each predictor and each higher-order effect (up to the specified order) will have been considered in at least one of the subproblems. The second stage of pLPS is an aggregation process, in which all predictors identified in the first stage are considered, together with all their interactions up to the specified order. An LPS procedure is used to identify the final set of significant predictors and interactions.

Tuning parameters in the first stage of pLPS are chosen by BGACV criterion. BGACV is a more stringent criterion than GACV, the difference between these criteria being similar to the difference between BIC and AIC (see

Methods

We now give further details of the pLPS scheme and its implementation. For simplicity, most of our discussion focuses on the case in which first-order effects and second-order interactions between all predictors are considered. Extension of the approach to include third-order effects as well is described briefly at the end of the section.

Considering _{
j
},_{1},_{2},⋯,_{
n
})∈{0,1} is the response, _{
j
}= (_{
j
}(1),_{
j
}(2),⋯ ,_{
j
}(_{
j
}(_{
st
}, where _{
st
},

In the first stage of pLPS (the “screening stage”), we solve two types of reduced LPS subproblems. The first type is based on a pair of partitions, denoted by _{1}and _{2}, and defines the LPS variables in the subproblems to be the first-order effects within each group (for which there are _{1}and a predictor in group _{2}. There are ^{2} basis functions for the latter effects, namely,
^{2} + 2

The second type of reduced LPS problem is obtained from the first- and second-order effects within a single partition. Here, the basis functions for group _{
t
}=_{
st
},

Diagram of the subproblems in the first stage of pLPS, assuming 5 partitions

**Diagram of the subproblems in the first stage of pLPS, assuming 5 partitions.** Side length of a square is the partition size, while the horizontal axis contains the labels of the first effect and the vertical axis the label of the second effect. Squares filled with red dots are “type-one” subproblems while the triangles filled with green dots are “type-two” subproblems.

We now briefly describe the LPS methodology, which is applied to each of these subproblems. By relabeling, we define the basis functions to be _{
ℓ
}(_{
B
}. Defining

where

with

and the penalty function being defined by

(We assume that the last basis function is the constant function 1, whose coefficient

If the outcomes can be predicted well using a small number of patterns, the number of patterns surviving the first stage of pLPS should be small. Suppose there are a total of ^{∗} unique predictor variables in all these patterns. The second stage of pLPS — the “aggregation stage” — is an LPS problem in which just these predictors and all their second-order effects are the patterns. There will be
_{1ℓ
}) for the main effects and
_{2ℓ
}) for the second-order interactions, plus one constant basis function. In the aggregation stage, we use different penalty parameters for the first- and second-order patterns, so the objective function is

where the link function

and the penalties are

The choice of penalty parameters (_{1}
_{2}) in (5) is critical to the performance of this formulation. BGACV does not work well in this setting. Often, it tends to select only second-order patterns, combining main effects with spurious partners. Occasionally, it selects only main effects, breaking true size-two patterns into separate main effects. The large difference between the numbers of basis functions
_{1} and _{2} is expensive and often does not give satisfactory results. As an alternative approach, we introduce the following penalty function, known as BGACV2:

where

Minor extensions to the pLPS approach are needed when size-three patterns (_{1},_{2},_{3} of the three partitions chosen to define the subproblem (with _{1} ≤_{2} ≤_{3}). The four types correspond to the cases _{1} <_{2} <_{3}, _{1} =_{2} <_{3}, _{1} <_{2} =_{3}, and _{1} =_{2} =_{3}, respectively. In the aggregation phase of pLPS, we will still be using two penalty parameters, one for main effects and one for interactions; size-two and size-three patterns share the same penalty parameter. The criterion function for choosing the appropriate values for penalty parameters _{1} and _{2} is

where
_{
a
}is the average of the three.

In the remainder of the paper, we use pLPS to denote the

The choice of _{
B
}= 2,001,001 basis functions, which can be handled comfortably by the LPS code. On a more standard computer (Intel^{®} Pentium^{®} 4 2.80GHz with 2 GB memory), we usually set

The pLPS code is available at

Results and discussion

Simulation studies

In this section we study the empirical performance of pLPS through four simulated examples. The first example is a relatively small data set with independent predictor variables: One main effect and two second-order interactions are included in the link function. The second example is a very large data set with strong correlations among neighboring variables, in which two main effects and two second-order interactions are assumed to be important. The third example studies the performance of pLPS3, which includes third-order interactions in the model. Two main effects, one second-order interaction, and one third-order interaction are included. The last example studies the performance of pLPS where there are negative correlations among the predictor variables and the true model involves many patterns.

We compare pLPS with three other methods:

• Logic Regression

• Stepwise Penalized Logistic Regression (SPLR)

• Random Forest (RF)

The number of trees and number of leaves in Logic Regression are selected by five-fold cross validation. The smoothing parameter in SPLR is also selected by five-fold cross validation, while the model size is selected by BIC.

Simulation Example 1

In our first example, 400 iid Bernoulli (0.5) random variables were simulated. The sample size is 700 and the logit function is

One hundred data sets were generated according to this model and analyzed by the four methods described above.

Table
_{50}, _{150}
_{250}, or _{251}
_{252} is taken to be false. For Random Forest, the last column counts the total number of false variables selected in the 100 trials. Any variable other than the five in the logit function is false.

**False patterns**

**Methods**

**
X
**

**
X
**

**
X
**

**(Variables)**

pLPS

94 (100)

99 (99,99)

96 (97,97)

153

Logic

100 (100)

70 (88,91)

65 (84,90)

190

RF

NA (100)

NA (96,97)

NA (94,96)

(517)

SPLR

100 (100)

97 (100,97)

91 (100,98)

712

On this example, pLPS selects all three patterns almost perfectly and generates the least number of false patterns. Logic Regression does not do well on the size-two patterns and selects slightly more false patterns. Random Forest does well in selecting the important variables but also selects many false variables. (If we change the criterion for declaring that Random Forest has selected a variable to the “top eight” or “top five,” we reduce the number of false variables but also reduce the variable counts.) SPLR has similar performance to pLPS in selecting the true patterns, but selects many more false patterns.

Simulation Example 2

Example 2 studies the behavior of pLPS on a large data set (_{
i
}, _{
i
}= 1 if
_{
i
}= 0 otherwise, for each

The simulation was performed 50 times; each run is quite time-consuming. We could not run Logic Regression on this example, as the dimensions exceed the limit of that code.

Table
_{1000}
_{3000} twice but selects the remaining patterns perfectly, and generates a smaller number of false patterns than the other methods. In Random Forest, we declared a variable to be selected if it was ranked in the top 12. It misses the pattern _{1000}
_{3000} with some frequency. SPLR selects all four patterns perfectly, but at the cost of a large number of spurious patterns. SPLR requires the user to set the maximum number of parameters allowed in the model, and selects the actual number by BIC. We set this maximum to 20, and it was reached on all 50 runs. (The maximum is still reached on every run when we set this parameter to 50).

**False patterns**

**Methods**

**
X
**

**
X
**

**
X
**

**
X
**

**(Variables)**

pLPS

50 (50)

50 (50)

48 (48,50)

50 (50,50)

278

RF

NA (50)

NA (50)

NA (28,37)

NA (50,50)

(335)

SPLR

50 (50)

50 (50)

50 (50,50)

50 (50,50)

800

Simulation Example 3

Example 3 studies the behavior of pLPS3 on a large data set, with sample size

The logit function is

This simulation was performed 50 times. As we can see from Table

**False patterns**

**Methods**

**
X
**

**
X
**

**
X
**

**
X
**

**(Variables)**

pLPS3

47 (50)

50 (50)

47 (50,50)

47 (50,49,48)

204

Logic

50 (50)

50 (50)

34 (43,44)

30 (50,44,41)

151

RF

NA (50)

NA (50)

NA (36,40)

NA (49,47,49)

(279)

SPLR

50 (50)

50 (50)

45 (49,50)

50 (50,50,50)

554

As in the previous examples, SPLR does well at selecting the important patterns but also selects many false patterns.

Simulation Example 4

Simulation 4 studies the performance of pLPS when there are some negative correlations among the covariates and the number of true patterns is large. Assuming _{
i
}= 1 if
_{
i
}= 0 otherwise, for each

Among the three interaction terms, the first had variables with positive correlation, the second had variables with negative correlation and the third had independent variables. The simulation was performed 100 times. (Logic Regression was again not implemented. Although the dimensions did not exceed the limit of that code, the number of true patterns did.) Table

**False patterns**

**Methods**

**Main effects average**
^{
∗
}

**
X
**

**
X
**

**
X
**

**(Variables)**

^{∗}The average of _{1}, _{3}, _{10}, _{201}, _{210}, _{220} and _{230}.

pLPS

96 (100)

98 (100,100)

98 (98,100)

99 (100,100)

320

RF

NA (99)

NA (96,100)

NA (87,72)

NA (94,89)

(1268)

SPLR

100(100)

97 (100,100)

82 (100,100)

97 (100,100)

1017

Summary

Logic Regression cannot handle very large data sets and does not reliably identify the interaction terms. Random Forest does not provide an explicit model of the interactions. It frequently scores well, but can perform poorly if the signal is not strong enough. SPLR scores well at selecting the right patterns, but selects too many false patterns. By contrast, pLPS usually selects the right patterns without adding too many false patterns, regardless of the size of the problem, the number of true patterns or the signs of correlations.

Our partitions are selected according to the natural order of variables in these simulation examples. If the number of variables in each partition is 200, the first 200 variables will be in the first partition and the next 200 variables will be in the second partition, and so on. If the variables are permuted, resulting in a different partitioning, we do not expect the results to be greatly affected. All possible higher-order patterns are considered in the first (screening) stage of the method, regardless of partitioning. A significant effect should survive the first stage regardless of how the partitioning is performed. To verify this claim, we performed a random permutation on the predictor variables in simulation Example 4 in all 100 runs. Among all the patterns selected in the original partitioning, 82% were still selected after the permutation. Although this figure is on the low side, it can be accounted for by the presence of noise patterns. If we focus on the ten most important patterns in each run, then from the 1000 considerations (10 patterns x 100 runs), the original partition and its permuted counterpart yield results that agree 95% of the time. To summarize: Although Simulation Example 4 is a complicated case with negative correlations and many important variables, the final results are not affected greatly by a shuffling of the first-stage partitions. We would expect similar results for the other examples discussed in this article.

The gene expression barcode data

With current microarray technology, we are able to measure thousands of RNA transcripts at one time, a capability that allows for richer characterization of cells and tissues. However, feature characteristics such as probe sequence can cause the observed intensity to be far away from the actual expression. Although this “probe effect” is significant, it is consistent across different hybridizations, in that the effect is quite similar when comparing intensities of different hybridizations for the same gene. Therefore, the majority of microarray data analysis uses relative expression rather than absolute expression. To overcome this limitation in measurement, a gene expression bar code (GEBC)

GEBC

In our first analysis, we took all normal tissues as “controls” and all non-breast tumor tissues as “cases”. In the second analysis we analyze the survival time of breast cancer patients after dichotomization. We define subjects with survival time less than 5 years as “cases” and those with survival time longer than 10 years as “controls”.

We apply pLPS on both data sets with 7,654 genes, evaluating the variable selection performance of pLPS by comparing with the knowledge base in literature. To compare the performance of pLPS with the alternative methods discussed in the Simulation section, the number of predictor genes must be reduced further, because Logic Regression cannot handle more than 1,000 variables. A screen step

Non-Breast Cancer data

In this analysis, all normal and non-breast cancer tissues are used. Breast tumors were excluded because no normal breast tissue was available. The data set contains 503 normal tissues and 70 cancer tissues, giving a malignancy rate of 12.2

The model fitted by pLPS on this data with 7,654 genes is shown in (10). Five size-two interactions are selected.

Most of these genes are known to be related to one or more types of cancer. For example, ERBB3 is very important in the development of breast cancer

To reduce the number of predictor genes to the size that is solvable by alternative methods, we fitted a simple logistic regression on each gene and kept the most significant genes (p-value <10^{−8}). This step yields 636 genes. Although this screening step results in the loss of many genes that could potentially be helpful in prediction, it must be performed in order to apply the alternative methods. To yield a fair comparison, we run all methods on this screened data set.

Table

**Methods**

**# Gene**

**# Para**

**
q
**

**Total**

**AUC**

“Total” sums the number of selected genes, the number of non-zero coefficients in the model, and the highest order of interactions. AUC indicates the area under the ROC curve.

pLPS

9.2

6.6

**2.0**

**17.8**

**0.982**

pLPS3

**8.4**

6.4

3.0

**17.8**

0.945

Logic

14.0

**5.2**

5.0

24.2

0.956

SPLR

17.2

20.6

5.6

43.4

0.962

Breast cancer survival time

The survival of breast cancer patients depends on many factors, such as grade, stage and oestrogen-receptor status. In this section we study the possible genetic effects using the gene expression barcode data. We denote patients who lived less than 5 years after diagnosis as “cases” and patients who lived more than 10 years after diagnosis as “controls.” Patients with a censored death time less than 10 years and patients that died between 5 and 10 years are excluded. The purpose of this step is not to provide a more homogeneous subset. Rather, we are converting the survival data into a binary outcome, because our method is developed with binary outcomes in mind. After this step, the remaining pool contains 243 patients, among which 80 are cases. The five-year death rate is 80/243 = 32.9

Formula (11) shows the model fitted by pLPS on this data with 7,654 genes. There are one main effect and four size-two interactions.

Among the selected genes, CDC20, CREB1, STAT5A and MAPT are known to be related to breast cancer. It was noted in

As in the previous subsection, we use a screen step to select the most important genes (p-value <10^{−3}); this step yielded 592 genes. The cutoff p-value used here is much bigger than that in the non-breast cancer data, because it is small enough to rule out most genes.

Table

**Methods**

**# Gene**

**# Para**

**
q
**

**Total**

**AUC**

“Total” sums the number of selected genes, the number of non-zero coefficients, and the highest order of interactions. AUC indicates the area under the ROC curve.

pLPS

10.0

6.8

**2.0**

18.8

**0.824**

pLPS3

10.2

6.6

3.0

19.8

0.780

Logic

**4.4**

**2.6**

3.8

**10.8**

0.721

SPLR

19.4

20.6

5.0

45.0

0.793

It is interesting to study the overlap between the sets of genes selected by these different methods. Table
^{−8} compared to <10^{−3} in the breast cancer survival data). It is easier for these methods to replace one gene with another, because they are all of similar importance. Therefore, it is not surprising that the sets of genes selected by different methods do not overlap strongly with each other.

**Non-Breast Cancer data**

**Breast cancer survival data**

**pLPS**

**pLPS3**

**Logic**

**SPLR**

**pLPS**

**pLPS3**

**Logic**

**SPLR**

Off-diagonal element shows the number of common genes selected by methods from the corresponding row and column. Diagonal element shows the number of genes selected by method from the corresponding row (or column). Numbers are the average of the five-fold cross validation.

pLPS

9.2

2.6

1.6

2.0

10.0

4.0

1.0

5.2

pLPS3

8.4

1.6

2.2

10.2

1.0

4.8

Logic

14.0

1.6

4.4

1.6

SPLR

17.2

19.4

Conclusions

We have described a partitioned version of the LASSO-Patternsearch algorithm (named pLPS) that extends the range of this method to data sets with a higher number of predictors, and allows parallel execution of much of the computation. We show through simulations that pLPS is better than competing methods in selecting the correct variables and patterns while controlling for the number of false patterns in the selected model. By testing on two gene expression data sets, we also show that pLPS gives smaller models with much better prediction accuracy than competing approaches.

Two smoothing parameters with modified tuning criterion are used in pLPS and pLPS3 (in contrast to the single parameter used in LPS). We impose a penalty on the difference between the number of main effects and the number of interactions for pLPS and a penalty on the difference among the numbers of main effects (size-two interactions in pLPS and size-three interactions in pLPS3). These penalties eliminate the extreme cases in which only main effects or interactions arise in the LASSO step, and which the original, unmodified criterion too often produces. On the other hand, if an extreme case is the truth, the LASSO step will generate some false patterns, but the parametric step tends to eliminate them and thus select the correct model.

Competing interests

The authors declare that they have no competing interests.

Author’s contributions

WS conceived the method, designed and implemented the pLPS algorithm, analyzed the data and drafted the manuscript. GW supervised the project. RAI provided the data and supervised the project. HCB participated the initial discussion of the method. SJW designed and implemented the algorithm of the minimization problem and participated the writing of the manuscript. All authors reviewed and approved the final manuscript.

Acknowledgements

WS - Research supported in part by NIH Grant EY09946 and NSF Grant DMS-00604572. GW - Research supported in part by NIH Grant EY09946, NSF Grant DMS-0906818 and ONR Grant N0014-09-1-0655. RAI - Research supported in part by NIH Grant GM083084. HCB - Research supported in part by NIH Grant GM083084, NIH Grant EY09946 and NSF Grant DMS-0604572. SJW - Research supported in part by NSF Grants DMS-0914524 and DMS-0906818.