Abstract
Background
Selecting genes and pathways indicative of disease is a central problem in computational biology. This problem is especially challenging when parsing multidimensional genomic data. A number of tools, such as L_{1}norm based regularization and its extensions elastic net and fused lasso, have been introduced to deal with this challenge. However, these approaches tend to ignore the vast amount of a priori biological network information curated in the literature.
Results
We propose the use of graph Laplacian regularized logistic regression to integrate biological networks into disease classification and pathway association problems. Simulation studies demonstrate that the performance of the proposed algorithm is superior to elastic net and lasso analyses. Utility of this algorithm is also validated by its ability to reliably differentiate breast cancer subtypes using a large breast cancer dataset recently generated by the Cancer Genome Atlas (TCGA) consortium. Many of the proteinprotein interaction modules identified by our approach are further supported by evidence published in the literature. Source code of the proposed algorithm is freely available at http://www.github.com/zhandong/LogitLapnet webcite.
Conclusion
Logistic regression with graph Laplacian regularization is an effective algorithm for identifying key pathways and modules associated with disease subtypes. With the rapid expansion of our knowledge of biological regulatory networks, this approach will become more accurate and increasingly useful for mining transcriptomic, epigenomic, and other types of genome wide association studies.
Introduction
Technologies for high throughput genetic profiling have revolutionized the study of human development and disease. Expression profiles spanning the entire human genome not only allow investigators to better understand disease subtypes [1], but also define new categories associated with sensitivity to pharmacologic treatment [2,3] and other clinical outcomes [4,5]. A central problem in these genomic studies is to construct an accurate predictive model and delineate specific genes or pathways driving a phenotype. Logistic regression is widely used for classification [69]. The number of genes, however, in highthroughput studies are often much larger than the number of specimens in a given study. This limitation causes instability in the algorithms used to select driver genes and poor performance of predictive models. The lasso algorithm for logistic regression was introduced to address these problems and perform feature selection through L_{1}norm regularization [10]. However, the number of variables selected by the lasso is bounded by the number of observations in an experiment. Furthermore, correlated variables are rarely selected as part of the predictive model at the same time. Various extensions of the Lasso algorithm, such as elastic net, pelora, grouped lasso and fused lasso introduce grouping or smoothness regularization terms to address these limitations [1114]. Of these, the fused lasso and elastic net have been successfully applied to a largest number of gene expression and genome wide association studies [1518]. Both fused Lasso and elastic net allow correlated genes or neighboring genes on a chromosome to be selected into a predictive model together. However, these algorithms tend to ignore functional interactions between individual gene products documented in the scientific literature. Integration of network information in the gene marker identification has shown to outperform methods without network information by Chuang et al [19]. Specifically, their networkedbased scoring and greedy search algorithm identified more robust gene markers with better prediction accuracy on metastasis status of breast cancer patients in two cohorts. In recent years, vast amounts of data detailing biologic networks has been organized into searchable databases. For example, BioGRID documents protein and genetic interactions from more than 39,991 publications [20]. The KEGG pathway database similarly curates molecular interactions and relational networks representing systemic functions at the level of both the cell and organism [21]. To incorporate biological network information into regression models, a networkconstrained regularization algorithm has been previously proposed for use with linear regression [22]. Use of this networkconstrained algorithm has been shown to outperform both lasso and elastic net analyses executed independently of biologic input.
Classification algorithms integrating network structure information have been proposed in other settings. For example, a network constrained support vector machine has been proposed to analyze functional magnetic resonance imaging data [23] and cancer microarray data [2426]. Network based prior has also been demonstrated to improve variable selection accuracy under the Bayesian inference framework [27,28].
Here, we propose a graphical Laplacian network regularized logistic regression method following the framework established by Li et al. [22]. We hypothesize that the integration of biological networks, such as ProteinProtein interactions, will improve prediction accuracy and variable selection in logistic models. To validate use of the proposed algorithm, we studied its theoretical properties and compared its performance to L_{1}norm regularized logistic regression and elastic net logistic regression on both simulated and real biologic data. Lastly, we demonstrate the utility of the proposed algorithm by using it to differentiate breast cancer subtypes and delineate biologic network modules associated with the triple negative breast cancer (TNBC) subtypes.
Materials and methods
Graph laplacian regularized logistic regression model
Suppose that the data set has n observations with p genes. Let y = (y_{1}, ..., y_{n})^{T }be the response with y_{i }∈ {0, 1} and X = [x_{1}, ..., x_{p}] be the matrix of biomarkers measured on n samples with x_{j }= (x_{1j}, ..., x_{nj})^{T }for j = 1, ..., p genes. Without loss of generality, we can assume that each gene is standardized. The binary clinical variable, y, can be predicted using the logistic model:
The parameter β can be estimated by maximizing the log likelihood function of the logistic model. However, it is well known that this estimation procedure performs poorly for both prediction purposes and variable selection when p ≫ n. Although various sparse parameter estimation procedures have been introduced to address these problems, these approaches tend to generate disconnected biomarkers that are rarely interpretable.
To incorporate biological network information into the model estimation procedure, we adopted the networkconstrained regularization framework proposed by Li et al [22]. Given a biological network G = (V, E), where V is the set of genes that correspond to p predictors, E is the adjacency matrix, and e_{uv }= 1 if there exists an edge between u and v, otherwise e_{uv }= 0. The normalized graph Laplacian matrix L for G can be defined by
where D = diag(E·1_{p}) is a degree matrix with the diagonal elements equal to the degrees for each node in G. For any fixed nonnegative regularization parameters λ and α, we define the logistic graph Laplacian net (LogitLapnet) criteria as:
Where
d_{u}, d_{v }are the degrees of nodes u and v respectively. In (3), X_{i }is the ith row of the matrix X. The first term in equation (3) is the negative log likelihood function of the logistic model. The second term is an L_{1}norm penalization on β, which encourages sparsity on the coefficients. The last term is a generalized L_{2}norm penalty using the graph Laplacian matrix, which encourages smoothness on coefficients of genes that are connected in the biological network. If we set α = 1, the LogitLapnet criteria is equal to the simple lasso logistic regression. When L = I which is identity matrix, the LogitLapnet criteria becomes the elastic net logistic regression.
The model estimation can be formulated into a convex optimization problem:
To solve problem (4) we used CVX, a package for specifying and solving convex programs [29,30]. The convexity property of LogitLapnet guarantees an optimal solution using any convex optimization solver.
Theoretical properties of logitlapnet
To study the behavior of the proposed method, we analyzed the theoretical properties of LogitLapnet.
Lemma 1 If , then for any α < 1 and λ > 0.
The proof of this lemma is given in the additional file 1. This lemma states that if two predictors are identical, the coefficients of these two predictors will be the same. In the L_{1}norm penalized logistic regression, only one predictor is selected.
Additional file 1. Proofs of lemma and theorem. This file includes the mathematics proofs on lemma 1, 2 and theorem 1.
Format: PDF Size: 241KB Download file
This file can be viewed with: Adobe Acrobat Reader
Theorem 1 Given data (y, X) and parameters (λ, α), let (λ, α) be the LogitLapnet estimator for problem (4). If and e_{ij }= 1, define , then
where is the sample correlation of x_{i }and x_{j }.
The upper bound in (5) provides a quantitative description for the grouping effect of LogitLapnet on the network structure. For two highly correlated genes that are connected in a biological network, the difference on the estimated coefficient is almost zero. As α goes to 1, the Lapnet becomes lasso logistic regression and the difference becomes unbounded.
Lemma 2 The maximum value of λ in problem (4) with satisfies
The maximum value of λ is reciprocal to α. It is clear that when α = 1, the LogitLapnet criteria becomes Lasso and a small penalization can produce an empty model. When α = 0, the penalization becomes much larger to generate an empty model. In practice, Lemma 2 provides a search guidance on the regularization path. The proof of Lemma 2 is also given in the additional file 1.
Gene expression profiles and differential gene analysis
Gene expression data from breast cancer specimens profiled by TCGA consortium was used to test the proposed LogitLapnet method. Level III RNASeq data (Illumina HigSeq RNASeqV2 from UNC) from patients with invasive breast carcinoma was obtained from the TCGA data portal https://tcgadata.nci.nih.gov/tcga/ webcite in September, 2012. These data profiled 20501 genes in 806 distinct breast cancer specimens. Normalized read counts were used for all analyses and were logtransformed prior to their use. Genes with normalized read counts less than 10 in more than 90% of patients were excluded from further analyses. The normalized read counts were logtransformed and standardized prior to applying the method.
Of the 806 breast cancers characterized by the TCGA, 261 patients had incomplete information on the three markers used to define TNBC and were excluded. Of the remaining cancers, 85 TNBC and 460 nonTNBC were further separated into training (n = 327; 51 TNBC, 276 nonTNBC) and test (n = 218; 34 TNBC, 184 nonTNBC) sets. A total of 4871 differentially expressed gene products (P ≤ 0.05, ttest; ≥ 1.5 foldchange between TNBC vs. nonTNBC on training) were merged with proteinprotein interaction (PPI) networks. This provided us with 797 genes and PPI networks of 937 interactions available for analysis using our proposed algorithm.
Proteinprotein interaction (PPI) network
To construct the graph Laplacian matrix for testing the LogitLapnet method on breast cancer data, networks of proteinprotein interactions (PPI) identified by twohybrid screening in Homo sapiens were obtained from The Biological General Repository for Interaction Datasets (BioGRID, version 3.2.98) http://www.thebiogrid.org webcite. At the time it was accessed, BioGRID documented 21483 interactions between 7700 genes.
Results and discussion
Simulation studies
We initially used a benchmark simulation proposed by Li to explore the performance of the proposed LogitLapnet algorithm [22]. In brief, a network with 200 distinct transcription factors (TFs) was simulated. Each TF in this simulation regulated 10 genes with a total of 2,200 genes in the simulated network. The clinical variable y was assigned a binary value and was associated with the first four TFs and their target genes. In model I, we assumed that two of the TFs and their targets were positively associated with the clinical variable and the other two TFs and their targets were negatively associated with the clinical variable.
The clinical variable was defined as y = [Prob (y = 1X; β) > ϱ] where ϱ ~ U(0,1). Expression levels for the 200 TFs were then simulated using a standard normal distribution. Each TF and its target genes were jointly distributed as a bivariate normal with correlation of 0.7.
In model II, gene expression levels were simulated similarly to model I except that a TF could be both a transcriptional activator and repressor at the same time. The coefcient vector was defined as:
Model III was similar to model I except that we decreased the association of the target genes on the clinical variable and made the model even sparser.
Model IV was similar to model II in allowing transcription factors to function as both activators and repressors. However, the clinical association of the target genes was decreased.
For each model, we simulated both a training data set as well as an independent test data set of 100 samples. A 10fold crossvalidation procedure was applied to the training data set to identify the optimal tuning parameter. Genes with nonzero coefficient in the estimated model were found to be associated with the clinical variable. The sensitivity and specificity of our variable selection performance was defined as the following:
where the ·_{0 }counts the number of nonzero elements in a vector, is the logical not operator on a vector and.* is the elementwise product.
We repeated the experiment 50 times. Results are summarized for each model in Table 1. We also computed the Bernoulli error loss on the test data set. For all four models, we compared the performance of the proposed algorithm to both the L_{1 }penalized logistic regression and the elastic net algorithm. Our method resulted in much higher sensitivity in identifying associated genes with the same specificity compared to the other two algorithms (Table 1). Our method also gave much smaller MSE compared to the Lasso and elastic net logistic regression. We computed the Receiver Operator Curve on the whole regularization path for each of the algorithm. In all four models, the proposed algorithm demonstrated much higher precision compared to Lasso and elastic net logistic regression (Figure 1).
Table 1. Results of simulations.
Figure 1. Receiver operator curves (ROCs) on the regularization path. ROCs were computed on the regularization path for LogitLapnet, enet, and lasso for all four models (A)(D). LogitLapnet has higher true positive rate and lower false positive rate compared to the other two approaches.
Biomarker identification using logitlapnet in human breast cancers
Breast cancers are clinically categorized according to the expression of several gene products, including the estrogen receptor (ESR1), progesterone receptor (PGR) and human epidermal growth factor receptor 2 (ERBB2). These biomarkers are routinely used not only to define prognosis but also determine treatment. [31,32]. Of the breast cancer subtypes defined by these biomarkers, triple negative breast cancers (TNBC) lacking expression of ESR1, PAR and ERBB2(her2) are the most clinically aggressive. TNBC demonstrate high rates of disease progression and recurrence [33]. Outcomes for this breast cancer subtype are generally poor, largely due to the fact that treatment options for women with TNBC are limited. However, a subgroup of TNBC are highly sensitive to conventional chemotherapy [34].
We chose to next validate use of our method by testing its ability to identify the key biomarkers used to define breast cancer subtypes. To accomplish this goal, we used patterns of gene expression in a large set of breast cancer specimens recently profiled by the Cancer Genome Atlas consortium. The tuning parameter for each method was selected through 10fold crossvalidation using the training data. We observed the same classification accuracy in predicting TNBC tumors (95%) when lasso, elastic net and LogitLapnet were applied to the testing data. The similar efficacy of these 3 algorithms is not surprising since TNBC tumors are dramatically different from the nonTNBCs in terms of their gene expression profiles.
However, there were a number of significant differences observed between the results obtained with the 3 algorithms tested. For example, use of LogitLapnet selected 262 genes, of which > 63% (166 genes) interacted with one another (Figure 2A). In comparison, use of the Lasso algorithm selected only 24 genes, most of which (20 genes) were isolated and were not predicted to interact. Elastic net selected nearly half of the input genes (393 genes) of which 59% (230 genes) were interconnected (Additional File 2). Furthermore, neither Lasso nor elastic net identified the progesterone receptor as a key discriminator for TNBC, despite the fact that this gene product is routinely used to clinically categorize breast cancers. Only LogitLapnet successfully identified each of the three markers used to define TNBC subtype: ESR1, PGR and ERBB2. These results suggest that LogitLapnet is more accurate than either Lasso or elastic net for identifying biomarkers from large multidimensional genomic datasets such as those generated by the TCGA.
Figure 2. Application of the algorithm to identify TNBCassociated genes. Genes and subnetworks of PPI associated with TNBC using TCGA breast cancer data and BioGRID PPI. Comparison of the selected genes from our proposed algorithm with those from lasso and elastic net (A). Genes and their respective subnetworks of PPI recovered by lasso (B) and our proposed Laplacian net algorithm (C). In the networks, genes having association to breast cancer reported in the literature are labeled with larger font.
Additional file 2. Genes identified by elastic net. This file includes a figure of genes and their respective subnetworks of PPI identified by elastic net. Genes with larger font indicates its association to breast cancer reported in the literature.
Format: PDF Size: 4.1MB Download file
This file can be viewed with: Adobe Acrobat Reader
Use of logitlapnet for inferring novel biologic relationships
In addition to identifying each of the 3 key biomarkers used to categorize breast cancer subtypes, we found that our method identified multiple subnetworks of gene expression in the breast cancer specimens profiled by TCGA. These subnetworks potentially reflect relationships with clinical or biologic significance. For example, one of subnetworks we identified includes multiple genes (AR, ESR1, MED1, MED24, RARA, PRAME, and HMGA1) involved in steroid hormone signaling. As demonstrated in Figure 2, LogitLapnet found that this cluster is closely connected to a known breast cancer gene, GATA3. This connection suggests a functional relationship. Evidence to support this relationship could not be found by directly searching the database of proteinprotein interactions used to construct our algorithm. However, at least one report published since our initial analyses has now shown that GATA3 mediates genomic ESR1binding upstream of FOXA1 [35]. This confirms that the integration of genomic and PPI data by our method has the capacity to identify new and otherwise unanticipated relationships with biologic significance.
Enhanced network specificity provided by logitlapnet
Another advantageous feature of LogitLapnet is the potential functional specificity of the subnetworks delineated by our algorithm. Subnetworks identified by LogitLapnet allow investigators to more readily focus on key genes for subsequent downstream functional analyses. For example, the subnetwork connecting AR, ESR1 and RARA identified by our algorithm in TNBC represents a cluster of genes involved in hormone receptor signaling (Figure 2). The subnetwork 2 identified by our method contains the genes PLK1, BCL2, BNIPL, and tumor suppressor p53 binding protein TP53BP2. PLK1, BNIPL and BCL2 are wellknown oncogenes and part of TP53 pathway [36]. The expression of BCL2 has been proposed as a prognostic marker for breast cancer patients. This subnetwork is also interesting, as it predicts a relationships between gene products that have been previously shown to impact the G1S (TP53BP) and G2M cell cycle checkpoints (PLK1). Dysregulation of both G1S and G2M are key hallmarks of human cancer as defined by Weinberg and others. Furthermore, TP53 dysfunction has been previously shown to lead to the overexpression of PLK1 and other gene products important for driving cells with genomic instability through the cell cycle. Thus, the ability of LogitLapnet to detect this relationship underscores its capacity to detect key events in breast and other human cancers. These results convey a clear message that interactions between genes in clusters might be biologically relevant.
In contrast, alternative algorithms such as elastic net identify fewer subnetworks containing larger number of genes. This means that subnetworks identified by LogitLapnet allow investigators to more readily focus on key targets for subsequent downstream functional validation. Although network modularization methods can be used to further dissemble networks defined by elastic net into smaller modules, these smaller modules may not reflect direct interactions between their individual components. This is because elastic net defines groups of coexpressed genes rather than networks of functionally interacting gene products. Alternatively, use of LogitLapnet can be considered to place gene correlations in the context of biologic function. Thus, its use can be reasonably anticipated to define relationships that are more likely to be biologically and clinically relevant.
In summary, the proposed algorithm identified set of genes associated with breast tumors. In addition, integration of PPI in the algorithm enables us to recover the genes not only in association to the breast cancer subtypes but also their interacting partners which are also breast cancer related. Most importantly, from the comparison with elastic net and lasso, our method selected a reasonable size of genes and is the only algorithm capable of identified all three marker genes in defining TNBC.
Conclusion
We have developed a graph Laplacian regularized logistic regression model for selecting the genes or network modules associated with clinical variables from gene expression profiles. Through simulation studies, we have demonstrated that our approach is much more sensitive for identifying clinically relevant genes and network modules. We have presented a case study on network module identification for triple negative breast cancer subtype using mRNAseq data. Our results indicate that LogitLapnet is a superior algorithm compared to Lasso and elastic net in disease gene and network module selection. Further work to biologically validate our predicted modules is needed to gain a more complete picture of the regulatory process in the TNBC subtype. Beyond genomics, there are many potential applications of LogitLapnet to utilize the rich information provided by network structure, such as metabolomics and proeteomics studies. In conclusion, our work developing the network structure regularized logistic regression model has many implications and has provided a new tool for research in genomic studies.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
Genevera Allen and Zhandong Liu developed the model and wrote the paper. Wen Zhang, YingWooi Wan and Kaifang Pang collected the real application data and conducted the application. Matthew Anderson provided biologic interpretation of networks and assisted with writing the paper. Zhandong Liu designed the simulation and real applications. Wen Zhang and Zhandong Liu realize and analyze the simulations.
Acknowledgements
This work is supported in part through the Collaborative Advances in Biomedical Computing Seed Funding Program at the Ken Kennedy Institute for Information Technology at Rice University supported by the John and Ann Doerr Fund for Computational Biomedicine and through the Center for Computational and Integrative Biomedical Research Seed Funding Program at Baylor College of Medicine. GA is also partially supported by NSF DMS1209017. ZD is supported by the Houston Bioinformatics Endowment and NSF DMS1263932.
Declarations
The publication fee is supported by the Houston Endowment.
This article has been published as part of BMC Genomics Volume 14 Supplement 8, 2013: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM 2013): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S8.
References

Golub TR, Slonim DK, Tamayo P, Huard C: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
science 1999, 286(5439):531537. PubMed Abstract  Publisher Full Text

Natsoulis G, El Ghaoui L, Lanckriet GRG, Tolley AM, Leroy F, Dunlea S, Eynon BP, Pearson CI, Tugendreich S, Jarnagin K: Classification of a large microarray data set: algorithm comparison and analysis of drug signatures.
Genome research 2005, 15(5):724736. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Holleman A, Cheok MH, den Boer ML: GeneExpression Patterns in DrugResistant Acute Lymphoblastic Leukemia Cells and Response to Treatment.
New England Journal of Medicine 2004, 351(6):533542.
[PMID: 15295046]
PubMed Abstract  Publisher Full Text 
Veer Lv, Dai H, Van De Vijver MJ, He YD: Gene expression profiling predicts clinical outcome of breast cancer.
nature 2002, 415(6871):530536. PubMed Abstract  Publisher Full Text

Wang Y, Klijn J, Zhang Y, Sieuwerts AM, Look MP: Geneexpression profiles to predict distant metastasis of lymphnodenegative primary breast cancer.
Lancet 2005, 365:671679. PubMed Abstract  Publisher Full Text

Zhu J, Hastie T: Classification of gene microarrays by penalized logistic regression.
Biostatistics 2004, 5(3):427443. PubMed Abstract  Publisher Full Text

Shen L, Tan EC: Dimension reductionbased penalized logistic regression for cancer classification using microarray data.
IEEE/ACM Trans Comput Biol Bioinform 2005, 2:166175. Publisher Full Text

Liao JG, Chin KV: Logistic regression for disease classification using microarray data: model selection in a large p and small n case.
Bioinformatics 2007, 23(15):19451951. PubMed Abstract  Publisher Full Text

Bootkrajang J, Kabán A: Classification of mislabelled microarrays using robust sparse logistic regression.
Bioinformatics 2013, 29(7):870877. PubMed Abstract  Publisher Full Text

Tibshirani R: Regression Shrinkage and Selection Via the Lasso.
Journal of the Royal Statistical Society, Series B 1994, 58:267288.

Zou H, Hastie T: Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005, 67(2):301320. Publisher Full Text

Dettling M, Bühlmann P: Finding predictive gene groups from microarray data.
Special Issue on Multivariate Methods in Genomic Data Analysis 2004, 90:106131.

Yuan M, Lin Y: Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2006, 68:4967. Publisher Full Text

Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K: Sparsity and smoothness via the fused lasso.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005, 67:91108. Publisher Full Text

Wu TT, Chen YF, Hastie T, Sobel E, Lange K: Genomewide association analysis by lasso penalized logistic regression.
Bioinformatics 2009, 25(6):714721. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kim S, Xing EP: Statistical estimation of correlated genome associations to a quantitative trait network.
PLoS genetics 2009, 5(8):e1000587. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tian Z, Zhang H, Kuang R: Sparse Group Selection on Fused Lasso Components for Identifying GroupSpecific DNA Copy Number Variations.

Liu J, Huang J, Ma S, Wang K: Incorporating group correlations in genomewide association studies using smoothed group Lasso.
Biostatistics 2013, 14(2):205219. PubMed Abstract  Publisher Full Text

Chuang HY, Lee E, Liu YT, Lee D, Ideker T: Networkbased classification of breast cancer metastasis.
Mol Syst Biol 2007, 3:140. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets.
Nucleic Acids Research 2006, 34(suppl 1):D535D539. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M: KEGG for integration and interpretation of largescale molecular data sets.
Nucleic Acids Research 2012, 40(D1):D109D114. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li C, Li H: Networkconstrained regularization and variable selection for analysis of genomic data.
Bioinformatics 2008, 24(9):11751182. PubMed Abstract  Publisher Full Text

Grosenick L, Klingenberg B, Katovich K, Knutson B, Taylor JE: Interpretable wholebrain prediction analysis with GraphNet.
NeuroImage 2013, 72:304321. PubMed Abstract  Publisher Full Text

Chen L, Xuan J, Riggins R, Clarke R, Wang Y: Identifying cancer biomarkers by networkconstrained support vector machines.
BMC systems biology 2011, 5:161. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Zhu Y, Shen X, Pan W: Networkbased support vector machine for classification of microarray samples.
BMC Bioinformatics 2009, 10(Suppl 1):S21. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Pan W, Xie B, Shen X: Incorporating predictor network in penalized regression with application to microarray data.
Biometrics 2010, 66(2):474484. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Stingo FC, Chen YA, Tadesse MG, Vannucci M: Incorporating biological information into linear models: a Bayesian approach to the selection of pathways and genes.
The Annals of Applied Statistics 2011, 5(3):19782002. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hill S, Neve R, Bayani N, Kuo WL, Ziyad S, Spellman P, Gray J, Mukherjee S: Integrating biological knowledge into variable selection: an empirical Bayes approach with an application in cancer biology.
BMC bioinformatics 2012, 13:94. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Grant M, Boyd S: Graph implementations for nonsmooth convex programs. [http://www.stanford.edu/~boyd/papers/pdf/graph_dcp.pdf] webcite
In Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences Edited by Blondel V, Boyd S, Kimura H. SpringerVerlag Limited; 2008, 95110.

CVX Research I: CVX: Matlab Software for Disciplined Convex Programming, version 2.0. [http://cvxr.com/cvx] webcite
2012.

Siegel R, Naishadham D, Jemal A: Cancer statistics, 2013.
CA: A Cancer Journal for Clinicians 2013, 63:1130. PubMed Abstract  Publisher Full Text

The TCGA: Comprehensive molecular portraits of human breast tumours.
Nature 2012, 490(7418):6170. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Stevens KN, Vachon CM, Couch FJ: Genetic susceptibility to triplenegative breast cancer.
Cancer Res 2013, 73(7):20252030. PubMed Abstract  Publisher Full Text

Foulkes WD, Smith IE, ReisFilho JS: Triplenegative breast cancer.
N Engl J Med 2010, 363(20):19381948. PubMed Abstract  Publisher Full Text

Theodorou V, Stark R, Menon S, Carroll JS: GATA3 acts upstream of FOXA1 in mediating ESR1 binding by shaping enhancer accessibility.
Genome Research 2013, 23:1222. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes.
Nat Rev Cancer 2004, 4(3):177183. PubMed Abstract  Publisher Full Text  PubMed Central Full Text