Abstract
Background
The recent advancement in array CGH (aCGH) research has significantly improved tumor identification using DNA copy number data. A number of unsupervised learning methods have been proposed for clustering aCGH samples. Two of the major challenges for developing aCGH sample clustering are the high spatial correlation between aCGH markers and the low computing efficiency. A mixture hidden Markov model based algorithm was developed to address these two challenges.
Results
The hidden Markov model (HMM) was used to model the spatial correlation between aCGH markers. A fast clustering algorithm was implemented and real data analysis on glioma aCGH data has shown that it converges to the optimal cluster rapidly and the computation time is proportional to the sample size. Simulation results showed that this HMM based clustering (HMMC) method has a substantially lower error rate than NMF clustering. The HMMC results for glioma data were significantly associated with clinical outcomes.
Conclusions
We have developed a fast clustering algorithm to identify tumor subtypes based on DNA copy number aberrations. The performance of the proposed HMMC method has been evaluated using both simulated and real aCGH data. The software for HMMC in both R and C++ is available in ND INBRE website http://ndinbre.org/programs/bioinformatics.php. webcite
Background
Tumor progression is a complicated biological process that comes with enormous genetic and molecular changes, such as chromosome aberration, gene mutations, and activation or inhibition of transcriptional pathways. The abnormal genetic changes often show high variability even among tumors within the same histopathological subtype and anatomical origin, which may lead to variation in clinical outcomes. For example, a subtype of colorectal cancer, hereditary nonpolyposis (HNPCC), is characterized by dominant genetic defects in DNA mismatch repair pathway and HNPCC patients have higher 5year survival than other subtypes of colorectal cancer patients[1]. When genetic aberration is specific to a subset of tumors, it provides potent targets for chemotherapy. Examples include trastuzumab and lapatinib for treating HER2positive breast cancers [2], tamoxifen for treating ERpositive breast cancers[3,4], and gefitinib and erlotinib for nonsmall cell lung cancer with EGFR mutations [59].
DNA copy number aberration is a striking feature of tumor cell. During tumor progression, chromosome is subjected to dramatic change in that DNA segments are amplified, deleted, or translocated. Comparative genomic hybridization (CGH) technology has been a widely used tool for detecting changes in chromosome fragments. The advancement of array technology has enabled researchers to conduct array CGH (aCGH) study for profiling genomewide chromosome variations using high density array, such as single nucleotide polymorphism (SNP) array that contains from 100K to 3M SNP markers [10]. Such high dimensional DNA copy number data reveals genomic heterogeneity in many cancer types, ensuring biomarker discovery for each genomic subtype at SNP copy number level [11].
Multiple clustering methods, including hierarchical clustering (HC), Naive Bayes, Knearest neighbours, support vector machine, probability modelbased clustering, and nonnegative matrix factorization (NMF), have been developed and applied for aCGH data to identify tumor subtypes based on the DNA copy number aberrations [1215]. We have previously developed a revised version of NMF that showed improved performance for aCGH clustering when testing on three tumor types, nonsmall cell lung carcinoma, colorectal cancer and malignant melanoma [16]. However, all these aforementioned methods fail to account for the spatial correlation between SNPs, and the correlation between adjunct SNPs could be as high as 0.99 for high density SNP array such as Affymetrix 500K. We therefore developed a mixture model based clustering method for tumor subtype classification that uses hidden Markov model (HMM) to account for the spatial correlation in aCGH data.
Shah et al. [17] have proposed a similar HMM based clustering method to account for spatial correlation, but is fundamentally different from our clustering method in a number of ways First, the models are different. They proposed a Bayesian hierarchical model, nonetheless we fit a hidden Markov model to the data directly, therefore it has less unknown variables and decreases the risk of model overfitting. Secondly, they use expectation maximization (EM) like algorithm to estimate variables, and we use maximum likelihood based method that is computationally more efficient. Thirdly, our algorithm automatically finds the optimal number of groups and converges to the optimal grouping. Fourthly, we developed a machine learning clustering algorithm and have implemented it using C++ parallel programming for fast computation. Finally, we preprocess the raw aCGH data by segmenting the chromosome prior to clustering. The segmentation step normalizes aCGH data that usually contains high frequency of intensity noise. We show the performance of the proposed HMMbased clustering (HMMC) method by applying it to simulated data and aCGH glioma data.
Methods
We first describe the data preprocessing procedure, our HMM model and the model fitting for a cluster of aCGH samples. Then we describe the machine learning algorithm that uses HMM models to cluster tumors. Finally, we introduce our fast implementation for the clustering algorithm and the approach to find the optimal number of groups.
aCGH data segmentation
Raw data from Affymetrix SNP array was preprocessed using the open source R package aroma.affymetrix with the default parameters [18]. The resulting SNPlevel copy numbers were segmented using circular binary segmentation (CBS) method within aroma.affymetrix [19]. The break points predicted by CBS at each sample were then summarized across all samples and the segments between pairs of adjunct break points were identified and the mean copy number of each segment was calculated for clustering analysis. Because tumor specimen is often contaminated with normal tissues that are adjacent to them, which would complicate the analysis and introduce false negatives, it is necessary to conduct quality control to remove the normal contamination first. We developed a machine learning algorithm to identify and filter out contaminated samples based on segmented copy number data. To conduct the quality control, we first chose two groups of samples; one with the highest number of copy number alteration regions and the other being normal samples. These two groups of samples served as a training dataset for a Random Forest classifier [20] that was used to distinguish tumors and normal samples and identify signatures for tumor samples. The out of bag error rate of this classifier is usually very low (<5%). The trained classifier was then applied to all aCGH samples and a score of probability to be contaminated was calculated for each sample. Samples with a probability of more than 50% of contamination were excluded from further clustering analysis.
HMM model fitting
We use a mixture HMM model to represent the clusters of tumor samples in that each cluster or subtype of tumors is modelled by a single HMM. Let t = 1,.., T, denote the DNA segments of aCGH data. Suppose there are G clusters in total and there are n_{g }samples in cluster g, where g = 1,..., G. The HMM model for cluster g is demonstrated in Figure 1. The hidden state at segment t is denoted as x_{g}(t), which is the true copy number of segment t and is unknown. The observed mean copy number for segment t and sample j is denoted as y_{gj}(t) where j = 1, ..., n_{g}. The transitional probability from state t1 to t is denoted as P_{g}(t) and the emission probability from state t to observation y_{gj}(t) is denoted as P_{gj}(t).
Figure 1. A hidden Markov model for a cluster of tumor aCGH samples. There are n_g samples in cluster g, and the graph shows three DNA segment, t1, t, and t+1. X is the unobserved copy number for each segment and y is the observed mean intensity value for each sample. P_{g }is transitional probability and P_{gj }is emission probability.
Based on the distribution of segmentation copy number data, we choose 6 states, xX{0.1, 1.5, 2, 3, 4, 5}. The transitional probabilities between states are given in a multinomial distribution that was empirically estimated from the raw data. For example, for the 79 glioma samples we discuss in the results section, the transitional probability between the same states is 0.56, and is 0.088 otherwise. Lognormal distribution is widely used to model DNA copy number [21]. We let the emission probability follow a lognormal distribution with the hidden state value being the mean of the normal distribution. The standard deviation of the normal distribution was empirically estimated from raw data using the mean standard deviation of segment copy number across all samples. Viterbi algorithm is used to fit this HMM model to a cluster of tumor samples. Viterbi algorithm is an efficient algorithm that uses a dynamic programming approach to estimate the hidden states from segment 1 to T and to calculate delta value, Δ_{g}, which is the loglikelihood for the model fitting of cluster g. We have implemented Viterbi algorithm in both R and C++ programming environment.
Sample clustering
The optimal clustering is determined by the maximum sum of delta values from all groups. It is computationally infeasible to perform HMM fitting and calculate delta sum values for every possible clustering because the number of clustering for n samples G groups is G^{n}. We have developed a computationally efficient algorithm for finding the clustering optima.
The algorithm searches the optimal clustering as follows. First, for a given number of groups G, we run a NMF algorithm that was developed earlier to preliminarily cluster the samples. The NMF algorithm has been described previously [16]. Though NMF classifies aCGH data without considering the correlation between markers, it has been shown to discover genomic signatures in various types of tumors [11,16]. Thus, the clustering result of NMF serves as a starting point for HMM clustering. Secondly, we fit a HMM for each cluster from the previous step and calculate the sum of delta values, Δ_{1}, ..., Δ_{G}. This is the loglikelihood for the current optimal clustering. Thirdly, based on the current optimal clustering, we randomly select 2 samples, and reassign the group labels to them. We then perform HMM model fitting and calculate the loglikelihood for the new clustering. If the new loglikelihood is greater than the optimal loglikelihood, we update the optimal clustering with this new clustering. Fourthly and finally, we repeat the last step a number of times, and stop until the optimal loglikelihood remains unchanged for m times of consecutive random clustering. We find that m = n^{2 }is sufficient for HMMC to converge to the optimal clustering. This workflow of clustering procedure is illustrated in Figure 2.
Figure 2. Workflow of HMMC clustering procedure. The procedure stops when the optimal clustering is not updated for m = n^{2 }comparisons where n is the number of samples.
We have use both R and C++ to implement this clustering algorithm. As step 3 can be performed simultaneously for multiple random clustering, we implemented a parallel computing version of HMMC to decrease the computation time. This was accomplished using the OpenMP package [22] within C++ to allow the random clustering and HMM fitting to be carried out across multiple threads in a multicore machine. Each thread can update the optimal clustering and all threads share the same optima. We have tested this program in Dell PowerEdge R910 workstation that has 32cores and the parallel computing increased efficiency by an order of magnitude.
Model selection
One of the most difficult questions about clustering problem is to determine the number of groups, G. Our HMMC algorithm relies on a given G to find the optimal clustering. Thus model selection is needed to determine the best number of groups.
We use the Bayesian Information Criterion (BIC), a widespread statistical tool to conduct model selection. BIC is defined as follows:
where L is the likelihood which measures how good the HMM model approximate the data, k is the number of parameters used in the model, and n is the number of samples. The second term, kln(n), serves as a penalty on the number of parameters used in the model to avoid overfitting. Thus, the magnitude of penalty increases as the number G increases. A preferred model is selected upon a lower BIC value that indicates relatively higher likelihood as well as relatively lower risk of overfitting.
The sum of delta values is the loglikelihood for HMM model fitting,
where Δg is the delta value of the gth group. Because the parameters for transitional and emission probability are empirically estimated, the number of unknown parameters k in BIC is equal to the number of hidden states. For an aCGH dataset with G groups and T segments, the number of parameters is equal to T × G. Thus, the BIC for HMM is defined as
We run HMMC for a range of values of G, and select the optimal G with the smallest BIC.
Crossvalidation for testing stability of HMMC
To assess the stability of HMMC results, we conduct a crossvalidation as described earlier [16]. After completing HMMC for the entire dataset, we perform crossvalidation as follows. We randomly leave 10% of samples out and apply HMMC to the remaining 90% of samples and compare the clustering results with the original one. The number of samples that are assigned to a different subgroup other than the original result is counted as errors. We repeat this procedure 100 times and calculate the error rate, which represents the stability of the clustering algorithm with respect to the permutation of samples.
Results
Simulation study
In order to test the performance of HMMC, we conducted simulation study to compare HMMC with NMF clustering. We created two groups of aCGH data with 10 samples per group and 100 segments per each sample. For the first group, all data were generated from a normal distribution with mean equal to 2 and standard deviation equal to s, which was variable between different experiments. For the second group, the copy number of all segments followed a normal distribution as well. This normal distribution was the same as the first group, except that 3 segments in the middle of genome had a mean equal to 4 such that these 3 segments were amplified. In order to introduce random noise, we randomly drew 40 segments in all samples and increased their copy number values by 2 (their mean was equal to 4). Thus these 40 segments were randomly amplified segments. Furthermore, AR(1) correlation was introduced into the data. Considering the high correlation between SNPs in the high density SNP array, we set the pho value of AR(1) to be 0.9.
We used HMMC and NMF to cluster the two groups with s value varying from 0.5 to 2. For each s, 200 aCGH datasets were generated and the number of samples misplaced for either method was counted. The error rate curves are shown in Figure 3.
Figure 3. The error rates for simulated data. HMMC and NMF were tested on the same simulated data. The x axis s is the standard deviation that was used to generate the data. The blue curve is the error rates of HMMC, and the red curve is for NMF.
The error rates of both HMMC and NMF increase as the standard deviation s increases. The error rates of HMMC are in the range of 3% to 8.5%, whereas those for NMF are between 7.9% and 17.6%. Therefore, the chance for HMMC to incorrectly classifying samples is less than half of that for NMF in our simulation study.
Glioma data
Glioma tissue aCGH data were acquired from Repository for Molecular BRAin Neoplasma DaTa (REMBRANDT) [23]. We obtained Affymetrix 500K SNP array data and the corresponding clinical information for 79 glioma patients. The aCGH data was first processed using aroma.affymetrix software and was then segmented using circular binary segmentation method. This preprocessing step reduced the number of markers from 500K to 5321 segments. The segmented data was used for further analysis.
The first question to be addressed is the computational performance of HMMC when it is applied to highdimensional data. We first test the convergence rate of HMMC for glioma aCGH data assuming 2 groups. As shown in Figure 4A, the loglikelihood of HMMC increases rapidly in the first 4,000 cycles of clustering, and it remains stable after 5,000 cycles. The loglikelihood reaches its maximum at approximately 15,000 cycles, and the computation stops after an additional 79^{2 }= 6241 cycles.
Figure 4. Computational performance for glioma data. A. The loglikelihood of HMMC converges rapidly to its maximum. B. The computational time of HMMC is linear to the sample size.
Secondly, we tested the computation time for different number of samples. Random samples were drawn from the 79 glioma data. The numbers of random samples was from 10 to 95 and the number of groups was assumed to be 2. We used a Dell PowerEdge R910 to run 20 threads simultaneously for this test. Figure 4B shows the curve of computation time. For 10 samples, it took less than 20 minutes. For 95 samples, it took about 2 hours to find the optimal clustering. The computation time is approximately linear to the number of samples (O(n)). Therefore, our clustering algorithm is computationally efficient and it achieves O(n) for the computational complexity.
To find the optimal number of groups, we conducted model selection using BIC. We ran HMMC for the number of groups G from 2 to 6, and calculated the corresponding maximum loglikelihood and BIC. The results are shown in table 1. The maximum loglikelihood increases as the number of groups increases because the model has a better fit for the data as the number of parameters increases. The BIC value was 304,617 for 2 groups, then it decreased to 305,030 for 3 groups, and then increased as the number of groups increased. Thus, the optimal number of groups is 3 because it gave the minimum value of BIC.
Table 1. Model selection for glioma data
Figure 5 shows the heatmap of clustering the glioma samples into 3 groups. Each row of the heatmap is a glioma sample, and each column is a DNA segment. There are 79 rows and 5,321 columns totally. The segments are sorted in the order of genomic positions from chromosome 1 to 22. Sex chromosome was not included to avoid bias on gender. Yellow indicates amplification and blue indicates deletion. The three groups are shown with the black horizontal lines in the heatmap. Group 1 is characterized by large area of deletion in chromosome 8. Group 2 has amplification in various regions of chromosome 2 to 6 and deletion in chromosome 17 and 18. Group 3 appears to have sparse aberrations in random regions.
Figure 5. Heatmap of glioma clustering. 79 glioma samples with 5,321 segments were clustered into 3 groups using HMMC. Sample names are labelled on the right side, and chromosome number is marked at the bottom. Yellow denotes amplification and blue denotes deletion. The three groups are labelled with number 1 to 3 on the left side of the graph.
To verify the clustering results, we performed survival analysis to the 3 groups. The KaplanMeier curves are shown in Figure 6. The blue line is group 1, the green is group 2 and the red is group 3. The median survival time for group 1 is 395 days, whereas those for group 2 and 3 are 1,698 and 1,337 days, respectively. Group 1 has a significant lower overall survival than group 2 and 3. The overall P value by logrank test is less than 0.0001. Therefore, HMMC successfully clustered glioma samples based on aCGH data and the resulting tumor subtypes were associated with clinical outcomes. As discussed previously, group 2 manifests with large segments of deletion in chromosome 8. Thus genomic signature is identified for poor clinical outcome and they can serve as prognostic biomarkers. Upon identification of genes located in these abnormal genomics regions, we can discover the potential target for cancer therapy.
Figure 6. KaplanMeier curves of glioma clusters. The 79 glioma patients were clustered into 3 groups using HMMC. The KaplanMeier curves for overall survival time in days were shown. The blue line is group 1, green is group 2, and red is group 3. The P value by logrank test is less than 0.0001.
Discussion
Genomic information has been increasingly used for molecular classifications of tumors because it provides a more objective view than histopathological approach and it sheds light on the molecular mechanisms of tumor heterogeneity. The global gene expression of major tumor types has been extensively studied for subtype identification [2427]. The fast development of SNP array has made it possible to characterize tumor subtypes based on the genomic DNA copy number, nonetheless, the computational and statistical methods in this area is still underdeveloped. In many aspects, DNA copy number is better than RNA expression data in terms of genomic signature and diagnostic or prognostic biomarkers. DNA is much easier to store than RNA because DNA is a stable molecule whereas RNA is transient and easy to degrade. In addition, RNA has to be collected from fresh tissues whereas DNA can be isolated from frozen or paraform tissues even after years of storage. Furthermore, DNA aberration is preserved in the cell and can be passed to daughter cells via mitosis, whereas RNA expression is unstable and RNA levels are affected by many factors such as cell cycle, environmental and physiological factors. Therefore, tumor classification and genomic signature identification based on DNA copy number have extensive applications and improvements in the computational and statistical methods are needed to extend research in aCGH data analysis.
Most current statistical and data mining methods for aCGH data were developed from expression microarray analysis. An important difference between aCGH data and mRNA expression data is the high spatial correlation between neighboring SNPs in the aCGH data. The correlation between genes in expression microarray is relatively less and ignored by most methods that have been developed for such data. Thus most methods developed for expression microarray may yield significant false positives and false negatives when applied to the aCGH data. Here we proposed a clustering algorithm that relies on HMM to take into account of the correlations between aCGH markers (SNPs). We found our HMMC algorithm to perform better than the NMF clustering method, which is one of the most widely used methods used for aCGH sample clustering.
Though HMM has been widely used in modelling correlated data, such as DNA copy number data, it is widely known to be computationally very slow, especially for the analysis of high dimensional data such as the aCGH data. In order to overcome this problem, we have used various approaches to increase the computational efficiency and speed. First, in the clustering procedure, we conduct a preliminary NMF clustering and the resulting groups are used as the starting point for HMMC.
Secondly, instead of performing exhausting search, we only disturb the current optimal clustering with a few random labelling. Thirdly, we implement the algorithm in C++ that is much faster than R and MatLab. Furthermore, we have developed a parallel computing version of the program, and it increases the speed by over 10fold our server computer. Our testing on 79 glioma samples showed that HMMC rapidly converges to its optimal clustering and the computation time is linear to the sample size.
Conclusions
In this manuscript we proposed a HMMbased clustering algorithm for identifying tumor subtypes using aCGH DNA copy number data. This approach properly models the high spatial correlation between aCGH markers. Clusters of tumor samples are modelled with a mixture of HMM models where each HMM fits a cluster of samples. We have developed a computationally efficient and fast clustering algorithm that takes only a computational time of O(n). We have shown that this HMMC algorithm has less than half the error rate of NMF clustering and it can locate the optimal number of groups automatically while applying to glioma aCGH data. The resulting clustering of glioma samples has strong association with overall survival time. This HMMC algorithm would potentially have wide applications in tumor subtype identification, genomic signature discovery, and diagnostic and prognostic biomarker search. We will conduct future research in extending HMMbased modelling to other high dimensional biological data analysis, such as copy number variation, gene set enrichment analysis, and sequencing data analysis such as ChIPseq data processing.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
KZ conceived the study, developed the HMMC algorithm and implemented it in R environment. YY wrote the C++ code for the methodology and performed test for glioma data. VD and YD provided constructive suggestions. KZ, VD, LX and DS wrote the manuscript with input from all authors. All authors read and approved the final manuscript.
Acknowledgements
Zhang's research was supported in part by NIH grant 2P20RR01647109, North Dakota INBRE.
References

Myrhoj T, Bisgaard ML, Bernstein I, Svendsen LB, Sondergaard JO, et al.: Hereditary nonpolyposis colorectal cancer: clinical features and survival. Results from the Danish HNPCC register.
Scand J Gastroenterol 1997, 32:572576. PubMed Abstract  Publisher Full Text

Vogel CL, Cobleigh MA, Tripathy D, Gutheil JC, Harris LN, et al.: Efficacy and safety of trastuzumab as a single agent in firstline treatment of HER2overexpressing metastatic breast cancer.
J Clin Oncol 2002, 20:719726. PubMed Abstract  Publisher Full Text

Fisher B, Dignam J, Wolmark N, DeCillis A, Emir B, et al.: Tamoxifen and chemotherapy for lymph nodenegative, estrogen receptorpositive breast cancer.
J Natl Cancer Inst 1997, 89:16731682. PubMed Abstract  Publisher Full Text

Cleator SJ, Ahamed E, Coombes RC, Palmieri C: A 2009 update on the treatment of patients with hormone receptorpositive breast cancer.
Clin Breast Cancer 2009, 9(Suppl 1):S6S17. PubMed Abstract  Publisher Full Text

Gazdar AF: Personalized medicine and inhibition of EGFR signaling in lung cancer.
N Engl J Med 2009, 361:10181020. PubMed Abstract  Publisher Full Text

Cappuzzo F, Hirsch FR, Rossi E, Bartolini S, Ceresoli GL, et al.: Epidermal Growth Factor Receptor Gene and Protein and Gefitinib Sensitivity in NonSmallCell Lung Cancer.
J Natl Cancer Inst 2005, 97:643655. PubMed Abstract  Publisher Full Text

LaurentPuig P, Lievre A, Blons H: Mutations and response to epidermal growth factor receptor inhibitors.
Clin Cancer Res 2009, 15:11331139. PubMed Abstract  Publisher Full Text

Harris M: Monoclonal antibodies as therapeutic agents for cancer.
Lancet Oncol 2004, 5:292302. PubMed Abstract  Publisher Full Text

Takano T, Ohe Y, Sakamoto H, Tsuta K, Matsuno Y, et al.: Epidermal Growth Factor Receptor Gene Mutations and Increased Copy Numbers Predict Gefitinib Sensitivity in Patients With Recurrent NonSmallCell Lung Cancer.
J Clin Oncol 2005, 23:68296837. PubMed Abstract  Publisher Full Text

Tan DS, ReisFilho JS: Comparative genomic hybridisation arrays: highthroughput tools to determine targeted therapy in breast cancer.
Pathobiology 2008, 75:6374. PubMed Abstract  Publisher Full Text

Maher EA, Brennan C, Wen PY, Durso L, Ligon KL, et al.: Marked genomic differences characterize primary and secondary glioblastoma subtypes and identify two distinct molecular and clinical secondary glioblastoma entities.
Cancer Res 2006, 66:1150211513. PubMed Abstract  Publisher Full Text

Myllykangas S, Himberg J, Bohling T, Nagy B, Hollmen J, et al.: DNA copy number amplification profiling of human neoplasms.
Oncogene 2006, 25:73247332. PubMed Abstract  Publisher Full Text

Myllykangas S, Tikka J, Bohling T, Knuutila S, Hollmen J: Classification of human cancers based on DNA copy number amplification modeling.
BMC Med Genomics 2008, 1:15. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Carrasco DR, Tonon G, Huang Y, Zhang Y, Sinha R, et al.: Highresolution genomic profiles define distinct clinicopathogenetic subgroups of multiple myeloma patients.
Cancer Cell 2006, 9:313325. PubMed Abstract  Publisher Full Text

Wang Y, Makedon F, Pearlman J: Tumor classification based on DNA copy number aberrations determined using SNP arrays.
Oncol Rep 2006, 15:10571059. PubMed Abstract  Publisher Full Text

Lu X, Zhang K, Van Sant C, Coon J, Semizarov D: An algorithm for classifying tumors based on genomic aberrations and selecting representative tumor models.
BMC Med Genomics 2010, 3:23. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Shah SP, Cheung KJ Jr, Johnson NA, Alain G, Gascoyne RD, et al.: Modelbased clustering of array CGH data.
Bioinformatics 2009, 25:i3038. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bengtsson H, Wirapati P, Speed TP: A singlearray preprocessing method for estimating fullresolution raw copy numbers from all Affymetrix genotyping arrays including GenomeWideSNP 5 & 6.
Bioinformatics 2009, 25:21492156. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of arraybased DNA copy number data.
Biostatistics 2004, 5:557572. PubMed Abstract  Publisher Full Text

Machine Learning 2001, 45:532. Publisher Full Text

Hodgson G, Hager JH, Volik S, Hariono S, Wernick M, et al.: Genome scanning with array CGH delineates regional alterations in mouse islet carcinomas.
Nat Genet 2001, 29:459464. PubMed Abstract  Publisher Full Text

Chandra R: Parallel programming in OpenMP. Volume xvi. San Francisco, CA: Morgan Kaufmann Publishers; 2001::230.

(National Cancer Institute 2005) REMBRANDT home page [http://http//rembrandt.nci.nih.gov] webcite
2011.

Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, et al.: Diffuse large Bcell lymphoma outcome prediction by geneexpression profiling and supervised machine learning.
Nat Med 2002, 8:6874. PubMed Abstract  Publisher Full Text

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
Science 1999, 286:531537. PubMed Abstract  Publisher Full Text

van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, et al.: Gene expression profiling predicts clinical outcome of breast cancer.
Nature 2002, 415:530536. PubMed Abstract  Publisher Full Text

Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, et al.: Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling.
Nature 2000, 403:503511. PubMed Abstract  Publisher Full Text