Abstract
Background
Strains of Mycobacterium tuberculosis complex (MTBC) can be classified into major lineages based on their genotype. Further subdivision of major lineages into sublineages requires multiple biomarkers along with methods to combine and analyze multiple sources of information in one unsupervised learning model. Typically, spacer oligonucleotide type (spoligotype) and mycobacterial interspersed repetitive units (MIRU) are used for TB genotyping and surveillance. Here, we examine the sublineage structure of MTBC strains with multiple biomarkers simultaneously, by employing a tensor clustering framework (TCF) on multiplebiomarker tensors.
Results
Simultaneous analysis of the spoligotype and MIRU type of strains using TCF on multiplebiomarker tensors leads to coherent sublineages of major lineages with clear and distinctive spoligotype and MIRU signatures. Comparison of tensor sublineages with SpolDB4 families either supports tensor sublineages, or suggests subdivision or merging of SpolDB4 families. High prediction accuracy of major lineage classification with supervised tensor learning on multiplebiomarker tensors validates our unsupervised analysis of sublineages on multiplebiomarker tensors.
Conclusions
TCF on multiplebiomarker tensors achieves simultaneous analysis of multiple biomarkers and suggest a new putative sublineage structure for each major lineage. Analysis of multiplebiomarker tensors gives insight into the sublineage structure of MTBC at the genomic level.
Background
Tuberculosis (TB), a bacterial disease caused by Mycobacterium tuberculosis complex (MTBC), is a leading cause of death worldwide. In the United States, isolates from all TB patients are routinely genotyped by multiple biomarkers. The biomarkers include Spacer Oligonucleotide Types (spoligotypes), Mycobacterial Interspersed Repetitive Units  Variable Number Tandem Repeats (MIRUVNTR), IS6110 Restriction Fragment Length Polymorphisms (RFLP), Long Sequence Polymorphisms (LSPs), and Single Nucleotide Polymorphisms (SNPs).
Genotyping of MTBC is used to identify and distinguish MTBC into distinct lineages and/or sublineages that are quite useful for TB tracking, TB control, and examining hostpathogen relationships [1]. The six main major lineages of MTBC are M. africanum, M. bovis, M. tuberculosis subgroup IndoOceanic, M. tuberculosis subgroup EuroAmerican, M. tuberculosis subgroup East Asian (Beijing) and M. tuberculosis subgroup EastAfrican Indian (CAS). Other major lineages exist such as M. canettii and M. microti, but they do not commonly occur in the US, so we do not consider them here. These major lineages can be definitively characterized using LSPs [2], but typically only spoligotypes and MIRU are collected for the purpose of TB surveillance. Classification, similarity search, and expertrule based methods have been developed to correctly map isolates genotyped using MIRU and/or spoligotypes to the major lineages [35].
While sublineages of MTBC are routinely used in the TB literature, their exact definitions, names, and numbers have not been clearly established. The SpolDB4 database contains 39,295 strains and their spoligotypes with the vast majority of them labeled and classified into 62 sublineages [6], but many of these are considered to be “potentially phylogeographicallyspecific MTBC genotype families”, rather than distinct phylogenetic sublineages with known biomarkers. Therefore, further analysis is needed to confirm these sublineages. The highlycurated MIRUVNTRplus website, which focuses primarily on MIRU, defines 22 sublineages. New definitions of sublineages based on LSPs and SNPs are being discovered; e.g. the RD724 polymorphism corresponds to the previously defined SpolDB4 T2 sublineage, also known as the Uganda strain in MIRUVNTRplus[7]. Now large databases using spoligotype, MIRU patterns, and RFLP exist. The United States Centers for Disease Control and Prevention (CDC) has gathered spoligotypes and MIRU isolates for over 37,000 patients. Welldefined TB sublineages based on spoligotype and MIRU are critical for both TB control and TB research.
The goal of this paper is to examine the sublineage structure of MTBC on the basis of multiple biomarkers. The proposed method reveals structure not captured in SpolDB4 spoligotype families because SpolDB4 sublineage only take into account a single biomarker, spoligotypes. A spoligotypeonly tool, SPOTCLUST, was used to find MTBC sublineages using an unsupervised probabilistic model, reflecting spoligotype evolution [8]. A key issue is to combine spoligotype and MIRU into a single unsupervised learning model. When MIRU patterns are considered, SpolDB4 families that are wellsupported by spoligotype signatures may become ambiguous, or allow subdivision/merging of the families. Existing phylogenetic methods can be readily applied to MIRU patterns, but specialized methods are needed to accurately capture how spoligotypes evolve. It is not known how to best combine spoligotype and MIRU patterns to infer a phylogeny. The online tool www.MIRUVNTRplus.org determines lineages by using similarity search to a labeled database. The user must select the distance measure which is defined using spoligotypes and/or MIRU patterns, possibly yielding different results.
In this study, we develop a tensor clustering framework to find the sublineage structure of MTBC strains labeled by major lineages based on multiple biomarkers. This is an unsupervised learning problem. We generate multiplebiomarker tensors of MTBC strains for each major lineage and apply multiway models for dimensionality reduction. The model accurately captures spoligotype evolutionary dynamics using contiguous deletions of spacers. The tensor transforms spoligotypes and MIRU into a new representation, where traditional clustering methods apply without users having to decide a priori how to combine spoligotype and MIRU patterns. Strains are clustered based on the transformed data without using any information from SpolDB4 families. Clustering results lead to the subdivision of major lineages of MTBC into groups with clear and distinguishable spoligotype and MIRU signatures. Comparison of the tensor sublineages with SpolDB4 families suggests dividing or merging some SpolDB4 families. As a way of validating multiplebiomarker tensors, we use them in a supervised learning model to predict major lineages using spoligotype deletions and MIRU. We compare the prediction accuracy of the multiplebiomarker tensor model created with NPLS (Nway partial least squares) with the 2way PLS applied to matrix data and an existing conformal Bayesian Network approach.
In the next section, we give a brief background on clustering and multiway analysis of postgenomic data, spoligotyping, and MIRU typing.
Clustering postgenomic data
Data clustering is a class of techniques for unsupervised classification of data samples into groups of similar behavior, function, or trait [9]. Clustering can be used in postgenomic data analysis to group strains with similar traits. It is common practice to use different clustering methods and use a priori biological knowledge to interpret the clusters, but computational cluster validation is needed to validate results without prior knowledge for unsupervised classification. A great survey by Handl et al. outlines the steps of computational cluster analysis on postgenomic data [10]. An application of computational cluster validation on microarray data by Giancarlo et al. compares the results of clusterings using various cluster validation indices [11]. Eisen et al. clusters gene expression data which groups genes of similar functions [12]. Improved clustering techniques have been developed, but how to combine multiple sources of information in one clustering is an open question.
Application of multiway models to postgenomic data clustering
Clustering on postgenomic data can be accomplished based on multiple sources of ground truth. The ground truth can be based on multiple biomarkers, host and pathogen, or antigen and antibody. A survey by Kriegel et al. outlines the methods for finding clusters in highdimensional data [13]. Analysis of multiway arrays for data mining is frequently used today in various fields, including bioinformatics, to use multiple sources of prior information simultaneously [14]. Alter and Golub use higherorder eigenvalue decomposition on a networks × genes × genes tensor and find significant subnetworks associated with independent pathways in a genomescale network of relations among all genes of cellular systems [15]. Omberg et al. use higherorder singular value decomposition on DNA microarray data, obtaining the core tensor of eigenarrays × xeigengenes × yeigengenes and finding correlation between genomes in the subtensors of the core tensor [16]. Multiway analysis of EEG data identifies epileptic seizures [17]. Use of common partitive and hierarchical clustering algorithms accompanied with multiway modeling of highdimensional data finds functionally related genes in stem cells [18]. Similarly, multiple biomarkers of the MTBC genome can be used to cluster MTBC strains.
Spoligotyping
Spoligotyping is a DNA fingerprinting method that exploits the polymorphisms in the direct repeat (DR) region of the MTBC genome. The DR region is a polymorphic locus in the genome of MTBC which consists of direct repeats (36 bp), separated by unique spacer sequences of 36 to 41 bp [19]. The method uses 43 spacers, thus a spoligotype is typically represented by a 43bit binary sequence. Zeros and ones in the sequence correspond to the absence and presence of spacers respectively. Mutations in the DR region involve deletion of one or more contiguous spacers. To capture this mechanism of mutation in our model, we find informative contiguous spacer deletions and represent spoligotype deletions as a binary vector, where one indicates that a specific contiguous deletion occurs (i.e. a specified contiguous set of spacers are all absent) and zero means at least one spacer is present in that contiguous set of spacers.
Large datasets of MTBC strains genotyped by spoligotype have been amassed such as SpolDB4 [6] and a more extended online version SITVIT (http://www.pasteurguadeloupe.fr:8081/SITVITDemo/index.jsp webcite). Spoligotypes can be readily used to identify commonly accepted major lineages of MTBC with high accuracy [4]. SpolDB4 defined a set of phylogeographic sublineages or families based on expert derived rules that are in common use in the TB community. In contrast to the major lineages that have been validated by more definitive markers such as single nucleotide polymorphisms and long sequence polymorphism, the exact definition of MTBC sublineages and the accuracy of the SpolDB4 families created only using spoligotypes remain open questions.
MIRUVNTR typing
MIRU is a homologous 46100 bp DNA sequence dispersed within intergenic regions of MTBC, often as tandem repeats. MIRUVNTR typing is based on the number of tandem repeats of MIRUs at certain identified loci. Among these 41 identified minisatellite regions on the MTBC genome, different subsets of sizes 12, 15, and 24 are proposed for the standardization of MIRUVNTR typing [3]. In this study, we use 12 MIRU loci for genotyping MTBC. Thus, the MIRU pattern is represented as a vector of length 12, each entry representing the number of repeats in each MIRU locus.
Results
We used the tensor clustering framework to cluster MTBC strains using multiple biomarkers, and compared the clustering to SpolDB4 sublineages. Next, we used supervised tensor learning and classified MTBC strains into major lineages using spoligoype deletions and MIRU patterns. We compared multiway and twoway supervised learning methods based on their prediction accuracy for major lineage classification. In the following section, we introduce multiplebiomarker tensors and present unsupervised and supervised learning experiments on multiplebiomarker tensors.
Multiplebiomarker tensor analysis of strain data
Multiple biomarkers of the MTBC genome in a relational database can be represented as a highdimensional dataset for multiway analysis. The multiplebiomarker tensor is constructed this way, with one of the modes representing strains and other modes representing biomarkers. In our experiments, we use this multidimensional array or tensor with three modes representing strains, spoligotype deletions, and MIRU patterns. This multiplebiomarker tensor captures three key properties of MTBC strains: spoligotype deletions, number of repeats in MIRU loci, and coexistence of spoligotype deletions with MIRU loci.
The strain dataset is arranged as a threeway array with strains in the first mode, spoligotype deletions in the second mode, and MIRU patterns in the third mode. Each entry
 X
 X
Figure 1. Multiplebiomarker tensor Strains x Spoligotype deletions x MIRU patterns tensor. Each entry
 X
Figure 2. Generation of multiplebiomarker tensor Biomarker kernel matrix for each strain forms multiplebiomarker tensor. Vector represents spoligotype deletions and represents MIRU patterns.
where
and r_{ik} is the number of repeats in MIRU locus k of strain i. Multiplebiomarker tensors can be used for both unsupervised and supervised learning. Next, we use the unsupervised tensor clustering framework on multiplebiomarker tensors to subdivide major lineages of MTBC into sublineages.
Subdivision of major lineages into sublineages
We subdivide each major lineage of MTBC into sublineages using multiplebiomarker tensors. For each major lineage, we generated the multiplebiomarker tensor using spoligotypes and MIRU types and applied multiway models to identify putative sublineages of each major lineage. Two multiway analysis methods were used: PARAFAC and Tucker3. Details of the methods and how the model parameters or components were selected can be found in the methods section. The validated multiway models with numbers of components for each major lineage are shown in Table 1. To evaluate the resulting clusters, we compared them to the published SpolDB4 families for each major lineage. The results are summarized in Table 2. We used the Fmeasure to measure how well the tensor sublineages match the SpolDB4 families with 1 indicating an exact match and 0 indicating no match. The average bestmatch stability is used to assess certainty of tensor sublineages respectively with 1 indicating highly stable clusters. For each major lineage, results show that the tensor analysis finds highly stable sublineages (the bestmatch stability is ≥84%) and that the number of sublineages found using tensors is close but not always identical to the number of SpolDB4 families.
Table 1. Number of components used in PARAFAC and Tucker3 models.
Table 2. Number of SpolDB4 families and number of tensor sublineages for each major lineage
The Fmeasure values range from 53% to 88% indicating that the sublineages found by the tensors only partially overlap with those of SpolDB4. Recall that the SpolDB4 families were created by expert analysis using only spoligotypes and that analysis by alternative biomarkers such as SNP and LSP has led to alternative definitions of MTBC sublineages. The tensor sublineages are based on spoligotype and MIRU patterns, thus in some cases the tensor divides SpolDB4 families due to difference in MIRU patterns even if the spoligotypes match. In other cases, the tensor analysis merges the SpolDB4 families because the collective spoligotypes and MIRU patterns are very close. In some cases, the tensor analysis almost exactly reproduces a SpolDB4 family providing strong support for the existence of these families with no expert guidance. In addition, the MIRU patterns provide additional evidence for the existence of these distinct sublineages. Thus, multiway analysis of MTBC strains of each major lineage with multiple biomarkers leads to new sublineages and reaffirms existing ones. Further insight can be obtained by examining the putative sublineages for each major lineage, which is detailed next.
Sublineage structure of M. africanum The most stable clusters were produced using PARAFAC and it constructed four putative sublineages of M. africanum, denoted MA1 to MA4. Table 3 gives the stability of each sublineage and the correspondence between the tensor sublineages and the SpolDB4 families. These four putative sublineages are quite distinct as shown by the stability of 1 for each sublineage and the clear separation of the four sublineages in the PCA plot in Figure 3. Figure 4 shows heat maps representing the spoligotype and MIRU signatures for each tensor sublineage, with white indicating 0 probability and black indicating probability of 1.
Table 3. Confusion matrix of M. africanum strains
Figure 3. The clustering plot of M. africanum strains Clustering plot of M. africanum strains using Principal Component Analysis on the score matrix obtained from the PARAFAC model. Four putative tensor sublineages, MA1 to MA4, are clearly distinct along the principal component axes.
Figure 4. Biomarker signatures of M. africanum tensor sublineages Spoligotype and MIRU signatures of tensor sublineages of M. africanum strains. White indicates probability of 0 and black indicates probability of 1. Intermediate colors represent probabilities in the range (0,1). MA1 and MA4 are similar in their MIRU signatures, and MA4 strains lack spacers 22 through 24, in addition to the deletions of MA1 strains. MIRU signatures of MA2 and MA3 strains are also similar, and MA2 has an extra deletion, 21 through 24, in addition to the deletions of MA3 strains.
The tensor sublineages strongly support the existence of the SpolDB4 AFRI_1, AFRI_2 and AFRI_3 families and show that the AFRI family is composed of these three families. With an Fmeasure of 66%, the tensor sublineages differ markedly from the SpolDB4 families for the M. africanum lineage. The AFRI family results largely explain this difference – AFRI is spread across three tensor sublineages. Disregarding AFRI, sublineages MA2 and MA3 match families AFRI_2 and AFRI_3 respectively. Interestingly, AFRI_1 is further subdivided into sublineages MA1 and MA4. The spoligotypes in MA1 and MA4 differ by only one contiguous deletion of spacers 22 through 24, but their MIRU signatures clearly distinguish them especially in MIRU loci 10, 16 and 40. The tensor indicates that the AFRI sublineage classification defines somewhat generic M. africanum strains that can be distinctly placed in the groups MA1 (part of AFRI_1), MA4 (other part of AFRI_1), MA2 (AFRI_2) and MA3 (AFRI_3).
The MIRUVNTRplus labels, determined on the basis of LSPs, indicate that there are two sublineages, West African 1 and West African 2, within M. africanum. Table 4 indicates the correspondence between the tensor sublineages and MIRUVNTRplus labels. MA1 and MA4 correspond to West African 2 and MA2 corresponds to West African 1. There is no data labeled by MIRUVNTRplus in MA3, but we speculate that it is West African 1 since MA2 and MA3 have more closely related MIRU and spoligotype signatures.
Table 4. Confusion matrix of distinct M. africanum strains based on MIRUVNTRplus sublineages
Sublineage structure of M. bovis PARAFAC generated the most stable clusters and constructed 3 sublineages for M. bovis, MB1, MB2, and MB3, while the dataset contains 5 SpolDB4 families, BOV, BOVIS1, BOVIS1_BCG, BOVIS2, and BOVIS3. Table 5 gives the correspondence between the tensor sublineages and the SpolDB4 families. All clusters have perfect stability and are well distinguished in the PCA plot in Figure 5. Figure 6 shows heat maps representing the spoligotype and MIRU type signatures of tensor sublineages. Much like the M. africanum SpolDB4 AFRI family, the BOV family defines a generic M. bovis sublineage that spreads across all three tensor sublineages. Disregarding BOV, MB3 consists of all of BOVIS1 and BOVIS1_BCG strains. Since BOVIS1_BCG is the attenuated bacillus CalmetteGuérin (BCG) vaccine strain, it is difficult to distinguish it from BOVIS1 using only MIRU patterns and spoligotypes. Therefore, the merger of BOVIS1 and BOVIS1 BCG is expected given the genetic similarity between the two groups of strains. Disregarding BOV, the MB1 and MB2 sublineages exactly match the SpolDB4 families BOVIS2 and BOVIS3 respectively.
Table 5. Confusion matrix of M. bovis strains
Figure 5. The clustering plot of M. bovis strains Clustering plot of M. bovis strains using Principal Component Analysis. Three putative tensor sublineages, MB1 to MB3, are clearly separated.
Figure 6. Biomarker signatures of M. bovis tensor sublineages Spoligotype and MIRU signatures of tensor sublineages of M. bovis strains. Although MIRU signatures of MB1 and MB2 strains are similar, spoligotype signatures of MB1 and MB2 strains are clearly distinguishable by extra deletions of 13 through 14 in all MB2 strains, and deletions of 5 through 7 in some MB2 strains.
Sublineage structure of East Asian (Beijing) The most stable clusters are produced by Tucker3 and it constructs six distinct sublineages of East Asian (Beijing), denoted B1 through B6. The variability in the spoligotypes of East Asian is limited to spacers 35 through 43 since all East Asian strains have spacers 1 to 34 absent. Since the SpolDB4 classification is based only on spoligotypes, the limited variability allows only two families, BEIJING and BEIJINGLIKE. Table 6 shows the correspondence between tensor sublineages and the SpolDB4 families. The clustering plot of tensor sublineages is shown in Figure 7. Heat maps representing the spoligotype and MIRU type signatures of tensor sublineages are shown in Figure 8. The tensor cleanly subdivides BEIJING into three sublineages B1, B4 and B6, all with stability 1. Spoligotype signatures of these sublineages differ. B1 strains have spacers 35 through 43 present, whereas B4 strains lack spacer 37, and B6 strains lack spacer 40. MIRU signature of sublineage B4 is clearly distinct in MIRU locus 40, having 3 repeats for most strains. The tensor subdivides the BEIJINGLIKE into sublineages B2, B3 and B5, each with distinct spoligotype signature. They all lack spacers 35 through 36. In addition, B2 strains lack spacer 37, and B3 strains lack spacer 40. Thus, the tensor strongly supports the existence of BEIJING and BEIJINGLIKE families, but also suggests that they can be further subdivided.
Table 6. Confusion matrix of East Asian (Beijing) strains
Figure 7. The clustering plot of East Asian (Beijing) strains Clustering plot of East Asian (Beijing) strains using Principal Component Analysis. Six putative tensor sublineages, B1 to B6, are clearly distinct.
Figure 8. Biomarker signatures of East Asian (Beijing) tensor sublineages Spoligotype and MIRU signatures of tensor sublineages of East Asian (Beijing) strains. Tensor sublineages B1, B4, B6 include BEIJING strains and sublineages B2, B3, B5 include BEIJINGLIKE strains.
Sublineage structure of EastAfrican Indian (CAS) Tucker3 generated the most stable clusters and it constructed four distinct sublineages for EastAfrican Indian (also known as CAS) denoted C1, C2, C3, and C4. The strains are also labeled with four SpolDB4 lineages: CAS, CAS1_DELHI, CAS1_KILI and CAS2. Table 7 shows the correspondence of tensor sublineages and SpolDB4 families. Figure 9 shows the clustering plot of tensor sublineages and Figure 10 shows spoligotype and MIRU type signatures of tensor sublineages. All sublineages are highly stable with stability 1. Much like with AFRI and BOV, the generic CAS family is divided across all tensor sublineages. C3 only contains CAS strains. Disregarding CAS, C1 contains most CAS1 DELHI strains and all CAS2 strains. C4 contains all CAS1_KILI strains. C2 contains 2 CAS1_DELHI strains, but the vast majority (331 strains) of CAS1_DELHI strains fall in C1. In addition to the common deletions of EastAfrican Indian (CAS) strains, C2 strains lack spacer 22, C3 strains lack spacers 20 through 22, and C4 strains lack spacers 20 through 22 and spacer 35. Variabilities in MIRU loci 10, 26, 31 and 40 are also key to defining differences in the sublineages. C2 and C3 strains differ by variations in MIRU locus 10. C4 strains which include all CAS1_KILI strains exhibit a very distinct MIRU signature compared to other tensor sublineages, especially in MIRU locus 26.
Table 7. Confusion matrix of EastAfrican Indian (CAS) strains
Figure 9. The clustering plot of EastAfrican Indian (CAS) strains Clustering plot of EastAfrican Indian (CAS) strains using Principal Component Analysis. Four putative tensor sublineages, C1 to C4, are clearly distinct.
Figure 10. Biomarker signatures of EastAfrican Indian (CAS) tensor sublineages Spoligotype and MIRU signatures of tensor sublineages of EastAfrican Indian (CAS) strains. In addition to deletions in C1 strains, C2 strains lack spacer 22. In addition to deletions in C3 strains, C4 strains lack spacer 35 and have only 1 repeat in MIRU 26. C2 and C3 strains are very close in their MIRU signature, but they differ by variations in MIRU locus 10.
Sublineage structure of IndoOceanic PARAFAC found the most stable clusters and it constructs nine distinct putative sublineages for IndoOceanic, denoted IO1 to IO9, while the dataset has thirteen SpolDB4 lineages. Table 8 shows the correspondence of tensor sublineages and SpolDB4 families. Figure 11 shows the clustering plot of tensor sublineages and Figure 12 shows spoligotype and MIRU signatures of tensor sublineages. The EAI5 family acts much like the CAS, BOV, and AFRI families, spreading across all the IndoOceanic sublineages except IO4. The small MANU1 family also spreads across four sublineages. The existence of the MANU1 family has not been well established by other biomarkers. Disregarding these two troubling families, the tensor sublineages correspond closely to the SpolDB4 families. Table 8 shows that there is almost a onetoone mapping between most SpolDB4 lineages and IndoOceanic tensor sublineages. Specifically, the mapping between the most stable clusters (with sublineage stability) and the families are: IO1 (.94) equals EAI6_BDG1, IO2 (1) equals EAI3_IND, IO4 (1) equals ZERO, and IO6 (.91) equals most of EAI2_MANILLA. All EAI strains are in IO9 (.77), all EAI1 strains are in IO8 (.86), all MICROTI strains are in IO5 (0.56), and all ZERO strains are in IO4. All EAI2_NTB strains are in IO5, all EAI3_IND strains are in IO2, and all EAI8_MDG strains are in IO7 (.84). EAI2_MANILLA is divided into two sublineages: 11 strains in IO5, 265 strains in IO6. While the spoligotype and MIRU signatures show that there are distinct EAI5 subgroups, the definition of the EAI5 and MANU1 groups are not well supported by the tensor analysis. They may represent a more generic sublineage that is further subdivided. Distinct patterns are observable in the spoligotype and MIRU signatures for most of the tensor sublineages.
Table 8. Confusion matrix of IndoOceanic strains
Figure 11. The clustering plot of IndoOceanic strains Clustering plot of IndoOceanic strains labeled by putative tensor sublineages using Principal Component Analysis. The tensor sublineages are not as distinct as they were for the previously analyzed major lineages, implying that the tensor sublineages are well distinguished in the PCA plot if they are stable.
Figure 12. Biomarker signatures of IndoOceanic tensor sublineages Spoligotype and MIRU signatures of tensor sublineages of IndoOceanic strains.
Sublineage structure of EuroAmerican Tucker3 found the most stable clusters and it generates 35 sublineages for EuroAmerican, denoted E1 to E35, while there are 33 SpolDB4 lineages labeled EuroAmerican. See additional file 1 for the confusion matrix of EuroAmerican strains that shows the correspondence of tensor sublineages and SpolDB4 families. Figure 13 shows the clustering plot of tensor sublineages. Figure 14 and Figure 15 show the spoligotype and MIRU signatures of tensor sublineages respectively.
Additional file 1. Confusion matrix of EuroAmerican strains The confusion matrix of EuroAmerican strains that shows the correspondence of tensor sublineages and SpolDB4 families. Each row represents SpolDB4 families and each column represents tensor sublineages.
Format: XLS Size: 15KB Download file
This file can be viewed with: Microsoft Excel Viewer
Figure 13. The clustering plot of EuroAmerican strains Clustering plot of EuroAmerican strains labeled by 35 tensor sublineages using Principal Component Analysis. The tensor sublineages are not as distinct as they were for the previously analyzed major lineages, reflecting the variability in the tensor cluster stability. It may also be due to the anticipated hierarchical structure in EuroAmerican strains.
Figure 14. Spoligotype signatures of EuroAmerican tensor sublineages Spoligotype signatures of tensor sublineages of EuroAmerican strains.
Figure 15. MIRU signatures of EuroAmerican tensor sublineages MIRU signatures of tensor sublineages of EuroAmerican strains.
Strains belonging to families H2, H37Rv, LAM12_MAD1, T1 (Tuscany variant), T1_RUS2, T4, T5_MAD2, and T5_RUS1 are clustered in tensor sublineages E9, E7, E8, E24, E11, E34, E34, and E17 respectively. In contrast, the T1 family, an ancestor strain family, is distributed across 25 tensor sublineages, with most T1 strains in E34. Sublineage stability is above .90 for 18 tensor sublineages. Spoligotype and MIRU signatures of sublineages suggest either subdivision or merging of SpolDB4 families. For instance, tensor sublineages E2, E6, and E32 include T1 strains only. In addition to common spacer deletions of EuroAmerican strains, E2 strains lack spacers 15 through 26, E6 strains lack spacers 9 through 23, and E32 strains lack spacers 1 through 19, which are all variations in spoligotype signatures of T1 strains. This sublineage classification further subdivides the poorlydefined ancestor T1 family. Strains of LAM families on the other hand are grouped in 17 tensor sublineages. Prior studies have found that LAM Rio strains identified by SNPs are found in multiple SpolDB4 lineages [20]. Therefore, it is expected that the use of multiple biomarkers leads to subdivision or merging of some SpolDB4 families.
Although most stable clusters of the EuroAmerican strain dataset are found using bestmatch stability, the DDweighted gap statistic plot has multiple peaks. DDweighted gap statistic, detailed in the methods section, is a cluster validity measure which is also used for detecting hierarchical structure in the datasets. Multiple peaks in DDweighted gap statistic plot suggest that the EuroAmerican dataset may have a multilevel hierarchical structure. Model order selection with randomized maps by Bertoni and Valentini can be used to detect the hierarchical structure in the EuroAmerican dataset [21].
We used the unsupervised tensor clustering framework to cluster MTBC strains of major lineages into sublineages. Next, we turn our attention to supervised tensor learning methods on multiplebiomarker tensors to classify strains into major lineages.
Classification of MTBC strains into major lineages using twoway and multiway supervised learning
Multiplebiomarker tensors can be used in supervised classification models as well as in unsupervised models. We use multiway partial least squares (NPLS) on multiplebiomarker tensors to predict major MTBC lineages [22]. In our experiments, we used spoligotype and MIRU as biomarkers and predicted the six major lineages using the same data as for the above unsupervised learning experiments combined into a single dataset. More specifically, we used 12 spoligotype deletions found informative in major lineage classification combined with 12loci MIRU [23]. We predicted major lineages with the NPLS multiway method and compared it with standard twoway PLS and prior results for conformal Bayesian Networks [4]. Table 9 shows the average testing Fmeasure as estimated by 5fold crossvalidation. We generate the multiplebiomarker tensor using 12 spoligotype deletions and 12loci MIRU with one additional bit indicating whether the at least one MIRU pattern includes letter rather than number of repeats, and create a predictive model using the NPLS multiway method. The model for standard 2way PLS is created by representing the data as a matrix with columns corresponding to 12 spoligotype deletions and 12loci MIRU with the additional indicator bit, and rows corresponding to MTBC strains. The number of latent variables for both NPLS and PLS are selected by inner 4fold crossvalidation of the training set data only.
Table 9. Multiway NPLS and standard twoway PLS classification accuracy results
We compare NPLS, standard PLS and Conformal Bayes Network (CBN) methods by Fmeasure of major lineage classification and see that they are accurate predictive models with no significant difference between the approaches. Table 9 shows the Fmeasure values for NPLS, standard PLS and CBN. The average Fmeasure of major lineage prediction on the same data using the CBN is 0.9897 [4]. This shows that NPLS and standard PLS methods predict major lineages as accurately as CBN, with a slightly better average Fmeasure value. All three methods achieve outstanding results for major lineage classification with no significant difference between approaches.
Conclusions
This study investigates multiplebiomarker tensors and illustrates how they can be used for both unsupervised and supervised learning models. First, a novel clustering framework is used to analyze the sublineage structure of MTBC strains based on multiple biomarkers. We generated multiplebiomarker tensors to represent multiple biomarkers of the MTBC genome and used multiway models for dimensionality reduction. The multiway representation determines a transformation of the data that captures similarities and differences between strains based on two distinct biomarkers. We clustered MTBC strains based on the transformed data using improved kmeans clustering and validated clustering results. We evaluated the sublineage structure of major lineages of MTBC and found similarities and clear distinctions in our subdivision of major lineages compared to the SpolDB4 classification. Simultaneous analysis of spoligotype and MIRU through multiplebiomarker tensors and clustering of MTBC strains leads to coherent sublineages within major lineages with clear and distinctive spoligotype and MIRU signatures. Second, we demonstrated how the multiplebiomarker tensor can be used to predict major lineages with extremely high accuracy competitive with other approaches. We show that 3way PLS, 2way PLS and CBN models are accurate major lineage predictors for MTBC strains.
The tensor clustering framework is flexible and can be applied to any multidimensional strain data. The design of the resulting tensor depends on the question to be answered. In this study, multiplebiomarker tensors are designed to find groups of MTBC strains. Thus, the application of the tensor clustering framework on multiplebiomarker tensors leads to sublineages of MTBC within major lineages. The multiplebiomarker tensor is further validated by the fact that it can used to predict known major lineages with high accuracy using NPLS. NPLS with multiplebiomarker tensors can be used for semisupervised learning as well. This can be useful for learning predictive models for sublineages in which only part of the data is labeled with sublineages and the other part of the data has no labels. This may result in more reliable and accurate classifiers of MTBC sublineages, and the resulting sublineage classifiers would be a significant enhancement to TB control, epidemiology and research. We leave this to future work.
The tensor clustering framework used in this study can be further extended to find subgroups of MTBC strains based on other biomarkers such as RFLP and SNPs. 15loci MIRU and 24loci MIRU patterns can also be used to represent MTBC genomes with multiplebiomarker tensors. Moreover, more than two biomarkers can be used in the MTBC genome representation. But, ambiguity in the tensor entries is an open question that needs to be solved in the tensor representation when more than two biomarkers are used. Addition of new biomarkers will increase the number of modes of the multiplebiomarker tensor, but the multiway analysis methods will remain the same.
Other questions of interest can be addressed by designing and analyzing hostpathogen tensors to examine the relationship of the pathogen genotype with host (or equivalent) attributes to examine questions of interest. For example, since the MTBC sublineages are known to be highly geographically dependent, a tensor which combines the pathogen genotype with the country of birth of the host may reveal additional sublineage structure and transmission patterns. A tensor combining MTBC genotype and host disease phenotype such as site of infection and drug resistance could be used to analyze MTBC genotype/phenotype relations.
Methods
Tensor Clustering Framework (TCF)
Clustering MTBC strains based on multiplebiomarker tensors consists of a sequence of steps. First, we find informative feature set of spoligotype deletions and generate a tensor. Second, we apply multiway models on the tensor and get a score matrix for the strain mode. Third, we use this score matrix to determine the similarity between strains, and cluster them using a stable version of kmeans. In the final step, we evaluate the clustering results using cluster validity indices. This stepwise clustering framework is outlined in Figure 16. We describe the steps of the tensor clustering framework in this section.
Figure 16. Tensor clustering framework Clustering framework of MTBC strains. Highdimensional genotype data are decomposed into twodimensional arrays using multiway models, which are then used as input to the kmeans_mtimes_seeded algorithm. Clusterings are validated using bestmatch stability. In case of a tie, the DDweighted gap statistic is used to pick the number of clusters.
Datasets
The dataset comprises 6848 distinct MTBC strains as determined by spoligotype and 12loci MIRU, labeled with major lineages and SpolDB4 families. The strains are mainly from the CDC dataset  a database collected by the CDC from 20042008 labeled with the major lineages collected by the TBInsight project (http://tbinsight.cs.rpi.edu/ webcite) that was previously studied in [4]. We also used the MIRUVNTRplus dataset from www.MIRUVNTRplus.org which is labeled with SpolDB4 lineages and sublineages. The original SpolDB4 labeled dataset provided in an online supplement [6] contains only spoligotypes. We found all occurrences of these spoligotypes in the CDC and MIRUVNTRplus dataset and constructed a database with spoligotype and MIRU patterns, with major lineages as determined by CDC, and sublineages as given in the SpolDB4 database [6]. The numbers of strains for each major lineage in the resulting dataset are shown in Table 10. We created 6 datasets from the CDC+MIRUVNTRplus dataset, one for each major lineage. These same 6 major lineage datasets were merged into one for the supervised learning experiment.
Table 10. Data statistics by major lineage
Feature Selection and Tensor Generation
Feature Selection The spoligotype pattern captures the variability in the DR locus of the MTBC genome. A spoligotype consists of 43 spacers represented as a 43bit binary sequence, and according to the hidden parent assumption, one or more contiguous spacers can be lost in a deletion event, but rarely gained [8,24]. Therefore, there are possible deletions of lengths varying from 1 to 43 in a spoligotype. Only subsets of spoligotypedeletions are required for effective discrimination of MTBC strains. A set of 12 deletion sequences of spoligotypes reported by Shabbeer et al. have proven to be good discriminator spacer deletions for major lineage classification [23]. These 12 deletion sequences are used in the supervised learning study. Another set of 81 deletion sequences of spoligotypes reported by Brudey et al. have proven to be good discriminator spacer deletions for SpolDB4 sublineage classification [6].
Within the TCF, we built a feature selection algorithm to find spacer deletions that are informative. This insures that the results are not biased by a priori selection of spoligotype deletions. Given a set of spoligotypes, we first calculate the frequency f_{i}, i = 1,‥, 946, of each possible deletion among the spoligotypes of strains. If f_{i} = 1, the deletion is a common deletion. If 0 ≤ f_{i} <threshold, the deletion is a nonexistent deletion, where threshold is data dependent and threshold = 0.05 is used by default. The deletions with frequency f_{i} such that threshold ≤ f_{i} < 1 are uncommon deletions. In the second step, we iterate through the set of uncommon deletions U, and remove an uncommon deletion u ∈ U, if there exists a common deletion c ∈ C which is a substring of u. We assign the final set of uncommon deletions as the feature set. Using the final feature set, we determine spoligotype deletions that are effective in discriminating the strains of the dataset. Algorithm 1 summarizes the feature selection procedure. Numbers of spoligotype deletions for each major lineage, found informative by the feature selection algorithm, are given in Table 10.
Tensor Generation We generated multiplebiomarker tensors using two biomarkers, spoligotype deletions and MIRU patterns, as explained earlier. The spoligotype deletions found informative by the feature selection algorithm are used in the generation of multiplebiomarker tensors. The multiplebiomarker tensor is of the form Strains × Spoligotype deletions × MIRU patterns. We used the tensor clustering framework on multiplebiomarker tensors to cluster strains.
Multiway modeling
Multiway models are needed to fit a model to multiway arrays. We used PARAFAC and Tucker3 techniques to model the tensors. We determined the number of components for each model to ensure a bound on the explained variance of data.
Multiway models We used PARAFAC and Tucker3 models to explain the tensor with high accuracy. Multiway modeling of tensors was carried out using the nwayToolbox of MATLAB by Bro et al. and the PLS toolbox[25,26].
PARAFAC
PARAFAC is a generalization of singular value decomposition to multiway data [27,28]. A 3way array
 X
where A ∈ ℝ^{I}^{×}^{R}, B ∈ ℝ^{J}^{×}^{R}, C ∈ ℝ^{K}^{×}^{R} are component matrices of first, second, and third mode.
 G
 E
Figure 17. PARAFAC model PARAFAC model of a threeway array
 X
The PARAFAC model is symmetric in all modes and the number of components in each mode is the same [29]. The PARAFAC model is a simple model, which comes with a restriction of the equality on the number of components in each mode which makes it difficult to fit a data array with the PARAFAC model. One advantage of the PARAFAC model is its uniqueness: fitting the PARAFAC model with the same number of components to a given multiway dataset returns the same result.
Tucker3
Tucker3 is an extension of bilinear factor analysis to multiway datasets [30]. A 3way array
 X
where A ∈ ℝ^{I}^{×}^{P}, B ∈ ℝ^{J}^{×}^{Q}, C ∈ ℝ^{K}^{×}^{R} are the component matrices of first, second and third modes respectively.
 G
 E
Figure 18. Tucker3 model Tucker3 model of a threeway array
 X
 E
Tucker3 is a more flexible model compared to PARAFAC. This flexibility is due to the core array
 G
Model validation A multiway model is appropriate if adding more components to any mode does not improve the fit considerably. There is a tradeoff between the complexity of the model and the variance of the data explained by the model. Therefore, validation of a model also determines a suitable complexity for the model. We used the core consistency diagnostic (CORCONDIA) to determine the number of components of the PARAFAC model [32]. The core consistency diagnostic measures the similarity of the core array
 G
Clustering algorithm
We developed the kmeans_mtimes_seeded algorithm, a modified version of the kmeans algorithm, to group MTBC strains based on the score matrices of the multiway models. Kmeans is a commonly used clustering algorithm with two weaknesses: 1) Initial centroids are chosen randomly, 2) The objective value of kmeans, measured as withincluster sum of squares, may converge to local minima, rather than finding the global minimum. We solve these problems with two improvements: 1) Initial centroids are chosen by careful seeding, using a heuristic called kmeans++, suggested by Arthur et al. [33]. Let D(x) represent the shortest Euclidean distance from data point x to the closest center already chosen. kmeans++ chooses a new centroid at each step such that the new centroid is furthest from all chosen centroids. Algorithm 2 summarizes the kmeans++ procedure. 2) The local minima problem is partially solved by repeating the kmeans algorithm multiple times and retrieving the run with the minimum objective value. We repeated the algorithm m = 20 times. The kmeans_mtimes_seeded algorithm combines these two improvements, as summarized in algorithm 3. The kmeans_mtimes seeded algorithm is more stable compared to the kmeans algorithm, and produces more accurate clusters.
Cluster Validation
Clustering results for the MTBC strains are evaluated to determine the best choice for the number of clusters and compare the chosen clustering with existing sublineages using cluster validity indices. We used the bestmatch stability to pick the most stable clusterings. In case of a tie in average bestmatch stability, we used the DDweighted gap statistic for cluster validation [34]. We compare our clusters to an existing classification using the Fmeasure.
BestMatch Stability The stability of a clustering is measured by the distribution of pairwise similarities between clusterings of subsamples of the data. The idea behind stability is that if we repeatedly sample data points and apply the same clustering algorithm to the subsample, then an effective clustering algorithm applied to well separated data should produce clusterings that do not vary much for different subsamples [35]. In such cases, the algorithm is stable independent of input randomization. We use bestmatch stability as suggested by Hopcroft et al. [36] to assess stability. The algorithm clusters the same data multiple times, and compares the reference cluster to model clusterings. We used 25 model clusterings to compare with the reference cluster. The stability of each cluster is calculated by finding the average best match between this cluster and the clusters identified using other model clusterings. High average bestmatch values denote that the two clusters have many strains in common and are of roughly the same size [8]. We also calculate the average bestmatch of a clustering by finding the average of bestmatch values for all clusters in the reference clustering. Bestmatch stability of a cluster C, compared to a model clustering , is calculated as:
where
and refC_{i} is the set of items in reference cluster i.
DDWeighted Gap Statistic (PC) Tibshirani et al. proposed a cluster validity index called the gap statistic, which is based on the withincluster sum of squares (WCSS) of a clustering [37]. Let the dataset be X ∈ ℝ^{n}^{×}^{p} consisting of n data points with p dimensions. Let d_{ij} be the Euclidean distance between data points i and j. After clustering this dataset, suppose that we have k clusters C_{1}, ‥, C_{k}, where C_{i} denotes the indices of data points in cluster i, of size n_{i} = C_{i} . The sum of withincluster pairwise distances for cluster r is defined as:
and the withincluster sum of squares for a clustering is defined as:
The idea of the gap statistic method is to compare W_{k} and its expected value under a reference distribution of the dataset. Therefore, the gap value is defined as:
Where represents the expected value under a sample of size n based on a reference distribution. The optimal number of clusters is the value for which Gap_{n}(k) is maximized. The selection of number of clusters via gap statistic is summarized in [37].
The reference distribution can be one of two choices: uniform distribution (Gap/Unif), or a uniform distribution over a box aligned with the principal components of the dataset (Gap/PC). Experiments by Tibshirani et al. show that Gap/PC finds the number of clusters more accurately, therefore we used Gap/PC in this study [37].
The gap statistic is a powerful method for estimating the number of clusters in a dataset. However, a study by Dudoit et al. showed that the gap statistic does not estimate the correct number of clusters for every case [38]. This may be because W_{k} increases as the number of data points increases. Hierarchical structure of the data may also cause problems. The data may be composed of nested clusters and the gap statistic will be capturing only the minimum of these candidate numbers of clusters. Yan et al. suggested a 2step improvement to the gap statistic, called the DDweighted gap statistic [39]. They defined average withincluster pairwise distances for cluster r as follows:
and the weighted withincluster sum of squares as:
Based on ,the weighted gap statistic is defined as
Let denote the difference in when the number of clusters is raised from k1 to k. is defined as
for , and otherwise it will be close to zero. Therefore, to find a “knee” point in the plot, they introduce a second difference equation and define as
is maximized when k is equal to the true number of clusters. The advantage of over the gap statistic is that there may be multiple peaks in the plot of and this may indicate a hierarchical structure in the data. In such cases, multilayer analysis should be used instead of a single step procedure.
Fmeasure The Fmeasure is a weighted combination of precision and recall of a clustering. Since the Fmeasure combines precision and recall of clustering results, it has proven to be a successful metric. We use the Fmeasure to evaluate how similar the tensor sublineages are to the SpolDB4 families. According to the contingency table in Table 11, precision, recall, and Fmeasure are defined as:
Table 11. Contingency table
Multiway Partial Least Squares Regression (NPLS)
NPLS is a multiway regression method where at least one of the independent and dependent blocks has at least three modes created by Bro et al. by generalizing PLS to multiway data [22]. Consider independent variables in the Xblock,
 X
 X
X=t(w^{K} ⊗ w^{J})' + E (1)
and the twoway array Y is decomposed as:
Y = uq' + F (2)
where t ∈ ℝ^{I}^{×1} and u ∈ ℝ^{I}^{×1} are score vectors of X and Y. w^{J} ∈ ℝ^{J}^{×1} and w^{K} ∈ ℝ^{K}^{×1} are the loading vectors (weights) of the second and third modes of
 X
Notice that the twoway array Y is decomposed into one score and one loading vector, whereas the matricized threeway array X is decomposed into one score and two loading vectors, w^{J} and w^{K}. This is the main difference between NPLS and PLS. At each iteration of NPLS, a new PLS component is added. If n PLS components are used, X is decomposed into component matrices T ∈ ℝ^{I}^{×}^{n}, W^{J} ∈ ℝ^{J}^{x}^{n}, W^{K} ∈ ℝ^{Kxn}, and Y is decomposed into component matrices U ∈ ℝ^{I}^{×n}, Q ∈ ℝ^{M}^{×}^{n}.
The aim of NPLS is to maximize the covariance of
 X
 X
U = TB + E_{u} (3)
This requires finding loading vectors w^{J} and w^{K} such that the covariance of t and y are maximized:
where Z∈ℝ^{J}^{×}^{K} is a matrix with and . To maximize this expression, we write it in matrix notation:
The problem of finding w^{J} and w^{K} is simply solved by SVD on Z[22,40]. w^{J} and w^{K} are first left and right singular vectors of Z. To reconstruct Y, we substitute (3) in equation 2:
Given
 X
The NPLS model of a multiway array is a multilinear model, like PARAFAC, which means that it has no rotational freedom. Therefore, the NPLS model of a multiway array is unique. In this study, we used a 3way array as the Xblock and a 2way array as the Yblock, therefore we are particularly working on the TriPLS2 version of NPLS, which is summarized in Algorithm 4. The term
 X
 X
Authors’ contributions
CO, KB and BY conceived the study. CO carried out the experiments. CO, KP and BY analyzed the results. AS and SV provided and analyzed some of the data. CO, AS, SV and KB drafted the manuscript.
Competing interests
The authors declare that they have no competing interests.
Acknowledgements
This work was made possible by Dr. Lauren Cowan and Dr. Jeff Driscoll of the Centers for Disease Control and Prevention. This work was supported by NIH R01LM009731.
This article has been published as part of BMC Genomics Volume 12 Supplement 2, 2011: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2010. The full contents of the supplement are available online at http://www.biomedcentral.com/14712164/12?issue=S2.
References

Gagneux S, DeRiemer K, Van T, KatoMaeda M, de Jong BC, Narayanan S, Nicol M, Niemann S, Kremer K, Gutierrez MC, Hilty M, Hopewell PC, Small PM: Variable hostpathogen compatibility in Mycobacterium tuberculosis.
PNAS 2006, 103(8):28692873. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gagneux S, Small PM: Global phylogeography of Mycobacterium tuberculosis and implications for tuberculosis product development.
The Lancet Infectious Diseases 2007, 7(5):328337. PubMed Abstract  Publisher Full Text

Supply P, Allix C, Lesjean S, CardosoOelemann M, RuschGerdes S, Willery E, Savine E, de Haas P, van Deutekom H, Roring S, Bifani P, Kurepina N, Kreiswirth B, Sola C, Rastogi N, Vatin V, Gutierrez MC, Fauville M, Niemann S, Skuce R, Kremer K, Locht C, van Soolingen D: Proposal for Standardization of Optimized Mycobacterial Interspersed Repetitive UnitVariableNumber Tandem Repeat Typing of Mycobacterium tuberculosis.
J. Clin. Microbiol. 2006, 44(12):44984510. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Aminian M, Shabbeer A, Bennett KP: A conformal Bayesian network for classification of Mycobacterium tuberculosis complex lineages.
BMC Bioinformatics 2010, 11(Suppl 3):S4. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Ferdinand S, Valétudie G, Sola C, Rastogi N: Data mining of Mycobacterium tuberculosis complex genotyping results using mycobacterial interspersed repetitive units validates the clonal structure of spoligotypingdefined families.
Research in Microbiology 2004, 155(8):647654. PubMed Abstract  Publisher Full Text

Brudey K, Driscoll J, Rigouts L, Prodinger W, Gori A, AlHajoj S, Allix C, Aristimuno L, Arora J, Baumanis V, et al.: Mycobacterium tuberculos is complex genetic diversity: mining the fourth international spoligotyping database (SpolDB4) for classification, population genetics and epidemiology.
BMC Microbiology 2006, 6:23. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Asiimwe B: Molecular characterization of Mycobacterium tuberculosis complex in Kampala, Uganda. PhD thesis. Makerere University; 2008.

Vitol I, Driscoll J, Kreiswirth B, Kurepina N, Bennett KP: Identifying Mycobacterium tuberculosis complex strain families using spoligotypes.
Infection, Genetics and Evolution 2006, 6(6):491504. PubMed Abstract  Publisher Full Text

Rubel O, Weber GH, Huang MY, Bethel EW, Biggin MD, Fowlkes CC, Luengo CL, Keranen SVE, Eisen MB, Knowles DW, Malik J, Hagen H, Hamann B: Integrating Data Clustering and Visualization for the Analysis of 3D Gene Expression Data.
IEEE/ACM Transactions on Computational Biology and Bioinformatics 2010, 7:6479. PubMed Abstract  Publisher Full Text

Handl , Julia , Knowles , Joshua , Kell , Douglas B: Computational cluster validation in postgenomic data analysis.
Bioinformatics 2005, 21(15):32013212. PubMed Abstract  Publisher Full Text

Giancarlo , Raffaele , Scaturro , Davide , Utro , Filippo : Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer.
BMC Bioinformatics 2008, 9:462. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns.
Proceedings of the National Academy of Sciences of the United States of America (PNAS) 1998, 95(25):1486314868. Publisher Full Text

Kriegel HP, Kröger P, Zimek A: Clustering HighDimensional Data: A Survey on Subspace Clustering, PatternBased Clustering, and Correlation Clustering.

Morup M: Applications of tensor (multiway array) factorizations and decompositions in data mining.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2011, 1:2440. Publisher Full Text

Alter O, Golub GH: Reconstructing the pathways of a cellular system from genomescale signals by using matrix and tensor computations.
Proceedings of the National Academy of Sciences of the United States of America 2005, 102(49):1755917564. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Omberg L, Golub GH, Alter O: A tensor higherorder singular value decomposition for integrative analysis of DNA microarray data from different studies.
Proceedings of the National Academy of Sciences 2007, 104(47):1837118376. Publisher Full Text

Acar E, AykutBingol C, Bingol H, Bro R, Yener B: Multiway analysis of epilepsy tensors.
Bioinformatics 2007, 23(13):i10i18. PubMed Abstract  Publisher Full Text

Yener B, Acar E, Aguis P, Bennett K, Vandenberg S, Plopper G: Multiway modeling and analysis in stem cell systems biology.
BMC Systems Biology 2008, 2:63. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Kamerbeek J, Schouls L, Kolk A, van Agterveld M, van Soolingen D, Kuijper S, Bunschoten A, Molhuizen H, Shaw R, Goyal M, van Embden J: Simultaneous detection and strain differentiation of Mycobacterium tuberculosis for diagnosis and epidemiology.
Journal of Clinical Microbiology 1997, 35(4):907914. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gibson AL, Huard RC, Gey van Pittius NC, Lazzarini LCO, Driscoll J, Kurepina N, Zozio T, Sola C, Spindola SM, Kritski AL, Fitzgerald D, Kremer K, Mardassi H, Chitale P, Brinkworth J, Garcia de Viedma D, Gicquel B, Pape JW, van Soolingen D, Kreiswirth BN, Warren RM, van Helden PD, Rastogi N, Suffys PN, Lapa e Silva J, Ho JL: Application of sensitive and specific molecular methods to uncover global dissemination of the major RDRio sublineage of the Latin AmericanMediterranean Mycobacterium tuberculosis spoligotype family.
J. Clin. Microbiol. 2008, 46(4):12591267. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bertoni A, Valentini G: Model order selection for biomolecular data clustering.
BMC Bioinformatics 2007, 8(Suppl 2):S7. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Bro R: Multiway calibration. Multilinear PLS.
Journal of Chemometrics 1996, 10:4761. Publisher Full Text

Shabbeer A, Cowan L, Driscoll JR, Ozcaglar C, Vandenberg SL, Yener B, Bennett KP: TBLineage: an online tool for classification and analysis of strains of Mycobacterium tuberculosis complex.
2011.
Unpublished manuscript

Warren RM, Streicher EM, Sampson SL, van der Spuy GD, Richardson M, Nguyen D, Behr MA, Victor TC, van Helden PD: Microevolution of the Direct Repeat Region of Mycobacterium Tuberculosis: Implications for Interpretation of Spoligotyping Data.
J. Clin. Microbiol. 2002, 40(12):44574465. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Andersson CA, Bro R: The Nway toolbox for MATLAB.
Chemometrics and Intelligent Laboratory Systems 2000, 52:14. Publisher Full Text

Eigenvector Research Inc.: PLS Toolbox. [http://www.eigenvector.com/] webcite
2011.
Last Accessed: March

Harshman RA: Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multimodal factor analysis.

Kroonenberg PM: Three mode component models: A survey of the literature.

Kroonenberg PM: Applied Multiway Data Analysis. Wiley; 2008.

Tucker LR: Some mathematical notes on threemode factor analysis.
Psychometrika 1966, 31(3):279311. PubMed Abstract  Publisher Full Text

Acar E, Yener B: Unsupervised multiway data analysis: A literature survey. [http://www.computer.org/portal/web/csdl/doi/10.1109/TKDE.2008.112] webcite
IEEE Transactions on Knowledge and Data Engineering 2009, 21:620.

Bro R, Kiers H: A new efficient method for determining the number of components in PARAFAC models.
Journal of Chemometrics 2003, 17(5):274286. Publisher Full Text

Arthur D, Vassilvitskii S: Kmeans++: The advantages of careful seeding. [http:/ / www.bibsonomy.org/ bibtex/ 2553bbfa74b13c47b4e9c7c0034a8406e/ cdevries?layout=simplehtml] webcite
SODA ’07: Proceedings of the eighteenth annual ACMSIAM symposium on discrete algorithms Society for Industrial and Applied Mathematics; 2007. PubMed Abstract  Publisher Full Text

Halkidi M, Batistakis Y, Vazirgiannis M: Cluster validity methods: part I.
SIGMOD Rec 2002, 31(2):4045. Publisher Full Text

BenDavid S, Luxburg UV, Pál D: A Sober Look at Clustering Stability. [http://www.springerlink.com/content/wx3m141rt96h/#section=489786&page=1 ] webcite

Hopcroft J, Khan O, Kulis B, Selman B: Natural communities in large linked networks. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. InACM; 2003:541546. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tibshirani R, Walther G, Hastie T: Estimating the Number of Clusters in a Dataset via the Gap Statistic.

Dudoit S, Fridlyand J: A predictionbased resampling method for estimating the number of clusters in a dataset. [http://genomebiology.com/2002/3/7/research/0036/ ] webcite

Yan M, Ye K: Determining the number of clusters using the weighted gap statistic.
Biometrics 2007, 63(4):10317. PubMed Abstract  Publisher Full Text

Smilde AK, Bro R, Geladi P: Multiway Analysis: Applications in the Chemical Sciences. Wiley; 2004.

Smilde AK: Comments on multilinear PLS.
Journal of Chemometrics 1997, 11(5):367377. Publisher Full Text

de Jong , Sijmen : Regression coefficients in multilinear PLS.
Journal of Chemometrics 1998, 12:7781. Publisher Full Text

Bro R, Smilde AK, de Jong S: On the difference between lowrank and subspace approximation: improved model for multilinear PLS regression.
Chemometrics and Intelligent Laboratory Systems 2001, 58:313. Publisher Full Text

Bro R, Smilde AK: Centering and scaling in component analysis. [http://onlinelibrary.wiley.com/doi/10.1002/cem.773/pdf ] webcite