Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY, USA

Mathematical Sciences Department, Rensselaer Polytechnic Institute, Troy, NY, USA

Computer Science Department, Siena College, Loudonville, NY, USA

Abstract

Background

Strains of

Results

Simultaneous analysis of the spoligotype and MIRU type of strains using TCF on multiple-biomarker tensors leads to coherent sublineages of major lineages with clear and distinctive spoligotype and MIRU signatures. Comparison of tensor sublineages with SpolDB4 families either supports tensor sublineages, or suggests subdivision or merging of SpolDB4 families. High prediction accuracy of major lineage classification with supervised tensor learning on multiple-biomarker tensors validates our unsupervised analysis of sublineages on multiple-biomarker tensors.

Conclusions

TCF on multiple-biomarker tensors achieves simultaneous analysis of multiple biomarkers and suggest a new putative sublineage structure for each major lineage. Analysis of multiple-biomarker tensors gives insight into the sublineage structure of MTBC at the genomic level.

Background

Tuberculosis (TB), a bacterial disease caused by

Genotyping of MTBC is used to identify and distinguish MTBC into distinct lineages and/or sublineages that are quite useful for TB tracking, TB control, and examining host-pathogen relationships

While sublineages of MTBC are routinely used in the TB literature, their exact definitions, names, and numbers have not been clearly established. The SpolDB4 database contains 39,295 strains and their spoligotypes with the vast majority of them labeled and classified into 62 sublineages

The goal of this paper is to examine the sublineage structure of MTBC on the basis of multiple biomarkers. The proposed method reveals structure not captured in SpolDB4 spoligotype families because SpolDB4 sublineage only take into account a single biomarker, spoligotypes. A spoligotype-only tool, SPOTCLUST, was used to find MTBC sublineages using an unsupervised probabilistic model, reflecting spoligotype evolution

In this study, we develop a tensor clustering framework to find the sublineage structure of MTBC strains labeled by major lineages based on multiple biomarkers. This is an unsupervised learning problem. We generate multiple-biomarker tensors of MTBC strains for each major lineage and apply multiway models for dimensionality reduction. The model accurately captures spoligotype evolutionary dynamics using contiguous deletions of spacers. The tensor transforms spoligotypes and MIRU into a new representation, where traditional clustering methods apply without users having to decide a

In the next section, we give a brief background on clustering and multiway analysis of post-genomic data, spoligotyping, and MIRU typing.

Clustering post-genomic data

Data clustering is a class of techniques for unsupervised classification of data samples into groups of similar behavior, function, or trait

Application of multiway models to post-genomic data clustering

Clustering on post-genomic data can be accomplished based on multiple sources of ground truth. The ground truth can be based on multiple biomarkers, host and pathogen, or antigen and antibody. A survey by Kriegel et al. outlines the methods for finding clusters in high-dimensional data

Spoligotyping

Spoligotyping is a DNA fingerprinting method that exploits the polymorphisms in the direct repeat (DR) region of the MTBC genome. The DR region is a polymorphic locus in the genome of MTBC which consists of direct repeats (36 bp), separated by unique spacer sequences of 36 to 41 bp

Large datasets of MTBC strains genotyped by spoligotype have been amassed such as SpolDB4

MIRU-VNTR typing

MIRU is a homologous 46-100 bp DNA sequence dispersed within intergenic regions of MTBC, often as tandem repeats. MIRU-VNTR typing is based on the number of tandem repeats of MIRUs at certain identified loci. Among these 41 identified mini-satellite regions on the MTBC genome, different subsets of sizes 12, 15, and 24 are proposed for the standardization of MIRU-VNTR typing

Results

We used the tensor clustering framework to cluster MTBC strains using multiple biomarkers, and compared the clustering to SpolDB4 sublineages. Next, we used supervised tensor learning and classified MTBC strains into major lineages using spoligoype deletions and MIRU patterns. We compared multiway and two-way supervised learning methods based on their prediction accuracy for major lineage classification. In the following section, we introduce multiple-biomarker tensors and present unsupervised and supervised learning experiments on multiple-biomarker tensors.

Multiple-biomarker tensor analysis of strain data

Multiple biomarkers of the MTBC genome in a relational database can be represented as a high-dimensional dataset for multiway analysis. The multiple-biomarker tensor is constructed this way, with one of the modes representing strains and other modes representing biomarkers. In our experiments, we use this multidimensional array or tensor with three modes representing strains, spoligotype deletions, and MIRU patterns. This multiple-biomarker tensor captures three key properties of MTBC strains: spoligotype deletions, number of repeats in MIRU loci, and coexistence of spoligotype deletions with MIRU loci.

The strain dataset is arranged as a three-way array with strains in the first mode, spoligotype deletions in the second mode, and MIRU patterns in the third mode. Each entry

- X

Multiple-biomarker tensor

**Multiple-biomarker tensor**

- X

Generation of multiple-biomarker tensor

**Generation of multiple-biomarker tensor** Biomarker kernel matrix

where

and r_{ik} is the number of repeats in MIRU locus k of strain i. Multiple-biomarker tensors can be used for both unsupervised and supervised learning. Next, we use the unsupervised tensor clustering framework on multiple-biomarker tensors to subdivide major lineages of MTBC into sublineages.

Subdivision of major lineages into sublineages

We subdivide each major lineage of MTBC into sublineages using multiple-biomarker tensors. For each major lineage, we generated the multiple-biomarker tensor using spoligotypes and MIRU types and applied multiway models to identify putative sublineages of each major lineage. Two multiway analysis methods were used: PARAFAC and Tucker3. Details of the methods and how the model parameters or components were selected can be found in the methods section. The validated multiway models with numbers of components for each major lineage are shown in Table

Number of components used in PARAFAC and Tucker3 models.

PARAFAC

Tucker3

Major Lineage

Tensor size

# Components

Core Consistency / Variance

# Components

Variance

64 × 22 × 12

3

95.08 / 93.33

[4 3 1]

91.94

102 × 34 × 12

2

100.00 / 86.02

[7 5 1]

91.05

East Asian (Beijing)

571 × 5 × 12

2

100.00 / 81.58

[3 4 2]

93.09

East-African Indian (CAS)

508 × 18 × 12

3

90.75 / 80.48

[6 6 4]

94.27

Indo-Oceanic

1023 × 28 × 12

5

92.99 / 80.35

[15 13 5]

95.55

Euro-American

4580 × 109 × 12

14

99.06 / 89.83

[14 13 5]

89.77

Number of components used in PARAFAC and Tucker3 models to fit the tensors for the datasets to be clustered. We used the core consistency diagnostic to validate PARAFAC models and percentage of explained variance to validate Tucker3 models.

Number of SpolDB4 families and number of tensor sublineages for each major lineage

Major Lineage

# SpolDB4 families

# Tensor sublineages

F-measure

Best-match stability

4

4

0.66

1

5

3

0.71

1

East Asian (Beijing)

2

6

0.88

1

East-African Indian (CAS)

4

4

0.75

1

Indo-Oceanic

13

9

0.67

0.86

Euro-American

33

35

0.53

0.84

F-measure and average best-match stability are used to assess the agreement of the tensor sublineages to the SpolDB4 lineages and certainty of tensor sublineages respectively.

The F-measure values range from 53% to 88% indicating that the sublineages found by the tensors only partially overlap with those of SpolDB4. Recall that the SpolDB4 families were created by expert analysis using only spoligotypes and that analysis by alternative biomarkers such as SNP and LSP has led to alternative definitions of MTBC sublineages. The tensor sublineages are based on spoligotype and MIRU patterns, thus in some cases the tensor divides SpolDB4 families due to difference in MIRU patterns even if the spoligotypes match. In other cases, the tensor analysis merges the SpolDB4 families because the collective spoligotypes and MIRU patterns are very close. In some cases, the tensor analysis almost exactly reproduces a SpolDB4 family providing strong support for the existence of these families with no expert guidance. In addition, the MIRU patterns provide additional evidence for the existence of these distinct sublineages. Thus, multiway analysis of MTBC strains of each major lineage with multiple biomarkers leads to new sublineages and reaffirms existing ones. Further insight can be obtained by examining the putative sublineages for each major lineage, which is detailed next.

**Sublineage structure of M. africanum** The most stable clusters were produced using PARAFAC and it constructed four putative sublineages of

Confusion matrix of M. africanum strains

MA1

MA2

MA3

MA4

Stability

1

1

1

1

AFRI

2

5

1

0

AFRI_1

21

0

0

16

AFRI_2

0

12

0

0

AFRI_3

0

1

6

0

Confusion matrix for 64 distinct

The clustering plot of

**The clustering plot of M. africanum strains** Clustering plot of

Biomarker signatures of

**Biomarker signatures of M. africanum tensor sublineages** Spoligotype and MIRU signatures of tensor sublineages of

The tensor sublineages strongly support the existence of the SpolDB4 AFRI_1, AFRI_2 and AFRI_3 families and show that the AFRI family is composed of these three families. With an F-measure of 66%, the tensor sublineages differ markedly from the SpolDB4 families for the

The MIRU-VNTR

Confusion matrix of distinct

MA1

MA2

MA3

MA4

West African 1

0

5

0

0

West African 2

21

0

0

16

Unspecified

2

13

7

0

Confusion matrix for 64 distinct

**Sublineage structure of M. bovis** PARAFAC generated the most stable clusters and constructed 3 sublin-eages for

Confusion matrix of M. bovis strains

MB1

MB2

MB3

Stability

1

1

1

BOV

7

5

5

BOVIS1

0

0

29

BOVIS1_BCG

0

0

11

BOVIS2

24

0

0

BOVIS3

0

21

0

Confusion matrix of

The clustering plot of

**The clustering plot of M. bovis strains** Clustering plot of

Biomarker signatures of

**Biomarker signatures of M. bovis tensor sublineages** Spoligotype and MIRU signatures of tensor sublineages of

**Sublineage structure of East Asian (Beijing)** The most stable clusters are produced by Tucker3 and it constructs six distinct sublineages of East Asian (Beijing), denoted B1 through B6. The variability in the spoligotypes of East Asian is limited to spacers 35 through 43 since all East Asian strains have spacers 1 to 34 absent. Since the SpolDB4 classification is based only on spoligotypes, the limited variability allows only two families, BEIJING and BEIJING-LIKE. Table

Confusion matrix of East Asian (Beijing) strains

B1

B2

B3

B4

B5

B6

Stability

1

1

1

1

1

1

BEIJING

468

0

0

18

0

41

BEIJING-LIKE

0

16

8

0

20

0

Confusion matrix of East Asian (Beijing) strains clustered into 6 groups using Tucker3. Correct labels are SpolDB4 labels on the rows, and tensor sublineages are represented by each column. The six highly stable tensor sublineages are indicative of additional genetic diversity within the BEIJING and BEIJING-LIKE sublineages.

The clustering plot of East Asian (Beijing) strains

**The clustering plot of East Asian (Beijing) strains** Clustering plot of East Asian (Beijing) strains using Principal Component Analysis. Six putative tensor sublineages, B1 to B6, are clearly distinct.

Biomarker signatures of East Asian (Beijing) tensor sublineages

**Biomarker signatures of East Asian (Beijing) tensor sublineages** Spoligotype and MIRU signatures of tensor sublineages of East Asian (Beijing) strains. Tensor sublineages B1, B4, B6 include BEIJING strains and sublineages B2, B3, B5 include BEIJING-LIKE strains.

**Sublineage structure of East-African Indian (CAS)** Tucker3 generated the most stable clusters and it constructed four distinct sublineages for East-African Indian (also known as CAS) denoted C1, C2, C3, and C4. The strains are also labeled with four SpolDB4 lineages: CAS, CAS1_DELHI, CAS1_KILI and CAS2. Table

Confusion matrix of East-African Indian (CAS) strains

C1

C2

C3

C4

Stability

1

1

1

1

CAS

50

21

35

1

CAS1_DELHI

331

2

0

0

CAS1_KILI

0

0

0

23

CAS2

45

0

0

0

Confusion matrix of East-African Indian (CAS) strains clustered into 4 groups using Tucker3. Correct labels are SpolDB4 labels on the rows, and tensor sublineages are represented by each column.

The clustering plot of East-African Indian (CAS) strains

**The clustering plot of East-African Indian (CAS) strains** Clustering plot of East-African Indian (CAS) strains using Principal Component Analysis. Four putative tensor sublineages, C1 to C4, are clearly distinct.

Biomarker signatures of East-African Indian (CAS) tensor sublineages

**Biomarker signatures of East-African Indian (CAS) tensor sublineages** Spoligotype and MIRU signatures of tensor sublineages of East-African Indian (CAS) strains. In addition to deletions in C1 strains, C2 strains lack spacer 22. In addition to deletions in C3 strains, C4 strains lack spacer 35 and have only 1 repeat in MIRU 26. C2 and C3 strains are very close in their MIRU signature, but they differ by variations in MIRU locus 10.

**Sublineage structure of Indo-Oceanic** PARAFAC found the most stable clusters and it constructs nine distinct putative sublineages for Indo-Oceanic, denoted IO1 to IO9, while the dataset has thirteen SpolDB4 lineages. Table

Confusion matrix of Indo-Oceanic strains

IO1

IO2

IO3

IO4

IO5

IO6

IO7

IO8

IO9

Stability

0.94

1

0.90

1

0.56

0.91

0.84

0.86

0.77

EAI

0

0

0

0

0

0

0

0

6

EAI1

0

0

0

0

0

0

0

2

0

EAI1_SOM

0

0

2

0

0

0

8

107

0

EAI2_MANILLA

0

0

0

0

11

265

0

0

0

EAI2_NTB

0

0

0

0

15

0

0

0

0

EAI3_IND

0

105

0

0

0

0

0

0

0

EAI4_VNM

0

0

0

0

0

0

0

3

42

EAI5

231

24

26

0

3

10

35

32

31

EAI6_BGD1

33

0

0

0

0

0

0

0

10

EAI8_MDG

0

0

0

0

0

0

4

0

0

MANU1

1

0

0

0

0

5

0

2

1

MICROTI

0

0

0

0

3

0

0

0

0

ZERO

0

0

0

6

0

0

0

0

0

Confusion matrix of Indo-Oceanic strains clustered into 9 groups using PARAFAC. Correct labels are SpolDB4 labels on the rows, and tensor sublineages are represented by each column. SpolDB4 lineages except EAI5 and MANU1 map to distinct tensor sublineages.

The clustering plot of Indo-Oceanic strains

**The clustering plot of Indo-Oceanic strains** Clustering plot of Indo-Oceanic strains labeled by putative tensor sublineages using Principal Component Analysis. The tensor sublineages are not as distinct as they were for the previously analyzed major lineages, implying that the tensor sublineages are well distinguished in the PCA plot if they are stable.

Biomarker signatures of Indo-Oceanic tensor sublineages

**Biomarker signatures of Indo-Oceanic tensor sublineages** Spoligotype and MIRU signatures of tensor sublineages of Indo-Oceanic strains.

**Sublineage structure of Euro-American** Tucker3 found the most stable clusters and it generates 35 sublineages for Euro-American, denoted E1 to E35, while there are 33 SpolDB4 lineages labeled Euro-American. See additional file

**Confusion matrix of Euro-American strains** The confusion matrix of Euro-American strains that shows the correspondence of tensor sublineages and SpolDB4 families. Each row represents SpolDB4 families and each column represents tensor sublineages.

Click here for file

The clustering plot of Euro-American strains

**The clustering plot of Euro-American strains** Clustering plot of Euro-American strains labeled by 35 tensor sublineages using Principal Component Analysis. The tensor sublineages are not as distinct as they were for the previously analyzed major lineages, reflecting the variability in the tensor cluster stability. It may also be due to the anticipated hierarchical structure in Euro-American strains.

Spoligotype signatures of Euro-American tensor sublineages

**Spoligotype signatures of Euro-American tensor sublineages** Spoligotype signatures of tensor sublineages of Euro-American strains.

MIRU signatures of Euro-American tensor sublineages

**MIRU signatures of Euro-American tensor sublineages** MIRU signatures of tensor sublineages of Euro-American strains.

Strains belonging to families H2, H37Rv, LAM12_MAD1, T1 (Tuscany variant), T1_RUS2, T4, T5_MAD2, and T5_RUS1 are clustered in tensor sublineages E9, E7, E8, E24, E11, E34, E34, and E17 respectively. In contrast, the T1 family, an ancestor strain family, is distributed across 25 tensor sublin-eages, with most T1 strains in E34. Sublineage stability is above .90 for 18 tensor sublineages. Spoligotype and MIRU signatures of sublineages suggest either subdivision or merging of SpolDB4 families. For instance, tensor sublineages E2, E6, and E32 include T1 strains only. In addition to common spacer deletions of Euro-American strains, E2 strains lack spacers 15 through 26, E6 strains lack spacers 9 through 23, and E32 strains lack spacers 1 through 19, which are all variations in spoligotype signatures of T1 strains. This sublineage classification further subdivides the poorly-defined ancestor T1 family. Strains of LAM families on the other hand are grouped in 17 tensor sublineages. Prior studies have found that LAM Rio strains identified by SNPs are found in multiple SpolDB4 lineages

Although most stable clusters of the Euro-American strain dataset are found using best-match stability, the DD-weighted gap statistic plot has multiple peaks. DD-weighted gap statistic, detailed in the methods section, is a cluster validity measure which is also used for detecting hierarchical structure in the datasets. Multiple peaks in DD-weighted gap statistic plot suggest that the Euro-American dataset may have a multilevel hierarchical structure. Model order selection with randomized maps by Bertoni and Valentini can be used to detect the hierarchical structure in the Euro-American dataset

We used the unsupervised tensor clustering framework to cluster MTBC strains of major lineages into sublineages. Next, we turn our attention to supervised tensor learning methods on multiple-biomarker tensors to classify strains into major lineages.

Classification of MTBC strains into major lineages using two-way and multiway supervised learning

Multiple-biomarker tensors can be used in supervised classification models as well as in unsupervised models. We use multiway partial least squares (N-PLS) on multiple-biomarker tensors to predict major MTBC lineages

Multiway N-PLS and standard two-way PLS classification accuracy results

Method

Average F-measure

N-PLS

0.9961 ± 0.0009

Standard PLS

0.9955 ± 0.0017

Conformal Bayes Net

0.9897

Multiway N-PLS and standard two-way PLS classification accuracy results when 12 spoligotype deletions and MIRU patterns are used to classify MTBC strains into major lineages. The excellent results compare favorably to prior results based on a conformal Bayesian Network in

We compare N-PLS, standard PLS and Conformal Bayes Network (CBN) methods by F-measure of major lineage classification and see that they are accurate predictive models with no significant difference between the approaches. Table

Conclusions

This study investigates multiple-biomarker tensors and illustrates how they can be used for both unsupervised and supervised learning models. First, a novel clustering framework is used to analyze the sublineage structure of MTBC strains based on multiple biomarkers. We generated multiple-biomarker tensors to represent multiple biomarkers of the MTBC genome and used multiway models for dimensionality reduction. The multiway representation determines a transformation of the data that captures similarities and differences between strains based on two distinct biomarkers. We clustered MTBC strains based on the transformed data using improved k-means clustering and validated clustering results. We evaluated the sublineage structure of major lineages of MTBC and found similarities and clear distinctions in our subdivision of major lineages compared to the SpolDB4 classification. Simultaneous analysis of spoligotype and MIRU through multiple-biomarker tensors and clustering of MTBC strains leads to coherent sublineages within major lineages with clear and distinctive spoligotype and MIRU signatures. Second, we demonstrated how the multiple-biomarker tensor can be used to predict major lineages with extremely high accuracy competitive with other approaches. We show that 3-way PLS, 2-way PLS and CBN models are accurate major lineage predictors for MTBC strains.

The tensor clustering framework is flexible and can be applied to any multidimensional strain data. The design of the resulting tensor depends on the question to be answered. In this study, multiple-biomarker tensors are designed to find groups of MTBC strains. Thus, the application of the tensor clustering framework on multiple-biomarker tensors leads to sublineages of MTBC within major lineages. The multiple-biomarker tensor is further validated by the fact that it can used to predict known major lineages with high accuracy using N-PLS. N-PLS with multiple-biomarker tensors can be used for semi-supervised learning as well. This can be useful for learning predictive models for sublineages in which only part of the data is labeled with sublineages and the other part of the data has no labels. This may result in more reliable and accurate classifiers of MTBC sublineages, and the resulting sublineage classifiers would be a significant enhancement to TB control, epidemiology and research. We leave this to future work.

The tensor clustering framework used in this study can be further extended to find subgroups of MTBC strains based on other biomarkers such as RFLP and SNPs. 15-loci MIRU and 24-loci MIRU patterns can also be used to represent MTBC genomes with multiple-biomarker tensors. Moreover, more than two biomarkers can be used in the MTBC genome representation. But, ambiguity in the tensor entries is an open question that needs to be solved in the tensor representation when more than two biomarkers are used. Addition of new biomarkers will increase the number of modes of the multiple-biomarker tensor, but the multiway analysis methods will remain the same.

Other questions of interest can be addressed by designing and analyzing host-pathogen tensors to examine the relationship of the pathogen genotype with host (or equivalent) attributes to examine questions of interest. For example, since the MTBC sublineages are known to be highly geographically dependent, a tensor which combines the pathogen genotype with the country of birth of the host may reveal additional sublineage structure and transmission patterns. A tensor combining MTBC genotype and host disease phenotype such as site of infection and drug resistance could be used to analyze MTBC genotype/phenotype relations.

Methods

Tensor Clustering Framework (TCF)

Clustering MTBC strains based on multiple-biomarker tensors consists of a sequence of steps. First, we find informative feature set of spoligotype deletions and generate a tensor. Second, we apply multiway models on the tensor and get a score matrix for the strain mode. Third, we use this score matrix to determine the similarity between strains, and cluster them using a stable version of k-means. In the final step, we evaluate the clustering results using cluster validity indices. This stepwise clustering framework is outlined in Figure

Tensor clustering framework

**Tensor clustering framework** Clustering framework of MTBC strains. High-dimensional genotype data are decomposed into two-dimensional arrays using multiway models, which are then used as input to the kmeans_mtimes_seeded algorithm. Clusterings are validated using best-match stability. In case of a tie, the DD-weighted gap statistic is used to pick the number of clusters.

Datasets

The dataset comprises 6848 distinct MTBC strains as determined by spoligotype and 12-loci MIRU, labeled with major lineages and SpolDB4 families. The strains are mainly from the CDC dataset - a database collected by the CDC from 2004-2008 labeled with the major lineages collected by the TB-Insight project (

Data statistics by major lineage

Major lineage

# Strains

# Spoligotype deletions

64

22

102

34

East Asian (Beijing)

571

5

East-African Indian(CAS)

508

18

Indo-Oceanic

1023

28

Euro-American

4580

109

Numbers of strains in each major lineage of CDC+MIRU-VNTR

Feature Selection and Tensor Generation

**Feature Selection** The spoligotype pattern captures the variability in the DR locus of the MTBC genome. A spoligotype consists of 43 spacers represented as a 43-bit binary sequence, and according to the hidden parent assumption, one or more contiguous spacers can be lost in a deletion event, but rarely gained

Within the TCF, we built a feature selection algorithm to find spacer deletions that are informative. This insures that the results are not biased by a priori selection of spoligotype deletions. Given a set of spoligotypes, we first calculate the frequency _{i}_{i}_{i}_{i}_{i}

**Tensor Generation** We generated multiple-biomarker tensors using two biomarkers, spoligotype deletions and MIRU patterns, as explained earlier. The spoligotype deletions found informative by the feature selection algorithm are used in the generation of multiple-biomarker tensors. The multiple-biomarker tensor is of the form

Multiway modeling

Multiway models are needed to fit a model to multiway arrays. We used PARAFAC and Tucker3 techniques to model the tensors. We determined the number of components for each model to ensure a bound on the explained variance of data.

**Multiway models** We used PARAFAC and Tucker3 models to explain the tensor with high accuracy. Multiway modeling of tensors was carried out using the

PARAFAC

PARAFAC is a generalization of singular value decomposition to multiway data

where **A** ∈ ℝ^{I}^{×}^{R}**B** ∈ ℝ^{J}^{×}^{R}**C** ∈ ℝ^{K}^{×}^{R}

- E

PARAFAC model

**PARAFAC model** PARAFAC model of a three-way array

The PARAFAC model is symmetric in all modes and the number of components in each mode is the same

Tucker3

Tucker3 is an extension of bilinear factor analysis to multiway datasets

where **A** ∈ ℝ^{I}^{×}^{P}**B** ∈ ℝ^{J}^{×}^{Q}**C** ∈ ℝ^{K}^{×}^{R}

- E

Tucker3 model

**Tucker3 model** Tucker3 model of a three-way array

Tucker3 is a more flexible model compared to PARAFAC. This flexibility is due to the core array

**Model validation** A multiway model is appropriate if adding more components to any mode does not improve the fit considerably. There is a tradeoff between the complexity of the model and the variance of the data explained by the model. Therefore, validation of a model also determines a suitable complexity for the model. We used the core consistency diagnostic (CORCONDIA) to determine the number of components of the PARAFAC model

Clustering algorithm

We developed the kmeans_mtimes_seeded algorithm, a modified version of the k-means algorithm, to group MTBC strains based on the score matrices of the multiway models. K-means is a commonly used clustering algorithm with two weaknesses: 1) Initial centroids are chosen randomly, 2) The objective value of k-means, measured as within-cluster sum of squares, may converge to local minima, rather than finding the global minimum. We solve these problems with two improvements: 1) Initial centroids are chosen by careful seeding, using a heuristic called kmeans++, suggested by Arthur et al.

Cluster Validation

Clustering results for the MTBC strains are evaluated to determine the best choice for the number of clusters and compare the chosen clustering with existing sublineages using cluster validity indices. We used the best-match stability to pick the most stable clusterings. In case of a tie in average best-match stability, we used the DD-weighted gap statistic for cluster validation

**Best-Match Stability** The stability of a clustering is measured by the distribution of pairwise similarities between clusterings of subsamples of the data. The idea behind stability is that if we repeatedly sample data points and apply the same clustering algorithm to the subsample, then an effective clustering algorithm applied to well separated data should produce clusterings that do not vary much for different subsamples

where

and _{i}

**DD-Weighted Gap Statistic (PC)** Tibshirani et al. proposed a cluster validity index called the gap statistic, which is based on the within-cluster sum of squares (WCSS) of a clustering ^{n}^{×}^{p}_{ij}_{1}, ‥, _{k}_{i}_{i}_{i}

and the within-cluster sum of squares for a clustering is defined as:

The idea of the gap statistic method is to compare _{k}

Where _{n}

The reference distribution can be one of two choices: uniform distribution (Gap/Unif), or a uniform distribution over a box aligned with the principal components of the dataset (Gap/PC). Experiments by Tibshirani et al. show that Gap/PC finds the number of clusters more accurately, therefore we used Gap/PC in this study

The gap statistic is a powerful method for estimating the number of clusters in a dataset. However, a study by Dudoit et al. showed that the gap statistic does not estimate the correct number of clusters for every case _{k}

and the weighted within-cluster sum of squares

Based on

Let

**F-measure** The F-measure is a weighted combination of precision and recall of a clustering. Since the F-measure combines precision and recall of clustering results, it has proven to be a successful metric. We use the F-measure to evaluate how similar the tensor sublineages are to the SpolDB4 families. According to the contingency table in Table

Contingency table

Same cluster

Different clusters

Same class

a

b

Different classes

c

d

Contingency table of a clustering, where rows represent true classes and columns represent found clusters. Given

Multiway Partial Least Squares Regression (N-PLS)

N-PLS is a multiway regression method where at least one of the independent and dependent blocks has at least three modes created by Bro et al. by generalizing PLS to multiway data

- X

**X**=**t**(**w ^{K}** ⊗

and the two-way array **Y** is decomposed as:

**Y** = **uq'** + **F** (2)

where **t** ∈ ℝ^{I}^{×1} and **u** ∈ ℝ^{I}^{×1} are score vectors of **X** and **Y**. **w ^{J}** ∈ ℝ

Notice that the two-way array **Y** is decomposed into one score and one loading vector, whereas the matricized three-way array **X** is decomposed into one score and two loading vectors, **w ^{J}** and

The aim of N-PLS is to maximize the covariance of

- X

**U** = **TB** + **E _{u}** (3)

This requires finding loading vectors **w ^{J}** and

where **Z**∈ℝ^{J}^{×}^{K}

The problem of finding **w ^{J}** and

Given

The N-PLS model of a multiway array is a multilinear model, like PARAFAC, which means that it has no rotational freedom. Therefore, the N-PLS model of a multiway array is unique. In this study, we used a 3-way array as the X-block and a 2-way array as the Y-block, therefore we are particularly working on the Tri-PLS2 version of N-PLS, which is summarized in Algorithm 4. The term

- X

Authors’ contributions

CO, KB and BY conceived the study. CO carried out the experiments. CO, KP and BY analyzed the results. AS and SV provided and analyzed some of the data. CO, AS, SV and KB drafted the manuscript.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

This work was made possible by Dr. Lauren Cowan and Dr. Jeff Driscoll of the Centers for Disease Control and Prevention. This work was supported by NIH R01LM009731.

This article has been published as part of