Center for Data Analytics and Biomedical Informatics, Temple University, Philadelphia, USA

Mathematics Department, Faculty of Science, Ain Shams University, Cairo, Egypt

Abstract

Background

Early classification of time series is beneficial for biomedical informatics problems such including, but not limited to, disease change detection. Early classification can be of tremendous help by identifying the onset of a disease before it has time to fully take hold. In addition, extracting patterns from the original time series helps domain experts to gain insights into the classification results. This problem has been studied recently using time series segments called

Results

The proposed early classification method for multivariate time series has been evaluated on eight gene expression datasets from viral infection and drug response studies in humans. In our experiments, the MSD method outperformed the baseline methods, achieving highly accurate classification by using as little as 40%-64% of the time series. The obtained results provide evidence that using conventional classification methods on short time series is not as accurate as using the proposed methods specialized for early classification.

Conclusion

For the early classification task, we proposed a method called Multivariate Shapelets Detection (MSD), which extracts patterns from all dimensions of the time series. We showed that the MSD method can classify the time series early by using as little as 40%-64% of the time series’ length.

Background

In medical informatics, the patient’s clinical data records, such as heart rate, are collected over time and therefore represent a time series. If the data is collected from two groups of patients (for example, symptomatic and asymptomatic with respect to heart failure), the task of multivariate time series (MTS) classification is to learn temporal patterns to determine whether the patient belongs to the group of symptomatic patients.

Time series have been extensively analyzed in various fields, such as statistics, signal processing, and control theory. The focus of the research in these fields is on gaining a better understanding of the data-generating mechanism, the prediction of future values, or the optimal control of a system. From a statistical viewpoint, time series analysis is comprised of methods for analyzing time series data in order to extract meaningful statistics from the data. As a part of time series analysis, time series forecasting is aimed to use a model, e.g. AutoRegressive Moving Average (ARMA), to predict future values based on previously observed values

Although all of the aforementioned methods could be helpful in our study, and the experience of researchers and practitioners from other fields are extremely valuable, the focus of our research is to classify a new time series as early as possible by looking at and extracting patterns from past observations rather than predicting future values or analyzing a single time series’ pattern.

In the data mining community, the time series classification problem has been studied in some detail as well. The predictive patterns framework has been introduced to directly mine a compact set of highly predictive patterns

A method that extracts all meta-features from a multivariate time series was proposed by Kadous et al.

In the context of classification of unknown time series (time series with an unknown label), models utilize the whole time series with the unknown label to predict it based on the information learned from training data. In an early classification context, the objective is to provide patient-specific classification of unknown time series as early as possible. Therefore, instead of utilizing the whole time series, our MSD method looks into a portion (current stream) of the unknown time series and determines whether it is able to predict the label of the whole time series without looking at the rest of the time series. If MSD is able to predict at the time point which is at the end of the current stream, the label is predicted. Otherwise, MSD requires more data for the unknown time series and looks at a larger segment, and does so until it is able to predict the label of the time series.

For early classification, a new method called

In this study, we generalize the definition of local shapelets to a multivariate context and accordingly propose a method for early classification of multivariate time series. The proposed method is called

In particular, we propose the following extensions to the existing univariate shapelet method:

• Extending the concept of univariate shapelets to multivariate shapelets, which are multidimensional subsequences with a distance threshold along each dimension.

• Proposing use of information gain-based distance threshold.

• Proposing use of weighted information-gain based utility score of a shapelet. A theorem is provided to show that the weighted information gain incorporates the earliness and assigns high utility score to the shapelet that appears earlier given the same accuracy performance.

The mathematical definition of the problem is presented in the Definitions section. The method for multivariate time series classification is described in the Methods section. Datasets are described in the Dataset and data processing section. In the Results and discussion section, the experimental results are presented. Finally, future work and concluding remarks are discussed in the Conclusion section.

Definitions

A time series _{1}, _{2}, …, _{L}} of length _{i}, _{i}) : _{i} is the time series number _{i} = _{i}) is its class. Given a time series _{1}, _{2}, …, _{L}}, a subsequence _{i}, _{i + 1}, …, _{i + l− 1}},

For a given time series _{1}, _{2}, … _{L − l + 1}} of length

Illustration of computing the distance between a subsequence

**Illustration of computing the distance between a subsequence****and a time series****.** To compute the distance between a subsequence

A shapelet is defined as _{f}) where _{f} of the shapelet is called the target class. The other classes are called the non-target classes, and are referred to as _{i} a target time series if the class of the time series is _{f}. The distance threshold

• The distance _{i}between _{i } in the dataset is computed using Equation 1. The distance _{i } is represented as a point in the order line as shown in Figure _{i}) = _{f}, then _{i} is represented as blue point. If _{i}) ≠_{f}, then _{i }is represented as red square.

• The distance threshold

Illustration of the distance threshold

**Illustration of the distance threshold.** The distance threshold is chosen such that it divides the dataset into two separate groups (red and blue groups). It is clear that there is no unique best threshold. Any threshold between 10 and 14 or between 16 and 21 has only either one false negative or one false positive. However, there is no perfect threshold that separates the datasets into two pure groups.

In another way, the distance threshold _{i} and

The distance between a shapelet

An **T** = [^{1},^{2},…,^{N}] where ^{j} is the ^{th} dimension of **T** and ^{j} [^{th} dimension of **T** at time stamp

An **f** = (**s**,_{f}). The vector **s** = [^{1},^{2},…,^{N}] where ^{j} is the ^{th} dimension of the shapelet. Figure

Illustration of a 3-dimensional shapelet

**Illustration of a 3-dimensional shapelet.** This shows an example of a 3-dimensional time series (red, green and blue lines) of length 15. An example of an extracted 3-dimensional shapelet of length 4 is illustrated in the right part of the figure. The shapelet is extracted from the time series from position 6 to position 9.

The distance between an **f** and **T** is a vector of

where ^{j},^{j}) is defined as in Equation 1. Simply, the distance between two multivariate time series is a vector of distances where each component in the distance vector is the distance between the corresponding dimensions of the two multivariate time series. The distance between a shapelet **f** and a time series **T**is defined as **f**,**T**) := **s**,**T**).

The distance threshold Δ = [^{1},^{2},…,^{N}] where ^{j} is computed (as explained in the Methods section) so that:

Methods

In this section we first describe a recently proposed method for early classification of univariate time series

Modifications of univariate shapelet for early time series classification

An

Algorithm 1: UnivariateShapeletsDetection

**Input**: A training dataset D of M univariate time series;

**Output**: A list of univariate shapelets

1. **for****do** {

2. **for****to****do** {for each shapelet length}

3. **for****to****do** {for each starting position}

4.

5. _{lk},)

6. _{lk})

7. _{lk},

8.

9. **return**

The method iterates over the time series in the dataset _{lk} (lines 2 and 3) the method calls the function _{lk} and all time series in _{lk} using Chebyshev’s inequality. Then, it assigns _{lk} a utility score (line 6) using a weighted _{1} score measure. In line 8, the method ranks all extracted shapelets using their utility scores and selects a subset of the highest ranked shapelets as the pruned set of shapelets which can exhaustively classify time series.

The functions that compute the distance threshold and utility score are explained in the following sections. We describe how to prune the shapelets and use them for early classification in the Shapelet Pruning and Classification sections, respectively.

Distance threshold method

The Chebyshev’s inequality method is proposed for computing the distance threshold ^{2} of the distribution’s values are more than

**Supplementary document.** The supplementary document (ECMTS-Supp.pdf) contains additional analysis of the obtained results. These details are omitted for lack of space but are consistent with the findings reported here.

Click here for file

The basic idea is to find the shapelet’s distance threshold that maximizes the information gain and divides the dataset into two groups, target and non-target time series

First, the entropy of the dataset is computed as

where _{c} is the number of time series of class _{L} and _{R} as illustrated in Figure _{L} contains all time series such that the distance between the shapelet and time series is less than or equal to the candidate threshold. The dataset _{R} contains the rest of the time series. Then the entropies _{L} and _{R} of the datasets _{L} and _{R} are computed, respectively. By comparing the entropy before and after the split, we obtain a measure of information gain which is computed as

where _{L} and _{R} are the number of time series in _{L} and _{R}. Therefore, we choose the distance threshold that maximizes the information gain for the shapelet. The algorithm is described in details in Additional file

Candidate distance threshold

**Candidate distance threshold.** The distance threshold _{1} splits the dataset into two datasets so that it has 4 true positives, 0 false positive, 4 true negatives, and 1 false negative. The information gain of _{1} is 0.4090. The distance threshold _{2} divides the dataset into two datasets so that it has 4 true positives, 1 false positive, 3 true negatives, and 1 false negative. The information gain of _{2} is 0.1591. Hence, _{1} has better information gain than _{2}.

Figure _{1} and _{2}. The threshold _{1} splits the dataset into two datasets so that it has 4 true positives, 0 false positive, 4 true negatives, and 1 false negative. The information gain of _{1} is 0.4090. The distance threshold _{2} divides the dataset into two datasets so that it has 4 true positives, 1 false positive, 3 true negatives, and 1 false negative. The information gain of _{2} is 0.1591. Therefore, the threshold _{1} is chosen because it has maximum information gain.

Utility score method

The set of shapelets extracted from the dataset might be exceedingly large. Therefore, it is important to rank the shapelets in order to select a small subset of the shapelets for classification. For this reason, each shapelet has to be assigned a score that takes into consideration earliness as well as discrimination among classes.

The weighted _{1} score method is proposed to rank shapelets _{1} method.

The utility score of a shapelet should incorporate the earliness and the distinctiveness properties. First, we define the earliness _{f}) and a time series

1. Compute the distance between the shapelet _{f}) and every time series _{i}in the dataset.

2. Split the dataset _{L} and _{R} such that _{L} contains all time series where _{i}) ≤ _{R} contains all time series where _{i}) >

3. For each time series _{L}, if _{f}, then

4. Compute _{L} as the weighted count of the number of time series in the dataset _{L} and _{R} is the size of the dataset _{R}.

5. Compute the weighted information gain using Equation 4.

The following theorem proves that the weighted information gain incorporates the earliness and assigns high utility score to the shapelet that has better earliness given the same accuracy performance.

Theorem: If _{1} and _{2} are two shapelets that have the same distance threshold (same splitting point), the same class, and different earliness (_{L} is _{T} and the number of non-target time series in _{L} is _{NT}. Without loss of generality, since _{L}, _{1},_{1} and _{1},_{2} such that _{L1} and _{L2} of the time series in _{L} for _{1} and _{2} is _{1}_{T} + _{NT} and _{2}_{T} + _{NT}, respectively. Since _{L1} < _{L2}. Hence the weighted information gain of _{1} is greater than the weighted information gain of _{2}.

Therefore, the weighted information gain gives high scores to the shapelets that come early in the time series.

Shapelet pruning

To select a subset of the shapelets for classification, the shapelets are sorted in descending order using their utility scores. In this manuscript, two methods have been used to select a subset of the shapelets.

The first method iterates over the shapelets starting from the highest ranked shapelet. We select the shapelet and remove all training examples that are covered by that shapelet. The shapelet _{f}. We use the next highest ranked shapelet to see if it covers any of the remaining training time series. If it covers some of them, then we select the shapelet and remove all time series that are covered. Otherwise, we discard it and proceed to the next one. This process continues until all training time series are covered.

The second method simply involves keeping the top

Classification

If the length of the shortest shapelets extracted by Algorithm 1 is

Multivariate shapelets detection for ECMTS

In a dataset of **f** = (**s**,_{f}). The method assumes that all subsequences ^{j} are extracted from the same starting position. Hence, we slide a window of length ^{j} of length ^{th} dimension to construct **s** = [^{1},^{2},…,^{N}]. An example of a 3-dimensional shapelet is shown in Figure

We follow the same procedures as in the univariate case. Namely, for each **f**, we compute the minimum distance between **f** and every time series **T** in the dataset. The distance between **f** and **T** is a vector of distances (

Equation 5 requires all **d**_{1} to be less than all corresponding **d**_{2}. Therefore, we would require all

where

The algorithm for extracting the multivariate shapelets from a dataset is similar to Algorithm 1. The algorithm iterates over each time series and extracts all multivariate shapelets. For each candidate multivariate shapelet, it computes the distances with every time series. Note that each distance is a vector of length

Distance threshold method

The multivariate information gain (Additional file **f**; a matrix ^{1},^{2},…,^{N}] that has maximal information gain.

Utility score method

The steps to adapt the utility scores defined on univariate time series are similar to the steps we have followed to adapt the distance threshold method.

After computing the score for each shapelet, the method sorts them in descending order according to their utility scores and then selects a subset of shapelets as explained in the Shapelet Pruning section. The classification process is similar to the process described in the Classification section, taking Equation 6 into consideration when computing the distance between the shapelet and the current stream of the query time series.

Dataset and data processing

Viral challenge datasets

We used two datasets for blood gene expression from human viral studies with influenza A (H3N2) and live rhinovirus (HRV) to distinguish individuals with symptomatic acute respiratory infections from uninfected individuals

H3N2 dataset: A healthy volunteer intranasal challenge with H3N2 was performed in 17 subjects. Of those subjects, 9 became symptomatic and 8 remained asymptomatic. Blood samples were taken from each subject at 16 time points. Some subjects have missed certain measurements at time points 1,5,6 and/or 7. Hence, the gene expression values were measured on average 14-16 times for each subject. 30 genes were identified, in ranked order, as contributing to respiratory infection

HRV dataset: A healthy volunteer intranasal challenge with HRV was performed in 20 subjects. Of those subjects, 10 became symptomatic and 10 remained asymptomatic. Blood samples were taken from each subject at 14 time points. We ignored time stamps 8-11 because the majority of the subjects missed the measurements at those time points. Thus, the gene expression values were measured on average 6-10 times for each subject. 30 genes were identified, in ranked order, as contributing to respiratory infection

Drug response dataset

Another clinical dataset was generated for studying the changes in cellular functions in multiple sclerosis (MS) patients in response to drug therapy with IFN ^{th} time point. Thus, the gene expression values were measured on average 5-7 times for each subject. The list of the genes used in our experiments is provided (Additional file

Identification of triplets of genes for a Bayes classifier of time series expression data of multiple sclerosis patients’ response to the drug has been performed

A discriminative hidden Markov model has been developed and applied to the MS dataset to reveal the genes that are associated with the good or bad responders to the therapy

A mixture of hidden Markov models has been developed to identify the genes that are associated with the patient response to the treatment

Environment setup and evaluation measure

In all experiments we set

In the results, we report the median of the accuracy, the coverage (the percentage of the time series that are covered by the method), and the earliness (the fraction of the time series length used for classification). Note that the earliness varied from test example to another. In other words, each test example could be classified at different time point, so that our method is patient-specific and there is no fixed length of the time series used for classification.

Because there is an imbalance in the drug response dataset, the accuracy (

where

Since the objective of the paper is to provide a method for early classification, we propose an evaluation measure that incorporates both the earliness (_{β}-measure as the weighted average between _{β}-measure is defined as:

where smaller values of _{1}-score, which gives both the accuracy and the earliness the same weight. _{1}-score reaches its best value at 1 and worst score at 0.

Results and discussion

Evaluation of MSD method

First, we show the effectiveness of the MSD method on a single patient from the H3N2 dataset. In Figure ^{th} time point. The MSD method used a shapelet of length 5 to classify the test subject. In the bottom panel, MSD used a shapelet of length 6 that was extracted from the time series of a symptomatic subject, so it correctly classified the symptomatic test subject at the 8^{th} time point (it used only 50% of the time series’ length to classify the test subject).

Illustration of the effectiveness of the MSD method on a case from H3N2 dataset

**Illustration of the effectiveness of the MSD method on a case from H3N2 dataset.** The effectiveness of the MSD method is illustrated on a single patient from H3N2. In the top panel, a 2-dimensional H3N2 asymptomatic test subject (genes RSAD2 and IFI44L observed at 15 time steps) has been correctly classified by MSD method at the 5^{th} time point. In the bottom panel a 2-dimensional H3N2 symptomatic test subject (genes RSAD2 and IFI44L observed at 16 time steps) has been correctly classified by MSD method at the earliest possible time stamp number 8. Red lines represent time series of the symptomatic subject. Blue lines represent time series of the asymptomatic subject. Shapelets are represents by solid markers.

Next, the MSD method was evaluated on the viral and drug response datasets using all genes defined by the dataset. In Table

**Dataset**

**Number of genes**

**Accuracy**

**Relative accuracy**

**Coverage**

**Earliness**

_{1}

The performance of the MSD method on 8 datasets is shown in the table. The MSD method achieved good accuracy on most of the datasets using a small fraction of the time series. The distribution of the statistics were skewed and not symmetric, so we report the median of the statistic.

H3N2

23

77.78

85.71

100

62.50

0.5060

HRV

26

70.00

71.43

100

40.00

0.6462

Baranzini3A

3

70.00

73.91

95.83

46.26

0.6080

Baranzini3B

3

66.67

68.00

100

44.81

0.6039

Baranzini6

6

70.83

70.83

100

42.86

0.6325

Baranzini12

12

66.67

66.67

100

42.86

0.6154

Lin9

9

67.86

69.57

100

44.00

0.6136

Costa17

17

68.00

69.23

100

45.24

0.6067

From Table _{1} ≈ 0.01) while our MSD method achieved approximately 68% accuracy using less than half of the time series’ length on average (_{1} ≈ 0.51).

For the viral infection dataset, a list of 23 genes associated with the viral infection sorted by their relevance to the infection diagnosis is provided in a recently published study

Performance of MSD method on the H3N2 dataset using different numbers of top genes

**Performance of MSD method on the H3N2 dataset using different numbers of top genes.** This figure illustrates the performance of the MSD method on the H3N2 dataset using different numbers of top genes from the provided ranked list

For the drug response dataset, no ranked list of genes is provided in previous publications. In 4 out of the 6 drug response datasets the number of the genes is small, therefore, on these datasets, we ran our MSD method on all combinations of genes. The number of genes used for each dataset to achieve the highest accuracy is provided in Table _{1}-score increased from 0.61 to 0.67).

**Dataset**

**genes**

**Accuracy**

**Relative accuracy**

**Coverage**

**Earliness**

_{1}

The MSD method has been evaluated on all combinations of the genes on 4 datasets. The accuracy of the classifier is improved than using all genes. For example, the performance of MSD method on the Lin9 dataset is improved significantly from 68% to 82% when using only 3 genes instead of 9 genes.

H3N2

Top 11 genes

80.00

87.50

88.89

64.29

0.4938

HRV

RSAD2

71.43

75.00

100

38.89

0.6587

Baranzini3A

Caspase 10

75.00

76.00

100

45.45

0.6316

Baranzini3B

Caspase 2 , Caspase 3

75.00

76.19

100

44.05

0.6409

Baranzini6

Caspase 10 , IL-4Ra

75.00

76.00

100

43.45

0.6448

Lin9

Caspase 2, Caspase 3, Jak2

81.82

82.61

100

43.43

0.6689

Since our method achieved high accuracy using a small number of genes (in some cases only one gene), we ran the univariate method

**Dataset**

**gene**

**Accuracy**

**Relative accuracy**

**Coverage**

**Earliness**

_{1}

The univariate method (using the Chebyshev’s inequality as distance threshold method and the weighted recall as utility score method) has been evaluated on each gene on all datasets. The best accuracy is reported.

H3N2

LOC26010

77.78

85.71

100

38.34

0.6879

HRV

RSAD2

42.86

80.00

55.56

52.50

0.4506

Baranzini3A

Caspase 10

12.00

100.00

12.25

42.86

0.1983

Baranzini3B

Caspase 3

26.09

80.00

31.38

40.26

0.3632

Baranzini6

Caspase 10

12.00

100.00

12.25

42.86

0.1983

Baranzini12

Caspase 3

26.09

80.00

31.38

40.26

0.3632

Lin9

Caspase 3

26.09

80.00

31.38

40.26

0.3632

Costa17

Caspase 3

26.09

80.00

31.38

40.26

0.3632

Baseline classifier for early classification

We compared the MSD method with a random classifier to evaluate MSD by comparison. The results of the random classifier are shown in Table

**Dataset**

**Accuracy**

H3N2

55.2833

HRV

52.1869

Baranzini3A

49.7893

Baranzini3B

49.6808

Baranzini6

50.8227

Baranzini12

53.9255

Lin9

50.7689

Costa17

51.5093

In addition, we compared MSD to the baseline classical classifier, which uses shorter time series. Recent research strongly suggested that the 1-nearest neighbor (1NN) method with Dynamic Time Warping (DTW) is exceptionally difficult to beat

We constructed 2 datasets out of H3N2, which we call 1NN(70) and 1NN(60). We also constructed 2 datasets out of the HRV dataset, which we call 1NN(50) and 1NN(40). The 1NN(

Comparison of the MSD method to the baseline classifier

**Comparison of the MSD method to the baseline classifier.** The performance of 1NN with DTW using different time series length and MSD on the viral infection datasets. The left (right) group shows accuracy of the classifiers on H3N2 (HRV) dataset, respectively. The x-axis within a group is ordered by the fraction of the time series, shown in parenthesis. The results provide evidence that the MSD method is more accurate than 1NN.

On the HRV dataset (right group), the accuracy of 1NN using 50% of the time series’ length (gray bar) is worse than our early classification method MSD (yellow bar), and MSD used a smaller fraction of time series on average. For instance, 1NN achieved 55% accuracy on 1NN(50) dataset (_{1} ≈ 0.46) while MSD was more accurate using on average 40% of time series’ length (_{1} ≈ 0.64). The results were consistent with the H3N2 dataset.

Therefore, for the early classification task, using conventional classification methods on shorter time series is not as accurate as using methods specialized for early classification, such as our proposed method.

Run-time analysis

In Table

**Dataset**

**Number**

**Number**

**TS length**

**Time**

**of genes**

**of examples**

**in seconds**

The run time of the MSD method is reported for all datasets. The number of genes, number of examples, the time series length, and the run time in seconds are reported in the table.

H3N2

23

17

16

295.1

HRV

26

20

10

77.7

Baranzini3A

3

52

7

49.3

Baranzini3B

3

52

7

36.1

Baranzini6

6

52

7

41.1

Baranzini12

12

52

7

64.3

Lin9

9

52

7

48.8

Costa17

17

52

7

131.9

Conclusion

For the early classification task, we proposed a method called Multivariate Shapelets Detection (MSD). It extracts patterns from all dimensions of the time series. In addition, we proposed using of information gain-based distance threshold and weighted information-gain based utility score of a shapelet. The weighted information gain incorporates the earliness and assigns high utility score to the shapelet that appears earlier. In order to adhere to the limitations of clinical settings (in which only a small pre-specified number of genes is provided in shorter time series), datasets comprised of fairly short time series were used in reported experiments. However, our method is applicable to any domain. We showed that MSD can classify the time series early by using as little as 40%-64% of the time series’ length. We compared MSD to a baseline classifier and showed that using the method proposed for early classification is more accurate than using conventional methods.

The run time of the MSD method grows exponentially with the number of examples and the length of the time series which limits the applicability of the proposed approach to datasets with smaller number of data instances and/or temporal observations. In practice, this is not a limitation for early classification in many health informatics applications (e.g. sepsis) since decisions typically have to be made very early by learning from a small number of patients. However, in future work, we will speed up the run time of the method by incorporating parallelism in the algorithm.

We are working to improve MSD by allowing the components of the multivariate time series shapelet to have different starting positions. Since the number of candidate shapelets grows exponentially, the concept of closed shapelets, and maximal closed shapelets can be introduced to pruning redundant shapelets that are supersets of smaller shapelets. Another extension to our work is to let the horizon between the time stamps in the subjects vary.

Competing interests

Both authors declare that they have no competing interests.

Author’s contributions

MG designed the algorithms, implemented software, carried out the analysis, and drafted the manuscript. ZO inspired the overall work, provided advice, and revised the final manuscript. Both authors read and approved the final manuscript.

Acknowledgements

We thank everyone in Prof. Obradovic’s laboratory for valuable discussions. Special thanks to the reviewers for their valuable suggestions that helped improving presentation and characterizing the proposed method, and to Dušan Ramljak for reviewing the initial draft of the paper.

This work was funded, in part, by DARPA grant [DARPA-N66001-11-1-4183] negotiated by SSC Pacific grant; the US National Foundation of Science [NSF-CNS-0958854]; and the Egyptian Ministry of Higher Education.