Department of Microbiology & Immunology, University of British Columbia, Vancouver, BC, Canada
Graduate Program in Bioinformatics, University of British Columbia, Vancouver, BC, Canada
Abstract
Background
Pairwise comparison of time series data for both local and timelagged relationships is a computationally challenging problem relevant to many fields of inquiry. The Local Similarity Analysis (LSA) statistic identifies the existence of local and lagged relationships, but determining significance through a
Results
To improve the performance of LSA on big datasets, an asymptotic upper bound on the
Conclusions
The
Background
The exponential increase and ubiquitous use of computational technology has given rise to an era of "Big Data" that pushes the limits of conventional data analysis
Though PCC is a classic and powerful technique for finding linear relationships between two variables, it is not designed for capturing leadlag relationships seen in time series data. Local similarity analysis (LSA)
Here we describe a novel asymptotic upper bound on the calculation of the LSA statistic's
Interpreting the LSA statistic
LSA concerns itself with pairs of time series data. The LSA Statistic can be interpreted in a manner similar to PCC when no lag window exists between two time series. However, LSA is also capable of capturing localized correlation that is staggered or lagged. A large positive or negative LSA value indicates a correspondingly strong PCC correlation or a correlation at a time displacement within the lag window (Figure
A lagged correlation between two time series
A lagged correlation between two time series. An example of two set time series that contain a leadlag correlation.
LSA is advantageous on large datasets containing many time series. Results can be visualized as a graphical network where nodes represent the individual time series and the edges represent their LSA correlation statistic. When displayed using a forcedirected layout in Cytoscape
Implementation
Description of the LSA algorithm
In this section we reproduce the algorithm from
In Figure
The LSA algorithm
The LSA algorithm. Algorithm for computing the LSA for a pair of time series X and Y.
The algorithm first initializes the arrays
Calculating the upper bound
In this section we derive the asymptotic upper bound on the
We begin by making certain assumptions about the probability model used to derive the bounds. First, each
Consider lines 5 and 7 of the LSA algorithm (Figure
We also define the set of random variables
We now consider a few useful lemmas that we will use to construct our
Lemma 1
Proof. The result is clear from the following:
In the LSA algorithm, we have maximums
Lemma 2
In order to get a simple formula for the bound on the cumulative tail probabilities for
Now to build theorems upon which we will derive a formulaic
Theorem 3
In order to apply the above theorem to get a simple formulaic approximation, we assume some random variables
Now we use the above results to get the probability estimates for our simple event terms {
Theorem 4
Proof. By applying Lemma 2 we have
and by Theorem 3, replacing
Notice that
It follows from Boole's inequality and Lemma 1 that
Finally, we have the following tail probability bound
standardizing with a mean of zero and a variance of one
Note that this last result is asymptotic. Thus,
Asymptotic
Asymptotic
Empirical
x1
n30Emp
n30Fas
n50Emp
n50Fas
n100Emp
n100Fas
0.05
1
1.000
1
1.000
1
1.000
0.07
1
1.000
1
1.000
0.997
1.000
0.09
1
1.000
0.999
1.000
0.953
1.000
0.11
0.999
1.000
0.984
1.000
0.819
1.000
0.13
0.989
1.000
0.928
1.000
0.627
1.000
0.15
0.958
1.000
0.823
1.000
0.441
1.000
0.17
0.896
1.000
0.687
1.000
0.292
1.000
0.19
0.803
1.000
0.545
1.000
0.184
1.000
0.21
0.694
1.000
0.417
1.000
0.111
1.000
0.23
0.58
1.000
0.309
1.000
0.064
1.000
0.25
0.472
1.000
0.224
1.000
0.036
1.000
0.27
0.376
1.000
0.158
1.000
0.019
0.693
0.29
0.294
1.000
0.109
1.000
0.009
0.373
0.31
0.227
1.000
0.073
1.000
0.005
0.194
0.33
0.172
1.000
0.048
0.981
0.002
0.097
0.35
0.128
1.000
0.031
0.666
0.001
0.047
0.37
0.094
1.000
0.019
0.444
< 0.001
0.022
0.39
0.067
0.98
0.012
0.291
< 0.001
0.01
0.41
0.048
0.742
0.007
0.187
< 0.001
0.004
0.43
0.033
0.555
0.004
0.118
< 0.001
0.002
0.45
0.023
0.411
0.002
0.073
< 0.001
0.001
0.47
0.015
0.301
0.001
0.044
< 0.001
< 0.001
0.49
0.01
0.218
0.001
0.027
< 0.001
< 0.001
0.51
0.006
0.156
< 0.001
0.016
< 0.001
< 0.001
0.53
0.004
0.111
< 0.001
0.009
< 0.001
< 0.001
0.55
0.002
0.078
< 0.001
0.005
< 0.001
< 0.001
0.57
0.001
0.054
< 0.001
0.003
< 0.001
< 0.001
0.59
0.001
0.037
< 0.001
0.002
< 0.001
< 0.001
0.61
< 0.001
0.025
< 0.001
0.001
< 0.001
< 0.001
0.63
< 0.001
0.017
< 0.001
< 0.001
< 0.001
< 0.001
0.65
< 0.001
0.011
< 0.001
< 0.001
< 0.001
< 0.001
0.67
< 0.001
0.007
< 0.001
< 0.001
< 0.001
< 0.001
0.69
< 0.001
0.005
< 0.001
< 0.001
< 0.001
< 0.001
0.71
< 0.001
0.003
< 0.001
< 0.001
< 0.001
< 0.001
0.73
< 0.001
0.002
< 0.001
< 0.001
< 0.001
< 0.001
0.75
< 0.001
0.001
< 0.001
< 0.001
< 0.001
< 0.001
0.77
< 0.001
0.001
< 0.001
< 0.001
< 0.001
< 0.001
0.79
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
0.81
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
0.83
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
0.85
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
0.87
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
0.89
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
0.91
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
0.93
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
0.95
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
0.97
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
0.99
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
0.001
Results
To validate versatility and effectiveness of the derived upper bound (Theorem 4), we applied the algorithm to four datasets, two sourced from biology, one from social networking, and a randomly generated control dataset. These include the Moving Pictures of the Human Microbiome
Computational complexity
The algorithm calculates in
LSA calculation time as a function of the number of time series
LSA calculation time as a function of the number of time series. On this loglog plot notice that because of its lack of a permutation test
Empirical running time for LSA calculation for data sets of different size
Time series
Time points
fastLSA (single thread)
fastLSA (16 threads)
1,000
130
6 sec
1 sec
CDC
6,178
24
3.24 min
2.2 sec
MPH
14,105
390
58 min
7.5 min
First Null
100,000
100

54 min
Second Null
1,000,000
30

2 days 3 hrs
Third Null
1,000,000
100

7 days 23 hrs
Moving pictures of the human microbiome (MPH)
The MPH time series dataset
For a given time series, if more than 25% of time steps were zero it was removed from the analysis. Analysis took 58 minutes (7.5 minutes on 16 threads) without including output writing time which is variable. Significant (
MPH local similarity graph
MPH local similarity graph. A local similarity graph of the MPH dataset showing significant LSA values as defined by the asymptotic upper bound (
Microarray hybridization detection cell cycleregulated genes in yeast
In the CDC data set
Comparison of LSA values: fastLSA and Original LSA
Comparison of LSA values: fastLSA and Original LSA. A comparison of calculated LSA statistics between
However, LSA was capable of detecting leadlag correlation despite the periodicity of the data, demonstrating its capacity to find long correlate pairs with a large number of covariate time series. Only 800 of the 6178 gene nodes could be classified from
CDC Local Similarity Graph
CDC local similarity graph. A local similarity graph of the CDC dataset showing significant LSA values as defined by the upper bound cutoff and the additional constraint of absolute LSA values greater than 0.85 (
Social media: top 1000 Twitter and Memetracker phrases (Twitter)
The data from
Twitter Local Similarity Graph
Twitter local similarity graph. A local similarity graph of the Twitter dataset showing significant LSA values with an additional threshold absolute LSA values greater than 0.98 (
Null hypothesis simulated data
Finally, to identify throughput limits of
Uniform Random Local Similarity Graph
Uniform random local similarity graph. A local similarity graph representing purposeful false positives, 1000 time series with 100 time steps randomly generated from a uniform distribution. Notice how no cliques form in the random data generated from a uniform distribution.
Discussion
LSA statistics have been demonstrated to capture relevant local similarity structure for a number of biological datasets
Conclusions
LSA is a local similarity statistic that has recently been used to capture relevant local structure in time series datasets, particularly within the biological community. However, its use has been limited to smaller datasets due to an intensive permutation test used to calculate significance. Our derivation and direct calculation of an asymptotic upper bound using
Project name: fastLSA
Project home page:
Operating system(s): OS X, Linux, or Windows
Programming Languages: C /C++
Other requirements: 1 GB RAM
License: GPLv3
Nonacademic restrictions: None
List of abbreviations
LSA: Local Similarity Analysis; PCC: Pearson's Correlation Coefficient; PCA: Principal Component Analysis; MDS: Multidimensional Scaling; DFA: Discriminant Fraction Analysis; MPH: Moving Pictures of the Human Microbiome; CDC: Centre of Disease Control.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
WED derived the
Declarations
The publication costs for this article were funded by Genome British Columbia and Genome Canada.
This article has been published as part of
Acknowledgements
We would like to acknowledge Dr.Fengzhu Sun and Dr.Jed Fuhrman at the University of Southern California for their support.