Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089-2910, USA

Division of Geological and Planetary Sciences, California Institute of Technology, Pasadena, CA 91125, USA

Marine and Environmental Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089-0371, USA

The Ecosystems Center, Marine Biological Laboratory, Woods Hole, MA 02543, USA

Bay Paul Center, Marine Biological Laboratory, Woods Hole, MA 02543, USA

Abstract

Background

The increasing availability of time series microbial community data from metagenomics and other molecular biological studies has enabled the analysis of large-scale microbial co-occurrence and association networks. Among the many analytical techniques available, the Local Similarity Analysis (LSA) method is unique in that it captures local and potentially time-delayed co-occurrence and association patterns in time series data that cannot otherwise be identified by ordinary correlation analysis. However LSA, as originally developed, does not consider time series data with replicates, which hinders the full exploitation of available information. With replicates, it is possible to understand the variability of local similarity (LS) score and to obtain its confidence interval.

Results

We extended our LSA technique to time series data with replicates and termed it extended LSA, or eLSA. Simulations showed the capability of eLSA to capture subinterval and time-delayed associations. We implemented the eLSA technique into an easy-to-use analytic software package. The software pipeline integrates data normalization, statistical correlation calculation, statistical significance evaluation, and association network construction steps. We applied the eLSA technique to microbial community and gene expression datasets, where unique time-dependent associations were identified.

Conclusions

The extended LSA analysis technique was demonstrated to reveal statistically significant local and potentially time-delayed association patterns in replicated time series data beyond that of ordinary correlation analysis. These statistically significant associations can provide insights to the real dynamics of biological systems. The newly designed eLSA software efficiently streamlines the analysis and is freely available from the eLSA homepage, which can be accessed at

Background

In recent years, advances in microbial molecular technologies, such as next generation sequencing and molecular profiling, have enabled researchers to spatially and temporally characterize natural microbial communities without laboratory cultivation

To analyze microbial community and other data under various conditions, researchers typically use techniques such as Pearson’s Correlation Coefficient (PCC), principal component analysis (PCA), multi-dimensional scaling (MDS), discriminant function analysis (DFA) and canonical correlation analysis (CCA)

To understand local and time-delayed associations, we originally designed a Local Similarity Analysis (LSA) for time series data measured typically at successive and equal time intervals without replicates

Since biological experiments are often associated with many potential sources of noise, repeated measurements (replicates) are usually carried out in order to better assess inherent uncertainties of the quantities of interest

Briefly, given time series data of two factors and a user-constrained delay limit, eLSA finds the configuration of the data that yields the highest local similarity (LS) score, which is a type of similarity metric. For example, within a delay limit of two units, the first time spot of one series might be aligned to the third time spot of the other series, thus maximizing their LS. For a dataset of many factors, eLSA is applied to each pairwise combination of factors in the dataset. Candidate associations are then evaluated statistically by a permutation test, which calculates the p-value which is the proportion of scores exceeding the original LS score after shuffling the first series and re-evaluating the LS score many times, and by the false discovery rate (FDR q-value), which is used to correct multiple comparisons. Researchers can use eLSA to detect undirected associations, i.e., association patterns without time delays, and directed associations, where the change of one factor may temporally lead or follow another factor.

The organization of the paper is as follows. In the “Methods” section, we describe the LSA algorithm for calculating LS score with replicates, data normalization, estimation of confidence interval for the LS score, and testing the statistical significance of a LS score. In the “Results” section, we first show the efficacy of eLSA by simulations, then describe briefly the pipeline of eLSA, and finally apply the pipeline to analyze a microbiological dataset and a gene expression dataset. The paper concludes with some discussion and conclusions.

Methods

Pearson’s correlation coefficient-based analysis

Suppose that the time series data for factors _{[1:n][1:m]} and _{[1:n][1:m]}, where _{i}_{[1:m]} and _{j}_{[1:m]}, or, in more abbreviated form, _{i}_{j}

where

Local similarity analysis with replicates

The original LSA method considers only data without replicates. In this paper, we extend the Local Similarity Analysis (LSA) method

(1) For ^{2}:

_{0,j} = 0, _{i,0} = 0, and _{0,j} = 0, _{i,0} = 0.

(2) For ^{2} with |

_{i+1,j+1} = max{0,_{i,j} + _{XY}_{i}_{j}

_{i+1,j+1} = max{0,_{i,j} + _{XY}_{i}_{j}

(3) _{max}_{1≤i,j≤n}_{i,j} and

_{max}_{1≤i,j≤n}_{i,j}.

(4)

_{sgn}_{max}_{max}

The _{max}_{i}_{i}_{i}_{j}_{i}_{j}

Different ways of summarizing the replicate data

Notice that the only additional component we introduced in the eLSA algorithm is the function

The ‘simple’ method is, in spirit, to take the mean profiles to represent the replicated series. In practice, we take _{i}_{i}_{i}_{i}_{i}

Bootstrap confidence interval for the LS score

With replicate data, researchers can study the variation of quantities of interest and to give their confidence intervals. Due to the complexity of calculating the LS score, the probability distribution of the LS score is hard to study theoretically. Thus, we resort to bootstrap to give a bootstrap confidence interval (CI) for the LS score. Bootstrap is a re-sampling method for studying the variation of an estimated quantity based on available sample data _{i}_{j}_{max}

Data normalization

eLSA analyses require the series of factors _{i}_{i}

Then, we take

where Φ is the cumulative distribution function of the standard normal distribution. We will take _{[1:n]} obtained through the above procedure as the normalization of

Permutation test to evaluate the statistical significance of LSA association

It is important to evaluate the statistical significance of the LS score measured by the p-value, the probability of observing a LS score no smaller than the observed score when two factors are not associated locally or globally. To achieve this objective, permutation test is used. To perform the test, we fix ^{(1)},^{(2)},…,^{(L)}} is the permuted set of _{L}

where

False discovery rate (FDR) estimation

In most biological studies, a large number of factors need to be considered. If there are

Computation complexity and implementation

For a single pair of time series, the time complexity for calculating the LS score using the dynamic programming algorithm is ^{2}

In summary, the internal support for replicates and the use of CI estimates are the two major methodological enhancements to LSA. The eLSA software, however, also incorporates other new features, such as faster permutation and false discovery rate evaluations and more options to handle missing values. Other implementation details are available from the software documentation.

Results

Simulations and benchmarks

We generated simulated data to show the efficacy of eLSA in capturing time-dependent association patterns, such as time-delayed associations and associations within a subinterval. We also studied the difference between the eLSA inference using the simple average (referred to as ‘simple’) method, the SD-weighted average method (referred to as ‘SD’), the median (referred to as “Med”) method, and the MAD (referred to as ‘MAD’) method.

Time-delayed association

In this case, _{j+3},_{j}**0** and covariance matrix _{j}_{j}_{j}_{ij}

Examples of simulated associations

**Examples of simulated associations.** a. An example of simulated time-delayed association series with five replicates is shown, where X (red square) leads Y (blue circle) by three time units. The pattern is not significant by ordinary correlation analysis (PCC=-0.258, P=0.272); however, it is captured by local similarity analysis (LS=0.507, P=0.006). b. An example of simulated subinterval association series with five replicates is shown, where X (red square) and Y (blue circle) are associated in the time interval from 6 to 15. The pattern is not significant by ordinary correlation analysis (PCC=0.258, P=0.273); however, it is captured by local similarity analysis (LS=0.428, P=0.028).

We see that the two series closely follows each other if we shift the

Association within a subinterval

In this case, we assume _{j}_{j}**0** and covariance matrix _{j}_{j}_{ij}

We can see the two series mostly closely follow each other within the intended subinterval 6 ≤

Different summarizing function

To see the effect of replicates, we also let

Mean and standard error of the estimated LS score

m=1

m=5

m=10

m=15

m=20

F-function

mean

se.

mean

se.

mean

se.

mean

se.

mean

se.

‘simple’

.495

.078

.495

.085

.491

.088

.493

.076

.496

.091

‘SD’

na.

na.

.332

.127

.391

.124

.412

.119

.435

.109

‘Med’

.495

.078

.490

.090

.490

.090

.490

.083

.498

.083

‘MAD’

na.

na.

.494

.115

.302

.128

.325

.129

.371

.119

The values are calculated based on 1000 simulations. ‘se.’ indicates standard error and ‘na.’ indicates not applicable.

Running time comparison

We benchmarked the running time performance of the new eLSA implementation and the old R script. For a dataset of 72 time series each with 35 time points, we tried eLSA analysis with 100 bootstraps, 1000 permutations and a delay limit of 3. It took the old script 20462 seconds to finish the computation while the new C++ program used 2054 seconds, which is about 9 times faster. Meanwhile, the new implementation also reduces the memory consumption and increases input/output efficiency. The benchmark is carried out on a “Dell, PE1950, Xeon E5420, 2.5GHz, 12010MB RAM” computing node.

The eLSA analysis pipeline

In this subsection, we briefly describe the eLSA analysis pipeline implemented into the eLSA software package, as shown in Figure

eLSA pipeline

**eLSA pipeline**. Users start with raw data (matrices of time series) as input and specify their requirements as parameters. The LSA tools subsequently

F-transformation and data normalization

The eLSA tool accepts a matrix file where each row is a time series for one factor. It fills up missing data by a user-specified method. Zero to third order spline-based methods and the nearest neighbour method as implemented in the

Local similarity scoring

Local similarity analysis calculates the highest similarity score between any pair of factors. Users can specify parameters, including, for example, the maximum shifts allowed. Local Similarity score is calculated using the eLSA dynamic programming algorithm (see Methods).

Permutation test

The statistical significance, the p-value, of LS score is evaluated using a permutation test. Briefly, eLSA randomly shuffles the components of the original time series and recalculates the LS score for the pairs. The p-value is approximated by the fraction of permutation scores that are larger (in absolute value) than the original score. Confidence interval for a given LS score is also found by bootstrapping from the replicated data. Finally, users can obtain significant eLSA association results by the combined use of p-value and FDR q-value thresholds as their filtering criteria.

Association network construction

Using only the significant associations, users can construct a partially directed association network. Generally, for two factors _{1},_{1}] in _{2},_{2}] in _{1} <_{2}, we can infer that

Microbial community data analysis

As an immediate application, we applied the eLSA pipeline to a set of real microbial community time series data. This San Pedro Ocean Time Series (SPOTs) dataset, originally reported in

First, we compared the performance of Pearson’s correlation coefficient (PCC) and eLSA analysis in identifying potential local and time-delayed associations. Restricting the significance threshold for the q-value

Significant associations found in real datasets

Found by eLSA

Found by PCC

Found by both

Found by eLSA

Found by PCC

Found by both

Dataset

# of factors

Microbial

515

1643

3237

293

2804

4242

658

446

42532

56605

39114

57991

71799

54201

Numbers of significant associations found by the extended Local Similarity Analysis (eLSA) and Pearson’s Correlation Coefficient (PCC) by controlling both the p-value (

If we look at the top five positive and negative absolute highest LS scores from the unique associations (|_{2}, _{4}, _{3} and oxygen), and the existence of some highly connected clusters formed by certain bacteria or eukaryote groups.

Top LS scores from the microbial community data

X

Y

LS

Xs

Ys

Len

D

P

PCC

Ppcc

Q

Qpcc

Euk239

Euk269

0.82

1

1

40

0

0

0.09

0.59

0.02

1.00

Bac609

Bac675

0.77

1

2

39

-1

0

0.14

0.41

0.00

1.00

Euk381

Euk462

0.77

1

1

40

0

0

0.44

0.00

0.02

0.11

Euk583

Bac989

0.68

2

1

39

1

0

0.30

0.06

0.02

0.73

Euk229

Euk339

0.57

1

2

39

-1

0

0.05

0.77

0.02

1.00

Euk97

boxy

-0.62

15

15

21

0

0

-0.42

0.01

0.00

0.17

Euk98

boxy

-0.62

15

15

21

0

0

-0.42

0.01

0.00

0.17

Euk109

boxy

-0.62

15

15

21

0

0

-0.42

0.01

0.00

0.17

Euk112

boxy

-0.62

15

15

21

0

0

-0.42

0.01

0.00

0.17

Euk116

boxy

-0.62

15

15

21

0

0

-0.42

0.01

0.00

0.17

The 5 positive and 5 negative highest absolute LS Scores from associations uniquely found by eLSA in the microbial community dataset. The columns in succession are X (first factor), Y (second factor), LS (Local Similarity score), Xs (start of the best alignment in the first sequence), Ys (start of the best alignment in the second sequence), Len (alignment length), D (shift of the second sequence compared to the first sequence, -: X is ahead of Y, + otherwise), P (p-value for the LS score, 0.00 stands for P<0.005), PCC (Pearson’s Correlation Coefficient), Ppcc (P-value for PCC), Q (q-value calculated for P, 0.00 stands for Q<0.005), Qpcc (q-value for Ppcc).

Typical association network from the microbial community data

**Typical association network from the microbial community data.** Round- (brown), square- (blue) and triangle- (green) shaped nodes are bacteria, eukaryotes and environmental factors, respectively. Solid (red) edges are positively associated, while dashed (blue) edges are negatively associated. Arrow indicates the time-delay direction.

Taking a closer look at one of the topmost ranked association:

Examples of real data associations

**Examples of real data association. ****a**. Shown are microbe group **b**. Shown are gene

Gene expression data analysis

Although LSA had its roots grounded in microbial community analysis, the technique can be readily applied to other biological time series data, such as replicated gene expression time series data from microarray and RNA-Seq experiments

The results are summarized in Table

Because these genes do not change expression level in both dauer exit and L1 starvation conditions, they are considered as common feeding response genes

We next analyzed the unique eLSA associations. These associations form a dense association network themselves with a long-tailed degree distribution, as shown in Figure

Node degree distribution of associations in

**Node degree distribution of associations in C. elegans analysis**. Shown is the node degree distribution of eLSA unique associations in C. elegans analysis. It shows a long-tail distribution with the maximum 189.

Top LS scores from the

X

Y

LS

lowCI

upCI

Xs

Ys

Len

D

P

PCC

Ppcc

Q

Qpcc

48087

27993

0.53

0.41

0.61

1

2

11

-1

0.00

0.56

0.06

0.00

0.01

32607

51986

0.52

0.41

0.61

2

1

10

1

0.01

0.51

0.09

0.00

0.01

29504

48087

0.52

0.40

0.61

2

1

11

1

0.00

0.41

0.18

0.00

0.03

23193

27993

0.51

0.41

0.59

1

2

11

-1

0.00

0.48

0.11

0.00

0.02

29494

30208

0.51

0.39

0.61

2

1

11

1

0.00

0.58

0.05

0.00

0.01

27993

53694

-0.55

-0.62

-0.44

2

1

11

1

0.00

-0.53

0.08

0.00

0.01

436287

53694

-0.54

-0.62

-0.44

2

1

11

1

0.01

-0.55

0.06

0.00

0.01

48941

53694

-0.52

-0.61

-0.42

2

1

11

1

0.00

-0.38

0.22

0.00

0.03

29494

22857

-0.52

-0.61

-0.41

2

1

11

1

0.00

-0.49

0.10

0.00

0.02

29494

436727

-0.52

-0.61

-0.40

2

1

11

1

0.01

-0.55

0.06

0.00

0.01

The 5 positive and 5 negative highest absolute LS Scores from the ^{th} and 5^{th} columns.

We also analyzed all the eLSA associations together, including both unique and non-unique eLSA findings. Though most of the genes are still hypothetical protein coding genes, we do find a group of eukaryotic initiation factors:

Translation initiation factor associations in

**Translation initiation factor associations in C. elegans analysis**

Discussion and conclusions

The eLSA technique extends LSA to time series data with replicates. This will help investigators better utilize the available information from their sample replicates and assist them in more effective and reliable hypothesis generation of time-dependent associations. In addition, a bootstrap framework is developed to estimate the confidence interval for the LS score. We also provided flexible missing value options and integrated efficient multiple testing control methods for the new eLSA technique. Using the microbial community and gene expression datasets, we demonstrated that eLSA uniquely captures additional time-dependent associations, including local and time-delayed association patterns, when compared to ordinary correlation methods, such as PCC. In this paper, we described the applications of our method with the time series data. Actually, the eLSA can be applied to any type of data with some gradients, including the response to different levels of treatments, temperature, humidity, or spatial distributions.

Currently, we use permutation test to assess the statistical significance of LS scores and bootstrap re-sampling to estimate the confidence interval of LS score. Both the permutation test and bootstrap methods are time consuming if high precise determination of statistical significance or confidence interval is desired. Theoretical developments on the distribution of the LS score are needed to eliminate or mitigate the computational burden required for these processes, and would be interesting topics for future studies. There is also a minimum sample number requirement for eLSA analysis. We suggest the sample number to be greater than 5+

Finally, we implemented the eLSA technique and analysis pipeline into an Open Source C++ extension to Python with many new features. Specifically, the pipeline streamlines data normalization, local similarity scoring, permutation testing and network construction. As shown in Figure

Submission interface for the LSA web service

**Submission interface for the LSA web service.** Upon submission, the job will perform eLSA analysis on the ‘CommonGenesData’ dataset (12 time spots and 4 replicates) with 200 permutations and 100 bootstraps within a delay limit of 3 units. In addition, by specification, it will use ‘simple’ averaging to summarize replicates and, by designating ‘none’, it will disregard the missing values.

Authors' contributions

LCX, JAS, JAC, ZGC, SLS, JJV, JAF, FS designed the study. LCX, ZGC, JAF and FS developed the methods. LCX, JAS, JAC developed and tested the software. LCX, JAS and JAC collected and analyzed the data. LCX, JAS, JAC, ZGC, JAF and FS wrote the paper.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

The authors would like to thank Cheryl Chow, Rohan Sachdeva, Barbara Campbell, Anders Andersson and Stefan Bertilsson for testing the eLSA software packages and web services and providing valuable suggestions. We thank Jun Zhao of PIBBS at University of Southern California for helpful discussion of

This article has been published as part of