Bioinformatics and Information Mining, University of Konstanz, 78457 Konstanz, Germany

Neurobiology, University of Konstanz, 78457 Konstanz, Germany

Abstract

Background

The calcium-imaging technique allows us to record movies of brain activity in the antennal lobe of the fruitfly

Method

We have developed an approximate Principal Component Analysis (PCA) for fast dimensionality reduction. The method samples relevant pixels from the movies, such that PCA can be performed on a smaller matrix. Utilising

Results

Our method allows for fast approximate computation of PCA with adaptive resolution and running time. Utilising

Conclusions

Fast dimensionality reduction with approximate PCA removes a computational bottleneck and leads to running time improvements for subsequent algorithms. Once in PCA space, we can efficiently perform source separation, e.g to detect biological signals in the movies or to remove artifacts.

Introduction

The fruitfly

The datasets we consider are

Odor coding

**Odor coding**. An odor molecule is encoded as a pattern of glomerulus responses in the ALs of the fruitfly brain. The green and yellow glomeruli remain inactive (not shown), whereas the blue and magenta glomeruli respond to the odor presentations (black bars mark two pulses of 1s each) with differential strength. Left and right ALs, that receive input from the left and right antennae, are mirror-symmetric and contain the same types of glomeruli.

A major objective of biological research in this field is to map the

In terms of data analysis, our goal is to extract glomerular signals and patterns from calcium-imaging movies. Ideally, we would like to do this in a fast and memory-efficient way, keeping in mind that the size of the movies is going to increase further in the future due to the advent of high-resolution and three-dimensional 2Photon microscopy

Here, we process imaging movies from the

ICA algorithms are typically performed after decorrelation and dimensionality reduction with a Principal Component Analysis (PCA)

We thus propose an approximate solution to PCA that, while being substantially faster than exact PCA, keeps biological detail intact. Apart from our specific ICA application, fast dimensionality reduction is also of general utility for computations on imaging movies.

How do we achieve a high-quality approximation to PCA? The observation is that, after processing, we usually deem only a small fraction of the pixels to be relevant, while many others do not report a biological signal. Following a feature selection paradigm

Instead, we propose to quickly select not few but many pixels (out of many more), and we do so by investing a small amount of time into computing pixel sampling probabilities that allow us to pick relevant pixels preferentially. Evaluation of a pixel's relevance relies on

We proceed as follows: In the methods section, we first introduce our notation and summarise prior work. We then consider a general framework for approximate SVD and modify it for our approximate PCA that is explicitly designed for the imaging movies. In the results section, we provide a technical evaluation with respect to speed and accuracy of the results, as well as practical examples for the fast analysis of

Methods

Preliminaries

Notation

PCA

For our purposes, _{k }_{Ij}_{i, j}. When we refer to column selection from matrix

Computing PCA and features for PCA

PCA can be computed by a singular value decomposition (SVD): _{k}_{Fr}_{k }

Regarding feature selection for PCA, Jolliffe

A paper on feature selection for PCA by Boutsidis et al.

Source separation with ICA

On imaging movies, source separation with ICA can be cast into the same notation as PCA (1). Where PCA relies on orthogonal, i.e. uncorrelated basis vectors, the goal of ICA

ICA can detect the glomerular sources in calcium-imaging movies

Monte Carlo approximate SVD

Here, we rely on a Monte Carlo-type approximate SVD proposed by Drineas et al. ^{m×c}, we can achieve an approximation to the sample covariance of ^{T }^{T}_{Fr}

In _{k}: _{k}:

The error of the approximate SVD of _{k }^{T }^{T}_{Fr }

The main result of

This result holds for column sampling probabilities _{j }_{Ij}|

In particular, the upper bound from (3) holds if we sample with replacement

Following the Monte Carlo framework, we can sample ^{T}

The upper bound, is, however, not very tight. If we wish to achieve

The main contribution of the norm-based Monte Carlo approach is thus to show that the correctness of SVD/PCA does not collapse under pixel sampling, but that the error is rather asymptotical and can be decreased further and further by sampling more pixels.

Covariation sampling

Although this pixel sampling may work well in practice, the theoretical bound is not very tight. Can we then more explicitly select biologically relevant pixels so as to ensure our confidence in the fast approximation?

The intuition is, that, if our pixel sample covers all glomeruli, the "biological error" will be small. We thus motivate a biological criterion, covariation between neighbouring pixel-timeseries, as an importance measure. The assumption we rely on is about the spatial aspect of the data, namely that a glomerulus in an imaging movie covers several adjacent pixels that all report the same signal (plus noise). This

Probability distributions

**Probability distributions**. **a) **Image from the Drosophila2D movie, distribution of norm probabilities and distribution of covariation probabilities. A 5% pixel sample (Algorithm 1 for norms, Algorithm 2 for covariance) is superimposed in black. **b) **Drosophila3D. For visualisation, we discretised the continuous z-axis into 9 layers.

Our approach is to compute a small part of the pixels ^{T }A_{i, j}) being defined as follows:

The column norms of ^{n×n }correspond to the amount of covariation with neighbouring pixels, i.e. if the column is from within one of the spatially local sources (glomeruli), the norm is high. Consequently, if we apply the column norm sampling according to (4) not to the movie matrix

Departing from the error bound scheme regarding the norm, we can now estimate in advance the biological signal content by computing for how much of ||_{Fr }_{Fr}

In practice, it is more convenient not to construct the entire matrix

Sampling from ^{cov}

Fast PCA for calcium-imaging movies

We first propose two alternative methods for pixel sampling (Algorithm 1 and 2) which we then utilise to perform PCA on a small matrix (Algorithm 3). Sampling allows for an adaptive resolution without a sharp cutoff by a threshold.

Pixel sampling

In Algorithm 1, we sample exactly

**Algorithm 1 Pixel sampling with replacement**, ^{m×n}, number of pixels ^{norm }_{0},..., _{(n - 1)}), ^{m×c}

**for all ****do**

pick column _{j}

**end for**

The above sampling strategy is necessary for the Monte Carlo scheme to work, however, for the covariation probabilities (7), the most parsimonious approach is simply sampling without replacement: Algorithm 2.

**Algorithm 2 Pixel sampling without replacement**, ^{m×n}, number of pixels ^{cov }_{0},..., _{(n - 1)}), ^{m×c}

R: = {}

**for all ****do**

sample _{j}

**end for**

Note that we can generally assume absence of movement, i.e. pixel identity remains the same throughout the measurement. The AL is a fixed anatomical structure, and small-scale movement that leads to shaky recordings can be eliminated by standard image stabilisation (as e.g. in

Computing PCA

We employ NIPALS-style PCA

Note that Drineas et al. _{k}:

We have summarised the approach in Algorithm 3. The first step consists of running Algorithm 1 or 2 in order to obtain the ^{+ }^{+ }is the generalised Moore-Penrose pseudoinverse of

The approximate PCA requires

**Algorithm 3 Approximate PCA**, ^{m×n}, number of samples ^{m×k }, ^{k×n}

select

//compute NIPALS-style PCA on matrix

**for all ****do**

**while **not converged **do**

**end while**

**end for**

//compute full-size images

^{+ }

Results

Datasets and pixel selection strategies

Our test datasets are "Drosophila2D" (Figure

Both datasets are concatenations of multiple measurements. In the middle of each measurement (except for controls), an odor was presented to the fly. A series of different odors was employed which enables us to tell apart glomeruli based on their differential response properties.

In Figure

Empirical evaluation

As evaluation criteria we rely on the Frobenius norm error ||_{Fr }_{k}_{Fr }

Results are presented in Figure

Performance

**Performance**. Means and standard deviations for time and error measures (10 repetitions) for exact and approximate PCA. Number of pixels

Already small samples lead to low additional error with respect to the Frobenius norm. E.g., on the Drosophila2D dataset, exact PCA achieves a Frobenius norm error of 73, 754.64 for a rank-_{Fr }

Both, norm error and covariation energy, reach about the level of accuracy of exact PCA already with sample sizes of between 10% to 15% of the pixels, whereas time consumption grows only slowly (Figure

How many pixels do we need to sample? While our empirical measurements suggest that between 10% to 15% of the pixels are sufficient, even smaller samples of about 1% of the pixels give good results in practice, the error being already much lower than the expected upper bounds. As a "safe" strategy we suggest to sample pixels with Algorithm 2 until the cumulated covariation energy exceeds a threshold, e.g. 0.95 (straight line in Figure

To give a visual impression of how the technical quality measures translate into image quality, we compare principal component images in

Example for PCA

**Example for PCA**. Top principal components computed by exact PCA and approximate PCA with covariation probabilities (1% pixel sample).

Application example: ICA

Recall that both PCA and ICA result in a decomposition of the form _{k }^{PCA }S^{PCA}_{k }^{ICA }S^{ICA}^{PCA }^{PCA}

In Figure

Example for temporal ICA

**Example for temporal ICA**. Performing ICA on the principal component timeseries matrix ^{PCA}**a) above: **spatial component **below: **image from raw movie, indicating the shapes of the left and right ALs. **b) **Timeseries component **c) **For comparison, we show the mean timeseries for the glomerulus pair on the raw movie

Taking into account the corresponding timeseries in

For comparison, we extracted (by thresholding) positions of all black pixels in

As another example, we have applied spatial ICA, working on ^{PCA }^{ICA }

Example for spatial ICA

**Example for spatial ICA**. Performing ICA on the principal component images matrix ^{PCA}**Top: **ICA was run after exact PCA, **bottom: **ICA was run after approximate PCA with a 1% or 15%, respectively, pixel sample (covariation probabilities). Closest matches are placed in the same column.

Here, we have regarded the spatial and temporal aspect of the data separately leading e.g. to spatial components that are not entirely local (Figure

Conclusions

We have shown that source separation can, in principle, detect glomerulus positions and remove artifacts in

Here, we have concentrated on finding a fast approximate solution to PCA that reduces data size prior to source separation. Delegating the main computational load to the preprocessing with fast PCA allows any source separation algorithm to scale up easily with the growing data sizes in imaging. A further promising area of application is, with due modifications, online analysis such that denoised movies are available already during the course of the experiment.

Our strategy for fast approximate PCA relies on simple precomputations that can be performed in a single pass over the data. Based on

Our empirical results show that small pixel samples reliably lead to approximations with low error. It remains as an interesting question for further research, whether it is possible to translate these results into theory, e.g. by proving tight error bounds that incorporate the

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

MS performed research and wrote the manuscript. CGG supervised research and edited the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We are grateful to Daniel Münch, Ana F. Silbering and Werner Göbel for recording imaging data, and to Henning Proske for technical assistance with data format and preprocessing. We thank Fritjof Helmchen and Werner Göbel for sharing their expertise on the 2Photon imaging technique and for providing equipment. Financial support by BMBF, DFG and the University of Konstanz is acknowledged. MS was supported by the DFG Research Training Group GK-1042 and a LGFG scholarship issued by the state of Baden-Württemberg.

This article has been published as part of