Genomics, Proteomics and Bioinformatics Unit, Center for Applied Medical Research, University of Navarra, Pamplona, Spain

Laboratory of Microbial Biofilms, Instituto de Agrobiotecnología, Universidad Pública de Navarra-Consejo Superior de Investigaciones Científicas-Gobierno de Navarra, Pamplona 31006, Spain

Cancer Imaging Laboratory, Center for Applied Medical Research, University of Navarra, Pamplona, Spain

Abstract

Background

High-density oligonucleotide microarray is an appropriate technology for genomic analysis, and is particulary useful in the generation of transcriptional maps, ChIP-on-chip studies and re-sequencing of the genome.Transcriptome analysis of tiling microarray data facilitates the discovery of novel transcripts and the assessment of differential expression in diverse experimental conditions. Although new technologies such as next-generation sequencing have appeared, microarrays might still be useful for the study of small genomes or for the analysis of genomic regions with custom microarrays due to their lower price and good accuracy in expression quantification.

Results

Here, we propose a novel wavelet-based method, named ZCL (zero-crossing lines), for the combined denoising and segmentation of tiling signals. The denoising is performed with the classical SUREshrink method and the detection of transcriptionally active regions is based on the computation of the Continuous Wavelet Transform (CWT). In particular, the detection of the transitions is implemented as the thresholding of the zero-crossing lines. The algorithm described has been applied to the public

Conclusions

The proposed method archives the best performance in terms of positive predictive value (PPV) while its sensitivity is similar to the other algorithms used for the comparison. The computation time needed to process the transcriptional signals is low as compared with model-based methods and in the same range to those based on the use of filters. Automatic parameter selection has been incorporated and moreover, it can be easily adapted to a parallel implementation. We can conclude that the proposed method is well suited for the analysis of tiling signals, in which transcriptional activity is often hidden in the noise. Finally, the quantification and differential expression analysis of

Background

The complete deciphering of the information contained in the genome would be helpful to improve our understanding of the biological processes occurring in living organisms. High-density oligonucleotide-based whole-genome microarray is an extensively used technology to detect the expression of all RNA species including protein coding RNAs and non-coding RNAs. It is particularly suitable for the analysis of whole small-sized genomes such as those corresponding to bacteria. For these organisms high resolution can be achieved with the microarrays currently provided by the manufactures.

Applications of tiling array technology include the generation of transcriptional maps and annotations of genomes, the identification of transcription factor binding sites, the analysis of alternative splicing events, the analysis of methylation states, the discovery of genotyping and polymorphism, and the re-sequentation of genomes

The emerging high-throughput next generation DNA sequencing (NGS) technologies

The analysis of a tiling microarray experiment starts with a two-step process that generates a discrete signal. First, the DNA or RNA samples are hybridized in the custom designed tiling array. Second, for each probe, the raw intensities are converted to a score

The workflow shown in Figure

Wavelet-based processing of tiling signals

**Wavelet-based processing of tiling signals.** Workflow for the analysis of the tiling signal based on the computation of the Continuous Wavelet Transform (CWT).

Transcriptome analysis refers to the detection of segments where the noisy tiling signal is constant. The start and end points of these segments correspond to transcript start and end sites. Several approaches have been deployed in the segmentation of tiling signals: pseudo-median or Hodges-Lehmann estimator

Wavelet analysis using the Discrete Wavelet Transform (DWT)

We also used this algorithm for the identification of the subset of transcripts whose expression decreases in a

We applied the segmentation methods to this high quality dataset and we have demonstrated its usefulness for the analysis of the tiling array derived transcriptome map. The results demonstrate that ZCL not only allows a rapid identification of the transcripst based on the segmentation procedure but also a more accurate estimation of the expression level of each transcript.

Results and discussion

All the steps needed to obtain a trancriptional map from the raw data (read the CEL files, normalize, denoise and segment the tiling signal) have been implemented using the statistical language

**R code: Segmentation and visualization functions.** Implemented functions in R language to perform PMSW and SCM segmentation and the proposed wavelet-based method for denoising and segmentation. In addition, functions are provided for proper visualization of data, integration of analysis results and evaluation of the obtained transcriptional maps.

Click here for file

**R code: Segmentation analysis of **** S. cerevisiae.** R script for segmentation of the

Click here for file

**R code: Segmentation analysis of ****).** R script for the segmentation of the

Click here for file

Experimental datasets

Saccharomyces cerevisiae dataset

The dataset is described in

Staphylococcus aureus dataset

The

Before cDNA synthesis, RNA integrity from each sample was confirmed on Agilent RNA Nano LabChips (Agilent Technologies). 10

Probe annotation and normalization

The annotation of the PM probe sequences was obtained with the alignment to the genome sequence of

The annotation files for

Denoising

The denoising was evaluated using the signal to noise ratio (SNR), a quantitative measure of its performance. In order to compare the results obtained with those from Huber et al.

Signal to noise ratio of different filtering methods

**Signal to noise ratio of different filtering methods.** Portion of the tiling signal used to evaluate the Signal to Noise Ratio (SNR). We consider two signal regions (S) and two noise regions (N). **(a)** Normalized signal. **(b)** Denoised signal using Donoho’s threshold. **(c)** Denoised signal using the SUREShrink threshold.

The SNR was computed as

with the noise standard deviation

where the symbol _{
N
} refers to the standard normal distribution

**SNR results**

Estimated SNR values of the tiling signal. The normalized and the wavelet denoised signal using Donoho’s and SUREShrink on which the calculation was performed are shown in Figure

Signal

SNR

Best SNR in

4.58

Normalized signal

4.28

Wavelet denoising (Donoho’s)

5.28

Wavelet denoising (SUREShrink)

**6.17**

Segmentation

A descriptive example of the denoising and segmentation for

Wavelet-based segmentation of

**Wavelet-based segmentation of ****tiling signal.** Visualization of

The TAR start and end positions were defined as the transition locations for which the difference between the mean intensity of neighboring segments is greater than 10% of the dynamic range of the tiling signal. Moreover, the inspection of the intensity histogram of chromosome 1 forward strand was used to set the minimum normalized transcription level value to −2. The same parameters were adopted to process the other strands of the organism. The R function

Another representative example of segmentation results is given in Figure

**Figure S1.** Visualization of

Click here for file

**Figure S2.** Visualization of

Click here for file

**Figure S3.** Visualization of

Click here for file

**Figure S4.** Visualization of

Click here for file

Wavelet-based segmentation of

**Wavelet-based segmentation of ****tiling signal.** Visualization of

Segmentations comparison using

The results from the ZCL segmentation were compared to those obtained with PSMW

Huber’s method is based on the structural change model (SCM). The SCM model

Due to the lack of a biologically validated ground truth to evaluate the outputs, we compared the methods in terms of two metrics, sensitivity and positive predictive value (PPV) at probe-level. We define sensitivity as the number of probes in the detected TARs that overlap with annotated regions (true positives,

The PMSW and SCM methods were applied to the

The graphical representation of the results obtained after processing the

Results for the identification of TARs

**Results for the identification of TARs.** Number of detected TARs, probe-level PPV and sensitivity, and computation time for the proposed (solid line), PMSW (dashed line) and SCM (dotted line) methods. The analysis was performed for the forward (left) and the reverse (right) strands of all chromosomes of

**Evaluation metrics (****)**

Mean number of detected TARs, probe-level PPV, probe-level sensitivity and computational time for PMSW, SCM and ZCL methods (all chomosome and strands of

**Method**

**TARs**

**PPV**

**Sensitivity**

**Time (min)**

PMSW

22114

0.7416

**0.4700**

2.88

SCM

11246

0.7847

0.3904

79.09

ZCL

18209

0.8486

0.3760

13.02

ZCLSure

22513

**0.8547**

0.3686

10.70

In-depth analysis of chromosome 1 gives interesting insights into concerning the relationship between methods. In the forward strand, the number of probes annotated as genes is 12796 representing 19.35% of the total number of probes. 65.16% of probes are correctly classified by the three algorithms (12.21% of gene probes and 52.95% of non-gene probes). From the annotated probes, 63.12% are detected by all methods, while only 11.85% of the probes are not detected by either of them. This means that 88.15% of the annotated probes are detected by at least one of the methods. The reverse strand contains 11866 annotated probes (17.90% of probes located in this strand), from which 19.74% are considered part of a TAR by all the methods and 28.35% are true negative probes. In this strand, 51.41% of the annotated probes are included in a TAR by any method while only 14.35% are never detected. In other words, 85.65% of the probes in the strand are detected by at least one algorithm. In light of this outcome, we considered it worthwhile to evaluate if the combination of results computed with the different methods would improve the performance of the segmentation.

Combination of TAR probes candidates

We evaluated the improvement in performance obtained by the combination of the different segmentations. We chose different strategies to define the sets (intersection of two or three methods and majority voting system). After a decision is taken on the candidates, TARs are constructed to create the transcriptional map. In Table

**Integrative transcriptional analysis**

PPV and sensitivity for both strand of chromosome 1 using individual TAR detection algorithms and the combination of their results.

**Method**

**PPV Forward**

**Sensitivity Forward**

**PPV Reverse**

**Sensitivity Reverse**

PMSW

0.6511

0.5873

0.5811

0.2073

SCM

0.7188

0.4390

0.6968

0.2146

ZCL

**0.8675**

0.3821

**0.8220**

**0.2208**

PMSW ⋂ SCM ⋂ ZCL

0.6312

**0.5984**

0.5441

0.2030

PMSW ⋂ ZCL

0.6448

0.5921

0.5744

0.2076

PMSW ⋂ SCM

0.6370

0.5957

0.5504

0.2043

SCM ⋂ ZCL

0.7053

0.4409

0.6626

0.2116

Majority voting

0.7247

0.4396

0.6993

0.2164

Computational performance

Differential expression analysis of

Comparative segmentation analysis using ZCL and PMSW and SCM algorithms was applied for the hybridization data obtained with a custom designed Affymetrix tiling array of

**Evaluation metrics (S. aureus)**

Mean number of detected TARs, probe-level PPV, probe-level sensitivity and computational time for PMSW, SCM and ZCL methods.

**Tiling Signal**

**Metric**

**PMSW**

**SCM**

**ZCL **

**ZCLSure **

WT Forward

PPV

0.6298

**0.6498**

0.6248

0.6407

WT Forward

Sens

0.8657

**0.8766**

0.8715

0.8719

WT Reverse

PPV

0.6867

0.6993

**0.7050**

0.6989

WT Reverse

Sens

0.8506

**0.8560**

0.8388

0.8535

PPV

0.6238

**0.6388**

0.6227

0.6308

Sens

**0.9054**

0.9035

0.9027

0.9036

PPV

0.6664

**0.6815**

0.6765

0.6748

Sens

**0.8697**

0.8684

0.8667

0.8515

The most frequent transcriptional analysis is the detection of genes that have changed their expression in the conditions under study (differential expression analysis). As sigma B affects the expression of more than one hundred genes, we decided to test whether it is possible to use the intensity of all the probes included in each detected TAR with the ZCL segmentation procedure to calculate the expression level of the transcript in a particular environmental condition. In order to carry out this analysis using tiling microarrays we need to compress the intensity of all the probes included in each detected TAR into one value. Standard methods for microarray normalization can be applied, for example RMA (Robust Multichip Average) algorithm in the case of Affymetrix microarrays

We introduced a simple analytical tool to be used independently of the microarray platform to measure the gene expression level based on the median value of the TAR probe intensities. We calculated this value for each wild-type and sigmaB mutant sample. We applied a statistical analysis (t-test) to obtained the p-value associated with the expression change taking into account the biological variability of the samples. Considering well-defined TARs in the ^{
B
}-regulated genes, as the alkaline shock protein 23 (

Differential expression analysis of

**Differential expression analysis of ****mutant.** Boxplots of median gene expression intensities. The expression of the selected genes has been previously reported to change in response to sigB repression.

Conclusions

Transcriptomics is a powerful technology for the study of gene structures and RNA-based regulation in any organism. Genome-wide transcriptome analysis of prokaryotes can be carried out with any of these two techniques: RNA-seq and genomic tiling arrays

In this paper, we propose a combined WT-based method for the denoising and segmentation of tiling signals. For illustrative and evaluative purposes, we applied the proposed analysis to the public

Our segmentation algorithm (ZCL) calculates all the possible break points based on the zero-crossing lines of the second derivative of the Gaussian wavelet. The results show that our method achieves the best compromise between accuracy (evaluated in terms of PPV and sensitivity) and computation time. The R code provided can be used to apply our algorithm as well as to combine the resulting segmentation with other methods as PMSW and SCM.

We also designed a new tiling microarray for the analysis of

Once the TARs are properly detected, differentially expressed transcripts can be identified by well-known methods (such as Linear Models for Microarray Data (LIMMA)

In conclusion, we present a novel method for denoising and segmentation of tiling microarray signals based on wavelet multiresolution analysis that outperforms previous methods in terms of SNR, positive predictive value and computation time. The R code that implements the method is given as supplementary material and can be easily adapted to a parallel computing schema. Also, we have introduced the possibility of combining the results of ZCL with those obtained with other two well-known approaches (PMSW and SCM) for the segmentation of tiling signals.

Methods

WT-based analysis

The CWT of a continuous signal

where ^{∗}((

The CWT can be interpreted as the correlation of the input signal with a position reversed version of _{
a
}=_{
a
}, and hence the smaller the corresponding analyzed frequency. The output value is maximized when the frequency of the signal matches that of the corresponding dilated wavelet. The CWT computation for arbitrary scales can be easily adapted to a parallel implementation with a linear computational complexity

Mallat’s fast wavelet algorithm ^{
i
} and time shifts ^{
i
}

Normalization of tiling microarray data

The analysis starts with background correction and quantile normalization as describe by the RMA algorithm

where

WT-based denoising

One of the most established methods of wavelet-based denoising was proposed by Donoho and Johnstone

where

Another well established method of wavelet shrinkage is SUREShrink _{
i
}:

is an unbiased estimate of risk, where _{
i
}∧_{
i
}

For a large dimension _{SURE} will be almost the optimal threshold

WT-based segmentation

An important issue in signal processing is to define an appropriate representation able to compress most of the signal information into few representative features. Sharp variations in amplitude (i.e., transitions and peaks) are among the most meaningful features of a signal. For that reason, many segmentation algorithms rely on their detection. Previous studies have detected the peaks in mass spectrometry data using either the ridge lines

It has been previously shown that the position of multiscale sharp transitions can be obtained from the zero-crossings of the signal convolved with the Laplacian of a Gaussian

where _{
a
} is a Gaussian function dilated by a factor

we derive that

Hence, the wavelet transform of _{
a
}(_{
a
}. The identification of transcript start and end sites is achieved by computation of the redundant CWT over a wide scale range followed by zero-crossing line detection and length thresholding. The chosen mother wavelet is the second derivative of a Gaussian. The redundancy of the CWT yields enhanced information on the position-scale localization of the features of interest (in this case, the transitions)

An illustrative example is given in Figure

Zero-crossing lines of the second derivative Gaussian wavelet

**Zero-crossing lines of the second derivative Gaussian wavelet.** An illustration of zero crossing lines detection. **(a)** Box signal contaminated with additive Gaussian noise (standard deviation = 0.5). **(b)** Absolute values of the CWT coefficients. The second derivative of the Gaussian was used as the mother wavelet. **(c)** All zero-crossing lines are shown. Note how the two longest lines correspond to the two sharp transitions of the box signal.

Identification of transcriptional active regions

The candidates start and end sites detected as described in the previous section, are filtered to remove incorrect assignments. The purpose of this procedure is to filter those transitions that do not correspond to variations in signal intensity. For the generation of TARs we considered the signal transitions in which variation in intensity is at least 10% of the dynamic range of the analyzed signal. We also eliminate from the list of detected TAR all the start and end points that are not correctly paired off. We use the sign of the zero-crossing lines to separate start and end points and we match each start site with its corresponding end site. Finally, we define the minimum normalized intensity threshold required for the segments to be considered as transcriptional active regions. This value is calculated as the median of the signal intensity distribution, but this threshold can also be user-defined. In order to improve the definition of TARs, we cluster together consecutive segments for which the mean normalized intensity value is over the threshold.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

VS and AMB conceived the idea, developed the methods and implemented the software. ATA and IL design the NA-Staph-b520729F microarray and MU and ATA carried out the processing and hybridization of samples. ATA, IL and VS made the biological interpretation of the results. All authors participated in writing and revising the manuscript.

Acknowledgements

We thank Prof. Fernando J. Corrales and Lourdes Ortiz (Genomics Core Facility) for technical support and all the useful comments about the manuscript. This work was supported by the spanish Torres-Quevedo fellowship [PTQ-08-03-07769] to VS. ATA and AMB were supported by Spanish Ministry of Science and Innovation ‘Ramón y Cajal’ contracts. This work was supported by the Spanish Ministry of Science and Innovation Grants BIO2008-05284-C02-01, BFU2011-23222, ERA-NET Pathogenomics PIM2010EPA-00606 and the agreement between ‘Fundación para la Investigación médica aplicada’ (FIMA) and the ’UTE project CIMA’.