Abstract
Background
The identification of protein coding regions (exons) in DNA sequences using signal processing techniques is an important component of bioinformatics and biological signal processing. In this paper, a new method is presented for the identification of exonic regions in DNA sequences. This method is based on the crosscorrelation technique that can identify periodic regions in DNA sequences.
Results
The method reduces the dependency of window length on identification accuracy. The proposed algorithm is applied to different eukaryotic datasets and the output results are compared with those of other established methods. The proposed method increased the accuracy of exon detection by 4% to 41% relative to the most common digital signal processing methods for exon prediction.
Conclusions
We demonstrated that periodic signals can be estimated using crosscorrelation. In addition, discrete wavelet transform (DWT) can minimise noise while maintaining the signal. The proposed algorithm, which combines crosscorrelation and DWT, significantly increases the accuracy of exonic region identification.
Background
When the DNA sequence of a new eukaryotic organism is synthesized, the exonic (protein coding) regions must be distinguished from the introns (see Figure 1 for a schematic of genome arrangement). The protein coding regions of DNA have been observed to exhibit a period3 property due to the nonuniform codon usage in the translation of codons into amino acids [1]. The aim of this paper is to use this property to identify exonic regions.
Figure 1. The structure of DNA sequence. This figure shows the structure of DNA sequences. (a) Double helix model of DNA. (b) A DNA sequence consists of genes and intergenic regions. The genes of eukaryotes are composed of exons and introns.
Several reasons for the existence of period3 property have been presented in [2,3] and [4]. Some codons participate more in protein synthesis than others, giving rise to repetitions of a specific type of codon in the genome [4]. For example, the existence of a large number of GCA codons in the exonic regions gives greater repetition of G, C and A nucleotides in the first, second and third codon position, respectively. In other words, the G, C and A nucleotides exhibit period3 property in the exonic regions.
Gene finding methods based on genetic characteristics, such as promoter, CpG Island, start and stop codon etc., tend to be of insufficient accuracy [5]. The characterization of coding and noncoding regions based on nucleotide statistics inside codons is described by Bernaola et al., who employed a 12symbol alphabet to identify the borders between coding and noncoding regions [6]. Later, Nicorici and Astola segmented the DNA sequence into coding and noncoding regions using recursive entropic segmentation and stopcodon statistics [7].
The use of signal processing techniques to identify exonic regions based on the period3 property offers new opportunities for gene finding. Tiawari used Fourier transform spectrum to achieve this goal [8]. In Tiawari's method, the discrete Fourier transform (DFT) energy at a central frequency is calculated for a fixed length window, and the window is slid across the numerical sequence. Vaidyanathan [9] identified protein coding regions using an antinotch filter which magnified regions with period3 property. Datta and Asif [10] presented a new algorithm using DFT theory with a Bartlett window. In another signal processing method, Akhtar [11] applied time domain algorithms, average magnitude difference function and time domain periodogram algorithms to identify period3 property. Some gene finding methods based on digital signal processing (DSP) techniques have been developed but the accuracy of these methods is low and requires improvement.
In this paper, a new algorithm based on crosscorrelation theory is presented. We show that the algorithm enhances the accuracy of the identification while reducing noise. The noisy waveform is crosscorrelated with a periodic impulse train to provide the estimated signal. Discrete wavelet transform is applied to remove extra frequencies.
The remainder of the paper is organized as follows: in the Methods section, the application of the crosscorrelation to obtain the periodic signal plus noise is described, together with the period3 behaviour detection using crosscorrelation theory. The final part of this section details the use of wavelet transform to remove noise. The datasets used are introduced in the Dataset section. Thereafter, evaluation measures are introduced for the measurement and comparison of various methods. Finally, in the Results and Discussion section, the results of the proposed algorithm are compared with those of the most common digital signal processing algorithms for exon prediction, in both time and frequency domains.
Methods
Crosscorrelation
The discrete nature of DNA and the existence of period3 behaviour in the exonic regions render it suitable for analysis by signal processing algorithms. We present an algorithm for the identification of the period3 component based on crosscorrelation techniques. The theory of crosscorrelation theory is briefly explained below.
Correlation between two waveforms, x_{1}[n] and x_{2}[n], each of length N, is defined as [12]:
To estimate a periodic waveform that is contaminated with noise, this waveform is crosscorrelated with an adjustable template waveform; the template waveform is adjusted until the crosscorrelation is maximized. The resulting template is an estimate of the signal term of the periodic waveform.
In our approach, a noisy waveform is crosscorrelated with a periodic impulse train of period equal to that of the signal.
Let the signal of period N_{p }points (N_{p }< N) be s[n] and the noise be q[n]; therefore, the noisy waveform is S[n] = s[n]+q[n]. Periodic impulse train used for the crosscorrelation is denoted δ[nkN_{p}], k = 0, 1, 2,..., N_{δ}, where N_{δ }is the number of impulses. Then
Where j represents the lag, defined as the number of sampling points by which δ is shifted to the left. For j = 0, and remembering that δ[nkN_{p}] = 0 for all n ≠ kN_{p}
Since the signal is periodic, s[n+kN_{p}] = s[n], and equation (3) becomes
or
As N_{δ }→ ∞, , and therefore r_{sδ }→ s(0).
Now, the periodic impulse train is shifted on the signal by an amount depending on j. Thus equation (5) can be written for all j's:
From equation (6), it can be concluded that
from which the periodic signal without noise can be extracted [12].
Identification of exonic regions
In this section, a new algorithm using the crosscorrelation is proposed for the identification of exonic regions. The algorithm proceeds via the following steps:
1. DNA sequences are converted into numerical sequences.
2. FIR filter is applied to the numerical sequences representing DNA sequences.
3. Crosscorrelation is applied to the filtered numerical sequences.
4. The noise effect is removed using discrete wavelet transform.
Figure 2 represents these steps as a block diagram. Each step is explained in detail below:
Figure 2. Block diagram of the proposed algorithm. This figure shows the block diagram of the proposed algorithm designed to identify protein coding regions.
1. Numerical conversion of the DNA sequences
To apply DSP techniques to the DNA sequence to find nucleotide regions exhibiting period3 behaviour, the DNA sequence is first mapped onto the numerical sequence. The simplest conversion method maps four numerical sequences I_{A}[n], I_{T}[n], I_{C}[n] and I_{G}[n] from DNA sequences in binary format. In this mapping, the presence or absence of the respective nucleotides at the nth position is represented by '1' and '0', respectively. For example, given a section of DNA sequence ATCCGATATTC, the binary sequence of the nucleotide A, denoted I_{A}[n], is [10000101000]. The binary sequences for the other three nucleotides T, C and G are found similarly [13].
2. Applying FIR filter to the numerical sequences
After mapping the DNA sequence onto its binary numerical sequence, the binary sequence is passed through a Hamming window based FIR filter of order 8 with central frequency set to 2π/3, to emphasize period3 property in the exonic regions. Lack of distortions in FIR filters is one reason for their preferred use over IIR filters in medical applications [12].
3. Applying crosscorrelation theory to the numerical sequences
Most previous methods have used a window of fixed length to find the regions in DNA sequences exhibiting period3 property. In such methods, the window length directly affects the accuracy of the identification. Typically, an appropriate window length is considered to lie within the range 240351 (window lengths are multiples of three to reflect the codon structure). Short length windows increase noise, while long length windows tend to miss short exonic regions.
In our proposed method, the crosscorrelation between the numerical DNA sequence and an impulse train of periodicity 3 (N_{p }= 3) is calculated to identify regions in the DNA sequence with period3 behaviour. The length of the impulse train is set at 270. The impulse train signal length plays the same role as the window length in previous approaches. Following the crosscorrelation calculation, the impulse train slides across the numerical sequence by an amount j. The impulse train with periodicity of three considered as δ[n3k] and B_{A}[n], B_{T}[n], B_{C}[n] and B_{G}[n] are the FIR filter outputs for the input I_{A}[n], I_{T}[n], I_{C}[n] and I_{G}[n] sequences respectively. Then,
Different energy levels of the period3 components exist in binary sequences M_{A}, M_{T}, M_{C }and M_{G}. Thus, the output energy spectrum is the combination of the four separate outputs
In this energy spectrum, a peak corresponds to the presence of a period3 component on that region, implying that the region is exonic.
4. Decreasing the noise using discrete wavelet transform
Decreasing noise increases the accuracy of exonic region identification. As seen from equation (6), a small window size, required for the detection of small exons, will not diminish noise sufficiently. Hence we apply discrete wavelet transform (DWT) to decrease the noise in the output spectrum.
DWT has been used for denoising in various signal processing applications. In protein coding region detection, Haar wavelet has previously been employed for noise suppression [14]. Our proposed algorithm uses Dmey wavelet to remove noise and thereby increase the accuracy of the exonic region identification.
To this end, by downsampling the output of low pass and high pass filters, samples are divided into two signals; high frequency samples (detail signals) and low frequency samples (approximation signals), each embracing half the number of samples as the original signal. Figure 3 shows this procedure operating over three levels.
Figure 3. DWT decomposition. This figure shows the schematic of DWT decomposition at three levels. The low pass and high pass halfband filters are denoted g[n] and h[n] respectively.
The signal x[n] is passed first through the high pass filter, h[n], then through the low pass filter, g[n] [14,15].
Approximation and detail signals for the output power spectrum of the sequence F56F11.4 (GenBank access number AF099922) at positions 702115020 are shown in Figures 4b and 4c. By removing the detail signal and considering only the approximation signal, the extra frequencies are removed and the output power spectrum is smoothed. Therefore, the noise effect is decreased, while the accuracy of the identification is enhanced.
Figure 4. Applying DWT to the proposed algorithm. This figure shows the results of applying DWT to the proposed algorithm for the sequence F56F11.4. (a) The output power spectrum of the proposed algorithm before DWT is applied. (b) High frequency components of level 1 DWT decomposition (detail signal). (c) Low frequency components of level 1 DWT decomposition (approximation signal).
Datasets
Standard datasets are used to compare the efficacy of different algorithms at identifying exonic regions. Exon and intron positions in these databases are available and when DSP methods detect the position of exons, these positions are compared with real positions. The proposed algorithm is first applied to chromosome III of Caenorhabditis elegans [NCBI Reference Sequence: NC_003281.8], containing a total of 13783681 nucleotides with 8172 coding regions, and the results are compared with those of other popular methods. The results of the proposed algorithm for the sequence F56F11.4 of C. elegans (comprising 8,000 nucleotides) are separately presented. This sequence has five exonic regions at positions 9281039, 25282857, 41144377, 54655644, and 72557605. Also analysed in this paper are the BG570 [16] and HMR190 [17] datasets. BG570 is a genomic test dataset of 570 single gene vertebrate sequences prepared by Burset and Guigo [16]. HMR195 comprises 195 singlegene human, mouse, and rat sequences selected in 2001 by Rogic et al. [17] to test and evaluate the performance of gene structure prediction algorithms.
Evaluation Measures
To accurately compare different methods, the evaluation is performed at the nucleotide level. In the identification of exonic regions using DSP techniques, some parameters are defined by changing the threshold level in the output spectrum. Those parameters which make the comparison possible are defined in this section. In the identification step, the number of nucleotides correctly predicted as exons is denoted true positive (represented by TP), while the number of nucleotides correctly predicted as introns is denoted true negative (represented by TN). Similarly, the number of intron nucleotides predicted as exon nucleotide is the false positive (FP) value, while the number of exon nucleotides predicted as intron nucleotides is the false negative (FN) value. From these four defined quantities, the sensitivity and specificity parameters are determined as follows [16]:
The sensitivity S_{n }is the proportion of exon nucleotides that have been correctly predicted as exons, and the specificity S_{p }is the proportion of predicted exon nucleotides that actually exist in the exonic regions. These parameters alone are not suitable for evaluation because at high sensitivity, the specificity is low and vice versa. Therefore, another measure known as the approximate correlation (AC) has been defined. This parameter combines sensitivity and specificity as shown [16].
In applying DSP techniques to gene searching, other parameters have been described. A most popular evaluation measure is the Receiver Operating Characteristic (ROC) curve. By selecting different threshold levels, different values of TP for a given FP are calculated at each threshold and the ROC curve is constructed from the various TPs and their corresponding FPs. The area under the ROC curve (AUC) is used as an evaluation measure; the greater the AUC, the higher the accuracy of the gene finding algorithm [18]. Another means by which to compare identification accuracy between methods is the calculation of specificity for different sensibilities. Since the majority of genomes comprise intronic and intergenic regions, the calculation of FP can provide a useful comparison measure [19].
Threshold Selection Method
To discriminate between coding and noncoding regions, a threshold is imposed on the output power spectrum. The selection of a proper threshold can optimise the accuracy of the identification; however, the calculation of an optimum threshold value itself raises problems [20]. Therefore, in this paper, the sensitivity, specificity and approximate correlation measures are defined by changing the threshold level, to accurately compare different methods. In this section, we discuss implementation of the threshold selection.
To select an appropriate threshold, the method of Kwan et al. [21] is used. The mean and standard deviation of the period3 values determined from a training set of exon and intron sequences are used to calculate the threshold level T, defined as:
where meanP_{3e }and sdP_{3e }represent respectively the mean and standard deviation of the period3 values obtained from the exon sequences of a training set, and meanP_{3i }and sdP_{3i }represent respectively the mean and standard deviation of the period3 values obtained from the intron sequences of the same training set.
The 1000 multi exon genes from chromosome III of C. elegans provide data for training. The calculated threshold level is 61. This threshold level was applied to the F56F11.4 gene in chromosome III of C. elegans as shown in Figure 5.a. Clearly, at this threshold, all five regions are correctly identified as coding regions. However, there also exist small noncoding regions around position 2000 which are misidentified as coding regions. Since the characteristics of the DNA sequence can change significantly at different positions, even within the same dataset, a static threshold may yield incorrect identifications at some positions. Therefore, an adaptive threshold selection algorithm such as that described in [22] is required for exon prediction. In Tables (1), (2), (3), and (4) our proposed algorithm is compared with other algorithms over a range of thresholds.
Figure 5. The identification of exonic regions on the gene sequence F56F11.4. The results of exonic region identification on the sequence F56F11.4 (8,000 bp) are plotted for different methods. (a) Crosscorrelation (proposed), (b) AN filter, (c) TDP and (d) DFT methods. The shadowed regions are exonic regions that must be identified.
Results and Discussion
In this section, the results of the proposed algorithm are compared with those of established methods, namely, Average Magnitude Difference Function (AMDF), Time Domain Periodogram (TDP) [11], AntiNotch filter (AN filter) [9], Fourier Transform Spectrum (DFT) [8] and Asif [10]. As mentioned in the previous section, to evaluate and compare the results, measures such as the area under the ROC curve, the specificity and the number of false negatives in a particular sensitivity are computed. The approximate correlation measure for different threshold levels is also calculated. Our proposed algorithm is first applied to the gene sequence F56F11.4. In Figure 5a, the output power spectrum computed by equation (12) is displayed. The output power spectrum of the AN filter, TDP and DFT methods is shown in Figures 5.b, c and 5d respectively. Exonic regions that should be identified in this figure are marked as shaded regions. It should be noted that Figure 5.a is the output of equation (12) after denoising with DWT, as shown in Figure 6.b. The strongest feature of our proposed algorithm is the noise reduction. Not only is the noise reduced by increasing the window length, but the small length exonic regions can be identified. Unlike the established methods, the accuracy of identification does not decrease by changing the window length up to a specific value. Figure 7 shows the effect of changing window length on area under the ROC curve for the F56F11.4 sequence. According to this curve, the identification accuracy of our algorithm is fixed for window lengths ranging from 150 to 510 bp, whereas that of the other tested methods depends on window length. The window length varies according to gene length. The decreasing noise effect and magnification of the period3 component under FIR filtering causes the peaks to coincide with exon positions and enables detection of small exons, such as the first exon in F56F11.4 (shown in Figure 5.a).
Figure 6. Results of applying DWT to the proposed algorithm for the gene sequence F56F11.4. This figure shows the results of applying DWT to the proposed algorithm for the sequence F56F11.4. (a) The output power spectrum of the proposed algorithm before DWT is applied. (b) The output power spectrum of the proposed algorithm following DWT processing.
Figure 7. The effect of window length. Area under the ROC curves generated by different methods for the gene sequence F56F11.4 is plotted. In our algorithm, the accuracy of identification is fixed for window lengths between 150 and 510 bp, but changing the window length affects the identification accuracy of the other methods.
In Table 1, the approximate correlation and specificity for specified sensitivities are presented for our proposed method and for the other tested methods (using gene sequence F56F11.4). We observe that our algorithm yields the highest value of both parameters.
Table 1. Comparison of different methods using the sequence F56F11.4.
The proposed algorithm is then applied to chromosome III of Caenorhabditis elegans [NCBI Reference Sequence: NC_003281.8], comprising 13783681 nucleotides with 8172 coding regions, and the results are again compared with the outputs of other popular methods. Different evaluation measures for the proposed algorithm, AN filter and TDP methods are shown in Table 2. Clearly, the proposed method outperforms AN filter and TDP methods. It achieves a larger area under the ROC curve, fewer false positives and higher specificities and approximate correlation compared with AN filtering and TDP. By way of illustration, at a sensitivity of 20% the false positive output of our algorithm is 134 bp compared with 157 bp and 196 bp for AN filtering and TDP, respectively. In addition, our proposed method exhibits relative improvements of 3% and 5% respectively over AN filter and TDP methods in the approximate correlation measure.
Table 2. Evaluation of different methods using chromosome III of C. elegans
The proposed algorithm was finally applied to the HMR195 and the BG570 datasets. The output results are shown in Table 3. With regard to the HMR195 dataset, our algorithm outputs the least number of nucleotides incorrectly identified as exons. At a sensitivity of 30%, the number of false positives in the crosscorrelation method is improved by a factor of 1.24 relative to the nextbest performing method, Asif. Our proposed algorithm shows relative improvements of 21%, 8.3%, 24%, 18% and 4% over the DFT, AN filter, Asif, AMDF and TDP methods respectively, in terms of the area under the ROC curve. Similar superiority of our proposed algorithm is apparent for the BG570 dataset.
Table 3. Evaluation of different methods using HMR195 and BG570 genomic datasets.
Table 4 shows the AC measure of our proposed method in addition to the other tested methods. At a sensitivity of 80%, the AC measure for the proposed method is 40% in the BG570 database, while that of TDP (yielding the highest AC of the established methods) is 31%. Finally, from Figures 8 and 9, illustrating the ROC's of the proposed and other methods, it is obvious that the proposed method's area under curve in both datasets is the highest of all the tested methods. This implies that our proposed algorithm is superior to the other methods at identifying exonic gene regions.
Table 4. Approximate correlation measures for HMR195 and BG570 genomic datasets
Figure 8. ROC curves of different methods for the HMR195 genomic dataset. The ROC curves of different methods are plotted for the HMR195 dataset.
Figure 9. ROC curves of different methods for the BG570 genomic dataset. The ROC curves of different methods are plotted for the BG570 dataset.
Conclusions
This paper presents a new algorithm based on crosscorrelation theory, designed to increase the accuracy of exonic region identification. The FIR filter makes it easier to identify the exonic regions. The main advantage of the proposed method is its reduced dependency on the window length as a result of the decreasing noise effect. The ability to detect small exonic regions is another advantage of this algorithm. The final step of the algorithm utilizes the discrete wavelet transform to reduce noise. Compared with established time and frequency domain methods, the proposed algorithm yields improvements ranging from 4% to 41% in terms of the area under the ROC curve for the HMR195 and BG570 datasets. Our proposed method also minimises the number of nucleotides incorrectly predicted as exonic. This decrease in the number of false positives is responsible for the increase in specificity; for example, at a sensitivity of 30%, our proposed algorithm yielded 15% to 85% improvement in specificity over other tested methods. As can be seen from Tables 3 and 4, our algorithm confers significant improvement on the accuracy of exonic region identification.
Authors' contributions
This work is carried out under the close guidance of AR and GK, who conceived of the study, and who participated in its design and coordination. OA implemented the method. The manuscript was written by OA and edited by AR and GK. All authors have read and approved the final manuscript.
Acknowledgements
The first author, Mr. Omid Abbasi would like to thank Mr. Omid Omrani, an expert in genetic laboratory techniques in the Department of Natural Sciences, University of Tabriz, Tabriz, Iran.
References

Fickett JW: Recognition of protein coding regions in DNA sequences.
Nucl Acids Res 1982, 10:53035318. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Trifonov E: Elucidating sequence codes: three codes for evolution.
Ann NY Acad Sci 1999, 870:330338. PubMed Abstract  Publisher Full Text

Eskesen ST, Eskesen FN, Kinghom B, Ruvinsky A: Periodicity of DNA in exons.

Chang CQ, Fung PCW, Hung YS: Improved Gene Prediction by Resamplingbased Spectral Analysis of DNA Sequence. In Proceedings of the 5th International Conference on Information Technology and Application in Biomedicine: 3031 May 2008. Shenzhen, China; 2008.

Salamov AA, Solovyev VV: Ab initio gene finding in Drosophila genomic DNA.
Genome Res 2000, 10:516522. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

BernaolaGalvan P, Grosse I, Carpena P, Oliver JL, RomanRoldan R, Stanley HE: Finding borders between coding and noncoding DNA regions by an entropic segmentation method.
Phys Rev Letters 2000, 85(6):13421345. Publisher Full Text

Nicorici D, Astola J: Segmentation of DNA into coding and noncoding regions based on recursive entropic segmentation and stopcodon statistics.
Journal of Applied Signal Processing, Special issue in Genomic Signal Processing 2004, 1(1):8191.

Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R: Prediction of probable genes by Fourier analysis of genomic sequences.
Comput Appl Biosci 1997, 13:263270. PubMed Abstract

Vaidyanathan PP, Yoon BJ: The role of signalprocessing concepts in genomics andproteomics.
Journal of the Franklin Institute 2004, 341:111135. Publisher Full Text

Datta S, Asif A: A Fast DFTBased Gene Prediction Algorithm for Identification of Protein Coding Regions.
Proceedings of the 30th International Conference on Acoustics, Speech, and Signal Processing 2005.

Akhtar M, Epps J, Ambikairajah E: Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene Prediction.
IEEE journal of selected topics in signal processing 2008, 2:310321.

Ifeachor E, Jervis B: Digital Signal Processing: A Practical Approach. Prentice Hall Press; 2002.

Voss RF: Evolution of longrange fractal correlations and 1/f noise in DNA base sequences.
Physical Review Letters 1992, 68:38053808. PubMed Abstract  Publisher Full Text

George TP, Thomas T: Discrete wavelet transform denoising in eukaryotic gene splicing.
BMC Bioinformatics 2010, 11:S50. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Weeks M: Digital Signal Processing Using MATLAB and Wavelets. Infinity Science Press LLC; 2007.

Burset M, Guigo R: Evaluation of gene structure prediction programs.
Genomics 1996, 34:353367. PubMed Abstract  Publisher Full Text

Rogic S, Mackworth AK, Ouellette BF: Evaluation of genefindin programs on mammalian sequences.
Genome 2001, 11:817832. Publisher Full Text

Akhtar M, Ambikairajah E, Epps J: Detection of period3 behavior in genomic sequences using singular value decomposition.
Proceedings of the International Conference on Emerging Technologies: 1718 September 2005; Islamabad, Pakistan 2005.

Burge C: Modeling dependencies in premRNA splicing signals in Computational Methods in Molecular Biology.
In Elsevier Sciences Edited by Salzberg SL, Searls DB, Kasif S. 1998, 129164.

Akhtar M, Ambikairajah E, Epps J: GMMbased classification of genomic sequences.
IEEE 15th International Conference on Digital Signal Processing (Cardiff, UK) 2007.

Kwan JYY, Kwan BYM, Kwan HK: Spectral analysis of numerical exon and intron sequences. Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2010) Workshops, Hong Kong; 2010.

Agrawal A, Mittal A, Jain R, Takkar R: An adaptive fuzzy thresholding algorithm for exon prediction. In IEEE International Conference on Electro/Information Technology, 2008. EIT; 2008.