Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Methodology article

Computational evaluation of TIS annotation for prokaryotic genomes

Gang-Qing Hu12, Xiaobin Zheng12, Li-Ning Ju1, Huaiqiu Zhu12 and Zhen-Su She123*

  • * Corresponding author: Zhen-Su She she@pku.edu.cn

  • † Equal contributors

Author Affiliations

1 State Key Lab for Turbulence and Complex System and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, China

2 Center for Theoretical Biology, Peking University, Beijing 100871, China

3 Department of Mathematics, University of California, Los Angeles, Los Angeles, CA 90095, USA

For all author emails, please log on.

BMC Bioinformatics 2008, 9:160  doi:10.1186/1471-2105-9-160

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/9/160


Received:12 November 2007
Accepted:25 March 2008
Published:25 March 2008

© 2008 Hu et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Accurate annotation of translation initiation sites (TISs) is essential for understanding the translation initiation mechanism. However, the reliability of TIS annotation in widely used databases such as RefSeq is uncertain due to the lack of experimental benchmarks.

Results

Based on a homogeneity assumption that gene translation-related signals are uniformly distributed across a genome, we have established a computational method for a large-scale quantitative assessment of the reliability of TIS annotations for any prokaryotic genome. The method consists of modeling a positional weight matrix (PWM) of aligned sequences around predicted TISs in terms of a linear combination of three elementary PWMs, one for true TIS and the two others for false TISs. The three elementary PWMs are obtained using a reference set with highly reliable TIS predictions. A generalized least square estimator determines the weighting of the true TIS in the observed PWM, from which the accuracy of the prediction is derived. The validity of the method and the extent of the limitation of the assumptions are explicitly addressed by testing on experimentally verified TISs with variable accuracy of the reference sets. The method is applied to estimate the accuracy of TIS annotations that are provided on public databases such as RefSeq and ProTISA and by programs such as EasyGene, GeneMarkS, Glimmer 3 and TiCo. It is shown that RefSeq's TIS prediction is significantly less accurate than two recent predictors, Tico and ProTISA. With convincing proofs, we show two general preferential biases in the RefSeq annotation, i.e. over-annotating the longest open reading frame (LORF) and under-annotating ATG start codon. Finally, we have established a new TIS database, SupTISA, based on the best prediction of all the predictors; SupTISA has achieved an average accuracy of 92% over all 532 complete genomes.

Conclusion

Large-scale computational evaluation of TIS annotation has been achieved. A new TIS database much better than RefSeq has been constructed, and it provides a valuable resource for further TIS studies.

Background

To initiate translation in prokaryote, a ribosome binds to a specific region of mRNA and then recognizes a nearby start codon. The position of the first nucleotide base pair (bp) in the start codon is denoted by translation initiation site (TIS). The sequence upstream to the TIS, the start codon itself and the sequence downstream to the TIS show specific patterns which differ from genome to genome. The sequence at about 20 bps upstream to the TIS in most prokaryotic genes contains primarily purine rich Shine-Dalgarno sequence [1]. However, increasing numbers of genes with missing Shine-Dalgarno sequences, known as leaderless genes if they also lack a 5'-untranslated region, have been reported in archaeal genomes [2]. Genome-wide computational analysis on leaderless genes revealed A/T rich sequences in a region at about 30 bps further upstream [3]. The start codon in most cases shows a strong preference to the ATG triplet than to others such as TTG and GTG [4]. Sequences downstream to the TIS exhibit a periodicity of three in the codon usage. Comparative genomic studies show that the sequence patterns around the true TIS might differ significantly between genomes. With the aid of a sequence logo tool, Torarinsson et al. [3] and Zhu et al. [5] reported the variation of sequence patterns among dozens of archaeal genomes, which shed light on the understanding of the divergence of translation initiation mechanisms in prokaryote.

Knowledge of exact TIS is essential for conducting experiments involving the identification of natively purified proteins by N-terminal amino acid sequencing as well as heterologous protein production [6]. However, there are increasing concerns on the TIS annotation quality in widely used databases such as GenBank and RefSeq [5-9]. Earlier completed microbial genome projects tend to annotate the 5'-most candidate start which is in frame to the stop codon [7]. On the other hand, Poole et al. [6] has observed a strong discrepancy of TIS annotation between databases CMR and RefSeq on several genomes. Despite manual corrections and periodic updates, the quality of the current TIS annotations is still largely uncertain, and it is intriguing to develop an independent method for assessing the TIS annotation reliability. Such method, if successful, may also be helpful to provide hints for further improvement. The need for developing such method is becoming more urgent for the database such as RefSeq is so widely used by experimental biologists that errors in the annotation might have big impact.

Several attempts have been made to assess the reliability of TIS annotation. Nielsen and Krogh [8] were the first to make a serious large-scale assessment of the reliability of the TIS annotation in RefSeq, but their approach that takes EasyGene 1.2 as the "gold standard" for comparison is questionable. As we will see later, EasyGene's own accuracy is not outstanding, hence the biased assessment is of limited interest. Frishman et al. [10], using the Orpheus program, show that the information content of aligned TIS upstream sequences correlates with the TIS prediction accuracy. Zhu, et al. [5] made a qualitative assessment of the relative TIS annotation quality for two TIS predictors, by comparing the sequence logo [11] of aligned TIS upstream sequences. In this assessment, the sequence logo around the aligned TISs of a consensus set predicted by both predictors (called consensus logo) is considered to be reliable, and hence the difference to the sequence logo of the aligned TISs of a 'specific' set predicted by only one program (called specific logo) would indicate qualitatively the TIS accuracy of that program. Taking S. solfataricus as an example, Zhu, et al. [5] showed that the specific sequence logo of MED 2.0 is very similar to the consensus logo obtained jointly with GenBank annotation, but the specific logo of the GenBank shows almost no sequence pattern. This result suggests that the GenBank TIS annotation in S. solfataricus is lower than MED 2.0. Generally speaking, there exists no systematic method to computationally evaluate the accuracy of TIS prediction.

We propose here a computational method to quantitatively estimate the TIS annotation accuracy of a prokaryotic genome; the annotation can be provided by either a program or a database. The method is based on a homogeneity assumption that the sequence patterns represented by a PWM around TISs are homogenous for a generic subset of genes of a genome. The whole set of TIS predictions are split into two sets; set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a> is called reference set and is so constructed to be nearly 100% accurate (see section "Reference set") and set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a> has only partially accurate prediction which are to be quantitatively evaluated. We assume that the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a> are generic subsets; this assumption is diffcult to prove, but is sound as a first approximation. It is then assumed that the PWM around predicted TISs in the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a> can be modelled as a linear combination of three elementary PWMs, one around true TIS and the others two around false TISs which are located upstream and downstream to the true TIS, respectively. All the three elementary PWMs are obtained from the sequence patterns of the reference set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a>, which carries naturally genome-specific features. A generalized least square estimator then determines the weighting of each of the three PWMs, and the weighting of the true TIS naturally determines the accuracy of the TIS annotation in the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a>. Hence, the prediction accuracy over the entire genome, <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a><a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a>, is derived.

The validity of the method is established with tests on experimentally verified TISs set EcoGene [12]. Then, the method is applied to estimate the TIS annotation accuracy of 532 genomes on the public databases and publicly available programs such as RefSeq [13], ProTISA [14], EasyGene [8,15], GeneMarkS [7], Glimmer 3 [16] and TiCo [17]. Finally, this analysis has led to a construction of a new TIS database, SupTISA, which is much better than RefSeq on TIS annotations.

Methods

Basic definitions

Let us first introduce several definitions:

• a blackboard bold symbol <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M3">View MathML</a> denotes a set of genes with specified STOP and TIS;

• the sample size of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M3">View MathML</a> is denoted by <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M4">View MathML</a>;

• the symbol <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M5">View MathML</a> denotes the accuracy of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M3">View MathML</a>;

• the symbol <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M6">View MathML</a> denotes the set of annotation;

• the symbol <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a> denotes the reference set whose TISs are supposed to be 100% accurate, and the symbol <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a> denotes its compliment: <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M7">View MathML</a>;

• the symbol <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M8">View MathML</a> denotes a subset of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a> which has correct TIS annotation, and the symbol F denotes its compliment: <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M9">View MathML</a>. Thus the annotation accuracy of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a> can be expressed as <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M10">View MathML</a>. Furthermore, the overall annotation accuracy is given by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M11">View MathML</a>

(1)

Elementary patterns expressed with PWMs

The main task of this work is to estimate <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M12">View MathML</a>. The tool for this evaluation is the PWM of aligned sequences around TIS. We choose l bps upstream and r bps downstream of start codons (in this paper l = 50 and r = 15) to form a window of width l + r. The PWM for the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M3">View MathML</a> is denoted by <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M13">View MathML</a>; concretely, the frequency of nucleotide b at an aligned position j is denoted by Wj (b), where b = 1 denotes adenine (A), b = 2 denotes cytosine (C), and so forth.

Three elementary PWMs will be relevant to our analysis, and correspond to three types of TISs in the annotation. The first is true TIS, and the corresponding PWM is denoted by <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M14">View MathML</a>. The second and third are two types of false TIS, whose PWMs will be denoted by <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M15">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M16">View MathML</a> indicating the false TIS located either upstream or downstream of true TIS, respectively. Note that the overall PWM is, by definition, a linear combination of PWMs of sub-patterns, and this linearity has a consequence that any number of sub-patterns around true TISs can always be combined to be a single <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M14">View MathML</a>, and this is also a valid statement for <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M15">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M16">View MathML</a> As long as the distribution of sub-patterns are uniform for the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a> (which is our homogeneity assumption, see below), it is justified to use the three elementary PWMs to represent an actual observed PWM such as <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M17">View MathML</a>.

The difference between the three types of PWMs are biologically clear. <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M14">View MathML</a> contains regulatory signals such as the SD sequence, which are required by the translation initiation machinery. Evolution must conserve such pattern. On the other hand, <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M15">View MathML</a> characterizes sequences exposed to neutral evolution and hence is generally feature-less. Finally, a false TIS located downstream to the true TIS is surrounded by coding sequences and <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M16">View MathML</a> exhibits period three oscillations. In Figure 1, we show the three patterns, obtained by our study, for three different organisms. The features discussed above are generally present.

thumbnailFigure 1. Three elementary sequence patterns around TISs. We selected three genomes with widely different genomic GC content to illustrate the content of three elementary PWMs associated with different TISs. Three genomes are from top to bottom: Sulfolobus solfataricus P2 (35.8%), Escherichia coli K12 (50.8%), and Pseudomonas aeruginosa PAO1 (66.6%). Logos from the left to right correspond to <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M16">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M14">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M15">View MathML</a> respectively. Data are obtained from the reference set and logos are generated by [11].

An annotation of finite accuracy will give rise to a PWM which is a linear combination of the above three PWMs. Specifically, for the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a>, we write:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M18">View MathML</a>

(2)

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M19">View MathML</a>. We will develop a least square estimator to determine the three coeffcients α's from the above four observed PWMs, the first coming from the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a> and the last three from the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a>.

Reference set

The three elementary PWMs are obtained from the reference set, which is very important in this evaluation. The reference set needs to be as reliable as possible, and should not be biased towards any database/predictor to be evaluated. We have chosen to use the six most recent TIS databases/predictors, namely, RefSeq [13], ProTISA [14], EasyGene [8,15], GeneMarkS [7], Glimmer 3 [16] and TiCo [17], to derive the reference set. For any genome, we obtained the reference TIS set by intersecting the annotations of all six databases/predictors; in order to reduce false positives, genes less than 600 bps are excluded [18]. Among the six annotations, EasyGene, GeneMarkS, Glimmer 3 and TiCo achieve significant improvements on TIS prediction [7,15-17], and ProTISA is compiled to contain more than 390, 000 confirmed TISs with collected evidence from experiments, literatures, conserved domain search, and sequence alignment between orthologous genes [14]. Today, we can get the intersecting of all 532 genomes found on GenBank from all the databases/predictors except EasyGene, the later only provides annotations of 157 genomes. We will use only five of the six annotations to get the reference set for the remaining 381 genomes. These reference sets represent the best TIS predictions so far achieved, which do not cover all genes (41.5 ± 9.5% in RefSeq) but presumably very accurate. Our present work provide an evaluation for the rest of the predictions, i.e. that of the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a>.

The procedure to obtain three PWMs from the reference set is as follows. Since the true TISs are known, the aligned sequences around the true TISs directly give rise to <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M20">View MathML</a> (an estimate of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M14">View MathML</a>, see later). Similarly, one obtains <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M21">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M22">View MathML</a> by aligning sequences around a randomly chosen false TIS upstream or downstream to the true TIS. Note that <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M21">View MathML</a> contains the least feature among the three,. Note also that in this procedure, all three PWMs have the sample size: <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M23">View MathML</a>, so there will be finite size effects to be accounted for in the analysis below.

Homogeneity assumption

Finally, let us discuss the limitation of the homogeneity assumption. The sequence pattern encompasses regulatory signals which are important to the translation of genes. The homogeneity property is based on the idea that the translation mechanism is largely universal across a genome. There may be several translation mechanisms acting on a genome [2,3,5,14]; in this case, the homogeneity assumption requires that the proportions of the sub-patterns remain the same for different subsets of our interest, namely the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a>. To prove this is a diffcult problem, and we do not intend to accomplish it in this work. This is because that <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a> might contain TISs with different statistical properties from the reference set, which makes them harder be annotated correctly. However, the validity of our evaluation depends on how large is its effect. A deviation from the homogeneity is similar to the effect of finite accuracy for the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a>, which is easier to study. The testing results (see section "Testing") show that imperfection or bias in the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a> yields definite but small modification of the evaluated accuracy. Therefore, it is reasonable to conclude that the homogeneity assumption is sound to leading order and the results of our evaluation are believable.

Algorithm

Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M24">View MathML</a> be an estimate of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M13">View MathML</a>. Because we are disposed with a finite set of samples, Eq. 2 becomes

<a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M25">View MathML</a>

(3)

where ε depends on both <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M26">View MathML</a> (for <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M17">View MathML</a>) and <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M23">View MathML</a> (for the three elementary W's). Furthermore, to eliminate redundancy from data, it is wise to make a Z-transformation [19] from the matrix W of (l + r) × 4 dimensions to a matrix V of (l + r) × 3 dimensions:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M27">View MathML</a>

(4)

where j = 1, 2,..., l+r. Consequently, we rewrite Eq. 3 as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M28">View MathML</a>

(5)

The nucleotide frequencies at different positions in all the PWMs are assumed to be independent [20]. The assumption is widely applied in gene-finders [5,7,16], and deviations are expected to be small based on results presented in the "testing" section.

Together with the homogeneity assumption, we show that E(ε') = 0 and

<a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M29">View MathML</a>

(6)

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M3">View MathML</a> takes <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M8">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M30">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M31">View MathML</a>, respectively, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M32">View MathML</a> is a 3(l+r) × 3(l+r) covariance matrix calculated on the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M3">View MathML</a> whose components are inferred from <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M24">View MathML</a> (see Additional File 1).

Additional file 1. Supplementary details of the method. Details for deducing Eq. 6 and minimizing the sum of squared errors in Eq. 7.

Format: PDF Size: 362KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

The estimation of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M33">View MathML</a>s in Eq. 6 can be done using a generalized least square, namely by minimizing the following weighted sum of squared errors t (see Additional File 1):

<a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M34">View MathML</a>

(7)

where α denotes the vector (<a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M35">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M36">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M37">View MathML</a>)T and Σ ' denotes Var(ε') for simplification. Because of Eq. 6, Σ' has a complicate dependence on α, and we need to solve a nonlinear optimization problem. This is done by an iterative procedure, with an initial α to evaluate Σ' which is substituted into a group of linear equations of optimization (the first-order partial derivative of α equals to zero) to calculate new α. The new α is then used to update Σ', and the calculation repeats until α converges. As explained in details in Additional File 1, the calculations converge quickly to correct values. Throughout our tests, we did not encounter any instability for this calculation.

Throughout the calculation, we face a question of how reliable the estimates of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M14">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M16">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M15">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M17">View MathML</a> are, given the finite sample of gene sequences used for the evaluation. This problem is addressed by adopting a bootstrapping strategy for finding a confidence interval (CI) of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M35">View MathML</a>. The calculation is repeated 200 times; each time, we randomly select, with replacement, a sample of TISs from the reference set of size <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M23">View MathML</a> to calculate three elementary <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M43">View MathML</a>s and a sample of TISs from the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a> of size <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M26">View MathML</a> to calculate <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M44">View MathML</a>, and perform the optimization calculation described above. This calculation is carried out during the testing and every assessment. The consistency of the estimate is then judged by the uncertainty interval of the output accuracy.

Results

Testing

The experimentally confirmed TISs in EcoGene [12], denoted as EcoGene854, allows us to design a procedure to test the reliability of our method. The procedure goes as follows. First, randomly divide genes in EcoGene854 into two equal-size-set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a>, and calculate the three elementary PWMs from the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a>, as explained above. Then, we create a series of partially accurate <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a> with accuracy α from 40% to 90% at a step of 10% by replacing 100(1 -α)% of the true TISs by randomly choosing false TISs. The aligned sequences with the newly assigned TISs of the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a> gives rise to <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M17">View MathML</a>, which is a simulated real annotation PWM of finite accuracy. The generalized least square calculation determines the estimated accuracy, <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M38">View MathML</a>. For each α, we repeat the generation of the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a> (200 times) and obtain a distribution of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M38">View MathML</a> from which the average and standard deviation of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M38">View MathML</a> can be derived.

In Figure 2, we plot the average estimate and the standard deviation of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M38">View MathML</a> as a function of true accuracy α. When the reference set is 100% accurate, <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M38">View MathML</a> fluctuates around α with ± 2.6%. We have also found that the estimate <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M38">View MathML</a> is unbiased.

thumbnailFigure 2. Comparison of estimated versus exact accuracy at three reference sets of different accuracy.

An intriguing question is what happens if the reference set is not 100% accurate. This can be easily checked by carrying out a series of tests with varying accuracy of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a> (by randomly replacing a portion of true TISs by false ones): <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M39">View MathML</a> = 90%, 95% and 100%. The results are also shown in Figure 2. Generally speaking, an over-estimation of the accuracy is obtained. This is readily understood because when <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M39">View MathML</a> < 100%, <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M14">View MathML</a> contains contribution from false TIS, and hence a bias is generated in favor of false annotation, and the estimated accuracy is higher. Specifically, we found that the estimated accuracy is about <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M40">View MathML</a> if the reference set has an accuracy of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M39">View MathML</a>. This effect is consistent in both <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M2">View MathML</a>. As for the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a>, the real accuracy is <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M39">View MathML</a> while the estimated accuracy is <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M41">View MathML</a> = 100%. So, the estimation is inversely proportional to the accuracy of the reference set. This dependence on <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M39">View MathML</a> is very helpful to keep in mind when one interprets an actual assessment.

The above designed tests provide a unique opportunity to test if a bootstrapping strategy offers any knowledge about the uncertainty of the estimate. We carried out a bootstrapping calculation for the runs with <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M12">View MathML</a> = 60% and <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M39">View MathML</a> = 100%, the widths of the obtained 95% CIs are shown in Figure 3 as a function of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M23">View MathML</a> (by taking only a subset of genes from the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a>). This dependence has an advantage to be compared to real assessment calculation. As shown, the width of the 95% CI follows approximately a power law dependence on <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M23">View MathML</a>, which is a result of the nonlinear optimization. At the largest set size of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M1">View MathML</a>, the width is around 13%, which is about 30% wider than that derived from the actual distribution of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M38">View MathML</a> (which is around 10.0% for an approximate normal distribution with standard deviation 2.6%, as above). In other words, the bootstrapping calculation over-estimates the scattering of estimated <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M35">View MathML</a>, and hence it provides a good and conservative measure of the reliability of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M38">View MathML</a>. When we extended the 95% CI results for the testing to those in real assessment with the actual <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M23">View MathML</a> in E. coli, we find that they agree remarkably well. This confirms the validity of the bootstrapping calculation. Generally speaking, Figure 3 shows that, for typical genomes with <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M23">View MathML</a> ~1000 – 2000, the obtained assessment accuracy would have a width of 95% CI of 5% to 8% (equivalently ± 1% to ± 2% in standard deviation). This is rather a satisfactory outcome.

thumbnailFigure 3. Width of 95% CI of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M42">View MathML</a> as a function of the size of the reference set. The 95% CIs of <a onClick="popup('http://www.biomedcentral.com/1471-2105/9/160/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/9/160/mathml/M42">View MathML</a> are calculated for the testing set (triangles) and for the RefSeq annotation assessment (circles for 532 genomes and blue full circle for E. coli K12). The CIs are derived from 200 bootstrapping calculations in each case.

Applications

Assessing RefSeq annotation

RefSeq is the most widely used public database on TIS, and its accuracy is the most concerned matter of this study. We have conducted an overall assessment on the TIS annotation for RefSeq. A total of 532 genomes are assessed. The annotation accuracy varies widely from 3.3% in A. baumannii ATCC 17978 to 96.8% in P. pentosaceus ATCC 25745 with an average of 80.6 ± 9.9%. About 40% of the genomes have accuracies higher than 85.0%, including genomes from several well studied genera such as Bacillus, Escherichia, Salmonella and Pseudomonas. In contrast, 13.5% of the genomes, most of which are GC-rich, have very suspicious TIS annotations with accuracies lower than 70%. A complete list of estimated accuracies for the 532 genomes is available in Additional File 2.

Additional file 2. Estimated TIS annotation accuracies of six selected databases/predictors. Accuracies of TIS annotation on a total of 532 genomes for RefSeq, Glimmer 3, GeneMarkS, EasyGene, TiCo and ProTISA.

Format: XLS Size: 173KB Download file

This file can be viewed with: Microsoft Excel ViewerOpen Data

Below, we examined two annotation preferences that potentially contribute to the RefSeq annotation quality, namely tendencies to over-annotate LORF and to under-annotate ATG start codon.

As reported previously [7,8], RefSeq tends to over-annotate LORF. If the TIS annotation takes the rule of LORFs (i.e., always taking the 5'-most start codon), then its TIS accuracy would equal to the percentage of LORF in all true TISs (which will be referred below to as the percentage of true LORF). Our method can define a way to estimate this percentage of true LORF. For a genome for which we can generate a reliable reference set, then we can generate an artificial annotation by adopting the LORF rule. The final estimated accuracy of this artificial annotation is the percentage of true LORF. This method is applied to Y. pestis, and the estimated percentage of true LORF is 63.7%. The calculation of the actual percentage of LORF in the RefSeq annotation for Y. pestis is 92.6%. We then judge that there is about 30% over-annotation of LORF in this genome. This study is carried out for a total of 532 genomes, and the results are shown in Figure 4 where we found an average of 7.6 ± 9.1% over-annotated of LORFs in RefSeq.

Another preference is the under-annotation of ATG start codon, for which we have now developed some statistical measures to provide further quantitative evidence. We have conducted calculation within genus, a taxonomic category ranking below family but above species. It is reasonable to expect that the TISs of species from the same genus show little difference in statistic such as the start codon usage. A total of 29 genera containing at least five selected genomes are studied, and the Escherichia genus is chosen to present our results; reported observations hold on most of the other genera (see Additional File 3). As shown in Figure 5, the percentage of annotated ATG start in the CFT073 strain is about 70%, whereas this percentage in the well-studied K12 strain reaches a much higher value of 90%. Note that the percentage calculated from the confirmed EcoGen854 data set is about 91%. Our estimated accuracy of TIS annotation for CFT073 strain is below 70%, significantly lower than the K12 strain (about 94%). Figure 5 shows a clear linear correlation between the ATG start codon usage and the accuracy for all strains in the Escherichia genus.

Additional file 3. Correlation between annotation accuracy and ATG start codon usage. A total of 29 genera were selected. The linear fit was applied if the Pearson Correlation is significant at 95% confidence.

Format: PDF Size: 116KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

thumbnailFigure 4. Estimate of the excess of LORF from RefSeq. The difference between annotated LORF and the true LORF estimated in our work shows the degree of over-annotation of LORF in RefSeq. The dot line shows the average.

thumbnailFigure 5. Correlation between RefSeq annotation accuracy and its ATG start codon usage for E. coli.

Assessing other TIS annotations

Since our reference set is constructed with the intersection of all relevant TIS databases/predictors, it is not biased towards any one, and hence we can carry out the analysis of accuracy for all of the predictors for the 532 genomes. This subsection is devoted to a discussion of their performances. We chose RefSeq as a standard of accuracy comparison for presenting the results. To reduce false positives, genes not annotated by RefSeq and genes with length short than 300 bps were excluded, as implied in [8,18]. Figure 6 shows the accuracy difference of the five other TIS predictors to RefSeq predictions.

thumbnailFigure 6. Annotation accuracy comparison between RefSeq and six other annotations. The dot line shows the average.

As two of the most popular gene-finders, Glimmer and GeneMark have been used to annotate hundreds of genomes. The most recent versions, Glimmer 3 and GeneMarkS, include a RBS model to predict TISs, which is in a form of PWM whose parameters are derived by a Gibbs sampler. High performances are reported on two well-studied genomes E. coli K12 and B. subtilis [7,16]. When assessed here on the 532 genomes by our method, Glimmer 3 and GeneMarkS report average accuracies of 83.6 ± 7.4% and 85.0 ± 7.3%, respectively, which are higher than RefSeq (80.6 ± 9.9%). However, for a considerable numbers of genomes, Glimmer 3 and GeneMarkS have made limited or no improvement over RefSeq (see Figure 6).

EasyGene has only published 157 genomes [21] and is believed to be a better TIS resource than RefSeq [8]. Indeed, our assessment confirmes that, for its reported set of genomes, EasyGene's annotation has a noticeably high average accuracy of 86.7 ± 6.3%, which is about 10% higher than the RefSeq prediction for the same set (see Figure 6). Note that EasyGene is reported to make conservative choices in gene prediction [8,15], and it excludes often 5–10% genes of RefSeq, some of which have function annotation.

Unlike gene finders, TiCo is a post-processor of an existent annotation. High performance was reported on E. coli K12 and B. subtilis, as well as on GC rich genomes such as P. aeruginosa PAO1 [17]. As shown in Figure 6, the improvement on RefSeq is indeed remarkable, with an average improvement on accuracy of 10% over all 532 genomes. Note, however, that the accuracy improvement is relatively lower in GC-rich and AT-rich genomes (in the wing part of Figure 6).

ProTISA is a recently published database dedicated to TIS annotation in prokaryotic genomes. It is generated by collecting various confirmed TISs and predictions from MED-Start (upgraded), which post-processes the RefSeq annotation [14,22,23]. The assessment carried out here indicates that, over 532 genomes, the ProTISA has a mean accuracy of 90.5%, which is 9.9% higher than RefSeq (see Figure 6). As a more rigorous comparison, we applied the paired-samples t-test to judge if ProTISA gives a significantly higher accuracy than RefSeq, and obtained a t-value of 31.2, much larger than 1.7 at 95% confidence and for degrees of 531. Thus, a definite positive answer is derived. Besides, there are 101 genomes for which ProTISA's prediction accuracy is higher by 15% than RefSeq; for these genomes, the RefSeq shows a clear preference of over-annotating LORF and under-annotating ATG start codon (data not shown). In addition to accurate TIS annotation, ProTISA annotates potential regulatory signals, which are helpful in investigating the diversity of translation initiation mechanism. For example, besides SD signal, Pribnow box is found at 10 bps upstream to TIS in many bacteria genomes, suggesting that leaderless gene may not be rare in bacteria [14].

A new TIS database: SupTISA

The method of evaluation proposed in this paper is based on a fundamentally different principle, the principle of homogeneity for the PWMs of any subset of genome as a linear combination of three elementary PWMs. This principle is based on the universal process of gene translation, and it is a macroscopic property for the ensemble of TISs. This information is supplementary to the properties that are used by TIS predictors, and hence can (and should) be used to provides a complementary way for achieving the global annotation performance. In other words, we propose to construct a new TIS annotation database by selecting the best TIS predictor's annotation for any given genome; the resulting annotations organize a new database (of 532 genomes at present) and is named SupTISA [24]. This is possible because the assessment is totally independent and unbiased.

Specifically, for each genome, SupTISA selects the one of RefSeq, ProTISA, EasyGene, GeneMarkS, Glimmer 3 and TiCo with the highest accuracy as the SupTISA annotation and provides its downloading at the web address [24] for TIS annotations. SupTISA takes advantage of the complementary quality of all the existing TIS predictors. For instance, ProTISA are generally better than TiCo on genomes with biased GC content, but TiCo outperforms ProTISA on others (data not shown). The result is that SupTISA achieves an average annotation accuracy of 92.1 ± 4.7%. Figure 6 shows that SupTISA generally over-performs RefSeq by 5% to 30%. Therefore, SupTISA provides the best resource for experimental use and for computational study related to TIS.

Discussions and Conclusion

Translation is a fundamental process for an organism, and the regulatory signals relevant to this process should have relatively uniform distribution across a genome. A PWM of aligned sequences around TIS summarizes the statistical information of the signal, and is then a tool to use for study how much, in a given set of annotation, the true signal has contributed. This is the principle we use for inventing, for the first time, an algorithm for large-scale evaluation of TIS's prediction accuracy. The work done on the testing with confirmed genes and on assessing six databases/predictors over 532 genomes give rise to a series of consistent results. Although the actual accuracy results may be subject to a few percents of uncertainty, due to statistical fluctuations of finite sample sizes and possible distortion of the reference sets, the assessments seem to be a valid leading order measure of the TIS annotations. Such assessment is meaningful, especially when the estimated accuracy is low: typically, some unjustified or simplified assumptions are used during the annotation. Our assessment then provides a tool for experimental or computational biologists to avoid to be mis-led by an over-simplified annotation. We have shown that the RefSeq annotations for some genomes are of this nature.

Correct annotation is important to both in vivo and in silico studies of translation initiation. In P. horikoshii OT3 and several other archaeal genomes, Cang and Wang [25] reported a high frequency of ATG triplets at 9 bps downstream of annotated TISs using GenBank's TIS annotation data. It was suggested that "a remedial initiation site for archaea ... reflect the decreased effciency of the translation initiation machinery in archaea". However, after taking a refined dataset of TIS annotations from the present database, such unexpected over-frequency disappeared (data not shown). It is then likely that the observed excess of ATG triplets is due to the excess of false TIS upstream to the true TISs present in the study. As an additional outcome, the present assessment yields a new database, called SupTISA. The interest of SupTISA lies in the fact that none of the TIS predictors is able to correctly take into account all properties of sequences around TIS over the entire family of prokaryotic genomes, and hence SupTISA can integrate them as a more macro-selector. This is achieved because SupTISA is based on a macroscopic principle (over all genomes) of homogeneity of translation machinery.

Authors' contributions

ZSS and GQH conceived the study, designed the applications and drafted the manuscript, ZSS and HQZ co-supervised the development of the work, XBZ and GQH designed and implemented the algorithm. LNJ performed part of the test. All authors read and approved the final manuscript.

Acknowledgements

We thank Yongchu Liu, Xiaojing Yang, Yifan Yang and Xinqiu Yao for beneficial discussions. The work received partial support by the National Natural Science Foundation (10225210 and 30770499) of China and was also supported by the 973 Project grant 2003CB715905.

References

  1. Shine J, Dalgarno L: The 3'-terminal sequence of E. coli 16S RNA: complementarity to nonsense triplets and ribosome binding sites.

    Proc Natl Acad Sci U S A 1974, 71:1342-1346. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. Londei P: Evolution of translational initiation: new insights from the archaea.

    FEMS Microbiol Rev 2005, 29:185-200. PubMed Abstract | Publisher Full Text OpenURL

  3. Torarinsson E, Klenk HP, Garrett RA: Divergent transcriptional and translational signals in Archaea.

    Environ Microbiol 2005, 7:47-54. PubMed Abstract | Publisher Full Text OpenURL

  4. Gold L: Posttranscriptional regulatory mechanisms in Escherichia Coli.

    Annu Rev Biochem 1988, 57:199-233. PubMed Abstract | Publisher Full Text OpenURL

  5. Zhu HQ, Hu GQ, Yang YF, Wang J, She ZS: MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes.

    BMC Bioinformatics 2007, 8:97. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  6. Poole FL, Gerwe BA, Hopkins RC, Schut GJ, Weinberg MV, Jenney FEJ, Adams MW: Defining genes in the genome of the hyperthermophilic Archaeon Pyrococcus furiosus: implications for all microbial genomes.

    J Bacteriol 2005, 187:7325-7332. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  7. Besemer J, Lomsadze A, Borodovsky M: GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions.

    Nucleic Acids Res 2001, 29:2607-2618. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  8. Nielsen P, Krogh A: Large-scale prokaryotic gene prediction and comparison to genome annotation.

    Bioinformatics 2005, 21:4322-4329. PubMed Abstract | Publisher Full Text OpenURL

  9. Starmer J, Stomp A, Vouk M, Bitzer D: Predicting Shine-Dalgarno sequence locations exposes genome annotation errors.

    PLoS Comput Biol 2006, 2:57. Publisher Full Text OpenURL

  10. Frishman D, Mironov A, Gelfand M: Starts of bacterial genes: estimating the reliability of computer predictions.

    Gene 1999, 234:257-265. PubMed Abstract | Publisher Full Text OpenURL

  11. Gorodkin J, Heyer LJ, Brunak S, Stormo GD: Displaying the information contents of structural RNA alignments: the structure logos.

    Comput Appl Biosci 1997, 13:583-586. PubMed Abstract OpenURL

  12. Rudd KE: EcoGene: a genome sequence database for Escherichia coli K-12.

    Nucleic Acids Res 2000, 28:60-64. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  13. Pruitt K, Tatusova T, Maglott D: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

    Nucleic Acids Res 2007, 35:D61-D65. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Hu GQ, Zheng XB, Yang YF, Ortet P, She ZS, Zhu HQ: ProTISA: a comprehensive resource for translation initiation site annotation in prokaryotic genome.

    Nucleic Acids Res 2008, 36:D114-D119. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. Larsen TS, Krogh A: EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance.

    BMC Bioinformatics 2003, 4:21. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  16. Delcher AL, Bratke KA, Powers EC, Salzberg SL: Identifying bacterial genes and endosymbiont DNA with Glimmer.

    Bioinformatics 2007, 23:673-679. PubMed Abstract | Publisher Full Text OpenURL

  17. Tech M, Meinicke P: An unsupervised classification scheme for improving predictions of prokaryotic TIS.

    BMC Bioinformatics 2006, 7:121. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  18. Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A: On the total number of genes and their length distribution in complete microbial genomes.

    Trends Genet 2001, 17:425-428. PubMed Abstract | Publisher Full Text OpenURL

  19. Zhang CT, Zhang R: Analysis of distribution of bases in the coding sequences by a diagrammatic technique.

    Nucleic Acids Res 1991, 19:6313-6317. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  20. Staden R: Computer methods to locate signals in nucleic acid sequences.

    Nucleic Acids Res 1984, 12:505-519. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  21. EasyGene [http://servers.binf.ku.dk/cgi-bin/easygene/search] webcite

  22. Zhu HQ, Hu GQ, Ouyang ZQ, Wang J, She ZS: Accuracy improvement for identifying translation initiation sites in microbial genomes.

    Bioinformatics 2004, 20:3308-3317. PubMed Abstract | Publisher Full Text OpenURL

  23. ProTISA [http://mech.ctb.pku.edu.cn/protisa/] webcite

  24. SupTISA [http://mech.ctb.pku.edu.cn/protisa/SupTISA/] webcite

  25. Cang XH, Wang J: A unique ATG triplet downstream of gene start in archaea: implications for translation initiation and evolution.

    Gene 2004, 327:75-79. PubMed Abstract | Publisher Full Text OpenURL