Escola de Artes, Ciências e Humanidades, Universidade de São Paulo, Rua Arlindo Béttio, 1000, 03828-000, São Paulo, SP, Brazil

Instituto de Psiquiatria, Universidade de São Paulo, R. Dr. OvÍdio Pires de Campos, 785, 01060-970, São Paulo, SP, Brazil

Instituto de Matemática e EstatÍstica, Universidade de São Paulo, Rua do Matão, 1010, 05508-090, São Paulo, SP, Brazil

Abstract

Background

A large number of probabilistic models used in sequence analysis assign non-zero probability values to most input sequences. To decide when a given probability is sufficient the most common way is bayesian binary classification, where the probability of the model characterizing the sequence family of interest is compared to that of an alternative probability model. We can use as alternative model a

Results

For all the tests, the target null model presented the lowest number of false positives, when using random sequences as a test. The study was performed in DNA sequences using GC content as the measure of content bias, but the results should be valid also for protein sequences. To broaden the application of the results, the study was performed using randomly generated sequences. Previous studies were performed on aminoacid sequences, using only one probabilistic model (HMM) and on a specific benchmark, and lack more general conclusions about the performance of null models. Finally, a benchmark test with

Conclusions

Of the evaluated models the best suited for classification are the

Background

Probabilistic models are widely used in biological sequence analysis. They are essential mechanisms to pre-process the plethora of data available, creating hypothesis for biological validation. Examples are Hidden Markov Models (HMM)

The choice of the alternative model is essential to reduce the number of false predictions and depends on the problem. An alternative model can be either a

More technically, we want to compute, given a nucleotide sequence

We want null models that help classifiers reject sequences that do not belong to family

Null models, due to their very generic nature, should not present any structure. Therefore a convenient model to describe random sequences in a null model _{N}

_{N}
^{cA
}
_{N}
^{cC
}
_{N}
^{cG
}
_{N}
^{cT
}

where _{i}

There are many possible strategies to set up a null model discussed in literature

The goal of this study is to evaluate the impact of each of these three classes of null models in the false positive rate of classifiers. We found only two studies in literature that analyzed the performance of null models

To make this study more general than previous works, we use random sequences and two different probabilistic models. Using random sequences guarantees there is no bias in the study towards any particular benchmark, so we expect the results to be of broad application. Also, the simulations used random sequences across the whole GC spectrum, in an effort to make the results applicable to any real-life situation. The two probabilistic models chosen are very different, aimed at covering a wide range of models: one with very simple architecture and one able to represent more structured sequences. The studies were performed using Weight Array Matrices (WAMs)

WAMs record only fixed-distance content dependencies, useful to represent sequence motifs. CMs are able to characterize indels and register dependencies in non-adjacent bases at arbitrary distances, which can be used to characterize secondary structure. We evaluated WAMs in the context of splice site prediction and CMs in the context of predicting RNA or other genomic elements with secondary structure. Splice sites were used for three reasons: first, splice site prediction is at the heart of gene prediction, an biologically important problem in bioinformatics, second, the abundance of data in public databases, third, because many successful predictors use position-dependent models, which is the base of our probabilistic model range. The spectrum of GC content in the dataset enabled using a single sequence family (splice sites) for all experiments with WAMs. In this context, the same was not possible for CMs, where training sets are generally small and concentrated on a small spectrum of GC content. In this case we had to use three different sequence families (see methods for details).

We will see below that the training set and the genomic background are not good choices for a null model. In fact, no fixed, non-uniform distribution is, as a quick mathematical analysis can demonstrate. As we will see below, two probabilistic i.i.d. models are best suited for classification: the

Results

Since we are interested in minimizing the number of false positive predictions, we used randomly generated sequences for evaluation. Random sequences should receive negative log-odds scores in probabilistic classifiers for any specific sequence family. In other words, a better performance in terms of specificity means fewer random sequences with positive scores. We evaluated six null models: 5%GC, 25%GC, uniform, 75%GC, 95%GC and the model obtained from the base frequencies in the target sequence (the target model)^{1}.

^{1}We used GC content as a simplified measure of nucleotide composition, which allows the visualization of 2D plots.

Initially, for illustrative purposes, we computed the log of the probability values of the test sequences given the null models alone (no log-odds score). This illustrates the values produced by these models for sequences at different GC compositions. We called these “raw scores”. Next, we used each of these models as null models in log-odds scoring classification for two different types of family models, WAMs and CMs. Since we are using only random sequences, the log-odds scores should be negative. Positive scores indicate false positives^{2}.

^{2}We assume that the chance of one of the random sequences being an actual family sequence is negligible.

Raw score behavior on random sequences

We have plotted the raw scores (log of the probability value) of random sequences using the fixed distribution and target models alone. The results are shown in Figure

Raw scores of random sequences

**Raw scores of random sequences** Raw scores (log of the probability given the null model) of random sequences as a function of their GC content. The raw scores were calculated using the six models: • target model (with the nucleotide distribution of the analyzed sequence) and 5 fixed GC models (x: 5%, □ 25%,◊: 50%, Δ: 75% and ∇ 95%).

As it was expected, the uniform model produces no bias along the GC content (x axis), producing a constant score, consistent with the fact that all analyzed sequences have the same size. The raw scores using the biased fixed distribution models (5%GC, 25%GC, 75%GC, 95%GC) show a linear dependence on the GC content of the analyzed sequences; the GC content of the model only determines the inclination of the linear plot. The target model presents a less intuitive result, a curve with the lowest scores at 50%GC and higher scores towards more extreme GC distributions.

Effect of different null models in log-odds scoring

Probabilistic models such as WAMs and CMs also capture the base composition of the sequences of the training set. Therefore, when we use log-odds scoring, the GC bias recorded by the family models should also influence the final score and we have to analyze the combined influence of the family and null models. We embedded the null models used in the previous section in classifiers using two different probabilistic techniques, Weight Array Matrices (WAMs)

Weight array matrices

We used sequences of acceptor splice sites to create three distinct training sets with different average GC content: 38%GC, 50%GC and 65%GC (see Materials and Methods for a justification on GC percentages). For each training set, a weight array matrix was trained and used to score random sequences using the six null models: 5 fixed GC models and the target model. The results are shown in Figure

Log-odds scores of weight array matrices

**Log-odds scores of weight array matrices** Null model influence on log-odds scores based on weight array matrices (WAMs). The plots show the log-odds scores of random sequences as a function of their GC content. In each plot, a horizontal line indicates the 0 log-odds score (used as classification threshold) and a vertical line indicates the average GC content of the training sequences. WAMs inferred by low, medium and high GC training sequences were used in the plots at lines 1, 2 and 3, respectively. 5%GC, 50%GC, 95%GC and target null models were used in the plots at columns 1, 2, 3 and 4, respectively. Note: the three bands visible in the plots are generated by the WAM models, and not by the use of any particular null model.

As we can see, log-odds scores of random sequences using fixed GC null models, including the uniform model, present a quasi-linear dependence on their GC content. This means that, no matter what is the composition of the sequences used to characterize the family (the training set), any random sequence at the ends of the GC spectrum will score consistently higher (or lower) than any other sequence. This effect is so relevant that random sequences in one of the GC content extremes have positive scores when any of the fixed GC models is used, which indicates a strong tendency to generate false positives in the classification of sequences with extreme GC compositions. On the other hand, the target null model presents higher scores for sequences with GC content similar to the average GC content of the training set and lower scores for sequences with extreme GC content. The target null model presents the lowest number of positively scored sequences. The consequence in real-life classifications would be a lower number of false positives.

Covariance Models

Covariance Models (CMs) are usually used to characterize families of RNAs or other genomic elements with secondary structure. Training sets for CMs tend to be much smaller. Therefore, instead of dividing the training set of a single family in different training sets separated by GC content (as performed in the analysis using WAMs), we used three different CMs obtained from the RFAM database _{low}
_{medium}
_{high}

We can observe in Figure

Log-odds scores of covariance models

**Log-odds scores of covariance models** Null model influence on log-odds scores based on Covariance Models (CMs). The plots show the log-odds scores of random sequences as a function of their GC content. In each plot, a horizontal line indicates the 0 log-odds score (used as classification threshold) and a vertical line indicates the average GC content of the training sequences. CMs inferred by low, medium and high GC training sequences were used in the plots at lines 1, 2 and 3, respectively. 5%GC, 50%GC, 95%GC and target null models were used in the plots at columns 1, 2, 3 and 4, respectively.

Specificity of the different null models

Table _{high}

Specificity of the different null models

null model

WAM_{low}

WAM_{med}

WAM_{high}

CM_{low}

CM_{med}

CM_{high}

5%GC

2431 (70%)

2377 (68%)

2173 (63%)

2439 (77%)

3775 (75%)

3567 (71%)

25%GC

1118 (32%)

1517 (44%)

1500 (43%)

1681 (53%)

2509 (50%)

2243 (44%)

50%GC

863 (25%)

79 ( 2%)

700 (20%)

674 (21%)

58 ( 1%)

0 ( 0%)

75%GC

1443 (42%)

1534 (44%)

738 (21%)

1644 (52%)

2521 (50%)

2197 (44%)

95%GC

2114 (61%)

2332 (68%)

2529 (73%)

2297 (73%)

3642 (72%)

3625 (72%)

target

45 ( 1%)

18 ( 0%)

25 ( 0%)

3 ( 0%)

0 ( 0%)

0 ( 0%)

Number (and percentage) of positively scored sequences for each null model. WAM_{low}, WAM_{med} and WAM_{high} designate the WAM models generated by the training set with low (36%), medium (48%) and high (65%) GC content, respectively. CM_{low}, CM_{med} and CM_{high} designate the CM models generated by the training set with low (5.6%), medium (49.2%) and high (71.4%) GC content, respectively.

Testing in

As we have seen above, the target null model presented much better performance against the other models when testing against random sequences. To validate these results in a realistic environment, we have tested the performance of 4 null models in the context of acceptor site prediction for

Precision-Recall Graph for

**Precision-Recall Graph for P. falciparum data** This picture shows the precision-recall graph for the acceptor splice site prediction in

Performance of the null models on the

Precision

Specificity

Sensitivity

F-score

target

**22.81%**

**99.12%**

36.79%

**28.16%**

uniform

3.51%

84.39%

**81.28%**

6.74%

genomic

13.07%

98.18%

39.01%

19.58%

training

5.39%

95.24%

38.69%

9.46%

This table shows the precision, specificity, sensitivity, and F-score for the entire testing set of the

**ROC curve for P. falciparum data** This picture shows the ROC curve for the acceptor splice site prediction in

Click here for file

Discussion

The results of raw scores presented above show that all but the uniform null model produce a score biased by the GC content of the analyzed sequence. The problematic aspect is not the GC dependence

Indeed, when the models were used in log-odds scoring, the uniform model showed the lowest dependence on the GC content (Figures

This is an interesting feature, since the GC content can be a meaningful characteristic of a sequence family. In fact, this is a more appropriate classifier behavior than the effect associated to other null models, such as a training set null model, that assign high scores to sequences with a GC bias opposite to that presented by the sequences of the targeted family. If a family of sequences has low compositional variation, GC content can be considered relevant information during the classification process. What we want in these cases is a dependence that will “center” at the characteristics of our training sets, that is, that rewards GC contents similar to those of the known sequences of the targeted family and that “punishes” GC contents that are not. That is exactly what the target null model does, without producing too many false positives.

The location of the “peaks” (near the training set GC content) is not a coincidence. In particular, if the family model

Also important is that the peak scores presented in the target null model do not necessarily correspond to positive scores. In fact, the target null model presented the best specificity results (lowest number of positive scores for random sequences) in all tests. Moreover, this effect is still in place even for models that register secondary structure such as the CMs. In this case, although the log-odds score peak is moved towards the average GC content of the family, they do not coincide exactly (which occurred in the WAM-based classifiers). The explanation is probably related to the structural component of the CM score, which is not so directly dependent on the sequence GC content.

If on one hand the target null model presents the best specificity, on the other hand it may impair sensitivity in detecting true sequences that have the base composition very different from the average composition of the training set. When a high GC variation is expected within the family of interest, it is possible that the target model will generate a higher number of false negatives, in which case the uniform model should also be considered. This phenomenon was observed in covariance model tests performed with a benchmark of transfer RNA sequences (tRNAs) (data not shown). For a test sample of 100 tRNA sequences^{3} with GC content evenly distributed over the GC range of the tRNA family (from 8.8% to 74.3%GC), the specificity values achieved using the uniform and target null models were, respectively, 96.7% and 100% and the sensitivity values were, respectively, 100% and 93%, corroborating the fact that the target model tends to have higher specificity and lower sensitivity than the uniform model. The same behavior was observed in the

^{3}Sequences downloaded from Rfam database release 8.1

The GC percentages of the fixed distribution null models shown in this article do not correspond to the specific GC contents that would constitute a “training set” null model on each experiment using simulated data. But, in fact, “training set” null models are fixed-distribution models, where the distribution is determined by the training set. Therefore, a training null model is not suitable because of its fixed distribution. The homogeneous behavior of the performance of fixed-probability null models and the inferior performance of the training null model in the real data experiment support our conclusions. Also, for the covariance models, the training set percentages (5.6%, 49.2%, 72.4%) were very close to the percentages used in the tests (5%, 50%, 75%). The same is not true for the WAM tests, in which case we did run tests for null models with the training set percentages, and the results were consistent (data not shown).

Our study was performed in the context of nucleotide sequences, however we expect similar results for aminoacid compositions. This is supported by the fact that the analytical reasoning we performed are also valid for aminoacid sequences. In other words, when using any fixed distribution model against the target null model in log-odds scoring, the highest scores are obtained for sequences with the same aminoacid composition as that described in the fixed model. Due to the number of possible aminoacids, a similar study would be harder to perform and interpret it as 2D plots would not be helpful. As a matter of fact, two HMM-based tools used for protein domain identification, SAM

Conclusions

In this paper we evaluated the performance of 3 different types of null models in profile-based probabilistic models:

All our results indicate that, when the sequence family presents low variation on the GC content, the target model is a more dependable model to generate hypothesis for biological verification due to its high specificity when compared to any fixed-distribution model, in particular for organisms that present genomic sequences with high GC bias. Detecting acceptor splice sites in the GC-poor

This study was performed using 2 probabilistic techniques, WAMs and CMs. However, we expect the results to hold for other techniques such as Weight Matrix Models

Methods

Generation of the random sequences

The random sequences were generated by a Perl script written for this study. We wanted to visualize howthe models behave when analyzing heterogeneous sequences, i.e, sequences with different GC contents. Thus we needed a homogeneous number of sequences in each GC content mark. The script is parameterized by two values, _{1}
_{2} = L — K_{1} containing only the symbols A and T chosen from a uniform distribution. The final

Five sets of random sequences were generated using the approach described above: one for the raw score computations (raw score set, 5050 sequences), one for the WAM evaluations (WAM set, 3450 sequences) and one for each of the three CMs (_{low}
_{medium}
_{high}
_{low}
_{medium}
_{high}

Obtaining raw scores for the fixed distribution and target models

We calculated the probability of the random sequences using six different probabilistic models: i. five fixed GC null models (5%GC, 25%GC, 50%GC, 75%GC, 95%GC), with G and C having the same individual probability, as well as A and T; ii. the target null model. We plotted the raw scores (logarithm of the probability value) versus the sequence GC content to illustrate the independent behavior of each null model. We show in this paper the plot using sequences with length 100. Plots for the other lengths are similar: the same curve inclinations with different score limits. To calculate the probability value of the sequence

_{N}
^{cA
} *_{N}
^{cC
}
_{N}
^{cG
}
_{N}
^{cT
}

where c_{i}

Obtaining log-odds scores for WAMs

Acceptor splice sites from the HS3D database release 1.2 **low GC**
**content** containing 1013 sequences with GC content less or equal to 50% (average GC content of 38%); ii. **medium GC content** training set containing 1381 sequences with GC content between 50% and 60% (average GC content of 50% ); iii. **high GC content** training set containing 1006 sequences with GC content greater than 60% (average GC content of 65%).

For each training set, we estimated Weight Array Matrices (WAMs)

Obtaining log-odds scores for CMs

Three RNA families with different GC content averages were chosen: two with extreme GC content and one with medium GC content. These families are: i. rbcL 5' UTR RNA stabilizing element (5.6% GC), ii. small nucleolar RNA SNORA67 (49.2% GC) and iii. bag-1 internal ribosome entry site - IRES (71.4% GC). The structural multiple alignments for these RNA families were downloaded from the full alignments of the Rfam database release 8.1 _{U}

_{N}
_{U}

Six null models were used: five fixed GC models (5%GC, 25%GC, 50%GC, 75%GC and 95%GC) and the target null model.

Default execution of cmsearch does not report hits with negative score. Since the score of most of the random sequences is negative, a small modification in cmsearch’s source code to also report negative scores was needed.

Acceptor splice site dataset for

We have extracted 7582 acceptor splice sites from PlasmoDB release 6.4. This dataset was splitted in two parts. We have used the first part having 1000 acceptor splice sites as training set to estimate the parameters of the WAM. The second part, having 6582 acceptor splice sites, was used as positive testing set. Each acceptor splice site sequence has 70 nucleotides with the conserved AG dinucleotide at position 47.

Genomic sequences of length 70 that contains the dinucleotide AG at position 47 and were not annotated as acceptor splice sites were considered as negative samples. We have extracted a total of 939994 false acceptor splice sites as negative testing set.

Precision Recall Graph

Since we are dealing with real data, in this experiment we used four types of null models: (i) the target null model; (ii) the genomic background null model; (iii) the uniform null model; (iv) and the training set null model.

Using these null models and the WAM estimated with the training set, we have generated the precision recall graph comparing the model in respect of different GC contents of the sequences in the testing set. We created a partition of the testing set in which each subset contains only sequences with a fixed GC content. For each subset with more than 5 positive samples, we calculated a point in the graph corresponding to the calculated precision and recall values. In this analysis we used precision

Competing interests

The authors declare that they have no competing interests.

Authors contributions

The data presented in the article was produced by AML and AYK. The analysis of the results and the writing of the article was performed by AML, AYK and AMD. All authors revised the final manuscript.

Acknowledgments

We would like to thank Hernando A. del Portillo who proposed the initial biological problem that motivated this study, Sean R. Eddy who helped AML suggesting possible null models for log-odds scoring analysis and for important advice in the final form of the paper, Eric Nawrocki for important insights about Infernal, Alex Coventry for helpful discussion and modification in cmsearch's source code to report negative values, and BIOINFO-Vision Laboratory (University of São Paulo) for computing facilities. During this work, AML was supported by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) and Fundação de Amparo a Pesquisa do Estado de São Paulo (FAPESP, 2007/01549-5), AYK was supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and AMD was partially supported by CNPq.

This article has been published as part of