National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, 20894, USA

Abstract

Background

Local alignment programs often calculate the probability that a match occurred by chance. The calculation of this probability may require a “finite-size” correction to the lengths of the sequences, as an alignment that starts near the end of either sequence may run out of sequence before achieving a significant score.

Findings

We present an improved finite-size correction that considers the distribution of sequence lengths rather than simply the corresponding means. This approach improves sensitivity and avoids substituting an

Conclusions

The new finite-size correction improves the calculation of probabilities for a local alignment. It is now used in the BLAST+ package and at the NCBI BLAST web site (

Background

Local alignments are an essential tool for biologists and often provide the first information about the function of an unknown nucleotide or protein sequence. An important question concerns the relationship of the score of a local alignment with the probability that the alignment occurred by chance.
**I** and **J** of lengths

The two statistical parameters in Equation (1) are

Several authors

The following presentation emphasizes intuition over mathematical formality, to explain how the finite-size correction can account for the finite sequence lengths **I** and **J**, before it achieves score _{
I
} (_{
J
} (**I** (**J**), and let
**I** (**J**), because there might be insufficient sequence to permit it to achieve the score

Sequence alignment graph of two random sequences I and J of lengths

**Sequence alignment graph of two random sequences I and J of lengths ****and ****, respectively.** The black circles are the initiation vertices of local alignment paths just remaining within the large rectangle of the sequence alignment graph before achieving the score _{I} (**I**; and the upper, length _{J} (**J**. The gray shaded area is therefore the (random) alignment rectangle that an optimal local alignment must start within to achieve the score

Equation (2) approximates the area within the alignment matrix where the optimal local alignment can start and on average still have enough space to exceed the score _{
I
} (_{
J
} (

The purpose of this note is to present a new finite-size correction formula for the BLAST statistics. It avoids the _{
I
} (_{
J
} (_{
I
} (_{
J
} (

Findings

New finite-size correction

As with the old finite-size correction, the expectation

Most practical scoring systems are symmetric, with **I** and **J** are the same, e.g., _{
I
} (_{
J
} (_{
I
} (_{
J
} (**I** and **J**.

The new finite-size correction replaces

where x^{+} = max{_{
I
} (_{
J
} (

The practical computation of Equation (4) approximates the distribution of (_{
I
}(_{
J
} (_{
I
} (_{
I
} (_{
J
} (_{
J
} (_{
I
}(_{
J
} (

The estimation of the parameters _{
I
}
_{
J
}
_{
I
}
_{
J
} and _{
I
}
_{
J
}
_{
I
}
_{
J
} and

BLAST _{
I
}
_{
J
}
_{
I
}
_{
J
}, and _{
u
} (_{
u
}) be the value of _{
I
} (_{
I
}) for ungapped alignment. The mathematical theories for random walks and for renewals yield analytic formulas for _{
u
} and _{
u
}
**I** and **J**, because it lacks gaps. Thus, _{
u
} and _{
u
} do not depend on the sequence (**I** or **J**) under consideration, so they contain no subscripts **I** or **J**. In a gapped alignment, let a gap of length 1 incurs a penalty

Under the normal approximation, routine computation shows that Equation (4) is approximately

where

Comparison of

We compared ^{−50}

Figure

Comparison of

**Comparison of ****-values for the new and old finite-size corrections using the BLOSUM62 scoring matrix and 11 + ****affine gap penalty for equal sequence lengths (****=****) 40, 100, 200, and 400.** Figure

Evaluation of accuracy

We evaluated the performance of the new finite-size correction using the ASTRAL SCOP 40 subset

We report the performance in terms of the Receiver Operator Characteristics (ROC). Specifically, we report the ROC_{
n
} score, which is obtained by pooling the results of all queries, ordering them by expect value, but only keeping results up the

As discussed above, the new finite-size correction should show the greatest improvement for short sequences. Therefore, we also produced ROC_{
n
} scores for different subsets of the SCOP database. One database subset has sequences shorter than the 25th percentile length (95 residues), and another has sequences shorter than the 50th percentile length (137 residues).

Table
_{
n
} scores for the full database as well as the two subsets described above. These scores have an average of one false positive per query (4852), a threshold found useful in other studies (Altschul SF, private communication). The ROC-4852 scores for the full database demonstrate a small improvement of the new finite-size correction over the older one. The subsets show a more impressive improvement. For the 50^{th} percentile subset, the ROC-4852 score improves by 9%. For the 25^{th} percentile subset, the ROC-4852 score shows a 13% improvement. In the 25^{th} percentile subset, the new finite-size correction produces roughly 12% more true positives overall at 4852 false positives than the old finite-size correction (Figure

**Method**

**25 ^{th} percentile**

**50 ^{th} percentile**

**Full database**

The three subsets contain proteins shorter than 91 residues (25^{th} percentile by length), shorter than 137 residues (50^{th} percentile by length), and the full database. ROC-4852 scores are presented with an error (one standard deviation). The 25^{th} percentile database contains 2533 sequences, the 50^{th} percentile database contains 5008 sequences, and the full database contains 10,569 sequences. There are 4852 queries.

New correction

0.10373 ± 0.00022

0.10073 ± 0.00019

0.08535 ± 0.00013

Old correction

0.09201 ± 0.00020

0.09282 ± 0.00017

0.08358 ± 0.00014

Number of true positives vs. number of false positives for both new and old finite-size corrections using short SCOP sequences as a database

**Number of true positives vs. number of false positives for both new and old finite-size corrections using short SCOP sequences as a database.** The searched database was created from the shortest 25% of the ASTRAL 40 sequences for SCOP version 1.75 (see text).

To assess the significance of this improvement on BLAST searches, one may look to the length distribution of sequences in a heavily used protein BLAST database. The non-redundant (“nr”) database is the default protein database at the NCBI BLAST web site. Of the sequences in the nr database, 11% are 95 residues or shorter; and 21%, 137 residues or shorter. The new finite-size correction improves the retrieval accuracy for a noticeable fraction of the proteins in the nr database.

Conclusion

We have described a new finite-size correction. The new correction has a more rigorous derivation than the current finite-size correction and avoids the use of an

Availability and requirements

Project Name: BLAST Statistical Parameters

Project home page:

Operating systems: Windows, MacOSX, LINUX, UNIX

Programming language: C++

License: Public Domain (see

Any restrictions to use by non-academics: None

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

YP, TM and JS drafted the manuscript. YP designed the

Acknowledgements

We thank Greg Boratyn for help in running the accuracy evaluations with the SCOP set. This research was supported by the intramural research program of the NIH, National Library of Medicine.

Appendix

Let
_{
i,j
} at each vertex
_{
i
} the ^{
th
} SALE (strict ascending ladder epoch) and
^{
th
} SALE score. Let
^{
th
} and ^{
th
} SALE scores.

Let
_{
i
} := _{
i
} − _{
i−1}, the incremental sequence length between (^{
th
} and ^{
th
} SALEs in sequence **I**, and Δ_{
j
} := _{
j
} − _{
j−1}, the incremental sequence length between (^{
th
} and ^{
th
} SALEs in sequence **J**. Last, we define

The formulas for computing _{
I
}
_{
J
}
_{
I
}
_{
J
} and

where var^{*} and cov^{*} represent the variance and covariance associated with the probability measure underlying the expectation
_{
I
}
_{
J
}
_{
I
}
_{
J
} and