Email updates

Keep up to date with the latest news and content from BMC Research Notes and BioMed Central.

Open Access Technical Note

New finite-size correction for local alignment score distributions

Yonil Park, Sergey Sheetlin, Ning Ma, Thomas L Madden and John L Spouge*

Author Affiliations

National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, 20894, USA

For all author emails, please log on.

BMC Research Notes 2012, 5:286  doi:10.1186/1756-0500-5-286

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1756-0500/5/286


Received:30 March 2012
Accepted:16 May 2012
Published:12 June 2012

© 2012 Park et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Local alignment programs often calculate the probability that a match occurred by chance. The calculation of this probability may require a “finite-size” correction to the lengths of the sequences, as an alignment that starts near the end of either sequence may run out of sequence before achieving a significant score.

Findings

We present an improved finite-size correction that considers the distribution of sequence lengths rather than simply the corresponding means. This approach improves sensitivity and avoids substituting an ad hoc length for short sequences that can underestimate the significance of a match. We use a test set derived from ASTRAL to show improved ROC scores, especially for shorter sequences.

Conclusions

The new finite-size correction improves the calculation of probabilities for a local alignment. It is now used in the BLAST+ package and at the NCBI BLAST web site ( http://blast.ncbi.nlm.nih.gov webcite).

Background

Local alignments are an essential tool for biologists and often provide the first information about the function of an unknown nucleotide or protein sequence. An important question concerns the relationship of the score of a local alignment with the probability that the alignment occurred by chance. [Karlin and Altschul 1] developed an asymptotic theory for local alignments, assuming that no gaps are permitted. For two random sequences I and J of lengths m and n, respectively, the resulting distribution of the optimal alignment score <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M1">View MathML</a> approximates a Gumbel distribution [2]

<a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M2">View MathML</a>

(1)

The two statistical parameters in Equation (1) are λ, the scale parameter, and k, the pre-factor.

Several authors [3-12] extended this framework to local alignments with gaps and showed that the Gumbel distribution from Equation (1) is still valid, though different values for λ and k are required. [Altschul 13] discussed the need for a “finite-size correction” to the lengths m and n to improve the accuracy of Equation (1). The resulting statistics are an integral part of the Basic Local Alignment Search Tool (BLAST) [14].

The following presentation emphasizes intuition over mathematical formality, to explain how the finite-size correction can account for the finite sequence lengths m and n to improve the accuracy of Equation (1). Let us begin with an optimal local alignment, which starts from score 0 and requires a non-zero sequence length within both I and J, before it achieves score y. Let LI (y) (LJ (y)) be the required random lengths within both I (J), and let <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M3">View MathML</a> ( <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M4">View MathML</a>) be the corresponding means. The main idea is that the optimal local alignment cannot start anywhere along the full length m (n) of sequence I (J), because there might be insufficient sequence to permit it to achieve the score y (Figure 1). The finite-size correction described in [13] and used in BLAST therefore replaced the area mn of the alignment matrix for Equation (1) by

<a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M5">View MathML</a>

(2)

thumbnailFigure 1. Sequence alignment graph of two random sequences I and J of lengths m and n, respectively. The black circles are the initiation vertices of local alignment paths just remaining within the large rectangle of the sequence alignment graph before achieving the score y, i.e., the lower local alignment path has length LI (y) within I; and the upper, length LJ (y) within J. The gray shaded area is therefore the (random) alignment rectangle that an optimal local alignment must start within to achieve the score y. Thus, if the black circles lie within the gray rectangle, the alignments achieve the score y within m × n alignment rectangle. If the black circles lie further outside the gray rectangle, the alignments do not achieve the score y within m × n alignment rectangle.

Equation (2) approximates the area within the alignment matrix where the optimal local alignment can start and on average still have enough space to exceed the score y. If m < lI (y) or n < lJ (y), however, the resulting value in Equation (2) might become negative. The BLAST code for the old finite-size correction therefore set the corrected sequence length to an ad hoc value (typically 1). For very short query or database sequences, the ad hoc correction could underestimate the significance of an alignment by many orders of magnitude.

The purpose of this note is to present a new finite-size correction formula for the BLAST statistics. It avoids the ad hoc correction and improves on them by considering the (approximately normal) distributions of the random lengths LI (y) and LJ (y) explicitly, and not just the corresponding means lI (y) and lJ (y). We demonstrate below that the new finite-size correction is better than the older one, both in theory and in practice. All BLAST+ protein-protein applications (i.e., BLASTP, BLASTX) use the new finite-size correction by default, starting with version 2.2.26.

Findings

New finite-size correction

As with the old finite-size correction, the expectation <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M6">View MathML</a> is approximated linearly:

<a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M7">View MathML</a>

(3)

Most practical scoring systems are symmetric, with s(A, B) = s(B, A) for any two letters A and B, and for a symmetric scoring matrix and symmetric sequence compositions, expectations corresponding to I and J are the same, e.g., lI (y) = lJ (y) = l(y). For asymmetric scoring systems or asymmetric sequence compositions, however, the variates LI (y) and LJ (y) can have different distributions, so the following retains the subscripts I and J.

The new finite-size correction replaces mn in Equation (1) by

<a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M8">View MathML</a>

(4)

where x+ = max{x,0}. Rather than taking the expectation ofLI (y) and LJ (y) as in Equation (2), Equation (4) is the expected area within the alignment rectangle where an optimal local alignment can start and have enough random sequence length to reach the score y (Figure 1).

The practical computation of Equation (4) approximates the distribution of (LI(y),LJ (y)) with a bivariate normal distribution, with means <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M9">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M10">View MathML</a>,variances var LI (y) = vI (y) and var LJ (y) = vJ (y), and covariance cov (LI(y), LJ (y)) = c(y), all assumed to be linear in the score y, i.e.,

<a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M11">View MathML</a>

(5)

The estimation of the parameters aIaJαIαJ and σ has mathematical depth and involves many unproved speculations, but involves a heuristic modeling of a random sequence alignment with Markov additive processes [15], ultimately with use of the renewal-reward theorem. The Appendix presents formulas for computing aIaJαIαJ and σ.

BLAST p-values are relatively insensitive to the values of the intercepts bIbJβIβJ, and τ, so the practical computation approximates them, as follows. Let au (αu) be the value of aI (αI) for ungapped alignment. The mathematical theories for random walks and for renewals yield analytic formulas for au and αu[16]. For an ungapped optimal alignment, the alignment length required to exceed the score y is the same within the sequences I and J, because it lacks gaps. Thus, au and αu do not depend on the sequence (I or J) under consideration, so they contain no subscripts I or J. In a gapped alignment, let a gap of length 1 incurs a penalty G. The following uncontrolled approximations hold [17]:

<a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M12">View MathML</a>

(6)

Under the normal approximation, routine computation shows that Equation (4) is approximately

<a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M13">View MathML</a>

(7)

where <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M14">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M15">View MathML</a>, and X is a standard normal variate. The final product <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M16">View MathML</a> is an uncontrolled independence approximation for the bivariate normal distribution.

Comparison of p-values for the new and old finite-size corrections

We compared p-values for the new finite-size correction with those for the old finite-size correction using the BLOSUM62 scoring matrix and affine gap penalty 11 + g. Hartmann used a rare-event simulation method to compute the local alignment score distribution for ranges that included small p-values like p = 10−50[18], thereby producing a theoretical standard for small p-values.

Figure 2 plots relative errors in logarithmic scale against true p-values for equal sequence lengths m = n = 40, 100, 200, and 400. Using Hartmann’s theoretical standard, the new finite-size correction outperforms as the p-value decreases, sometimes by orders of magnitude.

thumbnailFigure 2. Comparison of p-values for the new and old finite-size corrections using the BLOSUM62 scoring matrix and 11 + g affine gap penalty for equal sequence lengths (m=n) 40, 100, 200, and 400. Figure 2 plots <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M17">View MathML</a> against p, where <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M18">View MathML</a> is the calculated p-value and p is the p-value from the simulation. Thus, a perfect p-value estimate corresponds to the horizontal line y = 0. The red line shows the results from the new finite-size correction; the black line, the old finite-size correction.

Evaluation of accuracy

We evaluated the performance of the new finite-size correction using the ASTRAL SCOP 40 subset [19] of release 1.75 of the Structural Classification of Proteins (SCOP) [20] database. We sorted the SCOP domains by lexicographic order and used the even numbered sequences as our query set, but removed any query that was the sole member of the superfamily in ASTRAL 40. For a given query sequence, we considered any database sequence belonging to the same SCOP superfamily as a true positive, and any database sequence belonging to a different SCOP fold as a false positive. Following [21], in the retrieval list for each query, we censored all sequences belonging to the same fold but different superfamily, so those sequences contributed neither true or false positives to the retrieval.

We report the performance in terms of the Receiver Operator Characteristics (ROC). Specifically, we report the ROCn score, which is obtained by pooling the results of all queries, ordering them by expect value, but only keeping results up the n-th false positive [21]. The expect value for the database search was obtained from the pairwise p-values using a length-proportional correction that takes the ratio of the database length to the target sequence length into account [13].

As discussed above, the new finite-size correction should show the greatest improvement for short sequences. Therefore, we also produced ROCn scores for different subsets of the SCOP database. One database subset has sequences shorter than the 25th percentile length (95 residues), and another has sequences shorter than the 50th percentile length (137 residues).

Table 1 presents ROCn scores for the full database as well as the two subsets described above. These scores have an average of one false positive per query (4852), a threshold found useful in other studies (Altschul SF, private communication). The ROC-4852 scores for the full database demonstrate a small improvement of the new finite-size correction over the older one. The subsets show a more impressive improvement. For the 50th percentile subset, the ROC-4852 score improves by 9%. For the 25th percentile subset, the ROC-4852 score shows a 13% improvement. In the 25th percentile subset, the new finite-size correction produces roughly 12% more true positives overall at 4852 false positives than the old finite-size correction (Figure 3). These results confirm our expectation that the new finite-size correction will display greatest improvement in retrieval for short sequences.

Table 1. Retrieval accuracy for different subsets of SCOP database with the new and old finite-size correction

thumbnailFigure 3. Number of true positives vs. number of false positives for both new and old finite-size corrections using short SCOP sequences as a database. The searched database was created from the shortest 25% of the ASTRAL 40 sequences for SCOP version 1.75 (see text).

To assess the significance of this improvement on BLAST searches, one may look to the length distribution of sequences in a heavily used protein BLAST database. The non-redundant (“nr”) database is the default protein database at the NCBI BLAST web site. Of the sequences in the nr database, 11% are 95 residues or shorter; and 21%, 137 residues or shorter. The new finite-size correction improves the retrieval accuracy for a noticeable fraction of the proteins in the nr database.

Conclusion

We have described a new finite-size correction. The new correction has a more rigorous derivation than the current finite-size correction and avoids the use of an ad hoc value for short sequences. We have tested the retrieval accuracy of the new finite-size correction on the gold standard SCOP set, and have shown that the improvement is most important for short sequences. This correction has been made part of the BLAST+ protein-protein applications (e.g., BLASTP, BLASTX) as well as at the NCBI BLAST web site. In the future, we plan to implement this correction for nucleotide-nucleotide comparisons.

Availability and requirements

Project Name: BLAST Statistical Parameters

Project home page: http://www.ncbi.nlm.nih.gov/CBBresearch/Spouge/html_ncbi/html/blast/ webcite

Operating systems: Windows, MacOSX, LINUX, UNIX

Programming language: C++

License: Public Domain (see http://www.ncbi.nlm.nih.gov/books/NBK22952/ webcite)

Any restrictions to use by non-academics: None

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

YP, TM and JS drafted the manuscript. YP designed the p-value evaluation method. SS implemented the new finite-size correction. NM integrated the correction into the BLAST+ code, ran tests, and calculated the ROC scores. JS devised the new finite-size correction. YP and SS are equal contribution first authors for this article. TLM and JLS are equal contribution last authors for this article. All authors read and approved the final manuscript.

Acknowledgements

We thank Greg Boratyn for help in running the accuracy evaluations with the SCOP set. This research was supported by the intramural research program of the NIH, National Library of Medicine.

Appendix

Let <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M19">View MathML</a>. Consider two semi-infinite random sequences, yielding alignment scores Si,j at each vertex <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M20">View MathML</a> within their alignment graph. Define the edge maximum score <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M21">View MathML</a>. Let <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M22">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M23">View MathML</a> for i ≥ 1. We call κi the ith SALE (strict ascending ladder epoch) and <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M24">View MathML</a> the ith SALE score. Let <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M25">View MathML</a>, the increment between the (i − 1)th and ith SALE scores.

Let <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M26">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M27">View MathML</a>. We also define <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M28">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M29">View MathML</a>. Let ΔIi := Ii − Ii−1, the incremental sequence length between (i − 1)th and ith SALEs in sequence I, and ΔJj := Jj − Jj−1, the incremental sequence length between (j − 1)th and jth SALEs in sequence J. Last, we define <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M30">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M31">View MathML</a>, and <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M32">View MathML</a>.

The formulas for computing aIaJαIαJ and σ are:

<a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M33">View MathML</a>

(8)

where var* and cov* represent the variance and covariance associated with the probability measure underlying the expectation <a onClick="popup('http://www.biomedcentral.com/1756-0500/5/286/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1756-0500/5/286/mathml/M34">View MathML</a>. In practice, for computational efficiency, we use importance sampling to estimate the parameters above [15]. The parameters are estimated separately at each SALE. We then apply asymptotic regression to estimate the values of aIaJαIαJ and σ as i → ∞ in the equation above [22].

References

  1. Karlin S, Altschul SF: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.

    Proc Natl Acad Sci U S A 1990, 87(6):2264-2268. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. Galambos J: The asymptotic theory of extreme order statistics. New York: Wiley; 1978. OpenURL

  3. Mott R: Maximum-likelihood-estimation of the statistical distribution of Smith-Waterman local sequence similarity scores.

    Bull Math Biol 1992, 54(1):59-75. OpenURL

  4. Waterman MS, Vingron M: Rapid and accurate estimates of statistical significance for sequence data base searches.

    Proc Natl Acad Sci U S A 1994, 91(11):4625-4628. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  5. Altschul SF, Gish W: Local alignment statistics.

    Methods Enzymol 1996, 266:460-480. PubMed Abstract | Publisher Full Text OpenURL

  6. Bundschuh R: Rapid significance estimation in local sequence alignment with gaps.

    J Comput Biol 2002, 9(2):243-260. PubMed Abstract | Publisher Full Text OpenURL

  7. Chia N, Bundschuh R: A practical approach to significance assessment in alignment with gaps.

    J Comput Biol 2006, 13(2):429-441. PubMed Abstract | Publisher Full Text OpenURL

  8. Newberg LA: Significance of gapped sequence alignments.

    J Comput Biol 2008, 15(9):1187-1194. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  9. Agrawal A, Brendel VP, Huang X: Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment.

    Int J Computat Biol Drug Des 2008, 1(4):347-367. Publisher Full Text OpenURL

  10. Poleksic A: Island method for estimating the statistical significance of profile-profile alignment scores.

    BMC Bioinformatics 2009, 10:112. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  11. Ortet P, Bastien O: Where does the alignment score distribution shape come from?

    Evol Bioinformatics 2010, 6:159-187. OpenURL

  12. Agrawal A, Huang X: Pairwise statistical significance of local sequence alignment using sequence-specific and position-specific substitution matrices.

    IEEE/ACM Trans Comput Biol Bioinformatics 2011, 8(1):194-205. OpenURL

  13. Altschul SF: Evaluating the statistical significance of multiple distinct local alignments. In Theoretical and computational methods in genome research. Edited by Suhai S. New York: Plenum Press; 1997:1-14. OpenURL

  14. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

    Nucleic Acids Res 1997, 25(17):3389-3402. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. Park Y, Sheetlin S, Spouge JL: Estimating the gumbel scale parameter for local alignment of random sequences by importance sampling with stopping times.

    Ann Stat 2009, 37(6A):3697-3714. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  16. Asmussen S: Applied probability and queues. New York: Springer; 2003. OpenURL

  17. Altschul SF, Bundschuh R, Olsen R, Hwa T: The estimation of statistical parameters for local alignment score distributions.

    Nucleic Acids Res 2001, 29(2):351-361. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. Hartmann AK: Sampling rare events: statistics of local sequence alignments.

    Phys Rev E 2002., 65(5) Publisher Full Text OpenURL

  19. Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: The ASTRAL Compendium in 2004.

    Nucleic Acids Res 2004, 32:D189-D192. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  20. Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP - a structural classification of proteins database for the investigation of sequences and structures.

    J Mol Biol 1995, 247(4):536-540. PubMed Abstract | Publisher Full Text OpenURL

  21. Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.

    Nucleic Acids Res 2001, 29(14):2994-3005. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  22. Sheetlin S, Park Y, Spouge JL: Objective method for estimating asymptotic parameters, with an application to sequence alignment.

    Phys Rev E 2011., 84(3) Publisher Full Text OpenURL