BMC Bioinformatics

official impact factor 3.03

Open Access Highly Access Research article

Testing statistical significance scores of sequence comparison methods with structure similarity

Tim Hulsen1*, Jacob de Vlieg1,2, Jack AM Leunissen3 and Peter MA Groenen2

Author Affiliations

1 Centre for Molecular and Biomolecular Informatics (CMBI), Nijmegen Centre for Molecular Life Sciences (NCMLS), Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands

2 Molecular Design and Informatics, NV Organon, Oss, The Netherlands

3 Laboratory of Bioinformatics, Wageningen University and Research Centre, Wageningen, The Netherlands

For all author emails, please log on.

BMC Bioinformatics 2006, 7:444 doi:10.1186/1471-2105-7-444

Published: 12 October 2006

Abstract

Background

In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences.

Results

All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores.

Conclusion

The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons.