School of Life Sciences Research, University of Dundee, Dow Street, Dundee, DD1 5EH, Scotland, UK

Bioinformatics Centre, Institute of Microbial Technology, Sector 39A, Chandigarh, India

This work was intitated when both authors were at the University of Oxford, Laboratory of Molecular Biophysics, Rex Richards Building, Oxford, OX1 3QU, UK

Abstract

Background

Percentage Identity (PID) is frequently quoted in discussion of sequence alignments since it appears simple and easy to understand. However, although there are several different ways to calculate percentage identity and each may yield a different result for the same alignment, the method of calculation is rarely reported. Accordingly, quantification of the variation in PID caused by the different calculations would help in interpreting PID values in the literature. In this study, the variation in PID was quantified systematically on a reference set of 1028 alignments generated by comparison of the protein three-dimensional structures. Since the alignment algorithm may also affect the range of PID, this study also considered the effect of algorithm, and the combination of algorithm and PID method.

Results

The maximum variation in PID due to the calculation method was 11.5% while the effect of alignment algorithm on PID was up to 14.6% across three popular alignment methods. The combined effect of alignment algorithm and PID calculation gave a variation of up to 22% on the test data, with an average of 5.3% ± 2.8% for sequence pairs with < 30% identity. In order to see which PID method was most highly correlated with structural similarity, four different PID calculations were compared to similarity scores (Sc) from the comparison of the corresponding protein three-dimensional structures. The highest correlation coefficient for a PID calculation was 0.80. In contrast, the more sophisticated Z-score calculated by reference to randomized sequences gave a correlation coefficient of 0.84.

Conclusion

Although it is well known amongst expert sequence analysts that PID is a poor score for discriminating between protein sequences, the apparent simplicity of the percentage identity score encourages its widespread use in establishing cutoffs for structural similarity. This paper illustrates that not only is PID a poor measure of sequence similarity when compared to the Z-score, but that there is also a large uncertainty in reported PID values. Since better alternatives to PID exist to quantify sequence similarity, these should be quoted where possible in preference to PID. The findings presented here should prove helpful to those new to sequence analysis, and in warning those who seek to interpret the value of a PID reported in the literature.

Background

Statistical measures of sequence similarity are routinely applied to quantify the results of sequence database searches

Recently, May

Results

Range of percentage identity seen for different PID calculations

Out of 1028 aligned pairs there were only 20 pairs where all four percentage identity measures had the same value. 711 pairs had differences in PID between 2% and 5%. There were 87 pairs for which the difference was greater than 5%. The greatest difference seen was 11.5%.

The difference between the maximum and minimum PID decreases slightly with increasing minimum PID. Thus, the average difference in PID for alignments with a minimum PID ≤ 30% was 3.3 ± 1.5 %, while the average difference for alignments with PID > 30% was 2.2 ± 1.5%.

PID2 was always largest since it considers only the aligned positions. PID4 was ≤ PID1 on most of the pairs. Differences between PID4 and PID1 were observed in pairs where one sequence overhangs at the N-terminal and other at the C-terminal. For most of the alignments, PID3 was higher than PID1 or PID4. PID4 gave slightly more consistent values of PID that were less prone to artefactually high or low values as a result of overhangs. PID4 also gave a slightly better correlation with structural similarity as shown in Table

Correlation between PIDs and structural similarity score. Z: Z-Score, (Also known as SD – Score) from randomisation.

Alignment

Weight Matrix

gap pen

PID1

PID2

PID3

PID4

Z

NAS

STAMP

0.85

0.82

0.84

0.86

AMPS

BLOSUM62

10

0.79

0.76

0.77

0.80

0.84

0.82

NAS: Normalised Alignment Score, (see text for details).

Range of percentage identity seen for different alignment methods

Ideally, one would calculate the PID between two sequences from the comparison of the protein three – dimensional structures. In the absence of structures for both proteins, sequence alignment techniques must be applied. Since alignment of sequences is an optimisation based on the parameters and algorithm, the resulting alignment depends on these factors. Accordingly, the range of PID4 was examined for the reference structural alignment and for sequence alignments obtained by the AMPS

In order to understand the effect of the alignment algorithm on the PID, the same sequence pairs were aligned by AMPS

In the real-world situation where one is comparing PID values calculated in different ways by different algorithms, the results presented here suggest the range in PID difference will be between 0 and 21.8 %. The average difference for PID ≤ 30% was 5.3 ± 2.8% and > 30% was 2.7 ± 1.9%.

Discussion

In this article it has been shown that the PID value was affected both by the way in which it was calculated, and by how the alignment was generated. While neither of these facts is particularly surprising, to our knowledge, this is the first time the range of PID has been reported for these effects. The combined effect of algorithm and calculation gave rise to differences in PID of up to 22%. Given these limitations, which PID calculation gave the most reliable estimate of similarity?

The STAMP structural comparison algorithm that was used to generate the reference alignments in this study provides a measure of structural similarity (Sc) which takes account of distance and conformational similarity, for each pair of proteins

where, V is the alignment score for two sequences, σ and

Conclusion

In this paper we have quantified the variation in reported percentage identity seen in 1028 structural alignments, due to different denominators in the PID calculation and due to alignment method. The overall conclusions are:

1. The four different PID denominators considered, gave up to 11.5% difference in PID on a single alignment in the test set.

2. Sequence alignments by three different methods resulted in variation of up to 14.6% PID on a single alginment in the test set.

3. Combination of PID calculation and alignment method led to variation of up to 22% PID on a single alignment in the test set.

4. PID calculations that take account of gaps (PID1 and PID4) were more highly correlated with the STAMP Sc score for structural similarity between the proteins, than those that do not consider gaps (PID2 and PID3).

5. All PID calculations were less well correlated with the STAMP Sc score than the Z-score obtained by comparison to shuffled sequence scores.

These overall conclusions are not surprising to those expert in sequence analysis, but to our knowledge this is the first time that the variation in PID has been quantified explicitly. Quantification of the variation in PID is valuable, since although PID is a poor substitute for more sophisticated scoring methods that take account of the physico-chemical properties of the amino acids and correct for sequence length, PID remains widely quoted. The findings presented here should prove helpful to those new to sequence analysis, and as a guide to those who seek to interpret the value of a PID reported in the literature.

Methods

Test data set

Protein domain families were taken from the OxBench database of reference alignments

Calculation of percentage identity

For each reference structural alignment, the percentage identity was calculated in four different ways.

PID1 was calculated as described by Doolittle, (1981):

PID2 only considers matched residues

PID3 only considers the shortest sequence

PID4 considers the shortest length (sequence plus gap positions).

Where _{A }and _{B }are the sum of the number of residues and internal gap positions in sequences

In this study, all PID values were calculated over the complete alignment rather than the structurally conserved core. This reflects the situation when aligning two protein sequences where neither protein has a known three-dimensional structure and so the structurally conserved core is unknown.

Authors' contributions

GPSR ran the programs, wrote analysis code and drafted the paper. GJB conceived and directed the project, and finalised the analysis and manuscript.

Acknowledgements

This work was supported in part by a grant from the UK Biotechnology and Biological Sciences Research Council (BBSRC). GJB thanks the Royal Society and European Molecular Biology Laboratory for support, and Prof. L. N. Johnson for encouragement. We thank Iver Cooper a US patent attorney for asking the question that prompted this study.