Characterization of oligopeptide patterns in large protein sets
1 IFM Bioinformatics, Linköping University, S-581 83 Linköping, Sweden
2 Department of Cell and Molecular Biology (CMB), Karolinska Institutet, S-171 77 Stockholm, Sweden
BMC Genomics 2007, 8:346 doi:10.1186/1471-2164-8-346Published: 1 October 2007
Recent sequencing projects and the growth of sequence data banks enable oligopeptide patterns to be characterized on a genome or kingdom level. Several studies have focused on kingdom or habitat classifications based on the abundance of short peptide patterns. There have also been efforts at local structural prediction based on short sequence motifs. Oligopeptide patterns undoubtedly carry valuable information content. Therefore, it is important to characterize these informational peptide patterns to shed light on possible new applications and the pitfalls implicit in neglecting bias in peptide patterns.
We have studied four classes of pentapeptide patterns (designated POP, NEP, ORP and URP) in the kingdoms archaea, bacteria and eukaryotes. POP are highly abundant patterns statistically not expected to exist; NEP are patterns that do not exist but are statistically expected to; ORP are patterns unique to a kingdom; and URP are patterns excluded from a kingdom. We used two data sources: the de facto standard of protein knowledge Swiss-Prot, and a set of 386 completely sequenced genomes. For each class of peptides we looked at the 100 most extreme and found both known and unknown sequence features. Most of the known sequence motifs can be explained on the basis of the protein families from which they originate.
We find an inherent bias of certain oligopeptide patterns in naturally occurring proteins that cannot be explained solely on the basis of residue distribution in single proteins, kingdoms or databases. We see three predominant categories of patterns: (i) patterns widespread in a kingdom such as those originating from respiratory chain-associated proteins and translation machinery; (ii) proteins with structurally and/or functionally favored patterns, which have not yet been ascribed this role; (iii) multicopy species-specific retrotransposons, only found in the genome set. These categories will affect the accuracy of sequence pattern algorithms that rely mainly on amino acid residue usage. Methods presented in this paper may be used to discover targets for antibiotics, as we identify numerous examples of kingdom-specific antigens among our peptide classes. The methods may also be useful for detecting coding regions of genes.