Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

Open Access Research article

Dynamics of domain coverage of the protein sequence universe

Bhanu Rekapalli1, Kristin Wuichet24, Gregory D Peterson3 and Igor B Zhulin12*

Author affiliations

1 Joint Institute for Computational Sciences, Oak Ridge National Laboratory – University of Tennessee, Oak Ridge, TN, 37831, USA

2 Department of Microbiology, University of Tennessee, Knoxville, TN, 37996, USA

3 Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, 37996, USA

4 Present address: Max-Planck-Institute for Terrestrial Microbiology, Marburg, D-35043, Germany

For all author emails, please log on.

Citation and License

BMC Genomics 2012, 13:634  doi:10.1186/1471-2164-13-634

Published: 16 November 2012

Abstract

Background

The currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding. Assigning sequences to families leads to a better understanding of protein function and the nature of the protein universe. However, a large portion of the current protein space remains unassigned and is referred to as its “dark matter”.

Results

Here we suggest that true size of “dark matter” is much larger than stated by current definitions. We propose an approach to reducing the size of “dark matter” by identifying and subtracting regions in protein sequences that are not likely to contain any domain.

Conclusions

Recent improvements in computational domain modeling result in a decrease, albeit slowly, in the relative size of “dark matter”; however, its absolute size increases substantially with the growth of sequence data.