Open Access Highly Accessed Research article

Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource

Thomas J Sharpton1*, Guillaume Jospin2, Dongying Wu27, Morgan GI Langille3, Katherine S Pollard14 and Jonathan A Eisen2567

Author Affiliations

1 The J. David Gladstone Institutes, University of California San Francisco, San Francisco, CA, 94158, USA

2 UC Davis Genome Center, University of California, Davis, Davis, CA, 95616, USA

3 Department of Biochemistry & Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada

4 Department of Epidemiology & Biostatistics, Institute for Human Genetics, University of California San Francisco, San Francisco, CA, 94158, USA

5 Deptartment of Evolution and Ecology, University of California, Davis, Davis, CA, 95616, USA

6 Deptartment of Medical Microbiology and Immunology, University of California, Davis, Davis, CA, 95616, USA

7 Department of Energy Joint Genome Institute, Walnut Creek, CA, 94598, USA

For all author emails, please log on.

BMC Bioinformatics 2012, 13:264  doi:10.1186/1471-2105-13-264

Published: 13 October 2012

Additional files

Additional file 1:

Representative genomes used in this study.

Format: XLSX Size: 11KB Download file

Open Data

Additional file 2:

Widely distributed protein family statistics.

Format: XLSX Size: 67KB Download file

Open Data

Additional file 3:

Interpro annotations enriched among the quartile of SFams with the largest size.

Format: XLSX Size: 34KB Download file

Open Data

Additional file 4:

Interpro annotations enriched among SFams with a precision less than 0.75.

Format: XLSX Size: 20KB Download file

Open Data

Additional file 5:

Distributions of various network topology statistics for the entire SFam similarity network. Each histogram illustrates the distribution of a network statistic for each node in the SFam similarity network, including degree centrality (upper left, log scale), betweenness centrality (upper right; x-axis scale constrained at a betweenness of 50), transitivity (lower left), and closeness centrality (lower right).

Format: PDF Size: 138KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 6:

Distributions of various network topology statistics for the largest SFam similarity component. Each histogram illustrates the distribution of a network statistic for each node in the largest SFam similarity network component, including degree centrality (upper left, log scale), betweenness centrality (upper right; x-axis scale constrained at a betweenness of 50), transitivity (lower left), and closeness centrality (lower right).

Format: PDF Size: 7KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 7:

Interpro annotations enriched among the quartile of SFams with the highest number of degrees.

Format: XLSX Size: 20KB Download file

Open Data

Additional file 8:

The distribution of the number of Interpro annotations detected per SFam.

Format: PDF Size: 108KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data