Email updates

Keep up to date with the latest news and content from BMC Evolutionary Biology and BioMed Central.

Open Access Research article

Proteome sequence features carry signatures of the environmental niche of prokaryotes

Zlatko Smole12, Nela Nikolic23, Fran Supek4, Tomislav Šmuc4, Ivo F Sbalzarini5 and Anita Krisko26*

Author Affiliations

1 Institute for Cell Biology, ETH Zuerich, Schafmattstrase 18, 8093 Zuerich, Switzerland

2 Mediterranean Institute for Life Sciences, Mestrovicevo setaliste bb, 21000 Split, Croatia

3 Institute of Biogeochemistry and Pollutant Dynamics, ETH Zuerich, Unversitätstrasse 16, 8092 Zuerich, Switzerland

4 Division of Electronics, Rudjer Boskovic Institute, Bijenicka 54, 10000 Zagreb, Croatia

5 Institute of Theoretical Computer Science and Swiss Institute of Bioinformatics, ETH Zurich, Zurich, Switzerland

6 Institut National de la Santé et de la Recherche Médicale U1001, Université Paris Descartes, Faculté de Médecine, 156 rue de Vaugirard, 75730 Paris Cedex 15, France

For all author emails, please log on.

BMC Evolutionary Biology 2011, 11:26  doi:10.1186/1471-2148-11-26

Published: 26 January 2011

Additional files

Additional file 1:

List of 1107 species used in this study, with values of each feature. Codes used for feature names are listed in Additional file 7.

Format: PDF Size: 388KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 2:

Classification results shown as receiver operating characteristic (ROC) graphs with associated area under the curve (AUC) values, by using SVM: (A) domain of life classification, (B) halophilicity classification; by using RF: (C) domain of life classification, (D) halophili - city classification.

Format: PDF Size: 38KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 3:

Classification results shown as receiver operating characteristic (ROC) graphs with associated area under the curve (AUC) values for temperature adaptation, by using SVM: (A), mesophiles vs. others (B), mesothermophiles vs. others, (C) thermophiles vs. others, (D) mesophiles vs. mesothermophiles, (E) mesophiles vs. thermophiles, and (F) mesothermophiles vs. thermophiles; by using RF: (G) mesophiles vs. others, (H) mesothermophiles vs. others, (I) thermophiles vs. others, (J) mesophiles vs. mesothermophiles, (K) mesophiles vs. thermophiles, and (L) mesothermophiles vs. thermophiles.

Format: PDF Size: 82KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 4:

Summary of feature selection results. (A) Ten most important features for classifications regarding domain of life revealed by the feature selection algorithm of RF. Pairs of box-and-whisker plots are shown for each feature labeled with a number: 1-Gln content, 2-Leu content, 3-normalized frequency of extended structure, 4-negative charge, 5-average protein size in a proteome, 6-Glu content, 7-charge, 8-His content, 9-ratio of charged and non-charged amino acids, 10-Cys content. Box-and-whisker plots represent bacteria and archaea from top to bottom. (B) Ten most important features for classifications regarding halophilicity revealed by the feature selection algorithm of RF. Pairs of box-and-whisker plots are shown for each feature labeled with a number: 1-negative charge, 2-charge, 3-hydrophilicity value, 4-positive charge, 5-Gln content, 6-Glu content, 7-ratio of charged and non-charged amino acids, 8-normalized frequency of beta turn, 9-Asp content, 10-Phe content. Box-and-whisker plots represent non-halophiles and halophiles from top to bottom. (C) Ten most important features for classifications regarding thermophilicity revealed by the feature selection algorithm of RF. Triplets of box-and-whisker plots are shown for each feature labeled with a number: 1-Gln content, 2-information measure for loop, 3-Glu content, 4-Val content, 5-normalized frequency of extended structure, 6-hydrophilicity value, 7-Tyr content, 8-Asp content, 9-negative charge, 10-Chou-Fasman parameter of the coil conformation. Box-and-whisker plots represent mesophiles, mesothermophiles and thermophiles from top to bottom. In all plots feature values are normalized from 0 to 1 from left to right. (+) signs represent outliers.

Format: PDF Size: 3.7MB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 5:

Histograms showing shapes of distributions of features' values within six representative proteomes: a mesophilic, thermophilic and halophilic bacterium, and a mesophilic, thermophilic and halophilic Archaeon.

Format: PDF Size: 211KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 6:

Pairwise correlation coefficients within the original set of 79 proteome features, and a visualization of the hierarchical clustering of these features. Applying a threshold (rank correlation < 0.9) to the clustering yielded 42 feature clusters whose representatives were chosen as the final, reduced-redundancy 42 feature set.

Format: XLS Size: 808KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 7:

Final list of 42 used features in the study, together with their codes.

Format: RTF Size: 2KB Download file

Open Data