Open Access Highly Accessed Open Badges Research article

Addressing the Challenge of Defining Valid Proteomic Biomarkers and Classifiers

Mohammed Dakna1, Keith Harris2, Alexandros Kalousis3, Sebastien Carpentier4, Walter Kolch56, Joost P Schanstra78, Marion Haubitz9, Antonia Vlahou11, Harald Mischak110* and Mark Girolami12

Author Affiliations

1 Mosaiques diagnostics and therapeutics, Hannover, Germany

2 Water and Environment Research Group, School of Engineering, University of Glasgow, Glasgow, UK

3 Computer Science Department, University of Geneva, Geneva, Switzerland

4 Laboratory of Tropical Crop Improvement, Katholieke Universiteit, Leuven, Belgium

5 The Beatson Institute for Cancer Research and Sir Henry Wellcome Functional Genomics Facility, University of Glasgow, Glasgow, UK

6 Systems Biology Ireland, Conway Institute, Belfield, Dublin 4, Ireland

7 Institut National de la Santé et de la Recherche Médicale (INSERM), U858, Toulouse, France

8 Université Toulouse III Paul-Sabatier, Institut de Médecine Moleculaire de Rangueil, Equipe n° 5, IFR150, Toulouse, France

9 Department of Nephrology, Hannover Medical School, Hannover, Germany

10 BHF Glasgow Cardiovascular Research Centre, University of Glasgow, Glasgow, UK

11 Research Foundation, Academy of Athens, Athens, Greece

12 Department of Statistical Science, University College London, London, UK

For all author emails, please log on.

BMC Bioinformatics 2010, 11:594  doi:10.1186/1471-2105-11-594

Published: 10 December 2010



The purpose of this manuscript is to provide, based on an extensive analysis of a proteomic data set, suggestions for proper statistical analysis for the discovery of sets of clinically relevant biomarkers. As tractable example we define the measurable proteomic differences between apparently healthy adult males and females. We choose urine as body-fluid of interest and CE-MS, a thoroughly validated platform technology, allowing for routine analysis of a large number of samples. The second urine of the morning was collected from apparently healthy male and female volunteers (aged 21-40) in the course of the routine medical check-up before recruitment at the Hannover Medical School.


We found that the Wilcoxon-test is best suited for the definition of potential biomarkers. Adjustment for multiple testing is necessary. Sample size estimation can be performed based on a small number of observations via resampling from pilot data. Machine learning algorithms appear ideally suited to generate classifiers. Assessment of any results in an independent test-set is essential.


Valid proteomic biomarkers for diagnosis and prognosis only can be defined by applying proper statistical data mining procedures. In particular, a justification of the sample size should be part of the study design.