Performance of a genetic algorithm for mass spectrometry proteomics
Office of the Clinical Director, National Institute of Neurological Disorders and Stroke, Bethesda MD, USA
BMC Bioinformatics 2004, 5:180 doi:10.1186/1471-2105-5-180Published: 19 November 2004
Recently, mass spectrometry data have been mined using a genetic algorithm to produce discriminatory models that distinguish healthy individuals from those with cancer. This algorithm is the basis for claims of 100% sensitivity and specificity in two related publicly available datasets. To date, no detailed attempts have been made to explore the properties of this genetic algorithm within proteomic applications. Here the algorithm's performance on these datasets is evaluated relative to other methods.
In reproducing the method, some modifications of the algorithm as it is described are necessary to get good performance. After modification, a cross-validation approach to model selection is used. The overall classification accuracy is comparable though not superior to other approaches considered. Also, some aspects of the process rely upon random sampling and thus for a fixed dataset the algorithm can produce many different models. This raises questions about how to choose among competing models. How this choice is made is important for interpreting sensitivity and specificity results as merely choosing the model with lowest test set error rate leads to overestimates of model performance.
The algorithm needs to be modified to reduce variability and care must be taken in how to choose among competing models. Results derived from this algorithm must be accompanied by a full description of model selection procedures to give confidence that the reported accuracy is not overstated.