This article is part of the supplement: Proceedings of the Tenth Annual MCBIOS Conference

Open Access Proceedings

Data mining tools for Salmonella characterization: application to gel-based fingerprinting analysis

Wen Zou1*, Hailin Tang1, Weizhong Zhao1, Joe Meehan1, Steven L Foley2, Wei-Jiun Lin3, Hung-Chia Chen14, Hong Fang5, Rajesh Nayak2 and James J Chen1

Author Affiliations

1 Division of Bioinformatics and Biostatistics, U.S. Food and Drug Administration, Jefferson, Arkansas, USA

2 Division of Microbiology, U.S. Food and Drug Administration, Jefferson, Arkansas, USA

3 Department of Applied Mathematics, Feng Chia University, Taichung, Taiwan

4 Graduate Institute of Biostatistics and Biostatistics Center, China Medical University, Taichung, Taiwan

5 The Office of Scientific Coordination, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas, USA

For all author emails, please log on.

BMC Bioinformatics 2013, 14(Suppl 14):S15  doi:10.1186/1471-2105-14-S14-S15

Published: 9 October 2013



Pulsed field gel electrophoresis (PFGE) is currently the most widely and routinely used method by the Centers for Disease Control and Prevention (CDC) and state health labs in the United States for Salmonella surveillance and outbreak tracking. Major drawbacks of commercially available PFGE analysis programs have been their difficulty in dealing with large datasets and the limited availability of analysis tools. There exists a need to develop new analytical tools for PFGE data mining in order to make full use of valuable data in large surveillance databases.


In this study, a software package was developed consisting of five types of bioinformatics approaches exploring and implementing for the analysis and visualization of PFGE fingerprinting. The approaches include PFGE band standardization, Salmonella serotype prediction, hierarchical cluster analysis, distance matrix analysis and two-way hierarchical cluster analysis. PFGE band standardization makes it possible for cross-group large dataset analysis. The Salmonella serotype prediction approach allows users to predict serotypes of Salmonella isolates based on their PFGE patterns. The hierarchical cluster analysis approach could be used to clarify subtypes and phylogenetic relationships among groups of PFGE patterns. The distance matrix and two-way hierarchical cluster analysis tools allow users to directly visualize the similarities/dissimilarities of any two individual patterns and the inter- and intra-serotype relationships of two or more serotypes, and provide a summary of the overall relationships between user-selected serotypes as well as the distinguishable band markers of these serotypes. The functionalities of these tools were illustrated on PFGE fingerprinting data from PulseNet of CDC.


The bioinformatics approaches included in the software package developed in this study were integrated with the PFGE database to enhance the data mining of PFGE fingerprints. Fast and accurate prediction makes it possible to elucidate Salmonella serotype information before conventional serological methods are pursued. The development of bioinformatics tools to distinguish the PFGE markers and serotype specific patterns will enhance PFGE data retrieval, interpretation and serotype identification and will likely accelerate source tracking to identify the Salmonella isolates implicated in foodborne diseases.

Data mining; Salmonella; PFGE; bioinformatics tools; data analysis.