Open Access Research article

Detecting contaminated birthdates using generalized additive models

Wei Luo1*, Marcus Gallagher2, Bill Loveday3, Susan Ballantyne3, Jason P Connor45 and Janet Wiles2

Author Affiliations

1 Centre for Pattern Recognition and Data Analytics, Deakin University, Geelong, Australia

2 School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, Australia

3 Drugs of Dependence Unit, Queensland Health, Brisbane, Australia

4 Discipline of Psychiatry, The University of Queensland, Brisbane, Australia

5 Centre for Youth Substance Abuse Research, The University of Queensland, Brisbane, Australia

For all author emails, please log on.

BMC Bioinformatics 2014, 15:185  doi:10.1186/1471-2105-15-185

Published: 12 June 2014



Erroneous patient birthdates are common in health databases. Detection of these errors usually involves manual verification, which can be resource intensive and impractical. By identifying a frequent manifestation of birthdate errors, this paper presents a principled and statistically driven procedure to identify erroneous patient birthdates.


Generalized additive models (GAM) enabled explicit incorporation of known demographic trends and birth patterns. With false positive rates controlled, the method identified birthdate contamination with high accuracy. In the health data set used, of the 58 actual incorrect birthdates manually identified by the domain expert, the GAM-based method identified 51, with 8 false positives (resulting in a positive predictive value of 86.0% (51/59) and a false negative rate of 12.0% (7/58)). These results outperformed linear time-series models.


The GAM-based method is an effective approach to identify systemic birthdate errors, a common data quality issue in both clinical and administrative databases, with high accuracy.