Open Access Highly Accessed Open Badges Research article

Optimizing data collection for public health decisions: a data mining approach

Susan N Partington13*, Vasil Papakroni2 and Tim Menzies2

Author Affiliations

1 Division of Animal and Nutritional Sciences, West Virginia University, Morgantown, WV, USA

2 Lane Department of Computer Sciences and Electrical Engineering, West Virginia University, Morgantown, WV, USA

3 Regional Research Institute, West Virginia University, 886 Chestnut Ridge Road, 5th Floor, P.O. Box 6825, Morgantown, WV 26506-6825, USA

For all author emails, please log on.

BMC Public Health 2014, 14:593  doi:10.1186/1471-2458-14-593

Published: 12 June 2014



Collecting data can be cumbersome and expensive. Lack of relevant, accurate and timely data for research to inform policy may negatively impact public health. The aim of this study was to test if the careful removal of items from two community nutrition surveys guided by a data mining technique called feature selection, can (a) identify a reduced dataset, while (b) not damaging the signal inside that data.


The Nutrition Environment Measures Surveys for stores (NEMS-S) and restaurants (NEMS-R) were completed on 885 retail food outlets in two counties in West Virginia between May and November of 2011. A reduced dataset was identified for each outlet type using feature selection. Coefficients from linear regression modeling were used to weight items in the reduced datasets. Weighted item values were summed with the error term to compute reduced item survey scores. Scores produced by the full survey were compared to the reduced item scores using a Wilcoxon rank-sum test.


Feature selection identified 9 store and 16 restaurant survey items as significant predictors of the score produced from the full survey. The linear regression models built from the reduced feature sets had R2 values of 92% and 94% for restaurant and grocery store data, respectively.


While there are many potentially important variables in any domain, the most useful set may only be a small subset. The use of feature selection in the initial phase of data collection to identify the most influential variables may be a useful tool to greatly reduce the amount of data needed thereby reducing cost.

Community survey methods; Data mining; Data collection; Ecological and environmental concepts; Nutrition