Estimating the size of hidden populations from register data
Centre for Social Research on Alcohol and Drugs, SoRAD, Stockholm University, SE-10691 Stockholm, Sweden
BMC Medical Research Methodology 2014, 14:58 doi:10.1186/1471-2288-14-58Published: 27 April 2014
Prevalence estimates of drug use, or of its consequences, are considered important in many contexts and may have substantial influence over public policy. However, it is rarely possible to simply count the relevant individuals, in particular when the defining characteristics might be illegal, as in the drug use case. Consequently methods are needed to estimate the size of such partly ‘hidden’ populations, and many such methods have been developed and used within epidemiology including studies of alcohol and drug use. Here we introduce a method appropriate for estimating the size of human populations given a single source of data, for example entries in a health-care registry.
The setup is the following: during a fixed time-period, e.g. a year, individuals belonging to the target population have a non-zero probability of being “registered”. Each individual might be registered multiple times and the time-points of the registrations are recorded. Assuming that the population is closed and that the probability of being registered at least once is constant, we derive a family of maximum likelihood (ML) estimators of total population size. We study the ML estimator using Monte Carlo simulations and delimit the range of cases where it is useful. In particular we investigate the effect of making the population heterogeneous with respect to probability of being registered.
The new estimator is asymptotically unbiased and we show that high precision estimates can be obtained for samples covering as little as 25% of the total population size. However, if the total population size is small (say in the order of 500) a larger fraction needs to be sampled to achieve reliable estimates. Further we show that the estimator give reliable estimates even when individuals differ in the probability of being registered. We also compare the ML estimator to an estimator known as Chao’s estimator and show that the latter can have a substantial bias when applied to epidemiological data.
The population size estimator suggested herein complements existing methods and is less sensitive to certain types of dependencies typical in epidemiological data.