State Key Lab of Resources and Environmental Information Systems, Institute of Geographical Sciences and Natural Resources Research, Chinese Academy of Sciences, 1305, No. A11, Rd. Datun, Anwai, Beijing, 100101, China
Program in Public Health, College of Health Sciences, University of California, Irvine, USA
Department OF Epidemiology, School of Medicine, University of California, Irvine, USA
Abstract
Background
Environmental exposure may play an important role in the incidences of neural tube defects (NTD) of birth defects. Their influence on NTD may likely be nonlinear; few studies have considered spatial autocorrelation of residuals in the estimation of NTD risk. We aimed to develop a spatial model based on generalized additive model (GAM) plus cokriging to examine and model the expected incidences of NTD and make the inference of the incidence risk.
Methods
We developed a spatial model to predict the expected incidences of NTD at village level in Heshun County, Shanxi Province, China, a region with high NTD cases. GAM was used to establish linear and nonlinear relationships between local covariates and the expected NTD incidences. We examined the following villagelevel covariates in the model: projected coordinates, soil types, lithodological classes, distance to watershed, rivers, faults and major roads, annual average fertilizer uses, fruit and vegetable production, gross domestic product, and the number of doctors. The residuals from GAM were assumed to be spatially autocorrelative and cokriged with regional residuals to improve the prediction. Our approach was compared with three other models, universal kriging, generalized linear regression and GAM. Cross validation was conducted for validation.
Results
Our model predicted the expected incidences of NTD well, with a good CV R^{2} of 0.80. Important predictive factors included the fertilizer uses, locations of the centroid of each village, the shortest distance to rivers and faults and lithological classes with significant spatial autocorrelation of residuals. Our model outperformed the other three methods by 16% or more in term of R^{2}.
Conclusions
The variance explained by our model was approximately 80%. This modeling approach is useful for NTD epidemiological studies and intervention planning.
Background
Birth defects refer to functional or structural anomaly present in infancy or later in life that can result in infant mortality and disability. Neural tube defect (NTD), one of the most common types of birth defects, have been estimated to occur in more than 320,000 infants worldwide annually
Environmental pollutants have also been linked to NTD
Although spatial analysis was used to examine the effects of the environmental factors on birth defects, it was not used to directly predict the incidence of NTD in Heshun, an area with extremely high rate of NTD. In addition, it is seldom used to quantify potentially important features (e.g. soil and lithodological types, faults and rivers) for combinational use with other numeric influential factors (e.g. use of pesticides and disinfection products, population, income and number of doctors)
To address the problems above, we developed a spatial model to predict NTD incidences based on environmental variables. A Poisson distribution was used to simulate the incidence of NTD in counts
Methods
Study domain
Heshun county (Figure
Study region (a), histograms, density plots (b) fitted of expected NTD occurrences during 2002–2005
Study region (a), histograms, density plots (b) fitted of expected NTD occurrences during 2002–2005.
Measured cases of NTD
Annual incidence of NTD from 1998 to 2005 was recorded for each of the 326 sampling villages within Heshun County. Figure
Spatial data and covariates
Literature has identified the following factors related to the NTD incidences: deficiency of folic acid and vitamin B_{12}
Physical factors
We collected two areatype variables (soil types and lithodological classes) and the locations of four linetype geographical features (i.e. faults, watersheds, rivers, and major roads). We calculated the proportion of area for specific soil and lithodological types within different buffer radius and estimated the shortest distance from the centroid of each village to each fault, watershed, river, and major road. We classified soils into nine types, including cinnamon soil, calcareous cinnamon soil, middle cinnamon soil, neutral lithosol, neutral regosol, cinnamon soil, calcareous regosol, calcareous lithosol, and fluvoaquic soil. We classified lithodological characters into six classes, including C: coal, fireclay, iron, carboniferous, potassiumbearing rock, silicon rock, sulfur, aluminum; P: Permian rocks, ferromanganese, violet sand earthenware clay; T: middle Triassic rocks; O: limestone, dolomite, middle Ordovician rocks; A: garnet; Q: clay cement grout with mixture of limestone and dolomite.
Other influential factors
For each village, we obtained its spatial location (the centroid projected coordinates) as well as other parameters including the total population, the number of doctors in service, gross domestic product (GDP), average annual fruit and vegetable production (kilogram), and the total amount of fertilizer uses (kilogram) per year. GDP reflects socioeconomic status of the population; higher GDP likely indicates a better living condition and foods richer in folic acid, vitamins and other nutrients. The number of doctors reflects local health services and the accessibility of hospital resources, which may influence education on disease prevention and prenatal care (e.g. early identification and abortion of fetus with birth defects).
Modeling approach
Optimal buffering analysis to extract qualitative covariates
The optimal decaying buffering analysis
Simulation of Poisson distribution for NTD incidences
Using the Poisson distribution, we simulated the counts of NTD incidences based on the following probability function:
GAM for an initial estimation of local expected incidences
The general equation of predictive value of NTD incidences:
where
We used the GAM package in R statistical software (R version 2.14) to estimate local means (
where
Three steps were used to select the covariates in GAM. First, to avoid multicollinearity, we used variance inflation factor (VIF) to divide the covariates into two parts: weakly correlated covariates (VIF < 5) and the independent groups of highly correlative ones (VIF > =5) (correlation coefficients were used to divide the highly correlative covariate into different groups)
Cokriging spatial residuals to minimize error variance
Assuming a stable space domain after the modeling of local means using GAM
Where
The residuals are influenced by both local variation of the target variable (expected NTD incidences) and regional variation or other unaccounted effects at nearby locations that is referred as regional residual in our model. According to the optimal principle of unbiased estimation and minimal error variance of cokriging, if the variogram of spatial and regional residuals is precisely captured, error variance of spatial residual will decrease substantially and in turn
We used the theoretical variogram to fit the experimental variogram of spatial and regional residuals and the cross covariance between them. We derived the cross covariance or the change of the variance between the spatial residual and the regional residual from the variogram. We examined the semivariogram cloud, tested different lag sizes and numbers of lags, and different variogram models to find the best reasonable fit by cross validation using ArcGIS (Version 10.0)’s Geostatistical Analyst
Model comparison
We compared our modeling approach (GAM plus cokriging of spatial residuals) to three other methods, i.e. universal kriging, generalized linear regression (GLM), and GAM only. Universal kriging estimates local means using coordinates and the residuals, but it has no structure to take into account local spatial covariates and regional variability. GLM assumes a linearly additive relationship between expected means and all spatial covariates. GAM incorporates both linear and nonlinear relationships but GAM itself does not account for spatial autocorrelation of the residuals. The GAM plus cokriging model incorporates both variability of local means estimated with GAM and the largescale regional variability useful for decreasing the bias of the prediction due to factors not accounted in the model
Cross validation
For model evaluation we used leaveoneout crossvalidation (LOOCV). LOOCV involves using a single observation from the original sample as the validation data, and the remaining observations as the training data; this is repeated such that each observation in the sample is used once as the validation data. We used three continuous error measures (
Inference and prediction of the NTD incidences
With the expected NTD incidence estimated, we used the Poisson probability distribution (Equation [1]) to infer and predict the risk of NTD incidences. With the following equation, we can predict the probability of at least one NTD case during the same 8year period for each village:
Also, we can infer the odds for NTD incidences:
Given the number of total births during the study period, we can estimate the incidences of NTD and identify the spatial variability and potential hot spots.
Uncertainty analysis
Using the mgcv package for GAM in R, we obtained the 95% upper and lower pointwise confidence limits around the GAM estimate for each regressor. The uncertainty of each covariate was evaluated based on these confidence limits. Further, we tested the sensitivity of the smooth functions of regressors by adjusting the degrees of freedom (5–10). In addition, we examined the uncertainty of the model using different variograms (spherical, circular, exponent, Gaussian and stable).
Results
Determinants
We selected one out of nine types of soil (calcareous and lithosol soil) and two out of six lithodological classes (brick clay and Trias Liujiagou group) as the predictive covariate with the corresponding optimal buffer distance resulting in the highest correlation with the target variable. Additional file
Supplemental Materials. Table S1. Optimal fitted variogram models of local and regional residuals by cross validation. Figure S1. Correlation of the soil (a) and lithodological types (b, c) along the decaying buffer distances. S2. Nonlinear/linear relationship between local covariates and the expected NTD incidences modeled by GAM. S3. Variograms of local and regional residuals and cross covariance. (DOCX 160 kb)
Click here for file
From eleven covariates (one soil type, two lithodological classes, the projected x/y coordinates, distance to watershed, GDP, number of doctors, average yearly fertilizer uses, shortest distances to rivers, faults and roads), five covariates, namely x/y coordinates, fertilizer uses, shortest distance to faults, shortest distances to rivers, lithodological type of Trias Liujiagou group (T*) were selected as the final predictive as determinants. Table
Spatial Covariates selected
Buffer distance (m)
R
V.P
F
pvalue
Geographic location


19.03
3.828
7.53e7*
Fertilizer used

0.40
25.64
14.74
<2e16*
Shortest distance to faults

−0.14
4.96
2.499
0.01*
Shortest distance to rivers

−0.22
7.52
4.149
5.12e5*
Lithodological type of Trias Liujiagou Group (T*)
1500
−0.23
1.04
0.03
0.0328*
Calcareous lithodological soil
2500
0.21

19.14
1.64e5*
Brick clay lithodological type(Q*)
1500
0.30

16.19
3.78e9*
Spatial correlation
Variogram models showed spatial autocorrelation of local and regional residuals (Additional file
Comparison of models
Our spatial residual model had the highest cross validation
Plot of observed NTD counts vs. expected NTD counts predicted by our method(a,
Plot of observed NTD counts vs. expected NTD counts predicted by our method (a,
Models
RMSPE
Generalized linear regression
0.234
−0.25
0.993
Universal kriging
0.164
0.414
1.17
GAM
0.582
−0.135
0.734
Our model of spatial residual
0.804
−0.045
0.502
Risk mapping
Figure
Prediction of the expected NTD occurrences inferred (a) and probability prediction of one NTD at least one birth defects (b)
Prediction of the expected NTD occurrences inferred (a) and probability prediction of one NTD at least one birth defects (b).
Odds ratio map of at least one birth defects inferred
Odds ratio map of at least one birth defects inferred.
Uncertainty analysis
Uncertainty analysis showed that our model had a stable prediction performance. Additional file
Discussion
We developed a GAM plus cokriging model to estimate the expected incidences of NTD (
We found that fertilizer uses, residual spatial autocorrelation, and the projected coordinates were important predictors of the NTD incidences, respectively accounting for 25.64%, 22.20%, and 19.03% of the variance. Our results of the fertilizer uses agree with the other environmental epidemiological studies of birth defects
Additionally, we further observed that higher NTD incidences were associated with living closer to the faults and living in areas with more Q* rock and calcareous lithosol soil, while lower NTD risk was linked to areas with more T* rock. Our findings of the influence of local covariates and the spatial autocorrelation agreed with the work of Wang et al.
Although statistical insignificant in our model, fruit and vegetable production (proxies of folic acid deficiency) was found to partly contribute to the distribution of NTD incidences in Heshun County by geographical detector
We generated the risk map and identified the hotspot villages with high risk of NTD incidence, which will help prioritize the resources needed for government intervention to reduce the risk of NTD. For the hot spots like Qing Cheng, Bo Li, Xu Cun and Niu Chuan (Figure
This study has several major limitations. First, we did not have data on the number of births for the period of data gathering thus we did not calculate the rate of NTD incidences. But the Poisson probability model is suitable for dealing with the probability assessment of count events and the output of odd ratios of birth defects to non birth defects reasonably reflects spatial variability of NTD risk (Figure
Conclusion
This study developed a residual spatial model that coupled GAM and cokriging to assess spatial variability of the expected incidence of NTD and its risk in Heshun County, Shanxi Province, China. Our method examined the influences of local environmental covariates, including shortest distances to faults and rivers, soil, rock, fertilizer uses and spatial location upon the variability of NTD incidences. Our method used GAM to establish the linear/nonlinear relationship between the covariates and the NTD risk and used cokriging to incorporate spatial autocorrelation of residuals from GAM. Compared with the other three methods, our method achieved the better effect with its LOOCV R square of 0.80. Our study has significant implication for the epidemiological studies of the influence of environmental factors on birth defects.
Abbreviations
NTD: Neural tube defects; GAM: Generalized additive model; GLM: Generalized linear regression; CV: Cross validation; LOOCV: Leaveoneout crossvalidation; IQR: Interquartile range; RMSPE: Square root of the mean of the squared prediction errors.
Competing interests
The authors declare they have no competing interests.
Authors’ contributions
LL ideated this paper’s content, developed the model, analyzed the data and jointly drafted and revised the manuscript. JW1 (Jinfeng Wang) provided the background knowledge and critical data support, gave constructive modeling suggestions and jointly revised this paper. JW2 (Jun Wu) helped improve the modeling, provided relevant background knowledge and perfect interpretation of the results, and made a considerable contribution to the paper’s revision. All authors read and approved the final manuscript.
Acknowledgements
This research is partially supported by grants 41171344/D010703 from the Natural Science Foundation of China, grant 2012CB955503 (Research of Identification of Susceptible Population and Risk Regionalization for Climate Changes and Health) from the National Basic Research Program of China’s Ministry of Science and Technology (973), and grant 2011AA120305–1 from the Hitech Research and Development Program of China’s Ministry of Science and Technology (863).
Prepublication history
The prepublication history for this paper can be accessed here: