Abstract
Background
Missing data on tumour stage information is a common problem in populationbased cancer registries. Statistical analyses on the level of tumour stage may be biased, if no adequate method for handling of missing data is applied. In order to determine a useful way to treat missing data on tumour stage, we examined different imputation models for multiple imputation with chained equations for analysing the stagespecific numbers of cases of malignant melanoma and female breast cancer.
Methods
This analysis was based on the malignant melanoma data set and the female breast cancer data set of the cancer registry SchleswigHolstein, Germany. The cases with complete tumour stage information were extracted and their stage information partly removed according to a MAR missingnesspattern, resulting in five simulated data sets for each cancer entity. The missing tumour stage values were then treated with multiple imputation with chained equations, using polytomous regression, predictive mean matching, random forests and proportional sampling as imputation models. The estimated tumour stages, stagespecific numbers of cases and survival curves after multiple imputation were compared to the observed ones.
Results
The amount of missing values for malignant melanoma was too high to estimate a reasonable number of cases for each UICC stage. However, multiple imputation of missing stage values led to stagespecific numbers of cases of Tstage for malignant melanoma as well as T and UICCstage for breast cancer close to the observed numbers of cases. The observed tumour stages on the individual level, the stagespecific numbers of cases and the observed survival curves were best met with polytomous regression or predictive mean matching but not with random forest or proportional sampling as imputation models.
Conclusions
This limited simulation study indicates that multiple imputation with chained equations is an appropriate technique for dealing with missing information on tumour stage in populationbased cancer registries, if the amount of unstaged cases is on a reasonable level.
Background
An important task of populationbased cancer registration is to assess the effectiveness of early detection programmes such as mammography screening. Time trend analysis of cancer incidence is an important indicator in such an evaluation and is often conducted. Time trend analysis of tumour stagespecific incidence is more appropriate, however less frequently applied [13]. A reduction in incidence of tumours with a poor prognosis might indicate a future reduction in mortality. Complete stage information is crucial for such analyses. Missing values for stage information might bias such a stagespecific analysis, especially when the missingnesspattern changes over time. A cancer registry may have a very complete case registration, yet still have missing information in important parameters, such as tumour size or lymph node status, remains an almost common phenomenon in populationbased cancer registration. The percentage of unknown stages can vary considerably between different cancer entities or cancer registries. Concerning melanoma and breast cancer, the federal cancer registries in Germany report the following percentages of unknown Tstage (tumour size according to TNM staging system [4]) between 1020% [5,6] and of unknown UICCstage [7] between 2040% [8,9]. There are several reasons for this, one being that tumour stage is often not known at the time of diagnosis and therefore, if the case is reported to the registry without additional notification, e.g. from the physician or from the pathologists, stage information is lost. Further, some cancer cases are only reported by a pathologist. These notifications  in general  do not provide any information on lymph node status or metastasis.
Concerning statistical analyses on the level of tumour stages, three more or less common approaches for dealing with missing stage data can be found: 1. adhoc missing data methods, such as omitting all cases with missing information [1012] or analysing them as a separate group [13], 2. distributing all cases with unknown stage proportionally to the known stages [14] and 3. using multiple imputation [15,16].
The first approach is widely known to produce biased results [17,18].
The second approach yields valid populationbased analyses on tumour stagespecific incidence if the precondition of equal tumour stage distributions among the cases with unknown Tstage and the cases with known Tstage is met. If this is not the case, the results will be biased. Additionally, analysis on the individual level is not possible with this method. Therefore, the missing values in tumour stage should be handled with an appropriate statistical method (such as approach 3) before calculating the stagespecific incidence rates to reduce the expected bias.
The following limited simulation analysis is aimed at determining a feasible method for imputation of missing stage information in empirical cancer registry data sets. We used cancer registry data for female breast cancer (a tumour site with only few missing values) and data for malignant melanoma (a tumour site with a high proportion of missing data). The cases with complete stage information were used to derive data sets with simulated missing stage information. We then analysed the individual stage estimations, the stagespecific numbers of cases and the stagespecific survival curves after treatment with different variants of multiple imputation.
Methods
Databases
Malignant melanoma data (men and women, ICD10 C43 excl. sarcomas) and breast cancer data (women; ICD10 C50) gathered by the cancer registry SchleswigHolstein in Germany between 2000 and 2008 was used for the following analysis. All DCO (death certificate only) cases were excluded from the analysis.
The cancer registry records data on tumour size (Tstage), involvement of lymph nodes (Nstage) and metastases (Mstage) according to the TNMclassification (see additional file 1: TNMdefinition for breast cancer and malignant melanoma) [4,19]. TNMstages can be combined to one prognostic classifier, using the UICCclassification (see additional file 2: UICCdefinition) [7]. The Tclassification as well as the UICCclassification consists of four main categories, with stage I having a good survival prognosis and stage IV a poor prognosis.
Additional file 1. Definition of the TNMcategories. Definition of the TNMcategories for malignant melanoma (ICD10 C43) and breast cancer (ICD10 C50) according to the TNM5 and the TNM6classification.
Format: PDF Size: 28KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 2. Definition of the UICCstages. Definition of the UICCstages for malignant melanoma (ICD10 C43) and breast cancer (ICD10 C50) according to the TNM5 and the TNM6classification.
Format: PDF Size: 21KB Download file
This file can be viewed with: Adobe Acrobat Reader
Imputation of missing stage information
Our analysis consists of six main steps:
1. Selection of variables
2. Simulation of five breast cancer data sets and five malignant melanoma data sets
3. Specification of the imputation models
4. Creation of ten complete data sets out of each simulated data set using multiple imputation
5. Statistical analysis and model evaluation
6. Sensitivity analysis for malignant melanoma
1. Selection of variables
Available clinically relevant variables, potentially related to stage information, were selected: sex, age at diagnosis, morphology, topography, grading, operation (yes/no), radiation therapy (yes/no), chemotherapy (yes/no), hormone therapy (yes/no), survival time and censoring. Additionally, each of the Tstage, Nstage and Mstage is associated with the two others, accordingly.
In order to take minor changes in the classification system over time and other possible temporal changes into account, year of diagnosis was also included into the analysis. As the breast cancer analysis excludes men, the variable sex was removed. No malignant melanoma patient received hormone therapy, so this variable was omitted for this data set.
Most categories in the variable morphology of the tumour had small entries. Thus, only categories affecting at least 1% of all patients were used. The other categories were pooled  with the not otherwise specified (NOS)  into one category. Topography of the tumour was treated analogously.
Most predictor variables have missing values (Table 1). This will be addressed by the multiple imputation method.
Table 1. Description of the observed and the simulated data sets for breast cancer and malignant melanoma patients
2. Simulation of a breast cancer data set and a malignant melanoma data set
Conducting the multiple imputation methods on a simulated data set enables us to judge the quality of the results and to make a significant comparison between the different methods, because the true results are known. Although a proper simulation study [20] could give higher evidence, we decided to restrict the analysis to a very small simulation study with two scenarios (female breast cancer and malignant melanoma), one data generation process (described below) and five simulated data sets each, as this seemed to be a sufficient approach in recognising the possible methods and determining which of these produce acceptable results.
It was important to retain the dependencies between the variables as close as possible. Rather than simulating the data set by assuming a multivariate normal distribution of the  if necessary, transformed  variables, which would involve a certain degree of abstraction, we generated a data set using the original data set itself as the data basis: The observed female breast cancer data set D had approximately 21,500 cases of which 80% had no missing value in any of the variables T, N or Mstage. These cases were used as the observed values for the simulated data set S. We assumed that the missingnesspattern of stage depends mostly on age at diagnosis, survival time, censoring and the interaction between survival time and censoring. These variables were complete in D. A logistic regression model was fitted for each variable with missing values, with age at diagnosis, survival time, censoring and the interaction between survival time and censoring as independent variables. These models could now predict the probability of any value in S to be missing. Every value in S was deleted randomly, depending on its individual missingnessprobability; a value with a high probability of missingness was therefore more probable to be deleted, but did not necessarily have to be deleted. The resulting data set was the simulated data set S.
The generation of missing values depended on a random starting value. Changing the random starting value would have produced a different simulated data set, which might have resulted in different conclusions about the imputation methods. To avoid such biases, we simulated a total of five data sets and obtained 50 completed data sets for each cancer entity and each variant of multiple imputation. The missingnesspattern was independent among the five data sets, but all imputation methods were conducted on the same five data sets. We used the default random number generator of R, "MersenneTwister", with five different starting seeds.
The observed malignant melanoma data set consisted of about 5,500 cases, of which 30% had complete T, N and Mstage information. The simulated data set was generated in the same way as described above.
3. Specification of the imputation models
Multiple imputation with chained equations [21] was used. Four scenarios with different imputation models were compared:
(1) Polytomous logistic regression is applicable for categorial data and may be used for the T and Nstages. However, there were only two Mstages, hence the polytomous logistic regression reduced to dichotomous logistic regression. The imputations of the missing values in the following four predictor variables morphology, radiation therapy, chemotherapy and hormone therapy were randomly sampled from the observed values.
(2) Predictive mean matching is a linear regression, in which the predicted value is substituted for the closest observed value. In our case, this yielded a value of 1, 2, 3 or 4 for T and of 0, 1, 2 or 3 for N. This method is valid for data on an ordinal scale. As for Mstage, dichotomous logistic regression was used as in scenario 1. The missing values in the predictor variables morphology, radiation therapy, chemotherapy and hormone therapy were treated as in scenario 1.
(3) The third scenario consisted of random forests [22] for T, N and M. Modern machine learning techniques are often superior to classical regression models if the modelling is complex, for example if interactions and nonlinear relations are involved [23]. The imputation models based on logistic regression and predictive mean matching included the interaction between survival time and censoring in their set of predictor variables, because a short survival time must be interpreted differently for a deceased person than for someone still alive, having only a short followup time. No interaction term was needed in the random forest because this method can internally model flexible interactions. The missing values in morphology, radiation therapy, chemotherapy and hormone therapy were treated as in scenario 1.
(4) The customary approach was to sample the missing tumour stage values from the observed stages, yielding a proportional distribution. To make the results comparable to the results from approach (1) to (3), multiple imputation was used rather than single imputation.
One assumption of multiple imputation is that the missing values are missing at random (MAR), e.g. the absence of a particular item is only dependent on other observable variables and not on unobservable parameters, nor the value of the item itself [23]. We included 13 predictors in the imputation models, which made the MAR assumption more plausible.
4. Creation of ten complete data sets out of each simulated data sets using multiple imputation
Ten completed data sets were generated for each simulated data set for both cancer entities, using the four imputation scenarios introduced above, which is usually sufficient [24]. Gibbs sampling with ten iterations was used to ensure model convergence [25].
5. Statistical analysis and model evaluation
The basic quality of the imputations was measured by the concordance of the stage predictions with their observed values and the extent of dislocation.
We then calculated T and UICCstagespecific numbers of cases based on the ten completed simulated data sets and compared them to the observed stagespecific numbers of cases. The numbers of cases and their standard deviations were calculated according to the rules for combining completedata inferences [26]. The mean absolute deviation (MAD) aggregated the information on differences between the predicted and the observed stage distributions for comparison of the different methods.
Finally, we plotted KaplanMeier survival curves. As T and UICCstages are prognosis groups, the observed and the predicted stagespecific survival curves should be similar. The differences were examined with logrank tests for each stage. The logrank test statistics of all imputed data sets were summed up to provide a measure for the total difference and to indicate the best imputation model.
6. Sensitivity analyses for malignant melanoma
Although ten imputations should generally suffice for data with a modest amount of 1030% missingness, the malignant melanoma with 39% missing Tstages (38% in the simulated data set) and 70% missing UICCstages (91% in the simulated data set) may require more imputations. It must be kept in mind, that a UICCstage can already be missing if only one of the three stages (T, N, M) is missing. Although a percentage of 91% of the UICCstages was missing, 'only' 56% of the values needed for the calculation of the UICCstage were missing.
We repeated the analysis for malignant melanoma with 25 imputations and 50, rather than ten, iterations.
Software
All statistical analyses were done in R 2.11.1 [27] using the packages mice [21], survival [28] and randomForest [29].
Descriptive statistics
The percentages in the individual variable categories are given for the description of the data. The median and the first and third quartiles are shown for age and survival time.
Results
Missing information
There were 21,428 incident cases of female breast cancer in SchleswigHolstein between 2000 and 2008. Six percent of the cases had no information on the Tstage, 11% had missing values in the Nstage and 16% in the Mstage. Only 17,162 (80%) cases had valid information on all three parameters.
In the same time period 5,520 cases of malignant melanoma were registered in SchleswigHolstein. The percentage of missing values was higher than in breast cancer: 39% in T, 68% in N and 67% in M. The stage information was complete for 1,685 cases (30%). Table 1 shows the most important variables and their distributions including the number of missing values.
Simulated data sets
The simulated data sets based on the cases with complete T, N and Mstage were similar to the original data sets for the relevant variables (Table 1). The only exceptions were a higher percentage of missing values for UICCstage (27% versus 18% for breast cancer, 91% versus 70% for malignant melanoma) and a longer median survival time for malignant melanoma (1765 versus 1552 days).
The values of the cases with complete T, N and Mstage in the original data sets are referred to as "observed values".
Accuracy of the imputations on individual level
Table 2 shows the concordance rate of imputed and observed T and UICCstages. Polytomous regression and predictive mean matching always yielded the highest concordance rate: approximately 48% of all imputations matched the observed Tstage value and approximately 80% of all imputation matched the observed UICCstage for both cancer entities. These concordance rates were higher than can be achieved by chance, as the lower concordance rates rendered by proportional sampling indicate.
Table 2. Concordance rates of imputed with observed T and UICCstages for breast cancer and malignant melanoma
Dislocations by three stages, i.e. a T1 (UICC I) imputed as T4 (UICC IV) or vice versa, occurred in less than 5% of all imputations for polytomous regression and predictive mean matching.
Estimations of the stagespecific numbers of cases
Table 3 displays the observed and the predicted case numbers for T and UICCstage after multiple imputation with the different scenarios. For example, there were 8,909 breast cancer cases in T1stage in the observed data set. After multiple imputation of missing values in the simulated data set, a number of 8,903.2 was predicted by the polytomous regression approach.
Table 3. Observed and with different multiple imputation methods predicted T and UICCstagespecific numbers of cases for breast cancer and malignant melanoma
Polytomous regression and predictive mean matching for multiple imputation of missing Tstage had the smallest deviations from the observed case numbers for breast cancer. The best results for malignant melanoma were again achieved by these approaches and also by proportional sampling. Although the percentage of missing values was substantially higher for malignant melanoma, the results for Tstage were comparable for both breast cancer and melanoma.
Polytomous regression and predictive mean matching were also showing similar results for UICCstage, however the proportional approach was even closer to the observed stagespecific numbers of cases for malignant melanoma.
The random forest scenario was always less accurate than the estimations by the other scenarios and had the largest standard deviations.
Survival curve estimations
The sum of logrank test statistics over all stages and imputations in Figures 1, 2, 3 and 4 indicate that the survival curves after multiple imputation with polytomous regression or predictive mean matching were closer to the observed survival curves than those after multiple imputation with random forests or proportional imputation. The logrank statistics for UICCstage for malignant melanoma were considerably higher than for Tstage or breast cancer (332.3 versus 80.8, 11.7 or 47.9 for polytomous regression).
Figure 1. Tstagespecific survival curves for female breast cancer. The predicted survival curves are based on the 50 completed data sets.
Figure 2. Tstagespecific survival curves for malignant melanoma. The predicted survival curves are based on the 50 completed data sets.
Figure 3. UICCstagespecific survival curves for female breast cancer. The predicted survival curves are based on the 50 completed data sets.
Figure 4. UICCstagespecific survival curves for malignant melanoma. The predicted survival curves are based on the 50 completed data sets.
Sensitivity analyses
The results for malignant melanoma after multiple imputations with 25 imputations and 50 iterations altered only marginally and are not shown here. The greatest changes occurred for the random forest scenario, which can be explained by the large variance in all its estimations. Convergence of the multiple imputation algorithms was usually achieved very soon, i.e. even less than ten iterations would have sufficed. Only the random forest approach often did not converge at all.
Discussion
Populationbased cancer registry data is an important source for the evaluation of early detection programmes. Stagespecific analysis of incidence is especially crucial. A decrease in incidence of cancer with poor prognosis might be a strong indicator for future mortality reduction [1,2]. However, it is almost impossible to collect all data without missing information on the tumour stage, even if the cancer registry has complete registration. Missing data on tumour stage poses as a serious problem in the evaluation of early detection programmes. Completing the data set by an active followback, such as repeated record inspection, physician interview or other strategies is desirable, would, however, involve high costs and be very time consuming. Thus, appropriate alternatives in handling the unknown information in the cancer registry data set should be used. In this analysis, different variants of multiple imputation were studied, with respect to their feasibility and appropriateness for the imputation of missing values in tumour stages, limiting our analysis to one cancer entity with a high number of cases with missing tumour stage information and one cancer entity with only few missing tumour stage data.
Multiple Imputation
A flexible and common approach of dealing with missing values is multiple imputation [1517,30]. Multiple imputation with chained equations works as follows: for each variable with missing values an individual imputation model is fitted. The predictor variables are related to the missingness and/or to the value of the respective variable. The incomplete data set is completed by iterative imputation of the missing values with the corresponding imputation model. This is done m times, generating m completed data sets. Now the statistical analysis of interest is performed with each data set separately. Finally the m results are pooled to one result [31]. The application of multiple imputation is more complex than the use of other missingdataapproaches [32].
There are two main advantages of multiple imputation. First, in contrast to complete case analysis, all information in the data set is used in the analysis and the results are less likely to be biased. Second, missing values can only be imputed with some degree of uncertainty. In contrast to single imputation methods this uncertainty is reflected by the variability of the m results [33,34].
Difference between T and UICCstage predictions
Only about 20% of the imputed values for UICCstage are different from the observed values, while about 50% of the Tstage imputations are dislocated by at least one stage. It has to be taken into account that UICCstage is generated from the three stage variables T, N and M and in many cases only one or two of them are missing.
Although the UICCstage imputations for malignant melanoma correspond so well to the observed values on the individual level, the predictions of the stagespecific numbers of cases and survival curves were not accurate. This is due to the fact that the percentage of missing values is much higher in malignant melanoma cases and therefore a percentage of 50% dislocated imputed stage values has a greater impact in the total data set.
Choice of the most appropriate imputation model
Machine learning techniques have been reported to produce better results than other classification models in situations with complex relations such as interactions or nonlinear relations [23,35]. Thus, we used random forests as an imputation model in addition to the two methods, which were already implemented in the micepackage in R (polytomous regression and predictive mean matching) to compare them to proportional sampling.
Overall, the imputation scenario based on polytomous regression seems to yield the best results. Imputed stage values are closest to the observed values; the difference of stagespecific numbers of cases to the observed data is smallest and the stagespecific survival curves fit best to the observed ones.
The predictive mean matching scenario yields results nearly as accurate as those by polytomous regression. It has the advantage of a shorter processing time and was found to be an appropriate method for imputation of missing values in other studies [36]. An explanation for the slightly better results of the polytomous regression might be a nonlinearity of the stages.
Using regression trees as imputation models for multiple imputation was found to be promising elsewhere [23]. Random forests, which consist of many regression trees, produce more stable results than a single regression tree. However, in the context of our study the estimations by the random forest scenario tended to have very large variances and were the most biased of all four scenarios. This might be due to convergence problems in the data completion; random forests are able to model complex relations, but if there is a lot of noise in the data, a random forest fits the model to this noise and the model fits can alter to a great extent from one iteration to the next. A simpler model such as polytomous regression or predictive mean matching seems to fit our data better.
The amount of missing values for the UICCstage for malignant melanoma was too high to permit reasonable estimations by any of the applied methods. This agrees with other findings [36], that estimates are biased when the proportion of missing data exceeds 50%. In this case, the imputation can be strongly influenced by noise and produce biased results. In such a situation the proportional sampling approach, which does not depend on any covariates and cannot be influenced by their noise, yielded better estimations of the stagespecific numbers of cases. However, this approach makes the strong and probably inapplicable assumption that the stage distribution in the unknown stages is the same as in the observed stages. The conventional method of assigning all cases with unknown stage proportionally to the known stages  which is equivalent to proportional sampling with only one imputation  has the additional drawback of single imputation compared to multiple imputation.
Strengths and limitations
The cancer registry in SchleswigHolstein has a high completeness: it is estimated to be almost 100% [37]. Therefore, the only possible source for bias in the stagespecific numbers of case estimates is a biased imputation model in the multiple imputation procedure. The high completeness also reduces the risk of biased imputation models because no significantly different subgroups are missing in the model building.
A simulation study has the great advantage of making the results of the different methods comparable, because the true results are known.
One limitation of our small simulation study is the restriction to one scenario for the generation of the simulated data sets, that is the exclusion of the cases with missing T, N or Mstage from the simulated data sets. If these cases differ substantially from the other cases, the results of the analyses are not directly transferable. The same problem would occur if the real missingnesspattern differs substantially from the model we fitted from the original data set and used to generate the simulated data sets. The greatest difference between the simulated data set and the original data sets is the higher amount of missing values in UICCstage. The values in T, N and Mstage were removed with separate models, which lessens the correlation between the missingness in these three variables. Thus, there were more cases to fill in, but the same number of missing values had to be imputed. Another difference occurred because a disproportionally high number of cases with a short survival time was omitted from the malignant melanoma data set. This is not seen as a bias in the data sets, because the shorter survival time probably only means a shorter registration time, i.e. less time to get notifications on T, N and Mstage. For the other variables, the univariate distributions in the simulated data set did not differ very much from those in the original data set.
The simulation of each five data sets for both cancer entities was aimed at controlling the variation of results due to the random deletion of values, but it is still a small number of simulations.
A proper simulation study would require additional scenarios in the design of simulated data sets, a higher number of simulations and a more detailed reporting of the results of the simulation study [20]. However, this limited simulation study appeared to be sufficient at identifying feasible imputation methods that provide reasonable results in cancer epidemiology.
Another limitation is that the imputation models do not take into account that the followup period for the recently diagnosed patients is quite short, which leads to very short survival times for patients who are alive and who may have a very good prognosis. We attempted to address this problem by including an interaction term for censoring and survival time and employing a random forest as imputation model, which might be capable of modelling such complex relations. The inclusion of year of diagnosis as predictor variable also helped to model this effect.
Further, the results are restricted to the data on two cancer entities of one cancer registry.
Conclusions
For statistical analysis of tumour stage information in cancer registry data, both on the individual and the aggregated level, multiple imputation with chained equations using polytomous regression or predictive mean matching as an imputation model was in this limited simulation study found to be an appropriate method for dealing with missing data in tumour stage. Utilizing one of these methods should lead to less biased estimates than using a crude proportional method. Polytomous logistic regression and also predictive mean matching regression as imputation models for T, N and Mstage yield good estimations on the individual stage value, the stagespecific numbers of cases and the stagespecific survival curves, as long as the amount of missing values is not too high. In contrast, random forests are not recommended because convergence problems in the multiple imputation were observed, the results are less close to the observed parameters and have often large variances.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
NE made substantial contributions to the conception and design, the acquisition of the data, the statistical analysis and the interpretation of the data and was involved in drafting the manuscript. AW made substantial contributions to the analysis and the interpretation of the data and was involved in drafting the manuscript. AK made substantial contributions to the conception and design, the acquisition of the data, the statistical analysis and the interpretation of the data. All authors read and approved the final manuscript.
References

Duffy SW, Tabar L, Vitak B, Day NE, Smith RA, Chen HHT, Yen MFA: The relative contributions of screendetected in situ and invasive breast carcinomas in reducing mortality from the disease.
European Journal of Cancer 2003, 39:17551760. PubMed Abstract  Publisher Full Text

Fracheboud J, Otto SJ, van Dijck JAAM, Broeders MJM, Verbeek ALM, de Koning HJ: Decreased rates of advanced breast cancer due to mammography screening in The Netherlands.
British Journal of Cancer 2004, 91:861867. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Esserman L, Shieh Y, Thompson I: Rethinking screening for breast cancer and prostate cancer.
JAMA 2009, 302:16851692. PubMed Abstract  Publisher Full Text

Wittekind C, Meyer HJ, Bootz F: TNM Klassifikation maligner Tumoren. 6th edition. Berlin. Heidelberg. New York: Springer; 2002.
Auflage edn.

Urbschat I, Kieschke J, Rohde M, Langer C, Hecht W: Krebs in Niedersachsen 2006/07. Oldenburg; 2010.

Kraywinkel K, Batzler WU, Bertram H, Hense HW: Brustkrebs Ergebnisse aus dem Regierungsbezirk Münster 19922004. Münster: Epidemiologisches Krebsregister NRW; 2007.

Fritz A, Percy C, Jack A, Shanmugaratnam K, Sobin L, Parkin DM, Whelan SL: International classification of diseases for oncology: ICDO. 3rd edition. Geneva: World Health Organization; 2000.
Auflage edn.

Stabenow R, Streller B, WilsdorfKöhler H, Eisinger B: Krebsinzidenz und Krebsmortalität 20052006. In Schriftenreihe des GKR. Berlin; 2009.

Freie und Hansestadt Hamburg BfS, Gesundheit (Hrsg.): Hamburger Krebsdokumentation 20052006. Hamburg;

Anderson WF, Jatoi I, Tse J, Rosenberg PS: Male breast cancer: A populationbased comparison with female breast cancer.
Journal of Clinical Oncology 2010, 28:232239. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jemal A, Ward E, Thun MJ: Recent trends in breast cancer incidence rates by age and tumor characteristics among U.S. women.
Breast Cancer Research 2007, 9:R28. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Lang K, Korn JR, Lee DW, Lines LM, Earle CC, Menzin J: Factors associated with improved survival among older colorectal cancer patients in the US: a populationbased analysis.
BMC Cancer 2009, 9:227. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Stockton D, Davies T, Day N, McCarthy BD: Retrospective study of reasons for improved survival in patients with breast cancer in East Anglia: earlier diagnosis or better treatment?
BMJ 1997, 314:472475. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Waldmann A, Eberle A, Hentschel S, Holleczek B, Katalinic A: Bevölkerungsbezogene Darmkrebsinzidenz im Zeitraum 2000 bis 2006  deuten sich erste Auswirkungen des KoloskopieScreenings an? Eine gemeinsame Auswertung der Krebsregisterdaten aus Bremen, Hamburg, dem Saarland und SchleswigHolstein.

Redaniel MT, Laudico A, MirasolLumague MR, Gondos A, Uy G, Brenner H: Intercountry and ethnic variation in colorectal cancer survival: comparisons between a Philippine population, FilipinoAmericans and Caucasians.
BMC Cancer 2010, 10:100. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Nur U, Shack LG, Rachet B, Carpenter JR, Coleman MP: Modelling relative survival in the presence of incomplete data: a tutorial.
Int J Epidemiol 2009, 39:118128. PubMed Abstract  Publisher Full Text

van Buuren S, Boshuizen HC, Knook DL: Multiple imputation of missing blood pressure covariates in survival analysis.
Stat Med 1999, 18:681694. PubMed Abstract  Publisher Full Text

Greenland S, Finkle WD: A critical look at methods for handling missing covariates in epidemiologic regression analyses.
Am J Epidemiol 1995, 142:12551264. PubMed Abstract

Wittekind C, Wagner G: TNMKlassifikation maligner Tumoren. 5th edition. Berlin Heidelberg New York: SpringerVerlag; 1997.
Auflage edn.

Burton A, Altman DG, Royston P, Holder RL: The design of simulation studies in medical statistics.
Stat Med 2006, 25:42794292. PubMed Abstract  Publisher Full Text

van Buuren S, GroothuisOudshoorn K: MICE: Multivariate Imputation by Chained Equations in R.

Machine Learning 2001, 45:532. Publisher Full Text

Burgette LF, Reiter JP: Multiple imputation for missing data via sequential regression trees.

Rubin DB: Multiple imputation for nonresponse in surveys. New York: Wiley; 1987.

van Buuren S, Brand JPL, GroothuisOudshoorn CGM, Rubin DB: Fully conditional specification in multivariate imputation.

Schafer JL: Analysis of incomplete multivariate data. London: Chapman & Hall; 1997.

R Development Core Team: R: A language and environment for statistical computing. 2.11.1 edition. Vienna, Austria: R Foundation for Statistical Computing; 2010.

Therneau T, original Splus > R port by Lumley T: survival: Survival analysis, including penalised likelihood.
2010.

Liaw A, Wiener M: Classification and Regression by randomForest.

van Dijk MR, Steyerberg EW, Stenning SP, Habbema JD: Survival estimates of a prognostic classification depended more on year of treatment than on imputation of missing values.
J Clin Epidemiol 2006, 59:246253. PubMed Abstract  Publisher Full Text

van Buuren S: Multiple imputation of discrete and continuous data by fully conditional specification.
Stat Methods Med Res 2007, 16:219242. PubMed Abstract  Publisher Full Text

Barzi F, Woodward M: Imputations of missing values in practice: results from imputations of serum cholesterol in 28 cohort studies.
Am J Epidemiol 2004, 160:3445. PubMed Abstract  Publisher Full Text

Brand J, van Buuren S, van Mulligen EM, Timmers T, Gelsema E: Multiple imputation as a missing data machine.

Donders AR, van der Heijden GJ, Stijnen T, Moons KG: Review: a gentle introduction to imputation of missing values.
J Clin Epidemiol 2006, 59:10871091. PubMed Abstract  Publisher Full Text

Weiss SM, Kapouleas I: An empirical comparison of pattern recognition, neural nets, and machine learning classification methods. In Readings in machine learning. Edited by Shavlik JW. Dietterich TG: Morgan Kaufmann; 1990:177183.

Marshall A, Altman DG, Royston P, Holder RL: Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study.

Pritzkuleit R, Holzmann M, Eisemann N, Gerdemann U, Katalinic A: Krebs in SchleswigHolstein  Inzidenz und Mortalität im Jahr 2008. Lübeck: SchmidtRömhild; 2011.
Prepublication history
The prepublication history for this paper can be accessed here: