Abstract
Background:
An important issue in prediction modeling of multivariate data is the measure of dependence structure. The use of Pearson's correlation as a dependence measure has several pitfalls and hence application of regression prediction models based on this correlation may not be an appropriate methodology. As an alternative, a copula based methodology for prediction modeling and an algorithm to simulate data are proposed.
Methods:
The method consists of introducing copulas as an alternative to the correlation coefficient commonly used as a measure of dependence. An algorithm based on the marginal distributions of random variables is applied to construct the Archimedean copulas. Monte Carlo simulations are carried out to replicate datasets, estimate prediction model parameters and validate them using Lin's concordance measure.
Results:
We have carried out a correlationbased regression analysis on data from 20 patients aged 17–82 years on preoperative and postoperative ejection fractions after surgery and estimated the prediction model: Postoperative ejection fraction =  0.0658 + 0.8403 (Preoperative ejection fraction); p = 0.0008; 95% confidence interval of the slope coefficient (0.3998, 1.2808). From the exploratory data analysis, it is noted that both the preoperative and postoperative ejection fractions measurements have slight departures from symmetry and are skewed to the left. It is also noted that the measurements tend to be widely spread and have shorter tails compared to normal distribution. Therefore predictions made from the correlationbased model corresponding to the preoperative ejection fraction measurements in the lower range may not be accurate. Further it is found that the best approximated marginal distributions of preoperative and postoperative ejection fractions (using qq plots) are gamma distributions. The copula based prediction model is estimated as: Post operative ejection fraction =  0.0933 + 0.8907 × (Preoperative ejection fraction); p = 0.00008 ; 95% confidence interval for slope coefficient (0.4810, 1.3003). For both models differences in the predicted postoperative ejection fractions in the lower range of preoperative ejection measurements are considerably different and prediction errors due to copula model are smaller. To validate the copula methodology we have resampled with replacement fifty independent bootstrap samples and have estimated concordance statistics 0.7722 (p = 0.0224) for the copula model and 0.7237 (p = 0.0604) for the correlation model. The predicted and observed measurements are concordant for both models. The estimates of accuracy components are 0.9233 and 0.8654 for copula and correlation models respectively.
Conclusion:
Copulabased prediction modeling is demonstrated to be an appropriate alternative to the conventional correlationbased prediction modeling since the correlationbased prediction models are not appropriate to model the dependence in populations with asymmetrical tails. Proposed copulabased prediction model has been validated using the independent bootstrap samples.
Background
Researchers, clinicians, and scientists are increasingly interested in the statistical models that have been designed to predict the occurrence of endpoint events given the diagnostic risk factors. The number and sophistication of cancer risk prediction models have grown rapidly over recent years and some researchers have expressed concerns as to whether they are always appropriately applied, correctly developed and rigorously evaluated. In 2004 the National Institutes of Health sponsored a workshop on Cancer Risk Prediction Models: a Workshop on Development, Evaluation, and Application in Washington D.C., USA. Experts associated with developing, evaluating, or using risk prediction models met to identify the strengths and limitations of cancer and genetic susceptibility prediction models currently in use and under development, in order to explore the methodological issues related to their development, evaluation and validation and also to identify the research priorities and resources needed to advance the field [1].
In this paper, a basic methodological issue of including the dependence parameter in the prediction model is considered. Pearson's linear correlation coefficient known as correlation is widely applied as a linear dependence measure. However, the correlation has several drawbacks and has a major impact on the accuracy of prediction models [2]. Correlation does not provide a complete description of the dependence structure even when there is a straightline relationship between two random variables. Rather correlation is the canonical measure of the stochastic dependence used with normal (elliptical) distributions and is strongly affected by extreme endpoints. Independence of two random variables implies that they are uncorrelated but zero correlation, in general, does not imply independence unless distributions are multivariate normal. Furthermore, correlation is not invariant under nonlinear strictly increasing transformations of random variables. Nonparametric measures of association like Kendall's rank correlation, Spearman's rank correlation and cstatistic are alternate measures of dependence which are more robust [3]. The kappa statistic is often used to measure the level of agreement when two categorical measurements of the same subjects are available. For an excellent review of dependence measures and their desirable properties, we refer to [24].
An alternative dependence measure is a copula which overcomes the limitations of correlation as a measure of dependence [58]. Use of copulas is a relatively new concept and has been applied in survival data analysis and actuaries [9,10]. Copulas are functions that join or couple multivariate distribution functions to their onedimensional marginal distribution functions. Advantages of using copulas in modeling are (i) allowance to model both linear and nonlinear dependence, (ii) arbitrary choice of a marginal distribution and (iii) capable of modeling extreme endpoints.
This paper describes the copulabased prediction modeling which can be employed as an alternative to the conventional correlationbased modeling in any multivariate clinical applications including riskprediction. Implementation of copula based prediction approach is illustrated by analyzing data from patients with aortic regurgitation and corrective surgery [11].
Methods
Study example
The study example is adapted from an investigation [11] which enrolled 20 patients for isolated aortic regurgitation both before and after surgery and 20 patients for isolated mitral regurgitation. To correct the malfunctioning of the aortic valve, open heart surgery was performed and an artificial valve was sewn into the heart. Data collected were on patient's age, gender, NYHA class (amount of impairment in daily activities), heart rate (beats/minute), systolic blood pressure (mmHG), ejection fraction (fraction of blood in the left ventricle pumped out during a beat), EDVIvolume (ml/m^{2}) of the left ventricle after the heart relaxes adjusted for body surface area (BSA), SVIvolume (ml/m^{2}) of the left ventricle after the blood is pumped out adjusted for BSA, ESVI volume (ml/m^{2}) of the left ventricle pumped out during one cycle adjusted for BSA; ESVI=EDVISVI. These measurements were taken before and after valve replacement surgery. The patients were selected to have left ventricular volume overload, i. e., expanded EDVI. For the purpose of illustration, we have used data on postoperative ejection fraction and preoperative ejection fraction from 20 patients with aortic regurgitation.
What are copulas?
We denote the cumulative probability distribution of preoperative ejection fraction (X) and postoperative ejection fraction (Y) by H(x, y) and marginal distributions of X and Y by F(x) and G(y) respectively. For uniform random variables U and V defined on [0,1] (by applying probability transforms U = F(X) and V = G(Y) to X and Y), there exists a bivariate copula function C(u, v) such that:
It is shown [24] that correlation r is only a limited description of the dependence between random variables except for the multivariate normal distribution where the correlation fully describes the dependence structure. If F(x) and G(y) are continuous then C(u, v) is unique otherwise C(u, v) is uniquely determined on range of F(x) × range of G(y). Since copulas link univariate marginals to their full multivariate distribution, an important feature of copulas is that any choice of marginal distributions can be used. Copulas are constructed on the assumption that marginal distributions are known or estimated from the data.
The two standard nonparametric dependence measures expressed in copula form are:
The dependence measures τ and ρ calculated from the application data are used to estimate the copula parameter. It may be noted that the Pearson's correlation r cannot be expressed in copula form.
A special class of copulas known as Archimedean copulas [12] is defined by C(u, v) = φ^{1 }[φ (u) + φ (v)] for all u, v ∈ [0,1], where φ (t) is a generator function such that for all t ∈ (0,1) φ (1) = 0 φ '(t) < 0, i.e., φ (t) is a decreasing function of t and φ "(t) ≥ 0, i.e., φ (t) is convex. Oneparameter families of the Archimedean copulas with their generator functions are tabulated by Nelson [[6], p. 94].
From the copulas perspective multinormal distribution has normal marginals and Gaussian (normal) copula dependence. NonGaussian copulas such as t and Archimedean can be used as an underlying dependence structure with any other nonnormal marginals. Thus copulas provide flexibility in modeling datasets. Some examples of bivariate Archimedean copulas are given in Table 1.
Table 1. Bivariate Archimedean copulas, generator functions and Kendall's τ.
Sample versions of measures of dependence can be expressed in terms of empirical copula and corresponding empirical copula frequency function [6].
Definition. Given (x_{i}, y_{i}), i = 1, ...n, a sample of size n from a bivariate distribution, the empirical copula is = [Number of pairs (x, y) in the sample such that x ≤ x_{(i)}and y ≤ y_{(j)}]/n, where x_{(i) }and y _{(j)}, 1 ≤ i, j ≤ n, denote order statistics from the sample. The empirical copula frequency function is given by , if (x_{(i)}, y_{(j)}) is an element of the sample; Otherwise zero.
Simulation of bivariate Archimedean copulas
The following algorithm generates random variables (U, V) whose joint distribution is an Archimedean copula C(u, v) with generator function φ (t).
1. Generate two independent uniform random variables p and q on the interval [0,1].
2. Set t = K_{C}^{1}(q) where K_{c }is a copula function C(u, v).
3. Set u = φ^{1 }[p·φ (t)] and v = φ^{1 }[(1p)·φ (t)].
4. Let x = F^{1}(u) and y = F^{1}(v).
5. Repeat n times steps 1 through 4 to generate n pairs of data (x_{i}, y_{i}), i = 1,..., n.
For implementing the algorithm, we perform the following steps [13]:
A. Kendall's rank correlation τ by the formula:
B. Copula parameter θ from τ .
C. Generator function φ (t).
D. First derivative of the generator function, φ '(t).
E. Inverse of the generator function, φ^{1}(t).
G. Inverse of copula function K_{C}^{1}(t) In case no close form exists, solution is obtained by numerical root finding through the equation .
H. u = φ^{1 }[p·φ (t)] and v = φ^{1 }[(1p)·φ (t)].
For ready reference algorithm implementation steps for some commonly applied Archimedean copulas are worked out and are given in Table 2.
Table 2. Algorithm steps for the Archimedean copulas.
Evaluating copulas
The first step in modeling and simulation is to identify the appropriate copula form. To identify the most appropriate copula for the given application data set (x_{i}, y_{i}), i = 1,...n, we follow the procedure [3,7]:
1. Calculate the nonparametric Kendall's rank correlation τ using the formula in equation (4).
2. Construct an empirical copula function K_{E}(t) as follows:
i. Determine the pseudo observations T_{i }= {Number of(x_{j }<x_{i}) such that x_{j }≤ x_{i }and y_{j }≤ y_{i}}/(n1).
ii. The empirical copula K_{E}(t) = proportion of T_{i}'s ≤ t, 0 ≤ t ≤ 1.
In nonmathematical terms, it means that for all pairs of subjects in which the Yvalue for a given subject is lower (or higher) than the Yvalue of a second subject, for what proportion of Xvalues does the first subject also have a lower (or higher) value?
3. Construct the Archimedean copula function .
In order to select the Archimedean copula that best fits the application data, we use a probability – plot or choose that copula which minimizes the nonparametric distance measure DM: ∫ [K_{c}(t)  K_{E}(t)]^{2}dK_{E}(t). For simulating bivariate Archimedean copulas, we refer to [14].
Further it may be worth exploring the connections of copulas to other nonparametric association statistics like cstatistic which are defined in terms of the concordant (C) and discordant (D) pairs. One such relationship is easily seen to exist between the Gumbel copula parameter θ and the concordant and discordant pairs. The Kendall's rank correlation τ in terms of (C, D) pairs is τ = 2(CD)/n(n1) and the Gumbel copula parameter θ and Kendall's rank correlation are related by τ = (θ 1)/θ . Thus it is easily seen that .
Validating the prediction model
To validate a prediction model, joint assessment of precision and accuracy is required. In [15], Lin proposed the concordance correlation coefficient to evaluate the agreement (reproducibility) between two sets of observed (y) and predicted (y) data. The concordance correlation is defined as:
where μ_{y }and are means of (y) and (), σ_{y}^{2 }and denote variances of (y) and () and is the population covariance between (y) and (). The concordance correlation is a product of precision (correlation between Y and ) and accuracy , where accuracy
The estimation of concordance correlation and its asymptotic sampling distribution are discussed in Lin [15].
Results
Application
The measurements on preoperative and postoperative ejection fraction from 20 patients with aortic regurgitation including their ages are given in Table 3. The exploratory data analysis indicates that both the preoperative and postoperative ejection fractions: (i) have slight departures from symmetry, (ii) are skewed in left tails (skewness coefficients being 0.3340 and 0.3730, respectively), (iii) have tendency of measurements to cluster less and (iv) have shorter tails (kurtosis coefficients being 0.9680 and 1.2540, respectively). Since both the preoperative and postoperative measurements show deviations from normal distributions, probability plots for normal, gamma and Weibull distributions were fitted. From the plots, gamma distributions are found to be the best fit since data points clustered mostly around a straight line for the gamma fit. Estimates of parameters of marginal distributions of preoperative and postoperative measurements are given in Table 3. Probability plots are graphed in Figure 1 and estimated marginal distributions are given in Figure 2.
Table 3. Data from patients with aortic regurgitation.
Figure 1. Quantile Plots of preoperative and postoperative ejection fractions.
Figure 2. Marginal distributions of preoperative and postoperative ejection fractions.
There is an evidence of significant association between the preoperative and postoperative ejection fractions since the Pearson's correlation coefficient r = 0.6870 (p < 0.0010), Kendall's rank correlation τ = 0.5050 (p < 0.0020) and Spearman's rank correlation ρ = 0.6970 (p < 0.0010).
For predicting the postoperative ejection fraction of a patient after surgery given preoperative ejection fraction measurement, we have estimated the conventional prediction regression model using correlation coefficient:
The pvalue indicates that the estimated model is useful in predicting the postoperative ejection fraction of a patient given the preoperative ejection fraction. However, predictions made in the lower range of preoperative ejection fractions may not be accurate because of the skewness exhibited by data in the left tail. As an alternative a copulabased prediction model is discussed below.
Simulation study
Three copulas of the Archimedean family namely Gumbel, Clayton, Frank and an empirical copula [1619] are estimated from the aortic regurgitation patients' data. These copulas are shown in Figure 3. Values of the nonparametric distance measure DM: ∫ [K_{c}(t)  K_{E}(t)]^{2}dK_{E}(t) for the Gumbel, Clayton and Frank copulas are 0.1440, 0.1580 and 0.1500 respectively. Thus, Gumbel copula is the best fit to model the given data. Monte Carlo simulations are performed to replicate datasets 50, 100, 150, 200, 250 and 300 times by implementing the algorithm to simulate bivariate data from the Gumbel copula.
Figure 3. Which copula fits the best?.
The estimated prediction model and 95% confidence intervals are given in Table 4. The prediction regression model for the postoperative ejection fraction using Gumbel copula and based on 300 simulations is:
Table 4. Estimated prediction models and 95% confidence intervals.
Since patient's age may be an important risk factor, we have included age as another predictor in the model. The estimated parameters of the ageadjusted copula based model and correlation based models are summarized in Table 5. Both prediction models indicate highly significant predictive power (Rvalues are 0.7010 and 0.7650 for correlation and copula based models respectively). The regression coefficient of age in both models is not significant (pvalues being 0.2120 for correlation model and 0.2610 for copula model).
Table 5. Estimated prediction model adjusted for age and 95% confidence intervals.
For comparison, predicted values of the postoperative ejection fractions from both copula and correlation based prediction models and actual data are shown in Figure 4. The percent absolute prediction errors for the lower preoperative ejection fractions from the copula and correlation methods are given in Table 6. It is clear from Table 6 that prediction errors due to the copula method are smaller than those based on the correlation method. It is therefore demonstrated that the copula is a more appropriate dependence measure capable of modeling asymmetrical tails whereas correlation is not appropriate to model skewed data.
Table 6. Percent absolute prediction errors in the lower tail from copula and correlation models.
Figure 4. Predicted postoperative ejection fractions based on the copula and correlation based prediction models.
Further, it may be noted that estimates, standard errors and width of confidence intervals from 50,100,150,200,250 and 300 copula simulations in Table 4 are very close. Thus, the proposed copula based prediction method does not require a large number of simulations to attain consistent estimates.
Validation using bootstrap independent data set
To validate the prediction model we were unable to obtain an independent dataset from the same population. Alternatively we have simulated fifty independent datasets by sampling with replacement from our dataset (bootstrap method). Such an approach is recommended for simulating independent datasets for methodological validation while analyzing small datasets. We found precision coefficient to be 0.8363 (p < 0.0001) indicating that the observed and predicted measurements have a strong association. The estimate of concordance statistic is 0.7722 (p = 0.0224) for the copula model and 0.7237 (p = 0.0604) for the correlation model. The predictions and observed measurements are therefore concordant for both models. The estimates of accuracy coefficient are 0.9233 and 0.8654 for copula and correlation models respectively.
Discussion
It is documented [24] that in prediction models the Pearson's linear correlation coefficient is not a complete and accurate description of dependence structure between dependent and predictor variables even when there exists a straightline relationship between them. An alternative method is to model the dependence structure using copulas which overcomes the limitations of correlation. Copulas are functions that join multivariate distribution functions to their onedimensional marginal distribution functions. Copulas allow modeling of both linear and nonlinear dependence. Through copulas any choice of marginal distribution functions can be used and extreme endpoint distributions can be modeled.
The copulabased approach to prediction modeling in clinical research methodology is described and is illustrated by estimating the prediction model for postoperative ejection fraction given the preoperative ejection measurements from an aortic regurgitation patients study. The approach provides flexibility in modeling and simulating datasets because many families of copulas are known to exist in the literature. It may be noted that copula based methodology is general, since it is applicable to model data with discrete, continuous and dichotomous outcomes. However a note of caution is about the evaluation of the method based on a small data set. A more rigorous validation should be based on an independent sample taken from the population.
There appears to be connections of copulas to other nonparametric association statistics like cstatistic which are defined in terms of concordant (C) and discordant (D) pairs. One such relationship between the Gumbel copula parameter and concordantdiscordant pairs is shown to exist.
Conclusion
We emphasize that the commonly used Pearson's linear correlation coefficient is not a complete description of dependence structure even when there is a straightline relationship between two random variables. An alternative copulabased methodology for prediction models in clinical research is described. The proposed copulabased model is capable of modeling the behavior of skewed data whereas correlation model is not appropriate for asymmetrical tails. The main statistical advantage of copulas is in replicating datasets through simulation with any type of marginal distributions.
Competing interests
The author(s) declare that they have no competing interests.
Authors' contributions
PK conceived of the study problem. MMS participated in formalizing study and providing data. Both authors carried out calculations. Both authors read and approved the final manuscript.
Acknowledgements
Authors wish to thank reviewers especially Jesse Berlin and Eric Lim for their insightful and critical review which has improved on earlier draft of the manuscript. Thanks are also due to MarilynLockyer, KingFaisalHeartInstitute for her careful reading of the manuscript. Authors thank Research Centre, KFSH&RC for sponsoring the research project ORA: 2060 022.
References

Freedman AN, Seminara D, Gail MH, Hartge P, Colditz GA, BallardBarbash R, Pfeiffer RM: Cancer Risk Prediction Models: A Workshop on Development, Evaluation, and Application.
Journal of the National Cancer Institute 2005, 97(10):715723. PubMed Abstract  Publisher Full Text

Embrechts P, Mcneil AJ, Straumann D: Correlation and dependence in risk management: properties and pitfalls. In Risk Management: Value at Risk and Beyond. Edited by Dempster M, Moffatt HK. Cambridge University Press; 1999.

Frees EW, Valdez E: Understanding relationships using copulas.

Schweizer B, Wolff EF: On nonparametric measures of dependence for random variables.

Sklar A: Functions de repartition a n dimensions et leurs merges.

Nelson RB: An Introduction to Copulas. SpringerVerlag New York, Inc; 1999.

Genest C, Rivest L: Statistical inference procedures for bivariate Archimedean copulas.
Journal of the American Statistical Association 1993, 88:10341043. Publisher Full Text

Joe H: Parametric families of multivariate distributions with given marginals.
Journal of Multivariate Analysis 2005, 46:262282. Publisher Full Text

Gross AJ, Lam CF: Paired observations from a survival distribution.
Biometrics 1981, 37:505511. Publisher Full Text

Marshall AW, Olkin I: Families of multivariate distributions.
Journal of the American Statistical Association 1988, 83:834841. Publisher Full Text

Fisher LD, van Belle G: Biostatistics A Methodology for the Health Sciences. John Wiley & Sons, Inc; 1993:410411.

Schweizer B: Thirty years of copulas. In Advances in Probability Distributions with Given Marginals. Edited by Dall'Aglio G, Kotz S, Salinetti G. Kluwer Academic Publishers; 1991:1350.

Financial Risk Management [http://www.riskglossary.com/papers/Copula.zip] webcite

Lin LI: A concordance correlation coefficient to evaluate reproducibility.
Biometrics 1989, 45:255268. PubMed Abstract  Publisher Full Text

Gumbel EJ: Bivariate exponential distributions.
Journal of the American Statistical Association 1960, 55:698707. Publisher Full Text

Gumbel EJ: Distributions des valeurs extremes en plusiers dimensions.

Clayton DG: A model for association in bivariate life tables and its applications in epidemiological studies of familial tendency in chronic disease incidence.
Biometrika 1978, 65:141151. Publisher Full Text

Frank MJ: On the simultaneous associativity of F(x, y) and x + y F(x, y).
Aequationes Math 1979, 19:194226. Publisher Full Text
Prepublication history
The prepublication history for this paper can be accessed here: