Abstract
Background
During the last 30 years, the median sample size of research studies published in highimpact medical journals has increased manyfold, while the use of nonparametric tests has increased at the expense of ttests. This paper explores this paradoxical practice and illustrates its consequences.
Methods
A simulation study is used to compare the rejection rates of the WilcoxonMannWhitney (WMW) test and the twosample ttest for increasing sample size. Samples are drawn from skewed distributions with equal means and medians but with a small difference in spread. A hypothetical case study is used for illustration and motivation.
Results
The WMW test produces, on average, smaller pvalues than the ttest. This discrepancy increases with increasing sample size, skewness, and difference in spread. For heavily skewed data, the proportion of p<0.05 with the WMW test can be greater than 90% if the standard deviations differ by 10% and the number of observations is 1000 in each group. The high rejection rates of the WMW test should be interpreted as the power to detect that the probability that a random sample from one of the distributions is less than a random sample from the other distribution is greater than 50%.
Conclusions
Nonparametric tests are most useful for small studies. Using nonparametric tests in large studies may provide answers to the wrong question, thus confusing readers. For studies with a large sample size, ttests and their corresponding confidence intervals can and should be used even for heavily skewed data.
Keywords:
Ttest; Nonparametric test; WilcoxonMannWhitney test; Welch test; Sample size; Statistical practiceBackground
In an article published in the New England Journal of Medicine (NEJM) in 2005, Horton and Switzer review the use of statistical methods in three volumes of the NEJM in 2004 and 2005 [1]. They divide the methods into 25 categories—sorted according to increasing complexity—and list the frequencies in each category. Also included are the results from previous surveys of articles published in the same journal in 1978–1979 and in 1989. Table 1 presents the proportions of articles that contained ttests and nonparametric tests. At all three time points, ttests or nonparametric tests or both were used in more than half of the articles. In 1978–1979, four ttests were used for every nonparametric test. In 2004–2005, ttests and nonparametric tests were used with equal frequency.
Table 1. Trends in the use of ttests and nonparametric tests in the NEJM
Let us compare this trend in the use of simple statistical methods with another development. Martin Bland [3] considers the median sample size of research reports published in the Lancet and the BMJ that used individual subject data. In September 1972, the median sample sizes were 33 and 37, and in September 2007, they were 3116 and 3104. Thus, during a similar time span as in Table 1, the sample size increased almost 100 fold.
If we assume that the NEJM is similar to the Lancet and the BMJ as regards statistical methods and sample size, research authors that publish in these highimpact medical journals have increase their use of nonparametric tests at the expense of ttests as their studies have increased in size.
This, to me, is counterintuitive.
ttests are parametric tests, which assume that the underlying distribution of the variable of interest is normally distributed. Consider the twosample ttest. It is fairly robust to deviations from normality [4], and—by the central limit theorem—increasingly so when the sample size increases. When the sample size of a study is 200, the ttest is robust even to heavily skewed distributions [5].
Nonparametric tests, as defined in Table 1, have, broadly speaking, two applications. First, as simple methods to analyze ordinal data, such as degree of pain classified as none, mild, moderate, or severe. Second, as alternatives to parametric tests, most often used when there is evidence of nonnormality. This latter practice is advocated in many basic textbooks, such as Refs. [69].
In their capacity as alternatives to ttests, nonparametric tests are thereby most useful when the sample size is small. One would, then, expect to observe an increase in the ratio of ttests to nonparametric tests as studies grow in size. Instead, the opposite has occurred. The purpose of this paper is to illustrate the consequences of uncritical use of nonparametric tests for large studies and to discuss some possible explanations for this practice.
Methods
Suppose that we want to compare the means or medians of a continuous variable in two independent groups. Two tests are often used for this problem: the (twosample) ttest and the WilcoxonMannWhitney (WMW) rank sum test. The ttest is a test for the hypothesis of equal means, whereas the WMW test is less specific. If the underlying distributions of the variable in the two groups differ only in location, i.e. in means and medians, the WMW test is a test for the hypothesis of equal medians. For all other situations, the null hypothesis of the WMW test is Prob(X<Y)=0.5, where X and Y are random samples from the two distributions. Interpretation of a small pvalue in this case is not always straightforward.
A difference in means or medians is usually accompanied by a difference in spread [10,11]. The WMW test is sensitive to distribution differences besides location [11] and may give a small pvalue based on differences in spread even when the means and medians are equal.
A simulation study was carried out to compare the rejection rates of the t and WMW tests for increasing sample size. Due to its superior properties [5], the ttest adjusted for unequal variances—hereafter simply referred to as the ttest, though it is often called the Welch U test—was used. The BrunnerMunzel test, a nonparametric test that adjusts for unequal variances, may be used as an alternative to the WMW test. It is not widely available in software packages, performs similarly to the WMW test [11], and is not included in the simulation study. The data were drawn at random from skewed gamma and lognormal distributions. The amount of skewness varied, in four steps, from small (coefficient of skewness = 1.0) to considerable (skewness = 4.0) and was always equal in both distributions, as were the means and medians. The only difference between the two distributions was in standard deviations, which differed, in eight steps, from 5% (ratio of 1.05) to 50% (ratio of 1.50). The nominal significance level was 5% and 10 000 replications were used.
Table 2 gives the true Prob(X<Y) for each scenario in the simulation study. Since the null hypothesis of the WMW test is Prob(X<Y)=0.5, we expect the rejection rates of the WMW test to exceed the nominal significance level whenever Prob(X<Y)>0.5. That is, the rejection rates of the WMW test represent the power to detect Prob(X<Y)≠0.5.
Table 2. The true Prob(X<Y) for each scenario in the simulation study
Results
Case study
Consider Figure 1, which is a plot of the probability density functions of two gamma (left panel) and two lognormal (right panel) distributions. The coefficient of skewness is 3.0 for all distributions, and the means and the medians of the two distributions in each panel are equal. The standard deviation of the distributions corresponding to the solid lines (X) are 10% greater than that of the distributions corresponding to the dotted lines (Y ). That difference is almost imperceptible for the two gamma distributions.
Figure 1. Probability density functions (pdf) of two gamma (left panel) and two lognormal (right panel) distributions. The two distributions in each panel are equal, except that the standard deviation of X is 10% greater than that of Y .
Suppose that we draw, at random, 1000 values from each of these four distributions. The results might look like that of Figure 2. Since we, in an actual study, obviously do not know the exact distributions from which the observed data originate, it is histograms such as these that give us a clue about the underlying distributions of the data. The data in Figure 2 are markedly skewed to the right, and we may be tempted to use the WMW test instead of the ttest to compare the locations of X and Y .
If we repeatedly draw samples of size 1000 from the distributions in Figure 1, we can apply the t and WMW tests to the samples each time and record the results. After 10 000 replications, the 5% rejection rates (the proportion of times p<0.05) were 5.1% (gamma distributions) and 4.9% (lognormal distributions) for the ttest. The expected rejection rate for an unbiased test of means or medians is 5.0%, that is, a one in 20 chance of a significant result when the means (and medians) are known to be equal. The ttest thus performs quite well. The rejection rates for the WMW test are 99% (gamma distributions) and 28% (lognormal distributions). The WMW test indicates a significant difference between the groups more often than the expected 5%. The explanation is that the two distributions are slightly different: their means and medians are equal but their standard deviations differ by 10%. The WMW test is sensitive to this difference and produces a small pvalue. But, if we are interested in comparing the means or the medians—as is customary—the WMW test most likely gives us an answer to the wrong question. The correct question for the WMW test can be formulated as: Is a random sample from one of the distributions likely to be less than a random sample from the other distribution? The skewness and standard deviation ratio of the two distributions in Figure 1 are 3.0 and 1.10, respectively. We thereby obtain the actual probability of Prob(X<Y) from Table 2, which is 56% for the gamma distributions and 52% for the lognormal distributions. The high rejection rates of the WMW test (99% and 28%) represent the power of the WMW test to detect that those probabilities are unequal to 50%.
If we repeat the above exercise for a range of sample size values, we can plot the rejection rate against the number of subjects in each group (Figure 3). The rejection rates of the WMW test increase as the sample size increases, whereas the rejection rates of the ttest are stable at about 5%.
Overall results from the simulation study
The patterns of rejections rates in Figure 3 persist for all combinations of skewness and standard deviation ratios considered in this study. The rejection rates of the ttest are always close to 5%, whereas the rejection rates of the WMW test increases with increasing sample size. As expected, the rejection rates of the WMW test increases when the difference in standard deviations increases, because it is this difference that the WMW test picks up. Interestingly, the rejection rates of the WMW test also increases when the amount of skewness increases. The problem is thus greater for situations in which one would more readily abandon the ttest (considerably skewed data) than for situations where the amount of skewness may be considered manageable (slightly skewed data). An example of the increasing rejection rates of the WMW test for increasing standard deviation ratios and increasing amount of skewness can be seen in Table 3.
Table 3. Rejection rates (%) of the t and WMW tests for data drawn from gamma distributions using 1000 subjects in each group
Detailed results for each of the 448 situations considered in the simulation study are given in Additional file 1. Tables 4 and 5 present summaries of the results. In Table 4, the average per cent rejection rates of the t and WMW tests are given stratified by study size. Each value in the table is the mean of the rejection rates for each of the 32 combinations of amount of skewness and standard deviation ratios.
Additional file 1. Supplementary materials. Detailed results from the simulation study.
Format: PDF Size: 171KB Download file
This file can be viewed with: Adobe Acrobat Reader
Table 4. Mean rejection rates (%) of the t and WMW tests, averaged over 32 combinations of amount of skewness and standard deviation ratios
Table 5. Estimated probabilities (%) that the pvalue of the WMW test is smaller than that of the ttest, averaged over 32 combinations of amount of skewness and standard deviation ratios
Table 5 presents the estimated probability that the pvalue of the WMW test is smaller than that of the ttest. For large studies with data distributed as in this simulation study, the WMW test almost always produces smaller pvalues than the ttest.
Discussion
The concurrent increases—since the Seventies—in sample size and use of nonparametric tests over ttests have a paradoxical quality. The usefulness of nonparametric tests as alternatives to ttests for nonnormally distributed data is most pronounced for small studies. When the sample size increases, so does the robustness of the ttests to deviations from normality. The nonparametric WMW test, on the other hand, increases its sensitivity to distribution differences other than between means and medians, and it may detect (i.e. produce a small pvalue) slight differences in spread. When the difference in spread increases, the probability that a random sample from one of the distributions is less than a random sample from the other distribution also increases. With a large sample size, the WMW test has great power to detect that that probability is not 50%. If the purpose of the study is to detect any distributional difference, using a nonparametric test is probably useful. Most studies, however, are carried out to investigate differences in means or medians, and as such, the ratio of nonparametric tests to ttests ought to decrease when studies grow in size.
Why then has the use of nonparametric tests increased? We may propose several explanations. Perhaps, nonparametric tests were underused earlier, and that the present ratio of ttests to nonparametric tests represents the “correct” one. If so, only the smallest of contemporary studies ought to use nonparametric tests. However, in the NEJM in 2004–2005, 27% of the studies used nonparametric tests [1], and the 25th percentile of the sample size in September 2007 in the Lancet and the BMJ were 1236 and 236 [3]. The smallest quartile of studies actually contains many quite large studies. Thus, the use of nonparametric tests is not confined to appropriately small studies. Another explanation might be that most studies do not use nonparametric tests as an alternative to ttests but rather to analyze ordinal variables, which is a highly reasonable practice. We do not have any systematic evidence to support or reject that hypothesis, although a cursory review of articles published in the NEJM, Lancet, JAMA, and BMJ from September through November 2011 revealed several large studies that used nonparametric tests as alternatives to ttests; for example, n=1721 [12], n=429 [13], n=107018 [14], n=44350 [15], n=1789 [16], and n=12745 [17]. The use of nonparametric tests as alternatives to ttests may be more common in highimpact journals [18]. The NEJM, for instance, in their instructions for authors, recommend that “nonparametric methods should be used to compare groups when the distribution of the dependent variable is not normal” ( , accessed March 19, 2012). That recommendation does not take into account the sample size and may force authors of large studies to use nonparametric methods needlessly. Four more explanations can be hypothesized. First, medical research authors may use a test for normality to decide whether to use a ttest or a nonparametric test. We strongly advise against that practice. In large studies, tests for normality are very sensitive to deviations from normality and thereby unsuitable as tools to choose the most appropriate test. Second, regardless of the size of their studies, authors may rely on recommendations and advice intended solely for the analysis of smaller studies. There might be a lack of critical thinking about recommendations and a poor understanding of the practical implications of the central limit theorem. Third, authors may simply prefer small pvalues, and might go shopping for the statistical method that gives them the smallest p. In the simulation study in this paper, the WMW test produced smaller pvalues that the ttest more than 70% of the times when the number of subjects in each group was 250. For 1000 subjects in each group, that proportion increased to more than 80%. Last, we have publication bias. A study with a significant pvalue from the WMW test may be more readily accepted for publication than a study with a nonsignificant pvalue from the ttest.
Is the WMW test a bad test? No, but it is not always an appropriate alternative to the ttest. The WMW test is most useful for the analysis of ordinal data and may also be used in smaller studies, under certain conditions, to compare means or medians [5,11]. Furthermore, if the results from the WMW test are interpreted strictly according to the test’s null hypothesis, Prob(X<Y)=0.5, the WMW test is an efficient and useful test. For large studies, however, where the purpose is to compare the means of continuous variables, the choice of test is easy: the ttest is robust even to severely skewed data and should be used almost exclusively.
One further benefit of using the ttest is that it facilitates interval estimation. The ttest and its corresponding confidence interval are based on the same standard error estimate; when the ttest is robust, so is the confidence interval. Combined with linear regression analysis, the ttest and its confidence interval form a simple and unified approach for analyzing and presenting continuous outcome data, which, for large studies, is sufficient for most practical purposes.
This study has only considered smooth, skewed distributions. Medical variables do not always have a smooth distribution and may include outliers. The problem with outliers is not that the ttest fails as a test of equality of means in their presence, but that the mean itself may be a poor representation of the typical value of the distribution. One solution is to use another measure of location, for instance, the trimmed mean, which may be compared in two groups with the YuenWelch test [5]. The problem that the mean does not reflect the central tendency of a distribution is most pronounced in small studies, where the impact of outliers is usually greater than in large studies.
Conclusions
The use of nonparametric tests in highimpact medical journals has increased at the expense of ttests, while the sample size of research studies has increased manyfold. Recent examples of large studies that use nonparametric tests as alternatives to ttests are abundant.
Nonparametric tests are most useful for small studies. Research authors that use nonparametric tests in large studies may provide answers to the wrong question, thus confusing readers. For large studies, ttests and their corresponding confidence intervals can and should be used even for heavily skewed data.
Competing interests
The author declares no competing interests.
Acknowledgements
The author thanks the editor and the reviewers for their thoughtful and constructive comments and suggestions.
References

Horton NJ, Switzer SS: Statistical methods in the journal.
New Engl J Med 2005, 353(18):19771979. PubMed Abstract  Publisher Full Text

Emerson JD, Colditz GA: Use of statistical analysis in the New England Journal of Medicine.
New Engl J Med 1983, 309(12):709713. PubMed Abstract  Publisher Full Text

Bland MJ: The tyranny of power: is there a better way to calculate sample size?
BMJ 2009, 339:b3985.
[10.1136/bmj.b3985]
PubMed Abstract  Publisher Full Text 
Skovlund E, Fenstad GU: Should we always choose a nonparametric test when comparing two apparently nonnormal distributions?
J Clin Epidemiol 2001, 54:8692. PubMed Abstract  Publisher Full Text

Fagerland MW, Sandvik L: Performance of five twosample location tests for skewed distributions with unequal variances.
Contemp Clin Trials 2009, 30:490496. PubMed Abstract  Publisher Full Text

Altman DG: Practical Statistics For Medical Research. Boca Raton, FL: Chapman & Hall/CRC; 1991.

Altman DG, Machin D, Bryant TN, Gardner MJ (eds): Statistics with Confidence (2nd edn). London: BMJ Books; 2000.

Bland M: An Introduction to Medical Statistics (3rd edn). Oxford: Oxford University Press; 2000.

Kirkwood BR, Sterne JAC: Essential Medical Statistics (2nd edn). Malden, MA: Blackwell Science, Inc.; 2003.

Hart A: MannWhitney test is not just a test of medians: differences in spread can be important.
BMJ 2001, 323:391393. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fagerland MW, Sandvik L: The WilcoxonMannWhitney test under scrutiny.
Stat Med 2009, 28:14871497. PubMed Abstract  Publisher Full Text

Kastrati A, Neumann FJ, Schulz S, Massberg S, Byrne RA, Ferenc M, et al.: Abciximab and heparin versus bivalirudin for nonSTelevation myocardial infarction.
New Engl J Med 2011, 365:19801989. PubMed Abstract  Publisher Full Text

Karim SSA, Naidoo K, Grobler A, Padayatchi N, Baxter C, Gray AL, et al.: Integration of antiretroviral therapy with tuberculosis treatment.
New Engl J Med 2011, 365:14921501. PubMed Abstract  Publisher Full Text

Rao SV, Kaltenbach LA, Weintraub WS, Row MT, Brindis RG, Rumsfield JS, et al.: Prevalence and outcomes of sameday discharge after elective percutaneous coronary intervention among older patients.
JAMA 2011, 306(13):14611467. PubMed Abstract  Publisher Full Text

Ferlitsch M, Reinhart K, Pramhas S, Wiener C, Gal O, Bannert C, et al.: Sexspecific prevalence of adenomas, advanced adenomas, and colorectal cancer in individuals undergoing screening colonoscopy.
JAMA 2011, 306(12):13521358. PubMed Abstract  Publisher Full Text

Parodi G, Marucci R, Valenti R, Gori AM, Migliorini A, Giusti B, et al.: High residual platelet reactivity after clopidogrel loading and longterm cardiovascular events among patients with acute coronary syndromes undergoing PCI.
JAMA 2011, 306(11):12151223. PubMed Abstract  Publisher Full Text

Christoffersen M, FrikkeSchmidt R, Schnohr P, Jensen GB, Nordestgaard BG, TybjærgHansen A: Xanthelasmata, arcus corneae, and ischaemic vascular disease and death in general population: prospective cohort study.
BMJ 2011, 343:d5497. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kühnast C, Neuhäuser M: A note on the use of the nonparametric WilcoxonMannWhitney test in the analysis of medical studies.
Prepublication history
The prepublication history for this paper can be accessed here: