Cambridge Computational Biology Institute, Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, CB3 0WA, UK

Oxford Centre for Collaborative Applied Mathematics, Mathematical Institute, University of Oxford, Oxford, OX1 3LB, UK

Current address: F. Hoffmann-La Roche AG, In Silico Sciences - Statistics, 4070 Basel, Switzerland

Abstract

Background

Pseudoreplication occurs when observations are not statistically independent, but treated as if they are. This can occur when there are multiple observations on the same subjects, when samples are nested or hierarchically organised, or when measurements are correlated in time or space. Analysis of such data without taking these dependencies into account can lead to meaningless results, and examples can easily be found in the neuroscience literature.

Results

A single issue of Nature Neuroscience provided a number of examples and is used as a case study to highlight how pseudoreplication arises in neuroscientific studies, why the analyses in these papers are incorrect, and appropriate analytical methods are provided. 12% of papers had pseudoreplication and a further 36% were suspected of having pseudoreplication, but it was not possible to determine for certain because insufficient information was provided.

Conclusions

Pseudoreplication can undermine the conclusions of a statistical analysis, and it would be easier to detect if the sample size, degrees of freedom, the test statistic, and precise

Background

The majority of neuroscience experiments include some type of inferential statistical analysis, where conclusions are reached based on the distance between the observed results from some hypothetical expected value. Discovering how the brain and nervous system work requires the proper application of statistical methods, and inappropriate analyses can lead to incorrect inferences, which in turn leads to wasted resources, biases in the literature, fruitless explorations of non-existent phenomena, distraction from more important questions, and perhaps worst of all, ineffectual therapies that are advanced to clinical trials _{28 }= 2.1; _{1 }+ _{2 }- 2, where _{1 }and _{2 }are the number of independent samples in each group) associated with this statistical test. The concept of degrees of freedom is perhaps not the most intuitive statistical idea, but it can be thought of as the number of independent data points that can be used to estimate population parameters (e.g. means, differences between means, variances, slopes, intercepts, etc.), and whenever something is estimated from the data, a degree of freedom is lost. Therefore the total _{1 }= _{2 }= 15, and so 15 + 15 - 2 = 28). Incidentally, the correct analysis with a

An example of pseudoreplication

**An example of pseudoreplication**. Two rats are sampled from a population with a mean (

The assumption of independence means that observations within each group or treatment combination are independent of each other. An alternative way of expressing this concept is to say that the errors (residual values) are independent, once the effects of all the other explanatory variables have been taken into account. In addition, other variables that are not included in the analysis (e.g. the order in which the samples were obtained) must not influence the outcome or be correlated with the residuals. The remainder of the introduction will define some commonly used terms, illustrate why pseudoreplication is problematic, and finally, discuss the four situations in which it can arise.

The terms

Pseudoreplication leads to the wrong hypothesis being tested and false precision

Ignoring lack of independence leads to two major problems. The first is that the statistical analysis is not testing the research hypothesis that the scientist intends, in other words, the incorrect hypothesis is being tested. This is illustrated in Figure _{0}: _{0}: _{19 }= -7.75, a ^{-7}, and a 95% confidence interval (CI) from 32.9 to 40.2. The correct analysis would give _{1 }= -2.07,

The second problem that arises is that correlations between observations can lead to calculated

where _{ij }are the values of the response, _{i }is the amount by which the mean of each rat is above or below the grand mean, and _{ij }are the residuals, which is the distance of each of the 20 values from the mean of their respective rat. The intraclass correlation can then be calculated as

where

A detailed analysis by Scariano and Davenport showed that both the Type I (false positive) and Type II (false negative) error probabilities can be affected by within group correlations

Four situations in which pseudoreplication can arise.

**Situation**

**Example**

**Solutions**

Repeated measures

Growth curve

1. Include subject as a random effect

2. Repeated measures ANOVA

3. Summary-measure analysis

Hierarchical/nested

Multiple brain sections

1. Include random effects

Multiple coverslips/wells

2. Average over observations

Litter effects

Correlated in time

Time of day testing occurs

1. Include time as covariate

Circadian effects

2. Include sample number as a covariate

Correlated in space

Multiple incubators

1. Include random effects

Cage effects

2. Average over observations

Repeated measurements on the same experimental unit

A common situation is when observations are taken at different times or under different experimental conditions on the same subjects, and this is usually a planned part of the experimental design. Data of this type are typically analysed with a paired-samples

Data with a hierarchical structure

A second common design where pseudoreplication can occur is when data are hierarchically organised. Biological data are often sampled at different spatial scales or levels of biological organisation. For example, several brains may be sliced into sections, and a number of regions on a section may be examined histologically (or maybe just the left and right side of the brain), and perhaps only a certain number of cells within each region would be examined. Thus there is a hierarchy, with the whole brain (animal) at the top, sections within a brain, regions within a section, and cells within a region (see reference

where

Another common case of hierarchically structured data is when multiple animals are born in a litter. Animals within a litter are not independent because they share the same parents and the same prenatal and early postnatal environment, and animals are therefore nested within litters

Other examples include applying treatments to cages of rats rather than individual rats (e.g. administering a substance in the drinking water), or applying treatments to pregnant females but examining the effect in the offspring. Here, cage and pregnant females are the experimental units, and not the individual animals since the treatments can only be applied to whole cages and pregnant females and not to the individuals animals. This type of experimental design is often referred to as a split-plot design and is characterised by the restrictions on randomisation; it needs to be distinguished from a design where individual rats can be randomised to different conditions. In addition, cells in the same flask or well of a cell-culture experiment are not independent; they will tend to be more similar than cells in different flasks or wells and will be subject to the same uncontrolled effects.

Observations correlated in space

Observations may be correlated in space because multiple measurements taken at one location will all be affected by the idiosyncratic aspects of that location. For example, 96-well plates often contain small amounts of fluid, and wells near the edges of the plate may evaporate faster than wells in the centre, and thus alter the concentration of substances such as metabolites, secreted hormones, etc. Placing the control samples in the first column of the plate and the treated samples in the second column would therefore not be a good idea. This is also the reason why microarrays have replicate probes for the same gene scattered throughout the array and not placed beside each other, as this accounts for any spatial effects in the quality of the array that may have arisen during manufacturing or handling. Spatial dependence may also arise in incubators for culturing cells. A large cell culture experiment may use two incubators, but differences particular to each incubator may affect the outcome variable. For example, the temperature and humidity levels may be different, or these variables may fluctuate more in one incubator than another, perhaps because one may be used more and thus the door is opened more often as people access their samples. Good experimental design would dictate that the treated samples are not placed in one incubator while the control samples are in the other, as it would be impossible to separate the effect of the treatment from the effect of the incubator.

Observations correlated in time

Unlike repeated measurements on the same samples, observations that are correlated in time are often not a planned feature of the experimental design, but arise from the sampling protocol, the phenomenon under investigation, or the way in which the experiment is conducted. In addition, observations need not be on the same subject. For example, rats have a circadian rhythm in the stress hormone corticosterone, which peaks at the beginning of the dark (active) phase, and gradually decreases throughout the night

Methods

The proportion of papers that had pseudoreplication in a large number of journals was not quantified because the majority of papers do not provide sufficient information for this to be assessed

A single recent issue of

The simulated data in Figure

Results

Of the nineteen papers published in the August 2008 issue of

**Classification of manuscripts**. Summary of the papers in the August 2008 issue of

Click here for file

Manuscripts with pseudoreplication

Fiorillo et al. performed electrophysiological recordings from the brains of two macaque monkeys

In another paper, Sato et al. classified rod terminals in the retina as either bipolar or not, and examined whether the proportion of these two terminal types differed between control and pikachurin knockout mice (Figure four E in their paper) using a Chi-square test

Both manuscripts used hierarchical sampling but did not distinguish between the number of data points and the number of independent observations. These types of errors are not limited to this issue (e.g. see

Manuscripts with suspected errors

The following manuscripts possibly had pseudoreplication, but insufficient information was provided to determine for certain. They are nevertheless discussed because they contain other types of analyses where pseudoreplication can arise.

Toni et al. examined how new dentate gyrus neurons integrate and form functional synapses with cells in the hilus and CA3 region of the hippocampus _{77 }= 10.50, _{156 }= 0.54,

Groc et al. examined the effect of corticosterone on AMPA receptor trafficking and synaptic potentiation

Pocock and Hobert examined the effects of oxygen levels on axon guidance and neuronal migration in

Using electrophysiological recordings, Chen et al. examined how the difficulty of a task affected the activity of neurons in the primary visual cortex of two monkeys

Serguera et al. examined how dopamine in the olfactory bulb of female mice impairs the perception of social odours contained in male urine _{4 }= 3.4,

In ambiguous situations such as this, readers have to form some sort of judgement regarding the statistical competence of the authors. Based on other aspects of their data analysis, one may be reluctant to give them the benefit of the doubt. First, an _{(1,N-G)}, where

Discussion

In a well-publicised study, Ioannidis concluded that most published research findings in the medical literature are false

The term pseudoreplication was coined by Hurlbert in 1984

Reporting guidelines

Medical studies involving human patients have detailed reporting guidelines such as the CONSORT statement

In addition to the above guidelines, further specific information should be provided in order to check whether analyses were carried out correctly. These include:

1. **Report the sample size and number observations for each experiment. **The sample size (

2. **Report the value of the test statistic, degrees of freedom, and exact ****-value**. These provide the necessary information to check whether the analyses were carried out correctly. They can also allow readers to understand the analysis better if the verbal description was ambiguous. If

**3. Error bars should correspond to the analysis. **Graphical measures of uncertainty such as confidence intervals and standard errors of the mean should be based on the number of independent samples, that is, the graphical representation of the data should correspond to the statistical analysis that was performed on them. If there are two groups of five animals, with multiple observations, the

Without this information it is not possible for peer-reviewers to adequately assess whether the statistical tests were carried out appropriately, and they must merely assume that the authors have performed the analyses correctly. This is not something that can be safely assumed, and a recent systematic survey identified a number of problems with the reporting, experimental design, and statistical analysis of studies using laboratory animals

Remedies for pseudoreplication

Pseudoreplication does not necessarily imply that the studies are flawed, and a reanalysis of the data may be all that is required. It may however become apparent that the sample size is too small to make any meaningful inferences about the parameters of interest. Pseudoreplication can be dealt with prior to analysis, for example by using only one mouse per litter for a particular experiment, thus eliminating any litter effects. Statistical methods for dealing with pseudoreplication are available and four such methods are discussed below and summarised in Table

Averaging dependent observations

In the opening example with ten rats undergoing rotarod testing on three consecutive days, the results from the three days can be averaged so that each rat contributes only one value to the analysis. This is particularly useful when there is no expected trend over the days of testing, or if there is, it is not relevant to the research question; for example, if three trials were simply used to get a better estimate of the rats' motor functioning. Similarly, in a hierarchical sampling design, one could average values from multiple neurons in a rat to obtain one value per rat that will then be carried forward for statistical analysis. Averaging has the advantage of simplicity, and common statistical tests can be applied (e.g.

Summary-measure analysis

Another alternative to using the mean of a number of dependent observations is to use some other relevant value which captures a feature of interest, such as the slope, intercept, or area under the curve

There is also an important point to be made when deciding whether to average over observations or to use slopes as a summary measure (or a mixed model), and it is based on whether the research question is (1) do subjects with high values of

Separate analyses

Another option is to conduct separate analyses on each of the three days, using an independent samples

Mixed models

As noted above, a repeated measures ANOVA is a common analyses for the rotarod example if there was interest in testing for differences across the three days; however, this method has been superseded by more recent methods with superior properties that are called random (or mixed) effects models, hierarchical models, multilevel models, or nested models (different disciplines use different names for the same method)

A key feature of these models is the distinction between fixed and random effects. Fixed effects are the familiar explanatory variables such as treatment, sex, condition, and dose, and are usually something that the experimenter is interested in testing directly. Fixed effects affect the mean of the outcome variable; for example, the effect of a treatment is to increase the value of the outcome variable compared to a control condition by a certain number of units. Random effects are less familiar and are usually something that the experimenter is not interested in directly (litter effects, cage effects, differences between incubators, differences between individual rats, etc.) but must be taken into account. A variable can be treated as being either fixed or random, but usually one is more appropriate, and the interpretation of the results is different. For example, if a researcher was interested in testing whether there are differences between cages on some outcome variable, twenty rats could be randomly assigned to four cages (5 rats per cage), labelled A-D. In this example there is no other experimental variable, only the cage that the rat is in. Treating cage as a fixed effect would lead to a one-way ANOVA with four levels. If significant differences are found between cages, then conclusions can only be made about these four cages, and not about other unobserved cages. If rats in cage C had particularly high values, there is no reason why in a subsequent experiment rats in a cage also labelled C would also have high values, rather than rats in cage B for example. There is nothing about the letter C on the front of the cage that affects the mean value of rats in that cage, or that can be used to predict the value of rats in other cages also labelled C; thus the cage labels are said to be uninformative. Contrast this with a true fixed effect such as dose of a drug; if the 50 mg/kg group had higher values than the 0 mg/kg control group, then one would also expect that the 50 mg/kg group would have the higher values in a subsequent experiment (rather than the 0 mg/kg group). If instead cage is treated as a random effect (the more appropriate analysis), then these four cages are treated as random samples from a population of cages, and inferences can be made about the effect of cages in general. A good discussion of the difference between fixed and random effects can be found in references

One important drawback of the repeated measures ANOVA is that the assumptions of compound symmetry and sphericity are rarely met. These terms refer to the correlation structure of the data. Returning to the rat rotarod example, the correlation of the outcome variable at each combination of time points can be calculated (day one vs. day two, day one vs. day three, and day two vs. day three). If these three correlations are all similar, and in addition the variances at each day are similar (homogeneity assumption), then the data are said to be compound symmetrical. If differences rather than correlations between each combination of time points are calculated, then the data are said to be spherical if these difference scores all have the same variance (see reference

Mixed models and their extensions (generalised and nonlinear mixed models) are the preferred methods for analysing the type of data discussed in this paper and are already being used to model litter effects

How to check reported values

When the necessary information is provided, it is easy to check whether pseudoreplication has been handled correctly, for this one needs to know the degrees of freedom associated with common statistical tests, and these are provided in Table

Degrees of freedom associated with common statistical tests.

**Test**

**Degrees of Freedom**

**T-test**

Independent

_{1 }+ _{2 }- 2

Paired

**One-way ANOVA**

**Two-way Anova**

Main effect of A

_{A -1}

Main effect of B

_{B -1}

A × B interaction

(_{A -1})(_{B -1})

Error

_{A}_{B}

**One-way RM-Anova**

Between subjects

Error

(

**Two-way Mixed ^{† }ANOVA**

Groups

Error

Obs

Group × Obs Interaction

(

Error

**Linear Regression**

1 and

**Chi-square**

(

_{1 }= sample size of group one; _{2 }= sample size of group two; _{A }= number of groups for Factor A; _{B }= number of groups for Factor B;

A similar procedure can be carried out for ANOVA

Conclusions

The problem of pseudoreplication has been recognised for many years in ecology and related areas

Statistical competence will not happen overnight, but stricter reporting requirements will make it easier to detect pseudoreplication, and this requires the sample size (

Continuing with the present situation suggests that statistical analysis is not really important, it's just something scientists go through to obtain

Author information

SEL's academic (BA, BSc, PhD) and research background are primarily in experimental neuroscience. SEL also holds a masters in Computational Biology and currently works as a Senior Research Statistician in the pharmaceutical industry.

Response to Lazic

by Christopher D. Fiorillo

Email:

Address: Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Korea

Lazic makes many useful points about the misuse of statistics. However, his article on pseudoreplication does not demonstrate a clear understanding of the goals and issues at stake in primate neurophysiology, nor does it clarify the importance of independence in statistical tests.

Lazic's criticisms of my paper

The statistic comparison that the author has questioned (in figures three c and four c) was designed to test whether there is a difference in neuronal firing rates (across the recorded population of neurons) between responses to juice reward depending on when juice was delivered following a conditioned stimulus. If one knows the subject matter and reads my paper for more that just statistical methods, then one has a strong prior expectation that the statistical comparisons made in figures three c and four c will be highly significant, and thus the statistical significance is not of much interest. The interesting point of the figures, as described in the main text, is that although significant, the differences are surprisingly small relative to the very large difference seen in comparing either of these responses to unpredicted reward. The statistical significance that was quantified in these figures was thus superfluous and could have been omitted entirely. There were no statistical tests performed on the important effect because it was so large as to be obvious (see mean ± sem firing rates in figures three b and four b). To paraphrase a colleague who is an excellent scientist but a reluctant statistician, it passed the "bloody ... obvious test."

Lazic also makes the point that "neurons of the same brain are not independent." If this statement does not signify confusion on the part of its author, it may nonetheless confuse readers. There is an important difference between physical or causal independence on the one hand, and logical or statistical independence on the other. Neurons in the same brain may be physically or causally related. However, that fact in itself does not necessarily require that the neurons are or are not statistically or logically independent. Statistical independence is all that matters with respect to statistical tests, and statistical dependencies could be present regardless of whether the neurons are in the same brain or different brains.

A second critical error that is often made is to confuse the functioning of a physical system with our knowledge of that system. Statistics and hypotheses are derived from the latter. "Statistical independence" means that we, the people performing the statistical test, do not have knowledge of how two pieces of data (such as the firing rates of two neurons) are related. We believe that neurons within a defined population, clustered together in the same region of the brain, are likely to have direct or indirect physical interactions with one another, and likewise, to display some sort of correlations in their firing rates. But at the outset of a typical study, we do not know what these correlations are, and thus it is rational for us to treat the data from each neuron as independent. By contrast, if we already knew neurons within clusters of known dimensions to be tightly coupled to one another through gap junctions, then we would probably try to avoid recording from neurons within the same cluster, and when we did obtain data from neurons in the same cluster, we might average it together before doing an analysis of responses across clusters. The data from my neurons was "statistically independent" because, at the time that I did the statistical test, I was ignorant of any relevant relationship between the firing rates of discrete neurons. Given a different state of knowledge, statistical independence may not apply and another type of statistical test may be more appropriate.

I strongly recommend

Acknowledgements

C.D.F. was supported by a "World Class University" Grant from the Korea Science and Engineering Foundation (R32-2008-000-10218-0).

Signed response

By Takahisa Furukawa, M.D and Ph.D.

E-mail:

Address: Department of Developmental Biology, Osaka Bioscience Institute, 6-2-4 Furuedai, Suita, Osaka, 565-0874, Japan

Many life science studies employ histological analyses, frequently immunostaining of tissue sections, and incorporate this image data into papers. Although it used to be that only the most representative images were displayed in paper figures, quantitative analysis along with statistical analysis of data are often required to obtain an analytical conclusion on immunostaining image data for publication. The author, Lazic, mentioned in his paper that the reason he picked Nature Neuroscience in particular for his case study is that this journal has detailed instructions for statistical analyses. As Lazic mentioned, authors basically examine their results to test whether or not their data is statistically significant, and present their data with statistical analysis results in their papers. This means that this journal requires high quality data and expects researchers conduct their studies according to these requirements. In Lazic's paper, the author picked multiple examples, In including our statistical analysis of our electron microscopy (EM), and indicated that the statistical analysis method used in our paper was inappropriate

In order to confirm this EM observation, we also performed 3D electron tomography analysis and described the result in the paper. Furthermore, the results from other experiments including electroretinogram and optokinetic responses also support our conclusion that photoreceptor synaptic terminal formation is impaired in the pikachurin null retina. Ultimately, the Lazic paper points out some very important issues in conducting appropriate statistical analysis for biological studies. We understand that we have to be very careful in choosing a suitable statistical method for analyzing our data in the future, however, the Lazic paper does not affect the conclusions in our paper.

Acknowledgements

SEL was supported by a Cancer Research UK bursary, the Cambridge Commonwealth Trust (University of Cambridge), and an OCCAM Visiting Studentship (University of Oxford). This publication was based on work supported in part by Award No KUK-C1-013-04, made by King Abdullah University of Science and Technology (KAUST). The helpful comments and suggestions from four anonymous reviewers is also gratefully acknowledged.