Department of Biostatistics and Bioinformatics, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA

Department of Biostatistics, SUNY University at Buffalo, Buffalo, NY, 14214, USA

Cancer Genetics, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA

Cancer Prevention and Control, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA

Pharmacology and Therapeutics, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA

Abstract

Background

Batch effect is one type of variability that is not of primary interest but ubiquitous in sizable genomic experiments. To minimize the impact of batch effects, an ideal experiment design should ensure the even distribution of biological groups and confounding factors across batches. However, due to the practical complications, the availability of the final collection of samples in genomics study might be unbalanced and incomplete, which, without appropriate attention in sample-to-batch allocation, could lead to drastic batch effects. Therefore, it is necessary to develop effective and handy tool to assign collected samples across batches in an appropriate way in order to minimize the impact of batch effects.

Results

We describe OSAT (Optimal Sample Assignment Tool), a bioconductor package designed for automated sample-to-batch allocations in genomics experiments.

Conclusions

OSAT is developed to facilitate the allocation of collected samples to different batches in genomics study. Through optimizing the even distribution of samples in groups of biological interest into different batches, it can reduce the confounding or correlation between batches and the biological variables of interest. It can also optimize the homogeneous distribution of confounding factors across batches. It can handle challenging instances where incomplete and unbalanced sample collections are involved as well as ideally balanced designs.

Background

A sizable genomics study such as microarray often involves the use of multiple batches (groups) of experiment due to practical complication. The systematic, non-biological differences between batches in genomics experiment are referred as batch effects. Batch effects are wide-spread occurrences in genomic studies, and it has been shown that noticeable variation between different batch runs can be a real concern, sometimes even larger than the biological differences

To minimize the impact of batch effects, a careful experiment design should ensure the even distribution of biological groups and confounding factors across batches. It would be problematic if one batch run contains most samples of a particular biological group. In an ideal genomics design, the groups of the main interest, as well as important confounding variables should be balanced and replicated across the batches to form a Randomized Complete Block Design (RCBD)

However, despite all best effort, it is often than not that the collected samples are not complying with the original ideal RCBD design. This is due to the fact that these studies are mostly observational or quasi-experimental since we usually do not have full control over sample availability

We developed OSAT to facilitate the allocation of collected samples into different batches in genomics studies. OSAT is not aimed to be a software for experimental design carried out before sample collection, rather, it is developed to fulfill the needs arise from some practical limitations occurring in the genomics experiments. Specifically, OSTA is developed to address one practical issue in genomics studies – when the available experimental samples ready to be profiled in the genomics instruments are collected, how should one allocate these samples to different batches in a proper way to achieve an optimal setup minimizing the impact of batch effects at the genomic profiling stage? With a block randomization step followed by an optimization step, it produces setup that optimizes the even distribution of samples in groups of biological interest into different batches, reducing the confounding or correlation between batches and the biological variables of interest. It can also optimize the even distribution of confounding factors across batches. OSAT can handle challenging instances where incomplete and unbalanced sample collections are involved as well as ideal balanced RCBD.

Results

Datasets

An exemplary data is used for demonstration. It represents samples from a study where the primary interest is to investigate the expression differentiation in case versus control groups (variable SampleType). Two additional variables, Race and AgeGrp, are clinically important variables that may have impact on final outcome. We consider them as confounding variables. A total of 576 samples are included in the study, with one sample per row in the example file. As shown in Additional file

**Table S1.** Example data. **Table S2.** Data distribution. **Figure S1.** Number of samples per plate. Paired specimens are placed on the same chip. Sample assignment use optimal.block method.

Click here for file

Comparison of different sample assignment algorithms

The default algorithm implemented in OSAT will first block three variables considered (

As shown in Figure ^{2} test examining the association between batches and each of the variables considered indicate that all there variables considered are highly uncorrelated with batches (p-value > 0.99, Table ^{2} test (Table

Summary of final setup produced by the default algorithm

**Summary of final setup produced by the default algorithm. a)** the distribution of SampleType characteristic across the plates; **b)** the distribution of Race characteristic across the plates; **c)** the distribution of AgeGrp characteristic across the plates; **d)** the index of optimization steps versus value of the objective function. The blue diamond indicates the starting point, and the red diamond marks the final optimized setup.

**Default algorithm**

**Alternative algorithm**

**An undesired setup through complete randomization**

**(optimal.shuffle)**

**(optimal.block)**

**Variable**

**
DF
**

**
Chi-square
**

**
P value
**

**
Chi-square
**

**
P value
**

**
Chi-square
**

**
P value
**

**SampleType**

5

0.2034518

0.9990763

0.03507789

0.9999879

13.25243

0.021124664

**Race**

5

0.2380335

0.9986490

3.68541503

0.5955359

14.22455

0.014244218

**Age_grp**

20

0.8138166

1.0000000

5.08147313

0.9996856

39.75020

0.005371387

Summary of final setup produced by the alternative algorithm

**Summary of final setup produced by the alternative algorithm. a)** the distribution of SampleType characteristic across the plates; **b)** the distribution of Race characteristic across the plates; **c)** the distribution of AgeGrp characteristic across the plates; **d)** the index of generated setups versus value of the objective function. The blue diamond indicates the first setup generated, and the red diamond marks the final selected setup.

Simply performing complete randomizations might lead to undesired sample-to-batch assignment, especially for unbalanced and/or incomplete sample sets. In fact, there is substantial chance that variables will be statistically dependent on batches if a complete randomization is carried out, especially for incomplete and/or unbalanced sample collections. As shown in Figure ^{2} tests indicate all three variables are statistically dependent on batches with p-values < 0.05 (Table

Summary of an undesired setup produced by complete randomization

**Summary of an undesired setup produced by complete randomization. a)** the distribution of SampleType characteristic across the plates; **b)** the distribution of Race characteristic across the plates; **c)** the distribution of AgeGrp characteristic across the plates.

Conclusions

Genomics experiments are often driven by the availability of the final collection of samples which might be unbalanced and incomplete. The unbalance and incompleteness nature of sample availability thus calls for the development of effective tools to assign collected samples across batches in an appropriate way in order to minimize the impact of batch effects at the genomics experiment stage. OSAT is developed to facilitate the allocation of collected samples to different batches in genomics study. With a block randomization step followed by an optimization step, it produces setup that optimizes the even distribution of samples in groups of biological interest into different batches, reducing the confounding or correlation between batches and the biological variables of interest. It can also optimize the homogeneous distribution of confounding factors across batches. While motivated to handle challenging instances where incomplete and unbalanced sample collections are involved, OSAT can also handle ideal balanced RCBD.

Partly due to its simplicity in implementation, complete randomization has been frequently used in the sample assignment step of experiment practice. When sample size is large enough, randomized design will be close to a balanced design. However, simple randomization could lead to undesirable imbalanced design where efficiency and confounding might be an issue after the data collection. As we demonstrated in the manuscript, simply performing randomizations might lead to undesired sample-to-batch setup showing batch dependence, especially for unbalanced and/or incomplete sample sets which doesn’t comply with the original ideal design. OSAT package is designed to avoid such scenario, by providing a simple pipeline to create sample assignment that minimizes the association between sample characteristics and batches. The software was implemented in a flexible way so that it can be adopted by genomics practitioner who might not be specialized in experiment design.

It should be emphasized that although the impact of batch effect on genomics study might be minimized through proper design and sample allocation, it may not be completely eliminated. Even with perfect design and best effort in all stages of experiment including sample-to-batch assignment, it is impossible to define or control all potential batch effects. Many statistical methods have been developed to estimate and reduce the impact of batch effect at the data analysis stage (

Experimental design has been applied in many areas, with methods being tailored to the needs of various fields. A collection of R packages for experimental design is available at

Methods

Methodology

The current version of OSAT provides two algorithms for creation of sample assignment across the batches based on the principle of block randomization, which is an effective approach in controlling variability from nuisance variables such as batches and its interaction with variables of our primary interest

By combining the variables of interest, we can create a unified variable with its levels based on all possible combinations of the levels of the variables involved. Assuming there are a total of _{
j
} samples in each stratum, _{
i
}, _{
1
}
_{
s
}
_{
1
}
_{
m
}

The expected number of sample from each stratum to each batch is denoted as _{
ij
}. One can split it to its integer part and fractal part as

where ⌊_{
ij
}⌋ is the integer part of the expected number and _{
ij
} is the fractal part. In the case of equalbatch size, it reduces to _{
ij
} are zero.

For an actual sample assignment

where _{
ij
} is the number of sample in each optimization strata from an actual sample assignment. Our goal is, through a block randomization step and an optimization step, to minimize the difference between expected sample size _{
ij
} and the actual sample size _{
ij
}.

The block randomization step is to create initial setup(s) of randomized sample assignment based on strata combining the blocking variables considered. The blocking variables include all variables of interests in the default algorithm, but only a specified subset of variables in the alternative algorithm.

In this step, we sample _{
j
} with size ⌊_{
ij
}⌋, as well as _{
j
} batches with size of ⌊_{
ij
}⌋. The two selections are linked together by the _{
j
} = _{
j
} − ∑ _{
i
}⌊_{
ij
}⌋ can be assigned to the available wells in each Block _{
i
} = _{
i
} − ∑ _{
j
}⌊_{
ij
}⌋. The probability of a sample in _{
j
} from strata _{
j
} being assigned to a well from block _{
i
} is proportional to the fractal part of the expected sample size _{
ij
}. For a RCBD, each batch will have equal number of samples with same characteristic and there is no need for further optimization. However, for other instances where the collection of samples is unbalanced and/or incomplete, an optimization step is needed to create a more optimal setup of sample assignment.

The optimization step aims to identify an optimal setup of sample assignments from multiple candidates. To select optimal sample assignment, we need to measure the variation of sample characteristics between batches. In this package, we define the optimal design as a sample assignment setup that minimizes our objective function based on principle of least square method

where _{
ij
} and _{
ij
} were defined previously.

In the default algorithm implemented in OSAT, optimization is conducted through shuffling the initial setup obtained in the block randomization step. Specifically, after initial setup is created, we randomly select

In the alternative algorithm, multiple (typically thousands of or more) sample assignment setups are first generated by procedure described in the block randomization step above, based only on the list of specified blocking variable(s). The optimal one will be chosen by selecting the setup (from the pool generated in the block randomization step) which minimizes the value of the objective function based on all variables considered. This algorithm will guarantee the identification of a setup that is conformed to the blocking requirement for the list of specified blocking variables, while attempting to minimize the between-batches variations of the other variables considered.

Implementation

We provide a brief overview of the OSAT usage as below. A more detailed description of package functionality can be found in the package vignette and manual.

Data format

To begin, sample variables to be considered in the sample-to-batch assignment will be encapsulated in an object using function

sample<− setup.sample (x, optimal, …)where in data frame

Batch layout

Next, the number of plates to be used in the genomic experiment, the layout design of these plates, and the level of batch effect to be considered are captured in a container object using constructor functionContainer <− setup.container(plate, n, batch, …)where parameter

Block randomization and optimization

Third, sample-to-batch assignment can be created through function

create.optimized.setup(fun="optimal.shuffle",sample, container, …)

The default algorithm is implemented in function

Output

Last, bar plot of sample counts by batches for all variables considered is provided for visual inspection of the sample assignment. Chi-square tests are also to examine the dependence of sample variables on batches. The final sample-to-batch assignment can be output to CSV.

Availability and requirements

Project name: OSATProject home page:

Programming language: R >= 2.15

License: Artistic-2.0Any restrictions to use by non-academics: None

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

LY, CM and SL conceived and designed the study. LY developed the software. LY CM and SL drafted the manuscript. QH, DW, MQ, JMC, LES, CAB, CSJ and JW all contributed to the study design. All authors read and approved the final manuscript.

Acknowledgements

We wish to thank the anonymous reviewers for their valuable comments and suggestions, which were helpful in improving the paper. The work was supported in part by the National Institute of Health grant R01HL102278 to LES, R01CA133264 to CBA, R01-CA095045 to CSJ, and R21CA162218 to SL.