High-dimensional biomolecular profiling of genetically different individuals in one or more environmental conditions is an increasingly popular strategy for exploring the functioning of complex biological systems. The optimal design of such genetical genomics experiments in a cost-efficient and effective way is not trivial.
This paper presents designGG, an R package for designing optimal genetical genomics experiments. A web implementation for designGG is available at http://gbic.biol.rug.nl/designGG webcite. All software, including source code and documentation, is freely available.
DesignGG allows users to intelligently select and allocate individuals to experimental units and conditions such as drug treatment. The user can maximize the power and resolution of detecting genetic, environmental and interaction effects in a genome-wide or local mode by giving more weight to genome regions of special interest, such as previously detected phenotypic quantitative trait loci. This will help to achieve high power and more accurate estimates of the effects of interesting factors, and thus yield a more reliable biological interpretation of data. DesignGG is applicable to linkage analysis of experimental crosses, e.g. recombinant inbred lines, as well as to association analysis of natural populations.
Genetical genomics  has become a popular strategy for studying complex biological systems using a combination of classical genetics, biomolecular profiling and bioinformatics [2-5]. By measuring molecular variation, using transcriptomics, proteomics, metabolomics and related emerging technologies, in genetically different individuals, genetical genomics has the potential to identify the functional consequences of natural and induced genetic variation. Recently, genetical genomics has been generalized to achieve a comprehensive understanding of the dynamics of molecular networks by combining environmental and genetic perturbation [6,7]. This type of large scale "omics" study leads to a better understanding of why individuals of the same species respond differently to drugs, pathogens, and other environmental factors.
However, most molecular profiling experiments are very costly, and as a consequence most genetical genomics studies are performed at the verge of statistical feasibility. Therefore, experimental design needs careful consideration to achieve maximum power from limited resources, such as microarrays and experimental animals [8,9]. But, even in standard scenarios this requires sophisticated application of statistical concepts to intelligently select genetically different individuals from a population and allocate them to different conditions and experimental units. This topic has motivated classical statistical research since a long time . More recently, the concepts developed there have been adapted to the high dimensional data sets of post-genomics research [8,11-13], and useful simplified design strategies have been suggested [11,14]. However, to transfer these statistical ideas to the even more complex context of genetical genomics [9,15,16] still requires considerable expertise in statistics.
Here we present an online web tool to make these selections and allocations easy for biologists with little/no statistical training. The program will find the best experimental design to produce the most accurate estimates of the most relevant biological parameters, given the number of experimental factors to be varied, the genotype information on the population, the profiling technology used, and the constraints on the number of individuals that can be profiled. Advanced users can download the underlying methods as an R package to adapt the program for a more tailored design. Without loss of generality, we will illustrate the method using microarrays, while they apply equally well to other profiling technologies, such as mass spectrometry. Also, we will only discuss molecular technologies that profile samples individually (e.g., single color microarrays) or in pairs (e.g., dual color microarrays), but an extension of the R scripts to more advanced multiplex technologies would be straightforward .
The objective of designGG is to find an optimal allocation of genetically different samples to different conditions and experimental units (arrays) favoring a precise estimate of interesting parameters, such as main genetic effects and interaction effects between genotype and drug treatment. A simple case with one environmental factor can be expressed as y = μ + G× E + ε, where y is the measurement vector, ε is the error term, and G×E denotes main effect and interaction effects of genotype and environment. In matrix notation, a model with one or more genotype factors (quantitative trait loci; QTL) and one or more environmental factors can be written as: Y = Xβ + E, where X is the design matrix of samples by parameters and β is the effect of genotype and environmental factors. The least squares estimate of β is b = (XTX)-1XTY with var(b) = σ2(XTX)-1. The optimal experiment design is defined as the one that minimizes the double sum of the variances of b firstly summed over all parameters and then summed over all genotypic markers. We use an optimization algorithm (simulated annealing ) to search the experimental design space of all possible allocations to produce an optimal design matrix X. During the optimization, the algorithm utilizes the available marker information from the individuals to optimize the allocation of individuals to microarrays and conditions.
In the optimization, the experimenter can, of course, give more weight to parameters of higher interest, which will then be estimated with higher accuracy. Particularly, prior knowledge about expected effect sizes of interesting factors can be incorporated as weight parameters for the algorithm and the weight is inversely proportional to the expected effect size of the corresponding factors. In addition, it is also possible to specify the genome regions that are of major interest in a particular experiment, by specifying a region parameter. For example, if the relevant phenotype is known to map to certain genome regions, parameters for the markers in these regions can be given full weight in the optimization algorithm, whereas parameters for other markers can be given lesser or even zero weight. Thus, mapping resolution can improve and the power for finding QTLs in focal regions can be increased.
DesignGG is a package entirely written in the R language . Every function of the designGG library is available as a stand-alone R tool and detailed help is available according to the standard format of R documentation.
Figure 1. Screenshot of the designGG web interface.
1. Choose the platform. Select the single- or dual-channel option for one-color or two-color gene expression microarrays (the dual-channel option is also used for any other technology profiling pairs of samples).
2. Upload a tab separated value (TXT) file containing the genotype data matrix (individuals × markers). Each cell contains a genotype label (e.g. A or B for the parental alleles, H for heterozygous loci; NA for missing data).
3. Set parameters. Specify the number of environmental factors, their number of levels, and the possible values of these levels. Specify either the total number of slides (assays) or the number of samples allocated within each condition.
4. Use advanced options if only one or a few genome regions or particular factors are of major interest. It is possible to optimize the experimental design by focusing on certain regions (e.g. the first 20 markers on chromosome I). Prior knowledge about expected effect sizes of interesting factors can also be incorporated as weight parameters for the algorithm.
5. Start the optimization algorithm by clicking on the button Optimize Experimental Design (Figure 1).
6. Get results. After the optimization is finished, the optimal experimental design will be displayed online (in table format), and will be available as text files for download.
Here we illustrate how to apply the designGG R package using an example: suppose we are studying the effect of genetic factors (Q), temperature (F1), drug treatment (F2) and their interaction on gene expression using two-colour microarrays. There are 100 microarray slides available for this experiment, and we plan to study two different levels for each environment, which are 16°C and 24°C for F1 (temperature), and 5 μM and 10 μM for F2 (drug treatment). Then the R package can also be used in command line form as follows:
1. Prepare the input file specifying the genotype of each individual at each marker position. The file should be formatted as tab separated values (TXT), as illustrated in Table 1.
Table 1. Example table of genotype data. Heterozygous loci are indicated by an H.
2. Load the designGG package by starting the R application and typing the command:
Specify the input arguments (Steps 3–5 correspond to steps 2–4 of using the web tool. The order of the following commands in steps 3–5 does not matter).
3. Choose the platform of the experiment. In this example, we use two-color microarray, thus:
> bTwoColorArray <- T #if paired; F otherwise
4. Load the marker data and specify the following required arguments (number of environmental factors, number of levels per factor, the values of each level, and the number of available slides):
> data(genotype) #an example data attached with the designGG package
# The command below can be used to read TXT data
# genotype <- read.table("genotype.txt")
> nEnvFactors <- 2
> nLevels <- c(2, 2)
> Level <- list(c(16, 24), c(5, 10))
> nSlides <- 100; nTuple <- NULL
An alternative to specifying nSlides is to specify nTuple, the number of strains to be allocated onto each condition. For example,
> nTuple <- 25 ; nSlides <- NULL;
5. In addition to the required arguments specified in step 4, there are some optional ones for a tailored experimental design: e.g., we might be especially interested in the genome region between 1st marker and 20th marker, where a known phenotypic QTL from previous study locates. They can then specify that the optimization algorithm should only take genotypes at markers 1 to 20 into account:
> region <- seq(1, 20, by = 1)
Additionally, if we want that the estimates of all interaction effects are twice as accurate as the estimates of the main effects (genotype, temperature and drug treatment), then we specify weights for the estimates:
> weight <- c(0.5,0.5,0.5,1,1,1,1)
Here the order of elements in the weight vector is such that first the main effects are listed, starting with the genotype, followed by the two environmental factors in the order used for nLevels and Level, then the one-way interactions, in the same order, and finally the two-way interaction between all three factors.
6. The following commands specify the directory where the resulting optimal design tables are to be stored and the name of the output files (design tables):
> directory <- "C:\myproject\design"
> fileName <- "myDesign"
A detailed explanation of the above arguments can also be found in Table 2.
Table 2. The description and possible values of designGG arguments
7. Run designGG to obtain your optimal design:
> myOutput <- designGG(genotype, nSlides, nTuple, nEnvFactors, nLevels, Level, region = region, weight = weight, nIterations = 10)
It should be noted that the number of iteration of the simulated annealing method (fnIterations)is set to 10 here for testing purposes. The default value (nIterations = 3000) is recommended, but it will result in a longer computing time.
8. Output can be found in the directory or retrieved with:
> optimalArrayDesign <- myOutput$arrayDesign
> optimalCondDesign <- myOutput$conditionDesign
Table 3. Example table of the allocation of strains to arrays.
Table 4. Example table of the allocation of strains to experimental conditions.
9. In addition, users can check the curve of optimization score recorded as the algorithm iterates using:
> plotAllScores (myOutput$plot.obj)
Details of default settings such as method (SA: simulated annealing) or nSearch (equals 2) can be found in the designGG manual or the online help. Example genotype data and output tables are also provided along with the package. The R package can be found in Additional file 1 and most up-to-date version of the software can be downloaded at http://gbic.biol.rug.nl/designGG webcite.
Additional file 1. designGG: an R-package for the optimal design of genetical genomics experiments. DesignGG aims at finding an optimal design of genetical genomics experiments which maximize the power and resolution of detecting genetic, environmental and interaction effects. This will help to achieve high power and more accurate estimates of the effects of interesting factors, and thus yield a more reliable biological interpretation of data.
Format: ZIP Size: 128KB Download file
Two tables summarize the optimal design: The table pair design is only used for two-channel experiments and describes how samples are paired together in one assay e.g., a two-color microarray chip (Table 3). The table environment design lists how samples are assigned to environments/experimental factors (Table 4).
DesignGG, a freely-available R package and web tool presented in this work, represents a novel tool for the researcher interested in system genetics. Based on the careful experimental design provided by designGG, limited resources, such as arrays and samples, are maximally exploited, and more accurate estimates of parameters of interest can be achieved.
Availability and requiredments
YL developed designGG. RCJ and RB directed the project. MAS, GV and JF helped to implement the web tool. All authors wrote the manuscript, and read and approved the final version.
This work was supported by the Netherlands Organization for Scientific Research, NWO-86504001. We thank Danny Arends for help in implementing the web tool.
Bystrykh L, Weersing E, Dontje B, Sutton S, Pletcher MT, Wiltshire T, Su AI, Vellenga E, Wang J, Manly KF, et al.: Uncovering regulatory pathways that affect hematopoietic stem cell function using 'genetical genomics'.
Schadt EE, Lamb J, Yang X, Zhu J, Edwards S, Guhathakurta D, Sieberts SK, Monks S, Reitman M, Zhang C, et al.: An integrative genomics approach to infer causal associations between gene expression and disease.
Li Y, Alvarez OA, Gutteling EW, Tijsterman M, Fu J, Riksen JA, Hazendonk E, Prins P, Plasterk RH, Jansen RC, et al.: Mapping determinants of gene expression plasticity by genetical genomics in C. elegans.
Swertz MA, De Brock EO, Van Hijum SA, De Jong A, Buist G, Baerends RJ, Kok J, Kuipers OP, Jansen RC: Molecular Genetics Information System (MOLGENIS): alternatives in developing local experimental genomics databases.