Department of Statistics, The University of Auckland, Auckland, New Zealand

Abstract

Background

Large-scale genetic association studies can test hundreds of thousands of genetic markers for association with a trait. Since the genetic markers may be correlated, a Bonferroni correction is typically too stringent a correction for multiple testing. Permutation testing is a standard statistical technique for determining statistical significance when performing multiple correlated tests for genetic association. However, permutation testing for large-scale genetic association studies is computationally demanding and calls for optimized algorithms and software. PRESTO is a new software package for genetic association studies that performs fast computation of multiple-testing adjusted P-values via permutation of the trait.

Results

PRESTO is an order of magnitude faster than other existing permutation testing software, and can analyze a large genome-wide association study (500 K markers, 5 K individuals, 1 K permutations) in approximately one hour of computing time. PRESTO has several unique features that are useful in a wide range of studies: it reports empirical null distributions for the top-ranked statistics (i.e. order statistics), it performs user-specified combinations of allelic and genotypic tests, it performs stratified analysis when sampled individuals are from multiple populations and each individual's population of origin is specified, and it determines significance levels for one and two-stage genotyping designs. PRESTO is designed for case-control studies, but can also be applied to trio data (parents and affected offspring) if transmitted parental alleles are coded as case alleles and untransmitted parental alleles are coded as control alleles.

Conclusion

PRESTO is a platform-independent software package that performs fast and flexible permutation testing for genetic association studies. The PRESTO executable file, Java source code, example data, and documentation are freely available at

Background

Permutation testing is often described as the gold-standard for determining statistical significance when performing multiple correlated tests for genetic association. Permutation testing can be applied to both case-control studies and trio studies (parents and affected offspring). In permutation testing, the case/control status of the individuals (for case-control studies) or the transmitted/untransmitted status of the parental chromosomes (for trio studies) are randomly permuted. The maximum test statistic, maximized over all tests for all markers, is calculated for the original affection/transmission status and for each permuted affection/transmission status. If

Permutation testing is computationally demanding for large-scale genetic association studies and requires an optimized software implementation. The PRESTO software package provides fast permutation testing for genome-wide association studies with thousands or millions of markers genotyped on thousands of samples. In addition to using permutation of the trait status to determine statistical significance of user-specified allelic and genotypic tests, PRESTO has three additional useful features: it can compute empirical distributions of order statistics so that the significance of sophisticated multi-marker statistics such as truncated products can be determined

Implementation

Features

PRESTO is designed to be flexible and user-friendly. Input files have a simple format with rows corresponding to markers and columns corresponding to individuals (two columns per diploid genotype). This format is well-suited to large-scale genetic studies where there are typically many more markers (rows) than individuals (columns). Genetic marker data can be split up over multiple input files (e.g. one file per chromosome). There are no restrictions on how alleles or missing data are coded, and any sequence of non-white space characters can be used. Multi-allelic markers are permitted and are analyzed by creating a diallelic marker for each allele (grouping the other alleles) and testing each diallelic marker for association with the trait status. If the cases and controls are sampled from a stratified population and the strata are specified, PRESTO will automatically perform stratified allelic and genotypic tests

PRESTO can also compute significance levels of combined single locus and multi-locus analysis by representing clusters of haplotypes as diallelic markers as described in Browning and Browning

PRESTO performs a Cochran-Mantel-Haenszel (CMH) test with continuity correction and a Mantel trend test

For each permutation of the trait status, PRESTO can store and report the top-ranked order statistics. The

PRESTO can also calculate significance levels for two-stage genotyping designs from the first-stage genotype data using the technique described by Dudbridge

Optimization techniques

PRESTO employs several techniques to optimize permutation testing on large-scale data sets. The permutations of the trait status are computed once and are stored. Each permutation of the trait status is represented as an array of Boolean (1 bit) variables in which the

For each permutation of the trait status and each diallelic marker, a 2 × 3 contingency table is created where the rows are the cases and controls and the columns are the three possible genotypes. PRESTO obtains the 2 × 3 contingency table counts without having to check the permuted trait status and genotype for all individuals. Instead, PRESTO stores the indices of individuals with missing genotypes, heterozygote genotypes, and minor (least common) allele homozygote genotypes. The indices of individuals with major allele homozygote genotypes do not need to be stored because the case and control major allele homozygote genotype counts can be calculated from the case and control sample sizes, the case and control missing genotype counts, and the case and control heterozygote and minor allele homozygote genotype counts. For example, if there are

Output files

PRESTO produces three output files: a log file, a P-value file, and a null distribution file. The log file summarizes the analysis and reports the command line parameters, the running time, and a list of all markers with a multiple-testing adjusted P-value less than 0.2.

The P-value file gives the chi-square test statistics for each allelic and genotypic test performed for each marker, and the permutation P-value for the maximum test statistic for each marker (maximized over all allelic and genotypic tests for the marker). If a marker has a maximum test statistic _{0 }when tested for association with the unpermuted trait status, and if for _{0}, then the multiple-testing adjusted P-value for the marker is (

The null distribution file gives the largest test statistics for each permutation of the trait status. If there are

Results

Computational time

Table

PRESTO running times for the Wellcome Trust Case Control Consortium Crohn's disease study.

# order statistics

# strata

one-stage study

two-stage study

1

1

52.3 m

33.8 m

1

12

84.9 m

55.1 m

1000

1

56.6 m

34.3 m

1000

12

85.6 m

58.5 m

PRESTO computational times for 449,446 autosomal markers genotyped in 1749 cases and 2938 controls. Allelic trend test and dominant/recessive genotypic tests were performed using 1000 permutations of the trait status for 8 scenarios defined by the number of genotyping stages (1 or 2), the number of order statistic distributions calculated (1 or 1000), and the number of population strata (1 or 12). Running times were measured on an Intel Core 2 Duo processor E6600, 2.4 GHz processor with 4 GB of memory running Linux.

Running times for PRESTO 1.0.1 and PLINK 1.0

PRESTO's running time is linear in the number of samples, linear in the number of markers and linear in the number of permutations. Generally, 1000 permutations are sufficient to determine experiment-wide significance. PRESTO can also be run in parallel as described in the documentation.

Memory requirements

Since only one marker is stored in memory at a time and since the trait status for each individual is stored using 2 bits, PRESTO's memory requirements are modest. If there are

Discussion

Permutation testing with 1000 permutations of a large case-control genome-wide association study with 5000 individuals genotyped for 500,000 markers can be performed using PRESTO in approximately one hour of computing time (Table

There has been some debate regarding the number of permutations required. When performing ^{-8}, 1.5 × 10^{-8}, and 6.2 × 10^{-9 }respectively. If additional permutations are desired, 10^{4 }or 10^{5 }permutations are easily performed on a large genome-wide data set like the WTCCC data set in Table

Permutation testing is particularly appealing because of its simplicity. Recently, several more complex alternatives to permutation testing have been proposed

Some methods for computing adjusted P-values exploit the fact that for many common statistical tests, the correlated tests have an asymptotic multivariate normal distribution under the null hypothesis of no trait-marker correlation. Seaman and Müller-Myhsok have proposed estimating the asymptotic distribution and sampling directly from it

There are some limitations with these approaches that estimate the asymptotic multivariate normal distribution of the test statistics. These methods do not estimate significance levels for two-stage genotyping designs. A more severe restriction is that these methods are typically limited to several hundred correlated tests. Seaman and Müller-Myhsok and Conneely and Boehnke suggest that the number of samples should be at least 10 times the number of tests performed in order to accurately estimate the asymptotic multivariate normal distribution

Other alternatives to permutation testing are based on importance sampling. Kimmel and Shamir

These importance sampling methods lack some of the features that are found in PRESTO. The methods do not calculate significance for two-stage genotyping designs, and they do not calculate adjusted P-values for general order statistics. In the extension to stratified data, the association test statistic used in Kimmel et al

Methods for computing multiple-testing adjusted P-values that are based on asymptotic multivariate normal distributions or importance sampling, are more complex than permutation testing, and require the asymptotic approximations to be accurate. In addition, when testing a single binary trait, these alternative methods provide little or no decrease in computational time relative to permutation testing with PRESTO, unless one is performing more than 1000 permutations.

Conclusion

PRESTO is a flexible, platform-independent software package that determines multiple-testing adjusted statistical significance for large-scale genetic association studies by using permutation of the trait status. PRESTO is faster than existing permutation testing software and can analyze a large genome-wide association study (500 K markers, 5 K individuals, 1 K permutations) in approximately one hour of computing time. PRESTO can be used with stratified data from multiple populations and with two-stage genotyping designs. PRESTO can also report empirical null distributions for the top-ranked statistics (i.e. order statistics) so that statistical significance can be determined for any test statistic calculated in terms of order statistics.

Availability and requirements

• **Project name: **PRESTO

• **Project home page: **

• **Operating system(s): **Platform independent

• **Programming language: **Java

• **Other requirements: **standard edition (SE) Java Runtime Environment (JRE) 5.0 (or higher)

• **License: **freely available for academic and commercial use.

Acknowledgements

This work was supported by the U.S. National Institutes of Health grant 3R01GM075091-02S1.

The analysis in Table