Lehrstuhl für Tierzucht, Technische Universität München, Hochfeldweg 1, 85376 Freising-Weihenstephan, Germany

Institut für Populationsgenetik, Veterinärmedizinische Universität Wien, Veterinärplatz 1, 1210 Vienna, Austria

ZuchtData EDV Dienstleistungen Ges.m.b.H. Dresdner Sraße 89/19 1200 Vienna, Austria

Abstract

Background

Hitchhiking mapping and association studies are two popular approaches to map genotypes to phenotypes. In this study we combine both approaches to complement their specific strengths and weaknesses, resulting in a method with higher statistical power and fewer false positive signals. We applied our approach to dairy cattle as they underwent extremely successful selection for milk production traits and since an excellent phenotypic record is available. We performed whole genome association tests with a new mixed model approach to account for stratification, which we validated via Monte Carlo simulations. Selection signatures were inferred with the integrated haplotype score and a locus specific permutation based integrated haplotype score that works with a folded frequency spectrum and provides a formal test of signifance to identify selection signatures.

Results

About 1,600 out of 34,851 SNPs showed signatures of selection and the locus specific permutation based integrated haplotype score showed overall good accordance with the whole genome association study. Each approach provides distinct information about the genomic regions that influence complex traits. Combining whole genome association with hitchhiking mapping yielded two significant loci for the trait protein yield. These regions agree well with previous results from other selection signature scans and whole genome association studies in cattle.

Conclusion

We show that the combination of whole genome association and selection signature mapping based on the same SNPs increases the power to detect loci influencing complex traits. The locus specific permutation based integrated haplotype score provides a formal test of significance in selection signature mapping. Importantly it does not rely on knowledge of ancestral and derived allele states.

Background

Linking genotype to phenotype is one of the central questions in biological sciences. Current approaches to map intraspecific variation to causative sequence variation use either a quantitative genetics framework (association mapping) or rely on population genetic theory (hitchhiking mapping).

Population genetic theory predicts that a favorably selected allele is either lost or increases in frequency until fixation

Based on this principle, genome scans were performed in a large number of species such as human, maize, Drosophila, _{ST }statistics. However, disentangling selection from nuisance signals caused by the demographic history of a breed or species based on genome wide polymorphism data remains challenging.

Stringent artificial selection resulted in an enormous improvement of production traits over the last couple of decades, especially for traits with moderate to high heritability. In combination with the availability of high density SNP arrays and high quality phenotypes, this intense selection renders the genome of dairy cattle an optimal model to look for signatures of recent positive selection.

While for genetic model organisms very powerful genomic tools are available, these species frequently lack phenotypic records to link signatures of selection in the genome to actual variation in phenotype unless a huge additional phenotyping effort is undertaken. This is the great advantage of using livestock species, as numerous production- and fitness traits are routinely recorded and used in breeding value estimation.

The estimated breeding value (EBV) expresses the genetic merit of a breeding animal estimated based on their own performance and performances of all available relatives. In the case of dairy bulls this typically includes hundreds to thousands of daughters. Furthermore EBVs are corrected for systematic environmental effects. Therefore the breeding value of an animal is the sum of its genes' additive effects based on Fisher's infinitesimal model

Since Sax's experiments with beans in 1923

Rapid improvements in high throughput SNP genotyping technologies and commercially available high density SNP arrays for livestock species allowed livestock geneticists to turn towards whole genome association (WGA) mapping approaches in the recent past e.g.

Population genetics provides information that is independent of phenotypic information on putative loci under strong directional artificial or natural selection. We show in this paper that combining a population genetics signal with association tests based on quantitative genetics in a composite statistic, increases power and reduces the number of false positive signals for localizing the source of selection.

In a similar vein, _{ST }values the combination with association results is not straightforward. Akey at al.

Our composite statistic combines a long-range haplotype statistic, based on genomic signatures of (new) positive mutations that are not yet fixed in a single population, and the regression coefficient based on allele-count indicator variables of a WGA - as the quantitative genetic approach. Both estimators rely on the underlying linkage disequilibrium (LD) between the causal variant and the genotyped SNP. We further propose a new mixed model approach to account for stratification in population based association studies, and we introduce a modified extended integrated haplotype score test statistic to detect selection. Using computer simulations and real data we show that the combination of both tests increases the power for localizing the target of selection relative to a single test and reduces the number of false positive signals.

Methods

Experimental Design

The highest selection pressure in the overall breeding goal in Brown Swiss cattle over the last decades was put on protein yield, the main trait of interest in this study to ensure high power for both mapping approaches.

The 140 highest and 148 lowest bulls with respect to protein yield EBV and a minimal EBV-accuracy (r^{2}, degree of determination) of 0.9 were chosen out of 973 progeny tested Brown Swiss bulls for selective genotyping

Phenotypes

Sire EBVs were obtained from the genetic evaluation centre LfL Grub, Germany from the August 2008 genetic evaluation for PY. EBVs for protein yield are in kilogram units.

Genotypes

Genomic DNA was prepared from semen straws following standard protocols using proteinase K digestion and phenol-chloroform. Across all samples the concentration was set to 50 ng/μl. Bulls were genotyped according to the manufacturer instructions with the Illumina BovineSNP 50K Bead chip^{® }comprising 54,001 SNPs at the Institute of Human Genetics of Helmholz Zentrum München, Germany. Genotypes of one individual were omitted due to a call rate of < 90%. The average call rate of the remaining 287 bulls was 98.6% corresponding to approximately 53,230 genotypes obtained per individual. The software PLINK, version 1.03

Detection of Selection Signatures

We wrote R and C++ scripts to calculate extended haplotype homozygosity (EHH) test statistics from phased haplotype data as proposed by

where c_{i }is the number of samples of a particular core SNP allele i, e_{ij }is the number of samples of a particular extended haplotype j, carrying the allele i at the core position, and s is the number of unique extended haplotypes

Voight et al. _{ancestral}/iHH_{derived}) in bins of derived allele frequencies from the empirical distribution at SNPs whose derived allele frequency ^{Voight}) follows approximately a standard normal distribution.

Since standardisation is based on the frequency of the derived allele this sets an upper limit to the age of the mutation. This test statistic answers the question of how unusual the length of a haplotype is, assuming the same age of allele across all observed selection coefficients acting on any core SNP with a similar derived allele frequency in the genome. It therefore does not provide a formal test of significance. Furthermore if different outgroups are used to define ancestral and derived states this sets different age boundaries to the mutations resulting in less precise standardisation.

A locus specific permutation-based iHS

When the rate of EHH decay is similar for the ancestral and derived allele, as expected for a neutral locus, uiHS is ~ 0

Voight et al.

In the following we introduce a locus specific permutation based approach that relies on minor and major allele frequencies rather than ancestral and derived states, respectively. Most importantly this test statistic provides significance of deviations of uiHS from its neutral expectation.

The core site in iHS test statistics is used to define two groups of haplotypes for comparison with regard to their block structure. We shuffled core SNP alleles 1,000 times at the core position while retaining the neighbouring haplotype configuration and calculated uiHS for each permuted sample (iHS_{P}) within each core SNP. Random shuffling SNP alleles at core sites randomizes allocation of haplotypes to the two groups for comparison while maintaining the LD structure in the surrounding genomic region. This simulates the null hypothesis of neutrality: the site was not subject to selection. We hereby obtain an empirical distribution of iHS under the H0 for each SNP, from which we obtain the probability that we see such an extreme iHS just by chance. The locus specific standard deviation of the 1,000 iHS_{p} test statistics is then used to scale the observed deviation of uiHS from its expectation zero. Scaled, permutation based iHS (siHS_{P}) is therefore calculated as

Since the empirical mean of permuted iHS statistics is approximately 0 (see Additional file

**Supplementary Figures S1-S5**. The PDF contains Figure S1: Histogram of means of 1000 permuted uIHS test statistics per locus; Figure S2: Histogram of derived allele frequencies for 34,851 SNPs in the study; Figure S3: Histogram of minor allele frequencies for 34,851 SNPs in the study; Figure S4: Histogram of ^{Voight }test statistics; Figure S5: Histogram of

Click here for file

Generally SNP sites with low minor allele frequencies show larger SD of iHS_{P}. However, to avoid any additional bias due to possibly remaining dependence of siHS_{P }on allele frequency, siHS_{P }was fit in a linear model by regressing the SNP minor allele frequency (MAF_{i}) at the core site on siHS_{P}. For each site, the random residual ε_{ij }was obtained and subsequently standardized using the standard deviations SD(ε_{ij}) of residuals across all SNPs. In contrast to ^{Voight}), our frequency correction is not done based on the expectations of SNPs within allele frequency bins but carried out on a continuous scale.

The resulting frequency corrected, and scaled test statistic is termed iHS.

This final test statistic is approximately standard normally distributed.

Since no high resolution genetic map was available for the SNPs in this study, physical distances between SNPs were used for calculating all integrated haplotype scores.

Whole Genome Association Study

Standard statistical tests, e.g. regressing phenotype on allele count in a linear model, are inappropriate for population based WGA in structured populations because they either result in an inflated proportion of spurious marker - phenotype associations or mask true associations (e.g.

with _{i }
_{i}

Recently, linear mixed models were proposed to effectively account for different levels of relatedness by incorporating pairwise genetic relatedness into the model

We therefore employed the following single locus mixed model which we term "MIX" that explicitly models the polygenic relationships among inviduals, as

where y is a vector of sire EBVs for protein yield, X is the design matrix in which SNP genotypes were coded 0, 1 and 2, counting the number of minor alleles and b the vector of regression coefficients on recoded SNP genotypes. Z denotes the design matrix for random effects with a ~ N (0, **G**σ_{a}
^{2}) being the vector of polygenic effects, σ_{a}
^{2 }the additive genetic variance and **G **the genetic covariance matrix and e ~ N (0, **I**σ_{e}
^{2}), a vector of residual effects. **G **was obtained from pairwise identical by descent (IBD) estimates using genome wide SNP data as implemented in PLINK **G **were calculated as 1+F, with

Mixed models were solved in R (^{6 }permutations if the SNP indicated association. The empirical

All 34,851 SNPs were tested one after the other for association with the protein yield (PY) phenotype.

As model MIX did not overcome the stratification present in our highly structured sample we applied a two stage approach. Besides accounting for the relationship via a mixed model, stratification was accounted for by pre-correcting SNP genotype codes for sire and maternal grandsire (MGS) differences using the following regression model

where gt_{ijk} is the recoded genotype code 0, 1 and 2, counting the number of minor alleles, sire is the fixed effect of sire _{ijk}~ N (0, Iσ_{e}
^{2}), the vector of random residual effects. Sire- and maternal grandsire families smaller than five were merged into one group.

Residuals ε_{ijik }

Evaluation of WGA via Monte Carlo Simulation

The proposed method to account for stratification is specific to situations typically observed in intensively selected livestock species and populations. We evaluated the effectiveness of MIXStrat by Monte Carlo simulations. Phenotypes, sire- and maternal grandsire family structure were taken from the population under consideration. Genotypes for 287 bulls and 10,000 diallelic sites were sampled based on the following procedure:

First, the allele frequency

Rate of False Positives

The average rate of false positive detections across m = 100 random repetitions was calculated as

Power Analysis

For true associations the mean genotype values within bulls of sire and maternal grandsire are correlated with phenotypic family means. This information is not utilized when genotypes are recoded and will thus reduce power. We evaluated the power of MIXStrat relative to the power of MIX under the alternative model. This was achieved by simulating an additive QTL effect which explained 1, 5 and 10% of the EBV variance:

where α_{QTL }is the allele substitution effect ^{2}
_{EBV }the variance of EBV, QTL_{SIZE }is the size of the effect as proportion of σ^{2}
_{EBV }and

with α_{Bonf }being the 5% Bonferroni – corrected type I error threshold of 2.5 × 10^{-5 }and m being the number of random Monte Carlo repetitions.

Composite Test Combined Significance Test and False Discovery Rate

We used Stouffer's method _{COMB}

The test statistic was calculated as

where Z is the standard normal variable under H_{0}, z(P_{i}) is the _{COMB }were obtained using the quantile function of the standard normal distribution. The tail area based false discovery rate (FDR) was calculated from _{COMB }values using the R package fdrtool, v1.2.5

Results

Evaluating the locus specific permutation of the iHS test statistic to detect signatures of selection and comparison to iHS^{Voight}

We mapped selection signatures with iHS^{Voight }and our newly proposed iHS to detect sites under selection.

Table ^{Voight }test statistic. Differences in expectations among derived allele frequency bins (Table ^{Voight}.

Means and standard deviations (SD) in defined frequency bins for uncorrected integrated haplotype score (uiHS) test statistics to calculate iHS^{Voight}.

**Frequency of derived allele**

**Mean**

**SD**

<= 0.1

-1.04

1.04

0.1 - 0.2

-0.93

0.94

0.2 - 0.3

-0.73

0.92

0.3 - 0.4

-0.48

0.92

0.4 - 0.5

-0.26

0.93

0.5 - 0.6

-0.06

0.92

0.6 - 0.7

0.16

0.94

0.7 - 0.8

0.39

0.95

0.8 - 0.9

0.65

0.99

> 0.9

0.75

1.06

Figure _{P}) is nearly constant at ~0.18 for SNPs with a MAF > 15% but increases more than two fold for SNPs with lower MAFs. A similar trend can be seen for iHS^{Voight }where SD in the <= 0.1 and > 0.9 is higher compared to the rest of the derived allele frequency bins.

Plot of standard deviations of permuted integrated haplotype scores (iHS) (1,000 permutations) versus minor allele frequency (MAF)

**Plot of standard deviations of permuted integrated haplotype scores (iHS) (1,000 permutations) versus minor allele frequency (MAF)**.

For SNPs with low minor allele frequencies we found a relatively higher proportion of extreme unscaled iHS statistics. We postulate that this is due to increased rates of false positives, since power simulations by ^{Voight }is powerful for loci with intermediate allele frequencies and that the power of the test drops substantially when the selective sweep is close to fixation, in other words for SNPs with low MAF.

Figure ^{Voight }test statistic. For SNPs with MAF ≥ 0.20 we see that our iHS test yields a higher proportion of significant loci when compared to traditional iHS^{Voight}.

Proportion of SNPs with significant selection signals, relative to all SNPs with a minor allele frequency (MAF) below the value given along the x-axis

**Proportion of SNPs with significant selection signals, relative to all SNPs with a minor allele frequency (MAF) below the value given along the x-axis**. solid line: permutation based iHS, dashed line: iHS^{Voight}; symbols: circles, squares and rhombs symbolize SNPs with P- values for the corresponding test statistic below 0.001, 0.005 and 0.01, respectively.

Figures ^{Voight }and iHS, respectively. Figure ^{Voight}.

Histogram of iHS^{Voight }test statistics from selection signature analysis for 34,851 SNPs

**Histogram of iHS ^{Voight }test statistics from selection signature analysis for 34,851 SNPs**.

Histogram of iHS test statistics from selection signature analysis for 34,851 SNPs

**Histogram of iHS test statistics from selection signature analysis for 34,851 SNPs**.

Quantile - quantile plot of ^{Voigth }(●) test statistics, respectively

**Quantile - quantile plot of P - Values from selection signature analysis for 34,851 SNPs using our modified iHS (X) and the iHS ^{Voigth }(●) test statistics, respectively**.

Figures ^{Voight}.

Our permutation based standardization allows a formal test against the null hypothesis of neutrality at a core SNP (expectation zero). Our standardization is against 1000 permuted test statistics at the same locus in the same LD background. We therefore do not need to define the state of ancestral and derived allele.

Additional file ^{Voight }and iHS, respectively.

Detection of Selection Signatures in the Brown Swiss dairy cattle population

Manhattan plots for iHS^{Voight }and iHS for each autosome except BTA 6 are shown in Additional file

**Supplementary Figures S6-S33**. The PDF shows Manhattan plots of bovine autosomes 1-5, 7-29; Capital letters denote QTLs reported from whole genome association studies (WGA) in cattle QTLdb at animalgenome.org, summarized as QTL trait ontology classes: B.. meat traits, E... exterior traits, H.. health traits, M.. milk traits, P.. production traits, R.. reproduction traits; o annotates a top 5% iHSVoight test statistic as reported in by [Quanbari et al. (2011)] in windows of 500 kb in Brown Swiss, x in any of the other breeds investigated; Plot A: iHSVoight test statistics, blue line: threshold identifying the top 5%; B: iHS test statistics, blue line: threshold identifying the top 5%; C: combined iHS^{Voight }and WGA results with model MIXstrat, D: combined iHS and WGA results with model MIXstrat; blue line is a at 10% false discovery rate threshold.

Click here for file

Among the 34,851 SNPs tested genome wide 1,710 and 1,621 SNPs had a test statistics > |1.96| with method iHS^{Voight }and iHS, respectively.

Distribution among chromosomes is remarkably uneven: BTA 5, 6, 12, 19 harbor 148, 124, 98, 89 sites, respectively which corresponds to 8 - 11% of all investigated SNP on the corresponding chromosomes that show significance applying iHS. On other chromosomes, namely BTA 28 and 17 ~ 1% of investigated SNPs exhibit significant selection signatures.

The same is true for iHS^{Voight }BTA 5, 6, 12, 16 and 19 have 171, 131, 148, 136 and 112 SNPs that show an iHS^{Voight }test statistic > |1.96| which corresponds to 8 - 14% of all SNPs on these chromosomes. BTA 7, 25 and 27 have only around ~ 1% sites with extreme iHS^{Voight }test statistics.

One particularly illustrative example is given by SNP Hapmap52798-ss46526455 located in the proximal region of BTA 14 at 0.565311 Mb (see Figures ^{®}. Interestingly, our iHS provided a strong and convincing signal of selection, while the iHS^{Voight }(0.81) provides considerably weaker support. Hence, this might illustrate the increased power of our modified iHS as compared to the iHS^{Voight}.

iHS (X) and iHS^{Voight }(●) on proximal end (0 to 2.5 Mb) of BTA 14

**iHS (X) and iHS ^{Voight }(●) on proximal end (0 to 2.5 Mb) of BTA 14**. The vertical line marks the position of DGAT1 K232A locus. The x-axis displays the physical position in megabases.

Haplotype bifurcation plot of Hapmap52798-ss46526455

**Haplotype bifurcation plot of Hapmap52798-ss46526455**. The top figure shows the sweeping allele "G" while the bottom figure shows allele "A". This figure shows the breakdown of LD from the core SNP with increasing distance in both directions. The core SNP represents the root of the diagram. Each SNP represents a node and is an opportunity for further branching. If both alleles of a SNP are present on a haplotype the line branches. The thickness of the lines corresponds to the number of samples carrying the haplotype. The length of a branch corresponds to the distance between SNPs.

Plot of EHH statistics of minor allele "G" (solid line) and major allele "A" (dotted line) of Hapmap52798-ss46526455 on proximal end of BTA 14

**Plot of EHH statistics of minor allele "G" (solid line) and major allele "A" (dotted line) of Hapmap52798-ss46526455 on proximal end of BTA 14**. The x-axis displays the physical position in megabases.

However, there is growing evidence for additional polymorphisms in the

Association Study on PY

We used 34,851 SNPs that met our stringent quality criteria and also had the ancestral allele reported in literature for association testing. Population stratification was accounted for by including IBD estimates from the genotype data (method MIX). A quantile - quantile plot analysis indicated, that this procedure did not sufficiently account for population stratification in our dataset (inflation factor λ = 1.34) (Figure

Quantile - quantile plot from association study on protein yield using model MIX (✘) and MIXStrat (Ο), respectively

**Quantile - quantile plot from association study on protein yield using model MIX (✘) and MIXStrat (Ο), respectively**.

We therefore developed a new strategy to reduce the number of erroneous association signals in our data (method MIXStrat). Both the quantile - quantile plot (Figure

Evaluation of WGA via Monte Carlo Simulation

Computer simulations showed that using MIXStrat the sample size in this study is sufficient to only detect strong effects explaining at least 10% of the phenotypic variation. The Monte Carlo simulation did not account for LD because conservative significance thresholds using Bonferroni correction were used. Nevertheless, it assesses the influence of population substructuring in single SNP regression whole genome association studies. Our simulations show clearly that the sire-, paternal grandsire- and maternal grandsire structure in dairy cattle populations alone can create significant results without any association between genotype and phenotype.

Additional file

**Supplementary Figures S34-S36**. The PDF shows Figure S34: Histogram of allele substitution effects from whole genome association study employing model MIXstrat in kilogram protein yield; Figure S35: Histogram of Stouffer's ^{Voight }test statistics; Figure S36: Histogram of Stouffer's

Click here for file

Rate of False Positives

Empirical type I error rate

Quantile - quantile plot for protein yield under the null distribution from method MIX (✘) and MIXStrat (Ο)

**Quantile - quantile plot for protein yield under the null distribution from method MIX (✘) and MIXStrat (Ο)**.

Power Analysis

As expected, MIXStrat reduced power under the model of an existing QTL (Table

Results from power calculations of the Monte Carlo simulation; the underlying models of MIX and MIXStrat are described in the "Methods section" of the paper.

**QTL size in EBV variance**

**MIX**

**power of MIXStrat**

1%

0.026

0.005

5%

0.347

0.234

10%

0.772

0.727

As further shown in Table

Consensus of Selection Signature Signals and Association Signals

A positive iHS value indicates that the minor SNP allele, relative to the major allele, is associated with the larger integrated EHH statistic and was possibly selected for. Likewise the estimated regression coefficient in the association analysis (β_{MIXStrat}) represents the estimated increase in trait value per additional copy of the minor allele. Thus alike signs of iHS test statistics and β_{MIXStrat }indicate that the SNP is causative by itself or is in LD with a causative site that is under positive selection. Opposite signs of iHS and β_{MIXStrat }may be observed when sites have pleiotropic effects and were selected on a different, possibly unobserved, trait. Generally one would expect to see a higher proportion of like signs as compared to opposite signs and a positive correlation coefficient for traits of major economic importance in the selection history of a breed.

Table _{MIXStrat }for PY based on all 34,851 sites, as a quantitative evaluation of accordance. As expected the overall correlations among all sites was low. The correlations between allele substitution effects and iHS among sites identified to be under selection however was substantial with 0.466 among the top 1% of sites and even higher among the top 0.1% of sites. IHS^{Voight }however was uncorrelated with top 1% sites and showed a lower correlation of 0.228 among the top 0.1% sites as compared to iHS. This further supports our notion that iHS is an improved haplotype based test statistic for identifying important loci.

Pearson correlation coefficients (95% confidence intervals) of different iHS statistics with regression coefficients from association study for protein yield.

**Method**

**SNPs with MAF < 10%**

**all SNPs**

**all (N= 4,387)**

**top 1% |iHS| (N = 42)**

**all (N = 34,851)**

**top 1% |iHS| (N = 349)**

**top 0.1% |iHS| (N = 35)**

iHS

0.045

0.197

0.091

0.466

0.559

(0.016-0.074)

(-0.105-0.467)

(0.080-0.101)

(0.380-0.544)

(0.277-0.751)

iHS^{Voight }

0.005

0.21

-0.005

0.002

0.228

(-0.025-0.034)

(-0.092-0.31)

(-0.0158-0005)

(-0.107-0.102)

(-0.114-0.521)

Combining Signatures of Selection with Association Tests

Selection signature - and association test statistics were moderately correlated (Pearson correlation coefficient was 0.091 for iHS and -0.005 for iHS^{Voight}) across all 34,851 SNPs as the majority of SNPs are not in LD with a causative locus and therefore not under selection. This justifies treating the two sets of results as independent and using Stouffer's method to obtain _{COMB}

Quantile - quantile plot from association study on protein yield using model MIXstrat (**Ο**) and combined test of selection signature iHS test statistics and whole genome associations with model MIXstrat (✘)

**Quantile - quantile plot from association study on protein yield using model MIXstrat ( Ο) and combined test of selection signature iHS test statistics and whole genome associations with model MIXstrat (✘)**.

Additional file ^{Voight }(plot C) and MIXstrat with iHS (plot D). All Manhattan plots are annotated with selection signature signals among the top 5% found by ^{Voight }in windows of 500 kB in BS cattle (symbol o) and in any of the other breeds investigated, symbol (x). All plots are further annotated with QTL results reported from whole genome association studies in the cattle QTL database "Cattle QTLdb"

Only QTL annotated from WGA studies were considered, because of the large confidence intervals of QTL positions from linkage studies.

Additional file ^{Voight }while Additional file

Plots A and B in Figure ^{Voight }and iHS, respectively. We see a nice agreement for both test statistics with the selection signatures reported by

Manhattan plots of chromosome 6

**Manhattan plots of chromosome 6**. Legend Figure 12: Capital letters denote QTLs reported from whole genome association studies (WGA) at ^{Voight }test statistic as reported in by ^{Voight }test statistics, blue line: threshold identifying the top 5%; B: iHS test statistics, blue line: threshold identifying the top 5%; C: combined iHS^{Voight }and WGA results with model MIXstrat, D: combined iHS and WGA result with model MIXstrat; blue line is a at 10% false discovery rate threshold.

Hayes et al. ^{Voight }and WGA results does not give as good agreement between the combined iHS and WGA test. This is supported by the lower correlation among the top 1% iHS^{Voight }test statistics and regression coefficients from WGA (Table

Discussion

Mixed model and method to control for stratification

The pairwise IBD matrix obtained by PLINK

The „Q+K" method, proposed by

Applying method MIX instead of a least squares allelic regression substantially reduced the inflation factor λ from 2.02 to 1.34 for PY. When we extended method MIX by Q, the matrix on population substructure based on clusters, estimated using the „pairwise population concordance" criteria

The Monte Carlo simulation confirms that the proposed MIXStrat approach deals correctly with all stratification in the data, as under the simulated H0 the observed -log

Detection of Selective Sweeps

Alleles under positive selection increase in frequency in a population and leave distinct signatures in the DNA sequence. One of these population-genetics based signatures is the increased length of the haplotype carrying the advantageous allele

The challenge is to determine whether a signature is due to selection or to confounding effects of population demographic history, such as bottlenecks, population expansions and population subdivision or simply due to drift in a finite population. Two striking bottlenecks were estimated by

We mapped selection signatures with iHS^{Voight}. Large negative values indicate regions in which newly derived alleles are increasing in frequency in the population. Large positive test statistics advocate so called soft sweeps, sweep from standing natural variation where the ancestral allele is increasing in frequency for iHS^{Voight}. As changes in the selection regime of dairy cattle are well documented and make sweeps from standing genetic variation likely we believe that it is important to consider both extreme positive and negative iHS test statistics as potentially interesting regions in the cattle genome. We developed a permutation - based extension to the iHS statistic proposed by ^{Voight}) _{MIXStrat }indicates that this is a consequence of a decreased rate of false positive detections rather than reduced power. Despite successful selection signature scans in cattle we note that protein yield is a typical quantitative trait for which selection is essentially multigenic and therefore likely to undergo simultaneous selective sweeps. Chevin and Hospital

Method to combine Selection Signatures with Association signals

We propose a novel approach to increase the power to detect association signals. In this study the statistical power to detect an association signal was quite limited, but by combining two independent sources of information for QTL detection in genome wide studies: association and signatures of selection, we were able to increase power and to reduce the false positive rate. Loci that explain variation in economically important traits are likely under selection and will often show incomplete selective sweeps. Thus there is a good chance to observe extreme iHS values among loci that show association. This is supported by the positive correlation of 0.446 between β_{MIXStrat }and iHS for loci among the top 1% iHS test statistics. Although many of the associations identified by our method are not yet confirmed, the concordance with prior results from WGA studies indicates that we were successful in detecting interesting loci. Fine mapping of QTL involves genotying of many more SNPs in the associated region possibly supported by resequencing a subset of extreme individuals

Our combined approach has highest power at intermediate allele frequencies, as both independent sources of information (selection signature mapping and WGA) have highest power at intermediate allele frequencies. Alleles that are not allowed to go to fixation are either likely to be under balancing selection (heterozygote advantage) or have pleiotropic effects with positive and negative effects for the traits under selection. Such loci are not expected to show a signature of recent positive selection. WGA, given the same size of effect, will have equal power to identify such loci and loci under positive selection.

Conclusion

The combination of WGA with hitchhiking mapping to identify a bona fide set of SNPs for candidate gene studies is very promising. We argue that our method improves power of QTL detection and reduces type I error rate by combining two independent sources of information. Our approach can of course be extended to all routinely recorded phenotypes, but for a proof of principle we restricted our analyses to PY as this trait was under most stringent selection over the last couple of decades and the bulls were selectively genotyped for PY to increase power for the whole genome association study.

Stratification is a substantial problem in WGA studies, particularly when carried out in livestock populations. Our MIXStrat approach controls the type I error rate, however at the cost of reduced power.

We accomplished a whole genome hitchhiking mapping study and identified roughly 1,600 SNPs displaying selection signatures that show generally good accordance with effects estimated in the WGA study. Our extension to the iHS test statistic proposed by

Given the substantial increase in power and the reduction in false positive signals we recommend using our combined strategy rather than stand alone WGA. This is especially important in small populations where it is not possible to genotype additional animals.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

MD, HS, CS and RF, wrote the manuscript; KF, CW, FS produced genotype data; HS, MD carried out the statistical analysis, HS, MD, CS and RF designed the study. All authors read and approved the final manuscript.

Acknowledgements

These studies were internally funded by the Technische Universität München. We thank the following artificial insemination stations for providing us with semen samples: Besamungstation J. Bauer GmbH & Co. KG, Besamungsverein Neustadt a. d. Aisch e. V. Meggle Besamungsstation Rottmoos GmbH, Niederbayerische Besamungsgenossenschaft Landshut-Pocking e. G., Prüf- und Besamungsstation München-Grub e. V., Rinderbesamungsgenossenschaft Memmingen e. G., Besamungsstation Birkenberg, Besamungsanstalt Gleisdorf, Oberösterreichische Besamungsstation, NÖ-Genetik Wieselburg. We thank T. Meitinger and P. Lichtner from Institute of Human Genetics from Helmholtz Zentrum München for generating and validating genotypes. MD was supported by Austrian Science Fund (FWF): project number L403-B11 to CS. We thank three anonymous reviewers for helpful comments and criticisms on earlier versions of this manuscript. We thank R. Emmerling from the Bavarian State Research Center for Agriculture (LfL) in Poing-Grub, Germany for the provision of EBVs.