Department of Ecology and Evolutionary Biology, University of Tennessee, Knoxville, TN 37996, USA

National Institute for Mathematical and Biological Synthesis, Knoxville, TN 37996, USA

Abstract

Background

Hybridization, genetic mixture of distinct populations, gives rise to myriad recombinant genotypes. Characterizing the genomic composition of hybrids is critical for studies of hybrid zone dynamics, inheritance of traits, and consequences of hybridization for evolution and conservation. Hybrid genomes are often summarized either by an estimate of the proportion of alleles coming from each ancestral population or classification into discrete categories like F1, F2, backcross, or merely “hybrid” vs. “pure”. In most cases, it is not realistic to classify individuals into the restricted set of classes produced in the first two generations of admixture. However, the continuous ancestry index misses an important dimension of the genotype. Joint consideration of ancestry together with interclass heterozygosity (proportion of loci with alleles from both ancestral populations) captures all of the information in the discrete classification without the unrealistic assumption that only two generations of admixture have transpired.

Methods

I describe a maximum likelihood method for joint estimation of ancestry and interclass heterozygosity. I present two worked examples illustrating the value of the approach for describing variation among hybrid populations and evaluating the validity of the assumption underlying discrete classification.

Results

Naively classifying natural hybrids into the standard six line cross categories can be misleading, and false classification can be a serious problem for datasets with few molecular markers. My analysis underscores previous work showing that many (50 or more) ancestry informative markers are needed to avoid erroneous classification.

Conclusion

Although classification of hybrids might often be misleading, valuable inferences can be obtained by focusing directly on distributions of ancestry and heterozygosity. Estimating and visualizing the joint distribution of ancestry and interclass heterozygosity is an effective way to compare the genetic structure of hybrid populations and these estimates can be used in classic quantitative genetic methods for assessing additive, dominant, and epistatic genetic effects on hybrid phenotypes and fitness. The methods are implemented in a freely available package “HIest” for the R statistical software (

Background

Research on hybrids and hybrid zones offers unique insights into several aspects of evolutionary and ecological genetics

When describing a possible hybrid population, investigators often wish to summarize each individual’s multilocus genotype in a simple and informative way. This usually takes the form of either a hybrid index indicating the proportion of an individual’s ancestors belonging to each “parental” lineage

Although no summary method is likely to satisfy all needs, the situation can be greatly improved by adding a single calculation so that hybrid genotypes are characterized by estimates of both ancestry (_{
I
}, the axis that distinguishes F1, F2, and recombinant inbred lines). In fact, joint estimates of ancestry and interclass heterozygosity include all of the information in the typical six-type classification because each class has a unique pair of expected values (Table
_{
I
}, not because the classification itself contains any other information

**Class**

**
S
**

**
H
**

**
p
**

**
p
**

**
p
**

Each of the genotypic classes generated in the first two generations of admixture has a unique pair of ancestry **
H
**

P1

0

0

1

0

0

P2

1

0

0

0

1

F1

1/2

1

0

1

0

F2

1/2

1/2

1/4

1/2

1/4

B1

1/4

1/2

1/2

1/2

0

B2

3/4

1/2

0

1/2

1/2

Below, I present simple maximum likelihood methods for estimating ancestry and heterozygosity from molecular marker data and explicitly testing the assumption that a discrete classification adequately describes an individual or dataset. I use empirical data and simulations to illustrate these two dimensions of hybridity and assess the reliability of inferences about discrete vs. continuous distributions of hybrid genotypes.

Methods

Ancestry and interclass heterozygosity for codominant markers

Buerkle
_{
I
}(the interclass heterozygosity index) for individual hybrid genotypes given parental allele frequencies. It is useful to express genotypic probabilities using Turelli and Orr’s
_{11 }= proportion of loci with both alleles derived from parental species 1, _{22 }= proportion of loci with both alleles derived from parental species 2, and _{12 }= proportion with one allele from each species. The system is completely specified by two parameters (because _{11} + _{12} + _{22 }= 1), and perfectly represents ancestry and interclass heterozygosity because _{
I
}= _{12}, and

The probability of a hybrid being homozygous for allele _{
ij1}) and population 2 (_{
ij2}), and Turelli and Orr’s

And the probability of being heterozygous for alleles

These probabilities can be generalized to consider any number

And

These expressions assume alleles were drawn at random from within each parental gene pool when the initial admixture was formed, but do not assume Hardy-Weinberg equilibrium within a hybrid population. Equivalent probability statements were used by Pritchard et al. in developing the Bayesian methods implemented in the program STRUCTURE

The log-likelihood of a set of genomic proportions for a given hybrid genotype with

Maximizing this function provides estimates of
_{
ij1 }= 1 and _{
ij2 }= 0), the joint MLE has closed form
_{11 }is the observed fraction of markers homozygous for species 1 alleles, and _{12} is the observed fraction of markers heterozygous for species 1 and species 2 alleles.

Dominant Markers

The method can be extended to dominant markers (e.g., AFLP). Assume allele

Implementation

For finding maximum likelihood estimates using equations 5 or 6, I used the general purpose optimization function _{11}
^{
″
}
_{12}
^{
″
}
_{22}
^{
″
}) from a three dimensional Dirichlet distribution centered on the old genomic proportions and with concentration parameter _{11},_{12},_{22}). Larger _{
I
} on a grid over the sample space and starting the MCMC at the grid point with highest likelihood. For present purposes, I ran the MCMC for 1000 steps (with

Sample space of hybrid genomic proportions.

**Sample space of hybrid genomic proportions. **The range of possible hybrid genomic proportions in terms of (A) ancestry and interclass heterozygosity on a bivariate coordinate system, and (B) Turelli and Orr’s
_{I }for three distinct hybrid types, all with

Simulations

Evolution of ancestry and heterozygosity in admixed populations

To illustrate how the joint distribution of _{
I
}change in the generations following admixture, I created a simple simulation model following Long’s “intermixture”
^{2}, 2 ^{2} P1, F1, and P2 genotypes in the first generation. Each succeeding generation is formed in the same way by random mating of pairs from the previous generation. I kept track of diploid genotypes to estimate _{
I
} through time. R code for the simulations is available as Additional File

**Simulations with gene flow. **R code for simulating neutral admixture.

Click here for file

To illustrate the effect of ongoing gene flow, I repeated the simulations above with stochastic immigration from unchanging parental populations (the continent-island admixture model

Linkage and sampling of the genome

Linkage among markers is expected to affect the sampling variance (hence reliability) of parameter estimates because linked markers will tend to provide redundant information. The assumption that two markers each provide independent information is violated if they are linked (i.e., if the probability of recombination is less than 0.5). In general this should not be a problem if loci represent a simple random sample with respect to recombinational distance

To evaluate the potential effects of linkage on bias and sampling variance, I created a simple linkage model. Each model genome included four diploid chromosomes with 100 loci each. The loci were evenly distributed across two chromosome arms, and one recombination event was modeled per chromosome arm per meiosis (a minimal rate based on mammalian disjunction

Using this model, I simulated F2, backcross, and later generation crosses (up to F10) from parental lines with diagnostic alleles at each marker. For comparison, I simulated the same series of cross types allowing free recombination between all markers (400 unlinked markers). For each simulated individual, I recorded the true values of _{
I
} from all 400 loci, and then estimated _{
I
}from samples of

Uncertainty of parental allele frequencies

My implementation of the estimators for _{
I
}depends on prior estimates of parental allele frequencies taken as known constants. To briefly illustrate the consequences of inaccurate assumptions about parental allele frequencies, I simulated ten generations of admixture in small populations (_{
I
} for each individual under different assumed parental allele frequencies. To evaluate the effect of an overall bias, I used four scenarios: (i) parental populations with

To evaluate the effect of balanced inaccuracy, I simulated admixture from parental lineages with 25 diallelic markers with allele frequencies all equal to 0.9 in one lineage and 0.1 in the other, and 25 additional diallelic markers with allele frequencies all equal to 0.7 in one lineage and 0.3 in the other, and then performed estimation assuming all 50 markers had allele frequencies of 0.8 and 0.2. Finally, to assess the impact of having just a few known diagnostic markers, I repeated this analysis replacing one locus of each type with a diagnostic locus, and performed estimation assuming those two were diagnostic but still assuming the other 48 markers had allele frequencies of 0.8 and 0.2.

Hybrid Classification

Equations (5) and (6) can be used to calculate the likelihood of predefined genotype frequency classes, as in Anderson and Thompson’s program NewHybrids
_{11 }= 1,_{12 }= _{22 }= 0|marker _{11 }= 0.25,_{12 }= 0.5,_{22 }= 0.25|marker _{11 }= 0.5, _{12 }= 0.5, _{22 }= 0.0

Evolution of genomic proportions under neutral admixture.

**Evolution of genomic proportions under neutral admixture. **The evolution of genomic proportions under neutral admixture in a simulated population founded by equal numbers from each parental species at t = 0. Population size was held constant at 100 diploids. Genotypes for 100 diagnostic 2-allele codominant markers were tracked over 200 non-overlapping generations of random mating and genetic drift.

The most valuable inference from genealogical classification of wild samples is in identifying situations where F1 hybrids are infertile so later generations are never formed
_{
I
}). This approach has the disadvantage of effectively treating the classification as a null model, which is not biologically justified. A better approach is to accept the classification only if its AIC is lower than the AIC of the MLE (in this case, equivalent to a criterion of within 1.0 log-likelihood units of the MLE). Note that the AIC of the best classification cannot be less than the MLE by more than 2 (the case where MLE is identical to the expectation for a class). This approach avoids the pitfall of assuming that individuals fall into a small set of discrete classes, and instead directly evaluates the validity of classification relative to the continuous model MLE.

Examples

To illustrate inferences based on _{
I
}, I analyzed two published data sets. The first is a sample of hybrid tiger salamanders from a 60-year old hybrid swarm where we expect to find no true parental or F1 individuals
_{
I
}, and to evaluate the likelihoods of the six genotype frequency classes typically of interest (corresponding to the expectations for pure parentals, F1’s, F2’s and first backcrosses in each direction). These functions and others used in this paper are available as a CRAN package called “HIest” (for “hybrid index estimation”) at

Introduced x native hybrid swarm in tiger salamanders

Barred Tiger Salamanders (

Fitzpatrick et al.

A natural hybrid zone in

_{
I
}. This example is instructive because the small number of non-diagnostic markers should give considerably less precision than the tiger salamander example, and because the high frequency of F1 hybrids is biologically significant if the inference is credible.

The nuclear markers used by Devitt et al.
_{
ij1},_{
ij2}) for my likelihood calculations. I also saved the _{
I
}.

Sampling and false classification

To further explore how the number of markers assayed affects erroneous classification, I took the tiger salamander data from Bluestone Pond and Toro Pond (Figure
_{
I
}. I randomly subsampled three markers (without replacement) and repeated the analysis 1000 times. Then I did the same for samples from 5 to 60 (out of the total of 65) in increments of 5. Given the history of the tiger salamander hybrid swarm and the low frequency of classification using the full dataset, I considered any “successful” classification a false positive.

Distributions of ancestry and heterozygosity in hybrid tiger salamander populations.

**Distributions of ancestry and heterozygosity in hybrid tiger salamander populations. **Joint maximum likelihood estimates of ancestry and interclass heterozygosity show variation among populations within the California tiger salamander hybrid swarm (A-E). Here,

Because the primary value of classification is in the identification of true F1 or pure parental genotypes
^{3 }= 0.125. To avoid spurious inference, investigators should avoid classifying individuals based on small numbers of markers
^{
L
}. So, in order to maintain an experiment-wise error rate of

markers. Although this applies precisely only in the case of F2 hybrids and diagnostic markers, it might be taken as a rule of thumb in the absence of other criteria. In the case of the Ensatina data with 46 putative hybrids and three markers, we might expect 5.75 false F1’s and would have wanted 10 markers to keep the error rate near 5%.

Results and Discussion

Evolution of ancestry and heterozygosity in admixed populations

Figure
_{
I
} from a single random simulation for _{
I
} near 0.5). By _{
I
}= 0.5, and the population slowly becomes more homozygous as alleles are lost by drift (_{
I
} declines toward zero).

Figure
_{
I
} remained moderate instead of dropping toward zero. With

Evolution of genomic proportions under neutral admixture and immigration.

**Evolution of genomic proportions under neutral admixture and immigration.** The evolution of genomic proportions under neutral admixture with ongoing gene flow. The simulated population as founded by equal numbers from each parental species at t = 0. Population size was held constant at 100 diploids. Each generation, resident adults were replaced by pure parental genotypes with probability 0.10 (average gene flow was

The same basic patterns can be seen when the loci are not entirely diagnostic (e.g., parental allele frequencies of 0.9 vs 0.1). However, when estimates were based on fewer markers, or less informative markers, it was often impossible to discern discrete genotype clusters by generation 2 (e.g., see Figures

Likelihood surfaces for codominant markers.

**Likelihood surfaces for codominant markers. **Likelihood surfaces of ancestry (_{I}) for 10 (A-C) and 40 (D-F) codominant biallelic loci with parental allele frequencies of 0.9 and 0.1. (A) and (D) are F1 hybrids with _{I }= 1.0; (B) and (E) are F2 hybrids (_{I }= 0.5); (C) and (F) are homozygous recombinants (_{I }= 0.0). Each level of shading covers two units of log-likelihood, so black is within 2 log-likelihood units of the maximum.

Likelihood surfaces for dominant markers.

**Likelihood surfaces for dominant markers. **Likelihood surfaces of ancestry (_{I}) for 100 (A-C) and 300 (D-F) dominant markers for the same three hybrid genotypes as in Figure 5. Dominant allele frequences in the parental species were set to 0.9 for half of the markers and 0.1 for the other half in species 1 and vice versa for species 2. The F1 (A and D) had the dominant phenotype for all markers, the F2 (B and E) was homozygous recessive at

Codominant markers

Maximum likelihood estimates of _{
I
}appear consistent and unbiased for known codominant genotypes (Figure

Dominant markers

Maximizing the log-likelihood for dominant markers also gives unbiased estimates of _{
I
}(Figure
_{11} and _{22}. If, for example, the absence of PCR product or particular band on a gel cannot be interpreted as a homozygous recessive genotype, the marker system should not be used for this or any other method relying on typical population genetic assumptions.

Linkage and sampling of the genome

Markers sampled at random from a structured genome were indistinguishable from truly unlinked markers in terms of bias and sampling variance of _{
I
} than a simple random sample of 60 markers. Thus, for systematically sampled genomes with good coverage, support intervals based on my likelihood calculations will be somewhat conservative.

**Supplementary figures and tables.** Figures and tables illustrating effects of linkage and inaccuracy of parental allele frequencies on bias and sampling variance of estimates of _{
I
}.

Click here for file

Uncertainty of parental allele frequencies

Effects of systematic over- or under-estimating differentiation between parental lineages predictably biased hybrid index estimates toward intermediate or extreme values respectively (Additional file
_{
I
}) and/or parental-like genotypes (high or low

When an equal number of parental allele frequencies were over- and under-estimated, estimates of _{
I
}had increased variance and were slightly biased toward extreme values (Additional file

Examples

Introduced x native hybrid swarm in tiger salamanders

The distributions of individual estimates of ancestry and interclass heterozygosity from the tiger salamander data are illustrated in Figure
_{
I
}. The patterns for Bluestone, Pond H, and Sycamore are consistent with gene flow between populations differing in allele frequencies (Figure
_{
I
} given

Only a small fraction of the sampled tiger salamanders would be classified into one of the six standard genotype frequency classes using the stringent criteria of (i) the best fit of the six had to differ from the others by at least two log-likelihood units, and (ii) the best fit of the six had to have lower AIC than the continuous model MLE. By these criteria, 21 of the 255 larvae would be classified as F2-like (_{11 }= 0.25, _{12 }= 0.5, _{22 }= 0.25) and one as like a backcross to California Salamander (_{11 }= 0.5, _{12 }= 0.5, _{22 }= 0.0). As expected, no larvae would be classified as F1 hybrids or “pure” parental genotypes. In this case, the low level of classification is entirely due to criterion (ii); in 233 of 255 cases, the MLE was a significantly better fit than the best fit of the six classes. Three examples of the very sharply peaked likelihood surfaces typical for this dataset are illustrated in Figure

Example likelihood surfaces for individual tiger salamander hybrids.

**Example likelihood surfaces for individual tiger salamander hybrids.** Joint maximum likelihood surfaces for three hybrid tiger salamanders from Bluestone Pond (Figure
_{I }= 0.5. Each level of shading represents two log-likelihood units.

Thus, with sufficiently high-resolution data, this kind of analysis can show that admixture has been ongoing for more than two generations and the simple hybrid classification scheme of F1, F2, and backcross is clearly inadequate to describe the distribution of genotypes in the wild. Even for Toro Pond, where 14/52 would be classified as F2, the joint distribution of _{
I
}is inconsistent with two generations of admixture because random mating is expected to produce the full array of parental, F1, F2, and backcross genotypes in a population (Figure

A natural hybrid zone in

My analysis corroborates the inference that the distribution of genotypes in the

Estimates of ancestry and heterozygosity in an

**Estimates of ancestry and heterozygosity in an ****hybrid zone. **Joint maximum likelihood estimates of _{I }largely corroborate the inferences of Devitt et al.
_{I }= 0.5), but it cannot be statistically distinguished from a

The likelihood surfaces fitted to the

Sampling and false classification

When the continuous MLE was compared against classification (using the 2x log-likelihood or AIC criteria), false classification was most common when about 10 markers were subsampled from the tiger salamander data. False classification dropped off for smaller numbers of markers because there was low power to discriminate alternative classes, and dropped off at larger numbers because the increased resolution allowed all six of the classes to be rejected in favor of the MLE (Figure

False classification rates.

**False classification rates. **False classification rate for subsamples of markers for the Bluestone Pond (a) and Toro Pond (b) tiger salamander data peaked at 10 markers when the typical six category system (limited to parental, F1, F2, and backcross genotypes) could be rejected by the MLE of _{I}. Shaded symbols show results for classification based on 2 log-likelihood units; black symbols show results for the AIC criterion. However, with

False classification in subsamples of the tiger salamander data was largely attributed to the difficulty of distinguishing F2 and backcross categories from later generation hybrids. Misclassification of later generation hybrids from these populations as parental or F1 was a problem only for small numbers of markers (Figure
_{
I
}(Figure

Conclusions

Hybrids are generally conceived as the genetically mixed descendants of two or more distinct ancestral populations
_{
I
}, the fraction of loci heterozygous for alleles from each ancestral group). Heretofore, interclass heterozygosity has been used only rarely in analyses of hybridization in the wild, but to great effect
_{
I
}. The joint likelihood is efficiently expressed in terms of Turelli and Orr’s

Joint consideration of _{
I
}provides considerably more biological insight than a single ancestry index or classification of hybrids into the limited categories generated in the first two generations of admixture
_{
I
} than in the likelihood that an individual is truly an F2 hybrid

Competing interests

The author declares no competing interests, financial or otherwise, regarding this manuscript.

Acknowledgements

I thank A. Buerkle, J. Fordyce, Z. Gompert, C. Nice, T. Devitt, Marius Roesti, and two anonymous reviewers for helpful discussions and comments on a draft of the manuscript. NSF grant DEB-0516475 helped support the work on tiger salamanders.