Abstract
Background
Most genetic studies of population differentiation are based on genepool frequencies. Population differences for gene associations that show up as deviations from HardyWeinberg proportions (homologous association) or gametic disequilibria (nonhomologous association) are disregarded. Thus little is known about patterns of population differentiation at higher levels of genetic integration nor the causal forces.
Results
To fill this gap, a conceptual approach to the description and analysis of patterns of genetic differentiation at arbitrary levels of genetic integration (single or multiple loci, varying degrees of ploidy) is introduced. Measurement of differentiation is based on the measure Δ of genetic distance between populations, which is in turn based on an elementary genic difference between individuals at any given level of genetic integration. It is proven that Δ does not decrease when the level of genetic integration is increased, with equality if the gene associations at the higher level follow the same function in both populations (e.g. equal inbreeding coefficients, no association between loci). The pattern of differentiation is described using the matrix of pairwise genetic distances Δ and the differentiation snail based on the symmetric population differentiation Δ_{SD}. A measure of covariation compares patterns between levels. To show the significance of the observed differentiation among possible gene associations, a special permutation analysis is proposed. Applying this approach to published genetic data on oak, the differentiation is found to increase considerably from lower to higher levels of integration, revealing variation in the forms of gene association among populations.
Conclusion
This new approach to the analysis of genetic differentiation among populations demonstrates that the consideration of gene associations within populations adds a new quality to studies on population differentiation that is overlooked when viewing only genepools.
Background
Most biological species are subdivided into populations that are more or less strongly connected by gene flow. This facilitates a species' persistence via adaptive differentiation to local conditions, which in turn serves to maintain genetic variation for future adaptational processes. This concept of species is reflected, for example, in metapopulation analysis with its special emphasis on extinctionrecolonization dynamics (see [1] for a still relevant review). Genetic control of the phenotypic traits on which processes of adaptation operate is usually complex due to the involvement of several interacting genetic traits that may be expressed even in different developmental phases, including the haplophase. The detection of selectively neutral impacts on population differentiation (e.g. founder effects, genetic drift) may also require the analysis of multiple genetic traits, the interactions among which are determined by chance and in combination with particular mating systems (such as partial selfing). Thus the amount and pattern of genetic differentiation among a set of populations basically depends on:
(1) the developmental stage (chiefly haplophase vs. diplophase),
(2) the genetic traits under consideration at this stage, and
(3) the ways in which the different states of these genetic traits in the populations are associated to form the genetic types (haplotypes, genotypes) at this stage, broadly termed gene association in this paper.
In general, traits are genetic only if they are inheritable, and the goal of inheritance analysis is to identify genes as the basic units of inheritance. The term genetic integration is used here to designate the combination or arrangement of these elementary objects "gene" into the haplotypes of gametes, into the genotypes at diploid (or polyploid) nuclei of diplophase individuals, or into the cytotypes of mitochondria or plastids, for example. Accordingly, each level of genetic integration usually corresponds to a developmental stage or an organelle that is characterized by special combinations of genes. (To emphasize this aspect, genic integration might be the more appropriate term.)
The main motivation for this paper was the realization that impacts of particular forces, selective or not, on population differentiation may not be observable at every level of genetic integration. Measurements of differentiation among populations based on gene frequencies, for example, provide no specific insights into the effects of mating systems nor of epistatic interaction on population differentiation. This is due to the fact that gene frequencies refer to the lowest level of genetic integration, namely its absence. This level, which is commonly addressed as a population's genepool, is conceived to consist of the set of all individual genes present in the population members for a specified set of genetic traits. Genetic studies of population differentiation are almost always based on this "beanbag" (critically reflected by Mayr [2] and defended by Haldane [3]; for concise reasoning of the persistence of the genepool concept see e.g. [4] or [5]). Studies of differentiation at multiple loci are no exception, since they commonly report averages over singlelocus differentiation indices. Also disregarded in studies of genepool differentiation are gene associations that deviate from HardyWeinberg proportions (homologous, or intralocus, association) or gametic equilibria (nonhomologous, or interlocus, association). Considering that forms and degrees of gene association may differ at different levels of genetic integration, it thus appears that previous studies on patterns of population differentiation have provided very little information on levels of genetic integration above the genepool.
One important reason for the usual focus on genepool differentiation is probably the lack of a method for measuring population differentiation consistently at all levels of genetic integration. Consistency means that comparison of the amount of differentiation among a set of populations between levels of integration provides information about the complexity of the gene associations that distinguish them. Since gene associations do not decrease as level of integration increases, neither should differentiation. Moreover, the extent of an increase in differentiation between subsequent levels should in some way reflect the degree of complexity of the additional gene associations, with equality as an indication of lack of additional complexity by some standard. Such a differentiation measure must thus be based on a conceptual characterization of the complexity of gene associations.
The existence of such a measure would not only facilitate experimental studies but also simplify the development and testing of models. Insights can be gained from models only when the characteristics described by the models derive from concepts that are conceived independently of the models. Thus models do not serve to analyze characteristics: characteristics serve to analyze models. Moreover, modelbased analysis that is limited to falsification of a particular model or its parameterization provides no information on the validity of related models. A conceptually argued measure, in contrast, can be applied to whole classes of models. This permits summarization of characteristics they have in common, the statistical significance of which can be tested by permutation analysis.
In the present paper, a new approach to differentiation analysis is presented that applies a conceptually argued measure of differentiation Δ_{SD }to analyze and compare differentiation patterns among populations at different levels of integration. Presentation includes the development of Δ_{SD}, representation of patterns of differentiation, and tests of significance of the patterns. Comparison of differentiation between levels of integration is analyzed mathematically. The method's usefulness is demonstrated by applying it to sixlocus microsatellite data from four stands of pedunculate oak (Quercus robur). The purpose of using real data is to show how insights can be gained directly from observations without limitation to particular models, the testability of which may be difficult. It turned out that the large increases in differentiation between levels that were observed in the real data were not producible in numerous simulations of simple selection models, indicating that these models cannot explain the complexity of the real data. Studies of the behavior of this measure using simulated data from increasingly complex models will be the subject of a future paper.
To prevent possible misunderstanding, it should be mentioned that this approach differs in content from any type of (hierarchical) partitioning, apportionment, or allocation of genetic variation (such as within and between populations). Methods of attributing overall variation to partitions draw upon the principle of the analysis of variance and were extended to include more general measures of difference between individuals by Rao (equation 2.3.1 in [6]). An application of this generalization to a special measure of genetic difference for multiple loci between haplotypes led Excoffier et al. [7] to the formulation of their "analysis of molecular variance". In contrast, the levels of genetic integration dealt with here cannot serve as classes (partitions) over which genetic variation is distributed. Instead, at each integration level (e.g. gene pool, singlelocus genotypes, multilocus genotypes) the genetic characteristics can be analyzed for their differentiation within population subdivisions. Subsequent comparison between levels reveals which level of integration, and thus which type of gene association (especially homologous vs. nonhomologous), has the greatest influence on the differentiation within the partition.
Methods
Levels of genetic integration and gene association
At the lowest level of genetic integration, the genepool, the genetype of each individual gene is characterized by the gene locus at which it is located and by its allelic state. Assuming that the degree of ploidy is the same at all loci, the relative frequencies of the genetypes in the genepool of a population equal ·p_{i;l}, where L is the number of loci and p_{i;l }is the relative frequency of the ith allele at the lth gene locus in the population (∑_{i}p_{i;l }= 1, ). If loci of differing degree of ploidy (e.g. nuclear and organelle) are included in the analysis, replace with the locusspecific quantities r_{l }obtained by division of the degree of ploidy at the lth locus by the sum of the degrees over all loci. The genepool frequency of the genetype specified by the ith allele at the lth locus then equals r_{l}·p_{i;l}. At higher levels of genetic integration, where the objects of interest represent compositions of several individual genes together with their genetypes, association among genetypes becomes relevant for differentiation studies. If the objects are diplophase individuals and if the genetypes are specified at a single genelocus, then all associations among the genes that make up the genotypes are homologous (i.e., allelic) by definition. When multiple loci are considered, both homologous and nonhomologous (interlocus) associations exist among genes. If the objects are haplophases, each object having just one gene per locus, then all gene associations are nonhomologous. Since at any given locus all objects carry the number of (allelic) individual genes specified by the degree of ploidy of the locus, the objects representing a given level of genetic integration are characterized by the same number of individual genes.
The elementary genic difference
From this perspective, genetic differences between two objects of the same level of integration are basically determined by the number of their individual genes that differ in type. If the numbers of copies of the ith allele at the lth gene locus are denoted by n_{i;l }and m_{i;l}, respectively, then the two objects differ by ∑_{i,l}n_{i;l } m_{i;l} genetype copies. This sum is maximal, equaling two times the total number K of individual genes represented in each object, if the objects share no genetypes (and thus differ completely). Since ∑_{i,l}n_{i;l }= ∑_{i,l}m_{i;l }= K holds, division of ∑_{i,l }n_{i;l } m_{i;l} by 2·K yields a measure of genic difference that is bounded between zero and one. This measure of elementary genic difference is applicable to all levels of integration. It differs from a closely related index suggested by Smouse and Peakall [8] in a different context, in which the absolute difference is replaced by the squared difference, a disadvantage of which is that objects sharing no genetypes need not realize the maximum difference.
The elementary genic difference does not distinguish homologous from nonhomologous genes. Hence, the homologous and nonhomologous gene arrangements within the objects affect the elementary genic differences between them only through their sum. For example, in the case of diploid individuals scored at two gene loci A and B, say, the genotypes A_{1}A_{1}/B_{1}B_{2 }and A_{1}A_{2}/B_{1}B_{3 }represent three (A_{1}, B_{1}, B_{2}) and four (A_{1}, A_{2}, B_{1}, B_{3}), respectively, of the total of five genetypes. A_{1 }is represented by two copies in the first genotype and by one copy in the second, and the remaining four genetypes are represented by at most one copy in each of the two genotypes. The sum of copy number differences between the two genotypes thus equals four. After division by twice the number of individual genes in a genotype (i.e. 2·4), this yields 0.5 as the elementary genic difference. The same result is obtained for the two genotypes A_{1}A_{2}/B_{1}B_{2 }and A_{1}A_{2}/B_{3}B_{3}, even though all genic differences are now due to the alleles at a single locus (B).
These considerations show that objects representing higher levels of genetic integration are not simply of the same or different genetic type, as is the case at the level of the genepool. Specification of the genetypes of which the genetic types are composed yields a measure of the differences between them that ensures the comparability of genetic differences even across levels of genetic integration. Thus, analysis of population differentiation at higher levels of integration should take into account not only differences in the frequencies of the genetic types among populations but also the variation in the pairwise differences between types.
The measure Δ of genetic distance between two populations
The measure Δ of genetic distance between two populations developed by Gregorius et al. [9] considers both the frequencies of genetic types and their pairwise differences, while avoiding the conceptual problems of dispersion indices (e.g. average differences within and between populations, see [6]). For a specified trait, Δ equals the minimum extent to which the genetic types of individuals in one of the two populations must be altered in order to obtain the composition of genetic types in the other. Denote:
where d(a, b) specifies the difference between genetic types a and b, and s(a, b) is a frequency shift. Frequency shifts are performed from types that are more frequent in the one population than in the other to types that are less frequent in than in . If the frequency p_{a }of type a in exceeds the frequency q_{a }of this type in , then the excess p_{a } q_{a }must be shifted to types deficient in , such that ∑_{b}s(a, b) = p_{a } q_{a }= p_{a } min{p_{a}, q_{a}}. The shift process is continued for all types with a frequency excess in until the frequencies of all types in match those in . Since there may be many different ways of shifting, Δ is taken to be the minimum of the above sum over all admissible frequency shifts s, i.e.,
In [9] and [10] it is shown that finding a shift transformation s that minimizes Δ(s) is equivalent to solving the "Transportation Problem" [11] by linear programming methods. These methods are implemented in the computer program DeltaS [12].
In combination with the measure of elementary genic difference, the measure Δ provides the desired conceptual method for studying population differentiation at different levels of genetic integration. At the lowest integration level, the genepool, where genetypes are specified by indices i; l and their frequencies in populations and as r_{l}·p_{i;l }and r_{l}·q_{i;l }(see above), Δ assumes a familiar form. Since individual genes are distinguished only by their identity or nonidentity in type, one obtains elementary genic differences d(a, b) = 1 for a ≠ b and d(a, b) = 0 for a = b. For any frequency shift s, it holds that Δ(s) = ∑_{a, b}s(a, b) = ∑_{a}(p_{a } min{p_{a}, q_{a}}) = ∑_{a}p_{a } q_{a}. Insertion of the genetype notation in place of the a's then yields:
where:
In this expression, d_{0}(p^{(l)}, q^{(l)}) is a familiar measure of genetic distance between two populations with allele frequencies p^{(l)}and q^{(l) }at locus l (see e.g. [13]). It turns out that the genepool distance between two populations equals the average distance over the single loci.
At the diplophase level of integration, for example, consider two populations and with HardyWeinberg proportions (HWP) for the two alleles A_{1 }and A_{2 }at a locus. Let p_{1 }> q_{1}, and let have more heterozygotes than . Then there is only one way s of shifting, namely s(A_{1}A_{1}, A_{2}A_{2}) = > 0 and s(A_{1}A_{2}, A_{2}A_{2}) = 2p_{1}p_{2 } 2q_{1}q_{2 }> 0. Since for the elementary genic distance, d(A_{1}A_{1}, A_{2}A_{2}) = 1.0 and d(A_{1}A_{2}, A_{2}A_{2}) = 0.5, the genetic distance equals Δ = 1.0·() + 0.5·(2p_{1}p_{2 } 2q_{1}q_{2}) = p_{1 } q_{1}. In this example, the distance at the diplophase level equals the genepool distance. Under Results it is shown (Proposition 1) that the diplophase distance is never less than the genepool distance and that equality at the two levels is of particular interest.
Patterns of differentiation among populations
At this point, each level of integration for a set of populations is characterized by a matrix of pairwise distances Δ between the populations. These matrices and the relationships among them can be called the pattern of differentiation among the populations. Three approaches to the description of differentiation patterns are discussed.
Clustering methods
Matrices of pairwise genetic distances between populations are commonly represented using clustering methods as dendrograms, the topologies (cluster structures) of which are of primary interest. In particular, the emergence of new cluster structures at higher levels of integration emphasizes the necessity to consider evolutionary forces of population differentiation that go beyond those conventionally held responsible for genepool differentiation. Detection of such structures of course depends on comparison of the dendrograms from different levels of integration, where the genepool constitutes the basic reference for comparison. There are many ways of comparing dendrograms obtained with the same clustering method (for an overview see e.g. [14], p. 94ff). We will concentrate instead on direct comparison of the quantities underlying all methods of clustering, i.e., the matrix of pairwise distances. Changes in topology are most likely to occur when the distance matrices show poor correspondence across levels of integration, that is, low covariation (see below).
Variance decomposition
Another common approach is less detailed and essentially rests on the computation of a single statistic of the degree of differentiation among populations. Among these measures, most of which are indexed by _{ST}, the classical versions F_{ST }[15] and G_{ST }[16] consider population differentiation solely for allele frequencies. More recent versions such as Φ_{ST }[7] or R_{ST }[17] include variable differences between genetic types. Inferences on patterns of differentiation are more or less restricted to ways in which an observed amount of differentiation could have evolved under certain model assumptions. Moreover, the whole family of _{ST}measures is based on the principle of variance decomposition, where the difference between the total variation and the average variation within populations is divided by the total variation. Such measures do not assume their maximum values only for completely differentiated populations. This follows directly from their conceptual underpinning, which refers to partitioning rather than differentiation of genetic variation among populations. The _{ST}measures therefore have limited relevance as indicators of patterns of differentiation among populations.
Symmetric population differentiation Δ_{SD}
For this reason, preference is given here to a related but more detailed approach that refers to the concept of symmetric set difference [18,19]. In this concept, each population is characterized by its genetic distance from its complement, i.e., the totality (union) of the remaining populations. By this means, populations can be ranked according to their contributions to the overall amount of differentiation. Application of the distance measure Δ to the concept of symmetric set difference yields quantities Δ_{j }as the distance Δ(p(j), (j)) between the jth population (j) and its complement (j). Denoting p(j) as the vector of type frequencies characterizing the jth population, the vector (j) of type frequencies that represent the remaining populations equals ∑_{k:k≠j }p(j)·c(k)/(j), where c(k) is the relative size of the kth population and (j) = ∑_{k:k≠j }c(k). With these quantities, the measure of symmetric population differentiation Δ_{SD }results as the average of the singlepopulation differentiations Δ_{j}, i.e.,
Whereas Δ_{SD }quantifies the average degree to which individual populations differ from their complements, its components Δ_{j }identify individual populations as being more or less representative of the whole collection of populations. Thus, Δ_{j }= 0 summarizes the situation where the jth population perfectly represents the totality of the populations. On the other hand, the more distinctly Δ_{j }exceeds Δ_{SD}, the more a population is distinguished from all the others. The extreme of complete differentiation of course requires a definite notion of complete difference between types (as is the case with binary difference measures as well as with the measure d of elementary genic difference).
The differentiation pattern inherent in Δ_{SD }and its components Δ_{j }for variable population sizes can be illustrated as a "differentiation snail" [18] (see Fig. 2 below). The snail complements the pattern characteristics obtainable from clustering methods or directly from the distance matrix in that it reveals tendencies of population assemblages to be genetically dispersed or to concentrate genetic variation in a few populations. In order to assess changes in the snail between levels of genetic integration, the following measure of covariation of the respective components Δ_{j }can be applied.
Covariation of differentiation between integration levels
The degree of correspondence between differentiation indices from two levels of integration can be determined by a measure of covariation. Commonly chosen measures of covariation are any of the versions of the productmoment correlation which are designed to quantify the closeness to a linear type of covariation between two variables. However, since our genetic distances are bounded, linear relationships can be realized only under very exceptional conditions. Moreover, it is difficult to see how relationships between levels of integration could be brought about by forces acting linearly on genetic distances. From this perspective it is preferable to use a measure of covariation that relies on general monotonic relationships between two variables. Such measures would more reliably detect any consistency of patterns of differentiation over levels of integration. As was pointed out in [20], a suitable measure of covariation is:
where the variables X_{i }and Y_{i }refer to genetic distances at two different levels of integration. In the case of the distances between a population and its complement, X_{i }and Y_{i }refer to Δ_{i }at the two levels of integration. In the case of pairwise distances between populations, X_{i }and Y_{i }refer to the ith element of the distance matrix for each of the two levels of integration. C varies between 1 and +1 such that C = 1 for strictly positive and C = 1 for strictly negative covariation. It is undefined in the practically irrelevant case where a nonzero difference for one variable implies equality for the other.
Permutation test of the significance of genetic differentiation patterns
Any increase of genetic differentiation among populations at higher levels of genetic integration is due to forces of association of genes that differ among populations. It is thus of basic interest to know whether the differentiation observed at a level of integration can be explained by random combination of genes (e.g. into diploid genotypes or haplotypes) or whether directed forces of combination must be assumed. This requires an analysis that is conditional on the genepool of each population, the number of populations, and the population sizes. The effects of chance can be assessed by permuting the genes within each population, such that all homologous and nonhomologous combinations of genes (alleles) into (haploid, diploid or polyploid) genotypes have equal probability. For each such permutation, the values of all relevant descriptors (e.g. covariation C for distance matrices and differentiation snails, the mean pairwise distance Δ in the distance matrix, the symmetric population differentiation Δ_{SD}) are determined. By performing a large number of permutations, the significance of each observed descriptor value can be measured in terms of the Pvalue, which is the proportion of permutations yielding descriptor values greater than or equal to the observed value. For interpretation of the results, both very small Pvalues (≤ 0.05) and very large Pvalues (≥ 0.95) are of interest.
This permutation analysis differs from common permutation analyses of differentiation among populations, in which the individuals (together with their fixed genotypes) are permuted over the populations. Such analyses aim to explain genepool differences among populations. In contrast, the present paper is targeted at forces of genetic differentiation that originate from the association of genes in diplo or haplostates and that thus go beyond those responsible for genepool differentiation.
Results and discussion
Effects of level of genetic integration on the pattern of differentiation among populations
Proceeding from lower to higher levels of integration, one expects an increase in differentiation among populations simply because of the larger varietal potential inherent in more complex structures. Since differentiation is based on distances, the distance between two populations should therefore also increase, or at least not decrease, with integration level. Consider two populations and , and denote the relative frequencies of their (multilocus) genotypes at L (≥ 1) loci of equal degree of ploidy (≥ 1) by frequency vectors P and Q and the relative frequencies of the genetypes in their genepools by frequency vectors p and q. Proof of the following Theorem requires the special properties of the elementary genic difference between genotypes, including the fact that it is a metric distance:
Theorem: For any two populations and , the distance Δ between the (multilocus) genetic structures P and Q at any L gene loci (L ≥ 1) of equal degree of ploidy is not less than the mean distance between the singlelocus structures P^{(l)}and Q^{(l)}, which in turn is not less than the distance between the corresponding gene pools p and q, that is,
where the difference between genetic types (haplotypes, diplotypes) is measured by the elementary genic difference d.
Proof: The equality results from definition of Δ and genepool. The first inequality follows from Proposition 1 (see Appendix A), which states that the distance Δ between Llocus genotypic structures P and Q (L ≥ 1) is never less than between the genepools p and q. From this it follows that Δ(p^{(l)}, q^{(l)}) ≤ Δ(P^{(l)}, Q^{(l)}) for each locus l. The second inequality stems from Proposition 2 (see Appendix B), which states that the distance Δ between multilocus genotypic structures P and Q is never less than the average of the distances between the corresponding singlelocus genotypic structures P^{(l) }and Q^{(l)}. ■
We investigated this Theorem by simulating numerous simple models. When we analyzed two populations with differing genepools at a locus but both showing HWP among the genotypes, we were surprised to see that the inequalities became equalities. Furthermore, the extension of HWP to inbreeding structures for the same inbreeding coefficient F (i.e., P_{ii }= p_{i}^{2 }+ Fp_{i}(1  p_{i}) and P_{ij }= 2p_{i}p_{j}(1  F)) also yielded equality (F = 0 gives HWP). Equality also held when each of the genotypic structures was the product of two allelic structures (e.g., maternal and paternal), one of which was the same in both populations. When we simulated the frequencies of twolocus genotypes in two populations, both showing HWP at both loci, as the product of the singlelocus genotype frequencies, equality again held. In contrast, differentiation between the genepool and the genotypes at a single locus did increase for inbreeding structures when the two inbreeding coefficients differed and for product structures when no two of the four allelic structures matched. No increase was obtainable between the average singlelocus genotypic distance and the multilocus distance in the case of two loci, each with two alleles, not even when the selection regimes differed between the populations. It is therefore interesting that examples using real data, one of which is presented below, all showed large increases between levels, indicating that the real data does not follow simple models.
As an explanation for the examples in which the genetic distance does not increase with level of genetic integration, consider that the first inequality becomes an equality, if Δ(p^{(l)}, q^{(l)}) = Δ(P^{(l)}, Q^{(l)}) holds for each single locus l. The calculated examples suggest that equality holds at a single locus if the genotypic structures in both populations result from the same function of their allelic structures, i.e., uniformity of homologous association. The second inequality became an equality in our calculated examples whenever multilocus genotype frequencies were the product of singlelocus genotype frequencies, i.e., in the absence of nonhomologous association.
These observations suggest that uniformity of homologous association and absence of nonhomologous association result in equal distances at different integration levels. Intuitively, this coincides with the conception that absence or uniformity of association do not really introduce any new structure to the higher levels of integration. Since this phenomenon only shows up when the difference between genotypes is measured by the elementary genic distance, this measure is closely tied to the concept that the absence of association does not lead to higher differentiation at higher levels of genetic integration.
Nevertheless, absence of nonhomologous association may not be a necessary condition for equality, since also occurred in some examples where association between loci was present. This means that the basic prerequisite for validity of Δ(p, q) = Δ(P, Q) (stated at the end of Appendix A), namely that every genetype that is not of equal frequency in the two populations be either a source gene or a sink gene, may be fulfilled even in the presence of nonhomologous association.
Carrying these results for Δ over to the differentiation measures Δ_{j }and Δ_{SD}, the differentiation among populations for multilocus genetic types (haplotypes, genotypes) equals the genepool differentiation if all populations show uniformity of homologous gene association (e.g. HWP, inbreeding for the same inbreeding coefficient) and absence of nonhomologous association. Otherwise, differentiation may increase with level of integration, as expected.
All of these results are based on the special measure of elementary genic difference between genotypes (for any degree of ploidy). Thus any other measure is likely to yield different results, the interpretation of which would of course depend on a clear conceptual understanding of the difference measure. In particular, this concerns genetic associations that are not specifically genic. A discussion of these measures (see [21] for an overview of measures) would, however, be clearly beyond the scope of this paper.
Application of the approach to an assemblage of oak stands
The effects of the level of genetic integration on patterns of differentiation will be illustrated with the help of an example based on published data [20,22]. The reason for not applying it to particular models here is to show how insights can be gained directly from observations, without model constraints. In this data, the multilocus genotypes at the same six nuclear microsatellite loci were scored in all adult trees of four stands of pedunculate oak (Quercus robur) located in northcentral Germany. Of the 159 trees in the stand near Rantzau, 154 trees could be scored at all six loci, yielding 153 different multilocus genotypes (abbreviated 159/154/153). The other three stands are near Behlendorf (228/178/177), Steinhorst (85/74/74), and Escherode (210/200/200). The number of alleles per locus lies between 15 and 35 with a mean of 23.7, of which an average of five occur in only one stand. Each multilocus genotype appeared in only one stand, yielding a total of 604 different genotypes among the 606 trees scored at all loci. Failure to score the complete multilocus genotypes of the other 76 trees in the stands is assumed to be independent of their genotypes.
Table 1 lists the distance matrix of pairwise distances Δ between stands and their mean as well as the symmetric population differentiation Δ_{SD }and its components Δ_{j}, both based on the elementary genic difference between genetic types, for each of three levels of integration: the genepool distance is the average of the six singlelocus allelic distances; the singlelocus diplophase distance is also the average over the loci; the multilocus diplophase distance. It is seen that for each pair of stands, all pairwise distances Δ increase considerably with the level of integration. This indicates that neither the gene association within single loci (homologous association) nor the gene association among loci (nonhomologous association) is of the same form in any two stands, and in particular that association is present. Both the distances and the snail components show a much larger increase between the singlelocus diplophase and the multilocus diplophase than between the genepool and the singlelocus diplophase. Hence the nonhomologous gene associations make a distinctly greater contribution to the differentiation than the homologous gene associations. It is interesting to consider the large increase between the singlelocus and the multilocus level in the light of our failure to produce any increase at all when simulating simple selection models, as mentioned above. This indication that the data is not explainable by simple models requires further investigation.
Table 1. Genetic differentiation among four oak stands at three levels of genetic integration.
In order to be sure that this apparent discrepancy between stands in the form of association is not simply due to the small number of multilocus genotypes in the stands compared to the number that could be formed from the genes present in the stands, a permutation analysis was performed as described above. Ten thousand new data sets were generated by random permutation of the genes at each locus within each stand to form new singlelocus genotypes, randomly combined to multilocus genotypes. Each observed distance was then compared to the 10 000 distances from permutation. Surprisingly, for both the singlelocus diplophase and the multilocus diplophase, the observed mean pairwise distance and the symmetric population differentiation Δ_{SD }were significantly high (i.e., higher than for 99% of all permutations). This indicates that both homologous and nonhomologous association of genes follow very different rules among the stands.
The significant size of the mean of the pairwise distances for the singlelocus diplophase and the multilocus diplophase may seem counterintuitive to the striking similarity of these distances within each of the three levels of integration. The same holds for the snail components. To explain this similarity, note that the range of values that appeared in the permutations is also quite narrow. Thus the collections of genes in the stands must place tight limits on the achievable distances and snail components.
Not only the sizes but also the covariation C of the pairwise distances Δ and the snail components Δ_{j }at the different integration levels depend on the differences in gene association between levels. The positive covariation of distance matrices and of snail components for all pairs of integration levels shows that no form of association completely overturns the ranking prescribed by the genepool. Whereas the gene arrangements that distinguish the singlelocus diplophase from the genepool do produce rank changes among the stands (C = 0.893 for the distance matrix and C = 0.809 for the snail components), the gene arrangements that distinguish the singlelocus diplophase from the multilocus diplophase have little effect on ranking (C = 1 for the distance matrix and C = 0.995 for the snail components). Not surprisingly, the gene arrangements that distinguish the genepool from the multilocus diplophase yield the weakest covariation (C = 0.720 for the distance matrix and C = 0.657 for the snail components).
This pattern of strong covariation is evident in the UPGMA dendrograms (Fig. 1) based on the three distance matrices, which are easier to visualize than the distance matrices themselves, and the differentiation snails (Fig. 2) constructed from the three sets of snail components. The dendrograms show weakly defined clusters that vary in topology between the genepool and the topologically identical clusters of the singlelocus diplophase and the multilocus diplophase. The snails show rank changes that are based on only slight differences between the snail components.
Figure 1. UPGMA dendrograms at three levels of genetic integration in four oak stands. For six microsatellite loci scored in four stands of oak (R, B, S, E), UPGMA dendrograms were constructed from the matrices of genetic distances Δ between stands in Tab. 1. Within each dendrogram, the quantitative differences between clusters are weak. The genepool dendrogram differs qualitatively, i.e., topologically, from the topologically identical dendrograms of the higher levels. The significantly large increase in the mean pairwise distance, and thus in the length of the dendrograms, with level of integration implies that the stands show differentiation for their forms of homologous gene association and, even more so, nonhomologous association.
Figure 2. Differentiation snails at three levels of genetic integration in four oak stands. For six microsatellite loci scored in four stands of oak (R, B, S, E) the differentiation snails were constructed from the snail components Δ_{j }in Tab. 1. Dotted circles mark the symmetric population differentiation Δ_{SD}. Within each snail, the quantitative differences among the components are slight. Each snail differs qualitatively in the ranking of the stands from the other two (i.e., covariation C < 1 for each comparison). The significantly large increase in the radius Δ_{SD }of the snails with each higher integration level confirms the differentiation among the stands for form of gene association.
It is interesting to compare the observed covariations with the ranges of covariation that occurred for the gene arrangements generated by the 10 000 random permutations. The distance matrices show weaker covariation between the singlelocus diplophase and the multilocus diplophase in almost 92% of the permutations (Pvalue 0.084 for C = 1) but between the genepool and the singlelocus diplophase for only 73% (Pvalue 0.270 for C = 0.893). From the high improbability of the observed perfect covariation (C = 1) between the singlelocus diplophase and the multilocus diplophase, it can be inferred that the nonhomologous association has a special relationship to the homologous association in the singlelocus diplophase. In contrast, the intermediate Pvalue for the covariation between the genepool and the singlelocus diplophase implies that the homologous association is not predetermined by the collection of genes.
The snail components showed a weaker covariation between the singlelocus diplophase and the multilocus diplophase for ca. 47% of the permutations (Pvalue 0.532 for C = 0.995) but between the genepool and the singlelocus diplophase only for ca. 9% (Pvalue 0.912 for C = 0.809). This confirms the stronger effect of homologous association than nonhomologous association on the ranking within the distance matrices. Compared to these, however, the snail components show stronger covariation than observed for a much higher proportion of the permutations, both for homologous and nonhomologous association. Hence, the covariation of the snail components seems to be less sensitive to the effects of gene association than is the covariation of the pairwise distances. This must be due to the equalizing influence of combining three stands for comparison to the fourth that is the basis of the snail components.
Discussion of the application to the oak stands
The differentiation observed among the oak stands increases distinctly from the genepool level to the singlelocus diplophase. An even larger jump in differentiation occurs when the nonhomologous association for the multiple loci is included. These are clear indications that all (except for perhaps one) of the stands show deviation from both HWP and gametic equilibrium, and that the degrees of deviation vary considerably among the stands. Such indications could not be confirmed by conventional statistical testing due to the large numbers of degrees of freedom and the implied weakness of the respective test statistics. It might come as a surprise that the application of the special permutation analysis presented above to genetic differences between populations detects association characteristics within populations. Confirmation and exploitation of this statistical potential deserves further investigation.
Consequently, if the four oak stands had been less clearly separated spatially, and if we had wanted to assign the trees to their proper subpopulations, we would have run into problems when making use of methods based on the absence of gene associations within populations. Methods for finding subdivisions of populations that are based on HardyWeinberg proportions and gametic equilibrium within populations (e.g. [2327]) may therefore not have assigned the individuals to their original stands.
When comparing the observed differentiation to that producible by gene association in the stands, all 10 000 permutations agreed with the observation by showing much higher differentiation among the singlelocus diplophases than among the genepools, both for the mean pairwise genetic distance and the symmetric population differentiation Δ_{SD}. This tells us not only that the random generation of gene association never yielded HardyWeinberg structures for all loci in all four stands simultaneously. Neither was any other form of homologous association realized simultaneously that leaves differentiation unchanged (e.g. inbreeding with equal coefficients). Furthermore, all nonhomologous associations showed a considerable additional increase in differentiation over the homologous associations, as is seen in the wide separation of the range of differentiation for the singlelocus diplophase from the range for the multilocus diplophase. Remarkably, both ranges of differentiation are quite narrow. These results indicate that the increases in differentiation that are realizable by homologous and nonhomologous gene association can be tightly restricted by the genic composition of the populations. In such cases, equal differentiation at consecutive integration levels may not be achievable. Thus it appears that differentiation among populations with respect to their forms of gene association may be a normal occurrence. This insight questions the common practice of restricting the measurement of population differentiation to the allelic level (e.g. F_{ST}), thereby ignoring the considerable effects of gene association on population differentiation. This analysis is the first of its kind. Therefore, we cannot venture a prediction about whether the above findings on covariation between levels of integration constitute a general trend. It is conceivable, for example, that these findings are mainly determined by the conspicuously large polymorphism characteristic of the microsatellite markers used in this study. Other genetic markers may tell different stories.
Conclusion
This new approach to the analysis of genetic differentiation among populations demonstrates that the consideration of gene associations within populations adds a new quality to studies on population differentiation that is overlooked when viewing only genepools.
Appendix A
Proposition 1: For any two populations and , the distance between the (multilocus) genetic structures P and Q at any L gene loci (L ≥ 1) of equal degree of ploidy is not less than the distance between the corresponding gene pools p and q, respectively, that is,
where the difference between genetic types (haplotypes, diplotypes) is measured by the elementary genic difference d.
Whereas the equality in Proposition 1 follows from the text, proof of the inequality in Proposition 1 depends on a lemma that applies the following notation: For two populations and , let G_{x }or G_{y }denote the genetic types of the individuals at L gene loci of degree of ploidy N = 1, yielding K = LN genes per individual. For the relative frequencies P(G_{x}) and Q(G_{x}) of type G_{x }in the two populations (by some ordering), denote the frequency structure of the Llocus types as P and Q. Call the ith allele at the lth locus A_{i;l}. Term the frequency structure of the genetypes A_{i;l }in the Llocus genepool as p and q. A shift transformation s(P, Q) decomposes the set of all genetic types on the basis of their relative frequencies into three sets: The source types G_{x }for which P (G_{x}) > Q(G_{x}) holds, i.e., that show an excess in the first population with respect to the second; the sink types G_{x }for which P (G_{x}) <Q(G_{x}) holds, i.e., that show a deficit in the first population; and those for which P(G_{x}) = Q(G_{x}) holds. In general terms, the excess of type G_{x }is quantifiable as P (G_{x})  min{P (G_{x}), Q(G_{x})} ≥ 0, with equality to 0 if P(G_{x}) ≤ Q(G_{x}). Likewise, the deficit of type G_{x }is quantifiable as Q(G_{x})  min{P (G_{x}), Q(Gx)} ≥ 0, with equality to 0 if P(G_{x}) ≥ Q(G_{x}). For all types G_{x}, s(P, Q) fulfills:
where: s(G_{x}, G_{y}) is the relative frequency among all individuals in population of individuals that are shifted from type G_{x }to type G_{y}.
Lemma 1: Consider any shift transformation s(P, Q) between the Llocus genetic structures. The genetic distance between the corresponding allelic structures p^{(l) }and q^{(l)} at locus l is expressible as:
where: and is the relative frequency of allele A_{i;l }at locus l in population and , respectively, where the α are defined as:
and where n_{i;l}(G_{x}) is the number of genes of allelic type A_{i;l }in type G_{x}.
Proof: Note that since an allele A_{i;l }can be present in both source and sink types, α(A_{i;l}, •) > 0 and α(•, A_{i;l}) > 0 can hold simultaneously. It follows that
■
Note that s(G_{x}, G_{y}) > 0 is true only if G_{x }is a source type and G_{y }a sink type. Thus α(A_{i;l}, •) quantifies the total number of A_{i;l}genes in the original (source) types of all shifted individuals, divided by the total number of genes at locus l in Population (= N· population size). Analogously, α(•, A_{i;l}) quantifies the number of A_{i;l}genes in the new (sink) types of all shifted individuals, divided by the same total number of genes. Their difference is the net frequency with which this allele was shifted.
Proof of Proposition 1: For any shift transformation s(P, Q), it follows from Lemma 1 and the definition of the α that:
The final equality follows from the definition of d(G_{x}, G_{y}) in the text. Since this holds for any shift transformation, it also holds if s(P, Q) is a minimum shift transformation, in which case ∑_{x,y}d(G_{x}, G_{y})·s(G_{x}, G_{y}) = Δ(P, Q). Therefore, it follows that: , as claimed. ■
In Proposition 1, equality holds if and only if for each genetype A_{i;l}, the expression
has the same sign for all pairs of types G_{x}, G_{y}. This distinguishes three special groups of genes: Genes A_{i;l }for which the expression equals zero for all pairs of types G_{x}, G_{y}, implying that A_{i;l }is equally frequent in the two populations and therefore shows no net shift; genes A_{i;l }for which the expression is ≥ 0 but not ≡ 0 for all x, y, that is, that are never less frequent in source types G_{x }than in the corresponding sink types G_{y}, making them source genes; genes A_{i;l }for which the expression is ≤ 0 but not ≡ 0 for all x, y, making them sink genes. (Note that a gene need not belong to any of the three groups, as is demonstrated by s(A_{i;l}A_{j;l}, A_{j;l}A_{j;l}) > 0 and s(A_{i;l}A_{j;l}, A_{i;l}A_{i;l}) > 0.)
Appendix B
Proposition 2: For any two populations and , the distance between the (multilocus) genetic structures P and Q at any L gene loci (L ≥ 1) of equal degree of ploidy N ≥ 1 is not less than the mean distance between the corresponding singlelocus structures P^{(l) }and Q^{(l)}, respectively, that is,
where the difference between genetic types is measured by the elementary genic difference d.
The validity of Proposition 2 for L = 1 is obvious. For L ≥ 2, proof depends on four lemmata that apply the following notation: Let s(P, Q) be a shift transformation between the Llocus genotypic structures. Denote the various Llocus types as G_{x }or G_{y}, and write each type G_{x }as the "product" of its projection to the singlelocus type at loci l = 1 and its projection to the complementary (L  1)locus type. Denote the singlelocus types at locus l as or and the complementary types as or . Define
as the marginal sum of all shifts that involve the type at locus l in the source type G_{x }and in the sink type G_{y}.
Lemma 2 The difference between the marginal sums for any u equals the net shift for any shift transformation s_{l }at the locus.
Proof: For the lth locus it holds that:
Their difference equals:
The same difference results for any shift transformation s_{l }at a locus l, since:
■
Even though marginal sums share this property with any shift transformation at the locus, the following lemma shows that marginal sums may not specify a shift transformation.
Lemma 3: The marginal sums of all types , at locus l may shift an amount that is in excess of the amount required of any shift transformation at the locus.
Proof: The total amount shifted away from any type at locus l equals
By the same reasoning, the amount received by equals
These inequalities contradict the equality required of a shift transformation. ■
Lemma 3 shows that the marginal sums may shift too much, and it is easy to construct examples for which this is the case. Excess amounts must be due to the appearance of one or more singlelocus types both in twolocus source types and in twolocus sink types. This makes them both sources and sinks in the marginal sums, in violation of the properties of a shift transformation. The three ways in which a type can act as both a source and a sink are:
The following lemma shows how to eliminate all ambivalent source/sink relationships from the marginal sums without changing the net amount shifted, i.e., amount sent away as a source minus the amount received as a sink.
Lemma 4: The marginal sums of all types , at locus l can be used to construct a quasishift κ_{l}(P^{(l)}, Q^{(l)}) with the following three properties:
Proof by construction: Consider the following algorithm:
Step 1: If holds for a type , set . Since , this has no effect on the sum . Repeat for an additional type fulfilling the condition. If none exist, go to Step 2.
Step 2: If and hold for a ≠b, set
it follows that
Set
Repeat for an additional pair of types that fulfill the condition. If none exist, go to Step 3.
Step 3: If and hold for three different indices a, b, c, subtract from both and add M to the "direct route" from to , i.e., set
Because d is a metric distance, implying
it holds that
from which it follows that
Set
If , go to Step 2. Otherwise, repeat Step 3 for another triplet of types fulfilling the condition. If none exists, STOP.
At each step, decreases or remains constant, yielding
After completion, either or or both hold for all u, meaning that no type is both a source and a sink. The net quasishift for each u remains constant throughout the algorithm, equaling by Lemma 2. Thus the quasishifts κ_{l}(, _{}) fulfill the properties, as claimed. ■
Lemma 5: The quasishifts κ_{l}(, ) constructed in Lemma 4 specify a shift transformation s_{l}(P^{(l)}, Q^{(l)}) for locus l for which it holds that
Proof: As proven in Lemma 4, for the quasishifts it holds that
and either or or both. There are three cases:
These three cases can be combined to the expression
Therefore, the quasishifts κ_{l}(, ) fulfill the definition of alpha shift transformation at locus l. Defining the shift and denoting , it follows from Lemma 4 that ■
With the help of the lemmata, Proposition 2 can now be proven:
Proof of Proposition 2: Let s(P, Q) be a shift transformation between the two Llocus genotypic structures. Denoting the Llocus types as G_{x }or G_{y}, their projections to locus l as or , and the various singlelocus types at locus l as or , it holds that
where s_{l}(P^{(l)}, Q^{(l)}) is the shift transformation constructed in Lemma 4. Since the inequality holds in particular if s(P, Q) is a minimal shift transformation, it follows, as claimed, that ■
Equality holds in Proposition 2 whenever the marginal sums for each locus l = 1,...,L specify a minimal shift transformation, i.e., when .
Authors' contributions
HRG conceived of the approach and drafted the Background and Methods. EG formulated and proved the Theorem, programmed the software, analyzed the data, and drafted the Results and Appendices. Both authors read and approved the final manuscript.
Acknowledgements
The authors gratefully acknowledge the comments of two anonymous reviewers which helped considerably in improving the presentation of our concepts. This work was partially funded by grant Zi 662/51 from the Deutsche Forschungsgemeinschaft.
References

Hanski I: Metapopulation dynamics.
Nature 1998, 396:4149. Publisher Full Text

Cold Spring Harbor Symposia on Quantitative Biology 1959, 24:114.

de Winter W: The Beanbag Genetics Controversy: Towards a synthesis of opposing views of natural selection.
Biology and Philosophy 1997, 12:149184. Publisher Full Text

Crow JF: The beanbag lives on.
Nature 2001, 409:771. PubMed Abstract  Publisher Full Text

Rao CR: Diversity and dissimilarity coefficients: a unified approach.
Theoretical Population Biology 1982, 21:2443. Publisher Full Text

Excoffier L, Smouse PE, Quattro JM: Analysis of molecular variance inferred from metric distances among DNA haplotypes: Application to human mitochondrial DNA restriction data.
Genetics 1992, 131:479491. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Smouse PE, Peakall R: Spatial autocorrelation analysis of individual multiallele and multilocus genetic structure.
Heredity 1999, 82:561573. PubMed Abstract  Publisher Full Text

Gregorius HR, Gillet EM, Ziehe M: Measuring differences of trait distributions between populations.
Biometrical Journal 2003, 45:959973. Publisher Full Text

Gillet EM, Gregorius HR, Ziehe M: May inclusion of trait differences in genetic cluster analysis alter our views?
Forest Ecology and Management 2004, 197:149158. Publisher Full Text

Hitchcock FL: Distribution of a product from several sources to numerous localities.

Gillet EM: DeltaS, a program to calculate the measure of pairwise distance Δ between populations. [http://www.unigoettingen.de/de/95605.html] webcite

Gregorius HR: Genetischer Abstand zwischen Populationen. I. Zur Konzeption der genetischen Abstandsmessung. [http://www.bfafh.de/inst2/sgpdf/23_13_22.pdf] webcite

Gordon AD: Hierarchical classification. In Clustering and Classification. Edited by Arabie P, Hubert LJ, Soete GD. Singapore etc.: World Scientific; 1996:65121.

Wright S: Evolution and the Genetics of Populations. Volume 2. Chicago: University of Chicago Press; 1969.

Nei M: Analysis of gene diversity in subdivided populations.
Proceedings of the National Academy of Sciences USA 1973, 70:33213323. Publisher Full Text

Slatkin M: A measure of population subdivision based on microsatellite allele frequencies.
Genetics 1995, 139:457462. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gregorius HR, Roberds JH: Measurement of genetical differentiation among subpopulations.
Theoretical and Applied Genetics 1986, 71:826834. Publisher Full Text

Gregorius HR: Differentiation between populations and its measurement.

Gregorius HR, Degen B, König A: Problems in the analysis of genetic differentiation among populations – a case study in Quercus robur. [http://www.bfafh.de/inst2/sgpdf/56_34_190.pdf] webcite

Hubálek Z: Coefficients of association and similarity, based on binary (presenceabsence) data: an evaluation.
Biological Reviews 1982, 57:669689. Publisher Full Text

Degen B, Streiff R, Ziegenhagen B: Comparative study of genetic variation and differentiation of two pedunculate oak (Quercus robur) stands using microsatellite and allozyme loci.
Heredity 1999, 83:597603. PubMed Abstract  Publisher Full Text

Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data.
Genetics 2000, 155:945959. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Falush D, Stephens M, Pritchard JK: Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies.
Genetics 2003, 164:15671587. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Corander J, Waldmann P, Sillanpää MJ: Bayesian analysis of genetic differentiation between populations.
Genetics 2003, 163:367374. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Holsinger KE, Wallace LE: Bayesian approaches for the analysis of population genetic structure: an example from Platanthera leucophaea (Orchidaceae).
Molecular Ecology 2004, 13:887894. PubMed Abstract  Publisher Full Text

Guillot G, Estoup A, Mortier F, Cosson JF: A spatial statistical model for landscape genetics.
Genetics 2005, 170:12611280. PubMed Abstract  Publisher Full Text  PubMed Central Full Text