Cincinnati Children's Hospital Medical Center, Department of Pediatrics, University of Cincinnati, Cincinnati, OH, USA

Department of Epidemiology, University of Alabama at Birmingham, Birmingham, AL, USA

Department of Biology, University of Northern Iowa, Cedar Falls, IA, USA

Center for Computational Genomics, Institute of Applied Genetics, Department of Forensic and Investigative Genetics, University of North Texas Health Science Center, Fort Worth, TX, USA

Abstract

Background

Admixture mapping is a powerful gene mapping approach for an admixed population formed from ancestral populations with different allele frequencies. The power of this method relies on the ability of ancestry informative markers (AIMs) to infer ancestry along the chromosomes of admixed individuals. In this study, more than one million SNPs from HapMap databases and simulated data have been interrogated in admixed populations using various measures of ancestry informativeness: Fisher Information Content (FIC), Shannon Information Content (SIC), F statistics (F_{ST}), Informativeness for Assignment Measure (I_{n}), and the Absolute Allele Frequency Differences (delta, δ). The objectives are to compare these measures of informativeness to select SNP markers for ancestry inference, and to determine the accuracy of AIM panels selected by each measure in estimating the contributions of the ancestors to the admixed population.

Results

F_{ST }and I_{n }had the highest Spearman correlation and the best agreement as measured by Kappa statistics based on deciles. Although the different measures of marker informativeness performed comparably well, analyses based on the top 1 to 10% ranked informative markers of simulated data showed that I_{n }was better in estimating ancestry for an admixed population.

Conclusions

Although millions of SNPs have been identified, only a small subset needs to be genotyped in order to accurately predict ancestry with a minimal error rate in a cost-effective manner. In this article, we compared various methods for selecting ancestry informative SNPs using simulations as well as SNP genotype data from samples of admixed populations and showed that the I_{n }measure estimates ancestry proportion (in an admixed population) with lower bias and mean square error.

Background

Admixture is a common form of gene flow between populations. It refers to the process in which two or more genetically and phenotypically diverse populations with different allele frequencies mate and form a new, mixed or 'hybrid' population

Several measures of marker informativeness for ancestry have been developed to select the most ancestry informative markers (reviewed in Rosenberg et al., 2003 _{ST}), and the Informativeness for Assignment Measure (I_{n}). The cutoff value for δ is highly subjective and has steadily decreased over time from ≥ 0.5 _{ST }≥ 0.4 _{n }≥ 0.3 _{ST}, FIC, SIC, and I_{n }can be applied to select informative markers for admixed populations formed from two or more ancestral populations. For FIC and SIC indices, ancestral proportions in the admixed population need to be specified.

In spite of numerous studies with these measures of marker informativeness for ancestry, several questions are not systematically addressed, including how often are the same sets of SNPs selected by the different methods? To what degree do they overlap and share common sets of SNPs? How do AIM panels selected by these different methods perform in estimating ancestry population contributions under different proportion of ancestral population in an admixed population? With so many measures to choose from, it is very important to understand their common features as well as where they differ in terms of SNP selection. Answering these questions with a systematic study would help users in choosing appropriate measures in a cost-effective manner. In absence of a comprehensive comparative study on the performance of the different marker informativeness measures in marker selection, researchers selected markers using only the measure of their personal choice. For example, the three major U.S. admixture mapping research groups led by David Reich, Michael Seldin and Mark Shriver in their recent independent admixture mapping panels for Latino populations used SIC, F_{ST }and δ

Results

SNP allele frequencies and comparisons of informative marker selection measures

There are 1,362,723 and 1,450,896 autosomal SNPs in HapMap phase III release #3 dataset for CEU and YRI population, respectively. Table _{ST}, FIC, SIC, and I_{n }were 0.19, 0.07, 0.35, 0.03, and 0.06, respectively. The majority of the markers contained a small amount of ancestry information, suggesting a very high similarity in allele frequencies among common variants (frequency > 5%) in CEU and YRI population.

Chromosome length and number of SNP markers across the genome in CEU and YRI population of the HapMap Phase III dataset

**Number of SNPs**

**SNPs genotyped in both population ^{a}**

**Chr #**

**length**

**in each population**

**SNPs genotyped**

**and fulfilling the filtering criteria ^{b}**

**(Mb)**

**CEU**

**YRI**

**in both populations ^{a}**

**Delta**

**F _{ST}**

**FIC**

**SIC**

**I _{n}**

1

246.6

111887

120349

103330

1663

1657

1653

1651

1654

2

242.7

113613

122377

106053

1742

1751

1742

1740

1749

3

199.3

94608

101070

88060

1464

1447

1448

1450

1450

4

191.2

85403

92052

79856

1390

1394

1387

1398

1392

5

180.6

87071

92350

81083

1310

1298

1314

1302

1304

6

170.7

91415

95108

84536

1241

1235

1239

1231

1236

7

158.7

75234

79231

69766

1139

1144

1152

1157

1148

8

146.2

74443

79368

69177

1053

1055

1053

1048

1054

9

140.2

63507

66437

58713

842

841

834

844

845

10

135.3

72846

76787

67388

979

988

980

980

988

11

134.3

69175

73942

64140

986

987

988

990

981

12

132.3

67486

70909

62018

988

981

973

972

977

13

96.2

51879

55392

48496

719

726

701

713

723

14

88.2

44570

47354

41434

646

646

650

648

650

15

82.0

40705

43837

37774

590

596

596

590

595

16

88.7

42738

46366

39616

568

559

554

555

557

17

78.6

36534

39075

33576

572

569

580

574

570

18

76.1

40153

43467

37669

562

568

564

561

572

19

63.6

25251

26798

23264

405

404

404

407

405

20

62.4

35252

37476

32870

447

447

446

448

443

21

37.1

19336

20556

18105

252

250

254

249

249

22

35.1

19617

20595

17817

249

253

253

252

249

Total

1362723

1450896

1264741

19807

19796

19765

19760

19791

^{a }Two inclusion criteria are in effective: 1) The SNP is shared by both YRI and CEU populations and 2) SNPs with missing frequency less than 10% of the samples.

^{b }Every SNP is at least 100 kb from its nearest neighbor.

**Table S1: Summary statistics of five measures of marker informativeness for CEU and YRI population in the HapMap phase III data**. A table of mean, standard deviation, minimum, median, maximum, and lower and upper quartile of the five measures of marker informativeness for CEU and YRI population.

Click here for file

Distribution of the five measures of marker informativeness for CEU and YRI population from HapMap phase III data

**Distribution of the five measures of marker informativeness for CEU and YRI population from HapMap phase III data**. The majority of the SNP markers display low to moderate estimates of genetic informativeness with few markers displaying high levels of population differentiation.

For CHB and JPT population, the distribution of the five measures of marker informativeness is show in Additional file

**Figure S1: Distribution of the five measures of marker informativeness for CHB and JPT population from HapMap phase III data**. Histograms of the five measures of marker informativeness. Almost all the SNP markers displayed low estimates of genetic informativeness.

Click here for file

**Table S2: Summary statistics of five measures of marker informativeness for CHB and JPT population in the HapMap phase III data**. A table of mean, standard deviation, minimum, median, maximum, and lower and upper quartile of the five measures of marker informativeness for CHB and JPT population.

Click here for file

Correlation, concordance, and overlapping analysis

Spearman correlation

To assess the level of similarity of the estimates of genetic information contained in each SNP marker across the different selection methods, Spearman correlation coefficient was calculated for the estimates of informativeness from different selection methods for CEU and YRI population. Figure _{ST }and I_{n}, whereas FIC and SIC exhibited somewhat asymmetric patterns. Pairwise scatterplots of the five measures of informativeness showed that the measures had high levels of correlation (Figure _{ST }and I_{n}. F_{ST }and I_{n }had an almost perfect monotonically increasing relationship.

3D scatter plot of CEU and YRI allele frequencies and the five measures of informativeness

**3D scatter plot of CEU and YRI allele frequencies and the five measures of informativeness**. The two horizontal axes are frequencies of alleles shared by the two populations and the vertical axis is the calculated values of marker informativeness for ancestry by the five different measures. Similar symmetric patterns were observed between F_{ST }and I_{n}, and FIC and SIC exhibited somewhat asymmetric patterns.

Scatter plots of the five measures of marker informativeness with nonparametric quantile density

**Scatter plots of the five measures of marker informativeness with nonparametric quantile density**. In each panel, r is the Spearman correlation coefficient, which ranged from 0.9512 between δ and FIC to 0.9994 between F_{ST }and I_{n}. F_{ST }and I_{n }had an almost perfect monotonically increasing relationship.

Concordance by deciles

Figure _{ST }also fell into the first group of δ, with a few falling into the 2^{nd }group of δ. However, even though some of the SNPs in the 2^{nd }group of FIC fell into the first group of δ, some in that same group fell into as high as the 6^{th }group of δ. The high concordance at the edges of the mosaic plots may be due to an edge effect. Again F_{ST }and I_{n }showed very high concordance, which indicates the ability of the two measures to identify AIM SNPs in a similar manner. This high similarity in picking informative markers can also be seen by the high correlation coefficients between these two measures (Figure _{ST }and I_{n }had the best agreement (kappa = 0.93). F_{ST }and SIC, and SIC and I_{n }also showed good agreement, with the Kappa statistics of 0.85 and 0.86, respectively. Delta, with Kappa statistics between 0.42 and 0.47, had relatively poor agreement with the other four measures.

Mosaic plot of the five measures of marker informativeness grouped by their deciles

**Mosaic plot of the five measures of marker informativeness grouped by their deciles**. Each mosaic plot was first divided horizontally into ten bars with equal width representing the ten groups of a measure of informativeness. The first group (left most bar) contained the SNPs with highest information and the last group (right most bar) contained the SNPs with lowest information. Each bar was then split vertically into different colored segments whose heights were proportional to the probabilities associated with the second measure of informativeness, conditional on the first measure. Higher concordance was observed at the two ends of the informative scale.

**Table S3: Kappa statistics of the five measures of informativeness as defined by deciles**. A table of pair-wise Kappa statistics of the five measures of informativeness.

Click here for file

Additional file _{ST }and I_{n }exhibited very similar partition patterns, which is consistent with what we observed using the Spearman correlation coefficient, Mosaic plots, and Kappa statistics. It can also be recognized through the scatter plot that FIC favors the selection of markers that are closer to fixation in one of the populations.

**Figure S2: Scatter plot of allele frequencies of CEU and YRI population partitioned by the ten groups defined by deciles of each measure of informativeness**. The top-left and bottom-right corner represent the most informative SNPs whereas the least informative SNPs reside at the center of the plot.

Click here for file

Overlapping

Figure _{ST}, and I_{n}. It can be seen across different _{ST}, and I_{n }were more likely to pick the same set of SNPs. As the number of top AIMs increased, FIC was more likely to choose SNPs that were not chosen by any other measure.

Overlap of top

**Overlap of top n AIMs selected by different measures of informativeness**. For (a) n = 1, (b) n = 5, (c) n = 10, (d) n = 20, (e) n = 50, and (f) n = 100, a 5-digit binary vector was assigned to each SNP, where each digit represents a measure, and they are, from the first to the last, Delta, F

Discrimination analysis and estimation of ancestral contribution

Discrimination analysis

Figure

Classification accuracy for ancestral population vs. number of top AIMs used by the five measures of informativeness

**Classification accuracy for ancestral population vs. number of top AIMs used by the five measures of informativeness**. (a) CEU vs. YRI and (b) CHB vs. JPT population.

**Figure S3: Number of AIMs needed to achieve specific accuracies for founder populations**. The two founder populations are (a) CEU and YRI and (b) CHB and JPT.

Click here for file

Estimation of ancestral contribution in admixed populations with top AIMs

Additional file _{n }performed slightly better than those selected by other measures of informativeness.

**Figure S4: Inferred population structure for CEU, YRI and ASW population with two clusters and 200 AIMs selected by FIC**. A plot of the inferred population structure of CEU, YRI and ASW population. The analysis was done in STRUCTURE and

Click here for file

**Figure S5: Estimate of ancestry contribution vs. number of top AIMs for CEU, YRI and ASW population from HapMap phase III data**. Top panel: estimate of CEU contribution for CEU population. Middle panel: estimate of YRI contribution for YRI population. Bottom panel: estimate of YRI contribution for ASW population.

Click here for file

For the simulated admixed population from CEU and YRI, a random sample of 100 individuals was extracted. The true average ancestry contribution was 70:30. Additional file _{n }with 20 AIMs, 0.02 by SIC and FIC with 20 AIMs, and 0.01 by F_{ST }and Delta with 50 AIMs. Using the top 20 AIMs, RMSE's were 0.095, 0.095, 0.100, 0.093, and 0.089 for δ, F_{ST}, FIC, SIC, and I_{n}, respectively. Figure

**Figure S6: Absolute error in the estimation of mean ancestry contribution for the simulated admixed populations**. A plot of absolute error in the admixed population simulated from (a) CEU and YRI and (b) CHB and JPT.

Click here for file

Individual true ancestry contributions and estimated contributions using top 20 AIMs of the simulated admixed population from CEU and YRI

**Individual true ancestry contributions and estimated contributions using top 20 AIMs of the simulated admixed population from CEU and YRI**. Top-left panel: histogram of individual true ancestry contributions. Top-middle panel to bottom-right panel: scatter plot of individual true ancestry contributions vs. individual estimated contributions using top 20 AIMs selected by Delta, F_{ST}, FIC, SIC, and I_{n}, respectively.

For the simulated admixed population from CHB and JPT, a random sample of 100 individuals was extracted. The true average ancestry contribution was 72:28 for the simulated admixed population. Absolute errors in the estimation of the ancestry contribution for the simulated population with up to top 100 AIMs selected by different measures of informativeness are given in Additional file _{ST }with 50 AIMs, 0.15 by FIC with 20 AIMs, 0.07 by SIC with 50 AIMs, and 0.01 by I_{n }with 50 AIMs. Using top 50 AIMs, RMSE's were 0.218, 0.182, 0.306, 0.186, and 0.170 for δ, F_{ST}, FIC, SIC, and I_{n}, respectively. Figure _{n }gave the lowest bias and RMSE using only relatively small AIM panels.

Individual true ancestry contributions and estimated contributions using top 50 AIMs of the simulated admixed population from CHB and JPT

**Individual true ancestry contributions and estimated contributions using top 50 AIMs of the simulated admixed population from CHB and JPT**. Top-left panel: histogram of individual true ancestry contributions. Top-middle panel to bottom-right panel: scatter plot of individual true ancestry contributions vs. individual estimated contributions using top 50 AIMs selected by Delta, F_{ST}, FIC, SIC, and I_{n}, respectively.

Individual true ancestry contributions and estimated contributions using top 1000 AIMs of the simulated admixed population from CHB and JPT

**Individual true ancestry contributions and estimated contributions using top 1000 AIMs of the simulated admixed population from CHB and JPT**. Top-left panel: histogram of individual true ancestry contributions. Top-middle panel to bottom-right panel: scatter plot of individual true ancestry contributions vs. individual estimated contributions using top 1000 AIMs selected by Delta, F_{ST}, FIC, SIC, and I_{n}, respectively.

Estimation of ancestral contribution in admixed populations with random subsets of top AIM panels

Figure _{n }gave the smallest mean error whereas those chosen by FIC had the highest mean error. AIMs chosen by FIC and SIC were more likely to overestimate ancestry contribution, and those by Delta, F_{ST}, and I_{n }were more likely to underestimate ancestry contribution.

**Table S4: Summary statistics of estimation errors of mean ancestry contribution for ASW population**. The estimates were based on 100 random subsets of 20 SNPs from panels consisting of top 1%, 2%, 5%, and 10% of the AIMs for CEU and YRI population. The gold-standard or 'true' ancestry contribution was taken as 78%, estimated by a collection of 3299 AIMs for the CEU and YRI population, all of which were selected as top 10% AIMs by at least one of the five measures.

Click here for file

Box-and-Whisker plot of estimates of mean ancestry contribution for ASW population with 100 random subsets of 20 SNPs from panels consisting of top 1%, 2%, 5%, and 10% of the AIMs

**Box-and-Whisker plot of estimates of mean ancestry contribution for ASW population with 100 random subsets of 20 SNPs from panels consisting of top 1%, 2%, 5%, and 10% of the AIMs**. From left to right different colors indicate results for Delta, F_{ST}, FIC, SIC, and I_{n}. The last two sets of the plot (yellow and gray) are the results where markers were ranked by the average rank (AVE) or minimum rank (MIN) of all five measures. The dashed line indicates 78%, which was estimated by a collection of 3299 AIMs for CEU and YRI population from HapMap phase III data. All of the 3299 markers were selected as top 10% AIMs by at least one of the five measures. For each method (color), the four Box-and-Whisker plots from left to right represent analysis results based on AIM panels consisting of top 1%, 2%, 5%, and 10% of the AIMs. See Additional File 10, Table S4 for summary statistics of estimation errors.

Results of simulation studies are shown in Figure _{n }gave the smallest mean error, whereas those chosen by FIC gave the largest mean error. The superiority of I_{n }was evident in the simulation study using CHB and JPT (Additional file

**Table S5: Summary statistics of estimation errors of mean ancestry contribution for the simulated admixed population from CEU and YRI**. The estimates were based on 100 random subsets of 20 SNPs from panels consisting of top 1%, 2%, 5%, and 10% of the AIMs for CEU and YRI population. The true ancestry contribution was 70%.

Click here for file

Box-and-Whisker plot of estimates of mean ancestry contribution for the simulated admixed populations with 100 random subsets of 20 or 50 SNPs from panels consisting of top 1%, 2%, 5%, and 10% of the AIMs

**Box-and-Whisker plot of estimates of mean ancestry contribution for the simulated admixed populations with 100 random subsets of 20 or 50 SNPs from panels consisting of top 1%, 2%, 5%, and 10% of the AIMs**. (a) Results for the simulated admixed population from CEU and YRI using random subsets of 20 AIMs. The dashed line indicates the true mean ancestry contribution (70%). (b) Results for the simulated admixed population from CHB and JPT using random subsets of 50 AIMs. The dashed line indicates the true mean ancestry contribution (72%). From left to right different colors indicate results for Delta, F_{ST}, FIC, SIC, and I_{n}. The last two sets of the plot (yellow and gray) are the results where markers were ranked by the average rank (AVE) or minimum rank (MIN) of all five measures. For each method (color), the four Box-and-Whisker plots from left to right represent analysis results based on AIM panels consisting of top 1%, 2%, 5%, and 10% of the AIMs. See Additional File 11, Table S5 and Additional File 12, Table S6 for summary statistics of estimation errors.

**Table S6: Summary statistics of estimation errors of mean ancestry contribution for the simulated admixed population from CHB and JPT**. The estimates were based on 100 random subsets of 50 SNPs from panels consisting of top 1%, 2%, 5%, and 10% of the AIMs for CEU and YRI population. The true ancestry contribution was 72%.

Click here for file

Overall, the AIM panels chosen by I_{n }performed the best, giving the lowest bias and RMSE, whereas those by FIC performed the worst across the real dataset and the simulated datasets. For the real dataset (ASW), the combined method by using the average ranking of the five measures outperformed all the five measures (Additional file _{n }and F_{ST}, but their performance were either better or very similar to the most commonly used method Delta, FIC and SIC (Additional file

Discussion

Admixture mapping is a powerful gene mapping approach

Several methods have been proposed to measure ancestry informativeness of markers and to choose a panel of markers to be genotyped while maintaining the power of detecting ancestral chromosome segments in each genomic location. The choice of which of these measures to use should depend on the efficiency of each measure in selecting most ancestry informative markers. However, there is no consensus as to which criteria to use to select markers for ancestry inference or admixture mapping, and the performance of these methods has not been carefully evaluated and compared. The rule of thumb is to select markers with large allele frequency differences between ancestry populations. However, the number of markers required for population assignment will depend on the populations under consideration, their respective level of genetic differentiation and the desired stringency of assignment

In this study, we applied five different analytic tools to evaluate the concordance of selected informative SNPs using the same dataset. Our investigation using 500 top ranked markers for each measure and accounting for the physical distance between consecutive AIMs to be at least 100 kb, showed the following overlap between the different measures: δ vs F_{ST }(n = 479), δ vs FIC (n = 220), δ vs SIC (n = 319), δ vs I_{n }(n = 424), F_{ST }vs FIC (n = 230), F_{ST }vs SIC (n = 329), F_{ST }vs I_{n }(n = 445), FIC vs SIC (n = 395), FIC vs I_{n }(n = 258), and SIC vs I_{n }(n = 354) (Additional file _{ST}, FIC, SIC, and I_{n}, respectively. FIC had the least overlap with other measures. However, based on current cutoff values used for each measure, the δ measure included a number of loci that were not selected by the remaining four methods. F_{ST}, FIC, and I_{n }gave relatively smaller and similar AIM panels, whereas SIC gave a very small panel of AIMs (Additional file _{ST}, and I_{n }were more likely to pick the same set of SNPs, and FIC was more likely to choose SNPs that were not chosen by the other measure.

**Table S7: Overlap of SNP markers between measures**. Diagonal (bolded): Number of SNPs genotyped in both populations and satisfying the filtering criteria^{a}. Upper-triangle: Overlap for the SNP markers. Lower-triangle: Overlap for the top 500 ranked SNP markers.

Click here for file

**Figure S7: Scatter plot of allele frequency difference between CEU and YRI population using current cutoff values for each measure**. Markers in red exceeded the cutoff for the measure of informativeness. Similar patterns were observed between F_{ST }and I_{n}. Delta yielded the largest AIMs panel and included a large number of loci not included by any of the remaining four methods. SIC gave the smallest AIMs panel.

Click here for file

Analytically, the FIC and SIC measures require pre-defined ancestral proportions in an admixed population, whereas F_{ST}, δ, and I_{n }do not. We ran sensitivity analysis to study the impact of ancestral proportion in choosing informative markers using arbitrary values using CEU and YRI population. Compared with FIC, SIC was less sensitive to the proportion of ancestry contribution in the selection of AIMs (Additional file

**Table S8: FIC - Sensitivity analysis of proportion of ancestry contribution on the selection of AIMs**. For a pair of proportions of ancestry contribution (m and m'), we examined overlap patterns between the two top n% AIM panels selected using m and m' in the computation of FIC. Overlap patterns were presented by 11: AIMs selected by both panels; 10: AIMs selected by panel one (m) but not panel two (m'); and 01: AIMs selected by panel two (m') but not panel one (m). Frequency and percentage of each overlap pattern were reported for top 1%, 5%, 10%, and 20% AIMs. Proportion of ancestry contribution considered included 0.1, 0.2, 0.3, 0.4, and 0.5.

Click here for file

**Table S9: SIC - Sensitivity analysis of proportion of ancestry contribution on the selection of AIMs**. For a pair of proportions of ancestry contribution (m and m'), we examined overlap patterns between the two top n% AIM panels selected using m and m' in the computation of SIC. Overlap patterns were presented by 11: AIMs selected by both panels; 10: AIMs selected by panel one (m) but not panel two (m'); and 01: AIMs selected by panel two (m') but not panel one (m). Frequency and percentage of each overlap pattern were reported for top 1%, 5%, 10%, and 20% AIMs. Proportion of ancestry contribution considered included 0.1, 0.2, 0.3, 0.4, and 0.5.

Click here for file

There are some limitations for some of these measures, for example, FIC favors selection of markers that are closer to fixation in one parental population and may not be appropriate to assess the level of informative markers when ancestral populations are more than two _{ST}, δ is easy to calculate and independent of mutation and model assumptions, however, δ has a major limitation of being only useful for admixed populations from two parental populations and it doesn't account for multiallelic situations at a locus. F_{ST }may not be appropriate to assess the level of genetic information in SNP markers when the number of populations is > 2, as the method could result in the selection of SNP markers which are specific for a single most genetically distinct population. The selected SNP markers that were specific for only the most distinct population are expected to have low heterozygosity. Genetic markers with high expected heterozygosity are informative and therefore useful in individual assignment analysis

Many factors impact the accuracy of the estimation of ancestry contributions, which include but are not limited to sample size, the panel of AIMs used, the number of AIMs used, and the underlying distribution of ancestry contribution of the individuals in the sample. The use of a phased HapMap dataset allowed us to simulate individuals that share common founding populations. Moreover, the ancestry proportion for each individual is known, allowing for the comparison of true and estimated individual admixture values, thus enabling the comparison of different methods by estimation accuracy. Our findings indicate that, the different measures of marker informativeness [δ, F_{ST}, FIC, SIC, and I_{n}] performed well and as few as the top 20 ranked informative markers were adequate for accurate classification of ancestral populations. This is in agreement with the commonly made claim in the literature on marker selection for population assignment that 'classification accuracy can be substantially improved if only a subset of loci is used in the assignment test'

Although the marker selection methods explored in this study agreed to a large extent in identifying the most informative SNPs, there were differences in their performance in ancestry estimation. The simulation study revealed that I_{n }was the best in selecting the set of AIMs giving the smallest bias and mean square error in ancestry estimation. Analysis based on random subsets of top 1% to 10% ranked AIMs indicated that, compared to other methods, AIM panels selected by I_{n }behaved consistently and reasonably well for both the ASW population and simulated admixed populations. These results illustrate that effective exploration of all these methods can help to not only identify the most informative markers but also produce an optimal minimum set of markers that can accurately and efficiently differentiate among populations.

We suggest that the different measures may provide unique insights into a marker's informativeness under different scenarios, including varying ancestral proportion and when more than two ancestral populations are present. To identify all potentially informative SNPs, results from all measures could be considered. For example, the union of the top 500 SNPs for all five measures could be considered as the best AIMs panel. Researchers need to be aware of the differences between the various methods for evaluating ancestry informativeness of SNP markers. Furthermore, as we attempted in our simulation studies using either average rank or minimal rank of all five measures, combined information from more than one method may provide a reliable means, although may not be the best, in selecting markers for ancestry inference. Further research on this topic may shed light on how to best integrate different measures to obtain a set of AIMs most effective for the populations under consideration. We believe that the information that a set of markers provides for assigning or discriminating individuals to their source populations or different relationships must be critically evaluated before investing millions of dollars on an admixture or ancestry related project. We anticipate identification of more complex patterns of ancestry will require explorations of these and newer methods yet to be defined, to identify an optimal set of markers to use, however this should become increasingly feasible as genotyping costs decrease and available data grow on different populations. This in turn will allow the development of higher resolution of genogeographic and ethnic maps and help investigators designing genetic association studies in stratified homogeneous groups.

Conclusions

Although millions of SNP markers with varying levels of information content for ancestry inference have been identified, only small subsets of highly informative markers need to be genotyped in order to accurately predict ancestry with a minimal error rate in a cost-effective manner. In this article, we compared various methods for selecting informative SNPs and showed that the I_{n }measure estimated ancestry proportion (in an admixed population) with lower bias and mean square error. In summary, we showed the utility of several measures of informativeness using simulations and real SNP genotype data from samples of admixed populations. The use of several available methods to prioritize informative markers for ancestry inference can reduce genotyping costs and avoid false positive genotype-phenotype associations while retaining most of the power found in much larger sets of published AIM panels.

Methods

Five measures of ancestry informativeness

Notation

Consider populations _{
ij
}denote the frequency of allele _{
j
}denote the average frequency of allele _{
Aj
}is a linear combination of allele frequencies in the ancestral populations, and can be written as _{
Aj
}= _{1}
_{1j
}+ _{2}
_{2j
}, where _{
i
}is the proportion of contribution of the ^{th }ancestral population, and _{1 }+ _{2 }= 1.

Absolute allele frequency difference (delta, δ)

Delta is the most commonly used measure of SNP marker informativeness for ancestry between two parental populations. Delta is defined as the absolute difference in the frequencies of a particular allele observed in two ancestral populations. For a biallelic locus, suppose allele one is the reference allele, then,

A marker with δ = 1 provides perfect information regarding ancestry whereas a marker with δ = 0 carries no information. It has been shown that δ by itself only provides limited information regarding a marker's informativeness for ancestry

F statistics (F_{ST})

F_{ST }is the proportion of the total genetic variance (the T subscript)

Here, _{ST }can range from 0 to 1. A high F_{ST }value implies a considerable degree of differentiation between populations. F_{ST }is a pair-wise population measure of differentiation or relatedness (genetic distance measure between the two populations) based on genetic polymorphism data such as SNPs and was recently utilized as a criterion for selecting markers for ancestry estimation

Fisher Information Content (FIC)

Pffaff

Here, ^{th }allele in the admixed population or individual, _{
j
}= _{1j
}- _{2j
}is the allele frequency difference,

Shannon Information Content (SIC)

Rosenberg et al. (2003) _{1 }and m_{2}), the SIC for a biallelic locus can be written as:

Informativeness for assignment (I_{n}) measure

I_{n }is a mutual information-based statistics that takes into account self-reported ancestry information from the sampled individuals

This formula is a generalization to more than two populations. From a likelihood perspective, it gives the expected logarithm of the likelihood ratio that an allele is assigned to one of the populations compared with a hypothetical 'average' population whose allele frequencies equal the mean allele frequency across the

Data

HapMap phase III dataset

We downloaded the complete HapMap phase III genotype data (

Simulated dataset

To compare marker informativeness measures in the estimation of ancestry population contribution, we simulated two artificially admixed datasets from the phased HapMap III dataset (with known allele frequencies). The first one is an admixed population from two parental populations with relatively high divergence: 113 unrelated individuals in CEU population and 113 unrelated individuals in YRI population. The second one is an admixed population from less differentiated ancestral populations: 84 unrelated individuals of Han Chinese in Beijing, China [CHB] and 86 unrelated individuals of Japanese in Tokyo, Japan [JPT]. The simulations were run using simuPOP

Statistical analysis

Python (^{® }Genomics, v.5, SAS Institute Inc.) programs were used to analyze the various measures of informativeness for ancestry.

Correlation, concordance, and overlapping analysis

To assess the level of similarity of the estimates of genetic information contained in each SNP marker across the five measures of marker informativeness, we used three statistical procedures: Spearman correlation coefficient, Cohen's Kappa statistics, and overlapping frequency analysis of top

Spearman correlation coefficient _{i }and _{i }are the ranks of observed data values, _{i}'s, and _{i}'s. In case of ties, the averaged ranks are used. Spearman correlation coefficient takes values between -1 and +1. A +1 or -1 indicates that the two measures are in a perfectly monotonically increasing or decreasing relationship, respectively, and a 0 means no relationship.

To show the distribution of markers according to one measure of informativeness relative to another, we further analyzed the data by grouping and rating SNP markers using deciles, producing mosaic plots and calculating Cohen's Kappa coefficients. Deciles are the nine values of a variable dividing its distribution into ten groups with equal frequencies. For each measure, based on its deciles we created a new categorical variable with values 1, 2..., and 10, indicating to which group a SNP belongs. We then used the new categorical variables to build mosaic plots and to examine the relationship between measures of marker informativeness. The mosaic plots show, for example, how the top 10% SNPs from one measure of informativeness distribute relative to another measure of informativeness. To assess the concordance of decile-based ratings of the informativeness of AIMs between measures, we computed the Cohen's kappa coefficient, a commonly used index to quantify agreement between two measurements

To answer the question of how often the same set of SNPs is selected by the different methods or which methods tend to select the same set of SNPs, we studied the overlap pattern of the top _{ST}, FIC, SIC, and I_{n}, respectively. A 1 in the digit indicates that the SNP is selected by the corresponding measure as one of the top _{ST}, and I_{n }as one of the top

Discriminant analysis

To compare the discrimination power of the five measures of informativeness and assess how many markers are needed for accurate ancestral CEU vs. YRI population and CHB vs. JPT population membership assignment, discriminant analysis was performed using the top 1, 2, ..., and up to 150 ranked AIMs. Discriminant analysis _{1}, _{2}, ..., _{
p
}) is a _{1}, _{2}, ..., _{
p
}), or a linear combination of _{
i
}with weights _{
i
}. The weights are chosen such that the projections of the data points (individuals) in the same class (CEU or YRI population) are close to each other while those of the data points from different classes are far from each other. Linear discriminant can be derived using a measure of generalized squared distance. An optimal linear classifier then can be found by minimizing classification error (probability of misclassification). The classifier can take into account of prior probabilities of the classes, which, in our analysis, were specified as proportional to the sample sizes in each class. A data point is classified into the class for which the posterior probability of the observation belonging to this class is the largest among all classes. Cross-validation is used to obtain prediction accuracy. The analysis was carried out using PROC DISCRIM in SAS (SAS 9.1.3, SAS Institute Inc.). We also examined the number of AIMs needed to achieve 90% or 95% classification accuracy.

Estimation of ancestry contribution in admixed ASW population

We estimated ancestry contribution for the admixed ASW population using up to 200 top ranked AIMs by different measures. We also estimated ancestry contribution using 100 sets of randomly selected 20 SNPs from the top 1%, 2%, 5%, and 10% ranked AIM panels. The analysis was performed using the software PSMIX

We constructed two new methods of ranking marker informativeness for ancestry by combining the information from all the five measures. For each marker, we assigned a ranking or score based on either the average ranking (AVE) or the minimum ranking (MIN) of the five measures. We didn't use the raw values from the five measures because they have different scales; thus, any score computed by weighted average of the raw values needs to be preceded by standardization of the raw values, which is beyond the scope of this paper.

Estimation of ancestry contribution in simulated admixed population

To validate the ancestral estimates of the five measures, the same set of analyses in the previous section were conducted for the two simulated admixed populations. In the simulated admixed populations, the ancestry proportion for each individual is known, so is the mean ancestry proportion across individuals in the same population. Estimation accuracy by different measures was compared at two different levels. At the population level, the estimate of the mean ancestry contribution across individuals was compared with the true value and bias was calculated for the five measures. At the individual level, individual true and estimated admixture values were compared, and root mean square error (RMSE) was used as a summary measure of precision in the estimation of individual ancestry proportion. RMSE is defined as _{
i
}and

Abbreviations

AIMs: ancestry informative markers; GWAS: genome-wide association studies; YRI: Yoruban population in Ibadan, Nigeria; CEU: Caucasian population from the United States with northern and western European ancestry; ASW: African American population from Southwest USA; CHB: Chinese population from Beijing, China; JPT: Japanese population from Tokyo, Japan.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

TMB conceived of the study and drafted the manuscript. LD performed the data analysis and manuscript writing. HW, TA, MA, RCPG, CK, LM, GKKH and RC contributed in manuscript writing. All authors read and approved the final manuscript.

Web resources

HapMap:

PYTHON:

Acknowledgements

This work was supported by National Institutes of Health grant 1K01HL103165 (TMB) and MH066181 (RCPG).