Pyrosequencing as a method for SNP identification in the rhesus macaque (Macaca mulatta)

Satkoski, Jessica A; Malhi, RS; Kanthaswamy, S; Tito, RY; Malladi, VS; Smith, DG

doi:10.1186/1471-2164-9-256

Research article
Open access
Published: 29 May 2008

Pyrosequencing as a method for SNP identification in the rhesus macaque (Macaca mulatta)

Jessica A Satkoski¹,
RS Malhi^2,3,
S Kanthaswamy^1,4,
RY Tito²,
VS Malladi¹ &
…
DG Smith^1,4

BMC Genomics volume 9, Article number: 256 (2008) Cite this article

7188 Accesses
38 Citations
Metrics details

Abstract

Background

Rhesus macaques (Macaca mulatta) are the primate most used for biomedical research, but phenotypic differences between Indian-origin and Chinese rhesus macaques have encouraged genetic methods for identifying genetic differences between these two populations. The completion of the rhesus genome has led to the identification of many single nucleotide polymorphisms (SNPs) in this species. These single nucleotide polymorphisms have many advantages over the short tandem repeat (STR) loci currently used to assay genetic variation. However, the number of currently identified polymorphisms is too small for whole genome analysis or studies of quantitative trait loci. To that end, we tested a combination of methods to identify large numbers of high-confidence SNPs, and screen those with high minor allele frequencies (MAF).

Results

By testing our previously reported single nucleotide polymorphisms, we identified a subset of high-confidence, high-MAF polymorphisms. Resequencing revealed a large number of regionally specific SNPs not identified through a single pyrosequencing run. By resequencing a pooled sample of four individuals, we reliably identified loci with a MAF of at least 12.5%. Finally, we found that when applied to a larger, geographically variable sample of rhesus, a large proportion of our loci were variable in both populations, and very few loci were ancestry informative. Despite this fact, the SNP loci were more effective at discriminating Indian and Chinese rhesus than STR loci.

Conclusion

Pyrosequencing and pooled resequencing are viable methods for the identification of high-MAF SNP loci in rhesus macaques. These SNP loci are appropriate for screening both the inter- and intra-population genetic variation.

Background

Rhesus macaques (Macaca mulatta) are used more extensively as animal models for the study of human disease than any other primate species. They provide the primary model for research in infectious diseases, reproductive biology, behavior, neuroscience and immunology. More recently, they have been employed in research and vaccine development for the human immunodeficiency virus (HIV) [1–3]. The severe shortage of rhesus macaques as subjects for biomedical research prompted the establishment of national centers for breeding them in the US [4–6]. After their exportation from India ceased in 1978, China became the principal supplier of rhesus macaques to these centers and, thus, domestically bred rhesus macaques represent both countries of origin with only negligible contributions from rhesus macaques from other countries [7]. The particular shortage of Indian-derived rhesus macaques available for use as subjects in biomedical research and their desirability over Chinese rhesus macaques have led to efforts to acquire Indian-like rhesus macaques from sources outside India, such as Nepal and Bangladesh [7], and to establish close relationships with the National Center for Primate Breeding and Research currently under construction in Bombay, India.

Nozawa et al. [8] were the first to characterize genetic polymorphisms in a broad range of Asian species of genus Macaca, including M. mulatta. They reported differences in allele frequencies for electrophoretically defined protein polymorphisms among regional populations of the species. Later, short tandem repeat (STR) polymorphisms were identified in rhesus macaques by using the polymerase chain reaction (PCR) and cross-amplification with human primers [9–11]. Variation in the frequencies of MHC alleles [12–16], mitochondrial DNA (mtDNA) variation, restriction-site polymorphism (RFLP) haplotypes [17, 18] and sequence [19, 20], and Y-chromosome haplotypes [21] were also characterized.

Since then, specific genetic differences between populations of rhesus macaques from China and India have been characterized based on electrophoretically defined protein polymorphisms [22, 23], STR polymorphisms [24–29], major histocompatibility (MHC) alleles [30–32], mtDNA [17, 28, 33] and, most recently, single nucleotide polymorphism (SNPs) [34–37]. Unfortunately, the number of polymorphisms represented above is far too small, and their distribution throughout the genome too broad, to permit whole genome studies of quantitative trait loci (QTL) and disease association [35].

Until recently, STR loci, numbering in the tens of thousands in most mammalian genomes, have been the preferred polymorphisms for characterizing genetic variability and estimating parameters useful for genetic management. STRs serve this purpose because they are neutral to selection pressure, exhibit multiple alleles (sometimes dozens), and hence demonstrate high levels of heterozygosity under equilibrium conditions. They are also spread throughout the genome making it possible to use them create a linkage map the rhesus genome [38], albeit at a density insufficient for whole genome association studies.

The recent completion of the rhesus genome map [39], however, has made the discovery of SNPs much easier [35–37]. While SNPs are typically biallelic and, therefore, exhibit fewer alleles and lower expected heterozygosity (gene diversity) than many STRs, they have the following advantages over STRs, making them more desirable for genetic management and for biomedical and/or genomic research. First, several hundred STRs would be required to achieve reliable estimates of the phylogenetic relationship and divergence time between Indian and Chinese rhesus macaques [40], a process would be both time-consuming and expensive. In contrast, several thousand SNPs can be simultaneously assayed through automation, reducing both the size of the confidence intervals around parameter estimates and cost. Second, SNPs are free of many of the sources of error (e.g., non-specific amplification, shadow bands, null alleles) that characterize STR genotypes. Being biallelic, the assignment of SNP alleles is free from the subjectivity that plagues STR genotyping. Additionally, SNPs can be more reliably genotyped with identical results in different laboratories [41].

Additionally, when carefully selected, SNPs can be expected to exhibit less homoplasy than STRs, and hence provide fewer false signals of common ancestry (thus, SNPs will be more reliable and informative as ancestry informative markers, or AIMs). Because STRs evolve following a stepwise mutation model and exhibit size constraints on mutation rates and direction, genetic signatures resulting from genetic bottlenecks (or expansions) are more ambiguous than for SNPs, which follow an infinite allele model of mutation [42]. Moreover, size constraints on the evolution of STRs, absent in SNPs, can lead to serious mis-specification of branch lengths and topology of phylogenetic trees [43].

STRs have been used successfully to identify linkage disequilibrium between pairs of loci, such as CD4 [44] and G6PD [45]. However, SNPs are more plentiful in the genome than STRs by a half dozen orders of magnitude and can be located quite close to a greater number of functional genes, allowing the creation of a high-density map throughout the genome. SNPs located at various distances from the target of selection exhibit strong linkage disequilibrium and exhibit values of Fst that systematically decline with distance from the target of directional selection, a pattern that has been observed in the human Duffy blood group [46], lactase [47] loci, and loci that influence human skin pigmentation [48, 49].

SNPs with high levels of heterozygosity (e.g., between 0.4 and 0.5) will provide estimates of parameters important for genetic management (e.g., gene diversity, genetic subdivision and average inbreeding coefficient). The extraordinary quantity of SNPs present in most mammalian genomes insures that the large number of highly informative SNPs necessary for this purpose can be found.

A large number of SNPs evenly spaced throughout the genome (the current STR map includes only 241 loci [38]), could be used in many types of genomic research, including creating a haplotype map of the rhesus macaque genome and searching for evidence of selection, studying the histories/phylogenies of multiple genes linked to SNPs (SNPsters- [50]), mapping of QTLs or other important phenotypes [see [51, 52], and [53] for examples using the baboon genome]. In addition, a comparison of the human and chimpanzee genomes has detected genes with unusually high ratios of non-synonymous to synonymous mutations, suggesting selection on genes influencing immune defense and tumor suppression in the human lineage [56]. However, this inference requires the assumption that the chimpanzee always expresses the ancestral form of these genes, an assumption that is not always valid. The availability of a rhesus SNP map would allow testing of this assumption. Thus, a detailed knowledge of population structure of rhesus macaques based on SNPs at various distances and on different chromosomes is crucial to the interpretation of studies of association between specific loci and disease etiology and susceptibility. The pyrosequencing strategy could provide these SNPs relatively cheaply and quickly. We subjected a DNA sample from a Chinese rhesus macaque to pyrosequencing. After parsing the resulting pyrofragments, we compared them to the newly completed rhesus genome to identify candidate SNPs. We selected a sample of these SNPs and conducted the following three tests to evaluate pyrofragment comparison as a method of SNP detection in the rhesus macaque.

Test 1

Fragment Resequencing. Many of the pyrosequenced fragments overlapped each other, ranging from zero to three overlaps (fragments with more than three overlaps were removed from the data set, per [37]). The quality of each base in the sequence was assessed with the program Phred [54, 55]. Phred score is assigned relative to the probability of an incorrect base call, with each 10-point increase equating to a 10-fold decrease in the probability of a miscalled base. (For example, a score of 20 equals 99% base call accuracy, while a score of 30 indicates 99.9% accuracy.) The Phred scores in these fragments ranged from one to 33. We resequenced a sample of these pyrofragments in both the original pyrosequenced individual and several other rhesus macaques, to evaluate the impact of overlap number and Phred score on the accuracy of our SNP detection algorithm. The goal of this test was determine an appropriate cutoff for overlap number and Phred score during SNP locus selection.

Test 2

Screening for Informative SNPs. Because all SNPs were detected through the comparison of a pyrofragments from a single Chinese rhesus to the genome of a single Indian rhesus, we resequenced a pooled DNA sample of four different, geographically diverse rhesus macaques. By this method, only SNPs located on at least one of the 4 orthologous chromosomes in the pool would be verified (MAF of at least 12.5%), suggesting that the heterozygosity under equilibrium conditions exceeds 0.22. Resequencing a pooled DNA sample will allow us to screen a large group of loci for those most likely to be informative in a larger population of animals.

Test 3

Detecting Geographic Variation. Given the large number SNPs identified from a single pyrosequencing run (~23,000), it is necessary to develop a method to test the efficacy with which they measure genomic variation. The goal of this test was twofold. First, we sought to not just identify SNP loci with a MAF of greater than 12.5%, but to identify markers meeting this criterion that were also AIMs. Second, we wanted to test the utility of randomly identified SNP loci for population genetic analysis compared with STR loci. To this end, 96 SNPs fitting the criteria established by the first two tests were chosen for genotyping by the Illumina GoldenGate Assay System in a sample of 95 geographically diverse Indian and Chinese rhesus macaques. These same 95 DNA samples were amplified at 23 microsatellite loci dispersed throughout the genome, and the utility of these markers for capturing the genetic variation present in the populations were compared.

Results

SNP Detection

Details of our SNP detection protocol can be found in [37], and the bioinformatic methods are described in [57]. Our SNP database can be searched by chromosome, base pair location, or unique pyrofragment ID at our website, located at http://mamusnp.ucdavis.edu.

Fragment Resequencing

Individual and Primer information are shown in Table 1. There were no differences between the pyrosequenced 454 fragments and the Sanger re-sequenced CHIW sample (generated from the same individual), supporting the low pyrosequencing error rate suggested by the Phred scores, as shown in Figure 1. The only exception is in fragment D8YOWMI02H6KYZ, where Sanger sequencing revealed individual Sch00R1684 to be a heterozygote. A minimum Phred score of 20 (99% accuracy) was considered an acceptable error rate for all subsequent tests. To increase our confidence that identified SNPs were true polymorphisms rather than sequencing errors, a SNP had to have been identified in a minimum of two pyrofragments to be considered for further analysis. These sequence comparisons demonstrate, though, that the scores assigned by the Phred program are accurate, and can be used as a general guideline to allow future SNP selection without having to verify the polymorphism through sequencing.

Table 1 Sequence and position information for primers used to replicate 454 pyrosequenced fragments.

Full size table

Also shown in Figure 1, the SNPs identified through comparison of the 454 fragment with the NCBI rhesus genome do not tend to be restricted to a particular regional sample; indeed, although the NCBI genome was created from an Indian individual, the Indian rhesus included in our regional sample always carried the same allele as the pyrosequenced (Chinese) individual, suggesting that a very low percentage of the identified SNPs qualify as AIMs.

However, a significant result of our test was that the comparison of regional samples revealed a great deal of previously undiscovered sequence polymorphism, indicated by dark grey boxes in Figure 1. This polymorphism suggests that genomic comparison of individuals from different geographic regions has the potential to quickly generate many more informative SNP loci. This result suggests that pyrosequencing additional, regionally variable, individuals has the potential to reveal much more of the variation present in the rhesus genome.

Screening for Informative SNPs

48 SNPs were selected to screen the pooled DNA sample (individuals listed in Table 2). At least two SNPs were chosen on each chromosome, one on each arm. Primers were successfully generated for 43 loci. Of these 43, 60% (N = 26) were confirmed as polymorphic. The remaining 40% that could not be verified are classified as low-frequency SNPs, and in fact may have been alleles private to the original pyrosequenced individual. Consistent with the results of the first test, between one and 13 additional SNPs were identified in each of the 43 amplicons. All four individuals, when sequenced individually at three randomly selected loci, produced amplicons for two of the loci, while only three individuals amplified at a third. The former loci all exhibited minor allele frequencies of 25%, which the latter exhibited a minor allele frequency of 33%. Thus the strategy of using a pooled sample is an effective way to select loci with minor allele frequencies of at least 12.5%. Of course, use of sample pools representing only 2 or 3 samples should identify SNPs with proportionately higher MAFs.

Table 2 Provenience and accession information for the individuals included in all experiments described above.

Full size table

Detecting Geographic Variation

Of the 95 individuals shown in Table 3, 79 produced viable genotypes for at least 95% of the 96 loci (listed in Table 4). Ninety-two of the loci submitted for genotyping produced analyzable data. Fourteen of these loci were monomorphic in all individuals, suggesting that they were either extremely low-frequency SNPs, or a mutation novel to the pyrosequenced animal.

Table 3 Samples submitted for SNP genotyping.

Full size table

Table 4 The 96 SNP loci submitted for genotyping.

Full size table

Sixty-five SNPs were variable in Chinese animals, and sixty-four were variable in Indian animals (one locus did not amplify in any Indian individuals). Of all the variable SNPs, 67.9% were polymorphic in both Chinese and Indian individuals. A very small proportion of the polymorphisms were unique to either population – only 18.5% and 17.2% of markers were ancestry informative in Chinese and Indian populations, respectively. Of these private polymorphisms, the distribution of MAF is shown in Figure 2. There is no significant difference between the MAF of the Chinese or Indian samples for each frequency class (2 sample t-test, p = 1), nor is there a significant relationship between the MAF class and the number of SNPs. The average heterozygosity in the Chinese sample was 0.25 ± 0.18, while in the Indian sample, it was slightly higher, at 0.28 ± 0.18. We were not able to detect any significant difference in linkage disequilibrium between the Chinese and Indian individuals (data not shown).

The results of the STR analysis are shown in Table 5. The Chinese sample demonstrated unique alleles for each of the 23 loci, with an average of 4.7 unique alleles per locus. The Indian sample only demonstrated unique alleles at 7 loci, with an average of 1.6 unique alleles per locus. The observed heterozygosities of the Chinese and Indian sample were 0.76 ± 0.11 and 0.68 ± 0.14, respectively. The results of the principal component analysis (PCA) for the SNP and STR data are shown in Figure 3.

Table 5 Summary information for the 23 STR loci included in the study.

Full size table

Discussion

The results shown in Figure 2 suggest that the allelic variants discovered through comparison of the Chinese 454 fragments and the Indian NCBI rhesus genome are probably broadly polymorphic in the species, rather than being confined to a particular region. These results conflict with those reported by Ferguson et al. [36] for SNPs identified in 3' regions of coding sequences of the rhesus genome, and by Hernandez et al. [35] for SNPs identified by sequencing several ENCODE regions. While the SNPs identified by Ferguson et al. [36] and Hernandez et al. [35] will be extremely useful as AIMs for differentiating between rhesus macaques originating in India and China and for estimating the level of admixture in hybrid rhesus macaques with ancestry from both regions, they might be unsuitable for studies of the structure of the rhesus macaque genome if they prove to be unrepresentative of this structure.

Additionally, our comparison of regional samples revealed a great deal of previously undiscovered sequence polymorphism, indicated by dark grey boxes in Figure 1. This polymorphism suggests that genomic comparison of individuals from different geographic regions has the potential to quickly generate more informative SNP loci in the process of verifying SNPs that are discovered in the future using the methodology we have used. This might be of particular importance for identifying derived SNPs in non-rhesus species, such as M. fascicularis.

A single pyrosequencing run cannot discover all the variation present in the rhesus macaque genome. Because a single run does not produce complete coverage of the genome. However, as shown in Figure 1, the pyrosequencing of additional regional samples will reveal more of these region-specific polymorphisms. Thus, we can expect that many more additional candidate SNPs than the 23,000 we have already identified will be revealed using this re-sequencing procedure, when both additional animals and regionally variable animals are pyrosequenced. Moreover, recent advances in 454 technologies have significantly increased the efficiency of the pyrosequencing reaction, nearly tripling the number of reads per run and increasing by two and one-half fold (to 250 bp) the size of each read. Thus, we might be able to detect approximately 60,000 SNPs (two and one-half fold larger than the number identified in fragments of average size 103 bp) from each additional pyrosequencing experiment. Recently, researchers have developed a method for discovering SNP loci and estimating MAF in a single run, by pooling individuals and digesting with Hae III, selecting fragments in the 70–200 bp range (creating a reduced representation library, or RRL) and resequencing. This method has been used to successfully identify large numbers of SNPs in domestic cattle [58] and could easily be adapted to the rhesus macaque.

Finally, the SNP data presented in this study address many of the concerns of researchers regarding STR data. The SNPs in this study were identified from a pool of ~23,000 candidates. As shown in Figure 1, the error rate of the Illumina SNP genotyping process is very low, and PHRED score appears to be quite useful as a measure of genotype confidence. This is quite different from STR genotyping, where genotype reliability is generally a measure of how many times it can be replicated [59]. Additionally, because the SNP loci are biallelic, there is no subjectivity necessary in the assignment of genotypes, nor is there a need for a "universal" allelic ladder, such as that used for human forensic STR testing [60–62].

In addition to replicability, SNP genotypes can be collected more quickly than STRs. The SNP data presented in this paper was collected in a single run, while the comparable STR data required hundreds of separate PCR reactions and at least four "pool-plexed" passes or two multiplex passes through the sequencer to produce complete, high-confidence genotypes.

However, efficiency and thrift are poor tradeoffs if the resultant data does not adequately assay inter-and intra-population genetic variation. Even with the relatively low number of either Chinese- or Indian-specific polymorphisms, as shown in Figure 3, our loci, when taken together, provide sufficient phylogenetic signal to discriminate Indian and Chinese rhesus populations. The separation of the IND1 and Chinese populations using STR loci alone is not nearly as complete, although the position of IND2 individuals in the STR-based PCA in Figure 3 could be due to underlying genetic structure not detected by the SNPs. This PCA, in combination with the heterozygosity measures from both types of loci, suggest not that the STR loci currently in use are biased [28], but that that genetic data collected from SNP loci simply provide a more complete estimate of the genetic variation in the rhesus genome. There was at least one SNP on each arm of each chromosome (excluding chromosome 19), while the distribution of the STRs was not as extensive. Nearly 70% of our SNP loci were variable in both Chinese and Indian individuals, making them useful for comparing the breadth of genomic variation between these populations, something that cannot be done with AIMs. Within populations, the average observed heterozygosity of our SNP loci was generally lower than the comparable STR data. However, in both regional samples, high and low MAF loci were equally common. If a high MAF is the primary criterion for inclusion in a SNP-based genetic management panel, pyrosequencing appears to be an especially useful method for quickly identifying large numbers of these loci. Post-identification, pooled resequencing as described above identifies loci with a MAF of at least 12.5% with a high degree of accuracy (although this percentage depends on the number of samples included in the pool).

Conclusion

In this study, we sought to expand on the SNP detection methods presented by Malhi et al. [37], by applying three tests to the approximately 23,000 previously identified SNPs. First, we replicated several of the pyrosequenced fragments with varying Phred scores, to both check for sequencing error and to screen for additional, regionally specific polymorphisms. From this, we found that the Phred score was an accurate representation of sequencing error rate, and set a minimum Phred score of 20, or 99% base call accuracy, for all subsequent tests. Second, we used a pooled DNA sample containing four different individuals to resequence 48 SNP loci. By using this technique to screen for polymorphic loci, and then sequencing animals individually to quantify the allele frequencies, we were able to screen for loci with MAF of at least 12.5. Third, we genotyped loci in a larger sample of 95 regionally variable macaques. We found that although a small percentage of the loci qualified as AIMs for this sample, a large proportion of loci had MAF in the 40–50% range, suggesting that our method of SNP identification is appropriate when the goal is the assessment of within-population genetic variation. Finally, we were able to demonstrate that the SNP loci were at least as useful as STRs for screening within-population genetic variation, and were better at between-population discrimination for the IND1 and Chinese individuals.