High throughput genotyping of single nucleotide polymorphisms (SNPs) for genome-wide association requires technologies for generating millions of genotypes with relative ease but also at a reasonable cost and with high accuracy. In this work, we have developed a theoretical approach to estimate allele frequency in pooled DNA samples, based on the physical principles of DNA immobilization and hybridization on solid surface using the Langmuir kinetic model and quantitative analysis of the allelic signals.
This method can successfully distinguish allele frequencies differing by 0.01 in the actual pool of clinical samples, and detect alleles with a frequency as low as 2%. The accuracy of measuring known allele frequencies is very high, with the strength of correlation between measured and actual frequencies having an r2 = 0.9992. These results demonstrated that this method could allow the accurate estimation of absolute allele frequencies in pooled samples of DNA in a feasible and inexpensive way.
We conclude that this novel strategy for quantitative analysis of the ratio of SNP allelic sequences in DNA pools is an inexpensive and feasible alternative for detecting polymorphic differences in candidate gene association studies and genome-wide linkage disequilibrium scans.
Single nucleotide polymorphisms (SNPs) represent the most genetic variation in the human genome, and are thought to have a promising future in a wide range of human genetics applications such as pharmacogenomics, population evolution, functional genomics, forensic and identification of genes responsible for the susceptibility of complex diseases. It has been suggested that 30,000–500,000 SNPs are required for a whole-genome association study [1,2]. Accurate determination of allele frequencies of such a large number of SNPs in a large number of human samples is an unusual challenge in the whole genome association studies for genetic alterations of low relative risk . It not only involves heavy workload, unusual amount of time and cost, but also a large amount of DNA of each sample. Because only very few markers are expected to show linkage and/or association in family data, a simple, highly efficient and cost-effective screening approach to identification of genetic markers showing linkage and/or association is highly desirable. Using pooled DNA samples may significantly facilitate meeting this goal since hundreds of DNA samples can be reduced to a single sample. Although pooling DNA samples may result in a loss of information of haplotype information, it is still appealing because of the tradeoff of the significant reduction in the amounts of effort and cost.
A number of approaches used for SNP genotyping have been used to estimate allele frequency in pooled DNA samples. These include primer extension followed by DHPLC , allele-specific amplification with real-time PCR , BAMPER , TAQMAN™ and RFLP analysis , dynamic allele-specific hybridization (DASH) , MassARRAY™ , mass spectrometry , pyrosequencing [11,12], SSCP , the amplification refractory mutation system (ARMS) , and DNA microarrays [15,16]. However, most of these methods are based on substantial post-PCR processing, and for one or very few SNPs for one pooling sample as a time. In this report, we describe a new microarray-based method for estimating the allele frequency in pooled DNA samples based on the physical principles of DNAs immobilization and hybridization on solid surface. This method well suits large-scale genetic association study, and has a number of advantages: capability of scaling up both in the numbers of SNPs and pooled samples (cases and controls) by utilizing microarray platform, assay of thousands of SNPs on one chip under uniform conditions, employing only two universal fluorescently labeled tags for thousands of SNPs, and no post-PCR processing.
The physical principle of hybridization on chip surface, modeled as a Langmuir adsorption process, has been extensively studied in recent years. However, most studies concentrate on the kinetics and thermodynamics of hybridization for gene expression assay [17,18] and genotyping . To our knowledge, the present study is the first application of the Langmuir function to SNP allele-frequencies estimation using pooled DNA and microarray.
Six SNP markers, two, ESR1E-U11 (T/C, rs11155816) and ESR1F-U21 (A/G, rs9340799) in the ESR1 gene, one, TGFB1D-U2 (G/C, rs1800471) in the TGFB1 gene, and three, HBB17 (A/T, c.102 A > T), HBB28 (T/C, rs33931746) and HBB26 (C/T, rs33950507) in the HBB gene, were employed to demonstrate the feasibility of this approach. The estrogen receptor encoded by ESR1, is a ligand-activated transcription factor composed of several domain important for hormone binding, DNA binding and activation of transcription. It has widely been demonstrated that ESR1 polymorphism is associated with breast cancer and bone mineral density. TGFB is a multifunctional peptide that controls proliferation, differentiation, and other functions in many cell types. It is well known that TGFB plays an important role in human diseases. Mutations in the TGF-beta pathway are responsible for many biological processes in cancer development. The increased expression and a polymorphism of TGFB1 have also been associated with abdominal obesity in humans. The inherited blood disorder β-thalassemia is caused by mutations in the HBB gene, which markedly decreases or completely prevents the production of β-globin chains. It is the most common inherited single-gene disorders in the world with the highest prevalence in many areas of southern China including Taiwan, and has been and/or remains endemic.
Principle and Design of the Method
Six SNP markers, two, ESR1E-U11 (T/C, rs11155816) and ESR1F-U21 (A/G, rs9340799) in the ESR1 gene, one, TGFB1D-U2 (G/C, rs1800471) in the TGFB1 gene, and three, HBB17 (A/T, c.102 A > T), HBB28 (T/C, rs33931746), and HBB26 (C/T, rs33950507) in the HBB gene, were included to demonstrate the feasibility of our new approach. When the PCR product from a pooled DNA sample containing the two alleles at a given ratio for an SNP is spotted onto the glass slide surface, competition between the two allelic sequences may occur during surface immobilization (Figure 1) because the spot has limited binding capacity determined by the active groups and steric hindrance. As a result, the amounts of the two allelic sequences bound to the spot should be proportional to the initial amounts in the PCR product. When two fluorescently labeled allele-specific probes hybridize to their corresponding allelic sequences in the spot, the intensities of the two fluorescent colors in the spot should reflect the relative amounts of allelic sequences in the pooled DNA sample.
Figure 1. Schematic illustration of the competitive immobilization and hybridization processes on glass surface using the strategy of tag-probe-allele sandwich structure. The single-base variation is indicted by letters. Fluorescent dyes are represented by small red or green spheres.
Two allele-specific probes were designed for each SNP. Each probe was composed of a specific domain (~15 bp) to hybridize with its allelic sequences at their 5'-ends, and a 3'-segment consisting of one of two 20-bp universal sequences to hybridize with a fluorescence-labeled universal tag. The two universal tags (Cy5-TTACGTGATGGTAATAGTGCTG and Cy3-TTAGGTCGTAGGTGCGTTAGAT) labeled with different fluorescent dyes, were designed to perfectly match their corresponding universal sequences of the SNP probes to form a "sandwich" architecture with the PCR product. The two universal tags were used for all SNPs under analysis. The sequences of the probes and the universal tags were carefully selected to avoid formation of tag-to-tag hybridization and hair-pin structures, which may interfere with the hybridization between tags and their probes, and between probes and their allelic sequences . The strategy of tag-probe-alleles sandwich complex is compatible with high-throughput analysis, flexible in experimental design, and cost effective.
Based on the analysis of the thermodynamics of immobilization and hybridization process on the solid surface, we constructed the Langmuir-type isotherm model to integrate the two processes. The Langmuir-type parameters of the model were obtained from a series of allele frequency experiment. Several theoretical considerations and experimental constraints were imposed as follows: (a) the immobilization process was viewed as a reversible competitive bimolecular surface reaction (adsorption and desorption) between DNA molecules and the glass surface; (b) the immobilization was allowed to reach equilibrium under the defined experimental conditions; and (c) all PCR products were diluted to the same concentration before printed onto the slides. Under these conditions, we can deduce the equations to characterize the relationship between the concentrations and of the allele sequences, A and B, immobilized to the glass surface and the initial concentrations of these alleles in the spotted solution as follows [21,22],
where the superscripted "i" represents the surface immobilization values, , , and represent the rate constants of immobilization association and dissociation of alleles A and B, respectively, R is the maximum surface concentration of the active group on a spot, cA and cB represent the spotting concentrations of alleles A and B in the pooled DNA sample, and represent the surface concentrations of the immobilized allelic sequences A and B.
where the factor α is defined as,
Because the two alleles of each SNP only differ by one base, and the immobilization conditions and the chemical properties of glass surface are the same to these two alleles, in theory, the values of the immobilization equilibrium constants is similar to , () to certain extent. During the experiment, all PCR products of the pooled DNA samples were diluted to the same concentration for spotting, cR(cR = cA + cB), which means cR is a constant in this assay. Then Eq. (7) can be rewritten as,
On the basis of Eq. (8), α value only depends on the slide surface chemistry (R), immobilization reaction rate constants ( and ), and the allelic sequence concentrations in the spotted solution (cA and cB). Thus, α can be considered as a constant under the defined experimental conditions.
Generally, the hybridization kinetics on the solid surface is represented by the familiar relationship,
It is well known that the simplest model for DNA hybridization on a chip is the Langmuir kinetic model for adsorption [23,24]. Langmuir adsorption theory for microarray is based on the assumption that there are two competing processes: the adsorption process, which is binding the probe molecules to the immobilized DNA molecules to form duplexes, and desorption, which is the reverse process of duplexes dissociating into separate molecules. Thus the form of the equation results in a non-linear relation between the concentrations and the signal intensities. In our method, some theoretical considerations and experimental constraints are imposed to fit the conditions required by the Langmuir regime: (a) the labeled tag-probe is in large excess compared to allelic sequences immobilized on the surface; (b) the tag-probe concentration near the slide surface is assumed to be constant and equal to the bulk concentration during hybridization process;.(c) the hybridization reaction can achieve equilibrium.
If we neglect some effects such as that secondary structures and cross-hybridization, and there is only specific binding between a given probe and its complimentary allelic strand, and between labeled tag and its corresponding probe, we can deduce that the amount of labeled tag-probe bound to the complementary immobilized allelic sequence is proportional to the fluorescence intensity. Then the two fluorescence intensities on the same spot could be translated into the relative amounts of the two alleles in the pooled DNA sample. By recasting the result to yield the background-corrected intensity, IA and IB, as a function of a series of allele immobilization concentrations, and , we obtain the following equation (the superscripted "h" means the hybridization fluorescence values),
where and represent hybridization equilibrium constants, the factor IR,A and IR,B refer to the maximal values related to the total number of molecules of the alleles immobilized in the given spot, and the intensities IA and IB refer to the background-corrected intensities of alleles A and B from a given spot.
Combining with the Eqs. (5) and (6), we can rewrite the Eqs. (10) and (11) as follows,
where the factor KA and KB are defined as , and . The relationship between the fluorescence intensity of binding sandwich and allele spotting concentration approximately follows the Langmuir-type adsorption isotherm function (Eqs. 12 and 13). This observed Langmuir-type isotherm model integrates the thermodynamics of the immobilization and hybridization processes.
Assuming that the allele concentration ratio (cA/cB) remains constant in the spotting solution and the initial pooled DNA sample, which means the alleles are equally amplified. By transforming and combining Eqs. (12) and (13), we can determine the pool allele frequency by the following equations,
where fA and fB represent the alleles A and B frequencies in the pooled DNA sample.
Characterization of the kinetics of microarray
In order to determine the values of Ka (KA or KB) and IR, a series of pooled reference DNA samples with different ratios between the two allelic sequences of each of the six SNPs at different proportions for the six SNPs (ESR1E-U11(T/C), ESR1F-U21(A/G), TGFB1D-U2(G/C), HBB17(A/T), HBB28(T/C), and HBB26(C/T)) were prepared from two DNA samples homozygous for the two alleles (see Method). For each SNP, fourteen pooled reference DNA samples were prepared with allele ratios ranging from 0% to 100%. The samples were amplified by PCR, purified, brought to a final total concentration of 50 μM, and printed onto a glass slide. Each sample was spotted five times for evaluating spot-to-spot variation. The fabricated array was then incubated in a hybridization solution containing Cy3/Cy5 labeled tag-probe duplexes which were in large excess over the number of immobilized allelic molecules on the array. The plots of the allele concentrations in spotting solution versus average background-corrected fluorescence intensities for the six SNPs were analyzed using the Langmuir-type model (Eqs. 12 and 13). Results are shown in Figure 2.
Figure 2. Average background-corrected signal intensity as a function of allele spotting concentration for the six SNPs. All curves depict the non-linear relationship modeled as Langmuir kinetic function for signal intensity from microarray. The spotting concentration of the allelic sequences ranged from 0 to 50 μM for each SNP. The signal intensity is calculated by integrating line profiles for spots containing the perfect match and normalizing the signals. The solid and dashed lines in each plot show best fit to the Langmuir kinetic model for the two alleles.
The values of constants, Ka and IR, for the six SNPs derived from the best-fit Langmuir isotherms are presented in Table 1. IR is the maximum fluorescence intensity, representing the saturated hybridization signal for the immobilized allele. In general, IR is determined by hybridization conditions, immobilization conditions, and surface chemistry of the slide, but independent from the spotting concentration if the amount of allelic sequences in the spotting solution is in a great excess over the absorbance capacity of the slide surface. In Eqs. (12) and (13), Ka is the observed apparent equilibrium constant of reaction between the probes in hybridization solution and the allele DNA in spotting solution. It is determined by immobilization affinity (Ka (i)), hybridization affinity (Ka (h)), surface chemistry (R, surface concentration of active group), and the total spotting concentration (cR). As shown in Table 1, the observed equilibrium constant Ka has the same order of magnitude for six SNPs: 0.65 × 105 M-1, 0.66 × 105 M-1 for genotype T and C of ESR1E-U11; 0.99 × 105 M-1, 0.99 × 105 M-1 for genotype A and G of ESR1F-U21; 0.79 × 105 M-1, 0.80 × 105 M-1 for genotype G and C of TGFB1D-U2, 0.51 × 105 M-1, 0.52 × 105 M-1 for genotype A and T of HBB17; 0.77 × 105 M-1, 0.79 × 105 M-1 for genotype A and G of HBB28; 0.79 × 105 M-1, 0.80 × 105 M-1 for genotype G and A of HBB26 respectively. These results clearly demonstrated that the single-base difference of the PCR products of each SNP had negligible effect on immobilization and hybridization for the perfect-matched probes.
Table 1. The constants of Ka and IR for tag-probe-allele hybridization derived from spike-in experiments
Allele frequency estimation for pooled DNA samples from synthetic DNAs and plasmids
Since experimental factors such as immobilization condition, hybridization, and surface chemistry may affect the values of Ka and IR, it is better to include the reference samples with known allele frequencies as controls in each sub-array. With the values of Ka and IR derived from the experimental data, it is possible to estimate the allele frequencies in the pooled samples. This was demonstrated by using pooled DNA samples with different ratios (frequency from 5% to 100%) between the two allelic sequences, prepared by mixing two DNA samples homologous from either allele from synthetic or plasmid DNA stocks. The allele frequencies of all samples were calculated using Eqs. (14) and (15). As summarized in Table 2, the mean standard deviations (SD) between the five replicates for the six SNPs were from 0.23 to 0.26 with a maximal range of 0.011 to 0.041 for each SD. These results confirm the reliability of our method.
Table 2. Allele frequency estimates for the six SNPs from pooled DNA samples*
Allele frequency estimation of pooled genomic DNA samples
We also investigated the validity of our method for pooled genomic DNA samples from clinical specimens using the same protocols described above for the synthetic and plasmid DNA stocks. Ten genomic DNA samples with different ratios between allelic sequences were prepared by pooling 100 individual genomic DNA samples of known genotypes for SNP HBB28(T/C). The fraction of the minor allele "C" ranged from 2% to 10%, with 1% increments. Twenty replicas of each pooled sample were measured to test the repeatability of the method. The variations are expressed as ± SD and ± standard error of the mean (SEM). Results are listed in Table 3. As shown, SDs ranged from 0.005 to 0.010, and SEM ranged from 0.0044 to 0.0103, indicating that our method is highly reproducible.
Table 3. Allele frequency estimates from pooled DNA samples with the minor allele frequency ranging from 2% to 10% for the "C" allele of SNP HBB28
Robustness of the method
In an association studies, it is critical to learn which markers with their allele frequencies significantly different among populations. The simplest strategy is to compare between results from the pooled samples of all cases and those from pooled samples of all controls. In more complex design, creating sets of sub-pools allows stratification, not only on the basis of the disease trait but also on secondary and tertiary traits as well. In these cases, it is very important to detect minor differences in allele frequencies between pools or sub-pools. To evaluate sensitivity of our method, the significance level of the differences in the allele frequencies between the two pooled samples with the closest allele frequencies (Table 3) was assessed by the Kolmogorov-Smirnov test. The resultant P values are listed in Table 4. As shown, samples with a difference in their allele frequencies as small as 1% could be discriminated (P < 0.0078) if the sampling error is negligible. Therefore, our method provides a very powerful tool for association studies using pooled samples.
Table 4. P values calculated by the Kolmogorov-Smirnov test to assess the significance levels of the differences in allele frequencies between pooled samples with closest allelic frequencies
Figure 3 is a scatter plot that summarizes all allele frequency estimation data (Table 2 and Table 3) for the six markers. The estimated pooled allele frequencies are in good agreement with the results of individual genotyping. For the genomic DNA samples (Table 3), the estimates and known values of allele frequencies are highly correlated (r2 = 0.9917). For all assays in the present study, the known and the measured allele frequencies were highly correlated with a correlation coefficient of 0.9992 (P < 0.01). The mean value of the differences between known and measured frequencies was 0.007. These results indicate that our method is highly accurate and reproducible. According to the data shown in Table 3, it is possible to apply our method to determining allele frequencies ≥ 2%.
Figure 3. Scatter plot demonstrating the accuracy of allele frequency estimation using pooled DNA samples. The chart is drawn based on the results for allele frequency listed in Table 2 and Table 3 versus the known frequencies prior to PCR amplification for the six SNPs. Short horizontal bars are error bars. The diagonal line shows complete concordance between known and observed allelic fractions.
To obtain a high degree of accuracy and sensitivity, several factors need to be taken into consideration. These include the number of samples to pool, the volumes of DNA to transfer by pipetting, microarray preparation and hybridization. For this reason, the reproducibility for the quantification of this method was evaluated. We investigated the sampling and measurement errors  using samples with allele frequencies of 0.05 and 0.10 for HBB28(C). Each sample was repeated 20 times which were subdivided into four groups, five each. Then we calculated the measurement error using the standard error of the mean
where the number of measurements was five, the measurement error is independent of sample size. The expected sampling error can be expressed as (f = allele frequency and n = sample size) , and the equation of is for the estimated combined sampling and measurement error. Results for the two allele frequencies are shown in Figure 4. With an actual allele frequency of 0.05, the measurement error was ± 0.0077 (Panel (a), Figure 4). For an actual allele frequency of 0.10, the measurement error was ± 0.0103 (Panel (b), Figure 4). Results in Figure 1 show that when the sample size is smaller than 400, the sampling error is greater than the measurement error at allele frequency of 0.05; when the sample size is smaller than 436, the sampling error is greater than the measurement error at allele frequency of 0.10. Thus the measurement error will be dominant in allele frequency estimation for a large sample size such as n > 500. Generally, the measurement error would be smaller than the statistical error of sampling at a large sample size. To obtain a reliable allele frequency, the sample size ≥ 500 is necessary for an allele frequencies greater than 0.10.
Figure 4. Comparison of expected sampling errors and experimental errors of the frequencies for the two alleles of SNP HBB28. The chart was drawn based on the results in Table 3. (a) The solid line is the expected sampling error for SNP site (HBB28) for the allele frequency of 0.05 for sample size up to 1000. The upper broken line is the estimated combine sampling and measurement error for this method based on Table 3 (see text). The lower broken line is measurement error. (b) The same as (a) for the allele frequency of 0.10 described in Table 3.
Identification of genetic loci associated with genes responsible for susceptibility to complex human diseases with a clinically available sample size is still a major challenge for whole genome association study. In addition to including a large number of SNPs, the chance of detecting significant association also requires a very large number of samples owing to the low phenotypic effect of the genes involved in multifactorial diseases. Although, high-throughput genotyping techniques are readily available for handling large sample size and a large number of genetic markers, the cost is still very high. Therefore, methods for estimation of SNP allele frequencies in such studies should be amenable to scaling-up both in the number of loci and in the number of samples. DNA pooling is a well established method, which can vastly reduce the amounts of effort, labor and cost involved in large-scale association studies . Pooling allows one to estimate the allele frequency among large numbers of individuals by examining a single or much fewer samples, reducing the workload from hundreds of samples to one or very a few. Instead of genotyping the large numbers of SNPs in individual samples on Affymetrix SNP chip for genome-wide association scans, a large number of papers have addressed pooling the DNA from large numbers of individuals [27-30].
In the present study, we report a new microarray-based strategy to estimate the allele frequencies of pooled samples. The method is highly sensitive and can be used to analyze a large number of markers and multi-pools simultaneously. Association studies with multifactorial subdivision strategy provide a powerful tool for the study of complex diseases and quantitative traits, influenced by disease heterogeneity, gene-gene and gene-environment interactions. Our method meets the need for the association studies using pooling strategy more elaborate than the two-pool (case vs. control) design. The populations of cases and controls can be stratified on the basis of secondary and tertiary traits to construct a series of sub-pools. In addition to some traits, factors such as age at onset, sex, lifestyle, or other clinical descriptions can also be used for categorization. This might capture effects of environmental factors that are known to affect the disease trait in question. The association studies with multifactorial subdivision strategy provide a powerful tool for the study of complex common disease and quantitative traits, influenced by the effects of disease heterogeneity, gene-gene and gene-environment interactions.
Genotyping pooled samples based on subdivision using microarray significantly reduces workload and cost with similar statistical power compared with genotyping individual samples. Assuming 1000 case and 1000 control samples with 100 candidate SNPs for a disease are to be studied and 100 samples in each pool, using our method, the 2000 PCR products can be analyzed by a single microarray, and a single hybridization reaction. Its statistical power is equivalent to 2 × 105 individual genotyping (100 SNPs × 1000 cases + 100 SNPs × 1000 controls), which is over a 100-fold reduction. In addition, we introduce the strategy of employing two universal florescence-labeled (Cy3 and Cy5) tags to form the sandwich structure with SNP detection probes and allelic sequences. Regardless the number of SNPs that need to be analyzed, only two universal florescence-labeled tags are needed. Therefore, it vastly reduces the experiment cost.
The fluorescent intensities of the two colors on a microarray spot are not directly proportional to the allele frequencies in pooled DNA samples. Differences in various aspects of oligonucleotide hybridization make it difficult to estimate allele frequencies based on the fluorescence signals. We have tackled this problem by theoretically deducing a Langmuir-type formula for each allele. This Langmuir-type isotherm model integrates the thermodynamics of immobilization and hybridization processes on microarray surface, and describes the relationship between the fluorescence intensity and allele concentration in the spotting solution. Biased amplification of different alleles may occur when the size range of microsatellite alleles is sufficiently great. However, no significant biased amplification with SNP alleles (only one-base difference which is not size difference) was observed in our experiments. Therefore, allele frequencies in the spotting solution can be directly considered as allele frequencies in pooled DNA.
Our method has advantages over other published protocols in its high sensitivity and capability to detect minor difference between allele frequencies. Compared to determination of allele frequencies in individual pools, it is more important to learn whether the allele frequencies among pools and sub-pools are significantly different from each other. The experimental results demonstrate that our method can successfully distinguish allele frequencies differing by 0.01 in the actual pool of clinical samples (P < 0.0078). Our results also demonstrate that alleles with a frequency as low as 2% can be detected. Our approach exemplifies the reproducibility of measurement with the mean divergence between individual and pooled allele frequencies of 0.7%, ranging from 0.05% to 2.2%. The observed SEMs varied from 0.0044 to 0.0103.
In conclusion, we have developed an accurate, robust and high-throughput method to estimate the allele frequency in a large number of samples by pooled DNA samples followed by PCR amplification and microarray analysis. The dynamics of the immobilization and hybridization of the PCR products on the solid surface was studied. The kinetics of these two processes was integrated to estimate the allele frequency in pooled DNA by establishing a Langmuir-type kinetic model. Our approach is inexpensive, efficient and capable of detecting interesting polymorphic differences in candidate gene association studies and genome-wide linkage disequilibrium scans.
DNA stocks, oligonucleotides and other reagents
The synthetic DNA stocks for SNPs ESR1E-U11 (T/C) ESR1F-U21 (A/G) and, TGFB1D-U2 (G/C), primers, probes and fluorescence-labeled tags were purchased from Songon Inc. (Shanghai, China), purified by reverse-phase HPLC using a standard procedure, and dried in vacuo. The plasmids containing inserts with the three point mutations (HBB17(A/T), HBB28(T/C), HBB26(C/T)) were constructed using a site-directed mutation technology. The genomic DNAs with the HBB28(T/C) mutation were from thalassemia patients, and provided by the First Affiliated Hospital of Gongxi Medical University. All other chemicals and solvents were purchased from Sigma and Gibco BRL (Carlsbad, CA, USA), and used without additional purification. Unless otherwise noted, all samples and buffers were prepared in deionized water prepared using a Milli-Q water purification system (Millipore Corp., Bedford, MA, USA).
Preparation of pooled DNA samples
DNA samples including the synthetic DNA stocks and plasmids were homozygous for either one or the other allele of the SNPs. The pooled DNA samples for one, SNP (HBB28), were prepared from the genomic DNA of 100 patients and normal individuals. These genomic DNA samples were isolated from lymphocytes in peripheral blood of clinical patients using Trizol reagent and dissolved in TE buffer. In all experiments, all the synthetic DNA stocks, plasmids and genomic DNA samples, were initially diluted to a concentration of 30 ng/μl and then mixed gently and requantitated to a working concentration of 10 ng/μl (± 0.1 ng/μl) using NanoDrop ND-1000 Spectrophotometer (NanoDrop Technologies, DE, USA). Pooled DNA samples were prepares based on their genotype for each given SNP.
Pooled DNA samples were amplified by using the TaKaRa PCR kits (rTaq) (Kyoto, Japan) with a thermal cycler (PTC-225, MJ Research, Waltham, MA, USA). PCR samples were initially heated up to 96°C for 3 min, and then amplified for 35 cycles. Each thermal cycle consisted of 30s at 94°C, 30s at 55°C, and 30s at 72°C. A final extension step was included for 5 min at 72°C. The sequences of PCR primers for amplification of the polymorphic sequences containing SNPs were TTCATCTGAGTTCCAAATGTCC and AATATACAATTATTTCAGAACCATTAGAGAC for ESR1E-U11; AGCTGTTTTATGCTTTGTCTCTG and AGGAATATACAATTATTTCAGAACCATT for ESR1F-U21; TGCTGCYGCTGCTGCTAC and CTCCATGTCGATAGTCTTGCA for TGFB1D-U2; AGGGTTGGCCAATCTACTCC and GTCTCCACATGCCCAGTTTC for HBB17, HBB28 and HBB26. PCR products were analyzed by gel electrophoresis with 1% agarose, and purified by precipitation with sodium acetate and 100% ice-cold alcohol, then washed in 75% alcohol several times and finally dried in DNA vacuum.
Microarrays fabrication of the PCR products
All PCR products representing different allele ratios were dissolved in a printing buffer (3 × SSC solution), and quantitated to a working concentration of 50 μM. PCR products were spotted into five identical matrixes on an aldehyde-coated slide (CEL Associates, Pearland, TX, USA). Within each matrix PCR products for synthetic DNA and plasmid samples were spotted in quintuplicate, and those for genomics DNA samples were spotted twenty times. Spots were ~200 μm in diameter and ~300 μm between adjacent centers. Printing was performed using the contact printing robot (SpotBot, Telechem International, CA, USA). After printing, all printed slides were hydrated overnight at 37°C in a box containing 200 ml saturated NaCl solution. After hydration, the slides were exposed to UV light at 950 mJ/cm2 in Stratalinker 2400 (Stratagene, CA, USA), then rinsed in boiling water for 2 min to denature the PCR product followed by rinsing in 100% alcohol for 1 min, dried by centrifugation at 1000 r.p.m. for 1 min, and stored in a vacuum oven .
Hybridization and microarray scan
Two hybridization processes were involved in microarray analysis: hybridization of the tags to the probes, and hybridization of the tag-probe duplexes to the allelic sequences. All SNP probes, which would hybridize with the corresponding DNA pool samples of the studied SNP markers in a matrix, were dissolved in 2 × hybridization buffer (Agilent Technologies. Inc. USA), and mixed with equal volume of Cy3/Cy5 labeled tags solution at a molar ratio of 1:2 to react for 30 min. Then the mixture was added to the hybridization matrix in the glass. The microarray was incubated overnight at 42°C in a hybridization cassette (Telechem International). After hybridization, the slide was sequentially rinsed in 2 × SSC/0.1% SDS and in 2 × SSC at 37°C, 5 min for each step, and dried by centrifugation. The slides were then scanned using a Genepix 4000B scanner (Axon Instruments, Foster City, CA, USA) at a resolution of 5 μm with 100% excitation intensity with PMT set to 55% and 48% for Cy5 and Cy3 channels, respectively. Spot analysis and quantification of the original 16-bit TIF images were performed with the Genepix software (v 5.0).
Standard deviation (SD) was calculated for the signal intensities of the repeats of each pooled sample. The significance levels of the differences between the allele frequencies for the all comparison were analyzed by the Kolmogorov-Smirnov test. The P value was used as a measurement of the degree of significance between the measured and known allele frequencies. The statistical software Systat 12.0 (Systat Software Inc., San Jose, CA, USA) was used to perform the data analyses.
SNPs: Single nucleotide polymorphisms; BAMPER: Bioluminometric assay coupled with modified primer extension reactions; RFLP: Restriction fragment length polymorphism; DASH: Dynamic allele-specific hybridization; SSCP: Single strand conformation polymorphism; ARMS: Amplification refractory mutation system; HPLC: High performance liquid chromatography.
Human genes: ESR1: estrogen receptor 1, Genbank No. NM_000125.2, OMIM#133430; TGFB1: transforming growth factor, Genbank No. NM_000660.3, OMIM#190180; HBB: hemoglobin, beta, Genbank No. NM_000518.4, OMIM#141900.
BCYe and BCYin conceived and designed the study and manuscript preparation. BCYe oversaw the project. HL contributed to discussions and manuscript preparation.
This work was supported by grants of 20627005 and 20776039 from the National Natural Science Foundation, the National Special Fund for SKLBE(2060204), 2007AA02Z331 from the 863 high-tech project, NCET-07-0287 from the Program for New Century Excellent Talents, and 06SG32 from the Shanghai Shuguang Program.
Zhou GH, Kamahori M, Okano K, Gao C, Harada K, Kambara H: Quantitative detection of single nucleotide polymorphisms for a pooled sample by a bioluminometric assay coupled with modified primer extension reactions (BAMPER).
Biotechniques 2000, 28:464-470. PubMed Abstract
Buetow KH, Edmonson M, MacDonald R, Clifford R, Yip P, Kelley J, Little DP, Strausberg R, Koester H, Cantor CR: High-throughput development and characterization of a genomewide collection of gene-based single nucleotide polymorphism markers by chip-based matrix-assisted laser desorption/ionization time-of-flight mass spectrometry.
Simpson CL, Knight J, Butcher LM, Hansen VK, Meaburn E, Schalkwyk LC, Craig IW, Powell JF, Sham PC, Al-Chalabi A: A central resource for accurate allele frequency estimation from pooled DNA genotyped on DNA microarrays.
Mol Biotechnol 2007, 36:142-150. PubMed Abstract