Abstract
Background
Diagnosis of infectious diseases now benefits from advancing technology to perform multiplex analysis of a growing number of variables. These advances enable simultaneous surveillance of markers characterizing species and strain complexity, mutations associated with drug susceptibility, and antigenbased polymorphisms in relation to evaluation of vaccine effectiveness. We have recently developed assays detecting single nucleotide polymorphisms (SNPs) in the P. falciparum genome that take advantage of postPCR ligation detection reaction and fluorescent microsphere labeling strategies. Data from these assays produce a spectrum of outcomes showing that infections result from single to multiple strains. Traditional methods for distinguishing true positive signal from background can cause false positive diagnoses leading to incorrect interpretation of outcomes associated with disease treatment.
Results
Following analysis of Plasmodium falciparum dihydrofolate reductase SNPs associated with resistance to a commonly used antimalarial drug, Fansidar (Sulfadoxine/pyrimethamine), and presumably neutral SNPs for parasite strain differentiation, we first evaluated our data after setting a background signal based on the mean plus three standard deviations for known negative control samples. Our analysis of single allelic controls suggested that background for the absent allele increased as the concentration of the target allele increased. To address this problem, we introduced a simple change of variables from customary (X,Y) (Cartesian) coordinates to planar polar coordinates (X = rcos(θ), Y = rsin(θ)). Classification of multidimensional fluorescence signals based on histograms of angular and radial data distributions proved more effective than classification based on Cartesian thresholds. Comparison with known diallelic dilution controls suggests that histogrambased classification is effective for major:minor allele concentration ratios as high as 10:1.
Conclusion
We have observed that the diallelic SNP data resulting from analysis of P. falciparum mutations is more accurately diagnosed when a simple polar transform of the (X,Y) data into (r,θ) is used. The development of high throughput methods for genotyping P. falciparum SNPs and the refinement of analytical approaches for evaluating these molecular diagnostic results significantly advance the evaluation of parasite population diversity and antimalarial drug resistance.
Background
Evaluating genotype data of pathogens isolated from a single human host presents an important challenge for molecular diagnostic assays as each human sample can contain varying numbers of parasite strains [1]. Genotype data for many diagnostic technologies are reported across a scale of fluorescence, though further evaluation is necessary to understand how fluorescence units correspond to the number of infecting pathogens. Further complications are encountered if the pathogen is diploid, if target sequences of interest are repeated in singlecopy genomes, or if haplotype resolution is a necessary component of diagnosis [2].
We have recently developed a postPCR, ligase detection reaction  fluorescent microsphere assay (LDRFMA) to evaluate single nucleotide polymorphisms (SNPs) associated with drug resistance in the notorious malaria parasite, Plasmodium falciparum (Pf) [3]. This assay employs Luminex xMAP technology to differentiate between possible alleles at a given locus. The genotypic fluorescence data generated provides semiquantitative information regarding the relative abundance of haploid parasite strains contained within an individual human patient. In our earlier study, we observed that as the fluorescence signal for single allele positive controls increased with increasing DNA template concentration, fluorescence associated with the allele not present in the sample would increase above background [3].
We approach the problem of accurate genotype identification by transforming bivariate fluorescence data from standard Cartesian (x,y) coordinates into polar (r,θ) coordinates as described below. In the standard Cartesian representation, the xaxis represents the fluorescence signal in an individual sample associated with allele X (typically the wildtype or drug sensitive allele) and the yaxis represents the fluorescence signal associated with allele Y (typically the mutant or drug resistant allele). This polar transformation takes into account the apparently linear crosstalk interaction responsible for the background signal. A similar kind of bivariate transformation has been applied to genotyping of diploid systems [47]. In the diploid case, these methods effectively distinguish between small, discrete numbers of possible diploid copy numbers (e.g. 0, 1 or 2 allelic copies). Our method distinguishes infection by either, both, or neither form of an allele that may be present in any given sample in an effectively continuous range of possible copy numbers.
Accurate classification of background and positive signal is essential in evaluation of infection by any pathogenic species and its various strains. In the context of malaria, isolate differentiation is utilized for studies of parasite population diversity and structure, as well as in identification of drug resistant parasites. Precise estimation of parasite population diversity is essential as this diversity can affect the ability to monitor antimalarial drug efficacy in a region [8]. Genotyping of specific drug resistance loci carries special importance because improper treatment of P. falciparum can lead to death of infected patients. Moreover, better identification of drug resistant parasites in endemic populations could be used to revise treatment protocols recommended by Ministries of Health [9] as increasing prevalence of drug resistant parasites has been associated with increased malaria morbidity and mortality [10]. As the arsenal of antimalarial drugs is limited, improved monitoring of drug efficacy and diagnosis of mutations associated with drug resistance is important for programs attempting to control the prevalence of infection by Plasmodium species parasites and reduce the potential for mosquitoborne transmission.
Results and Discussion
Motivation for the polar coordinate transformation
When diallelic SNPs occur in a population of haploid P. falciparum bloodstage parasites, infected individuals may be positive or negative for infection by parasites carrying either allele. With respect to any single SNP locus there are four possible infection "states": X+/Y (infected by wildtype parasites only), X/Y+ (infected by mutant parasites only), X+/Y+ (infected by both wildtype and mutant parasites), or X/Y (no infection). Three of these states (X+/Y, X/Y+, X+/Y+) cannot be distinguished using standard blood smear microscopy because the SNP does not confer observable changes in parasite morphology. The goals of the present study were to distinguish the infection states of individuals in an endemic population and to establish levels of confidence in the diagnostic assessment.
Unlike genotyping results on singlecopy loci from diploid organisms, where allelespecific signals are based on discrete template quantities (0, 1 or 2 allelic copies), allelespecific signals for P. falciparum cover a continuous and varying range as a human infection can consist of multiple parasite strains with wide ranging parasitemia levels for each strain. The LDRFMA procedure produces fluorescence measurements associated with the presence of varying levels of one or two sitespecific alleles in a blood sample. Some lowlevel background signal is observed in control samples containing no DNA. The standard method for establishing the presence of a given allele in a sample is to impose a threshold based on the background signals obtained from a number of blank controls; background signals for alleles of the Pf dihydrofolate reductase gene (dhfr) when evaluated in uninfected North American controls ranged from 154 to 322 Median Fluorescence Intensity (MFI) units [3]. A typical threshold would then be set at a fluorescence level of at least two standard deviations above the mean obtained from this ensemble of DNAnegative blanks. In LDRFMA diagnosis interpretation of true positive from negative fluorescence measurements becomes complicated by background fluorescence signal when, for example, allele "Y" is not present, while allele "X" is present at high levels. Results in Figure 1 illustrate this outcome in a control experiment where LDRFMA was performed on the dhfr SNP occurring at amino acid 164 (sensitive allele = isoleucine [I; codon ATA]; resistant allele = leucine [L; codon TTA]). PCR amplification was performed on a dilution series of the Pf laboratoryadapted strain, 3D7 (dhfr 164I). The average background signal when no 3D7 genomic DNA was added to the initial PCR was 95.5 MFI units (3xSD = 156.6 MFI units). At higher concentrations of 3D7 genomic DNA the dhfr 164L signal was consistently above the conventional background value (> 200 MFI units).
Figure 1. Detection of the dhfr 164 SNP alleles: serial dilution of P. falciparum 3D7 genomic DNA. Serial dilutions were performed on the laboratoryadapted strain, 3D7 known to carry the 164 I allele in multiple experiments based on twofold, fivefold, and tenfold dilutions. The axes of this plot are in Median Fluorescence Intensity (MFI); xaxis is MFI for 164 I, yaxis is MFI for 164 L. The linear least squares line (solid black; r = 0.82), illustrates how the MFI for 164 L increases as the concentration of 3D7 genomic DNA, and MFI for 164 I increases despite the absence of the 164 L allele in the 3D7 genome. The mean of the background 164 L allele is plotted as a solid black horizontal line; dashed lines designate mean + 1 standard deviation (SD; black), +2 SD (black) and +3 SD (red). Additional experiments focused on the P. falciparum dhfr codon 59 SNP of strains Dd2 and 3D7, and dihydropterate synthetase (dhps) codon 613 SNP routinely showed this same linear increase in background signal associated with increased concentration of the dhfr PCR product.
Because the two sequences differ at only a single base pair, it was not surprising that a small fraction of LDR products representing the 164L allele (not present in 3D7) formed despite sequence mismatch when the concentration of the 164I template increased through PCR amplification. This is particularly true since the mismatched base pairs occur at the end of the oligonucleotide probe sequences and not in middle positions found to be more destabilizing to probe hybridization [11]. In our analyses of field samples from malariaendemic regions of Papua New Guinea (PNG), we have observed these same background trends in our results. Figure 2 shows an example of bivariate fluorescence data obtained from the LDRFMA analysis of 264 PNG patients at the dhfr 59 (C/R) locus. This locus was chosen as both alleles are found in the PNG parasite population, in contrast to the dhfr 164 locus. Using the threshold determined as three standard deviations above the mean calculated from known negative samples, these results suggest that 68 individuals (25.8%) were infected with mixtures of 59 C and R (Figure 2). When the XYplot of this data is examined by eye, there is an obvious increase in background levels for both alleles as the median fluorescence increases (Figure 2). Given that similar results were seen in our dilution control experiment (Figure 1), we were concerned that the conventional approach for differentiating positive from background fluorescence was resulting in overreporting the number of mixed strain infections. Motivated by the possibility that the background signal could be roughly linearly proportional to the concentration of the nonmatching sequence, we introduced a transformation of the (X,Y) fluorescence data that is compatible with this natural multiplicative structure in the distribution of the data. We introduce polar coordinates (r,θ) such that and (Figure 3). The magnitude r represents the distance in the plane from the point at (0,0) to the point at (x,y), and provides a measure of the overall level of infection. The quantity θ represents the angle between the xaxis and ray extending from the point at (0,0) to the point at (x,y). The angle θ depends only on the ratio y/x of the two fluorescence signals. If the background signal were truly linear, samples with single strain infections would appear in the Cartesian plane along lines with y/x ratios equal to a constant. Consequently we sought to determine diagnosis thresholds in terms of the angle θ (or equivalently, in terms of the ratio y/x) rather than in terms of x or y alone.
Figure 2. Detection of the dhfr 59 SNP alleles from infected Papua New Guinean study participants. LDRFMA was performed on 264 clinical samples and the allelespecific MFI data is plotted for dhfr 59 C/R (xaxis is MFI for 59 C, yaxis is MFI for 59 R). The mean plus three standard deviations thresholds for each allele are plotted as red dashed lines.
Figure 3. Polar Transform of Cartesian (X,Y) fluorescence data. The polar transformation is a method that converts Cartesian (X,Y) coordinates into their angular (θ) and magnitude (r) components. Each coordinate (X and Y) can be written as the product of its magnitude and angle components, i.e. X= rcos(θ) and Y = rcos(θ). From these expressions, we can solve for and
Histogram based threshold estimation
The four possible states of infection observed through the combination of two allelespecific fluorescence signals occupy four regions of the (X,Y) plane. Results for individuals with monoallelic infections lie near either the x or yaxis, whereas results for mixed strain infections lie somewhere in the upper right region of the plane. Uninfected individuals cluster together near the origin. The classification problem amounts to drawing boundaries between these groups and providing confidence level assessments of the locations of the boundaries. In the (r,θ) plane, the transformed fluorescence data again occupy four regions (Figure 4A to 4B). Following the transformation from Cartesian to polar coordinates, the boundaries between each region become parallel to the horizontal or vertical axes, making them straightforward to estimate.
Figure 4. Rationale for establishing thresholds to differentiate between background and truepositive fluorescence signals. Panel A: Beginning with raw data, the histogram method transforms the (X,Y) data into polar coordinates, (r,θ) (Panel B). Panel C: From the polar coordinates, the method generates a histogram of the magnitude, finds the first minimum (magnitude threshold), and removes the negative samples. Then the method generates a histogram of the angular components and determines the first and last minima as the two angular thresholds. Panel D: Finally, the thresholds are transformed back to Cartesian coordinates, where the angular thresholds become lines and the magnitude threshold becomes a quarter circle.
We now introduce a novel algorithmic procedure for determining allelecalling thresholds of individual samples. Our procedure is based on a segmentation technique applied to histograms of data obtained from an entire sample population. In practice we find that the density of bivariate fluorescence measurements obtained empirically typically falls into four distinct clusters corresponding to the four possible states of infection. Our technique exploits the naturally occurring gaps present in histograms produced by such a mixture of differently infected populations.
Our histogramsegmentation analysis of the diallelic population fluorescence data proceeded in two steps. First, we constructed histograms corresponding to the number of data points in the population with magnitudes r falling in each of n_{r }equally spaced bins beginning with the minimum magnitude, r_{ }= min(r), and finishing with the maximum magnitude, r_{+ }= max(r). We chose the number of bins as (this notation indicates that we round the quantity up to the nearest integer) giving bins of width approximately equal to 100 MFI units (as shown in Figure 4C). The maximum fluorescence levels obtained with Luminex FlexMap microspheres has been observed to be 25,000 MFI, and experimental blanks typically have levels less than 200 MFI. We expected uninfected individuals to have small positive background fluorescence signals clustered together around a central mean (see Methods). Therefore, we determined the first local minimum in the histogram as we increased from the lowest bin, r_{ }≤ r ≤ r_{ }+ 100. Heuristically, the density of points along the raxis should peak near r = r_{ }and decrease smoothly as we move to successive bins, leaving the uninfected population; the density should reach a minimum before climbing again as one enters the various infected populations. Practically, this method has worked well for data sets ranging in population size from 44 to 667. For very small data sets (n ≤ 40) this histogrambased method should be used with caution. For further details on implementing the algorithm, see Methods. Sample code written in MATLAB is available from the authors upon request.
Having set a threshold for distinguishing infected from uninfected individuals, we then constructed a histogram of the distribution of angles θ of the remaining, infected individuals (Figure 4C). We used 45 equally spaced bins ranging from θ = 0 to θ = π/2 (zero to ninety degrees, in increments of two degrees per bin). We set the threshold for the X/Y+ population (wildtype infection only) to be the first minimum in the histogram as θ increases past the local maximum near θ ≈ 0. The threshold for the X/Y+ population (mutant infection only) is the first minimum below the local maximum near (or 90 degrees). Individuals with wildtype only infections lie scattered around a vertical line at an angle θ that is positive (positive slope in the (X,Y) plane) but close to zero. Individuals with infections carrying only the mutant allele lie scattered around a vertical line at an angle close to but less than (or 90°). Individuals with mixed infections were scattered throughout the plane between these two vertical lines, and uninfected individuals lay along the horizontal axis with arbitrary values of θ. Once thresholds for distinguishing the four classes of infection/no infection in the polar coordinate plane were established, an inverse polar coordinate transformation returned the data and the diagnosis thresholds to the Cartesian (X,Y) plane (Figure 4D). This final recapitulation of the data back into the Cartesian plane illustrates that the background cutoff lines are not parallel with either the X or Y axes, as they would have been following previous commonly practiced data analysis strategies (cf. Figure 1).
Confidence Intervals
In order to estimate the probability of classification error following polar transformation of the data, we used bootstrap resampling with replacement to establish percentage confidence interval distributions for the three thresholds (uninfected vs. infected; X+/Y vs. X+/Y+; X/Y+ vs. X+/Y+) [12]. With each resampled data set, we used the histogram method to determine thresholds as described above. After running the bootstrap 1,000 to 100,000 times, we determined (1  α) percentile confidence intervals that were placed at the and quantiles of the bootstrap estimated thresholds (shown in Figure 4C). We used an increasing series of confidence levels, α_{1 }<α_{2 }<...<α_{n}. Each pair of confidence parameters α_{j }<α_{j + 1 }encloses a region of the (x,y) plane (or equivalently, the (r,θ) plane). A hypothetical point lying in such a region would have a classification confidence between α_{j }and 1α_{j + 1}. For example, the pair α_{j }= .05 <α_{j + 1 }= .10 would determine a region of the plane corresponding to points with diagnosis confidence exceeding 90%, but not higher than 95%. Table 1 shows the number of points in our sample population lying within a given confidence region. The weighted average of the confidence values gives a lower bound estimate of the average diagnosis error (see Equations 12 under Methods). We then used these confidence intervals to obtain upper bounds on the probability that the classification of a randomly chosen member of the population might change upon resampling. If the confidence intervals on the threshold settings were reasonably narrow, then only those individuals whose combined fluorescence measurements located them near one of the thresholds would be in danger of uncertain classification. By placing the thresholds at local minima in the density of sample points we minimized the number of individuals with ambiguous diagnoses. From these estimates, we can determine an upper bound on the probability of misdiagnosis across the entire population (denoted Pr (error)). This upper bound on the probability of error in turn gives us a lower bound on the overall confidence of our classifications (as 1Pr (error)), shown in Table 1. For details of this calculation, please see the Methods section.
Table 1. Locusspecific uncertainty and confidence intervals
The first column of Table 1 shows bootstrap based confidence interval (CI) results for Dhfr locus 59. Out of a total population of 264 samples, six fell within a 68% confidence interval for classification but not within an 80% CI (first row of table). One fell within a 90% CI but was uncertain at the 95% confidence level (third row of table), et cetera. The last two rows of Table 1 indicate that two samples were classified with 99.9% confidence, but not better than 99.9%, and the remaining 245 samples were classified with better than 99.9% confidence as estimated by bootstrap resampling. As shown in Table 1 the assignments at each SNP locus based on the polar transform technique had bootstrap classification confidence estimates of 89% or better when averaged across the population.
Mixed Dilution Control Experiment
In order to assess the sensitivity of the histogram segmentation based diagnostic method we used mixtures of laboratory clones (HB3, alleleT; K1, alleleA) with known genotypes to perform control LDRFMA assays on the chr1SNP (PlasmoDB identifier, CombinedSNP.MAL1.1085 ([13]). We used a total of 34 mixtures with known DNA concentration ratios ranging from 1:1 to 99:1 (major allele:minor allele) along with a pure sample of each allele and a blank (37 samples total). The LDRFMA assay was also performed on 264 field samples from human subjects collected in PNG. The histogram method was applied to the field data alone in order to generate diagnostic thresholds with 95% confidence intervals (see Figure 5). Subsequently the control samples were called using the thresholds obtained from the field data, and compared with the known DNA concentrations.
Figure 5. Control experiment using sample mixtures of known allelic ratios to assess the sensitivity of the histogram segmentation analysis. The LDRFMA assay was performed for chr1SNP (CombinedSNP.MAL1.1085, PlasmoDB SNP identifier [13]) on 264 samples collected in PNG (blue dots). The polar coordinate histogram method generated diagnostic thresholds (black solid lines) and 95% confidence intervals (red dotted lines). The LDRFMA assay was performed separately for the same SNP on 34 mixtures of control strains with known genotypes (HB3, alleleT; K1, alleleA) at DNA concentrations ranging from 1:1 to 99:1 (major allele:minor allele, crosshair symbols) along with a pure sample of each strain and a blank. Samples correctly called as "mixed" had concentration ratios ranging from 1:1 to as high as 15:1. All samples with major:minor allele ratios of 29:1 or higher were misclassified as "single strain infections". Of samples with major:minor allele ratios of 9:1 or smaller, 95% were called correctly (19 of 20); of those with ratios 10:1 or smaller, over 90% were called correctly (22 of 24).
The results show that the histogram based method provides reliable results for a range of DNA concentration ratios (analogous to relative parasitemias for two allele types in a given patient) and breaks down once the ratio between major and minor allele concentrations becomes too large. Samples with major:minor allele ratios ranging from 1:1 to 5:1 were called correctly (as mixed infections) 100% of the time (18 out of 18 samples); samples with ratios of 9:1 or 10:1 were called correctly 67% of the time (4 out of 6 samples); both pure samples were called correctly. Mixed samples with ratios of 15:1 were called correctly 50% of the time (1 out of 2 samples) and samples with ratios of 29:1 or higher were always miscalled (as single rather than mixed infections). This performance is expected because of the mismatch binding effect described earlier: when the lower concentration allele is present in too small an amount relative to the major allele, the minor allele disappears below the advancing background signal. Fortunately the method is nevertheless robust over a wide range. In particular, all of the 21 samples called as "mixed" with nominal confidence exceeding 95%, were in fact called correctly. By taking the natural twodimensional geometry of the data into account, the polar coordinate/histogram segmentation method greatly reduces the risk of false positive diagnoses of mixed infection, without increasing the risk of false negative diagnoses of overall infection. In practice diagnostic assays should therefore include dilutions of allelespecific controls to evaluate potential for misclassification of samples on any given 96well plate.
Diagnostic Threshold Variability Across SNPs
We applied the histogram segmentation based diagnostic algorithm to data for six diallelic SNPs (dhfr 59 and 5 additional SNPs listed in Additional File 1: Primers and PCR amplification conditions for additional SNPs) from field samples collected in PNG. Due to a range of factors the angle thresholds (θ) for distinguishing mixed from single strain infections varied significantly between SNPs. The X+/Y versus X+/Y+ threshold (wildtype only versus mixed) ranged from 0.12 to 0.40 radians (7° to 23°) with a mean ± s.d. of 0.22 ± 0.07 radians (12 ± 4)°. The X/Y+ versus X+/Y+ threshold (mutant only versus mixed) ranged from 1.34 to 1.48 radians (77° to 85°) with a mean ± s.d. of 1.42 ± 0.04 radians (83 ± 2)°. This variability in the natural breakpoint between single and mixed infections may result from SNPtoSNP variability of practical peak fluorescence, differences in population structure such as relative prevalence of particular SNPs, sequence specific differences in the likelihood of mismatch binding in the LDR assay, among other effects. Because of this variability we cannot fix a single diagnosis threshold in the angle θ (or equivalently, in the ratio y/x) that will work equally well for all populations and SNPs.
Additional file 1. Primers and PCR amplification conditions for additional SNPs. Primers and conditions for PCR amplification of additional SNPs utilized in assessing the sensitivity of the histogram segmentation analysis and for examining the diagnostic threshold variability.
Format: PDF Size: 20KB Download file
This file can be viewed with: Adobe Acrobat Reader
Comparison of histogram based threshold estimation to conventional Cartesian estimations
Following the finding that our histogram based method provides reliable results for a range of DNA concentration ratios, we can ask how well a method based on Cartesian thresholds would perform in comparison. The standard method employed in malarial SNP genotyping analysis is to set a positive infection threshold based on the distribution of fluorescence observed in an ensemble of blank controls included in the LDR/FMA assay. Typically, this threshold is set at two or three standard deviations above the mean of the blank ensemble fluorescence level.
Table 2 shows the number of individuals assigned to each of four possible states of infection with 95% or higher confidence by the polar transform histogram method, along with the number assigned to the corresponding categories using the standard Cartesian method (mean plus three standard deviations). The final row of the table lists the total number of samples reclassified when the two methods are compared. Even taking into account the number of points classified as "uncertain" by the bootstrap reclassification (an average of 31 samples +/ 15 for six loci examined), the Cartesian method misclassifies a very significant fraction of individuals at each SNP position. This misclassification ranges from 26 to 67% of the samples at each locus, or 76 to 237 samples (Table 2).
Table 2. Reclassification of SNP allele calls by use of polar transformation compared to conventional methods
This contrast in data analysis is apparent for the classification of infections carrying resistanceassociated mutations at dhfr codons 59. Whereas 87 samples would be classified by conventional methods at codon 59 as infections including a mixture of parasites carrying both drug sensitive and resistant alleles, only 17 samples were classified to carry this mixed strain infection by the polar transformation method. As mutations in dhfr codons 59 are associated with high levels of resistance to drugs targeting the folate biosynthesis pathway [14], misclassification of genotyping results would have significant bearing on the expected susceptibility phenotype for important antimalarial drugs such as sulfadoxine/pyrimethamine (SP, Fansidar) and LAPDAP [15,16].
The misclassification of allele calls by conventional methods at the chr1SNP and chr9SNP are even more apparent than the dhfr codon 59 data with 226 and 237 samples respectively reclassified following the use of the polar transformation method (Table 2). This reclassification is caused by high levels of background fluorescence observed for the alleles at these loci as shown in Additional File 2: Plot of chr1SNP fluorescence data with conventional Cartesian thresholds and thresholds generated by the polar coordinate histogram method, and Additional File 3: Plot of chr9SNP fluorescence data with conventional Cartesian thresholds and thresholds generated by the polar coordinate histogram method. While genotypes at these SNP loci carry no clinical significance, these data indicate that each individual allele specific primer and associated reagents will behave differently in regards to levels of background signal. Accurate allele discrimination for genotyping P. falciparum infections therefore requires refined data analysis tools adapted to the intrinsic geometrical structure occurring in population data, such as the histogram based approach introduced here.
Additional file 2. Plot of chr1SNP fluorescence data with conventional Cartesian thresholds and thresholds generated by the polar coordinate histogram method. Samples (n = 357) were genotyped at Chr1SNP using the LDRFMA method. Allele calls were then made using both the conventional methods (3× standard deviations above the mean of known negative samples) and by the polar coordinate histogram method. Conventional thresholds are indicated by the red dotted line, and the polar thresholds are indicated by the black solid line with the black dotted line showing the 95% confidence intervals.
Format: PDF Size: 49KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 3. Plot of chr9SNP fluorescence data with conventional Cartesian thresholds and thresholds generated by the polar coordinate histogram method. Samples (n = 357) were genotyped at Chr9SNP using the LDRFMA method. Allele calls were then made using both the conventional methods (3× standard deviations above the mean of known negative samples) and by the polar coordinate histogram method. Conventional thresholds are indicated by the red dotted line, and the polar thresholds are indicated by the black solid line with the black dotted line showing the 95% confidence intervals.
Format: PDF Size: 47KB Download file
This file can be viewed with: Adobe Acrobat Reader
Conclusions
It is to be expected that the data analysis challenges described herein will be observed in other postPCR genotyping systems, as offtarget background signals commonly increase with higher levels of PCR product amplified [17,18]. Accurate discrimination of true positive from background signal when genotyping microbial pathogens by a range of methods, including LDRFMA and Taq Man, requires novel strategies for data analysis [19,20]. In regards to malaria, isolate differentiation is utilized for studies of parasite population diversity and structure, as well as in identification of drug resistant parasites. The reliability of genotyping methods for strain differentiation can affect the ability to monitor antimalarial drug efficacy and the presence of drug resistance mutations in a parasite population. In the future, the methods described herein can be coupled with expectation maximization approaches to further evaluate multilocus genotype and haplotype frequencies in parasite populations [21].
Methods
Blood samples
Samples evaluated in this study were part of ongoing efforts to monitor Plasmodium species infections in the Wosera, East Sepik Province, Papua New Guinea (PNG) [22,23]. This region of northern, lowland PNG is holoendemic (parasite rate in oneyear old children > 0.75 [2426]) for malaria and all four human malaria parasite species, P. falciparum, P. vivax, P. malariae and P. ovale are observed. Informed consent was obtained from all study participants. This study was approved by the Medical Research Advisory Committee of PNG and by the Institutional Review Board for Human Investigation at University Hospitals of Cleveland, Ohio.
P. falciparum (Pf) laboratory strains
Pf laboratoryadapted strains obtained from the Malaria Research and Reference Reagent Resource (MR4, ATCC Manassas, Virginia) included 3D7 (MRA102), HB3 (MRA155), K1 (MRA159) and VS/1 (MR4176). In vitro growth of Pf was performed as described previously [27]. Thin blood smears were fixed with 100% methanol for 30 seconds, stained with 4% Giemsa for 30 minutes, and examined by microscopy with an oilimmersion objective (100×). Parasitemia was based on the number of infected red blood cells (IRBCs)/total red blood cells (infected plus uninfected RBCs; n = 1,000).
DNA template preparation
DNA was extracted from whole blood field samples (200 μL) using the QIAamp 96 DNA Blood Kit (Qiagen, Valencia, CA). Genomic DNA was extracted from cultures (200 μL) of laboratory adapted Pf strains using the QIAamp DNA blood mini kit (Qiagen, Valencia, CA).
PCR amplification of Pf SNPs
All reactions (25 μl) were performed in a buffer containing 3 pmoles of the appropriate upstream and downstream primers, 67 mM TrisHCl, pH 8.8, 6.7 mM MgSO_{4}, 16.6 mM (NH_{4})_{2 }SO_{4}, 10 mM 2mercaptoethanol, 100 μ M dATP, dGTP, dCTP, and dTTP, and 2.5 units of thermostable DNA polymerase. Amplification reactions were performed in a Peltier Thermal Cycler, PTC225 (MJ Research, Watertown, MA). Specific primers and thermocycling conditions used to amplify the Pf dihydrofolate reductase (dhfr) target sequence for evaluating polymorphisms associated with pyrimethamine resistance were described in Carnevale et al [3]. Additional, presumably neutral SNPs, chosen irrespective of location or function were utilized for assessing the sensitivity of the histogram segmentation analysis and for examining the diagnostic threshold variability were amplified using the primers and conditions listed in Additional File 1: Primers and PCR amplification conditions for additional SNPs. Following PCR amplification, products were loaded on 2% Agarose gels (Amresco, Solon, OH), and electrophoresis was performed in 1× TBE buffer (8.9 mM Tris, 8.9 mM boric acid, 2.0 mM EDTA). The gels were stained for 30 min with SYBR Gold (Molecular Probes, Eugene, OR), diluted 1:10,000 in 1× TBE buffer, and DNA products were visualized on a Storm 860 using ImageQuant software (Molecular Dynamics, Sunnyvale, CA).
LDRFMA evaluation of Pf SNPs
The methods and strategies used to perform the LDRFMA evaluation of Pf mutations in the dhfr gene have been previously described in detail [3]. Primers for the LDRFMA diagnosis of the additional, presumably neutral SNPs chr1SNP, chr7SNP, chr8SNP, chr9SNP, and chr13SNP are listed in Additional File 4: Ligase detection reaction primers for genotyping additional SNPs. The following brief description and summary in Figure 6 provide an overview of the threestep, postPCR, LDRFMA procedure.
Additional file 4. Ligation detection reaction primers for genotyping additional SNPs. Primers for the LDRFMA diagnosis of additional SNPs utilized in assessing the sensitivity of the histogram segmentation analysis and for examining the diagnostic threshold variability.
Format: PDF Size: 20KB Download file
This file can be viewed with: Adobe Acrobat Reader
Figure 6. Overview of postPCR, LDRFMA diagnosis of P. falciparum dhfr drug resistanceassociated SNPS. Allelespecific oligonucleotides include 5' "TAG" extensions that are 24 nucleotides in length. TAG sequences (n = 100), containing no G residues, have been designed to reduce potential for crosshybridization with sequence amplified from naturally occurring target sequences. Conserved sequence oligonucleotides are 5'phosphorylated and 3'biotinylated. Fluorescent microsphere sets (n = 100) are embedded with varying ratios of red and infrared fluorochromes to result in unique fluorescent "classification" signatures. Each microsphere set is precoupled to unique antiTAG oligonucleotides specifically complementary to the TAG sequences. Following ligation detection and TAG:antiTAG hybridization reactions, doubly labeled ligation products are detected and the "reporter" signal (Rphycoerythrin) is classified into allelespecific bins by the BioPlex array reader and BioPlex Manager 3.0 software. Data from these reactions are immediately available in Excel spreadsheets.
Following PCR amplification of the genespecific target sequences carrying the locus of interest, products were combined into a multiplex LDR (Step #1) where allelespecific upstream primers ligate to conserved sequence downstream primers. Upstream, allelespecific primers include 5' extensions of unique "TAG" sequences. Downstream, conserved sequence primers are modified by 5' phosphorylation and 3' biotinylation. The 5' ends of the LDR products receive "classification" labeling in a second multiplex (Step #2) reaction where hybridization occurs between the TAG sequences added to the allelespecific primers and antiTAG (complementary sequence) oligonucleotide probes bound to fluorescent microspheres. Following this hybridization reaction, products are incubated (Step #3) in a solution containing streptavidinRphycoerythrin (SAPE) to allow "reporter" labeling through binding to the 3'biotin on the conserved sequence primers. Detection of doubly labeled ligation products occurs through dual fluorescence flow cytometry in the BioPlex array reader (BioRad Laboratories, Hercules, CA) and leads to collection of "reporter" signal in unique allelespecific bins. AntiTAG oligonucleotide probes bound to fluorescent microspheres (2.5 × 10^{5 }beads/mL/US$25) are available from Luminex Corporation (Austin, TX).
Specific LDR primers/probes used for the dhfr locus have been previously described [3] and primers/probes for additional SNPs utilized here are listed in Additional Table 2. Individual reactions were performed in a solution (15 μL) containing 20 mM TrisHCl buffer, pH 7.6, 25 mM potassium acetate, 10 mM magnesium acetate, 1 mM NAD+, 10 mM dithiothrietol, 0.1% Triton X100, 10 nM (200 fmol) of each LDR probe, 1 μL of each PCR product, and 2 units of Taq DNA ligase (New England Biolabs, Beverly, MA). Reactions were initially heated at 95°C for one minute, followed by 32 thermal cycles at 95°C for 15 seconds (denaturation) and 58.0°C for 2 minutes (annealing/ligation). The multiplex LDR product (5 μL) was then added to 60 μL of hybridization solution (3 M tetramethylammonium chloride [TMAC], 50 mM TrisHCl, pH 8.0, 3 mM EDTA, pH 8.0, 0.10% sodium dodecyl sulfate) containing 250 Luminex FlexMAP microspheres from each allelic set (total number of alleles = 9). Mixtures were heated to 95°C for 90 seconds and incubated at 37°C for 40 minutes to allow hybridization between SNPspecific LDR products and beadlabeled antiTAG probes. Following hybridization, 6 μL of streptavidinRphycoerythrin (Molecular Probes, Eugene, OR) in TMAC hybridization solution (20 ng/μL) was added to the postLDR mixture and incubated at 37°C for 40 minutes in Costar6511M polycarbonate 96well Vbottom plates (Corning Inc., Corning, NY). Detection of SNPspecific LDR:microspherelabeled antiTAG hybrid complexes was performed using a BioPlex array reader (BioRad Laboratories, Hercules, CA); the plate temperature was set to 37°C throughout detection. Fluorescence data, reported as median fluorescence intensity (MFI; range 0 to 25,000), were collected using BioRad software, BioPlex Manager 3.0 (BioRad Laboratories, Hercules, CA).
As a first step in evaluating background signals generated by our LDRFMA diagnostic assays, we analyzed 70 samples from randomly selected American Red Cross blood donors who had no history of malaria exposure. Through this analysis we observed that median fluorescent intensity (MFI) LDRFMA signals from these samples were normally distributed. From these results, conventional methods (3× standard deviations above the mean) for establishing thresholds between negative and positive fluorescence were deemed appropriate for comparison against our polar transformation method (Table 2).
Statistical analyses and graphing
All statistical analyses were performed using MATLAB version 7.7 (R2008a or b) (MathWorks Inc., Boston, MA). After transforming a set of N bivariate (x,y) fluorescence values into polar (r,θ) coordinates and forming the histogram as described above, the "first minimum after the initial maximum" in the magnitude variable was found by applying three criteria. Let n(r) denote the number of counts in the histogram centered at magnitude r and let Δr denote the bin width. (1) The histogram count at the minimum should be less than the counts on either side, or possibly equal to the number on the right, i.e. n(rΔr) > n(r) ≤ n(r + Δr). (2) The location of the minimum should exceed the location of the first maximum. The first maximum is required to be a local maximum and to exceed a minimum count requirement, to avoid finding spurious local max/min combinations before the true first population maximum. We used bins of size Δr = 100 MFI units, approximately equal to a single standard deviation of the MFI distribution for uninfected blank samples. We find the following heuristic to be robust in practice: assuming at least 10% of the population falls in the "uninfected" class, we expect approximately 50% of that number to fall within one standard deviation of the uninfected population mean. Therefore we use a threshold for the bin count at the "first maximum" of max(8,N/20). (3) We impose a minimum value for the relevant "first minimum after the initial maximum." Assuming bin widths and standard deviation of MFI blank signals of 100 each, we set the bin occupancy threshold equal to twice the number of counts expected to fall between two and three standard deviations from the blank mean, if all N samples were actually uninfected. That is, we require
where erf denotes the standard error function.
After obtaining a magnitude cutoff (r_{*}) for distinguishing infected from uninfected samples, we produce a histogram of the angles θ for all N_{inf }≤ N "infected" samples, i.e. those with (r > r_{*}), in 45 bins with a width of 2 degrees ranging from zero to π/2 (ninety degrees). Here let n(θ) denote the number of counts in the histogram centered at angle θ and let Δθ denote the bin width for the angle histogram. Again we apply three criteria to find the wildtype/mixed cutoff (θ_{lo}, close to θ = 0) and the mutant/mixed cutoff (θ_{hi}, close to θ = π/2). First we find the largest count in the range 0 to π/4 (45 degrees) or π/4 to π/2 (45 to 90 degrees), respectively. (1) Given maximum counts at θ = θ_{0 }(the local maximum close to θ = 0) and θ = θ_{90 }(the local maximum close to θ = π/2) we require θ_{0 }<θ_{lo }and θ_{hi }<θ_{90}. (2) We require that the diagnosis thresholds be placed at local minima of the histogram, i.e. n(θ_{lo } Δθ) > n(θ_{lo}) ≤ n(θ_{lo }+ Δθ) and n(θ_{hi } Δθ) ≥ n(θ_{hi}) <n(θ_{hi }+ Δθ). (3) In order to be an appropriate local minimum, the count at θ_{lo }or θ_{hi }should not exceed the expected number of counts for the uniform distribution of N_{inf }samples over 45 bins, or N_{inf}/45.
Confidence Intervals
As described in the Results section, we estimated the probability of classification error via bootstrap resampling with replacement to establish percentage confidence interval distributions for the three thresholds (uninfected vs. infected; X+/Y vs. X+/Y+; X/Y+ vs. X+/Y+) [12]. With each resampled data set, we used the histogram method to determine thresholds. After running the bootstrap 1,000 to 100,000 times, we determined (1  α) percentile confidence intervals that were placed at the and quantiles of the bootstrap estimated thresholds (Figure 4C). We used these confidence intervals to obtain upper bounds on the probability that the classification of a randomly chosen member of the population might change upon resampling as follows. Let ρ(r,θ) denote the density of signal throughout the polar plane, and let f(r,θ) denote the classification confidence associated with any given point on the polar plane. We would like to approximate the following integral in order to bound the total probability of error by our thresholds (that is, the probability of misclassifying a sample):
for θ_{ }= min(θ) and θ_{+ }= max(θ) (respectively for r_{ }and r_{+}). We can rewrite this integral as:
where x_{i }= (r_{i},θ_{i}) and N is our total sample size. If we choose a subset of confidences, such as {α_{1},α_{2},...α_{n}} for α_{j }∈ (0,1) then we can obtain an upper bound on the probability of error as the sum of the desired confidences, α_{j}, multiplied by the fraction of the total sample size that lies inside those α_{j }confidence intervals. (For example, a value of α = 0.05 corresponds to a 95% confidence.) From these estimates, we obtain a lower bound on the overall confidence in our classifications (see Table 1).
Abbreviations
SNP: Single nucleotide polymorphism; PNG: Papua New Guinea; LDR: ligase detection reaction; FMA: fluorescent microsphere assay; Pf: Plasmodium falciparum; dhfr: dihydrofolate reductase; dhps: dihydropterate synthetase; pfcrt: P. falciparum chloroquine resistance transporter; IRBC: infected red blood cell; MFI: median fluorescence intensity; Chr1SNP: PlasmoDB SNP Identifier CombinedSNP.MAL1.1085; Chr7SNP: PlasmoDB SNP Identifier CombinedSNP.MAL7.5506; Chr8SNP: PlasmoDB SNP Identifier CombinedSNP.MAL8.6181; Chr9SNP: PlasmoDB SNP Identifier CombinedSNP.MAL9.4825; Chr13SNP: PlasmoDB SNP Identifier CombinedSNP.MAL13.6337.
Authors' contributions
PJT developed the mathematical and data analysis framework and advised DPK on implementation. DPK and PJT implemented the algorithms and performed the data analysis. DPK and JD performed the laboratory work and collected the LDR/FMA genotyping data. PAZ developed the PCRbased diagnostic assays. PAZ and PJT jointly supervised the overall project. JD, DPK, PAZ and PJT wrote the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We thank David McNamara and Eric Carnevale for technical assistance during this study. We thank Drs. Sunil Rao, Thomas LaFramboise and Charles King for helpful comments and criticisms of the data transformation and interpretations. This study could not have been performed without the willing participation of the Wosera community and dedicated PNGIMR field study team. Support for this study was provided by grants from the National Institutes of Health (AI52312), Fogarty International Center (TW007872), and the National Science Foundation, Undergraduate Research at the Interface of Mathematics and Biology Program (DUE 0634612), and NSF DMS0720142. The funding organizations had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. PJT acknowledges research support from the Oberlin College Libraries.
References

Yang S, Rothman RE: PCRbased diagnostics for infectious diseases: uses limitations, and future applications in acutecare settings.
Lancet Infect Dis 2004, 4:337348. PubMed Abstract  Publisher Full Text

Takala SL, Smith DL, Stine OC, Coulibaly D, Thera MA, Doumbo OK, Plowe CV: A highthroughput method for quantifying alleles and haplotypes of the malaria vaccine candidate Plasmodium falciparum merozoite surface protein1 19 kDa.
Malar J 2006, 5:31. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Carnevale EP, Kouri D, DaRe JT, McNamara DT, Mueller I, Zimmerman PA: A multiplex ligase detection reactionfluorescent microsphere assay for simultaneous detection of single nucleotide polymorphisms associated with Plasmodium falciparum drug resistance.
J Clin Microbiol 2007, 45:752761. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Moorhead M, Hardenbol P, Siddiqui F, Falkowski M, Bruckner C, Ireland J, Jones HB, Jain M, Willis TD, Faham M: Optimal genotype determination in highly multiplexed SNP data.
Eur J Hum Genet 2006, 14:207215. PubMed Abstract  Publisher Full Text

Peiffer DA, Le JM, Steemers FJ, Chang W, Jenniges T, Garcia F, Haden K, Li J, Shaw CA, Belmont J, et al.: Highresolution genomic profiling of chromosomal aberrations using Infinium wholegenome genotyping.
Genome Res 2006, 16:11361148. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Plagnol V, Cooper JD, Todd JA, Clayton DG: A method to address differential bias in genotyping in largescale association studies.
PLoS Genet 2007, 3:e74. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Teo YY, Inouye M, Small KS, Gwilliam R, Deloukas P, Kwiatkowski DP, Clark TG: A genotype calling algorithm for the Illumina BeadArray platform.
Bioinformatics 2007, 23:27412746. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Greenhouse B, Dokomajilar C, Hubbard A, Rosenthal PJ, Dorsey G: Impact of transmission intensity on the accuracy of genotyping to distinguish recrudescence from new infection in antimalarial clinical trials.
Antimicrob Agents Chemother 2007, 51:30963103. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sibley CH, Ringwald P: A database of antimalarial drug resistance.
Malar J 2006., 5(48) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Trape JF: The public health impact of chloroquine resistance in Africa.
Am J Trop Med Hyg 2001, 64:1217. PubMed Abstract  Publisher Full Text

Binder H, Preibisch S, Kirsten T: Base pair interactions and hybridization isotherms of matched and mismatched oligonucleotide probes on microarrays.
Langmuir 2005, 21:92879302. PubMed Abstract  Publisher Full Text

Wasserman L: All of Statistics: A Concise Course in Statistical Inference. 1st edition. New York: Springer; 2004.

Aurrecoechea C, Brestelli J, Brunk BP, Dommer J, Fischer S, Gajria B, Gao X, Gingle A, Grant G, Harb OS, et al.: PlasmoDB: a functional genomic database for malaria parasites.
Nucleic Acids Res 2009, 37:D539543. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sibley CH, Hyde JE, Sims PF, Plowe CV, Kublin JG, Mberu EK, Cowman AF, Winstanley PA, Watkins WM, Nzila AM: Pyrimethaminesulfadoxine resistance in Plasmodium falciparum: what next?
Trends Parasitol 2001, 17:582588. PubMed Abstract  Publisher Full Text

NzilaMounda A, Mberu EK, Sibley CH, Plowe CV, Winstanley PA, Watkins WM: Kenyan Plasmodium falciparum field isolates: correlation between pyrimethamine and chlorcycloguanil activity in vitro and point mutations in the dihydrofolate reductase domain.
Antimicrob Agents Chemother 1998, 42:164169. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wilairatana P, Kyle DE, Looareesuwan S, Chinwongprom K, Amradee S, White NJ, Watkins WM: Poor efficacy of antimalarial biguanidedapsone combinations in the treatment of acute uncomplicated, falciparum malaria in Thailand.
Ann Trop Med Parasitol 1997, 91:125132. PubMed Abstract  Publisher Full Text

Liu H, Li S, Wang Z, Ji M, Nie L, He N: Highthroughput SNP genotyping based on solidphase PCR on magnetic nanoparticles with dualcolor hybridization.
J Biotechnol 2007, 131:217222. PubMed Abstract  Publisher Full Text

Takatsu K, Yokomaku T, Kurata S, Kanagawa T: A FRETbased analysis of SNPs without fluorescent probes.
Nucleic Acids Res 2004, 32:e156. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Troyer RM, Lalonde MS, Fraundorf E, Demers KR, Kyeyune F, Mugyenyi P, Syed A, Whalen CC, Bajunirwe F, Arts EJ: A radiolabeled oligonucleotide ligation assay demonstrates the high frequency of nevirapine resistance mutations in HIV type 1 quasispecies of NVPtreated and untreated motherinfant pairs from Uganda.
AIDS Res Hum Retroviruses 2008, 24:235250. PubMed Abstract  Publisher Full Text

Daniels R, Volkman SK, Milner DA, Mahesh N, Neafsey DE, Park DJ, Rosen D, Angelino E, Sabeti PC, Wirth DF, Wiegand RC: A general SNPbased molecular barcode for Plasmodium falciparum identification and tracking.
Malar J 2008, 7:223. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Li X, Foulkes AS, Yucel RM, Rich SM: An expectation maximization approach to estimate malaria haplotype frequencies in multiply infected children.
Stat Appl Genet Mol Biol 2007., 6
Article33
PubMed Abstract  Publisher Full Text 
Genton B, alYaman F, Beck HP, Hii J, Mellor S, Narara A, Gibson N, Smith T, Alpers MP: The epidemiology of malaria in the Wosera area East Sepik Province Papua New Guinea in preparation for vaccine trials. I. Malariometric indices and immunity.
Ann Trop Med Parasitol 1995, 89:359376. PubMed Abstract

Kasehagen LJ, Mueller I, McNamara DT, Bockarie MJ, Kiniboro B, Rare L, Lorry K, Kastens W, Reeder JC, Kazura JW, Zimmerman PA: Changing patterns of Plasmodium bloodstage infections in the Wosera region of Papua New Guinea monitored by light microscopy and high throughput PCR diagnosis.
Am J Trop Med Hyg 2006, 75:588596. PubMed Abstract  Publisher Full Text

Lysenko A, Semashko I: Geography of malaria: A medicogeographic profile of an ancient disease. Moscow: Academy of Sciences USSR; 1968.

Snow RW, Guerra CA, Noor AM, Myint HY, Hay SI: The global distribution of clinical episodes of Plasmodium falciparum malaria.
Nature 2005, 434:214217. PubMed Abstract  Publisher Full Text

McNamara DT, Kasehagen LJ, Grimberg BT, ColeTobian J, Collins WE, Zimmerman PA: Diagnosing infection levels of four human malaria parasite species by a polymerase chain reaction/ligase detection reaction fluorescent microspherebased assay.
Am J Trop Med Hyg 2006, 74:413421. PubMed Abstract  Publisher Full Text