Abstract
Background
The advent of next generation sequencing technology has accelerated efforts to map and catalogue copy number variation (CNV) in genomes of important microorganisms for public health. A typical analysis of the sequence data involves mapping reads onto a reference genome, calculating the respective coverage, and detecting regions with toolow or toohigh coverage (deletions and amplifications, respectively). Current CNV detection methods rely on statistical assumptions (e.g., a Poisson model) that may not hold in general, or require finetuning the underlying algorithms to detect known hits. We propose a new CNV detection methodology based on two Poisson hierarchical models, the PoissonGamma and PoissonLognormal, with the advantage of being sufficiently flexible to describe different data patterns, whilst robust against deviations from the often assumed Poisson model.
Results
Using sequence coverage data of 7 Plasmodium falciparum malaria genomes (3D7 reference strain, HB3, DD2, 7G8, GB4, OX005, and OX006), we showed that empirical coverage distributions are intrinsically asymmetric and overdispersed in relation to the Poisson model. We also demonstrated a low baseline false positive rate for the proposed methodology using 3D7 resequencing data and simulation. When applied to the nonreference isolate data, our approach detected known CNV hits, including an amplification of the PfMDR1 locus in DD2 and a large deletion in the CLAG3.2 gene in GB4, and putative novel CNV regions. When compared to the recently available FREEC and cn.MOPS approaches, our findings were more concordant with putative hits from the highest quality array data for the 7G8 and GB4 isolates.
Conclusions
In summary, the proposed methodology brings an increase in flexibility, robustness, accuracy and statistical rigour to CNV detection using sequence coverage data.
Background
Recent genome research have highlighted the role of structural variants on natural phenotypic variations with vital importance for human health [1,2]. The advent of massively parallel sequencing technologies has resulted to a drastic cost reduction per megabase of DNA sequence, and is leading to unprecedented genomic resolution and large sample size applications. In a single run, these technologies are able to generate millions of DNA fragments (reads) from a target genome, which are then mapped onto a reference genome when available or undergo de novo assembly. The resulting mapped data with potentially high coverage is the core of structural variant detection, and several methods have been recently proposed depending on the type of polymorphism to be identified, as recently reviewed by Medvedev et al. [3].
The present work considers the detection of copy number variations (CNVs), such as deletions and amplifications, using sequence coverage data when mapped onto a reference genome. In theory, deletions are detected in regions with extremely low coverage whereas amplifications are typically located in regions with exceptionally high coverage [3]. The common strategy to analyse sequence coverage data is to divide the reference genome into nonoverlapping windows (or bins) of a given size [36]. Since the GC content is known to influence the resulting coverage distribution [79], the windows are further subdivided according to this genomic parameter and analysed separately. Finally, appropriate detection limits are calculated. In this regard, there are two main approaches to determine these thresholds. One approach is to assume a Poisson distribution for coverage when there is no copy number variation, as invoked by the EWT method [4]. In theory, this assumption entails an equality between the mean and the variance of the coverage distribution. However, there is a growing number of data sets whose variance of the coverage distribution is clearly greater than the mean coverage [4,10]. This statistical “overdispersion” implies that, at a given statistical significance level, any Poissonbased method tends to detect a higher number of CNVs in comparison to a situation where overdispersion is deemed an intrinsic property of the data, thereby increasing the false positive rate. Recently, the cn.MOPS approach has been proposed, where the analysis is performed across samples and the resulting coverage distribution of a given window (across samples) is modelled through a finite mixture of Poisson distributions [6]. However, within a window, this modelling approach reverts to the common Poisson distribution (with different parameters along the different segments comprising the genome) if there is no CNV present. Another approach assumes a proportionality between the underlying copy number and the median coverage after being adjusted for the underlying GC content, and smoothed by an appropriate segmentation/aggregation algorithm, as available in the FREEC software [5]. Notwithstanding its high computational efficiency, this software critically relies on the analyst to parameterise the underlying segmentation algorithm. If prior information is available from the samples under analysis, one can set key parameters tentatively until obtaining results in line with previous findings. This finetuning exercise becomes extremely time consuming in a high throughput setting, especially as expected differences in the patterns of data between samples cannot be captured by a single parameter set. Alternative methodologies are then required with the potential of being much more generalisable and applicable to a high throughput setting.
To improve current approaches for CNV detection, we propose a new methodology based on a Poisson hierarchical modelling approach. Our data analysis strategy is now outlined. First, we assume a Poisson distribution for coverage when there is no copy number variation, as previously done in EWT [4] and cn.MOPS [6]. We then extend this distribution to an overdispersion setting by allowing the respective rate parameter to vary according to a Gamma or a Lognormal distribution. The resulting models of this hierarchical structure are the PoissonGamma (also known as the Negative Binomial) and PoissonLognormal, respectively. In this way, different data patterns under the hypothesis of no CNVs can be captured due to the great flexibility of these secondlevel distributions. We adjust the results for the GC content as implemented elsewhere, i.e., we divide the reference genome into nonoverlapping windows and analyse those with similar GC content separately. The formal CNV detection is based on highest posterior density (HPD) credible intervals associated with the posterior predictive distribution for coverage. Hence the stringency of our method is controlled by the credibility level — hereafter denoted by the parameter γ — specified for the analysis. The final stage of the analysis encompasses merging contiguous hits and excluding putative deletions when the corresponding coverage per nucleotide position is greater than zero.
To assess the performance of our methodology, we use 7 publicly available Plasmodium falciparum (Pf, causes malaria) data sets. The choice of this realworld data set shows two major advantages. First, it includes the resequencing data of the 3D7 reference genome thus allowing the direct estimation of the baseline false positive rate. We also used this reference sample to perform a simulation study where data shows different read depth. Second, it encompasses data of 4 laboratory strains for which comparative genomic hybridization (CGH) array data is available, thus enabling ’external’ validation of the CNVs detected via sequencing data analysis. All results are compared to the FREEC and cn.MOPS softwares, two potentially promising approaches when applied to human genome data [6].
Results
Plasmodium falciparum genome data shows distinct random patterns for the underlying coverage distribution
Data under analysis comprises Pf samples of 5 wellcharacterised laboratory strains from different parts of the world (3D7  reference strain of African origin, HB3  Honduras, DD2  Indonesia, 7G8  Brazil, and GB4  Ghana) and of 2 clinical isolates from travellers to Africa (OX005  Ghana, and OX006  Kenya). Each sample consists of millions of reads mapped onto the 3D7 reference genome (version 3.0, 23Mb, 19% GC content, Table 1), which was divided into nonoverlapping 100bp windows and filtered for the respective exome (120,309 100bp windows in total). The resulting coverage distributions exhibit distinct shapes and summary statistics (Figure 1A and Table 1), thus the necessity of having flexible approaches for CNV detection. The percentage of windows with zero coverage or with coverage equal or greater than 500 reads is up to 0.51% (DD2), suggesting the presence of few CNVs in the samples.
Table 1. Statistical description of the coverage distributions using 100bp windows and after filtering the data for the Pf exome
Coverage distributions are intrinsically overdispersed, skewed, and longtailed
A brief description of the empirical coverage distributions led to two key observations. First, every coverage distribution is characterised by extreme overdispersion as the variance is greater than the mean in each sample (Table 1 and Figure 1B). This statistical property seems intrinsic to coverage data because of its observation in the 3D7 resequencing data, where just a few hits should be identified under the assumption of no errors in sequencing, mapping, and in the annotated genome. Second, these distributions are skewed and long tailed as suggested by the skewness and kurtosis coefficients (Additional file 1).
Additional file 1. Skewness and kurtosis of empirical coverage distributions.
Format: PDF Size: 667KB Download file
This file can be viewed with: Adobe Acrobat Reader
Coverage distributions are described well by a PoissonGamma model
To analyse the data, we devised a CNV detection strategy based on PoissonGamma and PoissonLognormal, two probability distributions known for their flexibility in tackling overdispersion. To estimate these models, we divided each coverage profile according to the respective GC content and analysed the corresponding data separately. The Poisson, PoissonLognormal and PoissonGamma distributions were then compared against each other being the latter the best model for the data irrespective of the criteria used (Additional file 2). According to this model, the expected coverage distributions are almost indistinguishable from the observed ones (Additional file 3). Therefore, the PoissonGamma distribution was used to determine robust CNV detection thresholds (shown in Additional file 4).
Additional file 2. Statistical model comparison between Poisson, PoissonGamma, and PoissonLognormal distributions. The Poisson and PoissonLognormal models were compared to the PoissonGamma using the Deviance Information Criteria (DIC) [30] and Bayes factors (BF). In the case of DIC, we calculated the ratio between that of the PoissonGamma and those of the remaining models. With respect to BF, they were estimated as the logratio between the corresponding predictive prior probabilities via the BICMC estimator [31].
Format: PDF Size: 898KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 3. Expected and empirical cumulative coverage distributions. Expected coverage distributions refer to the corresponding posterior predictive distributions for the set of all 100bp windows used in the analysis.
Format: PDF Size: 943KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 4. Limits for CNV detection used on each sample as function of the underlying GC content. CNV detection limits were determined according to the posterior predictive probability distribution of the PoissonGamma (the best model for every data set under analysis).
Format: PDF Size: 783KB Download file
This file can be viewed with: Adobe Acrobat Reader
The PoissonGamma approach shows a low baseline false positive rate
The baseline false positive rate of our method was first assessed through the analysis of the 3D7 resequencing data, where CNVs are known (e.g., GTP cyclohydrolase I gene, PFL1155w), and few hits should be produced. The corresponding data analysis led to a proportion of hits lower than the statistical stringency used in the analysis (e.g., 28 hits out of 120,309 100bp windows using γ=99.9%; see Table 2). Specifically, we obtained 11 windows with toolow coverage, being sparsely distributed across the genome. Conversely we found 17 windows with toohigh coverage, 13 of which are true positive hits located in the GTP cyclohydrolase I locus (Additional file 5), an amplified region already highlighted by CGH array technology [11]. When applied the FREEC software to the same data, there was a slightly lower proportion of hits (15/120,309) but this proportion can be increased by using alternative parameter settings. The crosssample cn.MOPS approach was applied to all isolates (3D7, HB3, DD2, 7G8, GB4, OX005, and OX006), leading to the highest proportion of hits in 3D7 (358/120,309). Unlike our approach and FREEC, cn.MOPS could only partially detect the amplification of the GTP cyclohydrolase I gene locus (11 hits out of 13 possible windows).
Table 2. Analysis of real and simulated 3D7 resequencing data
Additional file 5. A large amplification detected between PFL1125w and PFL1160w genes in the 3D7 reference genome data using the PoissonGamma model.
Format: PDF Size: 844KB Download file
This file can be viewed with: Adobe Acrobat Reader
We extended the above analysis by studying the influence of read depth on the false positive rate in a simulation study. Ten independent samples were generated for each of 3 read depths (10×, 20×, and 50×). The results showed that the false positive rate of our method is lower or in line with the corresponding statistical stringency adopted in the analysis (Table 2). For example, using γ=99.9%, the proportion of hits was 0.19%, 0.11%, and 0.05% for 10×, 20×, and 50× read depths, respectively. Moreover, our method was able to detect the amplified GTP cyclohydrolase I gene locus in every generated data set. Conversely the FREEC software produced no CNV calls in each simulated data set, even when parameterised to target a higher number of CNVs. The FREEC software seems then less sensitive when using low read depth data and there are just a few hits to be detected. We applied cn.MOPS approach to the set of simulated samples with the same read depth and found a lower mean proportion of hits in comparison to our method running at γ=99%. However, when we increased stringency of our method (γ=99.9%), cn.MOPS was outcompeted. Finally, likewise FREEC, cn.MOPS could not identify the amplified GTP cyclohydrolase I gene locus in any of the generated samples.
The PoissonGamma modelling approach detects known and novel CNV regions
The analysis of the remaining laboratory and clinical samples led to a total number of hits ranging from 257 (OX006) to 899 (DD2) using γ=99.9% (Table 3). The PoissonGamma approach could detect a large amplification located between the PFL1125w and PFL1160w genes (Figure 2A–D), which has been previously identified using CGH technology [1113]. Another important CNV hit is the amplified region spanning the PFE1095w and PFE1160w genes in the Indonesian DD2 sample. This locus includes the PfMDR1 gene (PFE1150w) whose increased copy number is usually associated with multidrug resistance of Southeast Asian Pf parasites [1416]. Finally, a large deletion of the CLAG3.2 gene (PFC0110w) was also targeted in both Ghanaian samples (GB4 and OX005), a result in agreement with field reports from that country [17,18]. Other large hits can be found in Additional file 6.
Table 3. Summary of CNVs detected by the PoissonGamma model across different laboratory strains and clinical samples
Figure 2. Copy number variation between the PFL1125w and PFL1160w genes across different laboratory and clinical samples. A. HB3 (Honduras); B. DD2 (Indonesia); C. 7G8 (Brazil); D. GB4 (Ghana); E. OX005 (Ghana); F. OX006 (Kenya). Note that the prefix PFL was removed from the corresponding gene names as available at genedb database (http://www.genedb.org webcite).
Additional file 6. CNVs larger than 500 bp detected using the PoissonGamma model (γ=99%), the FREEC software, and cn.MOPS approach.
Format: PDF Size: 52KB Download file
This file can be viewed with: Adobe Acrobat Reader
Comparison with FREEC and cn.MOPS approaches
The FREEC and cn.MOPS approaches were applied to the same laboratory and clinical samples; see Additional file 6 for a list of hits identified by these alternative methods. Table 4 shows the number of shared and exclusive hits of FREEC and cn.MOPS in relation to our approach (see also Additional files 7 and 8). For FREEC, the proportion of shared hits varies with the sample under analysis, ranging from 27.4% (GB4) to 63.9% (DD2) for deletions and from 4.2% (OX005) to 84.3% (DD2) for amplifications (using γ= 99.9% in our method). The high proportion of shared amplifications in the Indonesian DD2 sample is mainly due to the strong signal derived from the PfMDR1 locus. Our method produced a higher number of exclusive hits, most of them in loci not targeted by FREEC, including a large amplification between PFL1125w and PFL1160c genes in HB3 lab strain previously validated by CGH technology [11]. Conversely the majority of the FREECexclusive hits, except those from the OX005 sample, are located in genomic regions targeted by our methodology, filling ’gaps’ or increasing the size of the putative CNVs (Additional file 7). The low frequency of FREECexclusive hits in these ’shared’ loci shows that little is missed by not including a formal merging algorithm in our method. In fact, the procedure of simply merging adjacent hits was enough to reduce significantly the number of hits produced by our method (Table 3).
Table 4. Hits shared and exclusively detected by the PoissonGamma (PG) model, the FREEC and the cn.MOPS approaches
Additional file 7. Comparison between hits detected by the PoissonGamma model and the FREEC software.
Format: PDF Size: 43KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 8. Ternary diagrams plotting the joint proportions of shared and exclusively detected hits by the PG model, the FREEC software, and cn.MOPS.
Format: PDF Size: 2.7MB Download file
This file can be viewed with: Adobe Acrobat Reader
In the case of cn.MOPS, the proportion of shared hits ranges from 21.4% (7G8) to 69.3% (OX005) for deletions and from 5.7% (OX005) to 85.2% (DD2) for amplifications (using γ=99.9% in our method). Likewise in the above comparison, the high proportion of shared amplifications in DD2 laboratory strain also results from hits in the PfMDR1 locus. With respect to exclusive hits, our method tends to produce a higher number of these than cn.MOPS for deletions, but a lower number for amplifications. This observation suggests that our method tends to define higher thresholds than cn.MOPS for deletions, but lower thresholds for amplifications. However, it is difficult to generalise this result due to differences in stringency definition employed by each method.
Finally, the FREEC software running under default settings could not detect a large amplification between PFL1125w and PFL1160c genes in HB3 isolate identified by our method (Figure 2A) and validated by CGH technology [11]. Similarly, the cn.MOPS approach could not detect the amplification of the GTP cyclohydrolase I gene in the 3D7 resequencing data. However, we could recover some of these false negatives, and possible others, using alternative (nondefault) parameter settings (results not shown). This result demonstrates the possibility of missing important CNVs by applying these approaches, specifically, when one needs to set them up ’blindly’ in the case of new isolates, large sample sizes, or when their underlying assumptions are not fully met by the data.
Validation of coveragebased hits using CGH array data
The validity of coveragebased hits produced by each methodology was assessed using published CGH data (Table 5). For HB3 and DD2 lab strains, where we used a previously compiled list of CGH hits [19], the FREEC software led to the highest concordance (92.9%) between coverage and CGHbased hits. This apparent best performance of FREEC is mainly due to a lesser number of hits produced by this software. The use of greater stringency in our method (i.e., γ=99.9%) led to an increase in the corresponding concordance, from 75.9% to 78.9% in HB3, and from 89% to 92.6% in DD2. For 7G8 and GB4 strains, where we reanalysed the original CGH data, our methodology outperformed FREEC and cn.MOPS overall and, in the best case scenario (γ=99.9% in our method), we could confirm 78.3% (7G8) and 54.0% (GB4) of our hits. The cn.MOPS approach seems inferior to our method and FREEC, with the lowest concordance rates irrespective of the lab strain and CNV type. This result may be explained by the nature of the data under analysis (i.e., far from being Poisson distributed when there is no CNV present) and the small sample size.
Table 5. Hits shared between CGH and coverage data using the PoissonGamma (PG) model, the FREEC software, and the cn.MOPS approach
Discussion
We have proposed a Poisson hierarchical modelling approach for CNV detection, which is flexible and robust to the common problem of overdispersed coverage data. Using simulation and resequencing data of the 3D7 reference genome, we have demonstrated a low baseline false positive rate of the methodology across different read depth. However, this low baseline false positive rate needs to be assessed in other genomic settings, preferably where reference resequencing data is available, or potentially using a robust simulation strategy with realistic statistical assumptions and parameter settings. In general, one can reduce the baseline false positive rate of any coveragebased method if mapping distance information is also taken into account. True positive hits are then likely to be those whose coverage and mapping distance analyses agree with each other. In particular, strong evidence for deletions is provided from genomic regions with toolow coverage and average mapping distance greater than expected, while amplified regions entail extremely high coverage and average mapping distance less than expected [3]. This integrated data analysis, whilst increasing robustness and accuracy, remains to be developed.
The proposed approach was also applied to nonreference strain data and identified a large number of CNVs that could be validated by CGH data. The empirical and simulation results have demonstrated that our approach may be applicable to larger genomes where read depths can be lower, or in settings where overdispersion is present [4,10]. Recent deep sequencing technologies are currently generating data with high coverage (>50fold on average) irrespective of the organism under study. Thus, we do not expect that the accuracy and robustness of our methodology shown here would change in these cuttingedge data sets with similar genomic coverage. In practice, the potentially high computational burden associated with larger genomes or strong overdispersion is surmountable by implementing parallel computing techniques.
Our method seemed to outperform FREEC and cn.MOPS approaches with respect to concordance of hits confirmed by CGH data for 7G8 and GB4 strains. However, a more accurate comparison was compromised by difficulties in relating stringency. The stringency of our method is controlled by the credibility level, a rigorous statistical parameter, but more difficult to be inferred in algorithms that do not consider a specific statistical model, as in the FREEC software. Notwithstanding this difficulty, we showed that increasing the stringency of our methodology led to a high concordance with the FREECbased hits. However, the FREEC software could only detect a known amplification at the GTP cyclohydrolase locus in HB3 [11] using alternative parameter setting. The dependency of findings on parameter setting is mitigated in our methodology as the analyst only needs to specify the underlying stringency.
The Poisson hierarchical modelling approach has the advantage of handling with different data patterns but, as it stands, cannot estimate the corresponding copy number. To overcome this limitation, one can invoke a proportionality between mean (or median) coverage and the underlying copy number, as assumed elsewhere [46]. In theory, this assumption requires modelling the coverage distributions through a finite mixture of PoissonGamma (or PoissonLognormal) distributions where each component is associated with a given copy number, as implemented in the cn.MOPS approach but using a Poisson mixture model. A Monte Carlo Markov Chain method with reversible jumps [20] can be applied to the corresponding Bayesian estimation. Whilst the greater computational overhead involved in applying this more complex model is surmountable, its utility would rely on having sufficient data to estimate the additional parameters. In our setting, the very low number of hits detected suggests that data might be too scarce to fit models with increased complexity. Our approach seems then the perfect balance between model complexity and data information, thus a potentially useful addition to bioinformatic toolkits used to identify CNVs from sequence coverage data.
Conclusions
We have developed a robust Poisson hierarchical modelling approach for CNV detection using sequence coverage data. When applied to the Pf genome, the method shows a low false positive rate in the 3D7 resequencing data, and is able to detect known and putative novel CNV regions. This promising result suggests the application of this approach to different settings, such as the human or other microorganism genomes. Because the approach was developed under a strong but flexible statistical framework, it brings increased statistical rigour and robustness into the problem of CNV detection. In addition, it will allow important extensions, such as the estimation of the underlying copy number.
Methods
Sequence data and processing
Data consists of 7 Pf genomes of which 5 are of wellcharacterised laboratory strains (3D7  reference strain of African origin, HB3  Honduras, DD2  Indonesia, 7G8  Brazil, and GB4  Ghana) and 2 are of clinical origin (OX005  Ghana, OX006  Kenya) [21]. The 3D7 reference strain data was used to assess the baseline false positive rate. All seven genomes underwent whole genome sequencing on the Illumina Genome Analyzer II (54/76base paired read) platform and processed as described elsewhere [21]. In brief, multiple alignment (bam) files were generated from fastq files of each whole genome sequencing data set after mapping the reads onto the 3D7 reference genome (version 3.0) using the smalt software (http://www.sanger.ac.uk/resources/software/smalt/ webcite). Using the toolkit SAMtools, poorly mapped reads were removed from the analysis. The corresponding raw data sets are publicly available from the European Bioinformatics Institute website (http://www.ebi.ac.uk webcite, SRA study ERP0000190) and the PlasmoDB database (http://www.plasmodb.org webcite).
Estimating coverage profiles
In each sample, calculation of coverage profile followed the usual procedure for human data [46]. The 3D7 reference genome was first partitioned into nonoverlapping and equalsize windows. The size of each window was set at 100 bp as it seemed a good compromise between sufficient resolution for CNV detection and reasonable statistical properties of the data. In each window, we calculated the respective coverage as the number of mapped reads using their starting mapped positions. Since coverage can be confounded by the GC content, we also calculated the underlying GC content profile using the FREEC software. Windows within 100kb of subtelomeric and centromeric regions, as well as windows with poor mapping score, were excluded from the analysis, since they could introduce bias due to putative mapping errors [22]. Windows associated with antigenic diversity gene families (including vars, stevors, and rifins) were also discarded owing to the fact that they are intrinsically variable [11]. We also filtered out noncoding regions, thus focusing the analysis on the Pf exome as on average there is twice as much coverage in coding regions [21]. After this filtering process, the coverage profile of each target genome comprises 120,309 100bp windows accounting for nearly 53% of the 3D7 reference genome. We adjusted our results for GC content by splitting each coverage profile into separate data sets comprising the coverage values of windows with similar GC content and analysing them separately [4,5]. Since the GC content distribution does not follow a Uniform distribution in the 3D7 reference genome [23], we divided the coverage profile according to the 5%centiles (5%, 10%, 15%,) of the GC content distribution as a way to ensure similar statistical power across the different set of windows.
Detection of CNVs using a Poisson hierarchical modelling approach
When analysing each data set, we specified the following Multinomial distribution for the coverage values of windows with similar GC content
whereis the total number of windows with GC content g, n_{g,i} is the total number of windows with coverage i and GC content g, p_{g}(i) is the overall probability of mapping i reads onto any window with GC content g. The probabilities p_{g}(i) are usually described by a Poisson distribution as long as the reads are independently and equally distributed across the genome [4]. Because of overdispersion, we extended this model by varying its mean parameter according to another probability distribution. We chose the Gamma or Lognormal distributions to model the variation of the Poisson mean parameter due to their flexibility and successful application across different scientific fields [2427]. These two secondlevel distributions, when conjugated with the Poisson model, give rise to the socalled PoissonGamma (with parameters α and β) and PoissonLognormal (with parameters μ and σ^{2}), respectively. Mathematically, the PoissonGamma model is given by
where α and β are the shape and rate parameters of a Gamma distribution, respectively. The PoissonLognormal distribution does not show closedform expression but good numerical algorithms exist for its calculation when applied to a specific data set [28].
The estimation of these two models was performed through Bayesian methods using noninformative prior distributions for the respective parameters. With respect to the PoissonGamma, we used a Gamma prior distribution for the parameters α and β. The respective shape and scale parameters were set at 0.001, as often specified in Bayesian applications (see, for example, WinBUGS online documentation at http://www.mrcbsu.cam.ac.uk/bugs webcite). For the PoissonLognormal model, we set a Gaussian prior distribution with mean 0 and standard deviation 10^{4} for the parameter μ, and a Gamma prior distribution with the shape and scale parameters equal to 0.001 for the parameter 1/σ^{2}, respectively. To obtain posterior samples through parallel computing, we used WinBUGS (http://www.mrcbsu.cam.ac.uk/bugs webcite) and JAGS (http://www.mcmcjags.sourceforge.net/ webcite) coupled with the R software through the R2WinBUGS and R2JAGS packages [[29]. The respective R and WinBUGS/JAGS scripts are available from the authors upon request.
After obtaining the posterior parameter samples, the models were tested against each other using Bayes factors and the Deviance Information Criteria (DIC) [30]. With respect to Bayes factors, we calculated the underlying prior predictive probability (PPPs) of each model using the socalled BICMC estimator, which seems to provide robust and stable results under hierarchical modelling [31]. To this end, we generated posterior samples for the loglikelihood function required for the calculation of BICMC estimator. We performed this task in WinBUGS/JAGS for the PoissonGamma distribution since the corresponding likelihood function is known analytically. However, for the case of the PoissonLognormal model whose probability distribution has no closedform expression, we calculated the posterior samples of the loglikelihood function using subroutines available in the package PAM for the R software [27].
For the formal CNV detection, we used the corresponding posterior predictive distribution, which embodies all uncertainty regarding coverage given the observed data and prior information. The calculation was performed through the simulation of ’new’ coverage values according to the respective posterior parameter samples and the best model for the data. We then determined the corresponding HPD credible interval at γ=99% and 99.9% using ChenShao method [32] in order to set appropriate CNV detection limits. In particular, windows with coverage greater than the upper bound of HPD credible interval are likely to contain amplifications. Conversely windows with coverage lesser than the lower bound of HPD credible interval are deemed deletions. To reduce the false positive rate associated with deletions, we removed all putative hits whose coverage was greater than zero in every nucleotide position. Data of total coverage per position was obtained using the SAMtools (samtools.sourceforge.net). The final step of the analysis comprised merging contiguous hits. In theory, we could have introduced a more formal merging stage by adapting, for example, the popular Circular Binary Segmentation algorithm [33] for sequence coverage data. In practice, we did not intend ’forcing’ our method to generate larger genomic regions artificially. A recent study shows that simply merging contiguous hits is sufficient to generate a small number of loci in relation to the total number of hits [34]. Another reason for not including such formal procedure is due to the fact that its application is in rigour compromised while studying the exome (i.e., a ’fragmented’ version of a genome) of a given organism. In this case, formal merging can only be performed at the gene level.
Detection of CNVs using FREEC and cn.MOPS softwares
There are several CNV detection methods currently available in the literature [6]. We chose to compare our methodology against FREEC and the cn.MOPS, two potentially promising approaches when applied to human genome data [6]. However, the performance of the crosssample cn.MOPS approach is likely to be impaired because: (1) even that this approach invokes a finite mixture of Poisson distributions for the read counts of a given segment, it reverts to a Poisson distribution when there is no CNV present, a distribution that does not fully agree with our data; (2) since the corresponding analysis is performed across samples, the accuracy of the method is dependent on the sample size. In this regard, Klambauer et al. [6] suggested a minimum number of 6 samples for the cn.MOPS to work well. This minimum sample size seems rather low in comparison to those typically found in the literature of this type of modelling approach [35,36].
In general, the FREEC software divides the reference genome into nonoverlapping and equalsize windows, and calculates the corresponding coverage profile of the target sample. A polynomial regression model is then used to describe the dependency between coverage and GC content. The respective predicted values are first standardised and then smoothed out. The final stage of the analysis consists of estimating the copy number in each segment and merging the regions with similar copy number. With this purpose, the software assumes that the ploidy of the organism under analysis is known and the copy number of a given segment is proportional to the median coverage of all the windows with similar GC content. For the Pf sequence coverage data, we specified ploidy of 1 and used a quadratic regression model. We also analysed the data through a cubic regression model, as often suggested for human genome data [5], but obtained unrealistic copy number distributions (results not shown). The parameters of the segmentation algorithm were set at their default setting.
The cn.MOPS approach is also based on sequence coverage data partitioned into nonoverlapping windows. It assumes a finite mixture of Poisson distributions (with a known number of components) for the coverage across samples of any given window. In this approach, each component of the mixture describes the coverage distribution associated with a given copy number under the assumption of a linear relationship between mean coverage and copy number. The model is fitted to each segment via an EM algorithm and the most probable component determined. To apply this approach to our data, we set the copy number to be an integer from 0 to 4, where the value 1 is the ’normal’ copy number (or the ploidy of the organism under study). The remaining parameters were specified at their default settings as explained in the documentation of the respective R package (called cn.MOPS).
Simulation study based on 3D7 resequencing data
To assess the baseline false positive rate of our method, we performed a simulation study based on 3D7 resequencing data. We generated 10 independent data sets from the 3D7 resequencing sample according to read depths of 10 ×, 20×, and 50 ×, corresponding to a total of 1.25, 2.5, and 6 million reads, respectively. Each data set refers to the coverage profile of 120,309 100bp windows and was simulated according to a Multinomial distribution with a sample size given by the corresponding total number of reads associated with a specific read depth and probability vector defined by the relative coverage profile of the original 3D7 reference sample. We analysed each data set separately using our method and the FREEC software. In the former, we used the PoissonGamma (PG) distribution and two different credible levels (γ=99% and 99.9%), while in the latter we analysed the data using the default parameter setting; we tested alternative parameter sets but the corresponding results were qualitatively similar (not shown). To summarise the results, we calculated the mean percentage of detected hits in relation to the total number of windows under analysis. Finally, we analysed each batch of 10 independent samples of a given read depth altogether using the cn.MOPS software and compared the results to those obtained from the analysis of each individual sample.
Comparative genomic hybridisation array data
To assess the reliability of the coveragebased hits, we brought into the analysis available CGH data for the HB3, DD2, 7G8 and GB4 laboratory strains. In the first two strains, we used a precompiled list of CGH hits [19], identifying the corresponding 100bp windows used in the coverage analysis. In the last two strains, we reanalysed the original CGH data [37] accessible from NCBI’s Gene Expression Omnibus through GEO Series accession number [GEO:GSE25656]. Data refers to log2ratio between intensities of these two strains in relation to those obtained from the 3D7 reference strain. We used the software SnoopCGH as a visualisation tool [38]. After removing probes in regions not considered in our coverage data analysis, we applied the following strategy to detect CNVs: (1) obtain a segmented profile of each individual intensity data set using DNAcopy package for the R software, (2) divide each segmented intensity profile into nonoverlapping windows of size 100bp as done in the coverage data analysis, (3) calculate the corresponding empirical distribution of segmented intensities, and (4) compute the respective highest probability density interval with a given probability mass τ (say τ=95%), and (5) produce a CNV call whenever the intensity of a given window is not included in this interval. In this regard, a negative log2ratio intensity was considered to be indicative of a deletion while a positive log2ratio was deemed a putative amplification. For each pairwise comparison, concordance rates were calculated for each data set and defined as the number of CNV hits identified by coverage analysis and confirmed by the CGH technology divided by the total number of coveragebased hits.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
AP, SGC and TGC conceived the project. NS developed the Poisson hierarchical modelling approach and wrote the first draft of the manuscript, and modifications were performed by AP, CJS, SGC and TGC. NS, SAA and TGC performed the data analysis. CJS and SGC obtained and processed the clinical samples. All authors have read and approved the final version of the manuscript.
Acknowledgements
This work was partially funded by a grant from the Foundation for the National Institutes of Health (grant ref. #566), the Wellcome Trust (grant ref. 077383/Z/05/Z) through the Grand Challenges in Global Health Initiative, and Fundação para a Ciência e Tecnologia through the project PestOE/MAT/UI0006/2011. TC is funded by the UK Medical Research Council and Wellcome Trust. CJS is supported by the UK Health Protection Agency. AP is supported by his faculty baseline funding from KAUST. We thank the Kwiatkowski group at the Wellcome Trust Sanger Institute for putting the raw sequence data into the public domain. We also thank Valentina Boeva for the advices using the FREEC software, Mark Preston for commenting the manuscript, Michael Bretscher for calling our attention to the JAGS software, and Francesc Coll for some useful references.
References

Zhang F, Gu W, Hurles ME, Lupski JR: Copy number variation in human health, disease, and evolution.

Stankiewicz P, Lupski JR: Structural variation in the human genome and its role in disease.

Medvedev P, Stanciu M, Brudno M: Computational methods for discovering structural variation with nextgeneration sequencing.

Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of copy number variants using read depth of coverage.

Boeva V, Zinovyev A, Bleakley K, Vert JP, JanoueixLerosey I, Delattre O, Barillot E: Controlfree calling of copy number alterations in deepsequencing data using GCcontent normalization.

Klambauer G, Schwarzbauer K, Mayr A, Clevert DA, Mitterecker A, Bodenhofer U, Hochreiter S: cn.MOPS: mixture of Poissons for discovering copy number variations in nextgeneration sequencing data with a low false discovery rate.

Quail MA, Kozarewa I, Smith F, Scally A, Stephens PJ, Durbin R, Swerdlow H, Turner DJ: A large genome center’s improvements to the Illumina sequencing system.

Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultrashort read data sets from highthroughput DNA sequencing.

Cheung MS, Down TA, Latorre I, Ahringer J: Systematic bias in highthroughput sequencing data and its correction by BEADS.

Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL: Accurate whole human genome sequencing using reversible terminator chemistry.

Kidgell C, Volkman SK, Daily J, Borevitz JO, Plouffe D, Zhou Y, Johnson JR, Roch KL, Sarr O, Ndir O, Mboup S, Batalov S, Wirth DF, Winzeler EA: A systematic map of genetic variation in Plasmodium falciparum.

Ribacke U, Mok BW, Wirta V, Normark J, Lundeberg J, Kironde F, Egwang TG, Nilsson P, Wahlgren M: Genome wide gene amplifications and deletions in Plasmodium falciparum.

Cheeseman IH, GomezEscobar N, Carret CK, Ivens A, Stewart LB, Tetteh KKA, Conway DJ: Gene copy number variation throughout the Plasmodium falciparum genome.

Price RN, Uhlemann AC, Brockman A, McGready R, Ashley E, Phaipun L, Patel R, Laing K, Looareesuwan S, White NJ, Nosten F, Krishna S: Mefloquine resistance in Plasmodium falciparum and increased pfmdr1 gene copy number.

Sidhu ABS, Uhlemann AC, Valderramos SG, Valderramos JC, Krishna S, Fidock DA: Decreasing pfmdr1 copy number in Plasmodium falciparum malaria heightens susceptibility to mefloquine, lumefantrine, halofantrine, quinine, and artemisinin.

Lim P, Alker AP, Khim N, Shah NK, Incardona S, Doung S, Yi P, Bouth DM, Bouchier C, Puijalon OM, Meshnick SR, Wongsrichanalai C, Fandeur T, Bras JL, Ringwald P, Ariey F: Pfmdr1 copy number and arteminisin derivatives combination therapy failure in falciparum malaria in Cambodia.

Chung WY, Gardiner DL, Anderson KA, Hyland CA, Kemp DJ, Trenholme KR: The CLAG/RhopH1 locus on chromosome 3 of Plasmodium falciparum: two genes or two alleles of the same gene?

Iriko H, Kaneko O, Otsuki H, Tsuboi T, Su XZ, Tanabe K, Torii M: Diversity and evolution of the rhoph1/clag multigene family of Plasmodium falciparum.

Samarakoon U, Gonzales JM, Patel JJ, Tan A, Checkley L, Ferdig MT: The landscape of inherited and de novo copy number variants in a Plasmodium falciparum genetic cross.

Green PJ: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination.

Robinson T, Campino SG, Auburn S, Assefa SA, Polley SD, Manske M, Macinnis B, Rockett KA, Maslen GL, Sanders M, Quail MA, Chiodini PL, Kwiatkowski DP, Clark TG, Sutherland CJ: Drugresistant genotypes and multiclonality in Plasmodium falciparum analysed by direct genome sequencing from peripheral blood of malaria patients.

Campbell PJ, Stephens PJ, Pleasance ED, O’Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C, Teague JW, Menzies A, Goodhead I, Turner DJ, Clee CM, Quail MA, Cox A, Brown C, Durbin R, Hurles ME, Edwards PAW, Bignell GR, Stratton MR, Futreal PA: Identification of somatically acquired rearrangements in cancer using genomewide massively parallel pairedend sequencing.

Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson K E: Genome sequence of the human malaria parasite Plasmodium falciparum.

Fisher RA, Corbet AS, Williams CB: The relation between the number of species and the number of individuals in a random sample of an animal population.

Thygesen HH, Zwinderman AH: Modeling Sage data with a truncated GammaPoisson model.

Sepúlveda N, Paulino CD, Carneiro J: Estimation of Tcell repertoire diversity and clonal size distribution by Poisson abundance models.

Bulmer M: On fitting the Poisson Lognormal distribution to speciesabundance data.

Sturtz S, Ligges U, Gelman A: R2WinBUGS: a package for running WinBUGS from R.

Spiegelhalter D, Best N, Carlin B, van der Linden A: Bayesian measures of model complexity and fit (with discussion).

Raftery A, Newton MA, Satagopan SM, Krivitsky PN: Estimating the integrated likelihood via posterior simulation using the harmonic mean identity (with discussion). In Bayesian Statistics 8. Edited by Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M. Oxford: Oxford University Press; 2007:371416.

Chen MH, Shao QM: Monte Carlo estimation of Bayesian credible and HPD intervals.

Olshen AB, Venkatraman ES: Circular binary segmentation for the analysis of arraybased DNA copy number data.

Valsesia A, Stevenson BJ, Waterworth D, Mooser V, Vollenweider P, Waeber G, Jongeneel CV, Beckmann JS, Kutalik Z, Bergmann S: Identification and validation of copy number variants using SNP genotyping arrays from a large clinical cohort.

Acuna JD, Munoz MA: Mixture analysis in biology: scope and limits.

Munoz MA, Acuna JD: Sample size requirements of a mixture analysis method with applications in systematic biology.

Jiang H, Li N, Gopalan V, Zilversmit MM, Varma S, Nagarajan V, Li J, Mu J, Hayton K, Henschen B, Yi M, Stephens R, McVean G, Awadalla P, Wellems TE, zhuan Su X: High recombination rates and hotspots in a Plasmodium falciparum genetic cross.

AlmagroGarcia J, Manske M, Carret C, Campino S, Auburn S, Macinnis BL, Maslen G, Pain A, Newbold CI, Kwiatkowski DP, Clark TG: SnoopCGH: software for visualizing comparative genomic hybridization data.