London School of Hygiene and Tropical Medicine, London, UK

Center of Statistics and Applications, University of Lisbon, Lisbon, Portugal

Wellcome Trust Sanger Institute, Hinxton, UK

Department of Clinical Parasitology, Hospital for Tropical Diseases, London, UK

King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

Abstract

Background

The advent of next generation sequencing technology has accelerated efforts to map and catalogue copy number variation (CNV) in genomes of important micro-organisms for public health. A typical analysis of the sequence data involves mapping reads onto a reference genome, calculating the respective coverage, and detecting regions with too-low or too-high coverage (deletions and amplifications, respectively). Current CNV detection methods rely on statistical assumptions (

Results

Using sequence coverage data of 7

Conclusions

In summary, the proposed methodology brings an increase in flexibility, robustness, accuracy and statistical rigour to CNV detection using sequence coverage data.

Background

Recent genome research have highlighted the role of structural variants on natural phenotypic variations with vital importance for human health

The present work considers the detection of copy number variations (CNVs), such as deletions and amplifications, using sequence coverage data when mapped onto a reference genome. In theory, deletions are detected in regions with extremely low coverage whereas amplifications are typically located in regions with exceptionally high coverage

To improve current approaches for CNV detection, we propose a new methodology based on a Poisson hierarchical modelling approach. Our data analysis strategy is now outlined. First, we assume a Poisson distribution for coverage when there is no copy number variation, as previously done in EWT

To assess the performance of our methodology, we use 7 publicly available

Results

Data under analysis comprises

**Coverage**

**Samples**

**Origin**

**Number of reads**

**Mean**

**Variance**

**Range**

**=0**

**≤10**

**≥****250**

**≥****500**

3D7

Africa

19,590,258

162.8

767.4

0–794

1

3

25

14

HB3

Honduras

14,024,161

116.6

585.6

0–449

188

262

23

0

DD2

Indonesia

21,080,366

175.2

1861.9

0–749

139

214

873

470

7G8

Brazil

13,736,522

114.2

2141.2

0–794

188

1419

365

29

GB4

Ghana

17,157,171

142.6

2087.3

0–955

151

540

274

7

OX005

Ghana

17,214,916

143.1

4387.9

0–1386

187

308

7691

109

OX006

Kenya

20,850,309

173.3

1072.1

0–733

46

102

656

7

Empirical coverage distributions are intrinsically overdispersed and skewed

**Empirical coverage distributions are intrinsically overdispersed and skewed. A**. Observed coverage distributions. **B**. Overdispersion defined as the ratio between coverage mean and variance (see also Table

Coverage distributions are intrinsically overdispersed, skewed, and long-tailed

A brief description of the empirical coverage distributions led to two key observations. First, every coverage distribution is characterised by extreme overdispersion as the variance is greater than the mean in each sample (Table

**Skewness and kurtosis of empirical coverage distributions.**

Click here for file

Coverage distributions are described well by a Poisson-Gamma model

To analyse the data, we devised a CNV detection strategy based on Poisson-Gamma and Poisson-Lognormal, two probability distributions known for their flexibility in tackling overdispersion. To estimate these models, we divided each coverage profile according to the respective GC content and analysed the corresponding data separately. The Poisson, Poisson-Lognormal and Poisson-Gamma distributions were then compared against each other being the latter the best model for the data irrespective of the criteria used (Additional file

**Statistical model comparison between Poisson, Poisson-Gamma, and Poisson-Lognormal distributions.** The Poisson and Poisson-Lognormal models were compared to the Poisson-Gamma using the Deviance Information Criteria (DIC)

Click here for file

**Expected and empirical cumulative coverage distributions.** Expected coverage distributions refer to the corresponding posterior predictive distributions for the set of all 100-bp windows used in the analysis.

Click here for file

**Limits for CNV detection used on each sample as function of the underlying GC content.** CNV detection limits were determined according to the posterior predictive probability distribution of the Poisson-Gamma (the best model for every data set under analysis).

Click here for file

The Poisson-Gamma approach shows a low baseline false positive rate

The baseline false positive rate of our method was first assessed through the analysis of the 3D7 resequencing data, where CNVs are known (

**Overall hits**

**PFL1155w locus**

**Method**

**10**×^{
∗
}

**20**×^{
∗
}

**50**×^{
∗
}

**Real data**

**10**×^{
∗
}

**20**×^{
∗
}

**50**×^{
∗
}

**Real data**

Results refers to the (mean) percentage of overall hits detected in relation to the total number of 100-bp windows (120,309) using real and simulated data from the 3D7 resequencing sample and the corresponding (average) number of 100-bp hits detected on the GTP cyclohydrolase I gene locus (PFL1155w, 1.3 kb in size).

^{*}Results based on 10 independent simulated data sets.

^{+}Analysis performed across samples (simulated or real where appropriate).

PG with

0.82%

0.58%

0.36%

0.08%

13

13

13

13

PG with

0.19%

0.11%

0.05%

0.02%

13

13

13

13

FREEC

0.00%

0.00%

0.00%

0.01%

0

0

0

13

cn.MOPS ^{+}

0.55%

0.16%

0.01%

0.30%

0

0

0

11

**A large amplification detected between PFL1125w and PFL1160w genes in the 3D7 reference genome data using the Poisson-Gamma model.**

Click here for file

We extended the above analysis by studying the influence of read depth on the false positive rate in a simulation study. Ten independent samples were generated for each of 3 read depths (10×, 20×, and 50×). The results showed that the false positive rate of our method is lower or in line with the corresponding statistical stringency adopted in the analysis (Table

The Poisson-Gamma modelling approach detects known and novel CNV regions

The analysis of the remaining laboratory and clinical samples led to a total number of hits ranging from 257 (OX006) to 899 (DD2) using

**
γ=99%
**

**
γ=99.9%
**

**Sample**

**Type of CNV**

**
#
**

**
#
**

# **Gene**

**
#
**

**
#
**

**#****Gene**

**Largest CNV****(kb)**

Results refer to the number of individual hits (i.e., 100-bp windows) and loci (pooled hits where contiguous) using the credible levels

HB3

Deletion

322

109

60

305

101

56

PFI1475 (2.0)

Amplification

246

206

119

60

53

46

PF11_0503 (0.6)

DD2

Deletion

279

98

58

265

95

55

PFL2550w (1.7)

Amplification

678

84

63

634

59

43

PFE1120w (14.8)

7G8

Deletion

243

125

83

205

101

61

MAL7P1.64 (1.1)

Amplification

343

118

106

215

49

37

PFL1130w (6.7)

GB4

Deletion

262

98

48

253

92

45

PFC0110w (2.5)

Amplification

108

84

79

47

38

36

PFL1155w (0.6)

OX005

Deletion

308

87

49

274

73

39

PFC0110w (2.8)

Amplification

1019

772

516

192

140

118

PFD0669c (1.0)

OX006

Deletion

170

65

35

167

62

33

PF07_0013 (1.3)

Amplification

277

226

188

90

70

64

MAL8P1.42 (1.1)

Copy number variation between the PFL1125w and PFL1160w genes across different laboratory and clinical samples

**Copy number variation between the PFL1125w and PFL1160w genes across different laboratory and clinical samples. A**. HB3 (Honduras); **B**. DD2 (Indonesia); **C**. 7G8 (Brazil); **D**. GB4 (Ghana); **E**. OX005 (Ghana); **F**. OX006 (Kenya). Note that the prefix PFL was removed from the corresponding gene names as available at genedb database (

**CNVs larger than 500 bp detected using the Poisson-Gamma model (****
γ=99%
**

Click here for file

Comparison with FREEC and cn.MOPS approaches

The FREEC and cn.MOPS approaches were applied to the same laboratory and clinical samples; see Additional file

**PG with ****
γ
**

**PG with ****
γ
**

**CNV**

**Sample**

The frequencies (and the respective percentages in brackets) refer to the number of hits shared and exclusively detected by the PG model against FREEC and cn.MOPS, where_{
PG-FREEC
} and_{
PG-cn.MOPS
} denote the hits shared between the respective pair of methods,_{
PG
}**,**
_{
FREEC
} and_{
cn.MOPS
} denote the exclusive hits produced by the corresponding methodology in the respective comparison. Percentages are in relation to the overall number of deletions and amplifications identified by the respective pair of methods.

Deletions

HB3

175 (55.2)

130 (41.0)

12 (3.8)

195 (63.5)

110 (35.8)

2 (0.7)

DD2

175 (63.9)

90 (32.8)

9 (3.3)

152 (57.4)

113 (42.6)

0 (0.0)

7G8

81 (29.1)

124 (44.6)

73 (26.3)

120 (21.4)

85 (15.1)

357 (63.5)

GB4

72 (27.4)

181 (68.8)

10 (3.8)

150 (50.0)

103 (34.3)

47 (15.7)

OX005

153 (51.0)

121 (40.3)

26 (8.7)

205 (69.3)

69 (23.3)

22 (7.4)

OX006

93 (55.0)

74 (43.8)

2 (1.2)

62 (37.1)

105 (62.9)

0 (0.0)

Amplifications

HB3

19 (29.7)

41 (64.1)

4 (6.3)

23 (12.2)

37 (19.7)

128 (68.1)

DD2

586 (84.3)

48 (6.9)

61 (8.8)

608 (85.2)

26 (3.6)

80 (11.2)

7G8

187 (37.6)

28 (5.6)

283 (56.8)

212 (17.8)

3 (0.3)

973 (81.9)

GB4

6 (10.5)

41 (71.9)

10 (17.5)

38 (11.8)

9 (2.8)

274 (85.4)

OX005

62 (4.2)

130 (8.8)

1291 (87.1)

168 (5.7)

24 (0.8)

2473 (93.5)

OX006

27 (22.9)

63 (53.4)

28 (23.7)

64 (20.9)

26 (8.5)

216 (70.6)

**Comparison between hits detected by the Poisson-Gamma model and the FREEC software.**

Click here for file

**Ternary diagrams plotting the joint proportions of shared and exclusively detected hits by the PG model, the FREEC software, and cn.MOPS.**

Click here for file

In the case of cn.MOPS, the proportion of shared hits ranges from 21.4% (7G8) to 69.3% (OX005) for deletions and from 5.7% (OX005) to 85.2% (DD2) for amplifications (using

Finally, the FREEC software running under default settings could not detect a large amplification between PFL1125w and PFL1160c genes in HB3 isolate identified by our method (Figure

Validation of coverage-based hits using CGH array data

The validity of coverage-based hits produced by each methodology was assessed using published CGH data (Table

**Strain**

**Methodology**

**Deletions**

**Amplifications**

**Overall**

CGH hits of HB3 and DD2 lab strains were taken from Samarakoon

HB3

FREEC

—

—

195/210 (92.9%)

cn.MOPS

—

—

214/348 (61.5%)

PG with

—

—

431/568 (75.9%)

PG with

—

—

288/365 (78.9%)

DD2

FREEC

—

—

792/831 (95.3%)

cn.MOPS

—

—

746/840 (88.8%)

PG with

—

—

854/957 (89.0%)

PG with

—

—

826/899 (91.9%)

7G8

FREEC

89/154 (57.8%)

285/470 (60.6%)

374/624 (59.9%)

cn.MOPS

91/477 (19.1%)

236/1185 (19.9%)

327/1662 (19.7%)

PG with

164/243 (67.5%)

216/343 (63.0%)

380/586 (64.9%)

PG with

153/205 (75.6%)

176/215 (81.9%)

329/420 (78.3%)

GB4

FREEC

32/82 (39.0%)

4/16 (25.0%)

36/98 (36.7%)

cn.MOPS

77/197 (39.1%)

28/273 (10.3%)

105/470 (22.3%)

PG with

152/262 (59.0%)

24/108 (22.2%)

176/370 (47.6%)

PG with

148/253 (58.5%)

14/47 (29.8%)

162/300 (54.0%)

Discussion

We have proposed a Poisson hierarchical modelling approach for CNV detection, which is flexible and robust to the common problem of overdispersed coverage data. Using simulation and resequencing data of the 3D7 reference genome, we have demonstrated a low baseline false positive rate of the methodology across different read depth. However, this low baseline false positive rate needs to be assessed in other genomic settings, preferably where reference resequencing data is available, or potentially using a robust simulation strategy with realistic statistical assumptions and parameter settings. In general, one can reduce the baseline false positive rate of any coverage-based method if mapping distance information is also taken into account. True positive hits are then likely to be those whose coverage and mapping distance analyses agree with each other. In particular, strong evidence for deletions is provided from genomic regions with too-low coverage and average mapping distance greater than expected, while amplified regions entail extremely high coverage and average mapping distance less than expected

The proposed approach was also applied to non-reference strain data and identified a large number of CNVs that could be validated by CGH data. The empirical and simulation results have demonstrated that our approach may be applicable to larger genomes where read depths can be lower, or in settings where overdispersion is present

Our method seemed to outperform FREEC and cn.MOPS approaches with respect to concordance of hits confirmed by CGH data for 7G8 and GB4 strains. However, a more accurate comparison was compromised by difficulties in relating stringency. The stringency of our method is controlled by the credibility level, a rigorous statistical parameter, but more difficult to be inferred in algorithms that do not consider a specific statistical model, as in the FREEC software. Notwithstanding this difficulty, we showed that increasing the stringency of our methodology led to a high concordance with the FREEC-based hits. However, the FREEC software could only detect a known amplification at the GTP cyclohydrolase locus in HB3

The Poisson hierarchical modelling approach has the advantage of handling with different data patterns but, as it stands, cannot estimate the corresponding copy number. To overcome this limitation, one can invoke a proportionality between mean (or median) coverage and the underlying copy number, as assumed elsewhere

Conclusions

We have developed a robust Poisson hierarchical modelling approach for CNV detection using sequence coverage data. When applied to the

Methods

Sequence data and processing

Data consists of 7

Estimating coverage profiles

In each sample, calculation of coverage profile followed the usual procedure for human data

Detection of CNVs using a Poisson hierarchical modelling approach

When analysing each data set, we specified the following Multinomial distribution for the coverage values of windows with similar GC content

where_{
g,i
} is the total number of windows with coverage _{
g
}(_{
g
}(^{2}), respectively. Mathematically, the Poisson-Gamma model is given by

where

The estimation of these two models was performed through Bayesian methods using non-informative prior distributions for the respective parameters. With respect to the Poisson-Gamma, we used a Gamma prior distribution for the parameters ^{4} for the parameter ^{2}, respectively. To obtain posterior samples through parallel computing, we used WinBUGS (

After obtaining the posterior parameter samples, the models were tested against each other using Bayes factors and the Deviance Information Criteria (DIC)

For the formal CNV detection, we used the corresponding posterior predictive distribution, which embodies all uncertainty regarding coverage given the observed data and prior information. The calculation was performed through the simulation of ’new’ coverage values according to the respective posterior parameter samples and the best model for the data. We then determined the corresponding HPD credible interval at

Detection of CNVs using FREEC and cn.MOPS softwares

There are several CNV detection methods currently available in the literature

In general, the FREEC software divides the reference genome into non-overlapping and equal-size windows, and calculates the corresponding coverage profile of the target sample. A polynomial regression model is then used to describe the dependency between coverage and GC content. The respective predicted values are first standardised and then smoothed out. The final stage of the analysis consists of estimating the copy number in each segment and merging the regions with similar copy number. With this purpose, the software assumes that the ploidy of the organism under analysis is known and the copy number of a given segment is proportional to the median coverage of all the windows with similar GC content. For the

The cn.MOPS approach is also based on sequence coverage data partitioned into non-overlapping windows. It assumes a finite mixture of Poisson distributions (with a known number of components) for the coverage across samples of any given window. In this approach, each component of the mixture describes the coverage distribution associated with a given copy number under the assumption of a linear relationship between mean coverage and copy number. The model is fitted to each segment via an EM algorithm and the most probable component determined. To apply this approach to our data, we set the copy number to be an integer from 0 to 4, where the value 1 is the ’normal’ copy number (or the ploidy of the organism under study). The remaining parameters were specified at their default settings as explained in the documentation of the respective R package (called

Simulation study based on 3D7 resequencing data

To assess the baseline false positive rate of our method, we performed a simulation study based on 3D7 resequencing data. We generated 10 independent data sets from the 3D7 resequencing sample according to read depths of 10 ×, 20×, and 50 ×, corresponding to a total of 1.25, 2.5, and 6 million reads, respectively. Each data set refers to the coverage profile of 120,309 100-bp windows and was simulated according to a Multinomial distribution with a sample size given by the corresponding total number of reads associated with a specific read depth and probability vector defined by the relative coverage profile of the original 3D7 reference sample. We analysed each data set separately using our method and the FREEC software. In the former, we used the Poisson-Gamma (PG) distribution and two different credible levels (

Comparative genomic hybridisation array data

To assess the reliability of the coverage-based hits, we brought into the analysis available CGH data for the HB3, DD2, 7G8 and GB4 laboratory strains. In the first two strains, we used a pre-compiled list of CGH hits

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

AP, SGC and TGC conceived the project. NS developed the Poisson hierarchical modelling approach and wrote the first draft of the manuscript, and modifications were performed by AP, CJS, SGC and TGC. NS, SAA and TGC performed the data analysis. CJS and SGC obtained and processed the clinical samples. All authors have read and approved the final version of the manuscript.

Acknowledgements

This work was partially funded by a grant from the Foundation for the National Institutes of Health (grant ref. #566), the Wellcome Trust (grant ref. 077383/Z/05/Z) through the Grand Challenges in Global Health Initiative, and Fundação para a Ciência e Tecnologia through the project Pest-OE/MAT/UI0006/2011. TC is funded by the UK Medical Research Council and Wellcome Trust. CJS is supported by the UK Health Protection Agency. AP is supported by his faculty baseline funding from KAUST. We thank the Kwiatkowski group at the Wellcome Trust Sanger Institute for putting the raw sequence data into the public domain. We also thank Valentina Boeva for the advices using the FREEC software, Mark Preston for commenting the manuscript, Michael Bretscher for calling our attention to the JAGS software, and Francesc Coll for some useful references.