Skip to main content
  • Research article
  • Open access
  • Published:

Copy number variation detection using next generation sequencing read counts

Abstract

Background

A copy number variation (CNV) is a difference between genotypes in the number of copies of a genomic region. Next generation sequencing (NGS) technologies provide sensitive and accurate tools for detecting genomic variations that include CNVs. However, statistical approaches for CNV identification using NGS are limited. We propose a new methodology for detecting CNVs using NGS data. This method (henceforth denoted by m-HMM) is based on a hidden Markov model with emission probabilities that are governed by mixture distributions. We use the Expectation-Maximization (EM) algorithm to estimate the parameters in the model.

Results

A simulation study demonstrates that our proposed m-HMM approach has greater power for detecting copy number gains and losses relative to existing methods. Furthermore, application of our m-HMM to DNA sequencing data from the two maize inbred lines B73 and Mo17 to identify CNVs that may play a role in creating phenotypic differences between these inbred lines provides results concordant with previous array-based efforts to identify CNVs.

Conclusions

The new m-HMM method is a powerful and practical approach for identifying CNVs from NGS data.

Background

Introduction

A copy number variation (CNV) is a variation between genomes in the number of copies of a genomic region that is 1,000 DNA bases (1 Kb) or larger [1]. A CNV is a type of structural variation (SV) because a CNV affects a relatively large region in a DNA molecule. Structural genomic duplications or deletions correspond to copy number gains or losses, respectively. CNVs play an important role in human hereditary illnesses [2] and in plant breeding and agricultural improvement [3].

Maize exhibits extensive variation in both genotype and phenotype relative to the variation seen in humans [4]. The genotypic diversity in maize species permits a variety of uses, such as human and animal food and fuel. The maize genotype B73 was sequenced in 2009 [3]. This accomplishment allows a further comparison and understanding in different types of maize. Swanson-Wagner et al. [5] and Belo et al. [6] compared a variety of maize inbreds with B73 using array comparative genomic hybridization (aCGH) and identified a considerable number of CNVs along the genome. Springer et al. [7] also analyzed the structural variance between the two maize genotypes B73 and Mo17 using aCGH.

Array comparative genomic hybridization, first proposed in 1997 [8], has served as a robust and effective approach for CNV screening [9]. Statistical methods for analyzing aCGH data are readily available and are described in review articles such as Wineinger et al. [10] and Medvedev et al. [11]. However, aCGH is expensive and has limited resolution and accuracy. Nowadays, rapidly developing next generation sequencing (NGS) technologies provide a sensitive and accurate alternative approach for accessing genomic variations. The quality, speed, and affordability give NGS a significant advantage over microarrays [12, 13].

Despite the advantages of NGS over aCGH, the use of NGS for CNV identification has been limited by a lack of available and effective statistical approaches. The well-developed aCGH data analysis methods cannot take the full advantage of NGS data, and thus, new statistical analysis methods for NGS data are needed. Most of the existing methods for CNV detection using NGS data can be classified into two categories: sliding window methods and hidden Markov model (HMM) methods. Sliding window methods include Segseq by Chiang et al. [14], Event-wise testing by Yoon et al. [15], rSW-seq by Kim et al. [16] and JointSLM by Magi et al. [17], among others. This category of methods must simultaneously deal with a large number of tests of significance, and the results of such methods are highly dependent on the determination of critical values. One example of the methods using HMMs is CNAseg by Ivakhno et al. [18]. This method relies on borrowing information from additional samples of the two genomes in comparison. In most situations, only one sequenced sample from each genotype is available. The method proposed in this paper is able to detect copy number variation change points even with just one sequenced sample of each genotype.

Data collection and terminology

To understand the data and our model for the data, it is necessary to introduce some NGS data collection details and terminology.

First of all, we use reference genome to describe the genome of the genotype that has been fully sequenced using whole-genome sequencing technologies. In contrast, the target genome is the genome of a genotype of interest that has not been fully sequenced. The goal is to use NGS data from the reference and target genotypes to identify regions of copy number variation between the reference and target genomes. We say that a genomic region in the target genotype where the number of copies is amplified relative to the reference genotype has a copy number gain. A target genomic region present, but at a reduced copy number relative to the reference genome, is said to have a copy number loss. A region that is present in the reference genome but absent in the target genome is described as absent. These three states (copy number gain, copy number loss, and absent) represent copy number variations in the target genome relative to the reference. A region with no difference in the number of copies between the target and the reference genotypes is said to be normal in state. A genomic location where there is a change from one copy number state to another is called a copy number change point.

To identify copy number change points and copy number states, a DNA sample from each of the target and the reference genotypes is obtained. The DNA strands in a sample are fragmented into 100 to 1,000 base segments. At one end of every randomly sampled segment, a sequence of 100 to 300 bases is determined and recorded. Such a sequence of bases is called a read. Each of the reads is then aligned to the reference genome to determine its origin in the genome. The location of the first base of the read on the reference genome is recorded as the position of the read. The numbers of reads for the target genome and the reference genome are recorded as the target read counts and the reference read counts, respectively. If a location has a positive target read count or a positive reference read count, it is called a site. Thus, data from NGS technologies are small non-negative integer counts with associated site positions on the reference genome.

Preliminary data processing

The data to be analyzed can be described as follows. Suppose the observed reference and target genome read counts are denoted by o i [ r ] and o i [ t ] , respectively. The corresponding genomic positions are denoted by h i , i = 1,…,I. The goal of m-HMM is to find the segmentation h 1 h i 0 < h i 1 < h i 2 << h I h i J such that the copy number of the target genome changes between two consecutive segments, and remains the same within a same segment. Because the read counts take small non-negative integer values, including a large number of zeros, it is difficult to carry out accurate modeling and inference using the original data. Thus, it is more practical to work with sums of counts rather than the original individual counts.

A common way to aggregate the data is to define windows with a specific width and calculate the sum of target and reference read counts within each window, so as to obtain shorter series of larger target and reference read counts. Kim et al. [16] defined windows using a fixed number of read counts in the reference genome. Chiang et al. [14], Xie et al. [19] and Ivakhno et al. [18] defined windows using a fixed genomic distance. These methods have an underlying assumption that the sites within a window share the same copy number state. Such an assumption may be reasonable because a CNV is a somewhat rare type of genomic mutation, and the closer any two sites are located on the genome, the less likely there is a CNV change point between these sites. However there is also a problem in implementing these methods. Sites are randomly located along a genomic sequence, with a high density in some parts of the genome and a low density in other parts of the genome. Rigidly defining windows with a fixed number of read counts has the potential to put sites physically far away from each other into one window, which increases the risk of including copy number change points in a window. Rigidly defining windows with a fixed genomic distance can produce high variation in the number of sites and in read counts across windows. This can lead to decreased accuracy for identifying CNVs.

In this paper, we propose a new grouping method, which takes into consideration both the numbers of sites in windows, and distances among sites within and between windows. We propose to use K-means clustering on the physical site positions to define windows. This method is implemented on each chromosome separately. We first divide each chromosome into several big parts, by defining breakpoints that correspond to the largest distances between adjacent sites. In real data set, we suggest to divide each chromosome into 20 parts. Then we perform K-means clustering for each of the 20 parts, where K is chosen for any particular part as the number of reference genomic sites in the part divided by a constant number C. Thus C is on average the number of reference genome sites in each window after grouping. In this paper, C = 40 is used in the simulation study and real maize data analysis. Finally, we obtain the windows, where each window is defined by a collection of sites in one cluster. The number of windows W may be different from one chromosome to another. In the maize application we present later in the paper, the number of windows per chromosome ranges from 2707 to 5831. For window w = 1,…,W, let g w denote the set of indices corresponding to sites in window w. Thus we obtain a new sequence of the target genome read counts

u 1 [ t ] , u 2 [ t ] , , u W [ t ] ,

and a new sequence of the reference genome read counts

u 1 [ r ] , u 2 [ r ] , , u W [ r ] ,

where the target and reference read counts for window w are the sum of the target read counts and the sum of the reference read counts within that window: u w [ t ] = i g w o i [ t ] , u w [ r ] = i g w o i [ r ] , w = 1,…,W. We use the median position of sites within a window as the location for that window: w  = median{h i ,ig w }, w = 1,…,W and obtain a series of genomic locations

1 , 2 , , W .

By this method, the sites that are closest together are more likely to be grouped together in a window, which results in a more reasonable grouping than previously used approaches.

Once sites are grouped into windows, we assume that the hidden states remain the same within windows, and our m-HMM method is used to obtain a window-based segmentation. Next, an adjustment procedure is further performed on the window-based result to obtain more accurate copy number change points. We refer to the final segmentation result as the site-based segmentation result.

Mixture-Hidden Markov Model (m-HMM) for a window based result

In this section, we describe the first step of the proposed mixture-hidden Markov model (m-HMM) that we use to estimate the window-based copy number change points along the genome. This CNV-detecting methodology is carried out separately on each chromosome.

HMM was described by Baum et al. [2023]. A HMM is constructed by a bivariate random process {S w ,U w }, w = 1,…,W. One component of this random process, {S w } (w = 1,…,W), is an unobserved Markov chain with finite states. {S w } (w = 1,…,W) has Markov property, which means that given the "current state" S w , the "future state" Sw+1 and the "past state" Sw-1 are independent, i.e., P(Sw+1|S1,…,S w ) = P(Sw+1|S w ), w = 1,…,W. The probability a kl  = P(Sw+1|S w ) is called the transition probability, which determines the probability of the state of w + 1 based on the state of w. In m-HMM, the unobserved copy number states of the windows along the chromosome are the hidden states. The copy number states take four values, where

1 = gain: copy number gain/amplification in the target relative to the reference,

2 = normal: no difference in copy number between the target and the reference,

3 = loss: region present in the target genome but at a reduced copy number relative to the reference,

4 = absent: region absent in the sample but present in the reference.

Another important component of a HMM is {U w } (w = 1,…,W), which is a sequence of observations for w = 1,…,W. Each hidden state generates an observation with specific probability, P(U w |S w ), which is called the emission probability. In m-HMM, the observations are the target genome read count u w [ t ] and the reference read count u w [ r ] at each window w, i.e., U w with value u w = ( u w [ t ] , u w [ r ] ) , w = 1,…,W. Detailed description of the m-HMM is presented in the following two subsections.

The transition probabilities

For the copy number states k and l, the transition probability a kl (w) (k,l = 1,…,4) is defined as the conditional probability of the next window w+1 taking copy number state l, given the copy number state k for the current window w. Motivated by Marioni et al. [24], we define the transition matrix which takes the relative positions of windows into consideration. As the distance between adjacent windows increases, the transition probability from state k to state l ≠ k increases and approaches to a positive constant. As the distance between adjacent windows decreases, the probability of a difference in copy number states between windows diminishes.

For each window w, the transition probability is defined as a function of w given by A w  = [a kl (w)]4×4, where

a kl ( w ) = P ( S w + 1 = l | S w = k , θ ) = p kl ( 1 - e - ρ d w ) l k 1 - ( j k p kj ) ( 1 - e - ρ d w ) l = k ,
(1)

for k,l = 1,…,4 and w = 1,…,W - 1. Here the parameter p kl  (0,1) affects the transition probabilities from state k to state l, and has the constraint l k p kl <1, k = 1,…,4. Constant d w denotes the physical distance on the genome between the location of window w and the location of window w + 1 (i.e., w+1 -  w ). The parameter ρ is a positive-valued parameter determining the effect of distance on the transition matrix. The distance effect diminishes as ρ approaches . The parameter vector θ represents all the parameters in the model, including p kl (k,l = 1,…,4;l ≠ k), ρ and all the parameters in the emission distribution to be described in the following section.

The emission distributions

The emission distributions define emission probabilities, which are the conditional joint probabilities of reference and target read counts, given the copy number state of the window. Each window w has two observations, a reference read count U w [ r ] = u w [ r ] and a target read count U w [ t ] = u w [ t ] . We model the reference read count as U w [ r ] | λ w [ r ] Poisson λ w [ r ] , where λ w [ r ] follows a Gamma distribution with parameters α and β.

Conditional on the state of window w, one natural choice for the emission distribution of the target read count for window w is

U w [ t ] | ( λ w [ r ] , S w = k ) Poisson ( K k c 0 λ w [ r ] ) ,
(2)

Here K1 = 2, K2 = 1, K3 = 0.5 and K4 = 0 are the CNV effects of the four copy number states "gain", "normal", "loss" and "absent", respectively. Parameter c0 is a normalization factor that accounts for any discrepancy between the total number of reference and target reads in normal regions. However in real data sets, extra variation and mis-alignments along the genomic sequence data are inevitable. For example, in a normal genomic segment with no difference in copy number between the target and the reference, there still exist some windows with normalized target and reference read count ratios significantly higher or lower than 1; within a segment of copy number gain (or loss), we also find windows that have normalized target and reference read count ratios significantly lower than 2 (or higher than 0.5). The original HMM introduced above will not only identify true copy number variation signals, but also the local variations caused by random error. Without additional modification, the method tends to show more state changes than those justified by true CNV signals.

The problem of identifying too many CNV change points is also pointed out by Ivakhno et al. [18]. To address this problem, Ivakhno et al.’s CNAseg employs a merging adjustment procedure on the outcomes of the original HMM segmentations using Pearson’s χ2 statistics. However, CNAseg segmentation depends heavily on the determination of the merging threshold. In the method we propose, instead of using (2) to model the target read count distribution, we use a Poisson mixture model for each of the four copy number states k = 1,2,3,4:

U w [ t ] | λ w [ r ] , S w = k j = 1 4 q kj Poisson v kj c 0 λ w [ r ] ,
(3)

where Q= [ q kj ] 4 × 4 = q 11 1 - q 11 0 0 1 - q 22 2 q 22 1 - q 22 2 0 0 1 - q 33 2 q 33 1 - q 33 2 0 0 1 - q 44 q 44 with q kk  (0.5,1) for k = 1,…,4, and V= [ v kj ] 4 × 4 = 2 v 12 0 0 v 21 1 v 23 0 0 v 32 0.5 v 34 0 0 v 43 0  with v12,v21 (1,2), v23,v32 (0.5,1), v34,v43 (0,0.5). The diagonal elements v kk (k = 1,…,4) in V denote the effects of the four copy number states: gain, normal, loss and absent, and have fixed constant values v11 = 2, v22 = 1, v33 = 0.5 and v44 = 0. The off-diagonal elements v kj (k,j = 1,…,4;k ≠ j) account for the uncertainties due to random errors in the sequencing technology such as mis-alignments, and their values are estimated from the data set. More specifically, parameter v12 accounts for the uncertainty in the copy number gain state that allows observed copy numbers to vary between 1 and 2; parameters v21 and v23 accounts for uncertainty in the normal state that allows observed copy numbers to vary between 0.5 and 2; parameters v32 and v34 allow observed copy numbers to vary between 0 and 1 within the copy number loss state; and the parameter v43 allows observed copy numbers to vary between 0 and 0.5 when the state is the absent state. All other v kj ’s are restricted to 0. As a result, matrix V is a tridiagonal matrix. The weight matrix Q is also a tridiagonal matrix with q kj (k,j = 1,…,4) being the weights for the Poisson components in the mixture distribution. Parameters q kk (k = 1,…,4) denote the Poisson component weights for the four copy number states and are restricted between 0.5 and 1. Accordingly, the off-diagonal q kj ’s are less than 0.5. In this way, the effects of the four copy number states dominate the effects of random uncertainties.

Based on the model as specified so far, the joint distribution for the target and the reference read counts at window w, conditional on the hidden state for window w being k, is

P ( ) U w [ t ] = u w [ t ] , U w [ r ] = u w [ r ] | S w = k , θ = j q kj Γ u w [ t ] + u w [ r ] + α ( v kj c 0 ) u w [ t ] β α Γ ( α ) u w [ r ] ! u w [ t ] ! ( v kj c 0 + 1 + β ) u w [ t ] + u w [ r ] + α
(4)

with u w [ r ] - and u w [ t ] - . A detailed derivation for (4) is provided in Additional file 1: Appendix 1. Note that

P ( ) u w [ s ] = 0 , u w [ c ] = 0 | S w = k , θ = j q kj β v kj c 0 + 1 + β α ,

which is greater than 0. However, it is not possible for both the target and the reference read counts to be zero in the same interval, i.e., the joint distribution of u w [ t ] and u w [ r ] is truncated in the sense that

P u w [ t ] = u w [ r ] = 0 | θ = 0 .
(5)

Consequently, we multiply the jth component of (4) by a constant ( v kj c 0 + 1 + β ) α ( v kj c 0 + 1 + β ) α - β α (j = 1,2,3,4), and obtain the joint probability that satisfies (5). For the target and reference reads at window w given S w  = k, the joint emission probability is

P ( U w = u w | S w = k , θ ) = j q kj ( v kj c 0 ) u w [ t ] β α u w [ t ] ! u w [ r ] ! Γ ( α ) Γ u w [ t ] + u w [ r ] + α ( v kj c 0 + 1 + β ) u w [ t ] + u w [ r ] ( v kj c 0 + 1 + β ) α - β α ,
(6)

where u w = u w [ t ] u w [ r ] ( - )×( - ) 0 0 and U w = U w [ t ] U w [ r ] .

Parameter estimation using the EM algorithm

We want to find the MLE for the parameter θ = {p,ρ,α,β,c0,q,v}, where p and ρ are transition probability parameters, and α, β, c0, q = (q11,q22,q33,q44) and v = (v12,v21,v23,v32,v34,v43) are emission probability parameters. It is not easy to directly maximize the likelihood with respect to this 26-dimensional parameter, so we use the Expectation-Maximization (EM) algorithm to iteratively maximize the likelihood

L(θ|u)= s π s 1 w = 1 W - 1 a s w s w + 1 (w) w = 1 W P( u w |s,α,β, c 0 ,q,v).
(7)

In (7), u = (u1′,…,u W′) is a series of the target and reference read counts for all the windows along the specific chromosome, s = (s1,…,s W ) {1,2,3,4}W is a vector of unobserved states for all the windows along the chromosome, and π s 1 =P( S 1 = s 1 ) is the probability that the copy number state in the first window is s1.

Characterizing the E and M steps

In the EM algorithm, the likelihood function is also called the observed data likelihood L (obs) =L(θ|u) because all the observations u are observed. If the hidden states were known, we have the complete data likelihood function as follows:

L (comp) = L ( θ | u , s ) = π s 1 w = 1 W - 1 a s w s w + 1 ( w ) w = 1 W P ( u w | s , α , β , c 0 , q , v ) .
(8)

In these two likelihoods, π s 1 w = 1 W - 1 a s w s w + 1 (w) is the probability of that the hidden states for all the windows on the chromosome are s1,…,s W , respectively, and w = 1 W P( u w |s,α,β, c 0 ,q,v) is the probability of the observed read counts u, given the hidden states s1,…,s W .

Given the parameter estimates θ(m) from the iteration mof the EM algorithm, we use the E-step and the M-step to update the parameter estimate.

  •  The E-step: Evaluate the expectation of the complete data log-likelihood with respect to the conditional distribution of the hidden states S given the observed data u, with θ = θ(m), i.e., evaluate E S | u , θ ( m ) log L ( θ | u , s ) .

  •  The M-step: Find θ(m+1), the θ value that maximizes E S | u , θ ( m ) log L ( θ | u , s ) .

A detailed look at both the E and M steps is provided in the Additional file 1.

Initialization, convergence, and prediction of hidden states

The initial values for all the parameters are defined as follows. In the transition probabilities, we define the initial values p ( 0 ) = Δ p 12 ( 0 ) , , p 43 ( 0 ) =0.1· 1 12 × 1 and ρ(0) = 0.5. For the parameters in the emission probabilities, we define q 11 ( 0 ) = q 22 ( 0 ) = q 33 ( 0 ) = q 44 ( 0 ) =0.5, v 12 ( 0 ) = v 21 ( 0 ) =1.5, v 23 ( 0 ) = v 32 ( 0 ) =0.75, v 34 ( 0 ) = v 43 ( 0 ) =0.25, and use the maximum likelihood estimates of α and β with all sites assigned with normal copy number state (state 2) as the initial values α(0) and β(0).

We use

x [ t ] = w u w [ t ] I { window w with normal copy number state }

and

x [ r ] = w u w [ r ] I { window w with normal copy number state }

to denote the total target and reference aligned counts for sites in normal copy number regions, and c 0 = x [ t ] x [ r ] to denote the ratio between them. We obtain the initial estimate for x[t] and x[r] as follows. First we calculate the ratio between the read counts of the target genome and the reference genome for all the windows with positive reference genome counts through the whole genome across all chromosomes. Then we separate the ratios into three groups using K-means clustering to get three group means: M1, M2 and M3. Suppose M1 < M2 < M3. Then x[t](0) is the sum of the target read counts for the windows belonging to the cluster with mean M2, and x[r](0) is the corresponding sum of reference read counts for the windows belonging to the cluster with mean M2. After obtaining x[t](0) and x[r](0), we calculate c 0 ( 0 ) using x [ t ] ( 0 ) x [ r ] ( 0 ) .

The probabilities of the four hidden states for the first window π s 1 (s1 = 1,2,3,4 and s 1 = 1 4 π s 1 =1) are determined by the target and the reference read counts in the first window. We calculate

R 1 = Δ u 1 [ t ] u 1 [ r ] · x [ r ] ( 0 ) x [ t ] ( 0 ) .

If R1 > 1.5, then π1 = 1, i.e., we assign copy number gain as the initial copy number state to the first window; if 0.75 < R1 ≤ 1.5, then π2 = 1, i.e., we assign normal as the initial copy number state to the first window; if 0 < R1 ≤ 0.75, then π3 = 1, i.e., we assign copy number loss as the initial state to the first window; if R1 = 0, then π4 = 1, i.e., we assign absent as the initial state to the first window.

For all the other windows S w (w ≥ 2), we use normal state (state 2) as the initial state. All the parameters are updated using the EM algorithm. The iteration stops when the difference between θ(m) and θ(m+1) is small enough, and we use the value of θ at the final iteration (denoted by θ ̂ ) as our estimate of θ.

After convergence, the conditional probability of each of the four states P k (w)=P S w = k | u , θ ̂ (k = 1,2,3,4) is calculated, and the conditional hidden state prediction for window w is given by the state that has the largest conditional probability, i.e., Ŝ w = arg max k P k (w) for w = 1,…,W.

Change point adjustment

Sites are grouped into windows initially, because we wish to have larger read counts to better capture the copy number variation signals with the m-HMM, and it was assumed in the first step that the copy number state remains the same within a window. In reality, it is possible that the copy number change points occur within windows. Figure 1 demonstrates the relation between the estimated change point before adjustment and the true change point. Suppose window w is the window identified by the algorithm where the copy number state changes from state 1 to state 2 with genomic sites grouped into windows. In this plot, the brown color and the blue color segments represent the two sides of the window based segmentation cut point. And the vertical black bar is the boundary of the two windows (window w-1 and window w). Suppose g w  = {i II ,i II + 1,…,i III - 1}, so i II is the index of the first site in window w, which would be identified as the change point if no adjustment is made. With gw-1 = {i I ,i I + 1,…,i II  - 1} and gw+1 = {i III ,i III  + 1,…,i IV -1}, we have i I and i III being the first sites of window w - 1 and window w + 1, respectively. The true change point i(true) may happen between sites i I and i III -1. In this plot, the true change point, represented by the vertical red bar, is within window w-1. In order to obtain a more accurate result, the following algorithm makes the adjustment:

  1. 1.

    For site i between site i I and site i III -1, we obtain both the total target and the reference counts from i I to i - 1, denoted as z i ( L ) [ t ] and z i ( L ) [ r ] ; also obtain both the total target and the reference counts from i to i III  - 1, denoted as z i ( R ) [ t ] and z i ( R ) [ r ] .

  2. 2.

    Calculate the Pearson’s χ 2 test statistic τ γ z i ( γ ) [ τ ] - z ̂ i ( γ ) [ τ ] 2 z ̂ i ( γ ) [ τ ] using { z i ( L ) [ t ] , z i ( L ) [ r ] , z i ( R ) [ t ] , z i ( R ) [ r ] }, where τ = t or r, γ = L or R, and z ̂ i ( γ ) [ τ ] = γ z i ( γ ) [ τ ] τ z i ( γ ) [ τ ] γ τ z i ( γ ) [ τ ] .

  3. 3.

    Do step 1 and 2 for every site i between i I and i III  - 1; the adjusted breakpoint i (adj) is the one with the largest Pearson’s χ 2 test statistic value.

Figure 1
figure 1

True change point and the estimate of the window based change point.

Results and discussions

Simulation studies

To set up golden standard test data sets with known CNV position to evaluate the performance of our method and peer tools, we conducted three simulation studies. In the first and the second studies, we simulated reads counts on each of the target and reference positions, then we simulated CNV segments on the target genome. The simulated sequences in the first two studies are low coverage with depths of 2X. In the third simulation study, we first randomly simulated copy number variation segments, then we simulated pair end Illumina reads and mapped back to the reference genome. The third simulation captures the sequencing process and with high coverage with depths of 30X.

Simulation study 1

The first simulation study is based on real DNA sequencing data from chromosome 4 of the lung cancer cell line NCI-H2347 from Chiang et al. [14]. Simulation based on real data can best maintain the characteristic of the data including variation and errors that can affect actual data analyses. We first randomly simulated the genomic positions of the sites along the target and reference genomes using a uniform distribution. We generated the reference and target genome read counts by shuffling the reference genome read counts in chromosome 4 of NCI-H2347. After that, we randomly picked 90 CNV segments on the simulated target genome. We considered three sizes for the CNV segments: 10 kb, 50 kb and 100 kb and generated 10 segments for each CNV type. We doubled the read counts for the segments with copy number gains, halved the read counts for the segments with copy number losses and set read counts to 0 for segments with no copies. This is a low coverage sequencing with depth of 2X. We compare the result of the m-HMM with the mixture Poisson emission probability in (6), the result using the original HMM with the Poisson emission probability in (2), and the result using Segseq by Chiang et al. [14].

Table 1 is the comparison between the original HMM and the proposed m-HMM, as well as the m-HMM without change point adjustment. The four true copy number states are listed in the first column. Sensitivities, specificities, empirical false positive rate (EFPR) and empirical false negative rate (EFNR) are computed using (9) and listed in Table 1. From (9) we see that a more accurate method has greater values in sensitivity and specificity, and lower values in EFPR and EFNR.

Sensitivity = # correctly identified copy number changes of a given type # true copy number changes of the given type , Specificity = # correctly identified normal sites # true normal sites , EFPR = # incorrectly identified copy number changes of a given type # identified copy number changes of the given type , EFNR = # incorrectly identified normal sites # identified normal sites
(9)
Table 1 Comparison among m-HMM with and without change point adjustment, and original HMM in simulation study 1

From Table 1, both the m-HMM with and without change point adjustment have greater sensitivities and specificity in all four copy number states than the original HMM. Also, the two m-HMM methods have lower EFPR and EFNR in copy number gain, loss and normal states. The original HMM has a slightly smaller EFNR for the absent state, but the EPNRs in copy number gains and losses are far greater than other methods. This means that, with the mixture Poisson emission distribution, the m-HMM is less affected by the errant variations in states and can capture true CNV signals better.

Comparing between the m-HMM with and without change point adjustment, the sensitivities and specificity are all increased by using the adjustment. Also, the EFPR and EFNR are decreased in copy number gain, loss and normal states, while increased slightly in the absent state. In general, the change point adjustment procedure does give the m-HMM better accuracies.

Table 2 lists a comparison between the m-HMM and SegSeq. Segseq classifies the copy number states with three categories: normal, copy number gain, and copy number loss. Thus, Segseq does not distinguish between the copy number loss state and the no copy state. In Table 2, Segseq did not identify any sites with copy number gain state in this simulation study, so both the sensitivity and EFNR are 0 in the copy number gain state. Moreover, SegSeq has a large EFNR and a low sensitivity in the loss or absent state. Although SegSeq was slightly better in identifying normal states, this came with the cost of much poorer identification of gains, losses, and absent sites by SegSeq relative to the m-HMM method.

Table 2 Comparison between m-HMM and SegSeq in simulation study 1

Simulation study 2

In this simulation study, we examined the m-HMM for copy number variation segments with different lengths. We simulated 30 copy number variation segments with each of the following lengths: 10, 20, 30, 50 and 100 kb. Within the 30 copy number variation segments of the same length, we have 10 segments with copy number gain, 10 segments with copy number loss and 10 segments absent in the sample but present in the reference. The simulation procedure is the same as Simulation Study 1. An identification is considered successful if there is a non-empty intersection between an identified segment and a simulated copy number variation segment with the correct variation type. The numbers of successful identified segments out of 10 are listed in Table 3. When the CNV segments are 30 kb or longer the m-HMM method identifies almost all the copy number variation segments of any variation type.

Table 3 m-HMM segmentation with different CNV lengths in simulation study 2

Simulation study 3

In order to examine m-HMM methodology with a simulation data set that captures the sequencing process, we conducted a third simulation study. First, we randomly introduced duplications (from 3 copies to 4 copies) and deletions (either homozygous 0 copy or heterozygous 1 copy) with difference sizes (1 kb, 5 kb, 10 kb) in the maize reference genome version 2 (chromosome 6 from website ftp://ftp.ensemblgenomes.org/pub/plants/release-10/fasta/zea_mays/dna/) to produce a new genome with CNVs. Five segments were simulated for each of the 20 different combinations (4 different copies × 3 sizes). Consequently, we simulated 10 segments with copy number gains, and 10 segments with copy number losses or absence with each of the 3 CNV segment sizes. After that, we simulated randomly sampled paired end Illumina reads based on the new genome sequence with both read ends of length 100 bp, clone length of average 500 bp and standard deviation of 50 bp, coverage depth of 30X and sequence error rate of 0.002. Then, the simulated sequence reads are mapped back to reference genome using the same parameter as before to produce our test data set with known CNV copy number and position.

We compared the m-HMM with SegSeq and CNVnator by Abyzov et al. [25]. The simulation results are listed in Table 4 and Table 5. Table 4 lists the number of correctly detected segments for copy number gain, loss and absent states in the first two rows, and the number of incorrectly detected segments for the true normal state in the last row. Because neither SegSeq nor CNVnator distinguish between copy number losses and absence, the table summarizes the two deletion states in one row. We have 10 segments of copy number gain or loss/absent for each of the three sizes. When the CNV lengths are at least 5 kb, the m-HMM identifies all the CNV segments. CNVnator detected all the segments with losses or absent state, and almost all segments with copy number gains. SegSeq detected all the segments with copy number gains, but only detected about half of the loss/absent segments. When segment length equals 1 kb, the m-HMM identifies all but one segment, while the detection accuracies for SegSeq and CNVnator are much decreased. The last row of Table 4 lists the number of true normal segments that are incorrectly detected by the methods. Smaller numbers in this row represent better performance. Our proposed m-HMM has 3 incorrectly detected segments, which is fewer relative to the other two methods. CNVnator has a large number of false positive detections when the true copy number state is normal.

Table 4 Counts of correctly detected CNV segments and false positively detected normal segments, comparing between m-HMM, SegSeq and CNVnator in simulation study 3
Table 5 Overlap rate of true CNV segments and detected CNV segments in simulation study 3, comparing between m-HMM with and without change point adjustment

Table 5 lists the overlap rates of true CNV segments and detected CNV segments by m-HMM with and without change point adjustment. The overlap rate is calculated using the overlap genomic length divided by the true CNV genomic length. We see that among all the corrected detected CNV segments, the overlap rates are quite high (over 70%). Also, the change point adjustment makes the overlap rate even higher (over 85%).

Application

We applied the proposed m-HMM to compare the sequence data from two maize genotypes: B73 and Mo17. The previously sequenced B73 genome [3] was used as the reference genome with the Mo17 genome as the target. The original data was provided by Gore et al. [26], and was pre-processed by the lab of Schnable. In the data preparation stage, 44 base sequence reads from each of the B73 and Mo17 genomes were aligned to the reference B73 genome. We obtained 4.3 million aligned reads from B73 and 1.54 million aligned reads from Mo17, and 2.3 million genomic positions on the reference B73 genome had positive read counts. Using the m-HMM, we found 1096 segments of 2000 bases or longer that have copy number variations that are at least 2 fold increasing/decreasing, among which, Mo17 has 14 segments with copy number gain state, 835 segments with copy number loss state and 247 segments with absent state, compared to B73.

We present the window based m-HMM result for chromosome 1, 3, 6 and 10 in Figures 2, 3, 4 and 5. In each figure, the horizontal axis represents the genomic location on the reference B73 genome (in million bases, or Mb), and the vertical axis represents the log of the normalized count ratio between Mo17 and B73, i.e., log Mo17 c 0 B73 with the log ratio set to -7 when the Mo17 count is zero. The blue, teal, green and red colors represent copy number gain, normal, copy number loss and absent state, respectively.

Figure 2
figure 2

Log ratio between the counts of B73 and Mo17 on chromosome 1. Each point represents a window.

Figure 3
figure 3

Log ratio between the counts of B73 and Mo17 on chromosome 3. Each point represents a window.

Figure 4
figure 4

Log ratio between the counts of B73 and Mo17 on chromosome 6. Each point represents a window.

Figure 5
figure 5

Log ratio between the counts of B73 and Mo17 on chromosome 10. Each point represents a window.

There are several large segments with few copy number variations between Mo17 and B73. They are 121.3 Mb 130 Mb on chromosome 1 (Figure 2), 69.2 Mb 82.0 Mb and 84.8 Mb 95.9 Mb on chromosome 3 (Figure 3), and 49.5 Mb 61.5 Mb on chromosome 10 (Figure 5). These results are concordant with previous results using aCGH data [6, 7].

The most extreme copy number loss or absent state detected on Mo17 is in chromosome 6 (Figure 4), located from 42.2 million to 46.2 million bases on the reference genome. Within this segment, the reference genome B73 has 8152 total read counts, and the sample genome Mo17 has 542 total read counts. The copy number ratio is about 0.066, or 0.13 after taking into account the normalization factor c0. Table 6 demonstrates the m-HMM segmentation result between 35.3 million and 57.0 million bases. Two other long segments with copy number loss or absent state in chromosome 6 are from 26.5 million to 28.8 million bases, and from 47.8 million to 49.7 million bases. These identifications are also concordant with the result from other studies that compared DNA sequences between B73 and Mo17 using aCGH data [6, 7].

Table 6 The m-HMM segmentation result for chromosome 6 from 35.3 Mb to 57.0 Mb

Table 7 lists the detected CNV segments that are longer than 2 Mb and show greater than 2-fold normalized changes between B73 and Mo17. Plots for all the maize chromosomes are provided in the Additional file 2: Appendix 2.

Table 7 This table lists the detected CNV segments that are longer than 2 Mb and presents greater than 2-fold normalized read count differences between B73 and Mo17

Several challenging aspects of this application should be noted when interpreting our results. First, there exists a high level of divergence between the target genome Mo17 and the reference genome B73. Because of the high divergence, it is difficult to map the target reads to the reference genome in some regions. This might partially explain the reason that many of the identified CNVs are classified as copy number loss or absent in Mo17 relative to B73. Moreover, the short read length of 44 bp in this data set makes it even more difficult to correctly map the target reads to the reference genome. Secondly, the average coverage depth is also very low (less than 1X). This makes it difficult to identify small CNV segments. Although higher sequencing depth provides better CNV detection accuracy, Sims et al. [27] pointed out that reasonable detections can still be obtained for large CNVs when the coverage is at least 0.1X.

With fast developing sequencing technologies, higher coverage depths and longer reads are available in may of today’s sequencing data sets. Reads with 100 bp are standard, and it is common to see 150 to 300 bp in Illumina reads. Also, coverage depths of 30X and higher are becoming mainstream. With longer reads, more precise mapping and higher coverage depth, the m-HMM is expected to provide better CNV detections with higher accuracy.

It is also important to note that DNA sequencing technologies are potentially affected by biases that arise from several biophysical and chemical features [28]. For example, favorable levels of GC-content could amplify the coverage of the genome so as to affect CNV detections. Many recent methods have addressed GC-content bias in CNV detections. Risso et al. [29] and Janevski, et al. [28] reviewed GC-content normalization tools and methods that are currently being used. Our method is based on direct comparison of two genomes within genomic windows. As long as the within-window GC content remains similar across genomes, variation in GC content across the genomes should not bias our m-HMM results.

Conclusions

In this paper, we proposed a new methodology to detect CNVs in the DNA sequences from two genotypes using next generation sequencing data. We used a hidden Markov model incorporating a mixture emission probability model to identify the copy number variation change points. The simulations study suggests that the m-HMM has better sensitivities and specificities in CNV identifications comparing to other methodologies.

The proposed m-HMM was applied to compare the two maize inbred lines B73 and Mo17, and identified CNV change points of the target sequence Mo17 relative to the reference sequence B73. The result of the m-HMM is concordant with previous genomic studies using aCGH data by Springer et al. [7] and Belo et al. [6].

In addition, the m-HMM can be used to compare two genotypes when only one sample of target reads and only one sample of reference reads are available. Many other existing methods require multiple read samples for both target and reference genotypes. Thus, the m-HMM approach we have proposed may be especially useful when financial constraints limit data collection.

Availability of the software and supporting data set

The m-HMM software and an example data set are available in the webpage https://www.stt.msu.edu/users/hengwang/mHMM.html. The maize sequencing data set supporting the results of this article can be downloaded through the link ftp://ftp.ensemblgenomes.org/pub/plants/release-10/fasta/zea_mays/dna/.

References

  1. Banerjee S, Oldridge D, Poptsova M, Hussain WM, Chakravarty D, Demichelis F: A computational framework discovers new copy number variants with functional importance. PLoS ONE. 2011, 6: 3-

    Google Scholar 

  2. Gokcumen O, Lee C: Copy number variants (CNVs) in primate species using array-based comparative genomic hybridization. Methods. 2009, 49: 18-25. 10.1016/j.ymeth.2009.06.001.

    Article  PubMed Central  PubMed  Google Scholar 

  3. Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves T, Minx P, Reily A, Courtney L, Kruchowski S, Tomlinson C, Strong C, Delehaunty K, Fronick C, Courtney B, Rock S, Belter E, Du F, Kim K, Abbott R, Cotton M, Levy A, Marchetto P, Ochoa K, Jackson S, Gillam B, et al: The B73 maize genome: complexity, diversity, and dynamics. Science. 2009, 326: 1112-1115. 10.1126/science.1178534.

    Article  PubMed  CAS  Google Scholar 

  4. Buckler ES, Gaut BS, McMullen MD: Molecular and functional diversity of maize. Curr Opin Plant Biol. 2006, 9: 172-176. 10.1016/j.pbi.2006.01.013.

    Article  PubMed  CAS  Google Scholar 

  5. Swanson-Wagner RA, Eichten SR, Kumari S, Tiffin P, Stein JC, Ware D, Nathan M, Springer NM: Pervasive gene content variation and copy number variation in maize and its undomesticated progenitor. Genome Res. 2010, 20: 1689-1699. 10.1101/gr.109165.110.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  6. Belo A, Beatty MK, Hondred D, Fengler KA, Li B, Rafalski A: Allelic Genome structural variation in maize detected by array comparative genome hybridization. Theor Appl Genet. 2010, 120: 355-367. 10.1007/s00122-009-1128-9.

    Article  PubMed  CAS  Google Scholar 

  7. Springer NM, Ying K, Fu Y, Ji T, Yeh CT, Jia Y, Wu W, Richmond T, Kitzman J, Rosenbaum H, Iniguez AL, Barbazuk WB, Jeddeloh JA, Nettleton D, Schnable PS: Maize inbreds exhibit high levels of copy number variation (CNV) and presence/absence variation (PAV) in genome content. PLoS Genet. 2009, 5: 11-

    Article  Google Scholar 

  8. Solinas-Toldo S, Lampel S, Stilgenbauer S, Nickolenko J, Benner A, Dohner H, Cremer T, Lichter P: Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Gene Chromosome Canc. 1997, 20 (4): 399-407. 10.1002/(SICI)1098-2264(199712)20:4<399::AID-GCC12>3.0.CO;2-I.

    Article  CAS  Google Scholar 

  9. Feuk L, Carson AR, Scherer SW: Structural variation in the human genome. Nat Rev Genet. 2006, 7: 85-97.

    Article  PubMed  CAS  Google Scholar 

  10. Wineinger NE, Kennedy RE, Erickson SW, Wojczynski MK, Bruder CE, Tiwari HK: Statistical issues in the analysis of DNA copy number variations. Int J Comput Biol Drug Des. 2008, 1: 368-395. 10.1504/IJCBDD.2008.022208.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  11. Medvedev P, Stanciu M, Brudno M: Computational methods for discovering structural variation with next-generation sequencing. Nat Methods. 2009, 6: 13-20. 10.1038/nmeth.1374.

    Article  Google Scholar 

  12. Hurd PJ, Nelson CJ: Advantages of next-generation sequencing versus the microarray in epigenetic research. Brief Funct Genomics. 2009, 8 (3): 174-183. 10.1093/bfgp/elp013.

    Article  CAS  Google Scholar 

  13. Su Z, Li Z, Chen T, Li QZ, Fang H, Ding D, Ge W, Ning B, Hong H, Perkins RG, Tong W, Shi L: Comparing next-generation sequencing and microarray technologies in a toxicological study of the effects of aristolochic acid on rat kidneys. Chem Res Toxicol. 2011, 24 (9): 1486-1493. 10.1021/tx200103b.

    Article  PubMed  CAS  Google Scholar 

  14. Chiang DY, Getz G, Jaffe DB, O’Kelly MJ, Zhao X, Carter SL, Russ C, Nusbaum C, Meyerson M, Lander ES: High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat Methods. 2009, 6: 677-681. 10.1038/nmeth.1363.

    Article  Google Scholar 

  15. Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2007, 19: 1586-1592.

    Article  Google Scholar 

  16. Kim TM, Luquette LJ, Xi R, Park PJ: rSW-seq: algorithm for detection of copy number alterations in deep sequencing data. BMC Bioinformatics. 2010, 11: 432-10.1186/1471-2105-11-432.

    Article  PubMed Central  PubMed  Google Scholar 

  17. Magi A, Benelli M, Yoon S, Roviello F, Torricelli F: Detecting common copy number variants in high-throughput sequencing data by using JointSLM algorithm. Nucleic Acids Res. 2011, 39: e65-10.1093/nar/gkr068.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  18. Ivakhno S, Royce T, Cox AJ, Evers DJ, Cheetham RK, Tavare S: CNAseg - a novel framework for identification of copy number changes in cancer from second-generation sequencing data. Bioinformatics. 2010, 26: 3051-3058. 10.1093/bioinformatics/btq587.

    Article  PubMed  CAS  Google Scholar 

  19. Xie C, Tammi MT: CNV-seq, a new method to detect copy number variation using high-throughtput sequencing. BMC Bioinformatics. 2009, 10: 80-10.1186/1471-2105-10-80.

    Article  PubMed Central  PubMed  Google Scholar 

  20. Baum LE, Petrie T: Statistical inference for proba-bilistic functions of finite state Markov chains. Ann Math Stat. 1966, 37 (6): 1554-1563. 10.1214/aoms/1177699147.

    Article  Google Scholar 

  21. Baum LE, Eagon JA: An inequality with applications to statistical estimation for probabilistic functions of a Markov process and to a model for ecology. Bull Am Math Soc. 1967, 73 (3): 360-363. 10.1090/S0002-9904-1967-11751-8.

    Article  Google Scholar 

  22. Baum LE, Petrie T, Soules G, Weiss N: A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math Stat. 1970, 41 (1): 164-171. 10.1214/aoms/1177697196.

    Article  Google Scholar 

  23. Baum LE: An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities III: Proceedings of the Third Symposium on Inequalities. Edited by: Shisha O. 1972, New York: Academic Press, 1-8.

    Google Scholar 

  24. Marioni JC, Thorne NP, Tavare S: BioHMM: A heterogeneous Hidden Markov model for segmenting array CGH data. Bioinformatics. 2006, 22: 1144-1146. 10.1093/bioinformatics/btl089.

    Article  PubMed  CAS  Google Scholar 

  25. Abyzov A, Urban AE, Snyder M, Gerstein M: CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011, 21 (6): 974-984. 10.1101/gr.114876.110.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  26. Gore MA, Chia JM, Elshire RJ, Sun Q, Ersoz ES, Hurwitz BL, Peiffer JA, McMullen MD, Grills GS, Ross-Ibarra J, Ware DH, Buckler ES: A first-generation haplotype map of maize. Science. 2009, 326 (5956): 1115-1117. 10.1126/science.1177837.

    Article  PubMed  CAS  Google Scholar 

  27. Sims D, Sudbery I, Ilott N, Heger A, Ponting C: Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet. 2014, 15: 121-132. 10.1038/nrg3642.

    Article  PubMed  CAS  Google Scholar 

  28. Janevski A, Varadan V, Kamalakaran S, Banerjee N, Dimitrova N: Effective normalization for copy number variation detection from whole genome sequencing. BMC Genomics. 2012, 13 (Suppl 6): S16-10.1186/1471-2164-13-S6-S16.

    Article  PubMed Central  PubMed  Google Scholar 

  29. Risso D, Schwartz K, Sherlock G, Dudoit S: GC-content normalization for RNA-seq data. BMC Bioinformatics. 2011, 12: 480-10.1186/1471-2105-12-480.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  30. Rabiner LR, Juang BH: An introduction to Hidden Markov Models. IEEE ASSP Mag. 1986, 3 (1): 4-16.

    Article  Google Scholar 

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. 1027527.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Heng Wang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

HW designed the model and the algorithm, carried out the simulation studies and data application, and drafted the manuscript. DN supervised the study, participated in the model design, and helped draft the manuscript. KY provided the data pre-processing and the model testing. KY also simulated the data set for the third simulation study. All authors read and approved the final manuscript.

Electronic supplementary material

12859_2013_6359_MOESM1_ESM.pdf

Additional file 1: Appendix 1 – More Details in the EM Algorithm [30]. (PDF 172 KB)

Additional file 2: Appendix 2 – Maize CNV Detection Results.(PDF 291 KB)

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Nettleton, D. & Ying, K. Copy number variation detection using next generation sequencing read counts. BMC Bioinformatics 15, 109 (2014). https://doi.org/10.1186/1471-2105-15-109

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-15-109

Keywords