Department of Oncology, Johns Hopkins University, Baltimore, MD, USA

Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA

Mathematical Institute, Heinrich-Heine-University Düsseldorf, 40225 Düsseldorf, Germany

Department of Medicine, Johns Hopkins University, Baltimore, MD, USA

Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA

Abstract

Background

In studies of case-parent trios, we define copy number variants (CNVs) in the offspring that differ from the parental copy numbers as de novo and of interest for their potential functional role in disease. Among the leading array-based methods for discovery of de novo CNVs in case-parent trios is the joint hidden Markov model (HMM) implemented in the PennCNV software. However, the computational demands of the joint HMM are substantial and the extent to which false positive identifications occur in case-parent trios has not been well described. We evaluate these issues in a study of oral cleft case-parent trios.

Results

Our analysis of the oral cleft trios reveals that genomic waves represent a substantial source of false positive identifications in the joint HMM, despite a wave-correction implementation in PennCNV. In addition, the noise of low-level summaries of relative copy number (log R ratios) is strongly associated with batch and correlated with the frequency of de novo CNV calls. Exploiting the trio design, we propose a univariate statistic for relative copy number referred to as the

Conclusions

Our results indicate that batch effects and genomic waves are important considerations for case-parent studies of de novo CNV, and that the minimum distance is an effective statistic for reducing technical variation contributing to false de novo discoveries. Coupled with segmentation and maximum a posteriori estimation, our algorithm compares favorably to the joint HMM with MinimumDistance being much faster.

Background

High-throughput arrays such as array comparative genomic hybridization (aCGH) and single nucleotide polymorphism (SNP) arrays provide high resolution maps of deletions and duplications. Such maps have been used to characterize the extent of CNVs in normal populations such as HapMap

Among the predominant algorithms for array-based CNV discovery are segmentation algorithms that segment the genome into regions of constant copy number

Statistical methods for the detection of de novo CNVs in case-parent trios have evolved from two-stage models to joint models. For the former, an HMM or segmentation method is fit independently to each sample of a trio and post hoc classification is obtained by identifying non-overlapping CNV in the offspring

In this paper, we apply a wave correction procedure

Results and discussion

Motivation

The main objective of our research is the delineation of copy number alterations present in the offspring that differ from parental copy numbers (defined as de novo), with an emphasis on false positive identifications and computational speed. We evaluate these issues on a case-parent study of 2,082 oral cleft trios.

We applied the joint HMM implemented in PennCNV with wave correction to the oral cleft trios. The analysis required an average of 130 minutes for a single trio and approximately 2.5 weeks for the oral cleft study when computation was distributed across 10 high performance nodes. Among 1,741 trios passing quality control (see Methods), the median number of de novo calls was 3 with an interquartile range of 2 to 5. To assess batch differences in the de novo call frequencies, we use the chemistry plate on which the samples were processed as a surrogate. We observed statistically significant differences by batch for the median absolute deviation (MAD) log R ratio (analysis of variance F-statistic with 76 and 4726 degrees of freedom was 25.07). While quality control removed trios for which the MAD and corresponding call frequencies were extreme, the mean MAD for each batch was positively correlated with the mean frequency of de novo deletion calls (Spearman correlation coefficient 0.54).

To identify data characteristics contributing to unusually high de novo deletion call frequencies, we plotted the log R ratios and B allele frequencies against their genomic physical position. In many trios with high de novo deletion frequencies, we observed smooth genomic waves with inferred breakpoints alternating between diploid and deletion states coinciding with regions of homozygosity. For example, a trough of approximately 5 Mb on chromosome 8 spans a 600 kb de novo deletion as well as several transmitted CNV called by PennCNV (Figure

Genomic waves appearing in parents and offspring induce false positive de novo deletions

**Genomic waves appearing in parents and offspring induce false positive de novo deletions.** Panels in **(a)** and **(b)** plot log R ratios and B allele frequencies, respectively, for father (top), mother (middle), and offspring (bottom). Overlaying the log R ratio plots is a lowess curve with a span of 1/10. The calls from the joint HMM are indicated in the offspring panel of Figures (a) and (b). The state codes 232, 322, and 332 correspond to a paternally inherited deletion, a maternally inherited deletion, and de novo deletion, respectively. **(c)** The minimum distance calculated from the log R ratios in (a). Overlaying the minimum distance is a lowess curve with the same span as in (a) and (b). The blue rectangle demarcates the de novo hemizygous deletion called by PennCNV.

Algorithm

Definition of the minimum distance

Consider the difference in log R ratio (_{
O
}−_{
F
}. We denote the paternal distance by _{
F
}. A comparable calculation for offspring and mother provides a measure of the maternal distance, denoted by _{
M
}. We define the minimum distance between parents and offspring as

The calculation is easily vectorized in **
d
**, consecutive negative or positive values in a genomic interval suggest DNA copy number loss or gain, respectively, relative to the most similar parental copy number. Although its calculation at a given marker is independent of the neighboring markers, the minimum distance can reduce technical variation from correlated probe-effects as well as the peaks and troughs of genomic waves that vary smoothly over large regions of the genome (e.g., Figure

Segmentation of the minimum distance

Single-sample segmentation algorithms applied to the univariate **
d
** can be used to identify breakpoints of potentially de novo CNVs. We currently favor circular binary segmentation (CBS)

**Optional post-processing of CBS segments.** Removing splits from CBS as a function of coverage and the standardized difference in segment means. **Baum-welch updates** Initialization and updating of parameters for the emission distributions. **Models for Mendelian transmission of the offspring copy number** Details regarding the adaption of the PennCNV probabilistic model of Mendelian transmission. **PennCNV annotation for trio copy number states** Annotation of trio copy number states in PennCNV. **Empirical estimation of simulation parameters in the oral cleft study** Estimation of simulation parameters from the oral cleft study **R** **environment and software versions**

Click here for file

The minimum distance can reduce artifacts that are shared by one or both parents and the offspring. In the motiving example (Figure
**
d
** calculated in the motivating example smooths the trough of the genomic wave (not shown), thereby avoiding local maxima in the likelihood identified by the joint HMM. The subsequent classification of the trio copy number (discussed next) for the minimum distance segment spanning the trough overwhelmingly favors a diploid trio copy number state due to the large number of heterozygous genotypes in the broader region.

As the minimum distance is a relative measure, regions with non-zero minimum distance do not necessarily indicate de novo CNVs. For example, a 300 kb region with positive **
d
** on chromosome 14 suggests a de novo duplication (bottom panel, Additional file

**Supplementary figures.** Supporting figures.

Click here for file

Maximum a posteriori estimation

We classify the copy number states of the minimum distance segments using a fully probabilistic model based in part on the joint HMM. Our approach delineates de novo events by finding the mode of the distribution of

The vector **
s
**

The conditional probability of the trio copy number in equation (2) can be re-expressed using Bayes’ rule as a product of the likelihood and the joint probability of the copy number states. (Hereafter, we refer to the conditional probability in equation (2) as a posterior probability.) Factoring the joint probability of the trio state as in Wang

for the first segment and

for segments _{1,O
}|…) and _{
l,O
}
_{
l−1,O
}|…) for Mendelian transmission of CNVs as implemented in the joint HMM

As copy number estimates from hybridization-based arrays are noisy, our goal is to estimate the likelihood robustly.

Our approach for robust-to-outlier estimation of a sample’s log R ratio likelihood is predicated on a mixture distribution for the emitted log R ratios. Specifically, for individual k of a trio and marker i, we assume a mixture distribution for the log R ratio given by

where the normal component captures within-sample variation for copy number state _{
r,k
} is the probability of observing an outlier log R ratio in sample

With the exception of the homozygous null state, robust-to-outlier estimation of the B allele frequency likelihood for a sample is also implemented via a mixture model. In particular, for positive copy number states we assume a theoretical mixture distribution given by

where the truncated-normal (
_{
s
}) and the uniform zero-one density captures technical variation that we assume to be independent of the genotype and copy number state. As B allele frequencies are thresholded to the [0,1] interval, the proportion of outlier log R ratios, _{
r,k
}, does not necessarily correspond to the proportion of outlier B allele frequencies given by _{
b,k
}, motivating their separate parameterization. The mixture probability _{
i,g
} is estimated from a binomial density parameterized by the frequency of the A allele for genotype

The likelihood in equations (3) and (4) is multiplied by terms involving the conditional probability of the offspring copy number, the initial state probability of the parental copy numbers (if

Segmentation and maximum a posteriori estimation are performed independently for each chromosomal arm and each trio, enabling an embarrassingly parallel implementation. Computational speed is derived from the parallel architecture and the implementation of the computationally intensive maximum a posteriori estimation (121 calculations) on a set of segments that is typically several orders of magnitude smaller than the number of markers on the array.

Simulation study

To assess the performance of PennCNV and MinimumDistance when the true CNV are known, we simulated chromosomes containing four de novo and four inherited copy number deletions spanning as few as 10 markers and as many as 100 markers. We additionally simulated three regions of homozygosity of 50, 100, and 500 markers in the offspring that were diploid in copy number and spanned by the trough of a simulated wave (see Methods). Log R ratios for a trio were sampled from a 3-dimensional multivariate normal distribution under 12 different parameterizations of the covariance for the trio (see Methods). B allele frequencies for the offspring were simulated to be consistent with Mendelian transmission.

We define false positives (FP) as the number of markers in normal regions called de novo and false negatives (FN) as the number of markers in de novo regions called normal. Overall, the correlation of FP for MinimumDistance and PennCNV was low. On average, the FP frequency is higher for the joint HMM than for MinimumDistance with several chromosomes having relatively high FP in PennCNV and low FP in MinimumDistance (bottom right quadrants of panels in Figure

Performance of PennCNV and MinimumDistance on simulated data

**Performance of PennCNV and MinimumDistance on simulated data.** Each point represents a synthetic 25,000 basepair chromosome in which the number of markers incorrectly called de novo **(a)** or not de novo **(b)** were tabulated for PennCNV and MinimumDistance. Log R ratios were simulated with three different levels of correlation between individuals in the trio (columns) and four different levels of variance (rows). The diagonal line in each panel is the identity. **(a)** False positive frequencies in PennCNV and MinimumDistance are uncorrelated, with more skewed frequencies in PennCNV that were threshold at 80 to fit on the display. The mean false positive frequency in MinimumDistance is lower than PennCNV over a range of variance and correlation settings (large circles). The gray horizontal and vertical dashed lines correspond to false positive rates of 0.001. **(b)** The number of markers falsely called de novo is highly correlated between methods. The mean false negative frequency is comparable in PennCNV and MinimumDistance (large circles). The gray horizontal and vertical dashed lines denote false negative rates of 0.1.

To assess how incorrect calls were distributed among the different CNVs, we calculated the proportion of the 25 chromosomes for which 50 percent or more of the markers in the CNV were classified incorrectly. None of the transmitted deletions had more than 50% of the markers called de novo by either method. Diploid regions of homozygosity had elevated FP rates in PennCNV, although the difference was not statistically significant (data not shown). For de novo CNV, MinimumDistance correctly called a higher percentage of the 10-marker features than PennCNV (column 1, Additional file

Case study of oral clefts

We assessed the performance of MinimumDistance and PennCNV on a set of oral cleft trios obtained from the International Consortium of Oral Clefts and genotyped on Illumina’s 610 quad array as part of the Gene, Environment, Association Studies consortium

When assessing the concordance of de novo hemizygous deletions called by MinimumDistance and PennCNV on 1,741 oral cleft trios that passed quality control (see Methods), we found that the 50^{th} and 75^{th}.. the 95^{th} and 99^{th} corresponding to 5 and 23 de novo alterations, respectively, in PennCNV compared to 2 and 7.5 alterations in MinimumDistance. MinimumDistance called a total of 1,261 de novo deletions in 651 trios versus 3,006 de novo deletions in 824 trios called by PennCNV. Nearly 40 percent of the PennCNV de novo deletions (1,174) occur in just 12 percent (212) of the trios. The 212 trios that harbor 40 percent of the de novo deletions were processed on the 15 chemistry plates having the highest log R ratio MAD (top, Figure

Plate-effect for de novo deletion frequencies

**Plate-effect for de novo deletion frequencies.** The square root of de novo deletion frequencies stratified by chemistry plate for PennCNV (top) and MinimumDistance (bottom). Plates are ordered by the median MAD from high (left) to low (right). F-statistics from an analysis of variance of the square root frequencies by plate are displayed in top right legend of each panel.

To systematically evaluate concordance of PennCNV and MinimumDistance, we created a list of the de novo deletions for each method ordered by decreasing coverage. We assessed concordance using three complimentary approaches: (i) the concordance at the top (CAT) defined as the proportion of de novo deletions appearing in the top of both lists

Concordance of PennCNV and MinimumDistance as a function of list size

**Concordance of PennCNV and MinimumDistance as a function of list size.** De novo hemizygous deletions identified by PennCNV and MinimumDistance were ranked by coverage. Plotted on the vertical axis is the proportion of de novo hemizygous deletions identified by both methods as a function of list size. For concordance at the top (CAT), the proportion in common is calculated for the top hits in each list (gray circles). We also plot the proportion of top hits detected by one method that were called de novo by the second method (squares and diamonds). Ranking the MinimumDistance list by the ratio of the maximum a posteriori probability to the posterior probability of diploid copy number improved the concordance (≈ 75%).

For de novo deletions with high coverage called by only one method, many appear to be artifacts with the number of apparent false positives in PennCNV nearly double that of MinimumDistance. As in the motivating example (Figure

In terms of concordance, de novo CNVs identified by both methods appear to be more amenable to experimental validation. Nearly half of the 40 concordant de novo calls that rank high by each method in terms of coverage occur on chromosome 22 (Figure
^{TM} copy number estimates from the qPCR platform

Concordance assessment of methods (a) and platforms (b) for de novo CNVs in the DiGeorge critical region on chromosome 22

**Concordance assessment of methods (a) and platforms (b) for de novo CNVs in the DiGeorge critical region on chromosome 22. ****(a)** The physical position of de novo deletions and amplifications for 25 trios are indicated by orange and blue boxes, respectively, for MinimumDistance (left panel) and PennCNV (right panel). Boxed numbers indicate the number of Illumina markers. Blue y-axis labels indicate trios validated by qPCR. **(b)** Each point is the minimum distance of the CopyCaller^{TM} copy number estimates (y-axis) from the qPCR platform and the minimum distance of the spanning CBS segment from the Illumina platform (x-axis) for one trio. A total of eight trios and three TaqMan probes were used in the validation experiment, generating 24 points on the scatterplot. The inter-platform concordance is high as indicated by the clusters at the bottom left and top right of the display. The four trios at the bottom left have a putative de novo CNV called by MinimumDistance and PennCNV at the 19.60 Mb locus. TaqMan probes flanking the de novo deletion (16.04 and 20.64 Mb) for these trios have minimum distance estimates that cluster near zero (top right). The top right cluster also contains four trios for whom the putative copy numbers were inferred to be diploid by MinimumDistance and PennCNV at the 19.60 Mb locus.

Conclusions

Genomic wave correction in conjunction with the joint HMM for case-parent trios is perhaps the de facto analysis for inferring de novo CNV, yet we find a number of de novo calls that appear to be artifacts of genomic waves and call rates that are correlated with batch (chemistry plate). We propose a simple, univariate measure of relative copy number that can reduce local and global sources of heterogeneity such as probe-effects and genomic waves, respectively, and can be segmented by standard, single-sample segmentation algorithms. We use the method of maximum a posteriori estimation for inferring the de novo status of segments. Key terms in the posterior probability are the likelihood, which we estimate robustly, and the probability of the offspring copy number conditional on the parental copy numbers. We compute the latter term by integrating over Mendelian and non-Mendelian models for CNV transmission, using tabled probabilities from the joint HMM directly for the Mendelian model. The MinimumDistance algorithm is several-fold faster than the joint HMM without any apparent trade-off in sensitivity or specificity as assessed by simulation. Unlike PennCNV, the frequency of de novo calls by MinimumDistance appears robust to differences in noise across batches and robust to genomic waves occurring in trios. De novo calls with high coverage that were concordant between methods include several de novo deletions and amplifications in the DiGeorge critical region on chromosome 22, four of which were subsequently validated by qPCR. As the DiGeorge critical region is known to be important for syndromic disorders that include craniofacial abnormalities, the de novo deletions from independent trios with non-syndromic oral cleft may help identify genes responsible for oral clefts. This finding, verifiable by both de novo detection algorithms, was obtained with a nearly 8-fold reduction in computational time using MinimumDistance.

Our approach for de novo CNV detection can have several limitations. First, the set of candidate breakpoints identified by segmenting the minimum distance are relevant only for identifying genomic regions in which the offspring copy numbers differ from the parental copy numbers. Breakpoints for transmitted CNV are only detectable when the copy number estimates within the CNV differ in magnitude between parents and offspring. Secondly, while genomic waves are strongly correlated with GC content, differences in direction or magnitude of waves across samples are not uncommon. Previous studies suggest that differences in DNA quantity contribute to inversions of the genomic waves between samples

A potential criticism of the current study is that we have evaluated a novel method on a dataset that has not been well studied for CNVs in the literature. While HapMap has been comprehensively characterized by several platforms and statistical methods, there are limitations. First, the cell lines used in HapMap studies have a signal to noise ratio much higher than the signal to noise ratio observed in DNA isolated from experimental studies such as the oral cleft dataset. In fact, our approach was motivated by the technical variation shared among trios in the oral cleft study. Secondly, a recent study failed to identify de novo CNVs in HapMap, identifying instead somatic changes or possible problems with the cell lines

Methods

Case study samples and data

The case-parent trio study for oral clefts is part of the Gene, Environment Association Studies consortium, commonly known as GENEVA
^{TM} (v2.0). All other statistical analyses were performed using the statistical environment

Quality control

We applied the joint HMM implemented in PennCNV with wave correction to 6,202 samples comprising 2,082 nuclear families in the oral cleft study. Using default settings for PennCNV, 560 samples were flagged for log R ratio standard deviations exceeding 0.3, B allele frequency drift greater than 0.01, or wave factor greater than 0.05

MinimumDistance

The minimum distance was computed directly from BeadStudio log R ratios. We applied CBS independently to each chromosomal arm using default values of the

Estimation of the likelihood of the resulting segments requires parameterizing the mixture distributions for the log R ratios and B allele frequencies (see equations 5 and 6, Section Results and discussion). Initial versions of MinimumDistance used theoretical means shared by all samples and estimated the log R ratio variances using an empirical Bayes approach that incorporated a term for the cross-sample variance at each marker. Disadvantages of this approach included means that were less robust to departures from the theoretical values and inflated variance estimates for copy number polymorphic regions due to the higher variability of the log R ratios across samples. These observations led us implement the Baum-Welch algorithm to update parameters _{
b,k,g
}, _{
b,k,g
}, _{
b,k
}, _{
r,k,s
}, _{
r,k,s
}, and _{
r,k
} from their initial values (see equations (6) and (5)). Issues of identifiability and our desire to parallelize across chromosomes for computational speed have led to several constraints for the Baum-Welch update (see Additional file

To calculate posterior probabilities, the likelihood is multiplied by the initial state probability of the parental copy numbers (if **
s
**

Simulation

We simulated chromosomes of 25,000 markers containing four de novo and four inherited copy number deletions that differ in the number of markers: 10, 25, 50, or 100 markers. In addition, we simulated three regions for which the offspring genotypes were homozygous with copy number two. Coverage in the three regions of homozygosity was 50, 100, and 500. Parameters of our simulation are the means and covariance of a three-dimensional multivariate normal distribution from which the log R ratios for a trio were sampled. Off-diagonal elements of the 3×3 correlation matrix of the trio were assumed to be the same with settings corresponding to independence (_{
r
}= 0_{
r
}= 0_{
r
}= 0_{
r
}= 0

For de novo hemizygous deletions, the mean for the parental log R ratios is zero and the mean for the offspring log R ratios is -0.5, approximating what we observe empirically. For transmitted deletions, the log R ratios for the father and offspring were simulated from normal distributions with mean -0.5. To simulate genomic waves spanning regions of homozygosity, we changed the mean smoothly as a function of the marker index along the chromosome from 0.0 to -0.2 to simulate a smooth wave. The correlation parameter of the log R ratios for each father-mother-offspring pair is the same. For deletions and genomic wave features, the B allele frequencies were simulated to be consistent with Mendelian inheritance of the transmitted allele(s). Twenty-five synthetic chromosomes were simulated for each covariance matrix.

Abbreviations

SNP: Single nucleotide polymorphism; aCGH: Array comparative genomic hybridization; HMM: Hidden Markov model; MCMC: Markov chain Monte Carlo; CNV: Copy number variant; CBS: Circular binary segmentation; FP: False positive; FN: False negative MAD: Median absolute deviation (MAD); qPCR: Quantitative polymerase chain reaction; CAT: Concordance at the top; GENEVA: Gene-Environment Association Studies consortium.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

RBS, IR, AFS, SGY, HS, and THB conceived of the study and participated in the drafting of the manuscript. RBS performed the data analysis and wrote the software. HS carried out the analysis of the PennCNV algorithm and provided useful suggestions for the manuscript. AFS performed qPCR validation of candidate de novo regions. SGY contributed to the development of the MinimumDistance

Acknowledgements

We sincerely thank all of the families from each recruitment site for participating in this international study, and we gratefully acknowledge the assistances of clinical, field and laboratory staff whose work made this study possible. We thank Drs. J.C. Murray, M.L. Marazita, R.G. Munger, A.J. Wilcox and R.T. Lie who directed individual research projects contributing to the International Cleft Consortium, which was part of the Gene, Environment Association Studies (GENEVA) Consortium. Our group benefited greatly from the work of the entire GENEVA consortium, and especially its Coordinating Center (directed by Drs. B. Weir and C. Laurie of the University of Washington) in data cleaning and preparation for submission to the Database for Genotypes and Phenotypes (dbGaP). We acknowledge the leadership of Dr. T. Manolio of NHGRI and Dr. E.L. Harris of NIDCR. Genotyping services were provided by the Center for Inherited Disease Research (CIDR), with substantial input from Drs. K. Doheny, H. Ling and E.W. Pugh. Raw data used for these analyses are available for further research into the etiology of craniofacial malformations from dbGaP

RBS is supported by NIH grant R00HG005015. IR and SY are supported by R01GM083084 and R03DE021437. HS is supported by the DFG (Research Training Group 1032 ”Statistical Modeling”) and grant SCHW1508/3-1. The consortium for GWAS genotyping and analysis was supported by the National Institute for Dental and Craniofacial Research through U01-DE-018993; the International Consortium to Identify Genes and Interactions Controlling Oral Clefts, 2007-2009. This project was part of the Gene, Environment Association Studies Consortium (GENEVA) funded by the National Human Genome Research Institute (NHGRI) to enhance communication and collaboration among investigators conducting genome-wide studies for a variety of complex diseases. Genotyping services were provided by the Center for Inherited Disease Research, funded through a federal contract from the US National Institutes of Health to Johns Hopkins University (contract number HHSN268200782096C). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.