Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston MA, USA

Department of Pathology, Molecular Genetic Research Unit, Brigham and Women’s Hospital, Boston MA, USA

Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital, Boston MA, USA

Center for Genomic Medicine, Brigham and Women’s Hospital, Boston MA, USA

Mailman School of Public Health, , NY, New York, USA

Department of Human Genetics, University of Michigan, MI, Ann Arbor, USA

Abstract

Background

In recent years there has been a growing interest in the role of copy number variations (CNV) in genetic diseases. Though there has been rapid development of technologies and statistical methods devoted to detection in CNVs from array data, the inherent challenges in data quality associated with most hybridization techniques remains a challenging problem in CNV association studies.

Results

To help address these data quality issues in the context of family-based association studies, we introduce a statistical framework for the intensity-based array data that takes into account the family information for copy-number assignment. The method is an adaptation of traditional methods for modeling SNP genotype data that assume Gaussian mixture model, whereby CNV calling is performed for all family members simultaneously and leveraging within family-data to reduce CNV calls that are incompatible with Mendelian inheritance while still allowing de-novo CNVs. Applying this method to simulation studies and a genome-wide association study in asthma, we find that our approach significantly improves CNV calls accuracy, and reduces the Mendelian inconsistency rates and false positive genotype calls. The results were validated using qPCR experiments.

Conclusions

In conclusion, we have demonstrated that the use of family information can improve the quality of CNV calling and hopefully give more powerful association test of CNVs.

Background

Copy Number Variants (CNV) are DNA segments whose copy-number deviates from the expected two copies observed in diploid genomes

Technologies have been developed for both CNV discovery and genotyping, the majority of which are array based, including comparative genomic hybridization (CGH) or SNP genotyping arrays

One way to help overcome such data quality issue is to use the family-based design for genetic associations. When available, family data can be incorporated to improve copy-number assignment of genotyped CNVs. In this paper we introduce a statistical framework for family-based CNV studies based on the Gaussian mixture model described in

Methods

Gaussian mixture model

We model the log2 ratios distribution with the Gaussian mixture model (GMM) described in _{1},..,_{
n
}} are generated from a mixture model with

where _{
k
}(_{
i
}|_{
k
}) is normal distributions with mean _{
k
} and variance

with _{
k
}=(_{
k
},_{
k
}). The components 1,..G correspond to discrete copy numbers (0,1,2...). The parameters of the model {_{
k
},_{
k
},_{
k
}} can be estimated using the E-M (Expectation-Maximization) algorithm, described in _{
i
k
}:

Then the “complete data” log likelihood becomes:

The E-step (Expectation): Computing the conditional probability of sample

The M-step (Maximization): The parameters are estimated given the conditional probability _{
i
k
}.

with

The E-step and M-step are iterated until convergence.

We use the R package mclust

Incorporating family data

To appropriately model the probabilities of specific parent-child copy-number configurations, we use the following probabilistic model from

**Total**

**copy number**

**Chromosome-specific**

**copy number**

**Probability**

0

0/0

1

1

0/1

1

2

1/1(common form)

1−

0/2 (rare form)

3

1/2 (common form)

1−

0/3 (rare form)

4

2/2

0.5

1/3

0.5

**Table S1.** Conditional probability table. The Conditional probability of total copy number of an offspring (O) given the copy number of mother (M) and father (F). The parameter

Click here for file

Let ^{
f
},^{
m
} and ^{
o
} represent the copy number distribution for the father, mother and offspring, respectively. The posterior probability of the trio

where ^{
o
}|^{
f
},^{
m
}) is the inheritance probability in the CNV inheritance matrix. Therefore, in the E-M algorithm we can simply reweight the E-step for the offsprings:

to obtain the conditional probability distribution of the offspring. The parents’ probability distribution will not be affected in this step. When we perform the M-step the joint conditional probability of the trio ^{
o
},^{
f
},^{
m
}|

Applied dataset

The study population has been described previously _{2} ratios of each probe were calculated using the normalized intensities of the Cy5 (sample) and Cy3 (reference) channels. We then assessed all probes for variability using the Bioconductor package CNVTools, and eliminated probes without variability. A mean Log_{2} ratio for each CNV region was then calculated, and is directly analyzed (total N after QC = 17,957 autosomal CNV regions). CNV frequency calls were based on CNVTools, with the largest bin assumed to be the 2-copy version. For validation, a small subset of regions were genotyped for copy number by real-time PCR with the Applied Biosystems Taqman copy number assay on a 7900HT instrument

Results

Simulation study

To assess the performance of the family-adjustment algorithm under various scenarios, we performed a simulation study. We generated intensity data based on similar scenarios in

Table

**Sensitivity**

**SNR**

**3**

**4**

**5**

**6**

**7**

MAF=0.1

Unadjusted

0.9095

0.9240

0.9773

0.9942

0.9988

Family adjusted

0.7106

0.9114

0.9757

0.9937

0.9987

MAF=0.2

Unadjusted

0.9340

0.9777

0.9946

0.9990

0.9991

Family adjusted

0.8828

0.9698

0.9922

0.9984

0.9997

MAF=0.3

Unadjusted

0.9368

0.9796

0.9925

0.9852

0.9812

Family adjusted

0.8570

0.9733

0.9950

0.9990

0.9991

**Specificity**

**SNR**

**3**

**4**

**5**

**6**

**7**

MAF=0.1

Unadjusted

0.2867

0.9789

0.9946

0.9990

0.9999

Family adjusted

0.9411

0.9864

0.9961

0.9993

0.9999

MAF=0.2

Unadjusted

0.8975

0.9591

0.9776

0.8582

0.6295

Family adjusted

0.9468

0.9740

0.9917

0.9981

0.9997

MAF=0.3

Unadjusted

0.9353

0.9800

0.9919

0.9812

0.9760

Family adjusted

0.8991

0.9684

0.9927

0.9984

0.9990

**Overall accuracy**

**SNR**

**3**

**4**

**5**

**6**

**7**

MAF=0.1

Unadjusted

0.5253

0.9707

0.9838

0.9878

0.9888

Family adjusted

0.9274

0.9697

0.9838

0.9878

0.9888

MAF=0.2

Unadjusted

0.9244

0.9704

0.9620

0.8996

0.7848

Family adjusted

0.9152

0.9708

0.9915

0.9980

0.9997

MAF=0.3

Unadjusted

0.8761

0.9709

0.9895

0.9809

0.9761

Family adjusted

0.8282

0.9545

0.9896

0.9977

0.9987

Simulation: Gaussian mixture models

**Simulation: Gaussian mixture models.** Gaussian Mixture Model fit for one of the simulated CNV regions with MAF=0.1 and SNR=3. The Gaussian mixture components are shown in different colors and overlaid the histogram.

Simulation: Before and after family adjustment

**Simulation: Before and after family adjustment.** The raw intensity values and CNV calls from the same simulated CNV regions in Figure

Application on real data

For the real data application, we refitted the Gaussian mixture model to an aCGH dataset of a genome-wide CNV association study of asthma. 14,234 polymorphic (i.e. those with 2 or more clusters) CNV regions assayed on the custom-designed array were evaluated. The GMM was applied with same fixed parameters

We next assessed the impact of family-based adjustment on association testing. Using the genome-wide aCGH data in 385 parent-child trios, we applied the CNV-FBAT algorithm

The p-values shift after family adjustment

**The p-values shift after family adjustment.** The log fold changes of p-values for association testing after family adjustment for all 14,234 regions (black) and 1,319 “high confidence” regions. CNV frequency is defined as the percentage of subjects in our population with copy number gain or loss.

The p-values shift after family adjustment

**The p-values shift after family adjustment.** Boxplots of log p-values fold changes by CNV frequency.

QQ-plot for asthma association test after family adjustment

**QQ-plot for asthma association test after family adjustment.** The QQ-plots after family adjustment for 661 “high confidence” regions with CNV frequency greater than 10%.

We also assessed the utility of our method in the analysis of rare variants. We focused on 50 CNV regions overlapped or near known asthma candidate genes

**Total CNV**

**Offsprings CNV**

**De novo**

Gaussian mixture model

1157

398

227

Family-adjusted

749

205

73

Figure

Histograms and scatter plots for 2 asthma-associated CNV regions validated with qPCR

**Histograms and scatter plots for 2 asthma-associated CNV regions validated with qPCR.** The histograms (panels **A/D**) are of all samples in the two asthma CNV regions. The scatter plots (panels **B/C/E/F**) are of the 46 samples with both Agilent array and qPCR measurements. (x-axis) represents the log2 ratios from CGH arrays and the y-axis represents the copy number estimates from qPCR. The scatter plots for unadjusted GMM (panels **B/E**) and family adjusted (panels **C/F**) are the same but colored differently indicating CNV calls (clusters).

Although we can see that the family adjustment algorithm generally reduce the number of CNV calls and false-positives, it is important to know how the algorithm performs when the CNVs are real. To demonstrate this point, we performed qPCR on four CNV regions with frequency ≥5

Histograms and scatter plots for 2 CNV regions validated with qPCR

**Histograms and scatter plots for 2 CNV regions validated with qPCR.** The histograms (panels **A/D**) are of all samples in the two CNV regions. The scatter plots (panels **B/C/E/F**) are of the 72 samples with both Agilent array and qPCR measurements. Agilent (x-axis) represents the log2 ratios from CGH arrays and the y-axis represents the copy number estimates from qPCR. The scatter plots for unadjusted GMM (panels **B/E**) and family adjusted (panels **C/F**) are the same but colored differently indicating CNV calls (clusters).

**Agilent CGH arrays GMM results**

**Region 1 (chr19:62166726-62167416)**

**2**

**3**

**4**

The numbers in parenthesis show the estimates after family adjustments. The qPCR estimates are rounded off the nearest integer and shifted to correspond to the CGH array estimates, which designate the cluster closest to zero as the two copy group. The overall accuracy goes from 70% to 77% for region 1 (chr19:62166726-62167416) and from 72% to 82% for region 2 (chr4:43446373-43446839).

2

Unadjusted

11

6

0

Family adjusted

(12)

(5)

(0)

3

Unadjusted

0

26

11

Family adjusted

(0)

(34)

(3)

qPCR results

4

Unadjusted

0

4

13

Family adjusted

(0)

(8)

(9)

**Region 2 (chr4:43446373-43446839)**

**0**

**1**

**2**

0

Unadjusted

14

9

1

Family adjusted

(15)

(9)

0

1

Unadjusted

0

26

10

Family adjusted

(0)

(34)

(2)

qPCR results

2

Unadjusted

0

0

11

Family adjusted

(0)

(2)

(9)

Discussion

We have introduced a formal statistical framework to CNVs in family-based designs, using Gaussian mixture models. This method considers both the family relationships and the log2 ratios for each individual, therefore reducing the number of Mendelian inconsistencies while allowing the detection of de novo events. Results from analysis of CAMP CNV data shows that our method improves CNV calls accuracy and reduces the number of Mendelian errors and false positive CNV calls, for both common and rare CNV regions and the results can be validated with qPCR. Though we only included parent-child trios in our study, the method can easily be extended to larger pedigrees with multiple generations of families. Our method works especially well for regions with moderate data quality, as opposed to extremely well-clustered or poor data. For well-clustered regions, the Gaussian mixture models give extremely high confidence (close to 100% posterior probability) for CNV calls, therefore the re-weighting with family data will not change the results by much. On the other hand, a poorly-clustered region often contains many mendelian-incompatible trios that the algorithm cannot reconcile. Therefore, our method is most useful for the “questionable” regions where the family data can help identify the real CNV regions.

We also examined the effects of family-based adjustment on association testing. Though it is possible to perform CNV association testing using either raw intensity data or derived copy number, others and we note the later is more preferable in most situations

Compared to other current methods for family-based CNV studies, such as PennCNV

Conclusions

In conclusion, though our method does not completely solve the data quality issue for CNV studies, we have shown through our analysis that incorporation of family data is a necessary step for better quality CNV calls which hopefully lead to more powerful family-based CNV association tests.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JC developed the main mathematical models and implemented the algorithm. Additional analyses were performed by AR and IIL. KD, RM and CL designed the CGH array and performed the assay for CNV association study for asthma. BAR is principal investigator of the primary grant supporting this work, “Structural Genetic Variation in Asthma” and together with JC conceptualized the algorithm. JC and BAR were responsible for manuscript preparation. All authors have read the manuscript and approved the final version.

Acknowledgements

We thank all subjects for their ongoing participation in this study. We acknowledge the CAMP investigators and research team, supported by the National Heart, Lung and Blood Institute (NHLBI) of the National Institutes of Health (NIH), for collection of CAMP Genetic Ancillary Study data. All work on data collected from the CAMP Genetic Ancillary Study was conducted at the Channing Laboratory of the Brigham and Women’s Hospital under appropriate CAMP policies and human subject’s protections. The CAMP Genetics Ancillary Study is supported by U01 HL075419, U01 HL65899, P01 HL083069, and T32 HL07427 from the NIH/NHLBI. Investigation of the role of structural genetic variation in the pathogenesis of asthma is supported by RHL093076, “Structural Genetic Variation in Asthma”, from the NHLBI.