Maximum-parsimony haplotype frequencies inference based on a joint constrained sparse representation of pooled DNA
1 Translational and Molecular Imaging Institute, Icahn School of Medicine at Mount Sinai, New York NY 10029, USA
2 Electrical Engineering Department, Columbia University, New York NY 10027, USA
3 Center for Computational Biology and Bioinformatics, Columbia University, New York, NY 10027, USA
BMC Bioinformatics 2013, 14:270 doi:10.1186/1471-2105-14-270Published: 8 September 2013
DNA pooling constitutes a cost effective alternative in genome wide association studies. In DNA pooling, equimolar amounts of DNA from different individuals are mixed into one sample and the frequency of each allele in each position is observed in a single genotype experiment. The identification of haplotype frequencies from pooled data in addition to single locus analysis is of separate interest within these studies as haplotypes could increase statistical power and provide additional insight.
We developed a method for maximum-parsimony haplotype frequency estimation from pooled DNA data based on the sparse representation of the DNA pools in a dictionary of haplotypes. Extensions to scenarios where data is noisy or even missing are also presented. The resulting method is first applied to simulated data based on the haplotypes and their associated frequencies of the AGT gene. We further evaluate our methodology on datasets consisting of SNPs from the first 7Mb of the HapMap CEU population. Noise and missing data were further introduced in the datasets in order to test the extensions of the proposed method. Both HIPPO and HAPLOPOOL were also applied to these datasets to compare performances.
We evaluate our methodology on scenarios where pooling is more efficient relative to individual genotyping; that is, in datasets that contain pools with a small number of individuals. We show that in such scenarios our methodology outperforms state-of-the-art methods such as HIPPO and HAPLOPOOL.