Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing
1 Center for Proteomics and Bioinformatics, Case Western Reserve University, Cleveland, OH, USA
2 Department of Electrical Engineering & Computer Science, Case Western Reserve University, Cleveland, OH, USA
3 Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, OH, USA
4 Department of Genetics and Genome Science, Case Western Reserve University, Cleveland, OH, USA
5 Department of Family Medicine and Community Health, Case Western Reserve University, Cleveland, OH, USA
6 Department of General Medical Sciences, Case Western Reserve University, Cleveland, OH, USA
7 Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA
8 Department of Pharmacy, Suzhou Health College, Suzhou, Jiangsu 215009, P. R. China
9 Department of Public Health Sciences, Stritch School of Medicine, Loyola University Chicago, Maywood, IL, USA
BMC Genomics 2014, 15:685 doi:10.1186/1471-2164-15-685Published: 16 August 2014
Sequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome. These sequences may be common in populations, and their absence in the reference genome may indicate rare variants in the genomes of individuals who served as donors for the human genome project. As the reference genome is used in probe design for microarray technology and mapping short reads in next generation sequencing (NGS), this missing sequence could be a source of bias in functional genomic studies and variant analysis. One End Anchor (OEA) and/or orphan reads from paired-end sequencing have been used to identify novel sequences that are absent in reference genome. However, there is no study to investigate the distribution, evolution and functionality of those sequences in human populations.
To systematically identify and study the missing common sequences (micSeqs), we extended the previous method by pooling OEA reads from large number of individuals and applying strict filtering methods to remove false sequences. The pipeline was applied to data from phase 1 of the 1000 Genomes Project. We identified 309 micSeqs that are present in at least 1% of the human population, but absent in the reference genome. We confirmed 76% of these 309 micSeqs by comparison to other primate genomes, individual human genomes, and gene expression data. Furthermore, we randomly selected fifteen micSeqs and confirmed their presence using PCR validation in 38 additional individuals. Functional analysis using published RNA-seq and ChIP-seq data showed that eleven micSeqs are highly expressed in human brain and three micSeqs contain transcription factor (TF) binding regions, suggesting they are functional elements. In addition, the identified micSeqs are absent in non-primates and show dynamic acquisition during primate evolution culminating with most micSeqs being present in Africans, suggesting some micSeqs may be important sources of human diversity.
76% of micSeqs were confirmed by a comparative genomics approach. Fourteen micSeqs are expressed in human brain or contain TF binding regions. Some micSeqs are primate-specific, conserved and may play a role in the evolution of primates.