Open Access Highly Accessed Research article

Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest

Nicholas J Hudson1, Laercio R Porto-Neto1, James Kijas1, Sean McWilliam1, Ryan J Taft2* and Antonio Reverter1*

Author Affiliations

1 Computational and Systems Biology, CSIRO Animal, Food and Health Sciences, St. Lucia, Brisbane, QLD 4067, Australia

2 Institute for Molecular Bioscience, The University of Queensland, St. Lucia, Brisbane, QLD 4067, Australia

For all author emails, please log on.

BMC Bioinformatics 2014, 15:66  doi:10.1186/1471-2105-15-66

Published: 7 March 2014



Genomic information allows population relatedness to be inferred and selected genes to be identified. Single nucleotide polymorphism microarray (SNP-chip) data, a proxy for genome composition, contains patterns in allele order and proportion. These patterns can be quantified by compression efficiency (CE). In principle, the composition of an entire genome can be represented by a CE number quantifying allele representation and order.


We applied a compression algorithm (DEFLATE) to genome-wide high-density SNP data from 4,155 human, 1,800 cattle, 1,222 sheep, 81 dogs and 49 mice samples. All human ethnic groups can be clustered by CE and the clusters recover phylogeography based on traditional fixation index (FST) analyses. CE analysis of other mammals results in segregation by breed or species, and is sensitive to admixture and past effective population size. This clustering is a consequence of individual patterns such as runs of homozygosity. Intriguingly, a related approach can also be used to identify genomic loci that show population-specific CE segregation. A high resolution CE ‘sliding window’ scan across the human genome, organised at the population level, revealed genes known to be under evolutionary pressure. These include SLC24A5 (European and Gujarati Indian skin pigmentation), HERC2 (European eye color), LCT (European and Maasai milk digestion) and EDAR (Asian hair thickness). We also identified a set of previously unidentified loci with high population-specific CE scores including the chromatin remodeler SCMH1 in Africans and EDA2R in Asians. Closer inspection reveals that these prioritised genomic regions do not correspond to simple runs of homozygosity but rather compositionally complex regions that are shared by many individuals of a given population. Unlike FST, CE analyses do not require ab initio population comparisons and are amenable to the hemizygous X chromosome.


We conclude with a discussion of the implications of CE for a complex systems science view of genome evolution. CE allows one to clearly visualise the evolution of individual genomes and populations through a formal, mathematically-rigorous information space. Overall, CE makes a set of biological predictions, some of which are unique and await functional validation.

Information compression; Phylogeography; Selection signatures