Open Access Highly Accessed Software

SHEAR: sample heterogeneity estimation and assembly by reference

Sean R Landman1, Tae Hyun Hwang234*, Kevin AT Silverstein5, Yingming Li6, Scott M Dehm67, Michael Steinbach1 and Vipin Kumar1

Author Affiliations

1 Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA

2 Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX, USA

3 Department of Clinical Sciences, University of Texas Southwestern Medical Center, Dallas, TX, USA

4 Harold C. Simmons Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA

5 Research Informatics Support Systems, Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN, USA

6 Masonic Cancer Center, University of Minnesota, Minneapolis, MN, USA

7 Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis, MN, USA

For all author emails, please log on.

BMC Genomics 2014, 15:84  doi:10.1186/1471-2164-15-84

Published: 29 January 2014



Personal genome assembly is a critical process when studying tumor genomes and other highly divergent sequences. The accuracy of downstream analyses, such as RNA-seq and ChIP-seq, can be greatly enhanced by using personal genomic sequences rather than standard references. Unfortunately, reads sequenced from these types of samples often have a heterogeneous mix of various subpopulations with different variants, making assembly extremely difficult using existing assembly tools. To address these challenges, we developed SHEAR (Sample Heterogeneity Estimation and Assembly by Reference; webcite), a tool that predicts SVs, accounts for heterogeneous variants by estimating their representative percentages, and generates personal genomic sequences to be used for downstream analysis.


By making use of structural variant detection algorithms, SHEAR offers improved performance in the form of a stronger ability to handle difficult structural variant types and better computational efficiency. We compare against the lead competing approach using a variety of simulated scenarios as well as real tumor cell line data with known heterogeneous variants. SHEAR is shown to successfully estimate heterogeneity percentages in both cases, and demonstrates an improved efficiency and better ability to handle tandem duplications.


SHEAR allows for accurate and efficient SV detection and personal genomic sequence generation. It is also able to account for heterogeneous sequencing samples, such as from tumor tissue, by estimating the subpopulation percentage for each heterogeneous variant.

Genomics; Next-generation sequencing; Sequence analysis; Assembly; Personal genome; Heterogeneity; Structural variation; Prostate cancer