Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

Open Access Highly Accessed Research article

The effects of contig length and depth on the estimation of SNP frequencies, and the relative abundance of SNPs in protein-coding and non-coding transcripts of tiger salamanders (Ambystoma tigrinum)

Soo Hyung Eo13* and J Andrew DeWoody12

Author Affiliations

1 Department of Forestry & Natural Resources, Purdue University, West Lafayette, IN, 47907, USA

2 Department of Biological Sciences, Purdue University, West Lafayette, IN, 47907, USA

3 Current address: Department of Zoology, University of Wisconsin, Madison, WI, 53706, USA

For all author emails, please log on.

BMC Genomics 2012, 13:259  doi:10.1186/1471-2164-13-259

Published: 20 June 2012



Next-generation sequencing methods have contributed to rapid progress in the fields of genomics and population genetics. Using this high-throughput and cost-effective technology, a number of studies have estimated single nucleotide polymorphism (SNP) frequency by calculating the mean number of SNPs per unit sequence length (e.g., mean SNPs/kb). However, both read length and contig depth are highly variable and thus raise doubt about simple methods of SNP frequency estimation.


We used 454 pyrosequencing to identify 2,980 putative SNPs in the eastern tiger salamander (Ambystoma tigrinum tigrinum) transcriptome, then constructed analytical models to estimate SNP frequency. The model which considered only contig length (i.e., the method employed in most published papers) was evaluated with very poor likelihood. Our most robust model considered read depth as well as contig length, and was 7.5 × 1055 times more likely than the length-only model. Using this novel modeling approach, we estimated SNP frequency in protein-coding (mRNA) and non-coding transcripts (e.g., small RNAs). We found little difference in SNP frequency in the contigs, but we found a trend of a higher frequency of SNPs in long contigs representing non-coding transcripts relative to protein-coding transcripts. These results support the hypothesis that long non-coding transcripts are less conserved than long protein-coding transcripts.


A modeling approach (i.e., using multiple model construction and model selection approaches) can be a powerful tool for identifying selection on specific functional sequence groups by comparing the frequency and distribution of polymorphisms.

Contig depth; Contig length; Model selection; SNP frequency; Transcriptome; 454 sequencing