Identification of single nucleotide polymorphisms from the transcriptome of an organism with a whole genome duplication
1 School of Molecular Biosciences, Washington State University, Pullman WA 99164-4660, USA
2 School of Biological Sciences, Washington State University, Pullman WA 99164-4236, USA
3 School of Biological Sciences, Washington State University, Vancouver, 14204 NE Salmon Creek Ave, Vancouver WA 98686-9600, USA
4 Center for Reproductive Biology, Washington State University, Pullman WA 99164-7520, USA
BMC Bioinformatics 2013, 14:325 doi:10.1186/1471-2105-14-325Published: 16 November 2013
The common ancestor of salmonid fishes, including rainbow trout (Oncorhynchus mykiss), experienced a whole genome duplication between 20 and 100 million years ago, and many of the duplicated genes have been retained in the trout genome. This retention complicates efforts to detect allelic variation in salmonid fishes. Specifically, single nucleotide polymorphism (SNP) detection is problematic because nucleotide variation can be found between the duplicate copies (paralogs) of a gene as well as between alleles.
We present a method of differentiating between allelic and paralogous (gene copy) sequence variants, allowing identification of SNPs in organisms with multiple copies of a gene or set of genes. The basic strategy is to: 1) identify windows of unique cDNA sequences with homology to each other, 2) compare these unique cDNAs if they are not shared between individuals (i.e. the cDNA is homozygous in one individual and homozygous for another cDNA in the other individual), and 3) give a “SNP score” value between zero and one to each candidate sequence variant based on six criteria. Using this strategy we were able to detect about seven thousand potential SNPs from the transcriptomes of several clonal lines of rainbow trout. When directly compared to a pre-validated set of SNPs in polyploid wheat, we were also able to estimate the false-positive rate of this strategy as 0 to 28% depending on parameters used.
This strategy has an advantage over traditional techniques of SNP identification because another dimension of sequencing information is utilized. This method is especially well suited for identifying SNPs in polyploids, both outbred and inbred, but would tend to be conservative for diploid organisms.