Open Access Highly Accessed Research article

De novo assembly of the pepper transcriptome (Capsicum annuum): a benchmark for in silico discovery of SNPs, SSRs and candidate genes

Hamid Ashrafi1*, Theresa Hill1, Kevin Stoffel1, Alexander Kozik2, JiQiang Yao13, Sebastian Reyes Chin-Wo12 and Allen Van Deynze1

Author Affiliations

1 Seed Biotechnology Center, University of California, Davis, 1 Shields Ave, Davis, CA 95616, USA

2 Genome Center, University of California, Davis, 1 Shields Ave, Davis, CA, 95616, USA

3 Present address: Interdisciplinary Center for Biotechnology Research, University of Florida, Gainesville, FL, 32610, USA

For all author emails, please log on.

BMC Genomics 2012, 13:571  doi:10.1186/1471-2164-13-571

Published: 30 October 2012

Abstract

Background

Molecular breeding of pepper (Capsicum spp.) can be accelerated by developing DNA markers associated with transcriptomes in breeding germplasm. Before the advent of next generation sequencing (NGS) technologies, the majority of sequencing data were generated by the Sanger sequencing method. By leveraging Sanger EST data, we have generated a wealth of genetic information for pepper including thousands of SNPs and Single Position Polymorphic (SPP) markers. To complement and enhance these resources, we applied NGS to three pepper genotypes: Maor, Early Jalapeño and Criollo de Morelos-334 (CM334) to identify SNPs and SSRs in the assembly of these three genotypes.

Results

Two pepper transcriptome assemblies were developed with different purposes. The first reference sequence, assembled by CAP3 software, comprises 31,196 contigs from >125,000 Sanger-EST sequences that were mainly derived from a Korean F1-hybrid line, Bukang. Overlapping probes were designed for 30,815 unigenes to construct a pepper Affymetrix GeneChip® microarray for whole genome analyses. In addition, custom Python scripts were used to identify 4,236 SNPs in contigs of the assembly. A total of 2,489 simple sequence repeats (SSRs) were identified from the assembly, and primers were designed for the SSRs. Annotation of contigs using Blast2GO software resulted in information for 60% of the unigenes in the assembly. The second transcriptome assembly was constructed from more than 200 million Illumina Genome Analyzer II reads (80–120 nt) using a combination of Velvet, CLC workbench and CAP3 software packages. BWA, SAMtools and in-house Perl scripts were used to identify SNPs among three pepper genotypes. The SNPs were filtered to be at least 50 bp from any intron-exon junctions as well as flanking SNPs. More than 22,000 high-quality putative SNPs were identified. Using the MISA software, 10,398 SSR markers were also identified within the Illumina transcriptome assembly and primers were designed for the identified markers. The assembly was annotated by Blast2GO and 14,740 (12%) of annotated contigs were associated with functional proteins.

Conclusions

Before availability of pepper genome sequence, assembling transcriptomes of this economically important crop was required to generate thousands of high-quality molecular markers that could be used in breeding programs. In order to have a better understanding of the assembled sequences and to identify candidate genes underlying QTLs, we annotated the contigs of Sanger-EST and Illumina transcriptome assemblies. These and other information have been curated in a database that we have dedicated for pepper project.

Keywords:
Pepper; Capsicum spp; Molecular Markers; EST; Transcriptome; RNAseq; Annotation; SNP; SSR; SPP