Biofuels extracted from the seeds of Camelina sativa have recently been used successfully as environmentally friendly jet-fuel to reduce greenhouse gas emissions. Camelina sativa is genetically very close to Arabidopsis thaliana, and both are members of the Brassicaceae. Although public databases are currently available for some members of the Brassicaceae, such as A. thaliana, A. lyrata, Brassica napus, B. juncea and B. rapa, there are no public Expressed Sequence Tags (EST) or genomic data for Camelina sativa. In this study, a high-throughput, large-scale RNA sequencing (RNA-seq) of the Camelina sativa transcriptome was carried out to generate a database that will be useful for further functional analyses.
Approximately 27 million clean “reads” filtered from raw reads by removal of adaptors, ambiguous reads and low-quality reads (2.42 gigabase pairs) were generated by Illumina paired-end RNA-seq technology. All of these clean reads were assembled de novo into 83,493 unigenes and 103,196 transcripts using SOAPdenovo and Trinity, respectively. The average length of the transcripts generated by Trinity was 697 bp (N50 = 976), which was longer than the average length of unigenes (319 bp, N50 = 346 bp). Nonetheless, the assembly generated by SOAPdenovo produced similar number of non-redundant hits (22,435) with that of Trinity (22,433) in BLASTN searches of the Arabidopsis thaliana CDS sequence database (TAIR). Four public databases, the Kyoto Encyclopedia of Genes and Genomes (KEGG), Swiss-prot, NCBI non-redundant protein (NR), and the Cluster of Orthologous Groups (COG), were used for unigene annotation; 67,791 of 83,493 unigenes (81.2%) were finally annotated with gene descriptions or conserved protein domains that were mapped to 25,329 non-redundant protein sequences. We mapped 27,042 of 83,493 unigenes (32.4%) to 119 KEGG metabolic pathways.
This is the first report of a transcriptome database for Camelina sativa, an environmentally important member of the Brassicaceae. We showed that C. savita is closely related to Arabidopsis spp. and more distantly related to Brassica spp. Although the majority of annotated genes had high sequence identity to those of A. thaliana, a substantial proportion of disease-resistance genes (NBS-encoding LRR genes) were instead more closely similar to the genes of other Brassicaceae; these genes included BrCN, BrCNL, BrNL, BrTN, BrTNL in B. rapa. As plant genomes are under long-term selection pressure from environmental stressors, conservation of these disease-resistance genes in C. sativa and B. rapa genomes implies that they are exposed to the threats from closely-related pathogens in their natural habitats.
Keywords:Brassicaceae; Camelina sativa; Transcriptome; de novo; Paired-end sequencing; NBS-LRR
Camelina sativa is a dicotyledonous plant in the Family Brassicaceae. It is commonly known as Camelina, “gold-of-pleasure” or false flax. It has a growth cycle similar to that of Arabidopsis. Camelina has been cultivated as a source of vegetable oil in Europe, central Asia and North America. The life cycle of C. sativa is relatively short, spanning approximately 100–120 days; thus, the species is very suitable for renewable-resource generation and as a spring or fall rotation crop . C. sativa has 3,500 years of cultivation history. Although it is an ancient crop, Camelina cultivation has decreased gradually in modern times in relation to rapeseed . The majority (80%) of fatty acids in Camelina oil are unsaturated; these are an important source of omega-3 fatty acids . In Camelina seeds, polyunsaturated fatty acids constitute more than 50% of the total, and linolenic acid (18:3n-3) makes up about 35-40% of total fatty acids . Camelina is recommended as a dietary supplement because of these benefits. In addition to its dietary use, Camelina oil has non-food applications, such as soaps, varnishes and biodiesel [5,6]. Production of this oil may solve the problem of limited feedstock availability for bio-diesel production. Camelina sativa would be useful as an alternative crop for biodiesel due to its low cost of production and high energy content. Because of the relatively high ester yield, alkyl esters from Camelina oil have been used as biodiesel . Camelina oil has also been used directly as a fuel for diesel transport engines . A further advantage of Camelina seed oil is that it produces less CO2 than traditional mineral oil products . Moreover, Camelina is more drought-resistant and frost-tolerant than rapeseed and can thus be grown on land with little fertilizer or on land that is fallow. In comparison with other oilseed plants, Camelina is particularly competitive on highly saline soils . The species is also well-known for its elevated resistance to insect pests and pathogens .
A number of plant transcriptomes have been deeply sequenced and subjected to further analysis over the last decade, particularly in model species of monocotyledons (Oryza sativa) and dicotyledons (Arabidopsis). These analyses provide valuable databases for non-model plant species [11,12]. However, there is no EST or genomic sequence currently available for Camelina sativa in the GenBank database. Transcriptomic sequence data for this low-cost oilseed plant will provide a valuable source of genomic information for practitioners of plant sciences. In the present study, we adopted Illumina paired-end sequencing to analyze the leaf transcriptome of Camelina sativa. In total, 2.42 gigabase pairs were obtained by deep sequencing; unigenes involved in most metabolic pathways were detected by our procedures. This is the first report of a transcriptome database for this oilseed plant. We provide a public dataset for genetic analysis and biological study of Camelina sativa.
Sequencing and transcriptome assembly by SOAPdenovo
Total RNA was extracted from young and mature leaves of Camelina sativa. Poly A+ mRNA was obtained by passing total RNA through a column of beads conjugated with oligo (dT); the product was then fragmented into short sequences (200-700nt). The shortened mRNA was transcribed to cDNA by reverse transcriptase before sequencing. Clean “reads” were filtered from raw reads by removal of adaptors, ambiguous reads and low-quality reads. Approximate 27 million clean reads (2.42 Gbp) with a mean length of 90 bp were obtained (GenBank accession number: SRA057100). Two assembly methods were adopted and compared (Table 1). We used SOAPdenovo to assemble all high-quality clean reads into contigs (37.24 Mbp longer than 75 bp), which were assembled into scaffolds (32.72 Mbp longer than 100 bp) that were in turn assembled de novo into unigenes (26.65 Mbp longer than 100 bp). We generated 204,190 contigs (length ≥ 75 bp) with a mean length of 182 bp and an N50 of 194 bp (Table 2). The majority of contigs (175,262) were in the range 100 – 400 bp, which accounted for 71.9% of total contigs (227,194 total reads) (Additional file 1: Figure S1a). In order to assemble scaffolds, the contigs were connected with N to represent unknown sequences between two contiguous contigs. Details of scaffold total number, mean length and the N50 value are given in Table 2. Most of the scaffolds (118,708/129,539 = 89.5%) reads had gap length ratios of < 0.01 (Additional file 1: Figure S1b) and the most frequent lengths (91.6%) were between 100 and 500 nt (Additional file 1: Figure S1c). After assembling the scaffolds de novo with SOAPdenovo software, we filled scaffold gaps with the lowest number of Ns so that each scaffold could not be extended from either side. Resulting sequences were defined as unigenes. In total, 83,493 unigenes were generated with a mean unigene length of 319 bp and an N50 of 346 bp (Table 2) (GenBank accession number: GABO00000000). There were 10,877 unigenes with ≥ 500 bp, and 1,871 unigenes with ≥ 1000 bp; the majority of unigenes (87.0%) had 100 to 500 bp (Figure S1d). The frequency distributions of unigene lengths and ratios of gap length to size of assembled unigene are depicted in Figures S1d and Figure S1e, respectively. Among all unigenes, 99.87% (83,386 unigenes) had gap lengths that were < 5% of total length (Figure S1e). The random distribution of reads in assembled unigenes is presented in Figure S1f to display sequencing bias.
Additional file 1. Supplementary Figures for this study. Figure S1 Overview of assembly by SOAPdenovo. (a) Length frequency distribution of contigs obtained from de novo assembly of high-quality clean “reads”. (b) Length frequency distribution of gap ratios (N/size) in assembled scaffolds. (c) Frequency distribution of assembled scaffold lengths. (d) Length frequency distribution of unigenes produced by contig joining, gap filling, and scaffold clustering. (e) Gap frequency distribution of assembled unigenes. x-axis values are ratios of gap length to length of assembled unigenes. y-axis values are frequencies of unigenes containing gaps. (f) Random frequency distribution of Illumina sequencing reads in assembled unigenes. x-axis values are relative positions of sequencing reads in assembled unigenes. The orientation of unigenes is from the 5’ end to the 3’ end. Figure S2 Venn Diagrams of the three categories of GO. In total, 33,475 unigenes were assigned to at least one GO category. Figure S3 Venn diagram results from diverse databases. (a) Venn diagram showing the number of unigenes matched to sequences in NR, Swiss-Prot and KEGG databases. All annotations were based on best BLASTX hits with E-Values ≤ 1.0E-5. The overlapping regions represent the number of unigenes that matched in different databases. (b) Venn diagram showing the number of unigenes in NR, Swiss-Prot, KEGG and COG databases. All annotations were based on the best BLASTX hits with E-Values ≤ 1.0E-5. Figure S4Camelina sativa transcriptome coding sequence (CDS) predicted by BLASTX and ESTScan software. (a) Number of predicted CDS with gap ratio frequency distribution (N/size). (b) Length frequency distribution of predicted CDS. (c) Length frequency distribution of predicted protein sequences. (d) Gap ratio frequency distribution of CDS predicted by ESTScan software. (e) Length frequency distribution of CDS predicted by ESTScan software. (f) Length frequency distribution of protein sequences predicted by ESTScan software.
Format: DOCX Size: 1.1MB Download file
Table 1. Comparison of SOAPdenovo and Trinity assembly results
Table 2. Summary details of sequences produced by SOAPdenovo assembly after Illumina sequencing
De novo transcriptome assembly by Trinity
To verify SOAPdenovo assembly results, we also made use of the Trinity assembly method, which assembles full-length transcripts without reference genomes . In total 103,196 transcripts with a total length of 71,935,591 bp were generated (GenBank accession number: GABL00000000), significantly exceeding the output from SOAPdenovo (26,651,285 bp for 83,493 unigenes). The mean length and N50 obtained by Trinity assembly were also longer than those obtained by SOAPdenovo (Table 1).
To further verify the assembly results from both methods, we mapped both unigenes (SOAPdenovo) and transcripts (Trinity) to Arabidopsis thaliana (TAIR10_cds_20101214) CDS sequences using BLAT (Table 3). Parameters of BLAT were set as default (minMatch=2, minScore=30, minIdentity=90, maxGap=2, tileSize=11 and stepSize=11). Of 83,493 unigenes, 52,864 were successfully mapped to CDS in TAIR; 70,753 of 103,196 transcripts were also successfully mapped. Of the total CDS number in TAIR, 63.4% (22,435/35,386) and 61.4% (22,433/35,386) were mapped with sequences assembled by SOAPdenovo and Trinity, respectively. Among the mapped CDS in TAIR, 59.6% (21,084/35,386) were generated by both methods. Thus, similar mapping results were obtained from the two assembly methods when the Arabidopsis thaliana genome was taken as a reference genome. We adopted SOAPdenovo output for further analyses.
Table 3. Assembly results with TAIR 10.0 CDS using SOAPdenovo and Trinity software
Annotation of predicted proteins in NR and Swiss-Prot databases
In total, 67,791 unigenes were annotated with 25,329 unique protein IDs. Most protein IDs (25,075 of 25,329) were annotated by the NR database. To annotate gene names and protein coding sequences among unigenes, we searched all six-frame translations of unigenes against the NR plant protein database in NCBI by running BLASTX with an E-value cut-off 1.0E-5. In total, 67,497 BLASTX hits (matches) were obtained, covering 80.8% of all unigenes (Table 4 and Additional file 2). Of these, 45,307 and 16,969 BLASTX hits were matched to Arabidopsis thaliana (TAIR 9) and A. lyrata proteins, respectively. Only 722 BLASTX hits were aligned to NR protein of Brassica spp. (Table 4). Thus, Camelina sativa was genetically more closely related to Arabidopsis than Brassica. The E-value frequency distribution of significant hits (E-value ≤ 1.0E-50) showed that 26% of the sequences shared strong homologies; the majority of matched sequences (74%) had E-values in the range 1.0E-50–1.0E-5 (Figure 1a). The translated amino acid sequences of unigenes were also closely similar to sequences from the NR database; nearly 90% of the BLASTX hits were in a similarity range between 40% and 100%. Only 9.6% of hits had similarity values less than 40% (Figure 1c). Homologies between different species are depicted in Figure 1e. Among hits, 67.1% matched to Arabidopsis thaliana, followed in sequence by A. lyrata (25.1%) and Brassica (1.1%) (Figure 1e).
Format: XLSX Size: 8.5MB Download file
Table 4. Summary of Camelina unigene annotations from SOAPdenovo assembly
Figure 1. Unigene homology searches against NR and Swiss-Prot databases. (a) E-value proportional frequency distribution of BLAST hits against the NR database. (b) E-value proportional frequency distribution of BLAST hits against the Swiss-Prot database. (c) Proportional frequency distribution of unigene similarities against the NR database based on the best BLAST hits (E-value ≤ 1.0E-5). (d) Proportional frequency distribution of unigene similarities against the Swiss-Prot database based on the best BLAST hits (E-value ≤ 1.0 E-5). (e) Proportional homology distribution among other plant species based on the best BLAST hits against the NR database (E-value ≤ 1.0 E-5).
We also matched protein coding sequences of unigenes with the protein database at Swiss-Prot using BLASTX; 40,804 of 83,493 unigenes (48.87%) returned hits at an E-value threshold of ≤ 1.0E-5 (Table 4). Most of the matched sequences (79.2%) had E-values between 1.0E-50 to 1.0E-5, and the remaining 20.8% had strong homologies with E-values of ≤1.0E-50 (Figure 1b). The similarity frequency distribution against Swiss-Prot was different from that obtained against the NR database; 86.9% of query sequences against Swiss-Prot had similarities between 20% and 80%; only 13.1% of sequences had strong homologies with >80% identity (Figure 1d).
Conserved domain annotation and COG classification
COG is a database built on phylogenetic relationships of protein sequences from 66 genomes, including bacteria, plants and animals. Individual proteins or paralogs from at least three lineages are categorized in each COG to represent an ancient conserved domain. Within the Camelina sativa unigene set, 14,190 were categorized (E-value ≤ 1.0E-5) in 25 functional COG clusters (Table 4, Figure 2). Thus, only a small proportion of the unigenes (17.0%) carried protein domains with annotation for COG categories. The five largest categories were: 1) general function predictions only (15.6%), 2) post-translational modification, protein turnover, chaperones (8.6%), 3) transcription (8.3%), 4) translation, ribosomal structure and biogenesis (7.7%), and 5) replication, recombination and repair (7.5%).
Figure 2. COG functional classification of the Camelina sativa transcriptome. Of 67,497 hits in the NR database, 14,190 unigenes with significant homologies in the COG database (E-value ≤ 1.0 E-5) were classified into 25 COG categories.
Figure 3. Gene Ontology (GO) term assignment of Camelina unigenes. Based on high-score BLASTX matches in the NR plant proteins database, Camelina unigenes were classified into three main GO categories and 46 sub-categories. The left y-axis indicates the percentage of a specific category of genes in each main category. The right y-axis indicates the number of genes in the same category. In total, we assigned 33,475 unigenes with BLASTX matches to known proteins.
Gene ontology (GO) classification
To functionally categorize Camelina sativa unigenes, we assigned GO terms to each assembled unigene . The Camelina unigenes were categorized in three main GO categories: biological process (23,524, 30.9%), cellular component (25,885, 34.0%) and molecular function (26,825, 35.2%). These GO terms were further subdivided into 46 sub-categories (Table 4 and Figure 3). In total, 33,475 unigenes were assigned to at least one of the GO categories of biological process, cellular component and molecular function (Additional file 1: Figure S2). The transcriptome of Camelina sativa was closely related to Arabidopsis thaliana sequences (Figure 1e). Only 1.1% of unigenes (722 unigenes) had higher homology with genes from Brassica spp. A substantial number (190) of these 722 unigenes were annotated as disease resistance proteins. They accounted for 26.3% of total unigenes annotated to Brassica sequences (Figure 4 and Additional file 3). After overlapping unigenes were filtered, 381 individual NR IDs matched Brassica sequences; among these, 50 NR IDs matched Brassica disease resistance proteins (Additional file 3).
Figure 4. Frequency distribution of unigenes that mapped to Brassica sequences in the NR database.
Format: XLSX Size: 116KB Download file
KEGG pathway mapping
Unigenes of Camelina sativa were mapped to KEGG pathways by using the translated amino acids to run BLASTX against the KEGG database. All returned hits with an E-value ≤ 1.0E-5 were annotated with Enzyme Commission (EC) numbers . Using Arabidopsis thaliana as a reference for pathways analysis, we annotated and mapped 27,042 of 83,493 unigenes (32.4%) to EC numbers in 119 KEGG pathways (Figure S3 in Additional file 1 and Additional file 4). Metabolic pathways had the largest number of unigenes (6,469 members, 23.9%, ko01100), followed in sequence by secondary metabolite biosynthesis (3,600 members, 13.3%, ko01110), plant-pathogen interaction (2,004 members, 7.4%, ko04626), spliceosome (1,333 members, 4.9%, ko03040), protein processing in the endoplasmic reticulum (973 members, 3.6%, ko04141), and starch and sucrose metabolism (777 members, 2.9%, ko00500) (Additional file 4).
Format: XLSX Size: 130KB Download file
Overall, 67,791 unique sequence-based annotations had BLASTX scores with E-values ≤ 1.0E-5 in the NR, Swiss-Prot and KEGG databases (Additional file 1: Figure S3). Among this number, 11,685 unigenes had hits in all four public databases (NR, COG, Swiss-Prot and KEGG), with relatively well defined functional annotations (Table 4, Figure S3 in Additional file 1). These annotations provide a valuable resource for investigating specific processes, structures, functions, and pathways that will guide research on Camelina sativa.
Protein coding sequence (CDS) prediction
Unigenes were firstly aligned by BLASTX (E-value ≤ 1.0E-5) to protein databases in the priority order NR, Swiss-Prot, KEGG, COG. Unigenes aligned to a high priority database were not aligned to databases of lower priority. The process ended when all alignments had been performed. The correct reading frame of the nucleotide sequences of unigenes (5’-3’ direction) was defined by the highest rank in the BLAST results, and the corresponding protein sequences were obtained from the standard codon table. Unigenes that could not be aligned to any database were scanned with ESTScan  to produce the nucleotide and amino acid sequences of the predicted region. In total, CDS of 67,739 unigenes were generated by BLASTX protein database searches described above. To evaluate the quality of the databases, we analyzed the ratio of gap lengths to the sizes of unigene CDS. Almost all of the unigenes had gap lengths that were < 5% of total length, accounting for 99.96% of total unigenes with CDS sequences (67,739 unigenes) (Additional file 1: Figure S4a). The majority of the unigene CDS (88.30%) had < 500 bp. Of 67,739 unigenes with CDS sequences, 7,923 had ≥ 500 bp and 1,340 had ≥ 1000 bp. The size frequency distributions of these unigene CDS and proteins are depicted in Figure S4b and S4c, respectively (Additional file 1). We scanned 1,042 CDS of unigenes that could not be aligned to any database by ESTScan. Of these, only two had gaps (Additional file 1: Figure S4d). The majority of the unigene CDS assigned by ESTScan (96.93%) were shorter than 500 bp (Additional file 1: Figure S4e); this was also the case for protein sequences obtained from ESTScan (Additional file 1: Figure S4f).
Fatty acid pathways in Camelina sativa and Arabidopsis thaliana
Fatty acids are carboxylic acids with long hydrocarbon chains that are saturated or unsaturated. The carbon atoms per molecule vary from 10 to 30. In Camelina seeds, most of the fatty acids are unsaturated and the oil has high omega-3 fatty acid content (35-40%). Eggs from hens and milk from cows fed Camelina meal contain elevated contents of unsaturated fatty acids [17,18]. In our annotated Camelina sativa transcriptome unigene dataset, we identified key genes involved in fatty acid biosynthesis and unsaturated fatty acid biosynthesis (Additional file 5). Camelina sativa shares pathways similar to those of Arabidopsis thaliana. However, not all gene members in the Arabidopsis genome were identified in the transcriptome dataset. For example, there are 7 homologous stearoyl-ACP desaturase genes (FAB2, DES1-6) in the genome of Arabidopsis, but only four of these can find highly homologous unigenes (E-values ≤ 1.0E-68) in the transcriptome dataset (data not shown). The other three members may not be expressed in leaves, or may be absent from the genome of C. sativa.
Format: XLSX Size: 13KB Download file