Open Access Highly Accessed Research article

Cutoffs and k-mers: implications from a transcriptome study in allopolyploid plants

Nicole Gruenheit1*, Oliver Deusch1, Christian Esser2, Matthias Becker1, Claudia Voelckel1 and Peter Lockhart1

Author Affiliations

1 Institute of Molecular Biosciences, Massey University, Palmerston North, New Zealand

2 Institute for Computer Science, Heinrich-Heine-University, 40225 Düsseldorf, Germany

For all author emails, please log on.

BMC Genomics 2012, 13:92  doi:10.1186/1471-2164-13-92

Published: 14 March 2012

Additional files

Additional file 1:

Table S1. Number and length of contigs per k-mer size and coverage cutoff for P. fastigiatum and P. cheesemanii. The number of sequences obtained per coverage cutoff were counted and divided into four size classes: a) Sequences that were longer than 1000 bp, b) sequences shorter than 1000 bp but longer than 500 bp, c) sequences shorter than 500 but longer than 200 bp and d) sequences shorter than 200 but longer than 100 bp.

Format: XLSX Size: 79KB Download file

Open Data

Additional file 2:

Table S2. Statistics and longest sequences for 380 assemblies of the P. fastigiatum library. The longest sequence was identified in each of the 380 assemblies and annotated according to its homologue in A. thaliana. The N50 and N90 values for each assembly were computed as well.

Format: XLSX Size: 59KB Download file

Open Data

Additional file 3:

Figure S1. Number of complete transcripts identified in different assemblies of P. cheesemanii reads. 380 different assemblies were conducted using ABySS [25,26] and a combination of i) coverage cutoffs between 2 and 20 and II) k-mer sizes between 25 and 63. Transcripts covering the complete coding sequence of the homologue from A. lyrata or A. thaliana were identified and counted. The maximum number (558) of complete transcripts was identified for coverage cutoff five and k-mer size 41 while the lowest (58) number of complete transcripts was identified for coverage cutoff 19 and k-mer size 63.

Format: PDF Size: 160KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 4:

Supplementary file S1: 3,912 complete annotated transcripts from the library of P. fastigiatum. Contigs from all 380 assemblies of the P. fastigiatum reads were searched against a combined library of A. lyrata and A. thaliana coding sequences using BLAT [47]. Transcripts that spanned more than 55% of a reference coding sequence were extracted and, if none of the contigs allocated to a specific coding sequence spanned the whole sequence, assembled further with CAP3 [32]. All 3,912 transcripts that, after this step, spanned more than 95% of the reference coding sequence were considered as 'complete transcripts' and were annotated according to the A. thaliana TAIR accession number if possible and according to the A. lyrata transcript number elsewise.

Format: TXT Size: 4.5MB Download file

Open Data

Additional file 5:

Supplementary file S2: 2,442 complete annotated transcripts from the library of P. cheesemanii. Contigs from all 380 assemblies of the P. cheesemanii reads were searched against a combined library of A. lyrata and A. thaliana coding sequences using BLAT [47]. Transcripts that spanned more than 55% of a reference coding sequence were extracted and, if none of the contigs allocated to a specific coding sequence spanned the whole sequence, assembled further with CAP3 [32]. All 2,442 transcripts that, after this step, spanned more than 95% of the reference coding sequence were considered as 'complete transcripts' and were annotated according to the A. thaliana TAIR accession number if possible and according to the A. lyrata transcript number elsewise.

Format: TXT Size: 3.1MB Download file

Open Data

Additional file 6:

Table S3. Coverage cutoff values for seven genes made from assemblies of reads which had up to three mismatches. All reads in the P. fastigiatum dataset mapping with up to three mismatches to the sequences of AT1G67090, AT3G14210, three sequences of AT1G54030 (two homeologues and one paralogue), ATCG00490, and AT1G75680 were determined and assembled separately using ABySS [25,26] and k-mer sizes 25 to 63. The automatically chosen mean coverage cutoff for each of the assemblies was extracted from the log files.

Format: XLSX Size: 9KB Download file

Open Data