Open Access Research article

Single nucleotide polymorphism discovery from expressed sequence tags in the waterflea Daphnia magna

Luisa Orsini1*, Mieke Jansen1*, Erika L Souche23, Sarah Geldof12 and Luc De Meester1

Author Affiliations

1 Laboratory of Aquatic Ecology and Evolutionary Biology, K.U. Leuven, Ch. Deberiotstraat 32, 3000 Leuven, Belgium

2 Laboratory of Animal Diversity and Systematics, K.U. Leuven, Ch. Deberiotstraat 32, 3000 Leuven, Belgium

3 Institut Pasteur, Plate-Forme Intégration et Analyse Génomiques, 28 Rue du Docteur Roux, 75724 Paris Cedex 15, France

For all author emails, please log on.

BMC Genomics 2011, 12:309  doi:10.1186/1471-2164-12-309

Published: 13 June 2011

Additional files

Additional file 1:

Description Daphnia magna cDNA libraries from NCBI GenBank. EST sequences and cDNA library types of D. magna sequences retrieved from GenBank at the time of the analysis.

Format: XLS Size: 27KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 2:

Natural populations of Daphnia magna used for SNP validation and their environmental characteristics. List of populations from Belgium used for SNP validation and their environmental characteristics. N = population size; Fish = presence (1)/absence (0) of fish; Land use = high (1)/low (0) land use intensity; Parasite = presence (1)/absence (0) of the parasite Pasteuria ramosa. Sampling date and environmental variables as measured at the sampling sites are also listed. Transparency was measured by means of Secchi disk.

Format: XLS Size: 20KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 3:

PCR and oligonucleotide probes used in the Sequenom MassARRAY platform for SNP typing. List of SNP loci genotyped using the Sequenom MassARRAY platform. The PCR primers, the oligonuocletide probes and the multiplex information are shown. The sequences of the SNP flanking regions have been deposited in NCBI dbSNP.

Format: XLS Size: 46KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 4:

Summary of the gene annotation of the EST sequences. In this file we report the gene annotation for three set of sequences based on BLAST searches in NCBI and in the Daphnia portal (http://wfleabase.org/ webcite), called wfleabase in the remaining text): 1) ESTs generated for this study exposing animals to three key environmental stressors and using suppressive subtractive hybridization. The results for this set of sequences are summarized in the spreadsheets EST_1070_NCBI and EST_1070_wfleabase_aa. In EST_1070_NCBI we summarize the gene annotation results obtained from BLAST searches in the NCBI non-redundant protein database using the program tblastx. In EST_1070_wfleabase_aa we summarize the results obtained from BLAST searches in the non-redundant protein database of the Daphnia portal (wfleabase) using the program tblastx. 2) Contigs obtained by assembling EST sequences produced in this study (see point 1 above) and sequences of Daphnia magna downloaded from NCBI GenBank at the time of the analysis. The results for this set of sequences are summarized in the spreadsheets Contigs_NCBI_1812, Contigs_wfleabase_aa_1812, and Contigs_wfleabase_na_1812. In Contigs_NCBI_1812 we summarize the gene annotation results obtained from BLAST searches in the NCBI non-redundant protein database using the program tblastx. In Contigs_wfleabase_aa_1812, and Contigs_wfleabase_na_1812 we summarize the results obtained from BLAST searches in the non-redundant protein database and in the nucleotide database of the Daphnia portal (wfleabase) using the programs tblastx and tblastn, respectively. 3) Contigs obtained from clusters of sequences mined for SNP markers. The number of contigs mined for SNPs is lower than the total number of contigs including our sequences and sequences from GenBank (point 2 above) as several stringent criteria were adopted to select them (see Methods). The results for this set of sequences are summarized in the spreadsheets Contigs_NCBI_574, Contigs_wfleabase_aa_574, and Contigs_wfleabase_na_574. Results from BLAST searches were obtained as in point 2 of this table legend. Columns ID in the described spreadsheets are as follows: 1) SID: sequence identity; 2) GOID - Gene ontology term identity; 3) PID - Protein identity as from BLAST searches; 4) P_desc - Gene description as from BLAST searches and indication of the species where it was identified; 5) e-value - significant homology between the sequence query and the hit in NCBI; 6) Paralog - the paralog group identity (several members may be shown); 7) Start-End: FrameFS - open reading frames predictor results with indication of the start and end coordinates and the frame; 8) DomainID:desc - protein site scan domain identity and description of the protein domain; 9) length - length of the EST; 10) OG_ID - group identity of the ortholog group of protein sequences. This analysis is based on searches for orthologs in several genomes; 11) E-value - significant homology to the ortholog group of protein sequences; 12) Score - score for the ortholog group of protein sequences analysis. The columns ID from 1 to 12 can be found in the spreadsheets: EST_1070_NCBI, Contigs_NCBI_1812, and Contigs_NCBI_574. In the remaining spreadsheets the following columns ID are present: 1) query id - query identity; 2) database sequence (subject) id - sequence identity in wfleabase; 3) gene id - gene identity in wfleabase; 4) percent identity - percentage of identity between query and the gene in wfleabase; 5) alignment length - match in bp between the query and the gene in wfleabase; 6) number of mismatches - number of mismatches between the query and the gene in wfleabase; 7) number of gap openings - gap openings between the query and the gene in wfleabase; 8) query start; 9) query end; 10) subject start - database sequence (subject) start; 11) subject end - database sequence (subject) end; 12) Expect value-E-value of the match between the query and the subject; 13) HSP bit score - blastp e-value score; 14) Gene_ID - gene identity in wfleabase; 15) Gname - gene name; 16) Gnomon - gene prediction in NCBI; 17) Paralog; 18) Paralog,# - number of paralogs identified; 19) OrthoID - ortholog identity; 20) ArpGene - homology to the arthropod genes list; 21) ArpDE - arthropod genes description; 22) Scaffold - scaffold number where the query was annotated; 23) Begin - query start on the scaffold; 24) End - query end on the scaffold; 25) Or - orphan gene; 26) KOG_JGI - ortholog and paralog proteins identities provided for a JGI-sequenced organism; 27) KOG_EMBL - ortholog and paralog proteins identities provided in the EMBL database; 28) meNOG_EMBL - evolutionary genealogy of genes; 29) Enzyme_JGI - protein identity reported in JGI; 30) Enzyme_JGI - protein identity reported in EMBL; 31) Description_JGI - protein description based on JGI database; 32) GeneOntology_JGI - Gene ontology as described in the JGI database; 33) Tandem_ID - identity of tandem genes arrangements. The columns ID are listed in the column_IDs spreadsheet.

Format: XLS Size: 3.4MB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 5:

Blast hits results based on the NCBI non-redundant protein database. List of species whose sequences showed significant homology to the EST sequences from Daphnia magna, based on similarities by BLAST searches in the NCBI non-redundant protein database. For each species the number of hits found is listed in the second column of the table. In total, 651 of the 685 EST sequences showed homology to sequences in other species. The list of different genes identified in the dataset ('genes'), the redundancy of the identified genes ('genes redundancy') and the number of times in which each gene was found in different species ('redundancy in species') are also shown.

Format: XLS Size: 50KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 6:

List of the EST-linked SNP and descriptive statistics. List of SNP markers in the set of 147 SNP targeted for genotyping with the Sequenom MassARRAY platform. The protein changes both at the synonymous (S) and the non-synonymous (NS) sites, the codon position of the point mutation, the genotyping success rate, and the minor allele frequency are shown. The characteristics of the contigs from where the SNPs were developed are also shown, in terms of length, polymorphism, and number of sequences in the contig. Nr: SNPs that did not fit in any assay design.

Format: XLS Size: 56KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 7:

Features of the EST contigs from which SNP markers were developed. Main features of the EST contigs from which SNP markers were designed. The contigs of the SNPs that failed in the genotyping process and the ones with a success rate larger than 70% are shown.

Format: XLS Size: 18KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 8:

Population genetic statistics in the six natural populations used for the SNPs validation. Population genetic statistics in the set of six populations used to validate the SNP markers. Ho = observed heterozigosity; He = expected heterozigosity, frequency of the two SNP alleles in the population, H-W = Hardy-Weinberg disequilibrium test (P < 0.05).

Format: XLS Size: 60KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 9:

Gene function of the contigs where SNP outliers were detected. List of outlier loci and the corresponding EST sequences with accession numbers to NCBI GenBank from which the SNPs were developed. The gene function was inferred from the EST contigs.

Format: XLS Size: 37KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data