Log on / register
Feedback | Support | My details
Open AccessHighly AccessResearch article

Mining housekeeping genes with a Naive Bayes classifier

Luna De Ferrari email and Stuart Aitken email

School of Informatics, the University of Edinburgh, Edinburgh EH8 9LE, UK

author email corresponding author email

BMC Genomics 2006, 7:277doi:10.1186/1471-2164-7-277

Published: 30 October 2006

Additional files

Additional file 1:

Attributes values and housekeeping probabilities for all EMBL human genes. The file contains the following attributes in tab separated format: 1. EMBL_gene_id = The EMBL gene identifier, 2. HGNC_symbol = the HUGO Gene Name Committee identifier 3. description = a textual description of the gene function, 4. EMBL_transcript_id = The EMBL transcript identifier, 5. cDNA_length = cDNA length (entire pre-splicing mRNA length: exons + introns + other untranslated regions), 6. cds_length = Coding sequence length (exons only), 7. exons_nr = Number of exons, 8. 3_MAR_presence = Presence of S/MAR in the 3' region, 9. 5_MAR_presence = Presence of S/MAR in the 5' region, 10. 5_polyA_18_presence = Presence of Poly(dA-dT) (with length of 18 or more bp) in the 5' region, 11. 5_CCGNN_2_5_presence = Presence of (CCGNN)2–5 in the 5' region, 12. perc_go_ts_match = Percent of GO terms for the gene that match with the tissue specific GO terms list, 13. perc_go_hk_match = Percent of GO terms for the gene that match with the housekeeping GO terms list, 14. is_hk = The housekeeping or tissue specific former classification from published lists (when known), 15. predicted_class = The predicted class given the probability (class is housekeeping if housekeeping probability ≥ 50%, tissue specific if probability ≤ 50%) 16. hk_probability = The new housekeeping probability generated by the Naive Bayes classifier When a value was unknown it was represented by a question mark, following the "arff" file standard for machine learning.

Format: TSV Size: 6.5MB Download file

Additional file 2:

Attributes values and housekeeping probabilities for all EMBL mouse genes. The file contains the following attributes in tab separated format: 1. EMBL_gene_id = The EMBL gene identifier, 2. MGI_symbol = the Mouse Genomic Informatics (MGI) symbol 3. description = a textual description of the gene function, 4. EMBL_transcript_id = The EMBL transcript identifier, 5. cDNA_length = cDNA length (entire pre-splicing mRNA length: exons + introns + other untranslated regions), 6. cds_length = Coding sequence length (exons only), 7. exons_nr = Number of exons, 8. 3_MAR_presence = Presence of S/MAR in the 3' region, 9. 5_MAR_presence = Presence of S/MAR in the 5' region, 10. 5_polyA_18_presence = Presence of Poly(dA-dT) (with length of 18 or more bp) in the 5' region, 11. 5_CCGNN_2_5_presence = Presence of (CCGNN)2–5 in the 5' region, 12. perc_go_ts_match = Percent of GO terms for the gene that match with the tissue specific GO terms list, 13. perc_go_hk_match = Percent of GO terms for the gene that match with the housekeeping GO terms list, 14. is_hk = The housekeeping or tissue specific former classification from published lists (when known), 15. predicted_class = The predicted class given the probability (class is housekeeping if housekeeping probability ≥ 50%, tissue specific if probability ≤ 50%) 16. hk_probability = The new housekeeping probability generated by the Naive Bayes classifier When a value was unknown it was represented by a question mark, following the "arff" file standard for machine learning.

Format: TSV Size: 3.3MB Download file

Additional file 3:

Attributes values and housekeeping probabilities for all EMBL fruit fly genes. The file contains the following attributes in tab separated format: 1. EMBL_gene_id = The EMBL gene identifier, 2. FlyBase_symbol = the FlyBase symbol 3. description = a textual description of the gene function, 4. EMBL_transcript_id = The EMBL transcript identifier, 5. cDNA_length = cDNA length (entire pre-splicing mRNA length: exons + introns + other untranslated regions), 6. cds_length = Coding sequence length (exons only), 7. exons_nr = Number of exons, 8. 3_MAR_presence = Presence of S/MAR in the 3' region, 9. 5_MAR_presence = Presence of S/MAR in the 5' region, 10. 5_polyA_18_presence = Presence of Poly(dA-dT) (with length of 18 or more bp) in the 5' region, 11. 5_CCGNN_2_5_presence = Presence of (CCGNN)2–5 in the 5' region, 12. perc_go_ts_match =Percent of GO terms for the gene that match with the tissue specific GO terms list, 13. perc_go_hk_match = Percent of GO terms for the gene that match with the housekeeping GO terms list, 14. is_hk = The housekeeping or tissue specific former classification from published lists (when known), 15. predicted_class = The predicted class given the probability (class is housekeeping if housekeeping probability ≥ 50%, tissue specific if probability ≤ 50%) 16. hk_probability = The new housekeeping probability generated by the Naive Bayes classifier When a value was unknown it was represented by a question mark, following the "arff" file standard for machine learning.

Format: TSV Size: 2.1MB Download file


© 1999-2009 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.