In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles1 Centre for Health Informatics, University of New South Wales, Sydney, Australia 2 School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, Australia 3 Centre for Infectious Diseases and Microbiology, Western Clinical School, University of Sydney, Sydney, Australia
BMC Bioinformatics 2009, 10:86doi:10.1186/1471-2105-10-86
Additional filesAdditional file 1: This file lists the mathematical definitions of the statistical scoring functions evaluated in Case studies 1 and 2. Format: PDF Size: 51KB Download file This file can be viewed with: Adobe Acrobat Reader Additional file 2: The C validation set (shaded area) includes genes responsible for synthesis of the peptidoglycan backbone (shaded area). The B validation set includes various accessory pathways (UDP-NAG synthesis, D-Glu and D-Ala synthesis, meso-DAP synthesis, and und-PP synthesis and recycling). The M validation set further includes genes responsible for transpeptidation, transglycosylation, and other genes responsible for peptidoglycan metabolisms. Abbreviations: UDP: uridine diphosphate; NAG: N-acetylglucosamine; NAG-1P: N-acetylglucosamine-1-phosphate; NAM: N-acetylmuramate; NAG-EP: N-acetylglucosamine-enopyruvate; Ala: alanine; Glu: glutamate; (D-Ala)2: D-alanyl-D-alanine; m-DAP: meso-diaminopelamate; Und-PP: undecaprenyl diphosphate; Und-P: undecaprenyl phosphate; F6P: fructose-6-phosphate; D-Glc: D-glucosamine; D-Glc-6P: D-glucosamine-6-phosphate; D-Glc-1P: D-glucosamine-1-phosphate; L-Asp: L-aspartate; L-Asp-4P: L-aspartate-4-phosphate; ASA: aspartate semialdehyde; DHDP: L-2,3-dihydrodipicolinate; THDP: tetrahydrodipicolinate; NS-AKP: N-succinyl-2-amino-6-ketopimelate; NS-DAP: N-succinyl-L,L-2,6-diaminopimelate; L,L-DAP: L,L-diaminopimelate. Format: EPS Size: 233KB Download file Additional file 3: The genes and the validation sets of peptidoglycan-related genes used in Case study 1. Format: PDF Size: 27KB Download file This file can be viewed with: Adobe Acrobat Reader Additional file 4: This file lists the 400 positive and 17 negative genome examples used in statistical CGP of peptidoglycan-related genes. Format: PDF Size: 71KB Download file This file can be viewed with: Adobe Acrobat Reader Additional file 5: This file lists the positions of glycolysis genes in the ranks produced by statistical CGP of peptidoglycan genes. Format: PDF Size: 17KB Download file This file can be viewed with: Adobe Acrobat Reader Additional file 6: This file lists the 200 positive and 142 negative genome examples used in statistical CGP of anaerobic mixed-acid fermentation genes. Format: PDF Size: 64KB Download file This file can be viewed with: Adobe Acrobat Reader Additional file 7: The rank positions, rank fractions (in pct), cluster of orthologous groups (COG), and the positions of candidate genes in the reference genome (SA-2603) ranked by amss scoring function. Format: PDF Size: 28KB Download file This file can be viewed with: Adobe Acrobat Reader Additional file 8: This figure shows the alternating decision tree (ADTree) model induced by M-validation set of SA-2603 genome. This model predicts whether a gene is related to peptidoglycan metabolism by summing the scores of all preceding nodes from root (Start). A higher score would rank the candidate gene higher. The model shown in this figure achieved an AUC of 0.975 as estimated by using stratified 10-fold cross-validation. Abbreviations of genome names: Nit. europ.: Nitrosomonas europaea (GenBank accession: AL954747); Wig. brevipalpis.: Wigglesworthia brevipalpis (AB063523, BA000021); Oen. Oeni PSU-1: Oenococcus oeni PSU-1 (CP000411); Clos. tetan. E88: Clostridium tetani E88 (AE015927, AF528097); Myc. mycoides.: Mycoplasma mycoides (BX293980); Ehr. ruminantium str.: Ehrlichia ruminantium str. Welgevonden (CR925678); Buc. aphidicol. Cc Cinara cedri.: Buchnera aphidicola Cc Cinara cedri (CP000263); Hah. chejuensis: Hahella chejuensis KCTC 2396 (CP000155); Ric. felis URRWXCal2: Rickettsia felis URRWXCal2 (CP000053–CP000055); Por. gingivalis. W83: Porphyromonas gingivalis W83 (AE015924) Format: EPS Size: 26KB Download file Additional file 9: The rank fraction (in pct) of genes prioritised by amss scoring function in the EC-K12 genome. Format: PDF Size: 26KB Download file This file can be viewed with: Adobe Acrobat Reader Additional file 10: The rank positions, rank fractions (in pct), cluster of orthologous groups (COG), and the positions of candidate genes in the reference genome (E. coli K-12) prioritised by amss scoring function. Format: PDF Size: 27KB Download file This file can be viewed with: Adobe Acrobat Reader Additional file 11: This is the tabular representation of results in Figure5, showing the AUCs of 10-fold cross-validations in rediscovering genes in the 31 KEGG pathways evaluated in Case study 3. Format: PDF Size: 30KB Download file This file can be viewed with: Adobe Acrobat Reader Additional file 12: In Case study 1, evaluation experiments were performed on candidate genes selected from one S. agalactiae and one E. coli genomes. These bacterial genomes belong to divisions of Firmicutes and Gamma-proteobacteria, both consisting of large number of closely-related sequences in positive examples, and it could have favourably biased the performance due to over-representation. This file describes an additional CGP experiment by selecting a less-well represented genome from the NCBI database, Prochlorococcus marinus MIT9313, to investigate this effect. Format: PDF Size: 69KB Download file This file can be viewed with: Adobe Acrobat Reader Additional file 13: There were considerable variations in inductive CGP performance in Case study 3, and some variations is attributable to statistical uncertainties or algorithmic differences. The influence of pathway functions on CGP performance was, however, unclear. Nevertheless, it was observed that there may be limitations in using KEGG pathways as a validation source, where potential sampling biases could have explained a significant proportion of such variations. In this file, an additional experiment was performed to illustrate this effect. Format: PDF Size: 88KB Download file This file can be viewed with: Adobe Acrobat Reader |




on Google Scholar








author email
corresponding author email