Generation of Gene Ontology benchmark datasets with various types of positive signal1 The Holm Group, Biocenter II, Institute of Biotechnology, PO Box 56, 00014 University of Helsinki, Finland 2 Department of Biological and Environmental Sciences, P.O. Box 56, 00014 University of Helsinki, Finland 3 Department of Biosciences, P.O. Box 1627, 70211 University of Kuopio, Finland
BMC Bioinformatics 2009, 10:319doi:10.1186/1471-2105-10-319
AbstractBackgroundThe analysis of over-represented functional classes in a list of genes is one of the most essential bioinformatics research topics. Typical examples of such lists are the differentially expressed genes from transcriptional analysis which need to be linked to functional information represented in the Gene Ontology (GO). Despite the importance of this procedure, there is a little work on consistent evaluation of various GO analysis methods. Especially, there is no literature on creating benchmark datasets for GO analysis tools. ResultsWe propose a methodology for the evaluation of GO analysis tools, which consists of creating gene lists with a selected signal level and a selected number of independent over-represented classes. The methodology starts with a real life GO data matrix, and therefore the generated datasets have similar features to real positive datasets. The user can select the signal level for over-representation, the number of independent positive classes in the dataset, and the size of the final gene list. We present the use of the effective number and various normalizations while embedding the signal to a selected class or classes and the use of binary correlation to ensure that the selected signal classes are independent with each other. The usefulness of generated datasets is demonstrated by comparing different GO class ranking and GO clustering methods. ConclusionThe presented methods aid the development and evaluation of GO analysis methods as they enable thorough testing with different signal types and different signal levels. As an example, our comparisons reveal clear differences between compared GO clustering and GO de-correlation methods. The implementation is coded in Matlab and is freely available at the dedicated website http://ekhidna.biocenter.helsinki.fi/users/petri/public/POSGODA/POSGODA.html webcite. |




on Google Scholar








author email
corresponding author email