Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Software

GOParGenPy: a high throughput method to generate Gene Ontology data matrices

Ajay Anand Kumar1*, Liisa Holm12 and Petri Toronen1

Author Affiliations

1 Institute of Biotechnology, University of Helsinki, (Viikinkaari 5), PO Box 56, Helsinki 00014, Finland

2 Department of Biosciences, Division of Genetics, University of Helsinki, (Viikinkaari 5), PO Box 56, Helsinki 00014, Finland

For all author emails, please log on.

BMC Bioinformatics 2013, 14:242  doi:10.1186/1471-2105-14-242

Published: 8 August 2013

Abstract

Background

Gene Ontology (GO) is a popular standard in the annotation of gene products and provides information related to genes across all species. The structure of GO is dynamic and is updated on a daily basis. However, the popular existing methods use outdated versions of GO. Moreover, these tools are slow to process large datasets consisting of more than 20,000 genes.

Results

We have developed GOParGenPy, a platform independent software tool to generate the binary data matrix showing the GO class membership, including parental classes, of a set of GO annotated genes. GOParGenPy is at least an order of magnitude faster than popular tools for Gene Ontology analysis and it can handle larger datasets than the existing tools. It can use any available version of the GO structure and allows the user to select the source of GO annotation. GO structure selection is critical for analysis, as we show that GO classes have rapid turnover between different GO structure releases.

Conclusions

GOParGenPy is an easy to use software tool which can generate sparse or full binary matrices from GO annotated gene sets. The obtained binary matrix can then be used with any analysis environment and with any analysis methods.

Keywords:
Gene Ontology; Large-scale datasets; Data Mining; Machine learning; Bioinformatics