Public microarray repository semantic annotation with ontologies employing text mining and expression profile correlation

Ruau, David; Kolárik, Corinna; Mevissen, Heinz-Theodor; Müller, Emmanuel; Assent, Ira; Krieger, Ralph; Seidl, Thomas; Hofmann-Apitius, Martin; Zenke, Martin

doi:10.1186/1471-2105-9-S10-O5

Volume 9 Supplement 10

Highlights from the Fourth International Society for Computational Biology (ISCB) Student Council Symposium

Oral presentation
Open access
Published: 30 October 2008

Public microarray repository semantic annotation with ontologies employing text mining and expression profile correlation

David Ruau¹,
Corinna Kolárik^2,3,
Heinz-Theodor Mevissen²,
Emmanuel Müller⁴,
Ira Assent⁴,
Ralph Krieger⁴,
Thomas Seidl⁴,
Martin Hofmann-Apitius^2,3 &
…
Martin Zenke^1,5

BMC Bioinformatics volume 9, Article number: O5 (2008) Cite this article

3683 Accesses
2 Citations
Metrics details

Public microarray repository annotation

Gene Expression Omnibus (GEO) [1] is the largest public web repository of microarray experiments. GEO, like ArrayExpress and Stanford MicroArray Database, provides descriptions of microarray experiments in free text making it difficult to search and comprehensively link those data to other knowledge resources. Text mining techniques applied to microarray experiment annotation are challenged by poor and/or ambiguous free text description and consequently leave some objects unlabelled. Previous work organized GEO entries at the level of series (GSE) and data sets (GDS) [2] using the Unified Medical Language System (UMLS) [3]. GSE and GDS description are often too broad and a better quality of annotation can be achieved if the GEO samples (GSM) are considered directly. Here we report on a novel approach for annotating GSM objects by employing a combination of text mining and global gene expression similarity. We hypothesize that the biological material analyzed on microarrays is related if unlabeled and labeled objects are highly similar in expression values and hence the class/annotation of one object can help annotate an unlabeled object. Our new method allows us to achieve a higher percentage of semantic annotation by combining both types of information stored in microarray databases.

Results

The GSM free text description (downloaded from GEO in November 2007) was mined using ProMiner [4], a software for Named Entity Recognition based on dictionaries of cell, tissue and disease ontologies from OBO [5] plus cell line resources. This resulted in 73.5–97.6% class labeling of the GSM objects (Table 1). Next the labeled objects were used to annotate the unlabeled objects. We computed the correlation matrix for all the objects where the raw data were available and followed the nearest neighbor approach [6] to identify the nearest labeled object within a δ range. The δ value is an input parameter determined empirically and limits the propagation of too dissimilar annotations. In this study we selected a delta value of 0.04 and observed an increase of the annotation percentage up to 4.9%, depending on the platform. The class labeling overall percentage after annotation propagation reached 78.4–99.4% (Table 1). The results were then stored into a relational database allowing to semantically search for microarray experiments.

Table 1 GSM object annotation coverage.

Full size table

Conclusions and perspectives

The class/annotation propagation from a labeled object to an unlabeled object works only if there is one labeled object within the δ range. Thus the chances of class propagation increase with the number of available objects. We plan to improve on this by merging different types of microarray platforms by using tools like AILUN [7] as well as adding a confidence score to the propagated annotations. Ultimately, the annotation process will be automatized and the resulting database made freely available.

References

Gene Expression Omnibus[http://www.ncbi.nlm.nih.gov/geo/]
Butte AJ, Kohane IS: Creation and implications of a phenome-genome network. Nat Biotechnol 2006, 24: 55–62. 10.1038/nbt1150
Article PubMed Central CAS PubMed Google Scholar
Unified Medical Language System[http://www.nlm.nih.gov/research/umls/]
Hanisch D, Fundel K, Mevissen HT, Zimmer R, Fluck J: ProMiner: rule-based protein and gene entity recognition. BMC Bioinformatics 2005, 6(Suppl 1):S14. 10.1186/1471-2105-6-S1-S14
Article PubMed Central PubMed Google Scholar
The Open Biomedical Ontologies[http://obofoundry.org/]
Dasarathy BV: Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press; 1991.
Google Scholar
Chen R, Li L, Butte AJ: AILUN: reannotating gene expression data automatically. Nature Methods 2007, 4: 879. 10.1038/nmeth1107-879
Article PubMed Central CAS PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Department of Cell Biology, Institute for Biomedical Engineering, RWTH Aachen, University Medical School, 52074, Aachen, Germany
David Ruau & Martin Zenke
Fraunhofer Institute SCAI, Schloss Birlinghoven, 53754, Sankt Augustin, Germany
Corinna Kolárik, Heinz-Theodor Mevissen & Martin Hofmann-Apitius
Department of Applied Life Science Informatics, Bonn-Aachen International Center for Information Technology (B-IT), 53113, Bonn, Germany
Corinna Kolárik & Martin Hofmann-Apitius
Data management and data exploration group, RWTH Aachen University, Germany
Emmanuel Müller, Ira Assent, Ralph Krieger & Thomas Seidl
Helmholtz Institute for Biomedical Engineering, RWTH Aachen University, 52074, Aachen, Germany
Martin Zenke

Authors

David Ruau
View author publications
You can also search for this author in PubMed Google Scholar
Corinna Kolárik
View author publications
You can also search for this author in PubMed Google Scholar
Heinz-Theodor Mevissen
View author publications
You can also search for this author in PubMed Google Scholar
Emmanuel Müller
View author publications
You can also search for this author in PubMed Google Scholar
Ira Assent
View author publications
You can also search for this author in PubMed Google Scholar
Ralph Krieger
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Seidl
View author publications
You can also search for this author in PubMed Google Scholar
Martin Hofmann-Apitius
View author publications
You can also search for this author in PubMed Google Scholar
Martin Zenke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Ruau.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Ruau, D., Kolárik, C., Mevissen, HT. et al. Public microarray repository semantic annotation with ontologies employing text mining and expression profile correlation. BMC Bioinformatics 9 (Suppl 10), O5 (2008). https://doi.org/10.1186/1471-2105-9-S10-O5

Download citation

Published: 30 October 2008
DOI: https://doi.org/10.1186/1471-2105-9-S10-O5

Highlights from the Fourth International Society for Computational Biology (ISCB) Student Council Symposium

Public microarray repository semantic annotation with ontologies employing text mining and expression profile correlation

Public microarray repository annotation

Results

Conclusions and perspectives

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

BMC Bioinformatics

Contact us

Highlights from the Fourth International Society for Computational Biology (ISCB) Student Council Symposium

Public microarray repository semantic annotation with ontologies employing text mining and expression profile correlation

Public microarray repository annotation

Results

Conclusions and perspectives

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us