Open Access Highly Accessed Open Badges Database

Text-mining applied to autoimmune disease research: the Sjögren’s syndrome knowledge base

Sven-Ulrik Gorr1*, Trevor J Wennblom2, Steve Horvath3, David TW Wong4 and Sara A Michie5

Author Affiliations

1 Department of Diagnostic and Biological Sciences, University of Minnesota School of Dentistry, Minneapolis, MN, 55455, USA

2 Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN, 55455, USA

3 Department of Biostatistics, School of Public Health, University of California, Los Angeles, CA, 90095, USA

4 School of Dentistry, University of California, Los Angeles, CA, 90095, USA

5 Department of Pathology, Stanford University School of Medicine, Stanford, CA, 94305, USA

For all author emails, please log on.

BMC Musculoskeletal Disorders 2012, 13:119  doi:10.1186/1471-2474-13-119

Published: 3 July 2012



Sjögren’s syndrome is a tissue-specific autoimmune disease that affects exocrine tissues, especially salivary glands and lacrimal glands. Despite a large body of evidence gathered over the past 60 years, significant gaps still exist in our understanding of Sjögren’s syndrome. The goal of this study was to develop a database that collects and organizes gene and protein expression data from the existing literature for comparative analysis with future gene expression and proteomic studies of Sjögren’s syndrome.


To catalog the existing knowledge in the field, we used text mining to generate the Sjögren’s Syndrome Knowledge Base (SSKB) of published gene/protein data, which were extracted from PubMed using text mining of over 7,700 abstracts and listing approximately 500 potential genes/proteins. The raw data were manually evaluated to remove duplicates and false-positives and assign gene names. The data base was manually curated to 477 entries, including 377 potential functional genes, which were used for enrichment and pathway analysis using gene ontology and KEGG pathway analysis.


The Sjögren’s syndrome knowledge base ( webcite) can form the foundation for an informed search of existing knowledge in the field as new potential therapeutic targets are identified by conventional or high throughput experimental techniques.