The automated retrieval and integration of information about protein point mutations in combination with structure, domain and interaction data from literature and databases promises to be a valuable approach to study structure-function relationships in biomedical data sets.
We developed a rule- and regular expression-based protein point mutation retrieval pipeline for PubMed abstracts, which shows an F-measure of 87% for the mutation retrieval task on a benchmark dataset. In order to link mutations to their proteins, we utilize a named entity recognition algorithm for the identification of gene names co-occurring in the abstract, and establish links based on sequence checks. Vice versa, we could show that gene recognition improved from 77% to 91% F-measure when considering mutation information given in the text. To demonstrate practical relevance, we utilize mutation information from text to evaluate a novel solvation energy based model for the prediction of stabilizing regions in membrane proteins. For five G protein-coupled receptors we identified 35 relevant single mutations and associated phenotypes, of which none had been annotated in the UniProt or PDB database. In 71% reported phenotypes were in compliance with the model predictions, supporting a relation between mutations and stability issues in membrane proteins.
We present a reliable approach for the retrieval of protein mutations from PubMed abstracts for any set of genes or proteins of interest. We further demonstrate how amino acid substitution information from text can be utilized for protein structure stability studies on the basis of a novel energy model.
Proteins carry out most cellular functions as they are acting as building blocks for structures, enzymes, and gene regulators, and are involved in cell mobility and communication . Proteins may interact briefly with each other in an enzymatic reaction, or for a long time to form part of a protein complex. The interactions between proteins are of central importance for almost all processes in living cells, and are described by numerous distinct pathways in databases such as KEGG . Malfunctions or alterations in such pathways can be the cause of many diseases, when for instance the biosynthesis of involved proteins is repressed or proteins are not interacting the way they should. The latter can be due to structural changes in one of the interacting proteins, caused by point mutations, i.e. single wild type amino acid substitutions. Indeed, it is already well known that such mutations are the cause of many hereditary diseases. Thus the large-scale analysis of point mutation data in combination with information about protein interactions, protein structure, and disease pathogenesis might facilitate the study of still unresolved phenotypes and diseases. Despite the availability of numerous biomedical data collections, valuable information about mutation-phenotype associations is still hidden in non-structured text in the biomedical literature. This knowledge can be extracted by text mining, stored in a homogeneous data store, and integrated with already available data from suitable databases. Combining all data, new hypotheses can be formulated, such as the prediction of phenotypic effects induced by mutations.
Genomic variation data have already been collected for many years. Single nucleotide polymorphisms (SNPs), which make up about 90% of all human genetic variation and occur every 100 to 300 bases along the 3-billion-base human genome , are available as large collections. Single amino acid polymorphisms (SAPs) are often manually extracted from literature and curated into databases, originating from wet lab experiments. Additionally, some structures of such mutations may be revealed in crystallography experiments and might eventually end up as distinct structures in the Protein Database PDB. Of particular interest is the identification of mutations which have a strong influence on the stability of proteins. Therefore, the biomedical literature can be systematically searched for information about mutation-phenotype associations by text mining, which may lead to new insights beyond information in existing databases. For the text mined data it is additionally possible to weight or prioritize information according to publication date, the involved authors, and journals. Consideration of such meta data can be relevant for detecting that an already published assumption has been proven wrong in a more recent publication, or for determining whether a protein just recently attracted interest or if the information is already available for years. Furthermore, it is possible to receive a more detailed view on a protein's characteristics, for example, if a certain interaction only takes place under specific conditions, or if an interaction is prevented by the conformational change of a protein domain triggered by a point mutation.
Data on mutations have been collected for years, for numerous species and by different organizations for diverse purposes. There are many efforts to cope with the data, which is being made available in a growing number of databases. The Human Genome Variation society  promotes the collection, documentation and free distribution of genomic variation information. New mutation databases are reported in the journal Human Mutation on a regular basis. There are manually curated databases like OMIM , UniProt Knowledgebase [6,7], and general central repositories like the Human Gene Mutation Database HGMD (now part of BIOBASE) , Universal Mutation Database , Human Genome Variation Database , or MutDB . Besides these central repositories, there are small specialized databases, such as the infevers autoinflammatory mutation online registry , the GPCR NaVa database for natural variants in human G protein-coupled receptors , or the Pompe disease mutation database with 107 sequence variants . Table 1 compares available mutation databases in terms of their scope and information content.
Table 1. Mutation databases: Most of available mutation databases focus on mutations from human, or specific protein families (e.g. G protein-coupled receptors). Some lack well-defined information on mutant phenotypes and only few link to interaction data. Half of the databases also contain data retrieved by text mining methods.
In contrast, unpublished SNPs normally make their way into large locus specific data repositories. Since August 2006, there is a wiki based approach SNPedia http://www.snpedia.com/index.php/SNPedia webcite in contrast to classical databases collecting information on variations in human DNA.
Despite the availability of numerous biomedical data collections, valuable information about mutation-phenotype associations is still hidden in non-structured text in the biomedical literature. Hence, text mining methods are implemented to automatically retrieve these data from the 18 millions of referenced articles in PubMed [15-19]. Text mining aims to generate new hypotheses through the automatic extraction and integration of information spread over several natural language texts. One of the key prerequisites for finding new facts (e.g. interactions or mutations) is the named entity recognition (NER) in text [20,21], the assignment of a class to an entity (e.g. protein), as well as a preferred term or identifier, in case an entry in a database, such as UniProt, or a controlled vocabulary like the Gene Ontology (GO)  exists. For the task of named entity recognition usually a dictionary is used, which contains a list of all known entity names of a class (e.g. human proteins) including synonyms. For the recognition of patterns (e.g. database identifiers like NM_12345) regular expression can be defined. For the analysis of whole sentences, Natural language processing (NLP) techniques are used, which aim to understand text on a syntactic and semantic level. This approach is often paired with systems which are based on a set of manually defined rules or which make use of (semi-)supervised machine learning algorithms.
In recent years, there have been diverse examples for the successful application of text mining to the mutation retrieval task. Early examples are the automatic extraction of mutations from Medline and cross-validation with OMIM , and mining OMIM for phenotypic and genetic information to gain insights into complex diseases . More recently, a concept recognition system based on regular expressions was applied on mutation mining task . GraB for the automatic extraction of protein point mutations using a graph bigram association  was reported to reliably find gene-mutation associations in full text. For identifying gene-specific variations in biomedical text, the ProMiner system developed for the recognition and normalization of gene and protein names was integrated with a conditional random field (CRF)-based recognition system . As an answer to the diverse approaches developed over the past years, a framework for the systematic analysis of mutation extraction systems was proposed .
A growing number of groups are working on protein mutations and their involvement in diseases. A recent overview is given at . Kanagasabai et al.  developed mSTRAP (Mutation extraction and STRucture Annotation Pipeline), for mining mutation annotations from full-text biomedical literature, which they subsequently used for protein structure annotation and visualization. Worth et al.  use structure prediction to analyse the effects of non-synonymous single nucleotide polymorphisms (nsSNPs) with regard to diseases. Focusing on Alzheimer's disease, Erdogmus et al.  developed MuGeX to extract mutation-gene pairs, with estimated 91.3% recall, and precision at 88.9%. Lage et al.  realized a human phenome-interactome network of protein complexes implicated in genetic disorders by integrating quality-controlled interactions of human proteins with a validated, computationally derived phenotype similarity score.
We compared the above mentioned mutation extraction approaches with regard to their strengths and weaknesses. MutationFinder is still used as a reference system for the pure mutation extraction task, although it does not distinguish between mutations on the DNA and protein level, and does not support grounding to genes. MuGeX finds textual descriptions of mutations and distinguishes between DNA and protein mutations, but their mutation grounding relies only on proximity and does not consider sequence information. The mutation grounding approach used in mStrap considers sequence information, but allows only mutation-protein pairs that co-occur in one sentence and the mutation extraction approach relies on simple regular expressions. Finally, GraB is a successful approach which implements the grounding and disambiguation techniques discussed above, but might be computationally too expensive for large data sets. Towards the development of an automated system for the interpretation of structure-function relations in the context of genetic variability data, we chose to design our own protein mutation retrieval system. We aim at a system, which identifies and grounds protein mutations based on sequence information and proximity at a high recall. On the other hand we need a flexible system, that can be applied to diverse biomedical questions and has moderate computational requirements.
As we have motivated above, novel gene-disease associations or the influence of mutations on protein-protein interactions can be discovered through combination of data from literature and databases. Hence, we designed a generic mutation centred approach that can be applied to any kind of genetic data for answering disease-centred questions. As a prerequisite, we consider available high quality data on protein point mutations from curated databases and from peer-reviewed literature. For the latter, we present a flexible approach for both the specific and high-throughput retrieval of mutations. In detail, the following tasks have to be performed: (1) Identify genes/proteins in abstracts. (2) From this subset of abstracts consider only those which additionally contain information about mutations. (3) Propose potential protein-mutation pairs. (4) Filter proposed pairs by sequence checks. (5) Utilize this information for the refinement of the original gene/protein identifier.
This module allows for the automated named entity recognition of genes and proteins. Our approach performs gene name disambiguation by using background knowledge to match a gene with its context against the text as a whole . A gene's context contains information on Gene Ontology annotations, functions, tissues, diseases etc. extracted from the databases Entrez Gene and UniProt. A comparison of gene contexts against the text gives a ranking of candidate identifiers and the top ranked identifier is taken if it scores above a defined threshold. This approach has been recently extended for inter-species gene normalization and achieves 81% success rate on a mixed dataset of 13 species .
We implemented an entity recognition algorithm (MutationTagger) to automatically extract protein point mutation mentions from PubMed abstracts. Wild-type and mutant amino acid, as well as the sequence position of the substitution are extracted by means of both a set of regular expressions for pattern recognition of 1 or 3-letter-notations (e.g. E312A or Glu(312) → Ala), and rules for the more complex identification of textual mutation descriptions (e.g. Glu312 was replaced with alanine). Problems concerning the full text representations (detecting the correct sequence position of the mutated residue and unraveling enumerations) have been addressed by additional extraction algorithms and the implementation of a sequence check. An evaluation of our method on the test data from MutationFinder  showed comparable success rates of 88% F-measure for mutation mention extraction (see Table 2).
Table 2. Mutation retrieval task: Evaluation of precision (P), recall (R), and F-measure (F) on a benchmark set provided with the MutationFinder algorithm. Our MutationTagger performs in general comparably to MutationFinder. Although MutationFinder shows a slightly better overall performance, in the high recall mode MutationTagger extracts more mutations, which is desirable for the subsequent grounding and gene normalization improvement task.
In the process of recognizing mutations in text the direct association to specific proteins and genes remains a challenge. This is due to the fact that the abstracts of relevant publications typically mention more than one mutation or protein, respectively. Thus, a mutation – protein association purely based on their co-occurrence in one abstract is not sufficient, as the consideration of all possible combinations of mutations and proteins would result in a significant number of false positive predictions. The problem becomes even more evident, when considering that both gene and mutation tagging are imperfect, achieving a precision of 80 to 90% each.
We are aiming at an approach that disambiguates the relations of candidate mutations with proteins, and at the same time filters out false positives from the underlying mutation and gene recognition tasks. Approaches have already been developed, which apply a word distance metric for assigning a mutation to its nearest occurring protein term, which is error prone, as matching mutation and protein do not necessarily have to occur close to each other in the abstract or even in the same sentence. The statistical approach GraB is a tool for the automatic extraction of protein point mutations using a graph bigram association , achieving good results of up to 79% F-measure for mutation-protein association but alone would also not fulfil the second aspect of filtering out false positives.
Mutations are commonly described as the substitution of a wild-type by a mutant amino acid at a given position. Our method compares the wild-type residue as described in a mutation mention with the UniProt/Swiss-Prot and PDB protein sequences for all candidate proteins. It is important to incorporate sequences from both repositories, as the sequence numbering can differ and it is not always evident from a publication's abstract, which the authors are referring to. To map UniProt IDs to PDB and vice versa, we used PDB cross-references in UniProtKB/Swiss-Prot http://beta.uniprot.org/docs/pdbtosp webcite and the residue specific comparison between PDB and SwissProt sequences http://www.bioinf.org.uk/pdbsws/ webcite as provided by Martin et al. . Only associations between mutations and proteins with matching amino acids are considered, whereas the score of mismatches is set to 0. Matching pairs are scored based on their proximity, favouring pairs that co-occur in the same sentence. We assign the score to the gene – mutation pair, but also keep track of the particular Swiss-Prot and/or PDB sequence (including chain information) that matched to the mutation. In the case of a shift between Swiss-Prot and PDB sequences we calculate the correct numbering for the shifted sequence utilizing the mapping table by Martin et al. Through the consideration of both sequence and proximity information, for each mutation exactly one gene match is determined, even if more than one protein-mutation pair is possible.
The developed mutation retrieval pipeline can be accessed through two different interfaces (see Figure 1), which offer either a systematic or quick and flexible solution, dependent on the annotation task. The following approaches have been implemented:
Organism-centred approach (database)
All available mutations for a given organism will be retrieved in one literature screening and stored in the Mutation database. This approach relies on the large-scale identification of gene mentions in PubMed abstracts, which have to be compiled for organisms of interest prior to a mutation screening. As of now, gene mention data is available for Human, Mouse, Yeast, Rat, Fruit Fly, E. coli, A. thaliana, C. elegans, Zebrafish, and H. pilori. However, data for additional relevant organisms will be added on a regular basis in the near future.
Protein-centred approach (on-the-fly)
It is possible to retrieve relevant data for a single gene or a list of genes/proteins for any organism. For this purpose, relevant documents are obtained via a keyword searches from the PubMed library using the Entrez Programming Utilities. Like for the large-scale identification of gene mentions in PubMed abstracts in the organism-centred approach, the result is a set of abstracts, which is subsequently processed by the MutationTagger.
Figure 1. Mutation retrieval workflow. Workflow of mutation data retrieval with MutationTagger. A: PubMed IDs of abstracts mentioning proteins for given species are retrieved from a local database (gene2pubmed), which contains the results of our gene normalizing approach. Mutations are identified in the abstracts and stored (mutation2pubmed). The gene and mutation data is joined, filtered by sequence checks, and stored (mutation2gene). B: For a queried protein or gene relevant articles are retrieved from the Entrez database. Mutations are identified in the abstracts, sequence checks against the queried protein are performed, and the checked mutation data is exported to HTML or SQL.
Improvement of gene normalization
As described above, we defined the input set of documents for the organism-e mutation mining approach by scanning the whole PubMed database for abstracts mentioning at least one gene or protein of a pre-defined species. For this filtering step, we relied on the gene normalization techniques of our gene normalizer, which was applied to all PubMed abstracts in advance and has shown 85% F-measure for human genes and slightly lower for other species . However, the gene normalization proposes only one identifier per gene mention, even if a set of different candidate identifiers was computed. According to internal ranking mechanisms, only the top scoring candidate is considered. This leads to a possible scenario, where in some cases the correct identifier is ranked lower and would be neglected for any subsequent data procession. In case of our mutation mining algorithm, we assume that some mutations cannot be associated to the correct protein, because the gene tagging task already failed.
On the other hand, it should be possible to improve the performance of both entity recognition techniques for genes and mutations by combining the results. The idea is to run both approaches with low precision thus receiving a high recall, associate all genes to all mutations, and then consider the intersection of all combinations that fit. Mutation and gene product are considered to be a valid pair, if the wild-type residues at the mutated position in the protein sequence and in the reported mutation match (as described in section Sequence Checks). For all proposed gene identifiers, protein sequences are obtained and checked for compliance with the reported wild type amino acid. The score of identifiers that show a match are increased, which might lead to a re-ranking of the identifiers for one gene entity. This could further improve the original gene normalization approach for candidate entities which are reported to show a mutation.
As shown in Figure 2 our gene normalizer identified CCP (human crystallin, gamma D) with EntrezGene ID 1421 as the top candidate gene for abstract PMID 8142383. MutationTagger identified a replacement of tryptophan with glycine at position 191 as the only mutation mentioned in the paper. None of the protein sequences retrieved for human CCP showed a tryptophan residue at position 191, which means that this gene identifier was not supported by mutation information. However, besides human crystallin, there was also cytochrome-c peroxidase in yeast (EntrezGene ID 853940) proposed as an alternative identifier, which was ranked lower. As the product of this gene showed a tryptophan residue at position 191 (according to PDB sequencing) the score was increased making it the new top candidate. Indeed, manual curation of the corresponding literature confirmed, that the only gene mentioned in the abstract is cytochrome-c peroxidase in yeast.
Figure 2. Improvement of gene normalization. Example for gene name normalization with the help of mutation mining. Initially, our gene normalizer proposed the human gene CCP as its context fits the text best (abstract not fully shown). However, when comparing the recognized mutation at position 191 with the sequences of all three candidates, only CCP in yeast contains the wild-type tryptophan at the specified position (PDB entry). After checking the full text of this publication, we found that CCP indeed refers to the gene in Saccharomyces cerevisiae.