State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, PR China

Abstract

Background

Considerable efforts have been made to extract protein-protein interactions from the biological literature, but little work has been done on the extraction of interaction detection methods. It is crucial to annotate the detection methods in the literature, since different detection methods shed different degrees of reliability on the reported interactions. However, the diversity of method mentions in the literature makes the automatic extraction quite challenging.

Results

In this article, we develop a generative topic model, the Correlated Method-Word model (

Conclusion

From the promising experiment results, we can see that the

Background

Interaction detection method extraction

The study of protein interactions is one of the most pressing biological problems. In the literature mining community, considerable efforts have been made to automatically extract the protein-protein interactions (

Nevertheless, little work has been done to automatically extract the interaction detection methods from the literature. The detection methods available to identify protein interactions vary in their level of resolution and the confidence of reliability. Therefore, it is important to identify such detection methods in order to validate the reported interactions. Some interaction databases, such as

The first critical assessment of detection method extraction was carried out by the

The diversity of method mentions in the literature is the major obstacle precluding the automatic extraction. In the real situation, different authors prefer different words and phrases to describe the same methods. For example, the detection method "

To validate the diversity, we apply a string matching algorithm with all the names/synonyms from the MI ontology on a set of 740 documents, annotated with 96 methods and provided by the

String matching performance.

740 Full Texts

0.090

0.107

0.098

We apply a string matching algorithm with all the names/synonyms from the MI ontology on a set of 740 documents, annotated with 96 methods and provided by the

As Table

Another straightforward solution is to treat the extraction issue as a classification problem – for each detection method in the ontology definition, a set of binary classifiers are built to make yes/no decisions

In another point of view, from the perspective of involvement of domain experts, some approaches achieved acceptable results on the small data set. In Rinaldi's work

Generative topic model

Nowadays, in the machine learning community, the generative topic model is receiving more and more attentions. Latent Dirichlet Allocation (

The advantages of the generative topic models are: 1) it would be easy to postulate complex latent structures responsible for a set of observations; 2) the correlation between different factors could be easily exploited by introducing the latent topic variables.

In this article, in order to extract the detection methods from the biological literature, we propose to formulate the correlation between the detection methods and related word occurrences in a probabilistic framework. In particular, we assume the applied methods are governed by a set of latent topics and the corresponding word descriptions are also influenced by the same topic factors, which characterize the correlation between the methods and related words. Under this setting, we appeal to the generative topic model to capture such latent correlations and infer the potential methods from the observed words by the statistic inference technique.

The intuitive notion behind the proposed model is that: different documents contain informative commonality in the descriptions of the same methods, therefore we propose to discover the common usage patterns for the desired methods from the latent correlations between the methods and related words. This manner is somehow analogous to the idea that to extract templates from the overlapping of different method descriptions. But the diversity in the method mentions brings the traditional template generation algorithms with low support and low confidence problems. Furthermore, when there are multiple methods in one document, the traditional approach would fail to figure out the latent correlations. In contrast, the generative model deals naturally with the missing data and provides a more feasible and theoretical framework.

The paper is organized as follows: in the Methods section, we present detailed descriptions about the proposed model and discuss the inference and parameter estimation procedures for the model; in the Results section, we perform extensive experiments to validate the proposed model; and in the Conclusions section, we would conclude the work and demonstrate our contributions in this paper.

Methods

Correlated Method-Word model

We present the Correlated Method-Word model (

Graphical model representation of the CMW model

**Graphical model representation of the CMW model**. Following the standard graphical model formalism

The model can be viewed in the terms of generative process that, the author should first select a set of topics for his/her manuscripts (e.g. physical protein-protein interactions); under different kind of topics, there are different choices of detection methods to confirm the findings (e.g.

Formally, we define a corpus consists of **z **= {_{1}, _{2}, _{3},..., _{N}} as the particular discrete topic assignments for each method; **y **= {_{1}, _{2}, _{3},..., _{M}} as the indexing variables to indicate which topic factor generates the corresponding word and _{1}, _{2}, _{3},..., _{N}} and _{1}, _{2}, _{3},..., _{M}} are the observed methods and words in document

Conditioned on the model parameters (

1. Sample topic proportion

2. For each method _{n}, n ∈ {1, 2, 3,..., N}:

a. Sample topic factor _{n }from the multinomial distribution : _{n }~

b. Sample method _{n }from the multinomial distribution conditioned on _{n }: _{n }~_{n}|_{n})

3. For each related word _{m}, m ∈ {1, 2, 3,..., M}:

a. Sample indexing variable _{m }from the Uniform distribution conditioned on N: _{m }~

b. Sample word _{m }from the multinomial distribution conditioned on _{m }~_{m}|

Our basic notion about each component of this model is that, the discrete occurrences of detection methods and related words in the given document are governed by the topic-specific distributions (e.g. matrix _{i }~**y **to indicate such latent structure between them.

Thus, the joint probability on the observed methods, words and latent variables in one document is given as follows:

An intuitive comparison between the traditional approach (e.g. discriminative classification and template matching method) and the proposed

Comparison between the traditional approach and the CMW model

**Comparison between the traditional approach and the CMW model**. In the this representation,

The traditional approach (the left panel of Figure

Efficient dimensional decomposition is explicitly implemented:

Inference and parameter estimation

Variational inference

In order to utilize the

Unfortunately, this posterior distribution is intractable: the couples between the continuous variable

Although the exact inference is intractable, there are a wide variety of approximate inference algorithms can serve for the propose, including: expectation propagation

In particular, we define the following fully factorized distribution on the latent variables:

where the Dirichlet parameters

The meaning of the above variational distribution is that: we discard the dependence among the latent variables by assuming they are independently drawn from the respective distributions. In that case, the aim of the variational inference is to find the optimal variational parameters which could minimize the

Following the general recipe for the variational approximation, we take derivatives with respect to the variational parameters and obtain the following coordinate ascent algorithm:

1. Dirichlet parameter

2. Multinomial parameter

where

3. Multinomial parameter

4. Dirichlet parameter

These update equations are invoked repeatedly until the relative change in

When we have achieved the approximate posterior probability, we could handle the conditional distribution of interest – **e**|**w **

Parameter estimation

Following the similar procedure in the variational inference, in this section, we utilize an empirical Bayesian method to estimate the parameters of the

1. Update the Dirichlet parameter

where

2. Update the Dirichlet parameter

3. Update the Multinomial parameter

These update equations correspond to find the maximum likelihood estimation with the expected sufficient statistics for each document taken under the variational posterior.

We develop an alternating

1. (

2. (

The

Results and discussion

We collect 5319 full-text documents from

Test corpora

The whole corpus consists of 115 unique method annotations, and each document associates with 1.99 different methods in average. Unfortunately, the standard deviation of the method frequency is so large that the corpus is heavily unbalanced: the most popular method "

Statistics of the corpus

**Statistics of the corpus**. In the whole corpus, 5 dominate detection methods take up nearly 59.3% occurrences and 86.1% (99 out of 115) methods occur in less than 10% documents.

We can discover from Figure

Feature selection

The ^{2 }statistic

Word ^{2 }value associating with the method

where

By ^{2 }statistic, we approximate the dependence between word

where _{i}) is the prior probability of method _{i}.

In the following experiments, we select the top 3000 terms to build up the feature set according to Eq(13).

Effect of topic factors

We first use the perplexity as the criterion to evaluate the effect of the number of topic factors, which is the only arbitrary parameter in the

where _{d }is the number of methods in the document

Better generalization capability is indicated by a lower perplexity over the held-out testing samples. We held out 20% of collection for the testing purpose and used the remaining 80% to train the model, in accordance with 5-fold cross-validation.

Figure

Methods perplexity

**Methods perplexity**. Lower perplexity on the testing data indicates a better generalization capability. Here we held out 20% of collection for the testing purpose and used the remaining 80% to train the model, in accordance with 5-fold cross-validation.

Besides understanding the impact of the number of topic factors on the generalization capability, we would be more interested in their explicit effect on the extraction performance. Here, we evaluate the precision and recall performance of the model under different number of topic factors. We use the same data set partition as in Figure

We could discover from Figure

Performance on the number of topics

**Performance on the number of topics**. We use the same data set partition as in Figure 4 and evaluate the

Extraction performance

Since there is few work to compare with, we employ the well studied

In the

In the

In the ^{light }

We perform comparisons on different proportions of the data used for training. In this comparison, we set the size of topics in the

We could discover from Figure

Comparison with the baseline models

**Comparison with the baseline models**. We compare the

One thing we should note is that, since the data set is unbalanced, we should attend the retrieval performance on the minor methods as well. In the method-level evaluation, the baseline models only retrieve most of the major methods (e.g. the top 5 methods) but ignoring the other minor ones, while the

Figure

Coverage comparison with the baseline models

**Coverage comparison with the baseline models**. We compare the

Rinaldi utilized the expert revised patterns to perform the extraction and achieved the best performance in the

Comparison with BioCreative II best result.

0.506

0.522

0.483

**0.654**

**0.545**

**0.543**

**+29.2%**

**+4.4%**

**+12.4%**

We operate the

Here, we briefly conclude the performance of the

Correlation between methods and words

To demonstrate the correlation between the different methods and words exploited by the

where _{d }is the number of words in the document

In Table

Top 20 relevant terms for methods.

**x-ray**

(MI:0114)

structure, crystal, residue, molecule, model, site, form, interface, chain, contact, bond, hydrogen, helix, pp, record, helical, window, surface, linker, segment

**two hybrid**

(MI:0018)

yeast, two-hybrid, interact, assay, fusion, system, plasmid, clone, cdna, screen, bait, sequence, acid, amino, encode, site, pp, record, domain, plant

**pull down**

(MI:0096)

gst, fusion, glutathione, pull-down, assay, interact, bead, buffer, wash, yeast, scopus, min, incubate, two-hybrid, antibody, pp, record, system, plasmid, sequence

**anti tag coip**

(MI:0007)

record, pp, cite, yeast, antibody, strain, panel, anti-flag, saccharomyces, flag, cerevisia, growth, blot, western, flag-tagg, gene, grow, medline, ha, anti-ha

**anti bait coip**

(MI:0006)

control, buffer, pp, record, isi, bait, cancer, antibody, extract, c-terminus, bead, sirna, tumor, stain, gene, yeast, sds, luciferase, embo, cdna

**coip**

(MI:0019)

antibody, pp, record, extract, yeast, domain, sequence, expression, blot, cdna, clone, activity, luciferase, growth, transfect, acid, fusion, sirna, mmedta, link

We collect top 20 terms for 6 different methods according to Eq(15) from the corpus.

Methods correlation analysis

By the

To represent a given method in the latent topic space, we re-normalize the topic-specific method distribution matrix

where _{i }is the

Recall that, each row of the multinomial parameter

Based on this representation, we employ an accumulative clustering algorithm to perform the hierarchical clustering and utilize a visualization tool

From the clustering result in Figure

Methods clustering tree

**Methods clustering tree**. We utilize an accumulative clustering algorithm to perform the hierarchical clustering and build up the "

Classify irrelevant documents

Although the

We randomly select 1000 documents from

This measurement indicates the maximum probability of a document containing at least one interaction detection method.

We arrange the relevance scores in a descending order in Figure

Relevance distribution in relevant and irrelevant documents

**Relevance distribution in relevant and irrelevant documents**. In the diagram, red line indicates the relevance scores in the relevant document set and the blue dots indicate the relevance scores in the irrelevant document set. If we select the classification threshold as the green line indicates, we would achieve a promising classification performance: in terms of precision 0.745, recall 0.676 and AUC 0.819.

Conclusion

In this paper, we propose a generative probabilistic model, the Correlated Method-Word model, to automatically extract the interaction detection methods from the biological literature. This problem is not well studied by the previous researches. By introducing the latent topic factors, the proposed model formulates the correlation between the detection methods and related words in a probabilistic framework in order to infer the potential methods from the observed words.

In our experiments, the proposed

Our contributions in this paper lie in: 1) propose a generative probabilistic model with proper underlying semantics for the detection method extraction issue, and the model achieves promising performance; 2) properly model the correlation between the detection methods and related words in the biological literature, which captures the in-depth relationship not only between the methods and related words but also among the different methods.

The

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Wang carried out the major work of the paper, proposed the model, implemented the experiments and drafted the manuscript. Huang gave directions in the process and revised the draft. Zhu supervised the whole work, gave great amount of valuable suggestions and helped to revise the manuscript. All authors have read and approved the final manuscript.

Acknowledgements

This work was supported by the Chinese Natural Science Foundation under grant No. 60572084, National High Technology Research and Development Program of China (863 Program) under No. 2006AA02Z321, as well as Tsinghua Basic Research Foundation under grant No. 052220205 and No. 053220002.

This article has been published as part of