Abstract
Background
The growth of the biomedical information requires most information retrieval systems to provide short and specific answers in response to complex user queries. Semantic information in the form of free text that is structured in a way makes it straightforward for humans to read but more difficult for computers to interpret automatically and search efficiently. One of the reasons is that most traditional information retrieval models assume terms are conditionally independent given a document/passage. Therefore, we are motivated to consider term associations within different contexts to help the models understand semantic information and use it for improving biomedical information retrieval performance.
Results
We propose a term association approach to discover term associations among the keywords from a query. The experiments are conducted on the TREC 20042007 Genomics data sets and the TREC 2004 HARD data set. The proposed approach is promising and achieves superiority over the baselines and the GSP results. The parameter settings and different indices are investigated that the sentencebased index produces the best results in terms of the documentlevel, the wordbased index for the best results in terms of the passagelevel and the paragraphbased index for the best results in terms of the passage2level. Furthermore, the best term association results always come from the best baseline. The tuning number k in the proposed recursive reranking algorithm is discussed and locally optimized to be 10.
Conclusions
First, modelling term association for improving biomedical information retrieval using factor analysis, is one of the major contributions in our work. Second, the experiments confirm that term association considering cooccurrence and dependency among the keywords can produce better results than the baselines treating the keywords independently. Third, the baselines are reranked according to the importance and reliance of latent factors behind term associations. These latent factors are decided by the proposed model and their term appearances in the first round retrieved passages.
Background
The use of largescale experimental techniques and biomedical tools has increased the pace at which biologists produce useful information. This also promotes the growth of the scientific literature, which contains information on those experimental results in the form of free text that is structured in a way which makes it straightforward for humans to read but more difficult for computers to interpret automatically and search efficiently. As a consequence, there is increasing interest in methods that can handle collections of biomedical texts. Such methods include systems that efficiently retrieve and classify information in response to complex user queries, and beyond this, systems that carry out a deeper analysis of the literature to extract specific associations.
Information retrieval (IR) deals with text analysis, text storage, and the retrieval of stored records having similarity between them [1]. In context of biomedical domain, IR systems are to retrieve documents/passages that a user might find relevant to his or her information need. What many information seekers, really desire to be provided short, specific answers to questions and put them in context by providing supporting information and linking to original sources [2]. There are situations when the terms retrieved by IR systems, are not the only desirably independent but associations among the terms within different contexts or a single text, which provide an insight into the text as answers, might be of interest in some specific domains like biomedical domain, text summarization, question answering systems and so on.
In this paper, we focus on discovering term associations among the keywords from a query. Taking all the keywords as a sequence, we consider some subsequences as terms and propose a factor analysis based model to provide knowledge for finding the importance of term associations statistically. In our scientific fields, variables such as "intelligence" or "leadership quality" can not be measured directly. Such variables, called latent variables, can be measured by other "quantifiable" variables, which reflect the underlying variables of interest. Factor analysis attempts to explain the correlations between the observed term associations in terms of the underlying factors, which are not directly observable. These latent factors can be considered the same as the hidden variables of "eliteness" introduced by Robertson et al [3] in order to gain some understanding of the relation among multiple term occurrences and relevance. The observations for the proposed approach can be obtained from the keywords that are extracted from the queries, and from the passages retrieved by an IR system. In order to find the latent factors for term associations, we compute the factor loadings [4] using MATLAB [5]. Then we calculate the communalities [4] based on factor loadings to indicate the importance and reliance of latent factors and use them to recursively rerank the baseline result for improving retrieval performance. In addition, in order to evaluate the superiority of the proposed approach, the generalized sequential pattern (GSP) algorithm is adopted as a comparison.
The paper is organized as follows. First, we briefly present the experimental results and discussions in the results and discussion section, where the IR environment is introduced with the descriptions of the data sets, queries, evaluation measures, the IR system and indices. The comprehensive empirical study includes the analysis for the baselines, the proposed term association, the influence of different indices and k for the recursive reranking algorithm, the comparisons to the GSP algorithm and the official submissions. Second, we show our contributions in the conclusion section. Third, in the methods section, we propose our methods systematically and consistently. A term association approach is presented, followed by a factor analysis based model and a corresponding algorithm, including a recursive reranking algorithm. The related work is also presented in this section.
Results and discussions
Here we report the results obtained from a set of experiments conducted on the TREC 20042007 Genomics data sets and 2004 HARD data set, in order to evaluate the effectiveness of the proposed model and algorithms.
Experimental environment
Data sets and queries
We evaluate the proposed model and algorithms on the TREC 20042007 Genomics data sets, since we focus on the biomedical domain. Furthermore, we also apply the TREC 2004 HARD data set for evaluation.
TREC 2007 and 2006 Genomics data sets provide a test collection of 162,259 fulltext documents assembled with 36 queries in 2007 and 28 queries in 2006. The TREC 2007 queries are in the form of questions asking for lists of specific entities. The definitions for these entity types are based on controlled terminologies from different sources, with the source of the terms depending on the entity type [6]. The TREC 2006 queries are derived from the set of biologically relevant questions based on the Generic Topic Types (GTTs) [2]. All these queries are listed on the official genomics website at: http://ir.ohsu.edu/genomics webcite.
TREC 2005 and 2004 Genomics data sets consists of a document collection for the ad hoc retrieval task which is a 10year subset of MEDLINE with completed citations from the database inclusive from 1994 to 2003. This provides a total of 4,591,008 records [2]. Each record is an abstract of a document. Then in this paper, we take an abstract as a passage. There are 50 queries for each year respectively. More information can be found at: http://www.ncbi.nlm.nih.gov/ webcite.
TREC 2004 HARD data set consists of entirely of English text, such as the Agence France Press (AFP), Associated Press (APW), Central News Agency (CNA), LA Times/Wash Post (LAT), New York Times (NYT), Salon.com (SLN), Ummah Press (UMM), Xinhua English (XIN) with the total collection of 652,710 documents. In our research, we parse the documents into passages [7]. There are 25 queries used in this paper.
Evaluation measures
The TREC Genomics Track has three evaluation measures that are the documentlevel, the aspectlevel and the passage2level (a new measure for the TREC 2007 queries) [6]. Each of these provides insight into the overall performance for a user trying to answer the given queries and measured by some variant of mean average precision (MAP), which are briefly described as follows.
Documentlevel
This is a standard IR measure. The precision is measured at every point where a relevant document is obtained and then averaged over all relevant documents to obtain the average precision for a given query. For a set of queries, the mean of the average precision for all queries is the mean average passage precision of that IR system.
Passagelevel
As described in [8], this is a characterbased precision calculated as follows. For each relevant retrieved passage, precision will be computed as the fraction of characters overlapping with the gold standard passages divided by the total number of characters included in all nominated passages from this system for the topic up until that point. Similar to regular MAP, relevant passages that are not retrieved will be added into the calculation as well, with precision set to 0 for relevant passages not retrieved. Then the mean of these average precisions over all topics will be calculated to compute the mean average passage precision.
Passage2level
This is a new characterbased MAP measure which is added to compare the accuracy of the extracted answers and modified from the original measure Passage MAP. Passage2 treats each individually retrieved character in published order as relevant or not, in a sort of "every character is a mini relevancejudged document" approach [6]. This is done to increase the stability of the passage MAP measure against arbitrary passage splitting techniques.
Gold standard
A gold standard is created by extracting out the relevance passages and entities for each topic. Judges for the relevant passages and entities are recruited from the institutions of track participants and other academic or research centres. They are required to have significant domain knowledge, typically in the form of a PhD in a life science. In summary, judges are given the following three instructions. First, reviewing the topic question and identifying key concepts. Second, identifying relevant paragraphs and selecting minimum complete and correct excerpts. Third, developing controlled vocabulary for entities based on the relevant passages and coding entities for each relevant passage based on this vocabulary [8].
System
We used Okapi BSS (Basic Search System) as our main search system. Okapi is an information retrieval system based on the probability model of Robertson and Sparck Jones [3,914]. The retrieval documents are ranked in the order of their probabilities of relevance to the query. Search term is assigned weight based on its withindocument term frequency and query term frequency. The weighting function used is BM25.
where N is the number of indexed documents in the collection, n is the number of documents containing a specific term, R is the number of documents known to be relevant to a specific topic, r is the number of relevant documents containing the term, tf is withindocument term frequency, qtf is withinquery term frequency, dl is the length of the document, avdl is the average document length, nq is the number of query terms, the k_{i}s are tuning constants (which depend on the database and possibly on the nature of the queries and are empirically determined), K equals to k_{1 }* ((1  b) + b * dl/avdl), and ⊕ indicates that its following component is added only once per document, rather than for each term.
In our experiments, the tuning constant parameters k_{1 }and b are set to be different values. k_{2 }and k_{3 }are set to be 0 and 8 respectively. Furthermore, we have added the query expansion module on Okapi BSS, which provides two query expansion algorithms for constructing structured queries to deal with synonyms, the frequent use of acronyms and homonyms [15].
Indexing
One important issue that IR systems have to deal with is the size of the retrieved passages and the granularity of the indexed information. In the context of text retrieval, the granularity of the indexed text can be defined as the length of the indexed text unit and the size can be defined as the length of the retrieved passage. In this paper, we call an indexed text unit as a passage.
Three indices are built on the 2007 and 2006 Genomics data sets according to three passage extraction methods and a paragraphbased index is built on the 2005 and 2004 Genomics data sets [16]. A paragraphbased index is set up on the 2004 HARD data set as well. The sentencebased indexing is based on passages each of which has up to 3 sentences. The paragraphbased indexing is generated on passages each of which is a paragraph. Here a paragraph is defined as the sequence of sentences between the <p>and </p>tags from the HTML data set. The wordbased indexing forms passages using a dynamic window [16,17] .
Experimental results
We report the baseline results in Table 1, which shows the performance under five parameter settings with three different indices in terms of the documentlevel, the passagelevel and the passage2level on the genomics 20042007 data sets and HARD 2004 data set respectively. Five groups have been set for the parameters of (k_{1}, b) with their indices. Therefore, there are 15 runs on all five TREC data sets. Note that only a paragraphbased index is set up for the TREC 2005 and 2004 Genomics data sets and the TREC 2004 HARD data set.
Table 1. Performance of baselines
Corresponding to the baseline results, we generate the results of the term association approach using our proposed algorithms. The performance and improvements are presented in Table 2. The values in the parentheses are the relative rates of improvement over the original results.
Table 2. Performance of the term association approach
Influence of parameter settings and indices
In order to investigate the influence of different indices and parameter settings, we will deeply analyse the experimental results. First, taking the TREC Genomics 2007 and 2006 data sets as an example, we compute the max, min, mean and sample standard deviation of the baselines in Table 3. From this table, we can see how these settings effect the result, since there is a disparity between the max and the min values under all the measures. Focusing on the sample standard deviation, the SSD values are calculated as a sample standard deviation of a discrete random variable. Compared to the mean, the SSD also shows the influence of the different indices and parameter settings.
Table 3. MAX, MIN, mean and SSD of the Genomics 2007 and 2006 baselines
To illustrate the results in Table 1 graphically, we replot these data in Figure 1 and Figure 2. The performance of the baseline results is shown in terms of the documentlevel, the passagelevel and the passage2level. The xaxis represents the evaluation measures, where "word", "sen" and "par" stand for the wordbased, the sentencebased and the paragraphbased indices. The yaxis shows the MAP performance. This figure shows that the sentencebased index produces the best results in terms of the documentlevel, the wordbased index for the best results in terms of the passagelevel and the paragraphbased index for the best results in terms of the passage2level. This finding also confirms our motivation for building up different indices for different information needs.
Figure 1. Performance of baselines: Genomics 2007 and 2006. The influence of index and parameter settings is investigated on the baselines: (1) the circles highlight the best results generated by three different indices; (2) the xaxis represents the evaluation measures, where "word", "sen" and "par" stand for the wordbased, the sentencebased and the paragraphbased indices; the parameter settings are specified in the legend corresponding to the indices; (3) one of the conclusions is drawn that the sentencebased index produces the best results in terms of the documentlevel, the wordbased index for the best results in terms of the passagelevel and the paragraphbased index for the best results in terms of the passage2level; (4) the data are corresponding to Table 2.
Figure 2. Performance of baselines: Genomics 2005 and 2004, HARD 2004. The influence of index and parameter settings is investigated on the baselines: (1) the circles highlight the best results; (2) only an index as the paragraphbased index, has been generated on the Genomics 2005 and 2004 data sets, the HARD 2004 data set, as mentioned in the indexing section; (3) the parameter settings are specified in the legend corresponding to the indices; (4) the data are corresponding to Table 2.
Influence of term association
In order to illustrate the term association results in Table 2, we plot them graphically in Figure 3 and Figure 4. It clearly shows that, for all the measures on five TREC data sets, the term association approach always outperforms the baselines. The improvements in the parentheses explain the significance evidently. More interesting, the figures of the factor analysis results almost have the same distributions as the figures of baselines. The best factor analysis results always come from the best baseline results. The sentencebased index produces the best factor analysis results in terms of the documentlevel, the wordbased index for the best factor analysis results in terms of the passagelevel and the paragraphbased index for the best factor analysis results in terms of the passage2level.
Figure 3. Performance of the term association approach: Genomics 2007 and 2006. The influence of index and parameter settings is continued on term association: (1) the circles highlight the best term association results, where the best term association results come from the best baselines; (2) the same index finding can also be observed that the sentencebased index produces the best term association results in terms of the documentlevel, the wordbased index for the best in terms of the passagelevel and the paragraphbased index for the best in terms of the passage2level; (3) the data are corresponding to Table 3.
Figure 4. Performance of the term association approach: Genomics 2005 and 2004, HARD 2004. The influence of index and parameter settings is continued on term association: (1) the circles highlight the best term association results, where the best term association results come from the best baselines; (2) the data are corresponding to Table 3.
In order to illustrate the improvements of term association, in Table 2, we plot them graphically in Figure 5, Figure 6, Figure 7 and Figure 8. There are two observations as follows. First, the positive values of the improvements notify that term association carries important weight on the retrieval results, which is much better than the baselines that only consider the unigram keywords independently. In other words, those bigram and trigram associations have more influential in the retrieval results than the independent keywords. Second, the influence in terms of the passage levels (the passage2level and the passagelevel) is greater than that in terms of the documentlevel. We also can see in Figure 5, Figure 6 and Figure 8, that the absolute values of improvements on the passagelevel are much higher than those on the documentlevel. This can be explained that term association is more efficient to be applied in the sentences or paragraphs compared to the documents.
Figure 5. Improvements of the term association approach over baselines: Genomics 2007. The improvements of term association over baselines are investigated: (1) the proposed approach outperforms the baselines, since the lines are in the first quadrant; (2) the influence on the passage levels is greater than that on the documentlevel; (3) the data are corresponding to Table 3.
Figure 6. Improvements of the term association approach over baselines: Genomics 2006. The improvements of term association over baselines are investigated: (1) the proposed approach outperforms the baselines, since the lines are in the first quadrant; (2) the influence on the passage levels is greater than that on the documentlevel; (3) the data are corresponding to Table 3.
Figure 7. Improvements of the term association approach over baselines: Genomics 2005 and 2004. The improvements of term association over baselines are investigated: (1) the proposed approach outperforms the baselines, since the lines are in the first quadrant; (2) no passage level improvement lines on the Genomics 2005 and 2004 data sets are presented, since there is only the documentlevel; (3) the data are corresponding to Table 3.
Figure 8. Improvements of the term association approach over baselines: HARD 2004. The improvements of term association over baselines are investigated: (1) the proposed approach outperforms the baselines, since all the lines are in the first quadrant; (2) the influence on the passage levels is greater than that on the documentlevel; (3) the data are corresponding to Table 3.
Influence of K for recursive reranking
We initialize the depth as k = 10 in the recursive reranking algorithm. The number k stands for the top k term associations weighted by the factor analysis based model. We recursively rerank the retrieved passages according to whether the passages contain the top k term associations or not. We conduct a series of experiments with different settings of k values in order to investigate the influence of value k and find a local optimization value for the proposed algorithm. We first randomly choose five original baselines from our five data sets respectively, namely Genomics 2007, Genomics 2006, Genomics 2005, Genomics 2004 and HARD 2004. Then the factor analysis model is applied on the baselines. Five numbers such as 1, 5, 10, 20, 100, are tested and the performance is shown in Table 4. We can see that number k affects the performance greatly when k is smaller than 10. However, when k becomes larger than 10, the final performance almost has no change. Therefore, we get this local optimization number as 10 for k in the recursive reranking algorithm for all the runs.
Table 4. Number k discussion
Comparison with GSP algorithm
We adopt the GSP algorithm as a comparison to our proposed approach. In order to map the GSP algorithm to our research problem, we treat the keywords extracted from the queries as the singleton items and N passages retrieved by the system for each query as the transaction database. Therefore, the candidates of 1  sequences are all the keywords, the k  sequences candidates are generated on the frequent (k  1)  sequences. For the support counting, we define the minimum support value corresponding to each query as follows. First, the counts of candidates are automatically calculated by the modified GSP algorithm, including all k  sequences. Then, we simulate the counts as a nonparametric distribution. Third, the 95% confidence interval of this distribution is computed, where the lower bound is the minimum support value for this GSP algorithm.
In this section, we study how the GSP algorithm performs on our five data sets. Here we focus on the experimental results with the paragraph index under five parameter settings, as shown in Table 5.
Table 5. Performance of GSP algorithm
Furthermore, we compare the best results of the GSP algorithm, the baselines and the proposed term association approach.
An interesting finding is drawn from the results of the GSP algorithm. The GSP algorithm works very well in terms of the passagelevel and the passage2level, while it is not good for the documentlevel. This can be explained by the following scenario. The frequent 3  sequence T_{1}T_{3}T_{4 }is found in the documents D_{1 }and D_{2}. In D1, T_{1}T_{3}T_{4 }is contained in a short passage so that D_{1 }earns good MAP results on the documentlevel and the passagelevel. In the document D_{2}, the situation is that T_{1}, T_{3 }and T_{4 }are found in different passages respectively. Since T_{1}T_{3}T_{4 }is still found as a sequence based on the definitions, D_{2 }is given a high weight and is going to earn good performance at least on the documentlevel. However, the standard evaluation does not think D_{2 }is qualified to be a relative document so that D_{2 }decreases the performance of the documentlevel.
Compared to the GSP algorithm, the proposed term association approach outperforms the baselines and the GSP results on all the measures. The factor analysis based model considers not only the concurrence of the terms, but also the dependency, especially in the high order structure. In the GSP algorithm, the document D_{2 }is given a good score. However, in the factor analysis based model, the factor loadings of T_{1}T_{3}T_{4 }in D_{2 }is very small, since T_{1}T_{3}T_{4 }is not treated as a trigram term association. T_{1}, T_{3 }and T_{4 }are three unigram terms, while T_{1}T_{3}T_{4 }is a frequent 3  sequence in the GSP algorithm. So the proposed approach avoids assigning a high weight to the document D_{2}.
The major difference among our proposed approach, ngram and PLSA, is that term associations are not dependent on the previous associations, whose reliance and importance are decided by the dependencies among the keywords in the passages, not by their probabilities upon the previous terms. For example, an interesting finding using factor analysis in this work, is that the bigram k_{1}k_{j}(j ≠ 1) might have the highest reliance, even though their previous unigram term k_{1 }or k_{j }is not the most important for a query in some IR systems. And our experiment confirms that k_{1}k_{j }plays an important role in the improved reranking result. Therefore, one of the major contributions of the proposed approach is to extract subsequences as term associations from a query without preliminary knowledge. This promotes us to employ the GSP algorithm as a comparison to evaluate the proposed approach statistically, but not to compare this approach with PLSA and PCA.
Comparison with official submissions
In order to further evaluate the term association approach to improving performance, we compare the performance of the term association approach to the official submissions at the best and mean values on the five TREC data sets in Table 6. Since the submissions of the 2004 HARD data set are not officially released, we focus on the genomics data sets. We can observe that, for the mean performance, term association outperforms baselines and the official submissions. For some best performance, term association makes improvements on baselines, but is not as good as the official submissions. However, based on the discussion upon the influence of term association in the section of influence of term association, we believe we could achieve higher performance if we have better baselines.
Table 6. Comparisons of baselines, term associations and official submissions
A case study
Topic 200 of the TREC 2007 queries is taken as an example. The description for Topic 200 is "What serum [PROTEINS] change expression in association with high disease activity in lupus?". Nine keywords are extracted as serum, proteins, change, expression, association, high, disease, activity and lupus. The rest words are removed by the system as the stop words. The system stems the keywords as serum, protein, chang, express, associ, high, diseas, active and lupus.
Table 7 shows the baseline whose parameters are set as (k_{1}, b) = (2.0, 0.4) with the paragraphbased index. The information of its keywords, the term count, the frequency and rank are presented for Topic 200. The parameters for this baseline are (k_{1}, b) = (2.0, 0.4) with the paragraphbased index. There are totally term associations generated by the proposed approach. Table 8 presents the top 10 term associations after applying the factor analysis based model, where terms, term count and their communalities are presented. Then in Table 9, the performance of term association is compared with the performance of baseline of Topic 200 in terms of the documentlevel, the passagelevel and the passage2level.
Table 7. Topic 200: keyword frequency rank
Table 8. Topic 200: ranking term associations
Table 9. Topic 200: performance comparison
First of all, we can see that no unigram is in the ranking association list. All the term associations in Table 8 are bigrams and trigrams. Since the term association improved result outperforms the baseline, it means that term association works very well on all the measures. Therefore, term association is better than only considering the keywords independently. Second, the trigram "high lupus serum" has the higher reliance than the bigram "activ serum", although the trigram's term count is only 7, which is much less than the bigram's term count as 118. This tells us that the term frequency might not make sense when compared to term association.
Conclusions
Modelling term association for improving biomedical information retrieval using factor analysis, is one of the major contributions of our work. We investigate term association among the keywords from a query and then build up a factor analysis based model to investigate the significance of term association. The proposed approach works very well on five large TREC data sets. Our improved performance is among top TREC official results submitted in the TREC 20042007 Genomics data sets and the TREC 2004 HARD data set.
Term association considering cooccurrence and dependency among the keywords produces better results than the baselines treating the keywords independently. In the other hand, the unigrams, bigrams and trigrams are terms independently computed by the factor analysis based model, which means that the trigrams are not dependent on the bigrams' importance, and the bigrams are not dependent on the unigrams' importance. Their importance is decided by the model and the appearances in the passages. This is also confirmed by the GSP algorithm.
In the term association approach, keywords and the retrieved passages are the observable data, and the factor analysis based model is built up to discover the unobservable latent factors. Factor loadings are computed to indicate the weights of the common factors. Communalities are calculated based on factor loadings to represent the importance and reliance of the corresponding terms associations. Finally, a ranking term association list is given by the model. Then we recursively rerank the baselines and report the experimental results.
The experimental results show that term association outperforms the baselines and the GSP results on all the evaluation measures, which provides a promising avenue for improving the information retrieval performance. Our future work includes investigating the PLSA model on the genomics research. This is also our ongoing work.
Methods
We will first introduce the observations. Then a factor analysis based model is proposed, in which common factors, factor loadings and communalities are defined. The pseudo codes for the factor analysis based algorithm and the recursive reranking algorithm are shown respectively.
Observations
In the traditional IR systems, keywords extracted from the queries are used to retrieve documents/passages with some weighting functions. In this paper, we examine term associations among keywords to improve information retrieval performance. For example, there are n keywords extracted from a query, and the system gives N passages for each retrieval baseline result. Term associations among these n keywords are extracted and used for reranking the N passages.
Our two main observation files from the system are: 1) the baseline result retrieved by the system with N passages for each query; 2) the corresponding term file which displays how many and which keywords are retrieved in each passage. The sample data are presented in Table 10 and 11.
Taking n keywords as a sequence, we study 1keyword subsequence, 2keyword subsequence and 3keyword subsequence as unigram, bigram and trigram term associations. If one term is appeared in a passage, it scores 1; if not, it scores 0. Therefore each passage can be presented as a 10 vector as shown in Table 12.
Table 12. Observation of keyword associations
A factor analysis based model
Factor analysis is a method for investigating whether a number of variables of interest T_{1}, T_{2}, . . . , T_{n}, are linearly related to a smaller number of unobservable factors F_{1}, F_{2}, . . . , F_{m}.
Based on the observation data, we suggest that the observations are functions of a number of common underlying factors. The underlying factors, tentatively and rather loosely describe the unobservable features of the retrieval passages. The score over all term associations is the sum of a constant times a common factor, i.e., it is a linear combination of those common factors in Equation 2.
where m stands for the count of common factors, m ≤ n. The numbers ℓ_{1}, . . . , ℓ_{m }are the factor loadings associated with this term association.
In this paper, term associations contain unigrams, bigrams and trigrams. Then, the data applied by the factor analysis based model would be associations and N passages for each query, which is a matrix. The factor loadings and the common factors for each query must be inferred from the data. Here we use n' to denote
In order to compute the reliance of the associations, communality is defined for the n' associations as
The larger of the communalities are, the more important of common factors are to represent the keywords.
It is assumed that each term association is related to m factors. Therefore, the mathematical model for the above example can be written as follows.
where T_{k }is the score of the k^{th }term association, with k = 1, . . . , n'; < f_{1}, . . . , f_{m }>is the unobserved common factor vector for the k^{th }term association; <ℓ_{k},_{0}, ℓ_{k},_{1}, . . . , ℓ_{k,m }>are the factor loading vector of the k^{th }term association; ε_{k }is error term, which serves to indicate that the hypothesized relationships are not exact. In matrix notation, we have
where T is a n' × N matrix of observable data; L is a n' × (m + 1) matrix of factor loadings, which are unobservable constants; F is a n' × m matrix of unobservable common factors; ε is a n' × N matrix of unobservable error variables.
Observe that by doubling the scale on which f_{1 }of F is measured, and simultaneously halving the factor loadings for f_{j}(j = 2..m) makes no differences to the model. Thus, no generality is lost by assuming that the standard deviation of f_{j}(j = 2..m) is 1. Likewise for f_{1}. Moreover, for similar reasons, no generality is lost by assuming every two factors f_{i }and f_{j }(i ≠ j) are uncorrelated with each other. The "errors" ε are taken to be independent of each other. The variances of the "errors" associated with the n' different associations are not assumed to be equal. The values of the factor loadings L and the variances of the "errors" ε can be estimated given the observed data T.
A factor analysis based algorithm
We present the proposed factor analysis based algorithm as follows, in which eight phases are included. The phase of Initialization gives the initial values for this algorithm, such as N = 1000. The phase of Matrices generation creates the matrices of the associations. In this research, we only consider unigrams, bigrams and trigrams. We calculate the communalities for all the associations at the phase of Communality calculation. Finally, the phase of Reranking is using the recursive reranking algorithm proposed in the following section of the recursive reranking algorithm to reranking the original result.
begin
0. Input
The baseline result for the queries on each data set.
The term file corresponding to the baseline result.
1. Output
A reranking result for the queries on each data set.
2. Initialization
N = 1000;
k = 10;
3. Keyword extraction
Read the term file;
For each query {
Get the keyword sequence;
Get the value of n;}
4. Matrices generation
For each query {
For (i = 1; i <= 3; i + +) {
Generate the ikeyword subsequences;
For all the ikeyword subsequences {
Search the subsequence in the term file;
If the subsequence exits, it scores 1; Else it scores 0;}}
5. Communality calculation
For each query {
Mathematically set up the factor analysis model;
Estimate the factor loadings for the common factors;
Compute the communalities; }
Sort the associations according to their communalities;
Get top k associations as the ranking association list.
6. Reranking
For each query {
Call the recursive reranking algorithm presented
in the following section of the recursive reranking algorithm; }
7. Final result generation
A final reranking result is generated.
end
First, Keywords are directly extracted from the queries. There is a term file which displays how many and which keyword terms are retrieved for each passage by the system. In other words, all the retrieved passages can be labelled by the keywords. Furthermore, for the keywords in the queries, no query expansion but stemming is applied. For example, "change" can have several expressions such as "changeless", "changing", "changeable", and so on. So our system deals with "change" as "chang". The process is done automatically in the system [13,15].
Second, according to the keyword sequence, unigrams, bigrams and trigrams are generated as term associations for each query, which makes a matrix.
Third, we set up the factor analysis model after generating the matrices. Through sorting the communalities, we can find which term association is more important according to its communality. The larger the communalities are, the more important the corresponding associations are. Finally, we recursively rerank the passages as the output result using the recursive reranking algorithm introduced in the following section.
A recursive reranking algorithm
A recursive reranking algorithm is called for the phase of Reranking in the previous factor analysis based algorithm. Here we present the pseudo codes as follows.
begin
0. Input
The baseline result for the queries on each data set.
The ranking association list generated by the factor analysis model.
1. Output
A reranking passage list for the queries on each data set.
2. Initialization
k = 10 which will be discussed in the section of influence of K for recursive reranking;
3. Recursive division
For the 1st association in the ranking association list,
The result list is divided into 2 parts: and where contains and does not.
For each part, the passages are sorted by their given weights.
For the 2nd association in the list,
The result list is divided into 4 parts: and
where and contain while and do not.
Repeat to rerank the result list for k times.
The result list is divided into 2^{k }parts.
The odd parts contain the associations while the even parts do not.
4. Reranking
For (i = 1; i ≤ 2^{k}; i + +) {
Sort the passages in according to their weights;
Let RL is the final reranking list, v is the size of RL;
The first q(1) passages in RL are the passages in
v(i) = # of RL;
The passages from (v(i) + 1) to (v(i) + q(i + 1)) in RL are the passages in
i = i + 1; }
5. Final result generation
A reranking result list is generated.
end
There are three main phases. The phase of Initialization gives the initial values i.e. k = 10, which we will give a deep discussion in the section of influence of K for recursive reranking. The phase of Recursive division divides the passages into the base cases, according to the ranking association. This procedure is displayed in Figure 9, which is very similar to a binary tree. For example, the factor analysis based model gives a ranking list of terms as {T_{1}, T_{2}, T_{3}} for reranking. The baseline results are then first reranking by T_{1}, where are results containing T_{1 }and are results not containing T_{1}. Second, and are recursively reranked by T_{2}. are the results containing T_{2 }and T_{1}, while are those not containing T_{2 }but containing T_{1}. are the results containing T_{2 }but not containing T_{1}, while are those not containing T_{2 }and T_{1}. Similarly, (i = {1, 2, 3, 4} are reranked by T_{3 }at the third step. Finally, the phase of Reranking gets the passages in the base cases Finally, a recursive result list for reranking is generated.
Figure 9. Procedure of recursive reranking. A recursive reranking algorithm is called for the phase of reranking in the section of the factor analysis based algorithm. The recursive division divides the passages into the base cases, according to the sorted term association, which is very similar to a binary tree. For example, the factor analysis based model gives a ranking list of terms as {T_{1}, T_{2}, T_{3}} for reranking. The baseline results are then first reranking by T_{1}, where are results containing T_{1 }and are results not containing T_{1}. Second, and are recursively reranked by T_{2}. are the results containing T_{2 }and T_{1}, while are those not containing T_{2 }but containing T_{1}. are the results containing T_{2 }but not containing T_{1}, while are those not containing T_{2 }and T_{1}. Similarly, (i = {1, 2, 3, 4} are reranked by T_{3 }at the third step.
Related work
Modelling and mining term association is important for information retrieval, which allows an IR system given a user's query terms to retrieve relevant documents more precisely.
Metzler and Croft [18] developed a general, formal framework for modelling term dependencies via Markov random fields. They not only made used of features based on occurrences of single terms, ordered phrases, and unordered phrases, but also explored full independence, sequential dependence and full dependence variant of the model. In addition, the training data were needed in the model for the parameters. Their ad hoc retrieval experiments showed improvements by modeling dependencies, especially on the larger collections.
Deerwester et al [19] proposed an approach to automatic indexing and retrieval, which was to take advantage of implicit higherorder structure in the association of terms with documents in order to improve the detection of relevant documents on the basis of terms found in queries. The proposed approach tried to overcome the deficiencies of termmatching retrieval by treating the unreliability of observed termdocument association data as a statistical problem. They assumed that some underlying latent semantic structure in the data was obscured by the randomness of word choice with respect to retrieval. Then, they use statistical techniques to estimate the latent semantic structure for indexing and retrieval.
Grefenstette [20] proposed an extraction technique using coarse syntactic analysis without domain knowledge, which produced word associations as lists of words related to the work appearing in a corpus. Their experimental results confirmed that, when the closest related terms were used in query expansion of a standard information retrieval data set, the results were much better than that given by document cooccurrence techniques, and slightly better than using unexpanded queries.
Hiroyuki Kaji at el [21] presented a method for automatically generating a corpusdependent association thesaurus from a text corpus. This method consisted of extracting terms and cooccurrence data from a corpus and analysing the correlation between terms statistically. They conducted the experiments on a newspaper article corpus, which proved that the thesaurus navigator efficiently explored information through a text corpus when the information needs were vague.
Manna and Gedeon [22] proposed a term association model which extracted significant terms as well as the important regions from a single document, which based on the subjective data analysis without predefined knowledge. They claimed that the model overcame the basic drawback of existing language models for choosing significant terms in single documents.
Wei et al [23] proposed a technique using association rule mining for the discovery of the associations which took in account not only the cooccurrence frequency but also the confidence and direction of the association rules. They consistently improved the effectiveness of the retrieval over the set of 48 test queries on the Associated Press 1990 news wires corpus of the TREC4 benchmark by query expansion using term association rules.
In this work, we propose a term association approach to customize a factor analysis based model to quantify the importance and reliance of term associations. Independent keywords, disordered dependent phases and highorder structure are considered at the same time in the proposed approach. In addition, we focus on the appearance of the terms at the same context statistically but not the distance among the terms.
As a popular analysis method, factor analysis is attractive in IR for two main reasons. One apparent advantage of factor analysis is that users can use it to reduce the dimensionality of the data. The other one is to find the hidden patterns. Mandl [24] discussed methods for dimensionality reduction using factor analysis in IR. Machado at el [25] presented a perspective to image retrieval based on multivariate factor analysis to minimize data redundancy and reveal hidden patterns. Mehta at el [26] proposed an approach for crosssystem personalization by factor analysis. Their proposed factor analysis method offered an algorithmic improvement over their previous work by taking into account the incompleteness of data. In our proposed approach, factor analysis is applied to discover some hidden common factors as the "eliteness" variables that can be used to estimate the importance of term associations.
Some related work has been done in the biomedical domain during the past few years. We investigated the optimization of multiple sources in [27], where a robust approach to optimizing multiple sources has been proposed. The proposed approach in the metasearch system has access to the baselines from three IR models as DFR, BM25 and language model. In [16], we concentrated on passage extraction and result combination. Three algorithms are presented for passage extraction to build indices and two result combination methods are proposed to combine the retrieval results from different indices. A naive model using factor analysis was also applied to improve the baselines for result combination, where unigrams and bigrams are considered. We also studied on a Bayesian learning approach to promoting diversity in ranking in [28]. In this approach, a reranking model computed the maximum posterior probability of the hidden property corresponding to each retrieved passage. Then it iteratively groups the passages into subsets according to their properties. In this paper, we focus on modelling term associations. The latent factors behind term associations reflect the importances and reliance of these term associations. They are decided by the proposed factor analysis based model and their term appearances in the first round retrieved passages.
Principal Component Analysis (PCA) [29] and factor analysis are two methods that can help reveal simpler patterns within a complex set of variables. In particular, they seek to discover if the observed variables can be explained largely or entirely in terms of factors. The main commonality between PCA and factor analysis is that they both have eigenvectors, eigenvalues, loading factors and scores. The differences are: (1) PCA is often used as a simple starting point in multivariate analysis; (2) factor analysis is often considered to be "statistical" in nature rather than purely mathematical as in PCA, since PCA eigenvectors cumulatively account for all the variability in the data set whereas factor analysis results include an unresolved component; (3) factor analysis results are often transformed through varimax and other methods to optimize eigenvectors for interpretation. This motivates us to choose factor analysis to compute the importance and reliance of term associations, in order to find the hidden "eliteness" variables. An ngram [30] is a subsequence of n items from a given sequence. An ngram model is a type of probabilistic model for predicting the next item in such a sequence. Some language models built from ngrams are "(n  1)order Markov models". Its grammar is a representation of an n^{th }order Markov model in which the probability of occurrence of a symbol is conditioned upon the prior occurrence of (n  1) other symbols. Probabilistic latent semantic analysis (PLSA) [31] is a method of latent semantic analysis that uses probabilistic means to obtain the hidden topics and their relationships to terms and documents. In this paper, we use factor analysis to estimate the latent factors and compute the communalities for term associations statistically.
We adopt the Generalized Sequential Pattern (GSP) algorithm [32] as a comparison to our proposed approach, which contains two main steps as candidate generation and support counting. At first, all single items (1  sequences) are counted. Then, from the frequent single items, a set of candidates of 2  sequences are formed and filtered to identify their frequencies by removing the nonfrequent items based on the minimum support. The frequent 2  sequences are used to generate the candidates of 3  sequences. This process is repeated until no more frequent sequences are found. The support counting is based on the minimum support value.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
This is a featuring work done by QH as a part of her Ph.D. thesis. JXH supervised the project and revised the manuscript. JXH and XH contributed in the study design and experiments. All authors read and approved the final manuscript.
Acknowledgements
This research is supported by the research grant from the Natural Sciences & Engineering Research Council (NSERC) of Canada and the Early Researcher Award/Premier's Research Excellence Award. The authors would like to thank anonymous reviewers for their valuable comments and suggestions.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 9, 2012: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2011: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S9.
References

Salton G, Fox EA, Wu H: Extended Boolean information retrieval.
Commun ACM 1983, 26(11):10221036. Publisher Full Text

Hersh W, Cohen A, Yang J: TREC 2005 Genomics Track overview.
Proceedings of 14th Text REtrieval Conference, NIST Special Publication 2005.

Robertson SE, Walker S: Some simple effective approximations to the 2Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 36 July 1994, Dublin, Ireland. ACM/Springer; 1994:232241.

Subbaraoand C, Subbarao N, Chandu S: Characterisation of groundwater contamination using factor analysis.

Applied Factor Analysis in the Natural Sciences. 2nd edition. 1996.

Hersh W, Cohen AM, Roberts P: TREC 2007 Genomics Track overview.
Proceedings of 16th Text REtrieval Conference, NIST Special Publication 2007.

Allan J: HARD Track overview in TREC 2004.
Proceedings of 13th Text REtrieval Conference, NIST Special Publication 2004.

Hersh W, Cohen AM, Roberts P: TREC 2006 Genomics Track overview.
Proceedings of 15th Text REtrieval Conference, NIST Special Publication 2006.

Robertson SE, Sparck J: Relevance weighting of search terms.
JASIS 1976, 27(3):129146. Publisher Full Text

Beaulieu M, Gatford M, Huang X, Robertson S, Walker S, Williams P: Okapi at TREC5.
Proceedings of TREC5, NIST Special Publication 1997, 143166.

Huang X, Huang Y, Wen M: A dual index model for contextual information retrieval. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 1519 2005, Salvador, Brazil. ACM; 2005:613614.

Huang X, Wen M, An A, Huang Y: A platform for Okapibased contextual information retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 611 2006, Seattle, Washington, USA. ACM; 2006:728728.

Zhong M, Huang X: Conceptbased biomedical text retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 611 2006, Seattle, Washington, USA. ACM; 2006:723724.

Yin X, Huang X, Li Z: Promoting ranking diversity for biomedical information retrieval using Wikipedia.

Huang X, Zhong M, Si L: York University at TREC 2005: Genomics Track.

Hu Q, Huang JX: Passage extraction and result combination for genomics information retrieval.
J Intell Inf Syst 2010, 34(3):249274. Publisher Full Text

Hu Q, Huang X: A dynamic window based passage extraction algorithm for genomics information retrieval.
ISMIS 2008, Foundations of Intelligent Systems, 17th International Symposium, May 2023 2008, Toronto, Canada 2008, 434444.

Metzler D, Croft WB: A Markov random field model for term dependencies. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '05. New York, NY, USA: ACM; 2005:472479.

Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R: Indexing by latent semantic analysis.
J Am Soc Inf Sci 1990, 41(6):391407. Publisher Full Text

Grefenstette G: Use of syntactic context to produce term association lists for text retrieval. In SIGIR '92: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA: ACM; 1992:8997. PubMed Abstract

Kaji H, Morimoto Y, Aizono T, Yamasaki N: Corpusdependent association thesauri for information retrieval. In Proceedings of the 18th Conference on Computational Linguistics. Morristown, NJ, USA: Association for Computational Linguistics; 2000:404410.

Manna S, Gedeon T: A term association inference model for single documents: a stepping stone for investigation through information extraction. In Proceedings of the IEEE ISI 2008 PAISI, PACCF, and SOCO International Workshops on Intelligence and Security Informatics, PAISI, PACCF and SOCO '08. Berlin, Heidelberg: SpringerVerlag; 2008:1420.

Wei J, Bressan S, Ooi BC: Mining term association rules for automatic global query expansion: methodology and preliminary results. [http://dl.acm.org/citation.cfm?id = 882511.885386] webcite
Proceedings of the First International Conference on Web Information Systems Engineering (WISE'00) Washington, DC, USA: IEEE Computer Society; 2000, 1:366.

Mandl T: Efficient preprocessing for information retrieval with neural networks.
Proceedings of the 7th European Congress on Intelligent Techniques and Soft Computing 1999.

Machado AMC, Marinho CNJ, Campos MFM: An image retrieval method based on factor analysis.
Computer Graphics and Image Processing, Brazilian Symposium on 2003, 0:191.

Mehta B, Hofmann T, Frankhaser P: Cross system personalization by factor analysis.

Hu Q, Huang JX, Miao J: A robust approach to optimizing multisource information for enhancing genomics retrieval performance.
BMC Bioinformatics 2011, 12(Suppl 5):S6. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Huang JX, Hu Q: A bayesian learning approach to promoting diversity in ranking for biomedical information retrieval.

Olivas ES, Guerrero JDM, MartinezSober M, MagdalenaBenedito JR, López AJS: Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques. IGI Global; 2009.

Manning C, Schlztze H: Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques. MIT Press; 1999.

Hofmann T: Probabilistic latent semantic indexing. In SIGIR '99: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA: ACM; 1999:5057.

Srikant R, Agrawal R: Mining sequential patterns: generalizations and performance improvements. [http://dl.acm.org/citation.cfm?id = 645337.650382] webcite
Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology (EDBT '96) London, UK: SpringerVerlag; 1996, 317.