OA Now back issues
 Search OA Now
Archive

December 1, 2003

INTERVIEW

Fingerprinting the literature

There is a growing need to develop more sophisticated strategies for searching and accessing complex information and establishing relevant connections hidden away in the scientific literature. Open Access Now talked to Les Grivell, who is the Director of E-BioSci, an ambitious European initiative to create a powerful electronic information platform for the life sciences.

A European information platform
The E-BioSci program was initiated three years ago by the European Molecular Biology Organisation (EMBO), after discussions with a number of interested parties in the research community and the publishing industry. Following the debate that led to the establishment of PubMed Central by the National Institutes of Health (see Open Access Now, July 28, 2003), there were similar discussions in Europe about electronic access to scientific literature. "At a certain moment EMBO took the lead to get the European stakeholders involved," recalls Grivell. "Frank Gannon (EMBO'S Executive Director) called a number of meetings with researchers, publishers and librarians to try and get the ball rolling and work out what people wanted from an electronic information resource."

The outcome of these discussions was a decision to develop the E-BioSci platform. The project is currently funded by a grant of €2.4 million (roughly US$2.75 million) from the European Commission, through the Research Infrastructures section of the Fifth Framework Quality of Life Programme.

The E-BioSci development team, housed in a building on the same campus as the prestigious European Molecular Biology Laboratory (EMBL) in Heidelberg, Germany, is sensitive to the reservations of commercial publishers and their fears about losing control of their content. "E-BioSci has to be all things for all people - to try to help scientists without posing a threat to publishers," explains Grivell. "I think that publishers were already aware when we started that there would be changes but they wanted time to see how best to make a transition to a new model."

"When I joined the project just over two-and-a-half years ago, our aim was to set up an information resource that did many things, and at the top of the list was improved access to the literature. The other main goal was to improve the integration between scientific information resources and the literature." Ultimately, E-BioSci hopes to allow its users to navigate seamlessly between bibliographic, sequence or image databases and the relevant full-text published literature.

"When I took over it was clear to us that it would take some time before all commercial publishers would be willing to release control of their content," recalls Grivell. This meant providing a system that allowed users to query content that was held by commercial publishers, without violating any of the access controls that were in place. "But we also wanted a system that was using all the benefits of Open Access. That's how we came to choose the technology that we have now implemented."


"There is no systematic way of linking everything associated with a particular published article in such a way that the reader can easily find it"

Les Grivell


"Developing ways to search images was a hot item," recalls Grivell. "Scientific publication is changing, with more emphasis on what is generally called 'supplementary material'. There isn't a good way of accessing a lot of that material. You find it if you look up the original article, but there is no systematic way of linking everything that is associated with a particular published article in such a way that the reader can easily find it."

A few additional features were added as the team went along. For example, there was a need for multilingual features, so that users could access literature in different European languages or be able to access the English-language literature using another European language as a query.

Finding fingerprints
The E-BioSci developers decided to create an approach for searching the literature that differed from conventional technologies. They see it as complementary to services such as PubMed Central. "Searching full-text is extremely complicated," says Grivell. "That's why PubMed Central decided to centralize everything. It's much easier to have one large archive that you let an indexing engine loose on. If you have your literature distributed over several different locations then that makes problems for your search engine which has to go out to each of these locations."


"E-BioSci has to be all things for all people"

Les Grivell


But by linking distributed sets of resources, E-BioSci could attract commercial publishers and owners of other resources, such as genomic and multi-dimensional image databases. The dispersed information is interpreted by the semantic matching of conceptual 'fingerprints'. The fingerprint is generated by indexing full-text and extracting words and phrases that are then matched against concepts that are hierarchically organized and numerically identified. The fingerprints are centralized into a search database, which can in principle be mirrored in many locations.

Fingerprints can be produced by indexing any type of text in any type of format (HTML, PDF or plain text): all the words are indexed, and then in a second step they are looked up in a thesaurus. The thesaurus is based on the medical subject heading (MeSH) terms linked to UMLS (Unified Medical Language System) identifiers, defined by the US National Library of Medicine. The words are then identified as concepts in the thesaurus. Where words form a phrase that itself forms a concept, then this is identified and used as an extra level or hierarchy in the search. "So, you end up with a very small file that contains a list of concepts that the article contains," says Grivell. Typically there are around thirty concepts per article, and sometimes up to one hundred. In addition, each of these concepts has a 'weight' in the article, which is determined by the frequency with which a phrase or word occurs, and its context.

The technology for generating fingerprints is based on a collaboration between E-BioSci and a small commercial software company, Collexis B.V., based in the Netherlands. The thesaurus has been extended with a gene symbol catalog developed in cooperation with the Department of Medical Informatics at the University of Rotterdam. The full-text literature can be searched using the conceptual fingerprint rather than keywords. "Because the text document itself never moves from is original location, you have a model that makes both Open Access and commercial publishers happy", notes Grivell.

 

Searching with fingerprints has some unique features. Fingerprints are typically 400 bytes in size. They can be generated very fast - the team is currently processing about 250,000 pages of text per day. And searching is fast too - 500,000 fingerprints can be compared in 40 milliseconds.

E-BioSci released a new version of the prototype software in mid-August and is currently working on ironing out bugs, increasing content, improving functionality, and so on. "In fact, it does everything that we originally planned to do - deep searching for full-text, interlinking between different databases, and multilingual searches (in French or German at the moment). The main thing that we are still working on is the image-literature connection," says Grivell.

The E-BioSci system is being regularly updated. "During the process of assigning terms to concepts you also see that a large number of terms occur that are not official concepts, but often with time these become accepted; they can be inserted and used in subsequent updates," notes Grivell.


"But we also wanted a system that was using all the benefits of Open Access"

Les Grivell


An interactive discovery tool
Grivell likes to think of E-BioSci as a discovery tool, and he emphasizes the differences between E-BioSci and more conventional bibliographic service such as Entrez-PubMed (www.ncbi.nlm.nih.gov/PubMed). "PubMed simply indexes everything. When you do a search you are actually looking up an entry in the index and that points you to a number of abstracts," explains Grivell. Of course, one can do more advanced searches using several keywords and Boolean terms (such as 'AND' and 'NOT'). "But every term is equally weighted and it's black and white - it's there or it's not there," says Grivell. "We deliberately took a different approach. Here, the concepts are derived from the article itself. We take out the words and use them to generate the fingerprint that forms the basis of the search. The search process itself is very interactive. The user can look at the fingerprint and modify the weight given to each concept. In that way you can change the focus or sharpen a search. You may end up with something that is similar to that which conventional searches produce, but there are always the additional unexpected results in there, which people may have missed."

"We are very aware that there are differences in maintaining this system compared to PubMed," notes Grivell. "Much depends on the thesaurus. The subject area is also relevant - some areas are better defined in the MeSH framework than others. Also, a lot depends to some extent on how many synonyms (many names for the same thing) are present for each concept. In English, for example, we have many synonyms, whereas in German and French there are fewer. When you search for a word in PubMed you will miss an article that only uses a synonym for that word - but with us you will find it if the synonym has been put into the thesaurus tree."

Homonyms - single words with multiple meanings - are an even greater challenge. "This is something we have not yet solved, because it is really difficult," confesses Grivell. "With gene symbols, homonyms are very common. A gene, however, can be defined by its context and this should help resolve ambiguities. It is a difficult problem and the solution will take a while."

The E-BioSci project runs together with another European program called ORIEL (Online Research Information Environment for the Life Sciences), which is developing technologies to manage large datasets. Among future challenges for the two projects is the development of methodologies for searching image databases. One ORIEL group (led by Dr David Shotton, University of Oxford) is working on Bioimage to develop a really well-structured image database that will cover conventional kinds of searches, based on metadata and image descriptions, as well as a link to E-BioSci searches based on fingerprints.

Having shown that the technology works well, the next step will be to scale up the number of resources that are linked through E-BioSci. The prototype at the moment makes use mainly of test collections of fingerprints that include all MEDLINE abstracts and commercial publisher collections. But the success of the system will surely be linked to the breadth of the literature sampled. "If publishers are interested in working with us, then creating fingerprints is not difficult or time-consuming," says Grivell. Open Access journals and Open Archives are ideally suited to this technology. "I am certainly keen to apply the technology to any archive that would be interested in seeing whether access is improved," Grivell offers. "One problem for repositories is that institutions find it hard to convince scientists to convert information into metadata formats. When you use E-BioSci any text format can be searched. The system could also generate fingerprints from different parts of documents - for example looking specifically through methods sections."

E-BioSci works on a combination of WSDL (Web Service Description Language) and SOAP (Simple Object Access Protocol). These area protocols can be used to tell the user how a database is structured and how to query it. E-BioSci is looking for other WSDL/SOAP partners, with reciprocal benefits to both in terms of sophisticated multi-dimensional database searches.

The future of E-BioSci will depend on how many people use it and find it helpful for their research. This will in turn influence E-BioSci's capacity to seek financial support. "This coming year will be quite crucial for us - we will have to discover from people who use the system how useful they have found it," notes Grivell. The proof of this ambitious data-mining service will lie with the users. Only they can demonstrate whether following the fingerprint clues generated by E-BioSci will lead to novel scientific discoveries.

www.e-biosci.org
www.oriel.org

 

 
 

Open Access Now is published by BioMed Central.
Editor: Jonathan B Weitzman.