|
December 1, 2003
INTERVIEW
Fingerprinting the literature
There is a growing
need to develop more
sophisticated strategies
for searching and
accessing complex information
and establishing relevant
connections hidden away in
the scientific literature.
Open Access Now talked
to Les Grivell, who is the
Director of E-BioSci, an
ambitious European initiative
to create a powerful electronic
information platform for the
life sciences.
A European
information platform
The E-BioSci program was initiated
three years ago by the European
Molecular Biology Organisation
(EMBO), after discussions with
a number of interested parties in
the research community and the
publishing industry. Following the
debate that led to the establishment
of PubMed Central by the National
Institutes of Health (see Open
Access Now, July 28, 2003), there were
similar discussions in Europe about
electronic access to scientific literature.
"At a certain moment EMBO took
the lead to get the European stakeholders
involved," recalls Grivell. "Frank
Gannon (EMBO'S Executive Director)
called a number of meetings with
researchers, publishers and librarians
to try and get the ball rolling and
work out what people wanted from an
electronic information resource."
The outcome of these discussions
was a decision to develop the E-BioSci
platform. The project is currently
funded by a grant of €2.4 million
(roughly US$2.75 million) from the
European Commission, through the
Research Infrastructures section of
the Fifth Framework Quality of Life
Programme.
The E-BioSci development team,
housed in a building on the same
campus as the prestigious European
Molecular Biology Laboratory
(EMBL) in Heidelberg, Germany, is
sensitive to the reservations of
commercial publishers and their
fears about losing control of their
content. "E-BioSci has to be all things
for all people - to try to help scientists
without posing a threat to publishers,"
explains Grivell. "I think that publishers
were already aware when we
started that there would be changes
but they wanted time to see how best
to make a transition to a new model."
"When I joined the project just over
two-and-a-half years ago, our aim
was to set up an information resource
that did many things, and at the top
of the list was improved access to the
literature. The other main goal was
to improve the integration between
scientific information resources and
the literature." Ultimately, E-BioSci
hopes to allow its users to navigate
seamlessly between bibliographic,
sequence or image databases and the
relevant full-text published literature.
"When I took over it was clear to
us that it would take some time before
all commercial publishers would be
willing to release control of their
content," recalls Grivell. This meant
providing a system that allowed users
to query content that was held by
commercial publishers, without
violating any of the access controls
that were in place. "But we also
wanted a system that was using all the
benefits of Open Access. That's how
we came to choose the technology
that we have now implemented."
"There is no systematic way
of linking everything associated
with a particular published article
in such a way that the reader
can easily find it"
Les Grivell
"Developing ways to search images
was a hot item," recalls Grivell.
"Scientific publication is changing,
with more emphasis on what is
generally called 'supplementary
material'. There isn't a good way of
accessing a lot of that material. You
find it if you look up the original
article, but there is no systematic
way of linking everything that is
associated with a particular published
article in such a way that the reader
can easily find it."
A few additional features were added
as the team went along. For example,
there was a need for multilingual features,
so that users could access literature
in different European languages or
be able to access the English-language
literature using another European language
as a query.
Finding fingerprints
The E-BioSci developers decided to
create an approach for searching the
literature that differed from conventional
technologies. They see it as
complementary to services such as
PubMed Central. "Searching full-text
is extremely complicated," says
Grivell. "That's why PubMed Central
decided to centralize everything.
It's much easier to have one large
archive that you let an indexing engine
loose on. If you have your literature
distributed over several different
locations then that makes problems
for your search engine which has to
go out to each of these locations."
"E-BioSci has to be all things for all people"
Les Grivell
But by linking distributed sets of
resources, E-BioSci could attract
commercial publishers and owners of
other resources, such as genomic and
multi-dimensional image databases.
The dispersed information is interpreted
by the semantic matching of conceptual
'fingerprints'. The fingerprint
is generated by indexing full-text and
extracting words and phrases that are
then matched against concepts that
are hierarchically organized and
numerically identified. The fingerprints
are centralized into a search
database, which can in principle be
mirrored in many locations.
Fingerprints can be produced by
indexing any type of text in any type
of format (HTML, PDF or plain text):
all the words are indexed, and then in a
second step they are looked up in a
thesaurus. The thesaurus is based on
the medical subject heading (MeSH)
terms linked to UMLS (Unified
Medical Language System) identifiers,
defined by the US National Library
of Medicine. The words are then
identified as concepts in the thesaurus.
Where words form a phrase that itself
forms a concept, then this is identified
and used as an extra level or hierarchy
in the search. "So, you end up with a
very small file that contains a list of
concepts that the article contains," says
Grivell. Typically there are around
thirty concepts per article, and sometimes
up to one hundred. In addition,
each of these concepts has a 'weight' in
the article, which is determined by the
frequency with which a phrase or word
occurs, and its context.
The technology for generating
fingerprints is based on a collaboration
between E-BioSci and a small commercial
software company, Collexis
B.V., based in the Netherlands. The
thesaurus has been extended with a
gene symbol catalog developed in
cooperation with the Department of
Medical Informatics at the University
of Rotterdam. The full-text literature
can be searched using the conceptual
fingerprint rather than keywords.
"Because the text document itself
never moves from is original
location, you have a model that makes
both Open Access and commercial
publishers happy", notes Grivell.
|
|
Searching with fingerprints has some
unique features. Fingerprints are
typically 400 bytes in size. They can be
generated very fast - the team is
currently processing about 250,000
pages of text per day. And searching is
fast too - 500,000 fingerprints can be
compared in 40 milliseconds.
E-BioSci released a new version of
the prototype software in mid-August
and is currently working on ironing out
bugs, increasing content, improving
functionality, and so on. "In fact, it
does everything that we originally
planned to do - deep searching for
full-text, interlinking between different
databases, and multilingual searches
(in French or German at the moment).
The main thing that we are still
working on is the image-literature
connection," says Grivell.
The E-BioSci system is being
regularly updated. "During the process
of assigning terms to concepts you also
see that a large number of terms
occur that are not official concepts,
but often with time these become
accepted; they can be inserted and used
in subsequent updates," notes Grivell.
"But we also wanted a
system that was using
all the benefits
of Open Access"
Les Grivell
An interactive
discovery tool
Grivell likes to think of E-BioSci as
a discovery tool, and he emphasizes
the differences between E-BioSci and
more conventional bibliographic
service such as Entrez-PubMed
(www.ncbi.nlm.nih.gov/PubMed).
"PubMed simply indexes everything.
When you do a search you are actually
looking up an entry in the index and
that points you to a number of
abstracts," explains Grivell. Of course,
one can do more advanced searches
using several keywords and Boolean
terms (such as 'AND' and 'NOT').
"But every term is equally weighted
and it's black and white - it's
there or it's not there," says Grivell.
"We deliberately took a different
approach. Here, the concepts are
derived from the article itself.
We take out the words and use them to
generate the fingerprint that forms the
basis of the search. The search process
itself is very interactive. The user can
look at the fingerprint and modify the
weight given to each concept. In that
way you can change the focus or
sharpen a search. You may end up
with something that is similar to
that which conventional searches
produce, but there are always the
additional unexpected results in there,
which people may have missed."
"We are very aware that there are
differences in maintaining this system
compared to PubMed," notes Grivell.
"Much depends on the thesaurus.
The subject area is also relevant -
some areas are better defined in the
MeSH framework than others. Also,
a lot depends to some extent on how
many synonyms (many names for the
same thing) are present for each
concept. In English, for example, we
have many synonyms, whereas in
German and French there are fewer.
When you search for a word in
PubMed you will miss an article
that only uses a synonym for that
word - but with us you will find it
if the synonym has been put into
the thesaurus tree."
Homonyms - single words with
multiple meanings - are an even
greater challenge. "This is something
we have not yet solved, because it
is really difficult," confesses Grivell.
"With gene symbols, homonyms are
very common. A gene, however, can
be defined by its context and this
should help resolve ambiguities. It is
a difficult problem and the solution
will take a while."
The E-BioSci project runs together
with another European program called
ORIEL (Online Research Information
Environment for the Life Sciences),
which is developing technologies
to manage large datasets. Among
future challenges for the two projects
is the development of methodologies
for searching image databases.
One ORIEL group (led by Dr David
Shotton, University of Oxford) is
working on Bioimage to develop a
really well-structured image database
that will cover conventional kinds
of searches, based on metadata and
image descriptions, as well as a
link to E-BioSci searches based on
fingerprints.
Having shown that the technology
works well, the next step will be to
scale up the number of resources that
are linked through E-BioSci. The
prototype at the moment makes use
mainly of test collections of fingerprints
that include all MEDLINE
abstracts and commercial publisher
collections. But the success of the
system will surely be linked to the
breadth of the literature sampled.
"If publishers are interested in working
with us, then creating fingerprints
is not difficult or time-consuming,"
says Grivell. Open Access journals and
Open Archives are ideally suited to
this technology. "I am certainly keen
to apply the technology to any archive
that would be interested in seeing
whether access is improved," Grivell
offers. "One problem for repositories
is that institutions find it hard to
convince scientists to convert
information into metadata formats.
When you use E-BioSci any text
format can be searched. The system
could also generate fingerprints
from different parts of documents - for
example looking specifically through
methods sections."
E-BioSci works on a combination of
WSDL (Web Service Description
Language) and SOAP (Simple Object
Access Protocol). These area protocols
can be used to tell the user how a
database is structured and how to
query it. E-BioSci is looking for
other WSDL/SOAP partners, with
reciprocal benefits to both in terms of
sophisticated multi-dimensional database
searches.
The future of E-BioSci will depend on
how many people use it and find it
helpful for their research. This will in
turn influence E-BioSci's capacity to
seek financial support. "This coming
year will be quite crucial for us - we
will have to discover from people who
use the system how useful they have
found it," notes Grivell. The proof
of this ambitious data-mining service
will lie with the users. Only they
can demonstrate whether following
the fingerprint clues generated by
E-BioSci will lead to novel scientific
discoveries.
www.e-biosci.org
www.oriel.org
|