Freedom of Information Conference 2000
Dr John Wilbur
National Centre
for Biotechnology Information
Prospects for improved access to large document
sets
How could
you access data? The problem is getting bigger and bigger as the size
of databases get larger and larger. That's unavoidable. I'd like to give
you a picture of what we are like today and perhaps what we may be like
tomorrow.
You'd like
retrieval of information to be intelligent, it's always nicer to deal
with a system that is intelligent. That makes you think you'd like to
use a computer so you've got to talk about artificial intelligence. Most
of our experiences haven't been too great with artificial intelligence,
but there is one outstanding area where it has been a great success and
that's what's called "expert systems". You've probably heard
of the programme developed at Stanford University for prescribing antibiotics
to meningitis patients at a fully expert level just as well as any human
could do it. There are thousands of these systems in the industry and
one of the real problems is that they are quite limited. You really have
to restrict yourself to a small area and you have to coax out of the expert
how it is he's doing what he's doing and encode that into a programme.
Trying
to define intelligence
If you take
a bunch of expert systems and put them all together you just get chaos.
The expert systems are built on rule sets or on propositions and there
are logical calculations going on. As you increase the size of the rule
sets, as you increase the number of propositions or statements of truth,
what happens is your system becomes overwhelmed by so many possible lines
of reasoning that it just loses focus and becomes impotent. So one then
has to try to re-evaluate what we mean by intelligence.
A researcher
in artificial intelligence gave a very provocative definition for intelligence;
"Perfect intelligence is the optimal use of the information you have
to achieve the goals you have". Now that's a very profound statement
and it actually makes the possibility of measuring intelligence seem like
a reasonable thing. If you know what perfect intelligence is you can assign
numerical measures. But there's a little hitch here; it's only a workable
definition if you know what the goal is. You can talk about perfection
in reaching a goal and perhaps you can actually design a system that's
perfect in that sense. But if you talk about general intelligence this
sort of thing begins to break down because we don't' know what that is.
If you don't know what the goal is you're handicapped when designing a
programme.
You might
say we do know what intelligence is. We know eachother are intelligent
so we think we know what intelligence is but people argue about what intelligence
is and they argue over tests for intelligence. We don't know what general
intelligence but we at least know it doesn't seem to be algorithmic. It's
not just some fancy algorithm that somebody put together one night when
they were drinking a lot of coffee, it doesn't work that way. At Stanford
they are quite persuasive in arguing that if you're going to have an intelligent
system you've got to have one that has a lot of knowledge in it. It can't
be intelligent like you and I are if it doesn't know a lot of facts. You
can't read a newspaper and make sense of it if you don't know the background
so one could argue that you have to have a lot of information and knowledge
built into the system if you're going to make it intelligent.
Knowledge
bases and document retrieval
Then there's
the real issue; would that be enough? If I carry that reasoning further
I could say you don't have to be too sophisticated, if you've got enough
information built into the system it'll just be intelligent. Well, they
haven't been able to prove that point but it does lead to something called
the "knowledge acquisition bottleneck" - you've got to be intelligent
to learn and if you don't have knowledge you can't learn so you can't
get the knowledge. So you have this bottleneck, which is a real problem.
So science
has moved somewhat from artificial intelligence and is concentrating on
what's called "machine learning". It's an active field and it
is having some success. People are working on knowledge bases - places
to store all that knowledge that machines are going to learn when they
learn how to learn. But that's getting a little less interesting because
people aren't learning how to learn fast enough to warrant much in the
way of knowledge bases. There are a lot of expectations here, there always
has been, but one of the expectations is that somehow machine learning
and knowledge bases will eventually lead us to improved document retrieval.
Where
we are right now - Medline and PubMed
Now I want
to talk to you next about where we are right now in regard to document
retrieval. The current state of the art, and it's not intelligent you
might say, is Medline. It's statistical document retrieval and not really
that intelligent because you just throw all the words in a bag, shake
it, and then use the result. You take a document with a title, an abstract,
authors, and such like, and use the index mesh terms that are assigned
by the library. Some of those have stars which say they're real important
and sometimes we have a hard time knowing if they're really that important.
And then there are qualifiers on those that sometimes focus a mesh term
more specifically on a subject. Then we have titles which are natural
language objects, just strings of text, and we break out the terms, the
particular words we think are useful. The same goes for abstracts. Those
are the basic pieces of information or the features on which we base our
efforts to retrieve information.
Then you
talk about a particular word, term, or phrase. There are two questions
you could. Firstly, is it useful somewhere in some subject area? And then,
in this particular document, is this a useful term, does it somehow give
me information about the subject of this particular document? We don't
make categorical decisions about those things but we try to assign a numerical
weight as to how important the term is. Is it generally useful? We base
it on the frequency of that term. If the term is in all the documents
it's of no use because it won't help us distinguish documents.
And there's
the situation in a particular document and that really is based on the
frequency of the term in the document. If the term appears only once then
maybe it's not that important but if it's there four or six times then
it's probably an important term and so we do another calculation to give
a weight to that term.
Then what
do we do with the weight? In PubMed when we ask for related documents
we are simply ask how many terms do they have in common? For every one
of those terms they have in common we multiply the local weight in one
document, the local weight in the other document, and the global weight,
and that's a measure of how similar those two documents are. That seems
pretty straightforward. But it has one little hitch and that is that if
I have a big document that takes up a lot of space it's likely to overlap
with the first document in a bigger way so it gets a bigger score, even
though it doesn't really mean anything. So we do a correction. We say
let's take that raw score, divide it by a sum figure of the length of
the first document and a sum figure of the length of the second document
and compute a normalised score that sort of takes into account that documents
aren't all the same length. This gives them all the same potential to
do well. So that is where we are in terms of document retrieval.
Comparing
artificial and human intelligence
Now let's
ask ourselves, what if we had intelligent retrieval? How would it compare
with what we have right now, statistical retrieval? How much improvement
would we have if we really could get artificial intelligence to work?
Well, you can actually do head to head competition here. You set a query
for a computer and a human. You then judge how useful in answering the
query are the documents retrieved by the human and those retrieved by
the computer. Way back in 1968 this was done and they discovered the computer
was not quite as good as the human. The computer did about 75% of what
the human could do and they suggested that that was as much progress as
they could hope for in the future but if they got as good as a human they
were not going to do any better.
We've actually
redone their experiment because we have developed two test sets of documents
in Medline. The first test set was developed on 72,000 documents. We randomly
picked a hundred documents and said that we'll treat these as the queries
and for each of those we'll pull up the top fifty by computer and ask
how related are they. We then had a human go through and judge how related
those fifty were to that query document. Fifty documents to each query,
and there's a hundred queries, that's 5000. It's a rather mind deadening
task because I had to do it along with a few other people! So five of
us did that and it turned out the computer did as well as anybody and
actually better than most of us.
We got a
larger and larger database because 72,000 is still a small database by
today's standards. We had 1.2 million documents. Now we repeated the experiment
so again we chose 100 documents randomly and for each we picked out the
fifty that looked the closest by computer. The we hired seven molecular
biologists with PhD's or equivalent to go through and judge all of those
5000 pairs of documents. It took them about a hundred hours each but they
got through. The computer did, on average, about 80% of what the humans
could do, although I will say that the computer did as well as one of
those people at least.
Understanding
why people are so awful
Now why is
that people are so awful? Well, people worry about links of queries, people
who query net search engines type in one or two words and expect to get
a reasonable answer from that. But that's not the issue here because our
documents have several hundred words in them so it's not a short document
problem, it's an ambiguity of language problem. I mean, language is ambiguous
and people's judgements are extremely variable so there's plenty of reason
to believe that this is the reason why people don't do any better than
this. Well, is it a hopeless situation? Is it really impossible for people
to retrieve documents better than this or is there some hope? Well there
is some hope, I'd hate to end on a pessimistic note.
We ranked
the top twenty documents and looked at what percentage of the documents
chosen by the computer and the people included that twenty. Before we
look at the figures I must explain that 42% is just random, it's what
you get when you just shuffle the papers and take out twenty. So, here
are the results; the B computer got about 53% and the individual person
was somewhere around 54%, 55%. Now why do I say there's hope? Because
we discovered that if you pull the judgements of the judges together you
could predict better and better what the unknown guy - the guy outside
the pool - wanted to see. He would judge it better and better if you pool
a committee, so there is wisdom in numbers. We found you really could
predict much better if you took several people - and it didn't take a
lot of people to make a difference here. You could eventually predict
almost twice as good as the computer, so there's a lot of room for improvement
and it's actually doable.
Is
common sense a factor?
But then
there's the question, "well, if it's doable, how could we do it?"
Well we see we get a super-human performance by a committee, it clearly
outclasses the individual. The natural question then is, "what is
that based on?" Now if you've read any artificial intelligence literature
you've probably run into the statement that the hardest thing to get a
computer to do is use common sense. Common sense tells us most vehicles
ride on four wheels, that water runs downhill, that animals get diseases,
the kinds of things that everybody knows. But what we see in the experiment,
the improvement, is not common sense because everybody should have it.
Therefore the individual should have done just as well as the committee.
So then perhaps it's educated sense. It must be that people have knowledge
of molecular biology and because they know something but they don't know
it all, when you get a bunch of them together they know more. I thought
that was a reasonable explanation but it didn't turn out to be right.
So why do
I say it's not right? We decided to test this theory so we got six educated
people who didn't know molecular biology and six molecular biologists
and we got them to judge the papers. We noticed that there's almost no
difference at all between the trained and the untrained!
So what good
is training? Well that's a hard issue but all I could come to is that
the way you decide if two documents are related is some kind of generic
pattern of recognition ability that you learn when you're growing up and
yes, it's based on experience. But it doesn't have anything to do with
college or beyond. So, we know we can capture this information if you
just lock up enough people, make them make these judgements, then free
them. And we know we can improve retrieval this way. So we could actually
take the information generated by these people and put it into a computer
programme and get the gist of what's going on. So we know we can get this
information and actually use it but the next question is how are we going
to get this stuff? And that's the problem. This is where open systems,
have a real contribution to make.
A
model for the future
The best
source of population data is the system that is the most open to the most
people because you've got the users there and if you can capture what
they're doing then hopefully you can use that to improve retrieval. A
possible way of doing this that I have designed is what I call informative
triples which are an extremely simple-minded idea. If you're looking at
a document and ask for related documents the computer comes back and offers
a bunch. It effectively says here are a load of options but I don't know
which one is really right. Then your response is to look at the list given
to you by the computer and you ask for, say, number two. That way you've
given the computer some information because you had to look beyond number
one to ask for number two. If the computer observes enough people it'll
begin to get the idea that maybe number one isn't such a good document.
To answer the query maybe I really need to put number two where number
one is. So you can see how the computer might use this to capture the
information needed to make it more convenient.
So those
are our options which we haven't really investigated adequately but I
think the more users the better is the real hope for success.
|