Log on / register
Feedback | Support | My details
  Quick Search
BioMed Central
PubMed Central
PubMed

Contents

John Wilbur
National Center for Biotechnology Information


BMC  Freedom of Information Conference 2000

Dr John Wilbur
National Centre for Biotechnology Information

Prospects for improved access to large document sets

How could you access data? The problem is getting bigger and bigger as the size of databases get larger and larger. That's unavoidable. I'd like to give you a picture of what we are like today and perhaps what we may be like tomorrow.

You'd like retrieval of information to be intelligent, it's always nicer to deal with a system that is intelligent. That makes you think you'd like to use a computer so you've got to talk about artificial intelligence. Most of our experiences haven't been too great with artificial intelligence, but there is one outstanding area where it has been a great success and that's what's called "expert systems". You've probably heard of the programme developed at Stanford University for prescribing antibiotics to meningitis patients at a fully expert level just as well as any human could do it. There are thousands of these systems in the industry and one of the real problems is that they are quite limited. You really have to restrict yourself to a small area and you have to coax out of the expert how it is he's doing what he's doing and encode that into a programme.

Trying to define intelligence

If you take a bunch of expert systems and put them all together you just get chaos. The expert systems are built on rule sets or on propositions and there are logical calculations going on. As you increase the size of the rule sets, as you increase the number of propositions or statements of truth, what happens is your system becomes overwhelmed by so many possible lines of reasoning that it just loses focus and becomes impotent. So one then has to try to re-evaluate what we mean by intelligence.

A researcher in artificial intelligence gave a very provocative definition for intelligence; "Perfect intelligence is the optimal use of the information you have to achieve the goals you have". Now that's a very profound statement and it actually makes the possibility of measuring intelligence seem like a reasonable thing. If you know what perfect intelligence is you can assign numerical measures. But there's a little hitch here; it's only a workable definition if you know what the goal is. You can talk about perfection in reaching a goal and perhaps you can actually design a system that's perfect in that sense. But if you talk about general intelligence this sort of thing begins to break down because we don't' know what that is. If you don't know what the goal is you're handicapped when designing a programme.

You might say we do know what intelligence is. We know eachother are intelligent so we think we know what intelligence is but people argue about what intelligence is and they argue over tests for intelligence. We don't know what general intelligence but we at least know it doesn't seem to be algorithmic. It's not just some fancy algorithm that somebody put together one night when they were drinking a lot of coffee, it doesn't work that way. At Stanford they are quite persuasive in arguing that if you're going to have an intelligent system you've got to have one that has a lot of knowledge in it. It can't be intelligent like you and I are if it doesn't know a lot of facts. You can't read a newspaper and make sense of it if you don't know the background so one could argue that you have to have a lot of information and knowledge built into the system if you're going to make it intelligent.

Knowledge bases and document retrieval

Then there's the real issue; would that be enough? If I carry that reasoning further I could say you don't have to be too sophisticated, if you've got enough information built into the system it'll just be intelligent. Well, they haven't been able to prove that point but it does lead to something called the "knowledge acquisition bottleneck" - you've got to be intelligent to learn and if you don't have knowledge you can't learn so you can't get the knowledge. So you have this bottleneck, which is a real problem.

So science has moved somewhat from artificial intelligence and is concentrating on what's called "machine learning". It's an active field and it is having some success. People are working on knowledge bases - places to store all that knowledge that machines are going to learn when they learn how to learn. But that's getting a little less interesting because people aren't learning how to learn fast enough to warrant much in the way of knowledge bases. There are a lot of expectations here, there always has been, but one of the expectations is that somehow machine learning and knowledge bases will eventually lead us to improved document retrieval.

Where we are right now - Medline and PubMed

Now I want to talk to you next about where we are right now in regard to document retrieval. The current state of the art, and it's not intelligent you might say, is Medline. It's statistical document retrieval and not really that intelligent because you just throw all the words in a bag, shake it, and then use the result. You take a document with a title, an abstract, authors, and such like, and use the index mesh terms that are assigned by the library. Some of those have stars which say they're real important and sometimes we have a hard time knowing if they're really that important. And then there are qualifiers on those that sometimes focus a mesh term more specifically on a subject. Then we have titles which are natural language objects, just strings of text, and we break out the terms, the particular words we think are useful. The same goes for abstracts. Those are the basic pieces of information or the features on which we base our efforts to retrieve information.

Then you talk about a particular word, term, or phrase. There are two questions you could. Firstly, is it useful somewhere in some subject area? And then, in this particular document, is this a useful term, does it somehow give me information about the subject of this particular document? We don't make categorical decisions about those things but we try to assign a numerical weight as to how important the term is. Is it generally useful? We base it on the frequency of that term. If the term is in all the documents it's of no use because it won't help us distinguish documents.

And there's the situation in a particular document and that really is based on the frequency of the term in the document. If the term appears only once then maybe it's not that important but if it's there four or six times then it's probably an important term and so we do another calculation to give a weight to that term.

Then what do we do with the weight? In PubMed when we ask for related documents we are simply ask how many terms do they have in common? For every one of those terms they have in common we multiply the local weight in one document, the local weight in the other document, and the global weight, and that's a measure of how similar those two documents are. That seems pretty straightforward. But it has one little hitch and that is that if I have a big document that takes up a lot of space it's likely to overlap with the first document in a bigger way so it gets a bigger score, even though it doesn't really mean anything. So we do a correction. We say let's take that raw score, divide it by a sum figure of the length of the first document and a sum figure of the length of the second document and compute a normalised score that sort of takes into account that documents aren't all the same length. This gives them all the same potential to do well. So that is where we are in terms of document retrieval.

Comparing artificial and human intelligence

Now let's ask ourselves, what if we had intelligent retrieval? How would it compare with what we have right now, statistical retrieval? How much improvement would we have if we really could get artificial intelligence to work? Well, you can actually do head to head competition here. You set a query for a computer and a human. You then judge how useful in answering the query are the documents retrieved by the human and those retrieved by the computer. Way back in 1968 this was done and they discovered the computer was not quite as good as the human. The computer did about 75% of what the human could do and they suggested that that was as much progress as they could hope for in the future but if they got as good as a human they were not going to do any better.

We've actually redone their experiment because we have developed two test sets of documents in Medline. The first test set was developed on 72,000 documents. We randomly picked a hundred documents and said that we'll treat these as the queries and for each of those we'll pull up the top fifty by computer and ask how related are they. We then had a human go through and judge how related those fifty were to that query document. Fifty documents to each query, and there's a hundred queries, that's 5000. It's a rather mind deadening task because I had to do it along with a few other people! So five of us did that and it turned out the computer did as well as anybody and actually better than most of us.

We got a larger and larger database because 72,000 is still a small database by today's standards. We had 1.2 million documents. Now we repeated the experiment so again we chose 100 documents randomly and for each we picked out the fifty that looked the closest by computer. The we hired seven molecular biologists with PhD's or equivalent to go through and judge all of those 5000 pairs of documents. It took them about a hundred hours each but they got through. The computer did, on average, about 80% of what the humans could do, although I will say that the computer did as well as one of those people at least.

Understanding why people are so awful

Now why is that people are so awful? Well, people worry about links of queries, people who query net search engines type in one or two words and expect to get a reasonable answer from that. But that's not the issue here because our documents have several hundred words in them so it's not a short document problem, it's an ambiguity of language problem. I mean, language is ambiguous and people's judgements are extremely variable so there's plenty of reason to believe that this is the reason why people don't do any better than this. Well, is it a hopeless situation? Is it really impossible for people to retrieve documents better than this or is there some hope? Well there is some hope, I'd hate to end on a pessimistic note.

We ranked the top twenty documents and looked at what percentage of the documents chosen by the computer and the people included that twenty. Before we look at the figures I must explain that 42% is just random, it's what you get when you just shuffle the papers and take out twenty. So, here are the results; the B computer got about 53% and the individual person was somewhere around 54%, 55%. Now why do I say there's hope? Because we discovered that if you pull the judgements of the judges together you could predict better and better what the unknown guy - the guy outside the pool - wanted to see. He would judge it better and better if you pool a committee, so there is wisdom in numbers. We found you really could predict much better if you took several people - and it didn't take a lot of people to make a difference here. You could eventually predict almost twice as good as the computer, so there's a lot of room for improvement and it's actually doable.

Is common sense a factor?

But then there's the question, "well, if it's doable, how could we do it?" Well we see we get a super-human performance by a committee, it clearly outclasses the individual. The natural question then is, "what is that based on?" Now if you've read any artificial intelligence literature you've probably run into the statement that the hardest thing to get a computer to do is use common sense. Common sense tells us most vehicles ride on four wheels, that water runs downhill, that animals get diseases, the kinds of things that everybody knows. But what we see in the experiment, the improvement, is not common sense because everybody should have it. Therefore the individual should have done just as well as the committee. So then perhaps it's educated sense. It must be that people have knowledge of molecular biology and because they know something but they don't know it all, when you get a bunch of them together they know more. I thought that was a reasonable explanation but it didn't turn out to be right.

So why do I say it's not right? We decided to test this theory so we got six educated people who didn't know molecular biology and six molecular biologists and we got them to judge the papers. We noticed that there's almost no difference at all between the trained and the untrained!

So what good is training? Well that's a hard issue but all I could come to is that the way you decide if two documents are related is some kind of generic pattern of recognition ability that you learn when you're growing up and yes, it's based on experience. But it doesn't have anything to do with college or beyond. So, we know we can capture this information if you just lock up enough people, make them make these judgements, then free them. And we know we can improve retrieval this way. So we could actually take the information generated by these people and put it into a computer programme and get the gist of what's going on. So we know we can get this information and actually use it but the next question is how are we going to get this stuff? And that's the problem. This is where open systems, have a real contribution to make.

A model for the future

The best source of population data is the system that is the most open to the most people because you've got the users there and if you can capture what they're doing then hopefully you can use that to improve retrieval. A possible way of doing this that I have designed is what I call informative triples which are an extremely simple-minded idea. If you're looking at a document and ask for related documents the computer comes back and offers a bunch. It effectively says here are a load of options but I don't know which one is really right. Then your response is to look at the list given to you by the computer and you ask for, say, number two. That way you've given the computer some information because you had to look beyond number one to ask for number two. If the computer observes enough people it'll begin to get the idea that maybe number one isn't such a good document. To answer the query maybe I really need to put number two where number one is. So you can see how the computer might use this to capture the information needed to make it more convenient.

So those are our options which we haven't really investigated adequately but I think the more users the better is the real hope for success.

Register now



© 1999-2008 BioMed Central Ltd unless otherwise stated