OA Now back issues
 Search OA Now
Archive

July 28, 2003

Interview

Public archives ensure a bright future for the scientific literature

Can you imagine trying to do your research without GenBank and PubMed? These public databases of sequences and literature citations have become such essential tools for the research scientist that it's hard to imagine life without them. Open Access Now talks to David Lipman, the man behind both of these as well as his latest project – PubMed Central, a digital archive of the full-text biomedical literature that will be a key component of life in the Open Access future.

Integrating literature and databases
PubMed and GenBank are services run by the National Center for Biotechnology Information (NCBI), part of the National Library of Medicine (NLM), based at the US National Institutes of Health campus in Bethesda, Maryland. NCBI’s Director David Lipman was there at the beginning, when NCBI began pioneering database integration. “PubMed came out of two streams that merged together,” recalls Lipman. “One was that the NLM had been producing MEDLINE for many years as an electronic index and bibliographic search service for the biomedical and life sciences literature. MEDLINE contained the citation and abstract, with some keywords and indexing terms. It was being made available online, usually by telephone-based access and then via internet as well, for a small fee to recover certain costs. About ten years ago NCBI started producing GenBank and providing some CD-ROM-based services for DNA and protein sequences.”

“We started to integrate MEDLINE with the sequences, so that if someone published a paper with a DNA sequence in it then we would have the corresponding MEDLINE record integrated with the sequence record,” says Lipman. “That was very popular and we kept increasing the MEDLINE coverage because the people using the service found it very useful. The set started getting bigger and bigger and then we thought about just including all of MEDLINE and making it a free online service.” That’s how PubMed was born.

The NCBI made it available during a test period and then the NLM Director, Dr. Donald Lindberg, decided to make it an ongoing free service. “We showed that some of the costs could be cut by using the internet and that by making it freely available it would be used a lot more,” Lipman says. “Because PubMed was a free service offered by a dependable organization – the NLM has been around since the Civil War – academic and commercial groups felt confident incorporating PubMed into their own information systems.”


PubMed demonstrated that the internet could make information openly available more cheaply and that it would be more widely used


PubMed was an instant success and within a short period of time its usage had shot up 100-fold. “PubMed was very popular and widely used,” says Lipman. “Two thirds of the people who access the system are not academics – they may be healthcare professionals, teachers or students. PubMed also creates a link-point for a lot of other kinds of information. Because it’s there, and it’s free and it’s dependable, a lot of other information providers use it in different and interesting ways - ways that we never would have imagined when we first started.”

PubMed Central: a central archive

A few years ago discussions about Open Access to the literature led to the creation of PubMed Central. “From the beginning, PubMed Central really had two roles,” says Lipman. “One was to try to increase access to the information and the other was electronic archiving. As electronic journals became more important and digital versions of articles, with more information, started to replace print as the definitive versions, the issue of archiving came up. “For paper journals, archiving was always done by libraries, so it was natural for the NLM to work with publishers to provide digital archiving through PubMed Central. So we started PubMed Central firstly to see if we could get the publishers to provide open access through our system and secondly to provide digital archiving.”

“While Open Access remains controversial, the digital archiving side of it is not,” Lipman says. The publisher provides the scientific content tagged in an explicit format, and NCBI converts it to a common format and makes it openly available. “If the technology changes, or there are doubts about publishers’ stability, then we have a single format for all this content so that we won’t have to spend outrageous amounts of money to update everything,” explains Lipman. “The NLM has a long history of active work in archiving. I think that one of the most important things that we are doing is dealing with the issue of stable archiving of digital scientific content. To that end, we have put a lot of work into developing standard archiving formats, working together with academics, publishers and university librarians.”

 

“The current status of PubMed Central is that we have a number of largely society publishers participating - some of them provide their content immediately, and free, while others have up to 12 months delay,” says Lipman. “We keep getting more participants, but it’s a slow process. Usage has been increasing – it’s now several times what it used to be. We have done nice work integrating PubMed Central into our overall Entrez system, so that you can link more directly to our other factual databases and, for example, find all the protein structures in a set of papers, or find all papers associated with a taxonomic group like the metazoans.”


One of the most important things that we are doing is dealing with the issue of stable archiving of digital scientific content


Another important aspect is how PubMed Central intersects with the Open Access movement. “There are going to be new journals starting up and one way to ensure the stability of this information is that there be national archives like PubMed Central, so that even if the journal is short-lived PubMed Central will continue to make it available for many, many years to come,” explains Lipman. “We have already started formal arrangements with groups in other countries to set up their own archives to contain the same content.” NCBI is encouraging others to have their own software systems, as there may be some advantages to providing diversity and having additional automatic checking of content. One of the critical aspects is that as the Open Access movement grows, people are going to want to have national archives with long standing commitments to preservation. And that’s where PubMed Central comes in.

Bringing papers to life Lipman predicts that the growth of PubMed Central will bring new approaches to how data and literature are integrated. “There are some obvious things that we can do now with the current PubMed Central system that clearly add value. For example, we can integrate a figure hidden away in a paper with factual databases of protein domains linked to structures and so forth. So that after the paper is published you could look back at the figure but see it actually updated with new information about structure. Then you would really be able to look at the paper and understand and extrapolate further what they were trying to do.”


When we have Open Access, people will be able to create and use databases and the literature in ways that will make both better


Lipman predicts an exciting future for further integration. “When people really view the literature as being owned by the community and as the tools get more sophisticated, then I think we will see people provide links in their papers to the factual databases in such a way that the paper is sort of alive. And setting up new kinds of databases, things like image databases, will be almost trivial.”

“When we have Open Access, then more people can get at it and that’s obviously better because more people can think about the science,” says Lipman. “But when it’s a community property and people are creating databases along with creating the literature, then the databases get better. Sometimes it’s harder to interpret the databases independently, but when you know that this paper is connected to a database you can take advantage of that. And likewise, the papers can have a more dynamic nature because they are linked to databases that are being continuously updated.”

“We have built a lot of this into PubMed Central already. But I am confident that as more people are working with it, it will start to evolve in directions that we could not have predicted right now. People will be able to create and use databases and the literature in ways that will make both better. And creating better databases, for less money, will allow better integration with the literature.”

Lipman sees the scientific paper as evolving over time as new discoveries are made. “Open Access will speed up the evolution of the scientific paper. When I see how much the biological databases have changed and improved over the last two decades, it’s exciting to think about what’s in store for the scientific literature in an Open Access future. I think that the system will become much more powerful as more and more of the literature becomes more open. I think the nature of the scientific paper will change – and we don’t know what it’s going to turn into.”


www.pubmedcentral.gov/

The National Library of Medicine and the National Center for Biotechnology Information run a large number of online services

NLM The National Library of Medicine (NLM) is the world's largest biomedical library. NLM explores the use of computer and communication technologies to improve the organization and use of biomedical information. NLM creates database and databanks and aims to educate users about available sources of information for biomedical research.

NCBI The National Center for Biotechnology Information (NCBI) was established in 1988 as a division of NLM. NCBI creates automated systems for storing and analyzing knowledge about molecular biology, biochemistry, and genetics; facilitates the use of such databases and software by the research community; coordinates efforts to gather biotechnology information; and performs research into advanced methods of computerbased information processing for analyzing the structure and function of biologically important molecules.

MEDLINE MEDLINE is NLM’s database of indexed journal citations and abstracts covering nearly 4,500 journals published in the US and more than 70 other countries. MEDLINE includes references to articles indexed from 1966 to the present and has been available online for searching since 1971. New citations are added weekly. MEDLINE citations and abstracts are available as the primary component of NLM's PubMed database, which is searchable via the internet.

PubMed Access to over 12 million citations from MEDLINE and additional journals is provided online by PubMed. PubMed includes links to many sites providing full-text articles and other related resources. In addition to providing access to MEDLINE, PubMed provides access to out-of-scope citations from certain journals, citations that precede the date that a journal was selected for MEDLINE indexing and some additional life science journals that submit full text to PubMedCentral.

 

PubMed Central PubMed Central is a digital archive of the life sciences journal literature, developed and managed by NCBI. Through PubMed Central, NCBI aims to preserve and maintain Open Access to the electronic literature (see the WWW? article on page 4 of this issue).

GenBank The GenBank database of nucleotide sequences includes sequences from over 130,000 organisms. GenBank belongs to an international collaboration of sequence databases that includes EMBL and DDBJ. GenBank is updated daily in NCBI search systems, and a full release is issued on the FTP site about six times a year. GenBank records include information about accession number formats, sequence identifiers, annotated biological features, and links to other relevant information and databases.

Entrez The Entrez gateway provides integrated access to nucleotide and protein sequence data, three-dimensional protein structures, genomic mapping information, PubMed, MEDLINE, and other databases. Two unique features of Entrez are, first, pre-computed similarity searches for each database record, identifying the related records (’neighbors’) within that database, and second, links from a record in one database to associated records in the other Entrez databases, providing integrated access across the various databases.

OMIM The Online Mendelian Inheritance in Man (OMIM) database is a catalog of human genes and genetic disorders developed by NCBI. The database includes textual information and references. It also has links to MEDLINE and sequence records in the Entrez system, and links to additional related resources. OMIM is intended for use by physicians and researchers and students concerned with genetic disorders.

All these services can be accessed free of charge at the NCBI website
http://www.ncbi.nlm.nih.gov

 

 
 

Open Access Now is published by BioMed Central.
Editor: Jonathan B Weitzman.