Freedom of Information Conference 2000
David Lipman
National Center for Biotechnology
Information
PubMed Central: still on course to revolutionise
biomedical publishing
A
universal electronic template will create a reliable and flexible interface
for journals
The concept
of making the results of primary research freely available to
anyone with an internet connection caused a great stir in the media and
biomedical science community when proposed last year by the National
Institutes of Health (NIH). After some revision to the original proposal,
PubMed Central was launched early this year. So where is the promised
publishing revolution? As this article explains, addressing the technical
challenges presented by such an ambitious project have kept us busy behind
the scenes, but we are now moving ahead to make PubMed Central a reality.
What is
PubMed Central?
The aim of PubMed Central is to deliver primary literature research findings
to the scientific community free of charge, without registration, advertisements,
or other barriers. Participation by publishers in PubMed Central is voluntary,
but participating publishers must meet the minimum standard of having
at least three members on their editorial boards who are currently principal
investigators on research grants from major funding agencies. Copyright
of material remains with the publisher or the author of an article, and
not with PubMed Central. There is currently no provision for non-peer
reviewed literature on the PubMed Central site.
PubMed Central
is still a fledgling system. If you visit the PubMed Central
website, you will see a modest number of articles available from a handful
of journals. We would be delighted if more content were already there.
However, we underestimated the technical issues involved in
displaying content from different sources. Resolving these issues to the
satisfaction of all concerned has proved to be a non-trivial task.
Making
it happen
The technical approach being taken by PubMed Central is to display journal
articles in a web browser by conversion 'on the fly' from the source data,
which are tagged in standard generalized mark up language (SGML). Currently,
SGML tagging of articles is usually a by product of the printing process.
In PubMed Central we require the SGML version of an article to be the
definitive source.
The advantage
of this approach - working directly from the SGML - is
twofold. First, SGML is an international standard, which means that the
data
are portable and can be used by others. Second, the maximum amount of
information about the actual content of an article is retained. This is
obviously desirable for the working archive that PubMed Central hopes
to
become. Future users of the archive will not be dependent on a particular
technology for continued access to its contents.
True, we
could store articles as HTML (HTML is a tiny subset of the SGML
language); this would be fine for merely displaying articles, but is
inadequate for defining the structure of an article. For example, in SGML,
an article title is described as such and retains the title tag if the
article is presented in different display styles; in HTML, it is merely
a set of large, bold letters. For this reason, many are looking to XML
- a half way
house between the complex SGML and overly simplistic HTML - to use as
the standard for text based information. In order for online publishing
to
evolve from a process based around journal articles into a dynamic and
rich set of information, it will be essential to keep the source document,
tagged in SGML or XML, as the archive copy.
A more
streamlined approach
The disadvantage of this 'SGML first' approach is that different publishers
use different sets of rules or templates, known as document type definitions
(DTDs) for tagging SGML. Furthermore, the content, tags, and DTD have
to be in perfect synchrony for a document to be displayed correctly. One
of the teething problems encountered by PubMed Central has been a lack
of this synchrony in many cases. Compounding these problems has been the
need to fine tune the translation from SGML to HTML for each journal,
to conform to the distinct display styles of different journals. We have
come to question whether this approach makes the best use of our time
and resources.
We are now
considering a more streamlined approach, which we expect will
deliver a reliable interface to the articles in PubMed Central, while
also
allowing flexibility for the development of the fabric of new articles
in
the future. Under this approach we would continue to display of SGML or
XML tagged articles on the fly, but would first convert the
SGML/XML supplied by the publisher, so that it conforms to a common set
of tagging
rules - in other words, using a single PubMed Central DTD. This would
mean, for
example, that all article titles are called <article title> rather
than a
mixture of <article title>, <article name>, <paper title>,
<paper heading>,
etc. Such an approach would make it more feasible to develop novel
information retrieval methods, a robust archiving system, and computational
analysis tools.
Why we
need PubMed Central
Some have questioned the need for a system such as PubMed Central, given
the growing free access to article archives at many journals' own sites,
combined with the reference linking that will be available through CrossRef.
Our response is that PubMed--the citation retrieval system with more than
10 million entries, to which PubMed Central is linked--already provides
a more powerful, fully operational, and completely free linking facility,
which reaches beyond simple bibliographic links to factual databases and
other resources. Even with alternative sources of free literature, an
advantage of PubMed Central is that it will provide access to all articles
in a single place, regardless of where they are published.
The NIH,
through the National Library of Medicine (NLM), has a strong
commitment to making PubMed Central a valuable resource for the life
sciences. For the NLM, PubMed Central is an extension of its longstanding
commitment to preserve and maintain open access to the world's
biomedical literature. Clearly it will take some time to resolve some
of the
technical issues discussed here, but we believe that the end results will
more than justify the effort, and we encourage other publishers to join
this
initiative.
David
J Lipman Director
National
Center for Biotechnology Information/National Library of Medicine/National
Institutes of Health Bethesda
Maryland, USA
Competing
interests: The author has responsibility for PubMed Central.
|