Freedom of Information Conference 2000
Paul Ginsparg
Los Alamos National Laboratory
Creating a global knowledge network
Don't
just clone the paper methodology
How should
our scientific research communications infrastructure be reconfigured
to take maximal advantage of newly evolving electronic resources? Rather
than "electronic publishing," which connotes a rather straightforward
cloning of the paper methodology to the electronic network, many researchers
would prefer to see the new technology lead to some form of global "knowledge
network," and sooner rather than later.
Some of the
possibilities offered by a unified global archive are suggested by the
Los Alamos e-print archives (where "e-print" denotes self archiving
by the author), which since their inception in 1991 have become a major
forum for dissemination of results in physics and mathematics. These e-print
archives are entirely scientist driven, and are flexible enough either
to co-exist with the pre-existing publication system, or to help it evolve
into something better able to meet researcher needs. The archives are
an example of a service created by a group of specialists for their own
use: It is also important to note that the rapid dissemination they provide
is not in the least inconsistent with concurrent or subsequent peer review,
and in the long run offers a possible framework for a more functional
archival structuring of the literature than is provided by current peer
review processes.
The
electronic medium can do it cheaper and better
As argued
by Odlyzko,[1] the current methodology of research dissemination and validation
is premised on a paper medium that was difficult to produce, difficult
to distribute, difficult to archive, and difficult to duplicate -- a medium
that hence required numerous local redistribution points in the form of
research libraries. The electronic medium is opposite in each of the above
regards, and, hence, if we were to start from scratch today to design
a quality controlled distribution system for research findings, it would
likely take a very different form both from the current system and from
the electronic clone it would spawn without more constructive input from
the research community.
The need
to reconsider the above methodology is reinforced by noting that each
article typically costs many tens of thousands of dollars to produce in
salaries, and much more in equipment and overhead. A key point of the
electronic communication medium is that, for a minuscule additional fraction
of this amount, it is possible to archive the article and make it freely
available to the entire world in perpetuity. Moreover, this is consistent
with public policy goals for what is in large part publicly funded research.
[3] The nine year lesson so far from the Los Alamos archives is that this
additional cost, including the cost of the global mirror network, can
be as little as a dollar per article, and there is no indication that
maintenance of the archival portion of the database will require an increasing
fraction of the time, cost, or effort.
Odlyzko has
also pointed out that average aggregate publisher revenues are roughly
$4000 per article, and that since acquisition costs are typically one
third of library budgets, the current system expends an additional $8000
per article in other library costs.[1] [2] Of course, some of the publisher
revenues are necessary to organize peer review, although the latter depends
on the donated time and energy of the research community and is subsidized
by the same grant funds and institutions that sponsor the research in
the first place. The question crystallized by the new communications medium
is whether this arrangement is the most efficient way to organise the
review and certification functions, or if the dissemination and authentication
systems can be naturally disentangled to create a more forward looking
research communications infrastructure.
A
new model for research communications
The figure
(figure
1) illustrates one such possible hierarchical
structuring of our research communications infrastructure. It also represents
graphically the key possibility in the new electronic architecture: that
of disentangling and decoupling the production and dissemination on the
one hand, from the quality control and validation on the other, in a way
that is not possible in the paper realm. The figure shows three electronic
service layers, as viewed by the interested reader/researcher, who can
choose the most auspicious access method for navigating the electronic
literature. The three layers are the data, information, and knowledge
networks--where information is taken to mean data plus metadata (i.e.
descriptive data), and knowledge signifies information plus synthesis
(i.e. additional synthesizing information).
The knowledge
layer includes third parties that can overlay the information and data
levels with synthesizing information, and can partition the information
into sectors according to subject area, overall importance, quality of
research, degree of pedagogy, interdisciplinarity, or other useful criteria.
They can also maintain other useful retrospective resources, such as suggesting
a minimal path through the literature to understand a given article, and
suggesting pointers to outstanding lines of research later spawned by
it.
The three
layers depicted are multiply interconnected. The information layer can
harvest and index metadata from the data layer to generate an aggregation
which can in turn span more than one particular archive or discipline.
The knowledge layer points to useful resources in the information layer.
The synthesizing information in the knowledge layer is the glue that assembles
the building blocks from the lower layers into a knowledge structure more
accessible to both experts and non-experts.
The role
of journals in this new hierarchy is to serve as pointers to selected
entries at the data level. This is identical to the current primary role
of journals: to select and certify specific subsets of the literature
for the benefit of the reader. A heterodox point that arises in this model
is that a given article at the data level can be pointed to by multiple
such virtual journals, insofar as they are trying to provide a useful
guide to the reader. Such multiple appearance would no longer waste space
on library shelves, nor be viewed as dishonest. This could tend to reduce
the overall article flux and any tendency on the part of authors towards
creating "least publishable units." The author of the future
could thereby be promoted on the basis of quality rather than quantity:
instead of 25 articles on a given subject, the author can point to a single
critical article that "appears" in 25 different journals.
The reader
can choose how best to proceed for any given application: either trolling
for gems directly from the data level (as many graduate students are occasionally
wont to do, hoping to find a key insight missed by the mainstream), or
instead beginning the quest at the information or knowledge levels, in
order to benefit from some form of prefiltering or organization. The reader
most in need of a structured guide would turn directly to the highest
level of "value added" knowledge in the "knowledge network."
This is where
capitalism should return to the fore: researchers can and should be willing
to pay a fair market value for services provided at the information or
knowledge levels that facilitate and enhance the research experience.
However, for reasons detailed above, we expect that access at the raw
data level can be provided without charge to readers. In the future this
raw access can be further assisted not only by full text search engines
but also by automatically generated reference and citation linking. The
experience from the physics e-print archives is that this raw access is
extremely useful to researchers, and the small admixture of noise from
a non-peer reviewed sector has not constituted a major problem. Research
in science has certain well defined checks and balances, and is ordinarily
pursued by certain well defined communities.
Change
will come through experiment and evolutionary forces
Ultimately,
issues regarding the correct configuration of electronic research infrastructure
will be decided experimentally, and it will be edifying to watch the evolving
roles of the current participants. Some remain very attached to the status
quo, as evidenced by responses to successive forms of the PubMedCentral
proposal from professional societies and other agencies, ostensibly acting
on behalf of researchers but sometimes disappointingly unable to recognize
or consider potential benefits to them. It is also useful to bear in mind
that much of the entrenched current methodology is largely a post World
War II construct, including both the large scale entry of commercial publishers
and the widespread use of peer review for mass implementation of quality
control (neither necessary to, nor a guarantee of, good science). Ironically,
the new technology may allow the traditional players from a century ago,
namely the professional societies and institutional libraries, to return
to their dominant role in support of the research enterprise.
The original
objective of the Los Alamos archives was to provide functionality that
was not otherwise available, and to provide a level playing field for
researchers at different academic levels and different geographic locations
-- the dramatic reduction in cost of dissemination came as an unexpected
bonus. As Andy Grove of Intel has pointed out,[4] when a critical business
element is changed by a factor of 10, it is necessary to rethink the entire
enterprise. The Los Alamos e-print archives suggest that dissemination
costs can be lowered by more than two orders of magnitude, not just one.
In the next
10 to 20 years, it is likely that many research communities will move
to some form of global unified archive system, without the current partitioning
and access restrictions familiar from the paper medium, for the simple
reason that it is the best way to communicate knowledge and hence to create
new knowledge.
Figure
The
figure illustrates one such possible hierarchical structuring of our research
communications infrastructure. It also represents graphically the key
possibility in the new electronic architecture: that of disentangling
and decoupling the production and dissemination on the one hand, from
the quality control and validation on the other, in a way that is not
possible in the paper realm. The figure shows three electronic service
layers, as viewed by the interested reader/researcher, who can choose
the most auspicious access method for navigating the electronic literature.
The three layers are the data, information, and knowledge networks--where
information is taken to mean data plus metadata (i.e. descriptive data),
and knowledge signifies information plus synthesis (i.e. additional synthesizing
information).
Data level:
the figure shows a small number of potentially representative providers,
including the Los Alamos e-print arXiv (and implicitly its international
mirror network), a university library system such as the California Digital
Library (CDL), and a typical foreign funding agency, such as the French
Centre Nationale de Recherche Scientifique (CNRS). These are intended
to convey the likely importance of library and international components.
Note that there already exist cooperative agreements with each of these
to coordinate via the "open archives" protocols (http://www.openarchives.org/)
to facilitate aggregate distributed collections.
Information
level: the figure shows a generic public search engine (Google), a generic
commercial indexer (Institute for Scientific Information, ISI), and a
generic government resource (the PubScience initiative), suggesting a
mixture of free, commercial, and publicly funded resources at this level.
For the biomedical audience at hand, I might have included services like
Chemical Abstracts and PubMed at this level. A service such as GenBank
is a hybrid in this setting, with components at both the data and information
layers. The proposed role of PubMedCentral would be to fill the electronic
gaps in the data layer highlighted by the more complete PubMed metadata.
Knowledge
level: the figure shows a tiny set of existing physics Publishers: American
Physical Society (APS), Journal of High Energy Physics (JHEP), and Applied
and Theoretical Mathematical Physics (ATMP); the second is based in Italy
and the third already uses the arXiv entirely for its electronic dissemination.
It also shows BioMed Central (BMC). These are the third parties that can
overlay additional synthesizing information on top of the information
and data levels; partition the information into sectors according to subject
area, overall importance, quality of research, degree of pedagogy, interdisciplinarity,
useful criteria; and maintain other useful retrospective resources, such
as suggesting a minimal path through the literature to understand a given
article, and suggesting pointers to outstanding lines of research later
spawned by it. The synthesizing information in the knowledge layer is
the glue that assembles the building blocks from the lower layers into
a knowledge structure more accessible to both experts and non-experts.
The three
layers depicted are multiply interconnected. The green arrows indicate
that the information layer can harvest and index metadata from the data
layer to generate an aggregation which can in turn span more than one
particular archive or discipline. The red arrows suggest that the knowledge
layer points to useful resources in the information layer. The blue arrows
-- critical here -- represent how journals of the future can exist in
an "overlay" form, i.e. as a set of pointers to selected entries
at the data level. The black arrows suggest how the reader might best
proceed for any given application: either trolling for gems directly from
the data level (as many graduate students are occasionally wont to do,
hoping to find a key insight missed by the mainstream), or instead beginning
the quest at the information or knowledge levels, in order to benefit
from some form of pre-filtering or organization.
Paul Ginsparg
Los Alamos National Laboratory
New Mexico USA
References
1. Odlyzko
A. Tragic loss or good riddance? The impending demise of traditional scholarly
journals. Intern J Human-Computer Studies 1995 ;42:71-122. Also available
in the electronic J Univ Comp Sci pilot issue, 1994.
2. Odlyzko
A. Competition and cooperation: libraries and publishers in the transition
to electronic scholarly journals. Journal of Electronic Publishing 1999;
http://www.press.umich.edu/jep. Also in J Scholarly Publishing 1999;30;163-85.
Articles also available at http://www.research.att.com/~amo/doc/eworld.html
3. Bachrach
S et al. Who Should 'Own' Scientific Papers? Science 1998;281:1459-60
(http://www.sciencemag.org/cgi/content/full/281/5382/1459)
4. Grove
A. Only the Paranoid Survive: How to Exploit the Crisis Points That Challenge
Every Company and Career. Bantam Doubleday Dell, 1996.
|