|
March 15, 2004
COMMENTARY
The many-copy problem and the many-copy solution
Peter Suber is a research
professor at Earlham
College in Richmond,
Indiana, and the Open Access
Project Director at Public
Knowledge in Washington,
D.C. He is one of the leading
advocates of Open Access and
an eloquent commentator on
the trends and promises of
Open Access models. Here,
Suber writes about the
response he anticipates to
what he calls "the many-copy
problem", the multitude of
mirrored articles that Open
Access could generate.
Too many copies?
As soon as we provide open access to
an article, we should expect copies to
proliferate around the world. The
archive or journal where the article
first appeared will make back-ups and
may have mirror sites. The Internet
Archive will make and store copies,
and search engines, like Google, will
put copies in their cache. Readers who
find the article especially important for
their teaching or research might post
copies on their own web sites. Others
will circulate copies as e-mail attachments.
Readers will have offline
copies on their hard drives that have
been produced by their browsers,
locally searchable databases, or other
applications. Many users will make
and keep printouts.
Insofar as this proliferation causes
trouble, let me call it the "many-copy
problem". Insofar as it solves problems,
let me call it the "many-copy
solution". Open access undoubtedly
triggers both.
The many-copy solution
Before we despair, let's start with the
good news. Here's a sketch of the
many-copy solution.
Mirror, mirror on the web, which of the copies should be read?
The proliferation of copies shows that
copying is physically or technically
possible. In the case of open-access
literature, copying is also legally
permissible. When it is both, then
licensing agreements and software to
enforce them haven't locked up the
content and made it uncopyable. The
proliferation of permissible copies
shows that the technical and legal freedom
to make and distribute copies is
intact, which is a key part of the free
exchange of information.
The proliferation of copies is insurance
against disaster. If one copy is
deleted or corrupted, the other copies
will probably survive. This fact was
made a deliberate preservation strategy
by LOCKSS (Lots of Copies Keeps
Stuff Safe), a peer-to-peer (P2P) network
of self-correcting archive mirrors
(see Who, What & Why? in this issue).
The proliferation of copies is a hedge
against censorship, not just against
deletion and corruption. When the
Bush administration started pruning
web sites controlled by the US government,
removing valid science that
might help terrorists and valid science
that might support abortion-choice
advocates, it was serenely unaware
that copies of the same files existed
elsewhere on the Internet. It doesn't
matter whether the censor is trying to
save lives or distort science. You can
only remove the copies you control
and the copies you know about. Open
access increases the odds that these
aren't the only copies online, let alone
offline in printouts and on hard drives.
"Having multiple copies helps ensure
that the articles will remain open
access, even if the original journals
die, are bought out, or change
their access policies"
Peter Suber
The proliferation of copies not only
increases the chances that a copy will
survive disaster, uncensored, but that
open access copies will survive. This
is one reason why BioMed Central and
the Public Library of Science deposit
copies of all their published articles in
the PubMed Central repository. The
existence of the PubMed Central
copies helps ensure that the articles
will remain open access, even if the
original journals die, are bought out, or
change their access policies.
The proliferation of copies also
increases the likelihood that at least
one copy will be indexed by a popular
search engine. Some online journals
have terrible search engines for their
content. Some archives are not compliant
with Open Archives Initiative
(OAI) standards and cannot benefit
from cross-archive OAI search
engines. But most open-access copies
in the surface (as opposed to 'deep')
web will be crawled by Google and
other major search engines. Some will
be indexed by OAI-specific engines
and other specialized academic search
engines. There is no single index that
represents the gold standard for
content trying to become visible and
discoverable. But every new copy
increases the number of pathways
between readers and copies, and
increases the odds that a random reader
will discover a copy by entering relevant
terms into his or her favourite
search engine, no matter how provincial
or peculiar he, she, or it may be.
Finally, the proliferation of copies
speeds access and thereby supports the
basic function of open access, which is
to accelerate research. If all copies of
an article had to be served from a central
location, with no caching or storage
on local machines, no printing,
and no forwarding, then the literature
might be nearly as difficult to reach
and share as it was in the era of print.
In this sense, open access doesn't onesidedly
cause the proliferation of
copies; the relationship is reciprocal.
Open access triggers copying by permitting
it, while copying improves
access by multiplying access points
and cutting delays.
"The proliferation of copies speeds
access and thereby supports the
basic function of open access,
which is to accelerate research"
Peter Suber
The many-copy
problem
How can the proliferation of copies
cause trouble? One reason that people
might want a controlled, single copy is
to provide a mechanism for measuring
the online traffic. Copies interfere with
the measurement of traffic and usage.
A given archive or journal might
measure usage very well. But with an
unknown number of copies elsewhere
on the Internet, and an unknown percentage
of readers using those other
copies, then the local measurements
will be inaccurate to an unknown
degree. We might know that all verified
counts are undercounts, but we
won't know by how much.
If we had perfect indices of the entire
Internet or perfect spybots in every
browser, then the proliferation of
copies would be compatible with perfect
measurement of traffic and usage.
Perfect indices are very desirable, and
perfect spybots very undesirable. But
we're very far from both, and there are
good reasons to think that the desirable
method of achieving this goal will
always be out of reach, even if (big
'if') we continually approach completeness
as an asymptote.
Or, perhaps a perfect index of the
Internet is not even desirable. It would
only solve the measurement problem if
it counted all copies in use. But then it
would have to count even offline copies
on hard drives, threatening the private
exchange and storage of information.
At some point, improving our usage
metrics will violate privacy and protecting
privacy will thwart usage metrics.
What if open access articles carried
code to report back to a scientometric
counting station now and then? This is
technically possible today. Copying
the file would also copy the code with
it, at least in the absence of a fairly
sophisticated hack. But even if the
code only reported anonymized traffic
and usage data, many users would
worry that it would report more,
invade privacy, and compromise
anonymous inquiry. Open-source code
would help allay fears, but would it
help enough? Either way, we're likely
to see closed-source versions of this
code become common.
Note that the proliferation of copies
only hinders metrics that count downloads,
search hits, and other forms of
usage. It does not affect the count of
citations or impact measurements based
on citation counts. Of course, automated
citation counts might fail too, for example
because an article citing my work is
offline or invisible to the counter. But if
so, the fault does not lie with the many copy
problem.
Tracking dynamic
changes
The proliferation of copies harms what
we could call 'dynamic' works, which
are periodically revised or updated.
Even if each update carries a revision
date, and all copies carry the revision
date, a reader will not know whether
there is a more recent copy elsewhere.
When I maintained a list of links to sites
in philosophy, I dated every revision of
the file. But I was frustrated when other
philosophers used copies, rather than
links, to share it with students or colleagues.
They would invariably fail to
keep their copies up-to-date. The result
was that readers who consulted their
copies, rather than my original, would
think that I was slow to update the file
(or slower than I really was).
If the dynamic work is an article or
book, then readers of out-of-date
copies will think the author is guilty of
errors or omissions that have been corrected
in newer versions. If the
dynamic work has legal implications,
like a website privacy policy, then out of-
date copies will mislead users about
their rights.
While I consent to open access for all
my online writings, I do try to control
the copying of my dynamic works.
When I find out-of-date copies on the
Internet, I ask the host to bring them up to-
date or take them down. I consent to
mirrors of my dynamic works only
when I am confident that the mirror will
remain in sync with the original.
The proliferation of copies makes it
more difficult to know when the version
of an article you're reading is the
same version as was approved by a
journal's peer-review process. The text
might give no indication. It might say
that it was approved, but it might be an
altered copy of a version that was truly
approved, or a fraudulent copy that
was never approved.
For better or worse, we're refining our
rules of thumb for deciding when to
trust online content. One rule might be,
"If the copy doesn't say where it was
refereed, then assume it was never refereed."
(It might be an honest preprint
or it might be a fraud.) This particular
rule may err too far on the side of scepticism,
for we all know peer-reviewed
papers on author web sites that show
no sign of their approval by a peerreview
process. The question is how
the many-copy problem interferes with
our attempt to make the rules more discriminating
and less crude.
One way to deal with the authentication
problem is for the journal that
conducted peer review to host its own
copy of the approved article. If you
distrust the copy you're reading, then
visit the source and read an authenticated
copy, or run file-comparison
software across the authenticated and
questionable copies.
Another approach is encryption, used
in Surfaces, an early peer-reviewed,
open access journal edited by Jean-
Claude Guédon. With encryption,
receiving an authenticated copy of an
article is as easy as receiving an
authenticated signature or credit card
payment. (I still don't understand why
this powerful idea from 1991 has not
been more widely imitated or criticized.
Is it because there's little
urgency to solve the authentication
problem itself?)
Another approach is to let articles
carry their metadata with them.
Metadata fields could indicate not only
authorship and date, but also whether
the article was refereed and where.
Embedded metadata would not be as
secure as encryption, but more convenient
for the reader and less likely
than 'self-reporting code' to aggravate
the suspicions of suspicious users.
Editor's Note: Suber's analysis
offers some thought-provoking
ideas about how to view the
generation of multiple copies
because of Open Access. The life
sciences community will have to
decide whether it will focus on
the problems created by the existence
of multiple copies or
whether it will celebrate the benefits
of the "many-copy solution".
This version of Suber's article has
been approved by the author.
The original copy can be found at this page.
|