OA Now back issues
 Search OA Now
Archive

March 15, 2004

COMMENTARY

The many-copy problem and the many-copy solution

Peter Suber is a research professor at Earlham College in Richmond, Indiana, and the Open Access Project Director at Public Knowledge in Washington, D.C. He is one of the leading advocates of Open Access and an eloquent commentator on the trends and promises of Open Access models. Here, Suber writes about the response he anticipates to what he calls "the many-copy problem", the multitude of mirrored articles that Open Access could generate.

Too many copies?
As soon as we provide open access to an article, we should expect copies to proliferate around the world. The archive or journal where the article first appeared will make back-ups and may have mirror sites. The Internet Archive will make and store copies, and search engines, like Google, will put copies in their cache. Readers who find the article especially important for their teaching or research might post copies on their own web sites. Others will circulate copies as e-mail attachments. Readers will have offline copies on their hard drives that have been produced by their browsers, locally searchable databases, or other applications. Many users will make and keep printouts.

Insofar as this proliferation causes trouble, let me call it the "many-copy problem". Insofar as it solves problems, let me call it the "many-copy solution". Open access undoubtedly triggers both.

The many-copy solution
Before we despair, let's start with the good news. Here's a sketch of the many-copy solution.
Mirror, mirror on the web, which of the
copies should be read?

The proliferation of copies shows that copying is physically or technically possible. In the case of open-access literature, copying is also legally permissible. When it is both, then licensing agreements and software to enforce them haven't locked up the content and made it uncopyable. The proliferation of permissible copies shows that the technical and legal freedom to make and distribute copies is intact, which is a key part of the free exchange of information.

The proliferation of copies is insurance against disaster. If one copy is deleted or corrupted, the other copies will probably survive. This fact was made a deliberate preservation strategy by LOCKSS (Lots of Copies Keeps Stuff Safe), a peer-to-peer (P2P) network of self-correcting archive mirrors (see Who, What & Why? in this issue).

The proliferation of copies is a hedge against censorship, not just against deletion and corruption. When the Bush administration started pruning web sites controlled by the US government, removing valid science that might help terrorists and valid science that might support abortion-choice advocates, it was serenely unaware that copies of the same files existed elsewhere on the Internet. It doesn't matter whether the censor is trying to save lives or distort science. You can only remove the copies you control and the copies you know about. Open access increases the odds that these aren't the only copies online, let alone offline in printouts and on hard drives.


"Having multiple copies helps ensure that the articles will remain open access, even if the original journals die, are bought out, or change their access policies"

Peter Suber


The proliferation of copies not only increases the chances that a copy will survive disaster, uncensored, but that open access copies will survive. This is one reason why BioMed Central and the Public Library of Science deposit copies of all their published articles in the PubMed Central repository. The existence of the PubMed Central copies helps ensure that the articles will remain open access, even if the original journals die, are bought out, or change their access policies.

The proliferation of copies also increases the likelihood that at least one copy will be indexed by a popular search engine. Some online journals have terrible search engines for their content. Some archives are not compliant with Open Archives Initiative (OAI) standards and cannot benefit from cross-archive OAI search engines. But most open-access copies in the surface (as opposed to 'deep') web will be crawled by Google and other major search engines. Some will be indexed by OAI-specific engines and other specialized academic search engines. There is no single index that represents the gold standard for content trying to become visible and discoverable. But every new copy increases the number of pathways between readers and copies, and increases the odds that a random reader will discover a copy by entering relevant terms into his or her favourite search engine, no matter how provincial or peculiar he, she, or it may be.

Finally, the proliferation of copies speeds access and thereby supports the basic function of open access, which is to accelerate research. If all copies of an article had to be served from a central location, with no caching or storage on local machines, no printing, and no forwarding, then the literature might be nearly as difficult to reach and share as it was in the era of print. In this sense, open access doesn't onesidedly cause the proliferation of copies; the relationship is reciprocal. Open access triggers copying by permitting it, while copying improves access by multiplying access points and cutting delays.


"The proliferation of copies speeds access and thereby supports the basic function of open access, which is to accelerate research"

Peter Suber



The many-copy problem
How can the proliferation of copies cause trouble? One reason that people might want a controlled, single copy is to provide a mechanism for measuring the online traffic. Copies interfere with the measurement of traffic and usage. A given archive or journal might measure usage very well. But with an unknown number of copies elsewhere on the Internet, and an unknown percentage of readers using those other copies, then the local measurements will be inaccurate to an unknown degree. We might know that all verified counts are undercounts, but we won't know by how much.

If we had perfect indices of the entire Internet or perfect spybots in every browser, then the proliferation of copies would be compatible with perfect measurement of traffic and usage. Perfect indices are very desirable, and perfect spybots very undesirable. But we're very far from both, and there are good reasons to think that the desirable method of achieving this goal will always be out of reach, even if (big 'if') we continually approach completeness as an asymptote.

Or, perhaps a perfect index of the Internet is not even desirable. It would only solve the measurement problem if it counted all copies in use. But then it would have to count even offline copies on hard drives, threatening the private exchange and storage of information. At some point, improving our usage metrics will violate privacy and protecting privacy will thwart usage metrics.

What if open access articles carried code to report back to a scientometric counting station now and then? This is technically possible today. Copying the file would also copy the code with it, at least in the absence of a fairly sophisticated hack. But even if the code only reported anonymized traffic and usage data, many users would worry that it would report more, invade privacy, and compromise anonymous inquiry. Open-source code would help allay fears, but would it help enough? Either way, we're likely to see closed-source versions of this code become common.

Note that the proliferation of copies only hinders metrics that count downloads, search hits, and other forms of usage. It does not affect the count of citations or impact measurements based on citation counts. Of course, automated citation counts might fail too, for example because an article citing my work is offline or invisible to the counter. But if so, the fault does not lie with the many copy problem.

Tracking dynamic changes
The proliferation of copies harms what we could call 'dynamic' works, which are periodically revised or updated. Even if each update carries a revision date, and all copies carry the revision date, a reader will not know whether there is a more recent copy elsewhere.

When I maintained a list of links to sites in philosophy, I dated every revision of the file. But I was frustrated when other philosophers used copies, rather than links, to share it with students or colleagues. They would invariably fail to keep their copies up-to-date. The result was that readers who consulted their copies, rather than my original, would think that I was slow to update the file (or slower than I really was).

If the dynamic work is an article or book, then readers of out-of-date copies will think the author is guilty of errors or omissions that have been corrected in newer versions. If the dynamic work has legal implications, like a website privacy policy, then out of- date copies will mislead users about their rights.

While I consent to open access for all my online writings, I do try to control the copying of my dynamic works. When I find out-of-date copies on the Internet, I ask the host to bring them up to- date or take them down. I consent to mirrors of my dynamic works only when I am confident that the mirror will remain in sync with the original.

The proliferation of copies makes it more difficult to know when the version of an article you're reading is the same version as was approved by a journal's peer-review process. The text might give no indication. It might say that it was approved, but it might be an altered copy of a version that was truly approved, or a fraudulent copy that was never approved.

For better or worse, we're refining our rules of thumb for deciding when to trust online content. One rule might be, "If the copy doesn't say where it was refereed, then assume it was never refereed." (It might be an honest preprint or it might be a fraud.) This particular rule may err too far on the side of scepticism, for we all know peer-reviewed papers on author web sites that show no sign of their approval by a peerreview process. The question is how the many-copy problem interferes with our attempt to make the rules more discriminating and less crude.

One way to deal with the authentication problem is for the journal that conducted peer review to host its own copy of the approved article. If you distrust the copy you're reading, then visit the source and read an authenticated copy, or run file-comparison software across the authenticated and questionable copies.

Another approach is encryption, used in Surfaces, an early peer-reviewed, open access journal edited by Jean- Claude Guédon. With encryption, receiving an authenticated copy of an article is as easy as receiving an authenticated signature or credit card payment. (I still don't understand why this powerful idea from 1991 has not been more widely imitated or criticized. Is it because there's little urgency to solve the authentication problem itself?)

Another approach is to let articles carry their metadata with them. Metadata fields could indicate not only authorship and date, but also whether the article was refereed and where. Embedded metadata would not be as secure as encryption, but more convenient for the reader and less likely than 'self-reporting code' to aggravate the suspicions of suspicious users.

Editor's Note: Suber's analysis offers some thought-provoking ideas about how to view the generation of multiple copies because of Open Access. The life sciences community will have to decide whether it will focus on the problems created by the existence of multiple copies or whether it will celebrate the benefits of the "many-copy solution". This version of Suber's article has been approved by the author. The original copy can be found at this page.

 

 
 

Open Access Now is published by BioMed Central.
Editor: Jonathan B Weitzman.