This article is part of the supplement: Proceedings of the Tenth Annual MCBIOS Conference
A cross disciplinary study of link decay and the effectiveness of mitigation techniques
Department of Mathematics and Statistics, South Dakota State University, Box 2220, Brookings, SD 57007, USA
BMC Bioinformatics 2013, 14(Suppl 14):S5 doi:10.1186/1471-2105-14-S14-S5Published: 9 October 2013
The dynamic, decentralized world-wide-web has become an essential part of scientific research and communication. Researchers create thousands of web sites every year to share software, data and services. These valuable resources tend to disappear over time. The problem has been documented in many subject areas. Our goal is to conduct a cross-disciplinary investigation of the problem and test the effectiveness of existing remedies.
We accessed 14,489 unique web pages found in the abstracts within Thomson Reuters' Web of Science citation index that were published between 1996 and 2010 and found that the median lifespan of these web pages was 9.3 years with 62% of them being archived. Survival analysis and logistic regression were used to find significant predictors of URL lifespan. The availability of a web page is most dependent on the time it is published and the top-level domain names. Similar statistical analysis revealed biases in current solutions: the Internet Archive favors web pages with fewer layers in the Universal Resource Locator (URL) while WebCite is significantly influenced by the source of publication. We also created a prototype for a process to submit web pages to the archives and increased coverage of our list of scientific webpages in the Internet Archive and WebCite by 22% and 255%, respectively.
Our results show that link decay continues to be a problem across different disciplines and that current solutions for static web pages are helping and can be improved.