<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art><ui>1471-2105-12-187</ui><ji>1471-2105</ji><fm>
<dochead>Database</dochead>
<bibl>
<title>
<p>Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library</p>
</title>
<aug>
<au ca="yes" id="A1"><snm>Page</snm><mi>DM</mi><fnm>Roderic</fnm><insr iid="I1"/><email>Roderic.Page@glasgow.ac.uk</email></au>
</aug>
<insg>
<ins id="I1"><p>Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, Graham Kerr Building, University of Glasgow, Glasgow G12 8QQ, UK</p></ins>
</insg>
<source>BMC Bioinformatics</source>
<issn>1471-2105</issn>
<pubdate>2011</pubdate>
<volume>12</volume>
<issue>1</issue>
<fpage>187</fpage>
<url>http://www.biomedcentral.com/1471-2105/12/187</url>
<xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-12-187</pubid><pubid idtype="pmpid">21605356</pubid></pubidlist></xrefbib>
</bibl>
<history><rec><date><day>21</day><month>9</month><year>2010</year></date></rec><acc><date><day>23</day><month>5</month><year>2011</year></date></acc><pub><date><day>23</day><month>5</month><year>2011</year></date></pub></history>
<cpyrt><year>2011</year><collab>Page; licensee BioMed Central Ltd.</collab><note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
<abs>
<sec>
<st>
<p>Abstract</p>
</st>
<sec>
<st>
<p>Background</p>
</st>
<p>The Biodiversity Heritage Library (BHL) is a large digital archive of legacy biological literature, comprising over 31 million pages scanned from books, monographs, and journals. During the digitisation process basic metadata about the scanned items is recorded, but not article-level metadata. Given that the article is the standard unit of citation, this makes it difficult to locate cited literature in BHL. Adding the ability to easily find articles in BHL would greatly enhance the value of the archive.</p>
</sec>
<sec>
<st>
<p>Description</p>
</st>
<p>A service was developed to locate articles in BHL based on matching article metadata to BHL metadata using approximate string matching, regular expressions, and string alignment. This article locating service is exposed as a standard OpenURL resolver on the BioStor web site <url>http://biostor.org/openurl/</url>. This resolver can be used on the web, or called by bibliographic tools that support OpenURL.</p>
</sec>
<sec>
<st>
<p>Conclusions</p>
</st>
<p>BioStor provides tools for extracting, annotating, and visualising articles from the Biodiversity Heritage Library. BioStor is available from <url>http://biostor.org/</url>.</p>
</sec>
</sec>
</abs>
</fm><meta>
<classifications>
<classification id="biodiversity_research" subtype="cross_series_title" type="BMC">Open Access Biodiversity Research</classification>
<classification id="biodiversity_research" subtype="cross_series_editor" type="BMC"/>
</classifications>
</meta><bdy>
<sec>
<st>
<p>Background</p>
</st>
<p>In July 2010 Lambert et al. <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp> published a paper in <it>Nature </it>that described an extinct sperm whale possessing the biggest bite of any tetrapod known. They named this formidable predator <it>Leviathan melvillei</it>, the genus name <it>Leviathan </it>being derived from the Hebrew 'Livyatan', the species name honouring Herman Melville (author of Moby Dick <abbrgrp>
<abbr bid="B2">2</abbr>
</abbrgrp>). As appropriate as this name was, it quickly ran foul of the rules of zoological nomenclature <abbrgrp>
<abbr bid="B3">3</abbr>
</abbrgrp> because <it>Leviathan </it>had been used 169 years previously for an extinct species of mammoth <abbrgrp>
<abbr bid="B4">4</abbr>
</abbrgrp>. Although the name <it>Leviathan </it>Koch <abbrgrp>
<abbr bid="B4">4</abbr>
</abbrgrp> had lapsed into obscurity (as a synonym of <it>Mammut </it>Blummenbach) its existence meant the newly discovered whale had to be renamed, which it duly was a month after the original publication <abbrgrp>
<abbr bid="B5">5</abbr>
</abbrgrp>.</p>
<p>The fate of Lambert et al.'s <it>Leviathan </it>illustrates a significant challenge facing researchers finding and naming new species - the discoverability of existing names. In the absence of a global register of all taxonomic names that have ever been published, a researcher about to publish a new name may struggle to establish that that it has not already been used. Zoological nomenclature dates from 1758, botanical nomenclature from 1753, hence a comprehensive list of taxonomic names must survey some 250 years of literature <abbrgrp>
<abbr bid="B6">6</abbr>
</abbrgrp>, much of which is obscure and may not exist in digital form. Digitising this legacy literature is the goal of the Biodiversity Heritage Library (BHL) <abbrgrp>
<abbr bid="B7">7</abbr>
<abbr bid="B8">8</abbr>
</abbrgrp>, a consortium of natural history museum libraries, botanic libraries, and research institutions. The bulk of this digitisation is carried out by the Internet Archive <abbrgrp>
<abbr bid="B9">9</abbr>
</abbrgrp>, which scans books (broadly defined to include bound issues of journals), creating a set of electronic files for each scanned item, which includes images of individual pages, and text extracted from those pages using Optical Character Recognition (OCR). BHL takes these files (together with the output from the scanning projects of individual BHL members), indexes them by bibliographic metadata and taxonomic names, and makes the content available on its web site <abbrgrp>
<abbr bid="B7">7</abbr>
</abbrgrp> (both as web pages and web services). Although the bulk of BHL's scanning activities focus on pre-1923 content that is out of copyright, it has not inconsiderable post-1923 content contributed by its member institutions, notably publications by various natural history museums.</p>
<p>The inability to easily locate articles in BHL is a substantial obstacle to integrating this legacy biodiversity literature into mainstream scientific publishing. The goal of BioStor is to provide tools to locate and extract articles from the BHL archive. BioStor differs from search engines such as PubMed <abbrgrp>
<abbr bid="B10">10</abbr>
</abbrgrp> and Google Scholar <abbrgrp>
<abbr bid="B11">11</abbr>
</abbrgrp>, which support free-form queries such as "what articles have been published on this topic?", or "what papers has this author published?" BioStor addresses a different question, namely "does this article exist in the BHL archive?" It is a tool to find out whether a specific article exists in the archive, as opposed to finding what articles exist on a particular topic.</p>
<sec>
<st>
<p>Locating articles in BHL</p>
</st>
<p>The BHL archive comprises "items" corresponding to physical objects which are scanned. Items are grouped together into "titles". A single volume book corresponds to a single title and item, whereas a multi-volume work, such as a journal, will comprise several items grouped under the same title (Figure <figr fid="F1">1</figr>). Noticeably absent from the BHL model is the standard unit of scientific citation, the article.</p>
<fig id="F1"><title><p>Figure 1</p></title><caption><p>Simplified model of Biodiversity Heritage Library content</p></caption><text>
   <p><b>Simplified model of Biodiversity Heritage Library content</b>. Each scanned item comprises one or more page images. Items are grouped together into titles.</p>
</text><graphic file="1471-2105-12-187-1" hint_layout="single"/></fig>
<p>For most modern articles the triple of journal name, volume, and starting page is sufficient to uniquely identify an article <abbrgrp>
<abbr bid="B12">12</abbr>
</abbrgrp>, and tools such as CrossRef's OpenURL resolver <abbrgrp>
<abbr bid="B13">13</abbr>
</abbrgrp> can take this this triple and discover whether a Digital Object Identifier (DOI) <abbrgrp>
<abbr bid="B14">14</abbr>
</abbrgrp> exists for a that article. Publishers make use of this tool to map the literature cited in a manuscript to the corresponding DOI. In an ideal world the BHL model of (title, item, page) (Figure <figr fid="F1">1</figr>) would map exactly to (journal, volume, page), such that an individual journal would correspond to a title in BHL, and each volume of that journal was a separate item. Given that BHL stores page numbers for each scanned page <abbrgrp>
<abbr bid="B8">8</abbr>
</abbrgrp>, locating articles would then be trivial and linking to BHL content could be readily integrated into existing publication processes, as well as bibliographic management tools that make use of CrossRef's services to augment user-provided metadata (e.g., Mendeley <abbrgrp>
<abbr bid="B15">15</abbr>
</abbrgrp>).</p>
<p>Unfortunately, the actual mapping between articles and BHL content is often rather more complicated. Large articles (e.g., monographs) may be treated as separate "titles" (effectively as if they were books), rather than parts of the same title. A contributing library may have bound several volumes of a journal together, such that a single "item" may comprise multiple volumes. Volume numbers themselves may not be unique within a journal. <it>The Annals and Magazine of Natural History </it>(ISSN 0374-5481), published from 1828 until 1967 (being succeeded by the <it>Journal of Natural History</it>, ISSN 0022-2933), is divided into 13 "series", each series numbering its volumes from one onwards. Hence, "volume 1" of <it>Annals and Magazine of Natural History </it>may refer to any one of 13 volumes spanning 138 years <abbrgrp>
<abbr bid="B16">16</abbr>
</abbrgrp>. Journals also differ in whether pagination is unique within a volume, or within parts of a volume. For example, in the journal <it>Arkiv f&#246;r Zoologi </it>(ISSN 0004-2110) each article starts on page 1, so that the triple (<it>Arkiv f&#246;r Zoologi</it>, 13, 1) may refer to <abbrgrp>
<abbr bid="B17">17</abbr>
<abbr bid="B18">18</abbr>
</abbrgrp>, or any of 23 other articles in volume 13 of that journal.</p>
<p>Discovering articles also assumes that the pagination in BHL is complete and correct, and that one side of a sheet of paper corresponds to a "page". BHL records the page number of regular pages, but not pages that are classified as special in some way, such as title pages, or tables of contents. For example, page 1 in Lynch et al. <abbrgrp>
<abbr bid="B19">19</abbr>
</abbrgrp> is recorded in BHL as being the title page without any number, which will frustrate efforts to locate this article by starting page alone.</p>
<p>While the triple (journal, volume, starting page) is usually sufficient - subject to the caveats above - to locate the start of an article, we want to recover all the pages in the article, hence we need both the starting and ending pages. Ideally we could then extract the corresponding set of page images from BHL and join them together to form an article. However, it is not uncommon for older articles to have discontinuous physical pagination, for example by having plates inserted between pages in the text. In some publications, such as <it>Isis von Oken</it>, the text on a page forms two columns, each with its own page number (Figure <figr fid="F2">2</figr>), hence one physical page need not equate to a bibliographic page.</p>
<fig id="F2"><title><p>Figure 2</p></title><caption><p>Physical page with two page numbers</p></caption><text>
   <p><b>Physical page with two page numbers</b>. Example of a physical page in the journal <it>Isis von Oken </it>with two columns, each of which as its own page number (249 and 250, respectively)</p>
</text><graphic file="1471-2105-12-187-2" hint_layout="double"/></fig>
</sec>
<sec>
<st>
<p>Metadata matters</p>
</st>
<p>Given that locating articles in a archive of legacy literature such as BHL is a non-trivial task, it is worth considering why such an undertaking is worthwhile, beyond integrating BHL with existing citation practices. Indeed, one could argue that, given that the OCR text for BHL content has been indexed by taxonomic name, the need for indexing by article has been greatly reduced - the user could simply search by taxonomic name and find the content they require. This would be sufficient for many users, especially if we were con fident that BHL had correctly indexed all the taxonomic names contained in the pages it has scanned. However, OCR errors mean that a significant fraction of names will be missed <abbrgrp>
<abbr bid="B20">20</abbr>
</abbrgrp>. An obvious approach to discovering these missing names would be to take existing databases of taxonomic names and publications and search for those publications in BHL.</p>
<p>Metadata also provides ways for clients to aggregate and filter search results. The Encylopedia of Life <abbrgrp>
<abbr bid="B21">21</abbr>
</abbrgrp> incorporates search results from BHL in its taxon pages, but the user has no obvious means of discovering whether the results are from the same article or not, nor can they order the results by date. As an example of one way the display of search results can be improved by sorting, consider the dispute concerning the correct scientific name for the sperm whale, which is debated in both the scientific literature <abbrgrp>
<abbr bid="B22">22</abbr>
<abbr bid="B23">23</abbr>
<abbr bid="B24">24</abbr>
</abbrgrp> and, more vociferously, Wikipedia <abbrgrp>
<abbr bid="B25">25</abbr>
</abbrgrp>. Being able to extract basic metadata from BHL would enable us to visualise the relative popularity of the two alternatives, <it>Physeter catodon </it>and <it>Physeter macrocephalus</it>, over time (Figure <figr fid="F3">3</figr>). With the obvious caveat that the literature in BHL is a biased sample of the taxonomic literature, it is clear that <it>Physeter macrocephalus </it>is the more commonly used name, but its usage peaked around the start of the twentieth century. By the 1950, the sperm whale was more commonly refered to as <it>Physeter catodon</it>. Navigating BHL content by date may help the user discover why the relative usage frequency of these two names changed in the previous century.</p>
<fig id="F3"><title><p>Figure 3</p></title><caption><p>Usage of two names for the sperm whale over time</p></caption><text>
   <p><b>Usage of two names for the sperm whale over time</b>. Approximate distribution over time of two alternative names for the sperm whale (<it>Physeter catodon </it>and <it>Physeter macrocephalus</it>) in items scanned by the Biodiversity Heritage Library. Date of publication was extracted from the <monospace>StartYear</monospace> and <monospace>EndYear</monospace> fields of the <monospace>Title</monospace> table (see Fig. 4) using regular expressions.</p>
</text><graphic file="1471-2105-12-187-3" hint_layout="single"/></fig>
</sec>
</sec>
<sec>
<st>
<p>Construction and content</p>
</st>
<p>A local copy of the core BHL tables (Figure <figr fid="F4">4</figr>) was created in MySQL using the data dump provided by BHL <url>http://www.biodiversitylibrary.org/data/data.zip</url>. Page images and OCR text for individual pages are retrieved as needed using the BHL API and cached locally (together with a thumbnail of the page image).</p>
<fig id="F4"><title><p>Figure 4</p></title><caption><p>Simplified BHL schema</p></caption><text>
   <p><b>Simplified BHL schema</b>. Simplified database schema for the core tables in the Biodiversity Heritage Library. The fields referred to in the text are shown, together with a brief explanation of their contents.</p>
</text><graphic file="1471-2105-12-187-4" hint_layout="single"/></fig>
<sec>
<st>
<p>Locating an article</p>
</st>
<p>BioStor provides an OpenURL <abbrgrp>
<abbr bid="B26">26</abbr>
</abbrgrp> resolver service to locate articles in BHL. At a minimum the resolver requires the journal name, volume, and starting page of the article being searched for. It may also make use of journal series and date, if these are provided. This service first checks whether the article already exists in the BioStor database. If the article is not found, the algorithm outlined in Figure <figr fid="F5">5</figr> is used to search for the article in BHL.</p>
<fig id="F5"><title><p>Figure 5</p></title><caption><p>Flow chart of algorithm for finding an article in BHL</p></caption><text>
   <p><b>Flow chart of algorithm for finding an article in BHL</b>. Steps 1-4 are explained in the text.</p>
</text><graphic file="1471-2105-12-187-5" hint_layout="double"/></fig>
<sec>
<st>
<p>Step 1 - Finding the journal</p>
</st>
<p>The first step is to determine whether BHL includes the journal containing the article. BioStor uses a service provided by bioGUID <abbrgrp>
<abbr bid="B27">27</abbr>
<abbr bid="B28">28</abbr>
</abbrgrp> to find the ISSN <abbrgrp>
<abbr bid="B29">29</abbr>
</abbrgrp> for the journal. If the bioGUID service returns an ISSN, the algorithm looks up the ISSN in the <monospace>Title Identifier</monospace> table (Figure <figr fid="F1">1</figr>) and retrieves the corresponding BHL <monospace>TitleID</monospace>. If the bioGUID service doesn't return a ISSN the algorithm attempts to find the journal title in the <monospace>ShortTitle</monospace> field in the <monospace>Title</monospace> table using approximate string matching. If it fails to find the title it then searches the <monospace>VolumeInfo</monospace> field in the <monospace>Item</monospace> table - for some journals (e.g., <it>Fieldiana Zoology</it>, ISSN 0015-0754) the journal title is stored in that field. If at this point we can't find the journal we exit.</p>
</sec>
<sec>
<st>
<p>Step 2 - Finding scanned items for the journal</p>
</st>
<p>Ideally each journal corresponds to a single BHL title, but in some cases the same journal may be represented by more than one BHL title, and hence have more than one <monospace>TitleID</monospace>. Step 2 uses a hard-coded table of such cases to ensure that all items for a given journal are considered by Step 3.</p>
</sec>
<sec>
<st>
<p>Step 3 - Finding the volume and page</p>
</st>
<p>Ideally the <monospace>VolumeInfo</monospace> field in the <monospace>Item</monospace> table would contain just the volume number, however all manner of free-form text may be found there. The volume may be recorded as simple numbers or as strings, sometimes indicating volume, page or date ranges, notes on completeness of the volume, or other comments (e.g., "Index"). Metadata may also be in a variety of languages, such that the field may refer to "Volume", "Band", or "Tome". Nor is metadata always recorded consistently within a journal, for example the <monospace>VolumeInfo</monospace> field for scanned items belonging to the journal <it>Proceedings of the Zoological Society of London </it>contains strings such as:</p>
<p indent="1">&#8226; Part 1- Part 4 (1833-38)</p>
<p indent="1">&#8226; 1856</p>
<p indent="1">&#8226; 1901, v. 1 (Jan.-Apr.)</p>
<p indent="1">&#8226; Jan-Apr 1906</p>
<p indent="1">&#8226; 1912 v. 2</p>
<p indent="1">&#8226; 1923, pt. 1-2 (pp. 1-481)</p>
<p>BioStor uses a set of ad-hoc regular expressions to extract volume (and other information where present, such series, issue, and date) information from the <monospace>VolumeInfo field</monospace>. If no match to the target volume is found the algorithm exits.</p>
</sec>
<sec>
<st>
<p>Step 4 - Checking the match</p>
</st>
<p>At this stage in the algorithm we will have one or more candidates for the first page in the article. Multiple candidates may occur because the article has been scanned by more than one BHL contributor, or because there may be more than one article with the same metadata (see examples of <it>Annals and Magazine of Natural History </it>and <it>Arkiv f&#246;r Zoologi </it>discussed above). Some of these matches can be filtered by series or date, if the user has supplied that information. For each remaining match we take the OCR text for the first page in the candidate and compare it to the article title by computing a local alignment between words in the page and word in the title using the Smith-Waterman <abbrgrp>
<abbr bid="B30">30</abbr>
</abbrgrp> algorithm. Each pair of words that match exactly are scored +2, mismatches, deletions, and insertions are all scored -1. The score for the alignment is normalised by the match score &#215; the number of words in the title, so that a perfect match has a score of 1. As an illustration, Figure <figr fid="F6">6</figr> shows the distribution of alignment scores for the <it>Annals and Magazine of Natural History</it>. Most articles in this journal have a score &gt; 0.5, however some articles have very low scores due to poor OCR quality. For example, for the article "Preliminary notice of the Schizopoda collected by H. M.S. Discovery in the Antarctic region" <abbrgrp>
<abbr bid="B31">31</abbr>
</abbrgrp> the corresponding OCR text is "Preltiniiiari/Xutice of I he Sc/ti:oj/0(/a collcxted hy 11. M.S. 'Dixcovenj' in the Antarctic Rec/io".</p>
<fig id="F6"><title><p>Figure 6</p></title><caption><p>Alignment scores for Annals and Magazine of Natural History</p></caption><text>
   <p><b>Alignment scores for Annals and Magazine of Natural History</b>. Frequency distribution of scores for Smith-Waterman alignment between article title and OCR text for 314 articles from <it>Annals and Magazine of Natural History </it>in the Biodiversity Heritage Library.</p>
</text><graphic file="1471-2105-12-187-6" hint_layout="single"/></fig>
</sec>
</sec>
<sec>
<st>
<p>Storing articles</p>
</st>
<p>Articles extracted from BHL are stored in the same MySQL database that stores the BHL tables, using a simple schema comprising a table for article bibliographic metadata, a table for authors, and a table that joins the authors to the individual articles they've authored. A further table joins the article to the BHL Page table (Figure <figr fid="F7">7</figr>).</p>
<fig id="F7"><title><p>Figure 7</p></title><caption><p>Simplified BioStor database schema</p></caption><text>
   <p><b>Simplified BioStor database schema</b>. Simplified database schema for the core tables in the BioStor database.</p>
</text><graphic file="1471-2105-12-187-7" hint_layout="single"/></fig>
</sec>
</sec>
<sec>
<st>
<p>Utility and Discussion</p>
</st>
<p>The BioStor database is available at <url>http://biostor.org/</url>. It features an OpenURL resolver, and can display individual articles, lists of publications by author, by taxonomic name, and by journal. At the time of writing the database contains 26,784 articles extracted from BHL.</p>
<sec>
<st>
<p>OpenURL resolver</p>
</st>
<p>BioStor provides an OpenURL resolver at <url>http://bioguid.info/openurl/</url>. If accessed using a web browser the user is presented with a form where they can enter the bibliographic details of an article individually (Figure <figr fid="F8">8a</figr>), or paste in a full citation and have BioStor attempt to parse it. BioStor's article parser uses regular expressions and is limited to simple citations of the form &lt;author(s)&gt; &lt;(Year)&gt; &lt;article title&gt;. &lt;journal&gt;. &lt;volume&gt;: &lt;starting page&gt;-&lt;end page&gt;. If the article is already in the BioStor database the article will be displayed, if not BioStor attempts to locate the article in BHL. If it finds potential matches, these are displayed to the user (Figure <figr fid="F8">8b</figr>). For each match the page displays the score based on Smith-Waterman alignment between the page OCR text and the article title. In the example shown in Figure <figr fid="F8">8b</figr>, there are three potential matches, two of which have high scores (they are duplicates resulting from two BHL contributors having scanned the same journal). A thumbnail of the first page in each possible match is shown, the user can click on this to view a larger version of the page if they wish to inspect the match more closely. If they are happy that one of the matches is indeed the article they were looking for, the user can fill in the reCAPTHCA test <abbrgrp>
<abbr bid="B32">32</abbr>
<abbr bid="B33">33</abbr>
</abbrgrp> and click on the corresponding button. BioStor will then retrieve the remaining page images and OCR text from BHL, store the article in its database, then display it to the user.</p>
<fig id="F8"><title><p>Figure 8</p></title><caption><p>BioStor OpenURL resolver</p></caption><text>
   <p><b>BioStor OpenURL resolver</b>. (a) Example of using the web interface to the OpenURL resolver. The user has entered bibliographic details for the reference "On the Arachnida taken in the Transvaal and in Nyasaland by Mr W. L. Distant and Dr Percy Rendall" <abbrgrp><abbr bid="B53">53</abbr></abbrgrp>. (b) The resolver has found three possible matches in the Biodiversity Heritage Library. For each match the best alignment between the article title and the OCR text is highlighted in yellow. The user can then chose which match will be stored in BioStor.</p>
</text><graphic file="1471-2105-12-187-8" hint_layout="double"/></fig>
<p>Cutting and pasting bibliographic details into web forms is tedious, so the web interface to the OpenURL resolver is intended for casual use only. Instead, it is envisaged that users will interact with the OpenURL resolver using one of the bibliographic tools that supports the protocol, such as EndNote <abbrgrp>
<abbr bid="B34">34</abbr>
</abbrgrp> and Zotero <abbrgrp>
<abbr bid="B35">35</abbr>
</abbrgrp>, or a web browser that supports OpenURL ContextObject in SPAN (COinS) <abbrgrp>
<abbr bid="B36">36</abbr>
</abbrgrp>, such as Firefox with the OpenURL Referrer add on <abbrgrp>
<abbr bid="B37">37</abbr>
</abbrgrp>. For example, the following OpenURL corresponds to the web form shown in Figure <figr fid="F8">8a</figr> (with line breaks added for clarity):</p>
<p>http://biostor.org/openurl</p>
<p>?genre=article</p>
<p>&amp;atitle=On the Arachnida taken in the Transvaal and in Nyasaland by Mr W. L. Distant and Dr Percy</p>
<p>Rendall</p>
<p>&amp;title=Ann. Mag. nat. Hist.</p>
<p>&amp;volume = 1</p>
<p>&amp;spage = 308</p>
<p>&amp;epage = 321</p>
<p>&amp;date = 1898</p>
<p>Appending "&amp;format=json" to the OpenURL returns the result in Javascript Object Notation (JSON), hence the service can be used as an API by other developers.</p>
</sec>
<sec>
<st>
<p>Retrieval performance</p>
</st>
<p>The ability of BioStor to find articles in BHL depends on several factors. An obvious reason BioStor may fail to find an article is that it simply has not been scanned by BHL. Alternatively, it may have been scanned by BHL but not yet added to the local copy of BHL used by BioStor. Even if an article exists in BHL, BioStor may fail to find it if the metadata describing the item that contains the article doesn't conform to one of the regular expressions BioStor uses to interpret the <monospace>VolumeInfo</monospace> field in the <monospace>Item</monospace> table. Because BioStor evaluates the quality of a match by comparing the title of the target article with the OCR text (Figure <figr fid="F6">6</figr>), OCR errors may result in the match being deemed too poor to be correct. If the metadata for the target article contains significant errors, such as incorrect pagination, then BioStor may also fail to find an article.</p>
<sec>
<st>
<p>Retrieval of articles in the journal Tijdschrift voor Entomologie</p>
</st>
<p>To provide a benchmark for BioStor's performance I used an EndNote database of 2330 articles from the journal <it>Tijdschrift voor Entomologie </it>spanning the years 1858 to 1999, inclusive, assembled by E. J. van Nieukerken as part of a complete index of the journal <abbrgrp>
<abbr bid="B38">38</abbr>
</abbrgrp>. Almost all volumes of <it>Tijdschrift voor Entomologie </it>for this period have been scanned by BHL, so ideally BioStor should recover most, if not all of these articles from this journal. This database chosen because of the quality of the bibliographic metadata, and the fact it spanned some 150 years, during which time the typeface and layout of the journal changed significantly.</p>
<p>The EndNote file for <it>Tijdschrift voor Entomologie </it>was converted into a Research Information Systems (RIS) format file, which was then parsed by a script which extracted each article, constructed an OpenURL query, and forwarded it to BioStor, which returned a response in JSON format. The script scored recorded whether a match for article was found, ignoring matches with an alignment score of less than 0.5. As part of the output the script created web pages displaying details of each putative match including a thumbnail image of the first page of the article, making it possible to quickly evaluate whether the match was correct. The database, scripts, and HTML output are available from <url>http://biostor.org/ms/</url>.</p>
<p>Of the 2330 articles in the database, 94 articles are in volumes not presently available in BHL, and 224 articles have pages labelled with Roman numerals which weren't recorded by BHL. This left 2012 articles in the BHL archive, of which BioStor found matches for 1429 (71%), doing noticeably better for articles published after 1950 (Figure <figr fid="F9">9</figr>). Only fifteen matches (1%) were found to be incorrect, in each case due to pagination errors in the corresponding scanned items in BHL (typically the pagination recorded by BHL was offset from the correct pagination by 2-3 pages).</p>
<fig id="F9"><title><p>Figure 9</p></title><caption><p>Success in locating articles from the journal Tijdschrift voor Entomologie</p></caption><text>
   <p><b>Success in locating articles from the journal Tijdschrift voor Entomologie</b>. Percentage of articles in the journal <it>Tijdschrift voor Entomologie </it>for the years 1858-1999 that BioStor found in the Biodiversity Heritage Library (BHL). 0% values represent volumes of <it>Tijdschrift voor Entomologie </it>that have not been scanned by BHL.</p>
</text><graphic file="1471-2105-12-187-9" hint_layout="single"/></fig>
<p>
<it>Tijdschrift voor Entomologie </it>is just one of the journals scanned by BHL, and it would be desirable to evaluate BioStor's performance across a range of journals. However, at present evaluation is hampered by the lack of freely available, comprehensive bibliographic databases for taxonomic journals.</p>
</sec>
</sec>
<sec>
<st>
<p>Displaying articles</p>
</st>
<p>Articles found by the OpenURL resolver are stored in the BioStor database, and given a unique URL of <url>http://biostor.org/reference/n</url> where <it>n </it>is a unique integer. Figure <figr fid="F10">10</figr> shows an article <abbrgrp>
<abbr bid="B39">39</abbr>
</abbrgrp> being displayed in BioStor. A simple Javascript-based viewer displays a single page as a image, with thumbnails of the all the pages in the article shown in a scrolling list. To minimise the time the article page takes to load the thumbnails are only loaded when visible using a delayed Javascript image loader <abbrgrp>
<abbr bid="B40">40</abbr>
</abbrgrp>. The user can navigate through the article by clicking on the thumbnail for a given page. To smooth the transition between individual pages, when the user clicks on the thumbnail for a new page the thumbnail is displayed in place of the full page image while that page image loads. When the page image has loaded the low resolution thumbnail (which will appear fuzzy to the user) is replaced by the higher resolution image, giving the user the sensation that the page has come into focus.</p>
<fig id="F10"><title><p>Figure 10</p></title><caption><p>Example of page displaying an article in BioStor</p></caption><text>
   <p><b>Example of page displaying an article in BioStor</b>. The article being displayed is <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>.</p>
</text><graphic file="1471-2105-12-187-10" hint_layout="double"/></fig>
<p>The metadata (such as title, authors, journal name, etc.) can all be edited by the user. These edits will be saved if the user passes a reCAPTHCA test. The metadata can be retrieved in standard formats such as Reference Manager (RIS), Endnote XML, and BibTeX. The web page also contains bibliographic metadata embedded using the Context Object in Span (COinS) technique <abbrgrp>
<abbr bid="B36">36</abbr>
</abbrgrp>, and &lt;meta&gt; tags using the Dublin Core <abbrgrp>
<abbr bid="B41">41</abbr>
</abbrgrp> and Google Scholar <abbrgrp>
<abbr bid="B11">11</abbr>
</abbrgrp> vocabularies. The article itself can also be downloaded as a PDF file, with bibliographic metadata embedded using Adobe's Extensible Metadata Platform (XMP) <abbrgrp>
<abbr bid="B42">42</abbr>
</abbrgrp>. Desktop bibliographic software that can read XMP, such as Mendeley <abbrgrp>
<abbr bid="B15">15</abbr>
<abbr bid="B43">43</abbr>
</abbrgrp> and Papers <abbrgrp>
<abbr bid="B44">44</abbr>
</abbrgrp>, can extract this metadata so that the user need not manually re-enter bibliographic details for the paper.</p>
<p>The article page also displays the taxonomic and, where possible, geographic scope of the article. Taxonomic scope is represented by a tag cloud of the taxonomic names that BHL has found in the OCR text for the article, and by a taxonomic classification of those names based on the 2008 edition of the Catalogue of Life <abbrgrp>
<abbr bid="B45">45</abbr>
</abbrgrp>. When an article is added to the BioStor database the OCR text is searched for strings that represent latitude and longitude values for point locations. Any points found are displayed on a Google Map.</p>
</sec>
<sec>
<st>
<p>Displaying authors</p>
</st>
<p>BioStor displays a summary page for each author in the database. To mitigate the problem of an author having more than one spelling of their name, BioStor clusters names using a web service provided by bioGUID <abbrgrp>
<abbr bid="B27">27</abbr>
</abbrgrp>, which implements Feitelson's <abbrgrp>
<abbr bid="B46">46</abbr>
</abbrgrp> weighted clique algorithm for finding equivalent names. The summary page aggregates publications and coauthorships across this set of names. The page uses Exhibit <abbrgrp>
<abbr bid="B47">47</abbr>
</abbrgrp> to create a faceted browser, enabling the user to browse an author's publications by date, journal, and coauthors.</p>
</sec>
<sec>
<st>
<p>Displaying journals</p>
</st>
<p>By default BioStor uses the ISSN to identify journals. Where a ISSN isn't available BioStor uses an OCLC number from the WorldCat service <abbrgrp>
<abbr bid="B48">48</abbr>
</abbrgrp>. A user can see all the articles for a given journal by appending the journal's ISSN to the URL http://biostor.org/issn/ (or OCLC to the URL http://biostor.org/oclc/). The resulting web page lists the articles for that journal, as well as a graphical representation of how many articles for that journal have been located in BHL. Figure <figr fid="F11">11</figr> shows the coverage of the journal <it>Proceedings of the United States National Museum </it>(ISSN 0096-3801), published from 1878 to 1968.</p>
<fig id="F11"><title><p>Figure 11</p></title><caption><p>Summary of coverage of the journal Proceedings of the United States National Museum in BioStor</p></caption><text>
   <p><b>Summary of coverage of the journal Proceedings of the United States National Museum in BioStor</b>. Dark blue bars represent pages that have been assigned to an article in BioStor. A sparkline depicts the distribution of these articles over time.</p>
</text><graphic file="1471-2105-12-187-11" hint_layout="single"/></fig>
</sec>
<sec>
<st>
<p>Displaying taxonomic names</p>
</st>
<p>If the user clicks on a name in the taxonomic tag cloud (Figure <figr fid="F10">10</figr>), or appends a taxonomic name (or uBio NameBankID <abbrgrp>
<abbr bid="B49">49</abbr>
</abbrgrp>) to the URL http://bioguid.org/name/ for a name that has been taxonomically indexed by BHL, BioStor displays a web page listing the articles in BioStor that contain that name. The page also displays a sparkline showing the distribution of that name over time in the local copy of BHL, and lists taxonomic synonyms of the name according to the 2008 edition of the Catalogue of Life <abbrgrp>
<abbr bid="B45">45</abbr>
</abbrgrp>.</p>
</sec>
<sec>
<st>
<p>Searching and browsing</p>
</st>
<p>BioStor supports rudimentary full text search of author names and article titles. It also provides an interactive way to browse articles geographically using Google Maps <url>http://biostor.org/maps/</url> (Figure <figr fid="F12">12</figr>). When the user pans or zooms the map the web page displays the set of articles (up to a limit of 20) whose OCR text includes (latitude, longitude) pairs contained within the current bounds of the map.</p>
<fig id="F12"><title><p>Figure 12</p></title><caption><p>Browsing BioStor content geographically using Google Maps</p></caption><text>
   <p><b>Browsing BioStor content geographically using Google Maps</b>. Listed below the map are the articles in the BioStor database with localities contained within the geographic area being displayed in the map.</p>
</text><graphic file="1471-2105-12-187-12" hint_layout="single"/></fig>
</sec>
<sec>
<st>
<p>Future directions</p>
</st>
<p>BioStor locates articles by matching existing bibliographies to BHL content, hence it relies on external sources of metadata to find articles. Typically these are bibliographies assembled by individual taxonomists for particular taxonomic groups, or lists of articles published in a single journal. An alternative approach would be to extract articles directly from the archive. Lu et al. <abbrgrp>
<abbr bid="B50">50</abbr>
</abbrgrp> used feature extraction and a mixture of rule-based and machine-learning techniques to extract metadata from BHL OCR text, recovering between 66% to 94% of articles in selection of three journals. The set of articles in BioStor could be used as a training data set to help further develop these methods. Another approach to article extraction is crowd sourcing, where the task of identifying articles would be devolved to users. Ultimately, crowd sourcing could become important in cleaning metadata, but it may prove challenging to engage users in creating metadata from scratch.</p>
<p>The BHL archive has extracted taxonomic names from the OCR text, and BioStor looks for geographic localities encoded as latitude and longitude pairs. We could make more extensive use of the OCR text, for example by using autonomous citation indexing <abbrgrp>
<abbr bid="B51">51</abbr>
</abbrgrp> to extract citations from the literature cited section of each article. These citations could in turn be feed into the BioStor OpenURL resolver to attempt to locate them in BHL. The combination of variable citation styles and OCR errors means that the same reference may have be represented by several different citations, requiring tools for cleaning and merging citation data (e.g., <abbrgrp>
<abbr bid="B52">52</abbr>
</abbrgrp>).</p>
<p>BioStor is built as a service on the top of a copy of data from BHL, and creates a local bibliographic database of articles. One future direction would be to integrate this data with BHL itself. BHL has an OpenURL resolver <url>http://www.biodiversitylibrary.org/openurlhelp.aspx</url> that primarily supports books rather than articles. Adding metadata from BioStor could enhance the BHL OpenURL service, and provide the biodiversity community with a single source for BHL-derived content. BioStor content could also be added to other bibliographic databases, in particular Mendeley <abbrgrp>
<abbr bid="B15">15</abbr>
<abbr bid="B43">43</abbr>
</abbrgrp>. Mendeley is developing an API for storing and retrieving documents and associated metadata, hence it might be possible to devolve the storing of basic bibliographic metadata to Mendeley, BioStor then becoming simply an OpenURL resolver.</p>
</sec>
</sec>
<sec>
<st>
<p>Conclusions</p>
</st>
<p>The 31 million scanned pages made available by the Biodiversity Heritage Library (BHL) represents a substantial resource of biological literature. BioStor provides an OpenURL resolver to locate articles in this archive. Each article extracted from BHL is given a unique URL, corresponding to a web page that displays the article pages, and information about the taxonomic names and geographic localities mentioned in the article. BioStor is available at <url>http://biostor.org/</url>.</p>
</sec>
<sec>
<st>
<p>Availability and requirements</p>
</st>
<p indent="1">&#8226; <b>Project Name: </b>BioStor</p>
<p indent="1">&#8226; <b>Project Home Page: </b>
<url>http://biostor.org/</url>. Source code is available from <url>http://code.google.com/p/bioguid/source/browse/#svn/trunk/biostor</url>.</p>
<p indent="1">&#8226; <b>Operating System: </b>The BioStor web site is usable with any modern web browser. The source code can be easily installed on a Mac OS X, Linux server. It has not been tested on a Windows machine.</p>
<p indent="1">&#8226; <b>Programming Language: </b>PHP</p>
<p indent="1">&#8226; <b>Other Requirements: </b>Web server</p>
<p indent="1">&#8226; <b>License: </b>GNU General Public License version 2</p>
<p indent="1">&#8226; <b>Any restrictions to use by non-academics: </b>None</p>
</sec>
<sec>
<st>
<p>Abbreviations</p>
</st>
<p>API: Application Programming Interface; BHL: Biodiversity Heritage Library; DOI: Digital Object Identifier; ISSN: International Standard Serial Number; JSON: JavaScript Object Notation; OCR: Optical Character Recognition; URL: Uniform Resource Locator.</p>
</sec>
<sec>
<st>
<p>Competing interests</p>
</st>
<p>The author declares that they have no competing interests.</p>
</sec>
</bdy><bm>
<ack>
<sec>
<st>
<p>Acknowledgements</p>
</st>
<p>The core data for BioStor comes from the Biodiversity Heritage Library <abbrgrp>
<abbr bid="B7">7</abbr>
</abbrgrp>. Chris Freeland, Phil Cryer, and Mike Lichtenberg provided data dumps from BHL, and answered queries regarding the BHL database schema. E. J. van Nieukerken kindly provided the EndNote database for <it>Tijdschrift voor Entomologie</it>. I thank the anonymous referees for their comments.</p>
</sec>
</ack>
<refgrp><bibl id="B1"><title><p>The giant bite of a new raptorial sperm whale from the Miocene epoch of Peru</p></title><aug><au><snm>Lambert</snm><fnm>O</fnm></au><au><snm>Bianucci</snm><fnm>G</fnm></au><au><snm>Post</snm><fnm>K</fnm></au><au><snm>de Muizon</snm><fnm>C</fnm></au><au><snm>Salas-Gismondi</snm><fnm>R</fnm></au><au><snm>Urbina</snm><fnm>M</fnm></au><au><snm>Reumer</snm><fnm>J</fnm></au></aug><source>Nature</source><pubdate>2010</pubdate><volume>466</volume><issue>7302</issue><fpage>105</fpage><lpage>108</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nature09067</pubid><pubid idtype="pmpid" link="fulltext">20596020</pubid></pubidlist></xrefbib></bibl><bibl id="B2"><aug><au><snm>Melville</snm><fnm>H</fnm></au></aug><source>Moby-Dick</source><publisher>Richard Bentley, London</publisher><pubdate>1851</pubdate></bibl><bibl id="B3"><aug><au><cnm>International Commission on Zoological Nomenclature</cnm></au></aug><source>International code of zoological nomenclature. International Trust for Zoological Nomenclature</source><edition>4</edition><pubdate>1999</pubdate></bibl><bibl id="B4"><aug><au><snm>Koch</snm><fnm>AC</fnm></au></aug><source>Description of the Missourium, or Missouri Leviathan: together with its supposed habits and Indian traditions concerning the location from whence it was exhumed; also, comparisons of the whale, crocodile and missourium with the leviathan, as described in 41st chapter of the book of Job</source><publisher>Prentice and Weissinger</publisher><edition>2</edition><pubdate>1841</pubdate><url>http://www.biodiversitylibrary.org/item/81522</url></bibl><bibl id="B5"><title><p>The giant bite of a new raptorial sperm whale from the Miocene epoch of Peru</p></title><aug><au><snm>Lambert</snm><fnm>O</fnm></au><au><snm>Bianucci</snm><fnm>G</fnm></au><au><snm>Post</snm><fnm>K</fnm></au><au><snm>de Muizon</snm><fnm>C</fnm></au><au><snm>Salas-Gismondi</snm><fnm>R</fnm></au><au><snm>Urbina</snm><fnm>M</fnm></au><au><snm>Reumer</snm><fnm>J</fnm></au></aug><source>Nature</source><pubdate>2010</pubdate><volume>466</volume><issue>7310</issue><fpage>1134</fpage><xrefbib><pubid idtype="doi">10.1038/nature09381</pubid></xrefbib></bibl><bibl id="B6"><title><p>The legacy of Linnaeus</p></title><aug><au><cnm>Anonymous</cnm></au></aug><source>Nature</source><pubdate>2007</pubdate><volume>446</volume><fpage>231</fpage><lpage>232</lpage><xrefbib><pubid idtype="pmpid" link="fulltext">17361138</pubid></xrefbib></bibl><bibl id="B7"><title><p>Biodiversity Heritage Library</p></title><url>http://biodiversitylibrary.org</url></bibl><bibl id="B8"><title><p>The Biodiversity Heritage Library: Advancing Metadata Practices in a Collaborative Digital Library</p></title><aug><au><snm>Pilsk</snm><fnm>S</fnm></au><au><snm>Person</snm><fnm>M</fnm></au><au><snm>Deveer</snm><fnm>J</fnm></au><au><snm>Furfey</snm><fnm>J</fnm></au><au><snm>Kalfatovic</snm><fnm>M</fnm></au></aug><source>Journal of Library Metadata</source><pubdate>2010</pubdate><volume>10</volume><issue>2</issue><fpage>136</fpage><lpage>155</lpage><xrefbib><pubid idtype="doi">10.1080/19386389.2010.506400</pubid></xrefbib></bibl><bibl id="B9"><title><p>Internet Archive</p></title><url>http://www.archive.org/</url></bibl><bibl id="B10"><title><p>PubMed</p></title><url>http://www.ncbi.nlm.nih.gov/pubmed/</url></bibl><bibl id="B11"><title><p>Google Scholar</p></title><url>http://scholar.google.com/</url></bibl><bibl id="B12"><title><p>Scholar-Friendly DOI Suffixes with JACC: Journal Article Citation Convention</p></title><aug><au><snm>Cameron</snm><fnm>RD</fnm></au></aug><source>Tech. Rep. CMPT TR 1998-08, School of Computing Science, Simon Fraser University</source><pubdate>1998</pubdate></bibl><bibl id="B13"><title><p>CrossRef OpenURL</p></title><url>http://www.crossref.org/openurl</url></bibl><bibl id="B14"><title><p>The Digital Object Identifier System</p></title><url>http://www.doi.org/</url></bibl><bibl id="B15"><title><p>Mendeley</p></title><url>http://www.mendeley.com/</url></bibl><bibl id="B16"><title><p>Publication and dating of the journals forming the <it>Annals and Magazine of Natural History </it>and the <it>Journal of Natural History</it></p></title><aug><au><snm>Evenhuis</snm><fnm>NL</fnm></au></aug><source>Zootaxa</source><pubdate>2003</pubdate><volume>385</volume><fpage>1</fpage><lpage>68</lpage></bibl><bibl id="B17"><title><p>The crane-flies collected by the Swedish expedition (1895-1896) to southern Chile and Tierra del Fuego (Tipulidae, Diptera)</p></title><aug><au><snm>Alexander</snm><fnm>CP</fnm></au></aug><source>Arkiv f&#246;r Zoologi</source><pubdate>1920</pubdate><volume>13</volume><issue>6</issue><fpage>1</fpage><lpage>32</lpage><url>http://biostor.org/reference/13820</url></bibl><bibl id="B18"><title><p>Neue und wenig bekannte Oligoch&#228;ten aus skandinavischen Sammlungen</p></title><aug><au><snm>Michaelsen</snm><fnm>W</fnm></au></aug><source>Arkiv f&#246;r Zoologi</source><pubdate>1921</pubdate><volume>13</volume><issue>19</issue><fpage>1</fpage><lpage>25</lpage><url>http://biostor.org/reference/14784</url></bibl><bibl id="B19"><title><p>The identities of the Colombian frogs confused with <it>Eleutherodactylus latidiscus </it>(Boulenger) (Amphibia: Anura: Leptodactylidae)</p></title><aug><au><snm>Lynch</snm><fnm>JD</fnm></au><au><snm>Ru&#237;z-Carranza</snm><fnm>PM</fnm></au><au><snm>Ardila-Robayo</snm><fnm>MC</fnm></au></aug><source>Occasional Papers of the Museum of Natural History University of Kansas</source><pubdate>1994</pubdate><volume>170</volume><fpage>1</fpage><lpage>42</lpage><url>http://biostor.org/reference/228</url></bibl><bibl id="B20"><title><p>Name Matters: Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (BHL)</p></title><aug><au><snm>Wei</snm><fnm>Q</fnm></au><au><snm>Heidorn</snm><fnm>PB</fnm></au><au><snm>Freeland</snm><fnm>C</fnm></au></aug><source>iConference 2010 Proceedings</source><pubdate>2010</pubdate><fpage>284</fpage><lpage>288</lpage><url>http://hdl.handle.net/2142/14919</url></bibl><bibl id="B21"><title><p>Encylopedia of Life</p></title><url>http://www.eol.org/</url></bibl><bibl id="B22"><title><p>The Scientific Name of the Sperm Whale</p></title><aug><au><snm>Holthuis</snm><fnm>LB</fnm></au></aug><source>Marine Mammal Science</source><pubdate>1987</pubdate><volume>3</volume><fpage>87</fpage><lpage>89</lpage><xrefbib><pubid idtype="doi">10.1111/j.1748-7692.1987.tb00154.x</pubid></xrefbib></bibl><bibl id="B23"><title><p>Mr. Schevill replies</p></title><aug><au><snm>Schevill</snm><fnm>WE</fnm></au></aug><source>Marine Mammal Science</source><pubdate>1987</pubdate><volume>3</volume><fpage>89</fpage><lpage>90</lpage><xrefbib><pubid idtype="doi">10.1111/j.1748-7692.1987.tb00155.x</pubid></xrefbib></bibl><bibl id="B24"><title><p>The International Code of Zoological Nomenclature and a paradigm: the name <it>Physeter catodon </it>Linnaeus 1758</p></title><aug><au><snm>Schevill</snm><fnm>WE</fnm></au></aug><source>Marine Mammal Science</source><pubdate>1986</pubdate><volume>2</volume><issue>2</issue><fpage>153</fpage><lpage>157</lpage><xrefbib><pubid idtype="doi">10.1111/j.1748-7692.1986.tb00036.x</pubid></xrefbib></bibl><bibl id="B25"><title><p>Wikipedia as an encyclopaedia of life</p></title><aug><au><snm>Page</snm><fnm>RDM</fnm></au></aug><source>Organisms Diversity and Evolution</source><pubdate>2010</pubdate><volume>10</volume><issue>4</issue><fpage>343</fpage><lpage>349</lpage><xrefbib><pubid idtype="doi">10.1007/s13127-010-0028-9</pubid></xrefbib></bibl><bibl id="B26"><title><p>Open Linking in the Scholarly Information Environment Using the OpenURL Framework</p></title><aug><au><snm>de Sompel</snm><fnm>HV</fnm></au><au><snm>Beit-Arie</snm><fnm>O</fnm></au></aug><source>D-Lib Magazine</source><pubdate>2001</pubdate><volume>7</volume><issue>3</issue><xrefbib><pubid idtype="doi">10.1045/march2001-vandesompel</pubid></xrefbib></bibl><bibl id="B27"><title><p>bioGUID: resolving, discovering, and minting identifiers for biodiversity informatics</p></title><aug><au><snm>Page</snm><fnm>RDM</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2009</pubdate><volume>10</volume><issue>Suppl 14</issue><fpage>S5</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-10-S14-S5</pubid><pubid idtype="pmcid">2788356</pubid><pubid idtype="pmpid" link="fulltext">19958515</pubid></pubidlist></xrefbib></bibl><bibl id="B28"><title><p>bioGUID</p></title><url>http://bioguid.info/</url></bibl><bibl id="B29"><title><p>ISSN International Centre</p></title><url>http://www.issn.org</url></bibl><bibl id="B30"><title><p>Identification of common molecular subsequences</p></title><aug><au><snm>Smith</snm><fnm>TF</fnm></au><au><snm>Waterman</snm><fnm>MS</fnm></au></aug><source>Journal of Molecular Biology</source><pubdate>1981</pubdate><volume>147</volume><fpage>195</fpage><lpage>197</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/0022-2836(81)90087-5</pubid><pubid idtype="pmpid" link="fulltext">7265238</pubid></pubidlist></xrefbib></bibl><bibl id="B31"><title><p>Preliminary notice of the Schizopoda collected by H. M.S. Discovery in the Antarctic region</p></title><aug><au><snm>Holt</snm><fnm>EWL</fnm></au><au><snm>Tattersall</snm><fnm>WM</fnm></au></aug><source>Ann Mag Nat Hist</source><pubdate>1906</pubdate><volume>17</volume><fpage>1</fpage><lpage>11</lpage><url>http://biostor.org/reference/50163</url></bibl><bibl id="B32"><title><p>reCAPTCHA</p></title><url>http://www.google.com/recaptcha</url></bibl><bibl id="B33"><title><p>reCAPTCHA: Human-Based Character Recognition via Web Security Measures</p></title><aug><au><snm>von Ahn</snm><fnm>L</fnm></au><au><snm>Maurer</snm><fnm>B</fnm></au><au><snm>McMillen</snm><fnm>C</fnm></au><au><snm>Abraham</snm><fnm>D</fnm></au><au><snm>Blum</snm><fnm>M</fnm></au></aug><source>Science</source><pubdate>2008</pubdate><volume>321</volume><issue>5895</issue><fpage>1465</fpage><lpage>1468</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1126/science.1160379</pubid><pubid idtype="pmpid" link="fulltext">18703711</pubid></pubidlist></xrefbib></bibl><bibl id="B34"><title><p>EndNote</p></title><url>http://www.endnote.com/</url></bibl><bibl id="B35"><title><p>Zotero</p></title><url>http://www.zotero.org/</url></bibl><bibl id="B36"><title><p>OpenURL ContextObject in SPAN (COinS)</p></title><url>http://ocoins.info/</url></bibl><bibl id="B37"><title><p>OpenURL Referrer</p></title><url>https://addons.mozilla.org/en-US/firefox/addon/4150</url></bibl><bibl id="B38"><title><p>Tijdschrift voor Entomologie 150 volumes: one and a half century of Systematic Entomology in a changing world</p></title><aug><au><snm>van Nieukerken</snm><fnm>EJ</fnm></au></aug><source>Tijdschrift voor Entomologie</source><pubdate>2007</pubdate><volume>1</volume><issue>2</issue><fpage>245</fpage><lpage>261</lpage><url>http://www.repository.naturalis.nl/document/93299</url></bibl><bibl id="B39"><title><p>A revision of the dwarf <it>Zonosaurus </it>Boulenger (Reptilia: Squamata: Cordylidae) from Madagascar, including descriptions of three new species</p></title><aug><au><snm>Raselimanana</snm><fnm>AP</fnm></au><au><snm>Raxworthy</snm><fnm>CJ</fnm></au><au><snm>Nussbaum</snm><fnm>RA</fnm></au></aug><source>Scientific Papers Natural History Museum University of Kansas</source><pubdate>2000</pubdate><volume>18</volume><fpage>1</fpage><lpage>16</lpage><url>http://biostor.org/reference/50335</url></bibl><bibl id="B40"><title><p>lazierLoad - Javascript Image Lazy Loader for Prototype</p></title><url>http://www.bram.us/projects/js_bramus/lazierload/</url></bibl><bibl id="B41"><title><p>Dublin Core Metadata Initiative</p></title><url>http://dublincore.org/</url></bibl><bibl id="B42"><title><p>Adobe XMP</p></title><url>http://www.adobe.com/products/xmp/index.html</url></bibl><bibl id="B43"><title><p>Mendeley - A Last.fm For Research?</p></title><aug><au><snm>Henning</snm><fnm>V</fnm></au><au><snm>Reichelt</snm><fnm>J</fnm></au></aug><source>eScience &apos;08. IEEE Fourth International Conference on eScience, 2008</source><pubdate>2008</pubdate><fpage>327</fpage><lpage>328</lpage><xrefbib><pubid idtype="pmpid" link="fulltext">21565935</pubid></xrefbib></bibl><bibl id="B44"><title><p>Papers</p></title><url>http://mekentosj.com/papers/</url></bibl><bibl id="B45"><title><p>The Species 2000 and ITIS Catalogue of Life</p></title><url>http://www.catalogueoflife.org</url></bibl><bibl id="B46"><title><p>On identifying name equivalences in digital libraries</p></title><aug><au><snm>Feitelson</snm><fnm>DG</fnm></au></aug><source>Information Research</source><pubdate>2004</pubdate><volume>9</volume><url>http://informationr.net/ir/9-4/paper192.html</url></bibl><bibl id="B47"><title><p>Exhibit: Publishing Framework for Data-Rich Interactive Web Pages</p></title><url>http://www.simile-widgets.org/exhibit/</url></bibl><bibl id="B48"><title><p>WorldCat.org: The World's Largest Library Catalog</p></title><url>http://www.worldcat.org/</url></bibl><bibl id="B49"><title><p>Universal Biological Indexer and Organizer (uBio)</p></title><url>http://www.ubio.org/</url></bibl><bibl id="B50"><title><p>A metadata generation system for scanned scientific volumes</p></title><aug><au><snm>Lu</snm><fnm>X</fnm></au><au><snm>Kahle</snm><fnm>B</fnm></au><au><snm>Wang</snm><fnm>JZ</fnm></au><au><snm>Giles</snm><fnm>CL</fnm></au></aug><source>Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries</source><pubdate>2008</pubdate><fpage>167</fpage><lpage>179</lpage><xrefbib><pubid idtype="doi">10.1145/1378889.1378918</pubid></xrefbib></bibl><bibl id="B51"><title><p>Digital libraries and autonomous citation indexing</p></title><aug><au><snm>Lawrence</snm><fnm>S</fnm></au><au><snm>Giles</snm><fnm>CL</fnm></au><au><snm>Bollacker</snm><fnm>K</fnm></au></aug><source>IEEE COMPUTER</source><pubdate>1999</pubdate><volume>32</volume><issue>6</issue><fpage>67</fpage><lpage>71</lpage><xrefbib><pubid idtype="doi">10.1109/2.769447</pubid></xrefbib></bibl><bibl id="B52"><title><p>Learning metadata from the evidence in an on-line citation matching scheme</p></title><aug><au><snm>Councill</snm><fnm>IG</fnm></au><au><snm>Li</snm><fnm>H</fnm></au><au><snm>Zhuang</snm><fnm>Z</fnm></au><au><snm>Debnath</snm><fnm>S</fnm></au><au><snm>Bolelli</snm><fnm>L</fnm></au><au><snm>Lee</snm><fnm>WC</fnm></au><au><snm>Sivasubramaniam</snm><fnm>A</fnm></au><au><snm>Giles</snm><fnm>CL</fnm></au></aug><source>JCDL &apos;06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries</source><publisher>New York, NY, USA: ACM</publisher><pubdate>2006</pubdate><fpage>276</fpage><lpage>285</lpage><xrefbib><pubid idtype="doi">10.1145/1141753.1141817</pubid></xrefbib></bibl><bibl id="B53"><title><p>On the Arachnida taken in the Transvaal and in Nyasaland by Mr W. L. Distant and Dr Percy Rendall</p></title><aug><au><snm>Pocock</snm><fnm>RI</fnm></au></aug><source>Ann Mag nat Hist</source><pubdate>1898</pubdate><volume>1</volume><fpage>308</fpage><lpage>321</lpage><url>http://biostor.org/reference/52084</url></bibl></refgrp>
</bm></art>
