<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-6-208</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Software</dochead>
      <bibl>
         <title>
            <p>An edit script for taxonomic classifications</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Page</snm>
               <mi>DM</mi>
               <fnm>Roderic</fnm>
               <insr iid="I1"/>
               <email>r.page@bio.gla.ac.uk</email>
            </au>
            <au id="A2">
               <snm>Valiente</snm>
               <fnm>Gabriel</fnm>
               <insr iid="I2"/>
               <email>valiente@lsi.upc.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>DEEB, IBLS, University of Glasgow, Glasgow G12 8QQ, UK</p>
            </ins>
            <ins id="I2">
               <p>Department of Software, Technical University of Catalonia, E-08034 Barcelona, Spain</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2005</pubdate>
         <volume>6</volume>
         <issue>1</issue>
         <fpage>208</fpage>
         <url>http://www.biomedcentral.com/1471-2105/6/208</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">16122379</pubid>
               <pubid idtype="doi">10.1186/1471-2105-6-208</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>21</day>
               <month>6</month>
               <year>2005</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>25</day>
               <month>8</month>
               <year>2005</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>25</day>
               <month>8</month>
               <year>2005</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2005</year>
         <collab>Page and Valiente; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>The NCBI taxonomy provides one of the most powerful ways to navigate sequence data bases but currently users are forced to formulate queries according to a single taxonomic classification. Given that there is not universal agreement on the classification of organisms, providing a single classification places constraints on the questions biologists can ask. However, maintaining multiple classifications is burdensome in the face of a constantly growing NCBI classification.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>In this paper, we present a solution to the problem of generating modifications of the NCBI taxonomy, based on the computation of an edit script that summarises the differences between two classification trees. Our algorithms find the shortest possible edit script based on the identification of all shared subtrees, and only take time quasi linear in the size of the trees because classification trees have unique node labels.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>These algorithms have been recently implemented, and the software is freely available for download from <url>http://darwin.zoology.gla.ac.uk/~rpage/forest/</url>.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>The NCBI Taxonomy <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> provides one of the most powerful ways to navigate the National Center for Biotechnology Information (NCBI) sequence data bases. Every sequence in GenBank is associated with a taxon (which, however, may be unidentified), and each taxon has a unique place in the NCBI taxonomy. Hence, not only can the user retrieve sequences for a given species (such as <it>Homo sapiens</it>), but also for a group of species, such as mammals (Mammalia) or animals (Animalia).</p>
         <p>The NCBI provides a single classification, assembled from a variety of sources including published literature, a panel of expert advisors, and the taxonomy provided by users when they submit new sequences. Given that there is not universal agreement on the classification of organisms, providing a single classification places constraints on the questions biologists can ask.</p>
         <p>To give a concrete example, Figure <figr fid="F1">1</figr> shows a simplified classification of animals, based on the current NCBI taxonomy. In this classification, the Bilateria are split into three groups (Acoelomata, Pseudocoelomata, and Coelomata) based on the nature of the internal body cavity (coelom). The Coelomata are themselves split into two groups, the Protostomia and the Deuterostomia, characterised by the fate of the blastopore during development (in the Protostomia this becomes the mouth, in the Deuterostomia it becomes the anus).</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Traditional view of animals</p>
            </caption>
            <text>
               <p><b>Traditional view of animals</b>. A "traditional" view of animal relationships, based on the NCBI classification.</p>
            </text>
            <graphic file="1471-2105-6-208-1"/>
         </fig>
         <p>An alternative view of animal classification is shown in Figure <figr fid="F2">2</figr>. The three-fold division based on body cavity disappears, leaving the fundamental split being between the Protostomia and the Deuterostomia. The Protostomia are divided into the Lophotrochozoa and the Ecdysozoa, the latter comprising arthropods, nematodes, and other moulting animals <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. This classification has implications for comparative genomics. The best known animal genomes are <it>Homo sapiens </it>(human), <it>Drosophila melanogaster </it>(fly), and <it>Caenorhabditis elegans </it>(nematode). Under the classical classification (Fig. <figr fid="F1">1</figr>), the coelomates human and <it>Drosophila </it>are more closely related to either other than either is to the aceolomate <it>C. elegans</it>, suggesting it would be most productive to compare our genome with that of <it>Drosophila</it>, rather than the more distant nematode. However, in the alternative classification (Fig. <figr fid="F2">2</figr>) <it>Drosphila </it>and <it>C. elegans </it>are more closely related to each other than either is to humans, and we have no (phylogenetic) reason for choosing one over the other as a point of reference for interpreting the human genome. There is considerable debate about the merits of the two classifications <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr></abbrgrp>. However, because the NCBI provides only one classification users cannot, for example, easily query GenBank for all ecdysozoan sequences &#8211; this taxon simply does not exist in the NCBI database. Instead, users are forced to construct Boolean queries such as (Arthropoda AND Nematoda). While in this simplified example this is not a great hardship, as the trees get larger and the differences more profound, it becomes harder to pose a query that captures the taxa required.</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>An alternative view of animals</p>
            </caption>
            <text>
               <p><b>An alternative view of animals</b>. A alternative tree of animals reflecting the "new animal classification".</p>
            </text>
            <graphic file="1471-2105-6-208-2"/>
         </fig>
         <p>One solution is simply to download the NCBI taxonomy, edit it to reflect the desired alternative classification, then use that to obtain sequences from taxa such as Ecdysozoa. It is reasonably straightforward to store a tree in a relational database an query it using SQL <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. However, the NCBI taxonomy is continually growing as new organisms are sequenced. Hence, a locally edited classification will quickly become obsolete. Having to download a fresh copy and then manually edit it would quickly become tedious.</p>
      </sec>
      <sec>
         <st>
            <p>Implementation</p>
         </st>
         <sec>
            <st>
               <p>Taxonomic classifications</p>
            </st>
            <p>Although ideally classifications mirror phylogenetic relationships, it is important to distinguish between classifications and phylogenies. A taxonomic classification can be modelled as a rooted, labelled, unordered tree. Unlike classifications, internal nodes of phylogenetic trees need not be labeled, although the internal nodes of a phylogeny may be decorated with measures of support (such as bootstrap values or Bayesian posterior probabilities).</p>
         </sec>
         <sec>
            <st>
               <p>Subtree isomorphism</p>
            </st>
            <p>Our approach is to first find subtree isomorphisms between the two trees, <it>T</it><sub>1 </sub>and <it>T</it><sub>2</sub>. A subtree is a connected subgraph of a tree. We distinguish between <it>top-down </it>and <it>bottom-up </it>subtree isomorphism. A top-down node matching the parent of each node in the matching is itself in the matching (excluding the root which has no parent). In a bottom-up matching, all the children of a node in the matching are also in the matching (Fig. <figr fid="F3">3</figr>).</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Connected subgraph and top-down and bottom-up subtrees</p>
               </caption>
               <text>
                  <p><b>Connected subgraph and top-down and bottom-up subtrees</b>. In the top-down subtree the parent of any node in the subtree is itself in the subtree. In the bottom-up matching, the children of any node in the matching are also in the matching. Modified from [10].</p>
               </text>
               <graphic file="1471-2105-6-208-3"/>
            </fig>
            <p>The algorithm first finds all subtrees, including bottom-up and top-down subtrees, that are common to <it>T</it><sub>1 </sub>and <it>T</it><sub>2</sub>. We find all kinds of subtree because, by themselves the subtrees found by each method can be small (Fig. <figr fid="F4">4</figr>).</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Subtree isomorphisms</p>
               </caption>
               <text>
                  <p><b>Subtree isomorphisms</b>. The top-down and bottom-up subtree isomorphisms between the animal classifications shown in Figs. 1 and 2. (ignoring the trivial bottom-up subtrees that comprise a single leaf).</p>
               </text>
               <graphic file="1471-2105-6-208-4"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Script</p>
            </st>
            <p>Having identified common subtrees, we then list the operations needed to transform <it>T</it><sub>1 </sub>into <it>T</it><sub>2</sub>. The first step is to delete nodes in <it>T</it><sub>1 </sub>that are not in any of the shared subtrees. The deletion of a node entails deletion of all the edges incident with the deleted node. We then add nodes found only in <it>T</it><sub>2</sub>, and the corresponding edges. The size of the script depends on the size of the shared subtrees, hence it is desirable to find the largest such subtrees.</p>
         </sec>
         <sec>
            <st>
               <p>Complexity</p>
            </st>
            <p>In general, computation of the least number of operations needed to transform <it>T</it><sub>1 </sub>into <it>T</it><sub>2 </sub>is an NP-hard problem <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, even for binary trees with a label alphabet of size two, as long as node and edge deletions, insertions, and label substitutions are allowed. However, in the case of trees with unique node labels, node label substitutions are forbidden because they may generate trees with non-unique node labels <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>, and the least number of operations or edit distance becomes a function of the size of shared subtrees <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. By identifying the largest common subtrees, we obtain the shortest possible edit script.</p>
         </sec>
         <sec>
            <st>
               <p>Computing an edit script</p>
            </st>
            <p>Taxonomic classifications are modelled as trees with unique node labels, and this fact makes it easier to deal with trees in terms of their sets of node labels and node label pairs, as done for graphs with unique node labels in <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>.</p>
            <p><b>Definition 1 </b><it>Let T </it>= (<it>V,E</it>) <it>be a tree</it>. <it>The label representation of T, denoted by R(T), is given by R</it>(<it>T</it>) = (<it>L,C</it>), <it>where L </it>= {&#8467;(<it>v</it>) | <it>v </it>&#8712; <it>V</it>} <it>and C </it>= {(&#8467;(<it>v</it>),&#8467;(<it>w</it>)) | (<it>v,w</it>) &#8712; <it>E</it>}.</p>
            <p>Thus, the label representation <it>R</it>(<it>T</it>) of a tree <it>T </it>defines the equivalence class of all those trees that are isomorphic to <it>T</it>. The use of label representations simplifies the notation, because isomorphic trees have exactly the same label representation.</p>
            <p>The edit operations of node and edge deletion and insertion, allow one to transform any given tree into any other tree. Label substitutions are forbidden because they may generate trees with non-unique node labels <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>.</p>
            <p><b>Definition 2 </b><it>Let T</it><sub>1 </sub>= (<it>V</it><sub>1</sub>,<it>E</it><sub>1</sub>) <it>and T</it><sub>2 </sub>= (<it>V</it><sub>2</sub>,<it>E</it><sub>2</sub>) <it>be trees, let R</it>(<it>T</it><sub>1</sub>) = (<it>L</it><sub>1</sub>, <it>C</it><sub>1</sub>), <it>and let R</it>(<it>T</it><sub>2</sub>) = (<it>L</it><sub>2</sub>,<it>C</it><sub>2</sub>). <it>Let also C </it>= <it>L</it><sub>1 </sub>&#8746; <it>L</it><sub>2 </sub>&#8746; {&#955;}.</p>
            <p><it>A node edit operation between T</it><sub>1 </sub><it>and T</it><sub>2 </sub><it>is a pair </it>(<it>a</it>, <it>b</it>) &#8712; <it>C </it>&#215; <it>C </it>with <it>a </it>&#8800; &#955; <it>or b </it>&#8800; &#955;. <it>A node edit operation of the form </it>(<it>a</it>, &#955;) <it>establishes deletion of the node v </it>&#8712; <it>V</it><sub>1 </sub><it>with </it>&#8467;(<it>v</it>) = <it>a together with the edge </it>(<it>parent</it>(<it>v</it>), <it>v</it>), <it>if v is not the root of T</it><sub>1</sub>, <it>and deletion of edge </it>(<it>v</it>,<it>x</it>) <it>for each child x of v in T</it><sub>1</sub>. <it>A node edit operation of the form </it>(&#955;,<it>b</it>) <it>establishes insertion of the node w </it>&#8712; <it>V</it><sub>2 </sub><it>with </it>&#8467;(<it>w</it>) = <it>b</it>.</p>
            <p><it>An edge edit operation between T</it><sub>1 </sub><it>and T</it><sub>2 </sub><it>is a triple </it>(<it>a</it>, <it>b</it>, <it>c</it>) &#8712; <it>C </it>&#215; <it>C </it>&#215; <it>C with b </it>&#8800; &#955; <it>and a </it>&#8800; &#955; <it>or c </it>&#8800; &#955;. <it>An edge edit operation of the form </it>(<it>a</it>, <it>b</it>, &#955;) <it>establishes deletion of the edge </it>(<it>v</it>, <it>x</it>) &#8712; <it>E</it><sub>1 </sub><it>with </it>&#8467;(<it>v</it>) = <it>a and </it>&#8467;(<it>x</it>) = <it>b</it>, <it>and an edge edit operation of the form </it>(&#955;, <it>b</it>, <it>c</it>) <it>establishes insertion of the edge </it>(<it>w</it>,<it>y</it>) &#8712; <it>E</it><sub>2 </sub><it>with </it>&#8467;(<it>w</it>) = <it>b and </it>&#8467;(<it>y</it>) = <it>c</it>.</p>
            <p>
               <it>An edit operation is either a node edit operation or an edge edit operation.</it>
            </p>
            <p>An edit script between two trees is just a set of edit operations that, if applied in the right order (essentially, inserting an edge only after having inserted the nodes incident with the inserted edge), allow one to transform one tree into the other.</p>
            <p><b>Definition 3 </b><it>An edit script between two trees T</it><sub>1 </sub>= (<it>V</it><sub>1</sub>,<it>E</it><sub>1</sub>) <it>and T</it><sub>2 </sub>= (<it>V</it><sub>2</sub>,<it>E</it><sub>2</sub>) <it>is a set of edit operations that transform R</it>(<it>T</it><sub>1</sub>) <it>into R</it>(<it>T</it><sub>2</sub>).</p>
            <p>Given <it>R</it>(<it>T</it><sub>1</sub>) = (<it>L</it><sub>1</sub>, <it>C</it><sub>1</sub>) and <it>R</it>(<it>T</it><sub>2</sub>) = (<it>L</it><sub>2</sub>, <it>C</it><sub>2</sub>), an edit script between <it>T</it><sub>1 </sub>and <it>T</it><sub>2 </sub>can be easily obtained by sorting the label sets and computing set differences, as follows:</p>
            <p>&#8226; Delete all nodes with labels in <it>L</it><sub>1 </sub>\ <it>L</it><sub>2</sub></p>
            <p>&#8226; Insert all nodes with labels in <it>L</it><sub>2 </sub>\ <it>L</it><sub>1</sub></p>
            <p>&#8226; Delete all edges with labels in <it>C</it><sub>1 </sub>\ <it>C</it><sub>2</sub></p>
            <p>&#8226; Insert all edges with labels in <it>C</it><sub>2 </sub>\ <it>C</it><sub>1</sub></p>
            <p>However, such a procedure does not, in general, lead to the shortest possible edit script, because some of the edge deletion operations may be redundant, given that deletion of a node entails deletion of all the edges incident with the deleted node. While any edit script would suffice to transform one tree into the other, the shortest edit script leads to a faster computation of the edited tree, given the script and the original tree.</p>
            <p>The following, alternative procedure is based on the set of common node labels between the two trees, which can be easily obtained as the intersection of the sets of node labels in the label representation of the trees, that is, <it>C </it>= <it>L</it><sub>1 </sub>&#8745; <it>L</it><sub>2 </sub>= {&#8467;(<it>v</it>) | <it>v </it>&#8712; <it>V</it><sub>1</sub>} &#8745; {&#8467;(<it>w</it>) | <it>w </it>&#8712; <it>V</it><sub>2</sub>}. The procedure can be sketched as follows:</p>
            <p>&#8226; Delete all nodes <it>v </it>&#8712; <it>V</it><sub>1 </sub>with &#8467;(<it>v</it>) &#8713; <it>C</it>.</p>
            <p>&#8226; Insert all nodes <it>w </it>&#8712; <it>V</it><sub>2 </sub>with &#8467;(<it>w</it>) &#8713; <it>C</it>.</p>
            <p>&#8226; Delete all edges (<it>v</it>,<it>x</it>) &#8712; <it>E</it><sub>1 </sub>with &#8467;(<it>v</it>), &#8467;(<it>x</it>) &#8712; <it>C </it>and such that the node <it>w </it>&#8712; <it>V</it><sub>2 </sub>with &#8467;(<it>v</it>) = &#8467;(<it>w</it>) is not the parent in <it>T</it><sub>2 </sub>of the node <it>y </it>&#8712; <it>V</it><sub>2 </sub>with &#8467;(<it>x</it>) = &#8467;(<it>y</it>).</p>
            <p>&#8226; Insert all edges (<it>w</it>,<it>y</it>) &#8712; <it>E</it><sub>2 </sub>with &#8467;(<it>w</it>), &#8467;(<it>y</it>) &#8712; <it>C </it>and such that the node <it>v </it>&#8712; <it>V</it><sub>1 </sub>with &#8467;(<it>v</it>) = &#8467;(<it>w</it>) is not the parent in <it>T</it><sub>1 </sub>of the node <it>x </it>&#8712; <it>V</it><sub>1 </sub>with &#8467;(<it>x</it>) = &#8467;(<it>y</it>).</p>
            <p>&#8226; Insert all edges (<it>w</it>, <it>y</it>) &#8712; <it>E</it><sub>2 </sub>such that &#8467;(<it>w</it>) &#8713; <it>C </it>or &#8467;(<it>y</it>) &#8713; <it>C</it>.</p>
            <p>A detailed description of the algorithm is given in Fig. <figr fid="F5">5</figr>. Correctness of the edit script algorithm is easy to establish.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Algorithm for computing edit script</p>
               </caption>
               <text>
                  <p><b>Algorithm for computing edit script</b>. Let <it>C </it>be a set of common node labels of <it>T</it><sub>1 </sub>and <it>T</it><sub>2</sub>. A function call of the form <it>edit script </it>(<it>T</it><sub>1</sub>, <it>T</it><sub>2</sub>, <it>C</it>) returns a set <it>E </it>of elementary edit operations that transform <it>T</it><sub>1 </sub>into <it>T</it><sub>2</sub>.</p>
               </text>
               <graphic file="1471-2105-6-208-5"/>
            </fig>
            <p><b>Theorem 1 </b><it>Let T</it><sub>1 </sub><it>and T</it><sub>2 </sub><it>be trees, let C </it>&#8838; &#8467;(<it>V</it><sub>1</sub>) &#8898; &#8467;(<it>V</it><sub>2</sub>), <it>let E </it>= <it>edit script </it>(<it>T</it><sub>1</sub>, <it>T</it><sub>2</sub>, <it>C</it>), <it>and let </it><graphic file="1471-2105-6-208-i1.gif"/><it>be the result of applying the set of edit operations in E to T</it><sub>1</sub>. <it>Then</it>, <graphic file="1471-2105-6-208-i1.gif"/><it>and T</it><sub>2 </sub><it>are isomorphic</it>.</p>
            <p><b>Proof </b>It has to be shown that <graphic file="1471-2105-6-208-i2.gif"/>. Let <it>R</it>(<it>T</it><sub>1</sub>) = (<it>L</it><sub>1</sub>, <it>C</it><sub>1</sub>) and <it>R</it>(<it>T</it><sub>2</sub>) = (<it>L</it><sub>2</sub>, <it>C</it><sub>2</sub>). The edit script establishes the deletion of all nodes with labels in <it>L</it><sub>1</sub>\ <it>C </it>and the insertion of all nodes with labels in <it>L</it><sub>2 </sub>\ <it>C</it>. Thus, <graphic file="1471-2105-6-208-i3.gif"/> = <it>L</it><sub>1 </sub>\ (<it>L</it><sub>1 </sub>\ <it>C</it>) &#8746; (<it>L</it><sub>2 </sub>\ <it>C</it>) = <it>C </it>&#8746; (<it>L</it><sub>2 </sub>\ <it>C</it>) = <it>L</it><sub>2</sub>.</p>
            <p>The edit script also establishes the deletion of all edges with source and target labels in (<it>C</it><sub>1 </sub>&#8745; <it>C </it>&#215; <it>C</it>) \ <it>C</it><sub>2</sub>, the insertion of all edges with source and target labels in (<it>C</it><sub>2 </sub>&#8745; <it>C </it>&#215; <it>C</it>) \ <it>C</it><sub>1</sub>, and the insertion of all edges with source or target label in <it>L</it><sub>2 </sub>\ <it>C</it>, that is, of all edges in <it>C</it><sub>2 </sub>\ (<it>C</it><sub>2 </sub>&#8745; <it>C </it>&#215; <it>C</it>). Furthermore, the deletion of all nodes with labels in <it>L</it><sub>1 </sub>\ <it>C </it>entails the deletion of all edges with source or target label in <it>L</it><sub>1 </sub>\ <it>C</it>, that is, of all edges in <it>C</it><sub>1 </sub>\ (<it>C</it><sub>1 </sub>&#8745; <it>C </it>&#215; <it>C</it>). (See Fig. <figr fid="F6">6</figr>.)</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Illustration for the proof of Theorem 1</p>
               </caption>
               <text>
                  <p><b>Illustration for the proof of Theorem 1</b>. Given the label representation <it>R</it>(<it>T</it><sub>1</sub>) = (<it>L</it><sub>1</sub>, <it>C</it><sub>1</sub>) and <it>R</it>(<it>T</it><sub>2</sub>) = (<it>L</it><sub>2</sub>, <it>C</it><sub>2</sub>) of two trees, and a set of common node labels <it>C </it>&#8834; <it>L</it><sub>1 </sub>&#8745; <it>L</it><sub>2</sub>, <it>T</it><sub>1 </sub>can be transformed into <it>T</it><sub>2 </sub>by deleting all nodes with labels in <it>L</it><sub>1 </sub>\ (<it>C</it>, which implies deletion of all edges with source and target node labels in <it>C</it><sub>1 </sub>\ (<it>C</it><sub>1 </sub>&#8745; <it>C </it>&#215; <it>C</it>); inserting all nodes with labels in <it>L</it><sub>2 </sub>\ <it>C</it>, deleting all edges with source and target node labels in (<it>C</it><sub>1 </sub>&#8745; <it>C </it>&#215; <it>C</it>) \ <it>C</it><sub>2</sub>; inserting all edges with source and target node labels in (<it>C</it><sub>2 </sub>&#8745; <it>C </it>&#215; <it>C</it>) \ <it>C</it><sub>1</sub>; and inserting all edges with source and target node labels in <it>C</it><sub>2 </sub>\ (<it>C</it><sub>2 </sub>&#8745; <it>C </it>&#215; <it>C</it>).</p>
               </text>
               <graphic file="1471-2105-6-208-6"/>
            </fig>
            <p>Now, <it>C</it><sub>1 </sub>= (<it>C</it><sub>1 </sub>\ <it>C</it><sub>2</sub>) &#8746; (<it>C</it><sub>1 </sub>&#8745; <it>C</it><sub>2</sub>) = ((<it>C</it><sub>1 </sub>&#8745; <it>C </it>&#215; <it>C</it>) \ <it>C</it><sub>2</sub>) &#8746; ((<it>C</it><sub>1 </sub>\ (<it>C</it><sub>1 </sub>&#8745; <it>C </it>&#215; <it>C</it>)) \ <it>C</it><sub>2</sub>) &#8746; (<it>C</it><sub>1 </sub>&#8745; <it>C</it><sub>2</sub>). In a similar vein, <it>C</it><sub>2 </sub>= ((<it>C</it><sub>2 </sub>&#8745; <it>C </it>&#215; <it>C</it>) \ <it>C</it><sub>1</sub>) &#8746; ((<it>C</it><sub>2 </sub>\ (<it>C</it><sub>2 </sub>&#8745; <it>C </it>&#215; <it>C</it>)) \ <it>C</it><sub>1</sub>) &#8746; (<it>C</it><sub>1 </sub>&#8745; <it>C</it><sub>2</sub>).</p>
            <p>Thus,</p>
            <p><graphic file="1471-2105-6-208-i4.gif"/> = <it>C</it><sub>1 </sub>\ ((<it>C</it><sub>1 </sub>&#8745; <it>C </it>&#215; <it>C</it>) \ <it>C</it><sub>2</sub>) \ ((<it>C</it><sub>1 </sub>\ (<it>C</it><sub>1 </sub>&#8745; <it>C </it>&#215; <it>C</it>)) \ <it>C</it><sub>2</sub>) &#8746; ((<it>C</it><sub>2 </sub>&#8745; <it>C </it>&#215; <it>C</it>) \ <it>C</it><sub>1</sub>) &#8746; ((<it>C</it><sub>2 </sub>\ (<it>C</it><sub>2 </sub>&#8745; <it>C </it>&#215; <it>C</it>)) \ <it>C</it><sub>1</sub>) = (<it>C</it><sub>1 </sub>&#8745; <it>C</it><sub>2</sub>) &#8746; ((<it>C</it><sub>2 </sub>&#8745; <it>C </it>&#215; <it>C</it>) \ <it>C</it><sub>1</sub>) &#8746; ((<it>C</it><sub>2 </sub>\ (<it>C</it><sub>2 </sub>&#8745; <it>C </it>&#215; <it>C</it>)) \ <it>C</it><sub>1</sub>) = <it>C</it><sub>2 </sub>and therefore, <graphic file="1471-2105-6-208-i5.gif"/> = (<it>L</it><sub>2</sub>, <it>C</it><sub>2</sub>) = <it>R</it>(<it>T</it><sub>2</sub>), that is, <graphic file="1471-2105-6-208-i1.gif"/> and <it>T</it><sub>2 </sub>are isomorphic.</p>
            <p>The edit script algorithm can be implemented to take time quasi linear in the size of the trees, by using any efficient dictionary data structure to represent the set of common node labels. The same dictionary data structure allows one to compute the set of common node labels within the same time bound and thus, the whole procedure can be implemented to take time quasi linear in the size of the trees. In particular, our C++ implementation uses the STL associative container set&lt;string> as representation of the set of shared node labels.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>Here we suggest a solution based on the notion of an "edit script" that summarises the differences between two trees. Given two trees, <it>T</it><sub>1 </sub>and <it>T</it><sub>2</sub>, a script lists the operations required to convert <it>T</it><sub>1 </sub>into <it>T</it><sub>2</sub>. The script could be constructed manually, but it would be more efficient to generate it automatically. Hence, we imagine the following scenario. A user downloads the NCBI taxonomy tree (or that subtree relevant to their interests), then edits the tree to reflect their preferred classification. Using the algorithm we describe below, the user then computes the edit script that transforms the NCBI tree into their classification. When a new NCBI tree appears on the NCBI ftp site, the user downloads that tree and applies to edit script to regenerate their classification. In this way, the user need only edit the NCBI tree once.</p>
         <p>As an example, given the two trees in Figs. <figr fid="F1">1</figr> and <figr fid="F2">2</figr>, the edit script for these trees is:</p>
         <p>delete node Pseudocoelemata</p>
         <p>delete node Coelomata</p>
         <p>delete node Protostomia</p>
         <p>delete node Acoelomata</p>
         <p>insert node Ecdysozoa</p>
         <p>insert node Lophotrochozoa</p>
         <p>insert node Protostomia</p>
         <p>insert edge Bilateria -> Deuterostomia</p>
         <p>insert edge Bilateria -> Protostomia</p>
         <p>insert edge Ecdysozoa -> Nematoda</p>
         <p>insert edge Ecdysozoa -> Arthropoda</p>
         <p>insert edge Lophotrochozoa -> Annelida</p>
         <p>insert edge Lophotrochozoa -> Brachiopoda</p>
         <p>insert edge Lophotrochozoa -> Bryozoa</p>
         <p>insert edge Lophotrochozoa -> Mollusca</p>
         <p>insert edge Lophotrochozoa -> Nemertea</p>
         <p>insert edge Lophotrochozoa -> Platyhelminthes</p>
         <p>insert edge Protostomia -> Lophotrochozoa</p>
         <p>insert edge Protostomia -> Ecdysozoa</p>
         <p>Applying the script to the NCBI tree (Fig. <figr fid="F1">1</figr>) yields the tree shown in Fig. <figr fid="F7">7</figr>, which is identical to the tree shown in Fig. <figr fid="F2">2</figr>.</p>
         <fig id="F7">
            <title>
               <p>Figure 7</p>
            </title>
            <caption>
               <p>Result of applying the edit script</p>
            </caption>
            <text>
               <p><b>Result of applying the edit script</b>. The result of applying the edit script to the tree in Fig. 1. This tree is the same as that shown in Fig. 2. Nodes which have been inserted into the tree are filled with light grey. A dashed line represents an edge that has been added to the original tree.</p>
            </text>
            <graphic file="1471-2105-6-208-7"/>
         </fig>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>The size of the edit script will be a function of the size of the input trees, and the degree to which they differ. At the time of writing, there are 83,802 metazoan taxa in GenBank. Given that the disagreement between the Coelomata and Ecdysozoa hypotheses concerns the deep level relationships, we can simplify the task by reducing the subtrees about which there is little or no disagreement to single nodes. For example, the 36,746 arthropod taxa can be represented by a single node. Hence, the tree shown in Fig. <figr fid="F1">1</figr> is greatly simplified, compared to the complete NCBI tree.</p>
         <p>One issue we don't directly address here is using the tree that results from applying the edit script to query GenBank. There are at least two approaches to doing this. The first is to store the tree in a local database and use a method such as visitation numbers <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> to generate queries involving higher taxa (such as listing all sequences from the Ecdysozoa).</p>
         <p>Another approach would be to use the tree to rewrite queries in terms of the original GenBank taxonomy. For example, in our rather simplified example in Fig. <figr fid="F2">2</figr>, we could use the tree to automatically rewrite the query term "Ecdysozoa" as the sum of its children (Arthropoda and Nematoda) as both trees (Fig. <figr fid="F1">1</figr> and Fig. <figr fid="F2">2</figr>) agree on the composition of these two taxa. One advantage of this approach is that we can continue to use tools such as BLAST, but in the context of a different taxonomic classification.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>We present a solution to the problem of generating modifications of the NCBI taxonomy, based on the computation of an edit script that summarises the differences between two classification trees. Our algorithms find the shortest possible edit script based on the identification of all shared subtrees, and only take time quasi linear in the size of the trees because classification trees have unique node labels. We have implemented the edit function in a C++ program that makes use of the Graph Template Library (GTL) available from <url>http://infosun.fmi.uni-passau.de/GTL/</url>. The code has been compiled and tested with the GNU gcc compiler on Mac OS X and Linux machines, and is available from <url>http://darwin.zoology.gla.ac.uk/~rpage/forest/</url>. The software comprises two programs, forest and script. The program forest takes two trees in GML format (the original tree and the edited tree) and computes an edit script. Given this script and the original tree, script generates the edited tree.</p>
      </sec>
      <sec>
         <st>
            <p>Availability and requirements</p>
         </st>
         <p>&#8226; <b>Project name: </b>Forest</p>
         <p>&#8226; <b>Project home page: </b><url>http://darwin.zoology.gla.ac.uk/~rpage/forest/</url></p>
         <p>&#8226; <b>Operating system(s): </b>Unix/Linux, tested on Mac OS X and Red Hat 8.0</p>
         <p>&#8226; <b>Programming language: </b>e.g. C++</p>
         <p>&#8226; <b>Other requirements: </b>Graph Template Library (GTL) (<url>http://infosun.fmi.uni-passau.de/GTL/</url>)</p>
         <p>&#8226; <b>License: </b>GNU GPL</p>
         <p>&#8226; <b>Any restrictions to use by non-academics: </b>Forest depends on GTL, which can be downloaded free of charge for non-commercial use. Commercial use of GTL requires a licence from BRAINSYS &#8211; Informatiksysteme GmbH (<url>http://www.brainsys.de/</url>)</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>RDMP posed the problem, and GV developed the algorithm. RDMP and GV jointly developed the software and wrote the manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>This work was funded by BBSRC grant BB/C004310/1 to RDMP, by the Spanish CICYT, project GRAMMARS (TIN2004-07925-C03-01), and by the Japan Society for the Promotion of Science through Long-term Invitation Fellowship L05511 for visiting JAIST (Japan Advanced Institute of Science and Technology).</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>NCBI Taxonomy</p>
            </title>
            <url>http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html</url>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Evidence for a clade of nematodes, arthropods and other moulting animals</p>
            </title>
            <aug>
               <au>
                  <snm>Aguinaldo</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Turbeville</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Linford</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Rivera</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Garey</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Raff</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Lake</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>1997</pubdate>
            <volume>387</volume>
            <issue>6632</issue>
            <fpage>489</fpage>
            <lpage>93</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/387489a0</pubid>
                  <pubid idtype="pmpid">9168109</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Coelomata and not Ecdysozoa: evidence from genome-wide phylogenetic analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Wolf</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Rogozin</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Koonin</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2004</pubdate>
            <volume>14</volume>
            <fpage>29</fpage>
            <lpage>36</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">314272</pubid>
                  <pubid idtype="pmpid" link="fulltext">14707168</pubid>
                  <pubid idtype="doi">10.1101/gr.1347404</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>The Opisthokonta and the Ecdysozoa may not be Clades: stronger support for the grouping of plant and animal than for animal and fungi and stronger support for the Coelomata than Ecdysozoa</p>
            </title>
            <aug>
               <au>
                  <snm>Philip</snm>
                  <fnm>GK</fnm>
               </au>
               <au>
                  <snm>Creevey</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Mclnerney</snm>
                  <fnm>JO</fnm>
               </au>
            </aug>
            <source>Mol Biol Evol</source>
            <pubdate>2005</pubdate>
            <volume>22</volume>
            <fpage>1175</fpage>
            <lpage>1184</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/molbev/msi102</pubid>
                  <pubid idtype="pmpid" link="fulltext">15703245</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa and Protostomia</p>
            </title>
            <aug>
               <au>
                  <snm>Philippe</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Lartillot</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Brinkmann</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Mol Biol Evol</source>
            <pubdate>2005</pubdate>
            <volume>22</volume>
            <fpage>1246</fpage>
            <lpage>1253</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/molbev/msi111</pubid>
                  <pubid idtype="pmpid" link="fulltext">15703236</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <aug>
               <au>
                  <snm>Celko</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>SQL for Smarties: Advanced SQL Programming</source>
            <publisher>San Francisco: Morgan Kaufmann</publisher>
            <pubdate>1999</pubdate>
         </bibl>
         <bibl id="B7">
            <title>
               <p>On the editing distance between unordered labeled trees</p>
            </title>
            <aug>
               <au>
                  <snm>Zhang</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Statman</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Shasha</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Inform Process Lett</source>
            <pubdate>1992</pubdate>
            <volume>42</volume>
            <fpage>133</fpage>
            <lpage>139</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/0020-0190(92)90136-J</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>On graphs with unique node labels</p>
            </title>
            <aug>
               <au>
                  <snm>Dickinson</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Bunke</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Dadej</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Kraetzl</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Proc. 4 th IAPR Int. Workshop Graph Based Representations in Pattern Recognition</source>
            <publisher>Springer-Verlag</publisher>
            <pubdate>2003</pubdate>
            <fpage>13</fpage>
            <lpage>23</lpage>
         </bibl>
         <bibl id="B9">
            <title>
               <p>On a relation between graph edit distance and maximum common subgraph</p>
            </title>
            <aug>
               <au>
                  <snm>Bunke</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Pattern Recogn Lett</source>
            <pubdate>1997</pubdate>
            <volume>18</volume>
            <fpage>689</fpage>
            <lpage>694</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S0167-8655(97)00060-3</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <aug>
               <au>
                  <snm>Valiente</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Algorithms on Trees and Graphs</source>
            <publisher>Berlin: Springer-Verlag</publisher>
            <pubdate>2002</pubdate>
         </bibl>
      </refgrp>
   </bm>
</art>

