<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1748-7188-2-10</ui>
   <ji>1748-7188</ji>
   <fm>
      <dochead>Software article</dochead>
      <bibl>
         <title>
            <p>A basic analysis toolkit for biological sequences</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Giancarlo</snm>
               <fnm>Raffaele</fnm>
               <insr iid="I1"/>
               <email>raffaele@math.unipa.it</email>
            </au>
            <au id="A2">
               <snm>Siragusa</snm>
               <fnm>Alessandro</fnm>
               <insr iid="I1"/>
               <email>alessandro.siragusa@gmail.com</email>
            </au>
            <au id="A3">
               <snm>Siragusa</snm>
               <fnm>Enrico</fnm>
               <insr iid="I1"/>
               <email>enricos@imap.cc</email>
            </au>
            <au id="A4">
               <snm>Utro</snm>
               <fnm>Filippo</fnm>
               <insr iid="I1"/>
               <email>filippo.utro@gmail.com</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Dipartimento di Matematica Applicazioni, Universit&#224; di Palermo, Italy</p>
            </ins>
         </insg>
         <source>Algorithms for Molecular Biology</source>
         <issn>1748-7188</issn>
         <pubdate>2007</pubdate>
         <volume>2</volume>
         <issue>1</issue>
         <fpage>10</fpage>
         <url>http://www.almob.org/content/2/1/10</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17877802</pubid>
               <pubid idtype="doi">10.1186/1748-7188-2-10</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>07</day>
               <month>5</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>18</day>
               <month>9</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>18</day>
               <month>9</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Giancarlo et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <p>This paper presents a software library, nicknamed BATS, for some basic sequence analysis tasks. Namely, local alignments, via approximate string matching, and global alignments, via longest common subsequence and alignments with affine and concave gap cost functions. Moreover, it also supports filtering operations to select strings from a set and establish their statistical significance, via z-score computation. None of the algorithms is new, but although they are generally regarded as fundamental for sequence analysis, they have not been implemented in a single and consistent software package, as we do here. Therefore, our main contribution is to fill this gap between algorithmic theory and practice by providing an extensible and easy to use software library that includes algorithms for the mentioned string matching and alignment problems. The library consists of C/C++ library functions as well as Perl library functions. It can be interfaced with Bioperl and can also be used as a stand-alone system with a GUI. The software is available at <url>http://www.math.unipa.it/~raffaele/BATS/</url> under the GNU GPL.</p>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>1 Introduction</p>
         </st>
         <p>Computational analysis of biological sequences has became an extremely rich field of modern science and a highly interdisciplinary area, where statistical and algorithmic methods play a key role <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>. In particular, sequence alignment tools have been at the hearth of this field for nearly 50 years and it is commonly accepted that the initial investigation of the mathematical notion of alignment and distance is one of the major contributions of S. Ulam to sequence analysis in molecular biology <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Moreover, alignment techniques have a wealth of applications in other domains, as pointed out for the first time in <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>.</p>
         <p>Here we concentrate on alignment problems involving only two sequences. In general, they can be divided in two areas: local and global alignments <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Local alignment methods try to find regions of high similarity between two strings, e.g. BLAST <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, as opposed to global alignment methods that assess an overall structural similarity between the two strings, e.g. the Gotoh alignment algorithm <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. However, at the algorithmic level, both classes often share the same ideas and techniques, being in most cases all based on dynamic programming algorithms and related speed-ups <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. More in detail, we have implementations for (see also Fig. <figr fid="F1">1</figr> for the corresponding function in the GUI):</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>a snapshot of the GUI</p>
            </caption>
            <text>
               <p><b>a snapshot of the GUI</b>. An overview of the GUI of BATS. The top bar has a specific button for each of the algorithms and functions implemented. Then, each function has its own parameter selection interface. The Edit Distance function interface is shown here.</p>
            </text>
            <graphic file="1748-7188-2-10-1"/>
         </fig>
         <p>(a) Approximate string matching with <it>k </it>mismatches. That is, given a pattern and text string and an integer <it>k</it>, we are interested in finding all occurrences of the pattern in the text with at most <it>k </it>mismatching characters per occurrence. We provide an implementation of an algorithm given in <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. It is a simplification of the first efficient algorithm obtained for this problem, due to Landau and Vishkin <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. The asymptotically fastest known algorithm to date is due to Amir, Lewenstein and Porat <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. Formalization of the problem, as well as description of the algorithm and library functions, both in C/C++ and Perl, is given in section 2.</p>
         <p>(b) Approximate string matching with <it>k </it>differences. That is, given a pattern and text string and an integer <it>k</it>, we are interested in finding all occurrences of the pattern in the text with at most <it>k </it>differences where, for each occurrence a "difference" is a character to be inserted, deleted or substituted in the pattern. We provide an implementation of the algorithm by Landau and Vishkin <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, although the asymptotically most efficient one, to date, has been recently obtained by Cole and Hariharan <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. Formalization of the problem, as well as description of the algorithm and library functions, both in C/C++ and Perl, is given in section 3.</p>
         <p>(c) The longest common subsequence from fragments, a generalization of the well known longest common subsequence problem <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, considered by Baker and Giancarlo <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. Formalization of the problem, as well as description of the algorithm and library functions, both in C/C++ and Perl, is given in section 4.</p>
         <p>(d) Edit distance with concave and affine gap penalties. It is the well known generalization of the edit distance between two strings introduced by M.S. Waterman <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>, i.e., with the use of concave gap costs. We provide an implementation of the algorithm obtained by Galil and Giancarlo <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> (<b>GG </b>algorithm for short). An analogous algorithm was obtained independently by Miller and Myers <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. We also point out that the asymptotically most efficient algorithm, to date, is still the one given by Klawe and Kleitman <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>, although it seems to be mainly of theoretic interest. It is also worth mentioning that the <b>GG </b>algorithm naturally specializes to deal with affine gap costs. Formalization of the problem, as well as description of the algorithm and library functions, both in C/C++ and Perl, is given in section 5.</p>
         <p>(e) Filtering, statistical significance computation and organism model generation. The first two functions allow to select a subset of strings from a given set and to assess its statistical significance via z-score computation <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. The third function is required in order to give to the first two, a probabilistic model of the input data. While the filtering techniques are quite standard, the implementation of the z-score computation is a specialization of a non-trivial implementation by Sinha and Tompa, used for motif discovery <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. Our code, as the one by Sinha and Tompa, works only for DNA sequences. The function allowing for the generation of a user-specified model organism gives, in a suitable format, all probabilistic information needed by the z-score function. Description of this part of the system, as well as presentation of the corresponding library functions, both in C/C++ and Perl, is given in section 6.</p>
         <p>As it is self-evident from the description just given, this software library is not intended as a generic programming environment, like Leda for combinatorial and geometric computing <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. An initial attempt, in that direction, for string algorithms is described in <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. The software presented here is more tailored at specific alignment problems. We also point out that most of the algorithms implemented in BATS are based on suffix trees <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. Here we use the algorithm by Ukkonen <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> in the Strmat library <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. It is not particularly memory-efficient (17 bytes/character) and that may be problematic for genome-wide applications of the corresponding algorithms. We finally point out that the entire library can be used as a stand-alone system with a GUI and it can be interfaced with Bioperl. A detailed user manual, together with installation procedures, file formats etc., is given at the supplementary web site <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>.</p>
      </sec>
      <sec>
         <st>
            <p>2 Approximate string matching with <it>k </it>mismatches</p>
         </st>
         <p>Given a text string <it>text </it>= <it>t</it>[1, <it>n</it>], a pattern string <it>pattern </it>= <it>p</it>[1, <it>m</it>] and an integer <it>k</it>, <it>k </it>&#8804; <it>m </it>&#8804; <it>n</it>, we are interested in finding all occurrences of the pattern in the text with at most <it>k </it>mismatches, i.e. with at most <it>k </it>locations in which the pattern and a text substring have different symbols.</p>
         <p>Let <it>Prefix</it>(<it>i</it>, <it>j</it>) be a function that returns the length of the longest common prefix between <it>p</it>[<it>i</it>, <it>m</it>] and <it>t</it>[<it>j</it>, <it>n</it>]. It can be computed in <it>O</it>(1) time, after the following preprocessing step: (A) build the suffix tree <it>T </it><abbrgrp><abbr bid="B22">22</abbr></abbrgrp> of the strings <it>p</it>[1, <it>m</it>]$<it>t</it>[1, <it>n</it>], where $ is a delimiter not appearing anywhere else in the two strings; (B) preprocess <it>T </it>so that Lowest Common Ancestor (LCA for short) queries can be answered in constant time <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. The preprocessing step takes <it>O</it>(<it>n </it>+ <it>m</it>) time and it is well known that the computation of <it>Prefix</it>(<it>i</it>, <it>j</it>) reduces to the computation of one LCA query on the leaves of <it>T </it><abbrgrp><abbr bid="B8">8</abbr></abbrgrp>.</p>
         <p>Once that the preprocessing step is completed, we can find the first (leftmost) mismatch between <it>p</it>[1, <it>m</it>] and <it>t</it>[<it>j</it>, <it>j </it>+ <it>m </it>- 1] in <it>O</it>(1) time by use of <it>Prefix</it>(1, <it>j</it>). If we keep track of where this mismatch occurs, say</p>
         <p>1: Algorithm <b>SM</b></p>
         <p>2: <b>for </b><it>j </it>= 1 <b>to </b><it>n </it><b>do</b></p>
         <p>3: &#160;&#160;&#160;<it>pt </it>&#8592; <it>j</it>; <it>v </it>&#8592; 1; <it>num_mismatch </it>&#8592; 0;</p>
         <p>4: &#160;&#160;&#160;**<it>t</it>[<it>j</it>, <it>j </it>+ <it>m </it>- 1] is aligned with <it>p</it>[1, <it>m</it>] and no mismatch has been found**</p>
         <p>5: &#160;&#160;&#160;<b>while </b><it>v </it>&#8804; <it>m </it>- 1 <b>and </b><it>num_mismatch </it>&#8804; <it>k </it><b>do</b></p>
         <p>6:</p>
         <p>7: &#160;&#160;&#160;&#160;&#160;&#160;**find leftmost mismatch between <it>t</it>[<it>pt</it>, <it>pt </it>+ <it>m </it>- 1] and <it>p</it>[<it>v</it>, <it>m</it>]**</p>
         <p>8: &#160;&#160;&#160;&#160;&#160;&#160;&#8467; &#8592; <it>Prefix</it>(<it>v</it>, <it>pt</it>)</p>
         <p>9: &#160;&#160;&#160;&#160;&#160;&#160;<b>if </b><it>v </it>+ &#8467; &#8804; <it>m </it><b>then</b></p>
         <p>10: &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<it>num_mismatch </it>&#8592; <it>num_mismatch </it>+ 1</p>
         <p>11: &#160;&#160;&#160;&#160;&#160;&#160;<b>end if</b></p>
         <p>12: &#160;&#160;&#160;&#160;&#160;&#160;<it>pt </it>&#8592; <it>pt </it>+ &#8467; + 1; <it>v </it>&#8592; <it>v </it>+ &#8467; + 1;</p>
         <p>13: &#160;&#160;&#160;<b>end while</b></p>
         <p>14: &#160;&#160;&#160;<b>if </b><it>num_mismatch </it>&#8804; <it>k </it><b>then</b></p>
         <p>15: &#160;&#160;&#160;&#160;&#160;&#160;<b>found match</b></p>
         <p>16: &#160;&#160;&#160;<b>end if</b></p>
         <p>17: <b>end for</b></p>
         <p>at position <it>l </it>of <it>pattern</it>, we can locate the second mismatch, in <it>O</it>(1) time, by finding the leftmost mismatch between <it>p</it>[<it>l </it>+ 1, <it>m</it>] and <it>t</it>[<it>j </it>+ <it>l </it>- 1, <it>j </it>+ <it>m </it>- 1]. In general, the <it>q</it>-th mismatch between <it>p</it>[1, <it>m</it>] and <it>t</it>[<it>j</it>, <it>j </it>+ <it>m </it>- 1] can be found in <it>O</it>(1) time by knowing the location of the (<it>q </it>- 1)-th mismatch. Algorithm <b>SM </b>gives the needed pseudo-code. We have:</p>
         <p><b>Theorem 2.1 </b><abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp><it>Given a pattern p and a text t of length m and n respectively, Algorithm </it><b>SM </b><it>finds all occurrences of p in t with at most k mismatches in O</it>(<it>m </it>+ <it>n </it>+ <it>nk</it>) <it>time, including the preprocessing step</it>.</p>
         <sec>
            <st>
               <p>2.1 The C/C++ library functions</p>
            </st>
            <p>The function below returns all occurrences, with at most <it>k </it>mismatches, of a pattern within a text.</p>
            <p>
               <b>Synopsis</b>
            </p>
            <p>
               <b>#include "k_mismatch.h"</b>
            </p>
            <p>
               <ul>OCCURRENCES</ul>
            </p>
            <p><b><ul>k_mismatch</ul></b>(<ul>char</ul><it><ul>*text</ul></it>, <ul>char</ul><it><ul>*pattern</ul></it>, <ul>int </ul><it><ul>k</ul></it>);</p>
            <p><b>Arguments</b>:</p>
            <p>&#8226; <it><ul>text</ul></it>: points to a text string;</p>
            <p>&#8226; <it><ul>pattern</ul></it>: points to a pattern string;</p>
            <p>&#8226; <it><ul>k</ul></it>: is an integer giving the maximum number of allowed mismatches.</p>
            <p><b>Return Values</b>: <b><ul>k_mismatch</ul></b> returns a pointer to <ul>OCCURRENCES_STRUCT</ul>, defined as:</p>
            <p>typedef struct <ul>occurrences</ul></p>
            <p>{</p>
            <p>&#160;&#160;&#160;<ul>int</ul><it><ul> start</ul></it>, <it><ul>end</ul></it>;</p>
            <p>&#160;&#160;&#160;<ul>int</ul><it><ul> errors</ul></it>;</p>
            <p>&#160;&#160;&#160;<ul>char</ul><it><ul>*text</ul></it>;</p>
            <p>&#160;&#160;&#160;<ul>char</ul><it><ul>*pattern</ul></it>;</p>
            <p>struct <ul>occurrences</ul><it><ul>*next</ul></it>;</p>
            <p>} <ul>OCCURRENCES_STRUCT</ul>, <it><ul>*OCCURRENCES</ul></it>;</p>
            <p>where:</p>
            <p>&#8226; <it><ul>start</ul></it>: is the start position of this occurrence in the text string;</p>
            <p>&#8226; <it><ul>end</ul></it>: is the end position of this occurrence in the text string;</p>
            <p>&#8226; <it><ul>errors</ul></it>: the number of mismatches of this occurrence;</p>
            <p>&#8226; <it><ul>text</ul></it>: is a pointer to the aligned substring corresponding to the occurrence found;</p>
            <p>&#8226; <it><ul>pattern</ul></it>: is a pointer to the aligned pattern string.</p>
         </sec>
         <sec>
            <st>
               <p>2.2 The PERL library functions</p>
            </st>
            <p>The function below returns all occurrences, with at most <it>k </it>mismatches, of a pattern within a text.</p>
            <p>
               <b>Synopsis</b>
            </p>
            <p>
               <b>use BSAT::K_Mismatch;</b>
            </p>
            <p>K_Mismatch <it>Text Pattern K</it></p>
            <p><b>Arguments</b>:</p>
            <p>&#8226; <it>Text</it>: is a scalar containing the text string;</p>
            <p>&#8226; <it>Pattern</it>: is a scalar containing the pattern string;</p>
            <p>&#8226; <it>K</it>: is a scalar giving the maximum number of allowed mismatches.</p>
            <p><b>Return values: </b>The function returns an array of occurrences. Each occurrence consists of a hash:</p>
            <p>my %occurrence = (</p>
            <p>&#160;&#160;&#160;errors => 0,</p>
            <p>&#160;&#160;&#160;start => 0,</p>
            <p>&#160;&#160;&#160;end => 0,</p>
            <p>&#160;&#160;&#160;text => "",</p>
            <p>&#160;&#160;&#160;pattern => "");</p>
            <p>where the above fields are as in the <ul>OCCURRENCES_STRUCT</ul> defined earlier.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>3 Approximate string matching with <it>k </it>differences</p>
         </st>
         <p>In this section we consider a more general problem of approximate string matching by extending the set of allowed differences between strings. Letting <it>text</it>, <it>pattern </it>and <it>k </it>be as in section 2, we are interested in finding all occurrences of <it>pattern </it>in <it>text </it>with at most <it>k </it>differences. The allowed differences are:</p>
         <p>(a) A symbol of the pattern corresponds to a different symbol of the text, i.e., a mismatch.</p>
         <p>(b) A symbol of the pattern corresponds to no symbol in the text.</p>
         <p>(c) A symbol of the text corresponds to no symbol in the pattern.</p>
         <p>Let <it>A </it>be an (<it>m </it>+ 1) &#215; (<it>n </it>+ 1) dynamic programming matrix and consider the following recurrence:</p>
         <p>
            <display-formula id="M1"><it>A</it>[0, <it>j</it>] = 0, 0 &#8804; <it>j </it>&lt;<it>n</it>.</display-formula>
         </p>
         <p>
            <display-formula id="M2"><it>A</it>[<it>i</it>, 0] = <it>i</it>, 0 &#8804; <it>i </it>&lt;<it>m</it>.</display-formula>
         </p>
         <p>
            <display-formula id="M3"><it>A</it>[<it>i</it>, <it>j</it>] = <it>min</it>(<it>A</it>[<it>i </it>- 1, <it>j</it>] + 1, <it>A</it>[<it>i</it>, <it>j </it>- 1] + 1, <it>if p</it>[<it>i</it>] = <it>t</it>[<it>j</it>] <it>then A</it>[<it>i </it>- 1, <it>j </it>- 1] <it>else A</it>[<it>i </it>- 1, <it>j </it>- 1] + 1).</display-formula>
         </p>
         <p>Matrix <it>A </it>can be computed row by row, or column by column, in <it>O</it>(<it>nm</it>) time. Moreover, it can be easily shown that <it>A</it>[<it>i</it>, <it>j</it>] is the minimal edit distance between <it>p</it>[1, <it>i</it>] and a substring of <it>text </it>ending at position <it>j</it>. Thus, it follows that there is an occurrence of the pattern in the text ending at position <it>j </it>of the text if and only if <it>A</it>[<it>m</it>, <it>j</it>] &#8804; <it>k</it>. The computation of <it>A </it>can be substantially sped-up by observing that, for any <it>i </it>and <it>j</it>, either <it>A</it>[<it>i </it>+ 1, <it>j </it>+ 1] = <it>A</it>[<it>i</it>, <it>j</it>] or <it>A</it>[<it>i </it>+ 1, <it>j </it>+ 1] = <it>A</it>[<it>i</it>, <it>j</it>] + 1. That is, the elements along any diagonal of <it>A </it>form a non-decreasing sequence of integers. Thus, the computation of <it>A </it>can be performed by finding, for all diagonals, the rows in which <it>A</it>[<it>i </it>+ 1, <it>j </it>+ 1] = <it>A</it>[<it>i</it>, <it>j</it>] + 1 &#8804; <it>k</it>. Such an observation was exploited by Ukkonen <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> in order to obtain a space efficient algorithm for the computation of the edit distance between two strings. Landau and Vishkin <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> cleverly extended the method by Ukkonen to obtain an efficient algorithm that handles the more general problem of string matching with <it>k </it>differences. We present their algorithm here, although the asymptotically most efficient one, to date, has been recently obtained by Cole and Hariharan <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>.</p>
         <p>Let <it>L</it><sub><it>d</it>,<it>e </it></sub>denote the largest row <it>i </it>such that <it>A</it>[<it>i</it>, <it>j</it>] = <it>e </it>and <it>j </it>- <it>i </it>= <it>d</it>. The definition of <it>L</it><sub><it>d</it>, <it>e </it></sub>implies that <it>e </it>is the minimal number of differences between <it>p</it>[1, <it>L</it><sub><it>d</it>,<it>e</it></sub>] and the substrings of the text ending at <it>t</it>[<it>L</it><sub><it>d</it>,<it>e </it></sub>+ <it>d</it>], with <it>p</it>[<it>L</it><sub><it>d</it>,<it>e </it></sub>+ 1] &#8800; <it>t</it>[<it>L</it><sub><it>d</it>,<it>e </it></sub>+ <it>d </it>+ 1]. In order to solve the <it>k </it>differences problem, we need to compute the values of <it>L</it><sub><it>d</it>,<it>e </it></sub>that satisfy <it>e </it>&#8804; <it>k</it>. Assuming that <it>L</it><sub><it>d</it>+1,<it>e</it>-1</sub>, <it>L</it><sub><it>d</it>-1,<it>e</it>-1 </sub>and <it>L</it><sub><it>d</it>,<it>e</it>-1 </sub>have been correctly computed, <it>L</it><sub><it>d</it>,<it>e </it></sub>is computed as follows. Let <it>row </it>= <it>max</it>(<it>L</it><sub><it>d</it>+1,<it>e</it>-1 </sub>+ 1, <it>L</it><sub><it>d</it>-1,<it>e</it>-1</sub>, <it>L</it><sub><it>d</it>,<it>e</it>-1 </sub>+ 1) and let &#8467; be the largest integer such that <it>p</it>[<it>row </it>+ 1, <it>row </it>+ &#8467;] = <it>t</it>[<it>d </it>+ <it>row </it>+ 1, <it>d </it>+ <it>row </it>+ &#8467;]. Then, <it>L</it><sub><it>d</it>,<it>e </it></sub>= <it>row </it>+ &#8467;. The proof of correctness of such a computation is a simple exercise, left to the reader. Moreover, if one makes use of the preprocessing algorithms presented in section 2, <it>L</it><sub><it>d</it>,<it>e </it></sub>can be computed in <it>O</it>(1) time as follows:</p>
         <p><it>L</it><sub><it>d</it>,<it>e </it></sub>= <it>row </it>+ <it>Prefix</it>(<it>row </it>+ 1, <it>row </it>+ <it>d </it>+ 1). Algorithm <b>SD </b>gives the needed pseudo-code. We have:</p>
         <p><b>Theorem 3.1 </b><abbrgrp><abbr bid="B11">11</abbr></abbrgrp><it>Given a pattern p and a text t, of length m and n, respectively, Algorithm </it><b>SD </b><it>finds all occurrences of p in t with at most k differences in O</it>(<it>m </it>+ <it>n </it>+ <it>nk</it>) <it>time, including the preprocessing step</it>.</p>
         <sec>
            <st>
               <p>3.1 The C/C++ library functions</p>
            </st>
            <p>The function below returns all occurrences of a pattern within a text with at most k differences.</p>
            <p>
               <b>Synopsis</b>
            </p>
            <p>
               <b>#include " k_difference.h"</b>
            </p>
            <p>
               <ul>OCCURRENCES</ul>
            </p>
            <p><b><ul>k_difference</ul></b> (<ul>char</ul><it><ul>*text</ul></it>, <ul>char</ul><it><ul>*pattern</ul></it>, <ul>int</ul><it><ul>k</ul></it>);</p>
            <p><b>Arguments</b>: As in function <b><ul>k_mismatch</ul></b></p>
            <p><b>Return Values</b>: As in function <b><ul>k_mismatch</ul></b></p>
            <p>1: Algorithm <b>SD</b></p>
            <p>2: **Initial Conditions Start Here**</p>
            <p>3: <b>for </b><it>d </it>:= 0 <b>to </b><it>n </it><b>do</b></p>
            <p>4: &#160;&#160;&#160;<it>L</it>[<it>d</it>, -1] &#8592; -1</p>
            <p>5: <b>end for</b></p>
            <p>6: <b>for </b><it>d </it>:= -(<it>k </it>+ 1) <b>to </b>-1 <b>do</b></p>
            <p>7: &#160;&#160;&#160;<it>L</it>[<it>d</it>, |<it>d</it>| - 1] &#8592; |<it>d</it>| - 1</p>
            <p>8: &#160;&#160;&#160;<it>L</it>[<it>d</it>, |<it>d</it>| - 2] &#8592; |<it>d</it>| - 2</p>
            <p>9: <b>end for</b></p>
            <p>10: <b>for </b><it>e </it>:= -1 <b>to </b><it>k </it><b>do</b></p>
            <p>11: &#160;&#160;&#160;<it>L</it>[<it>n </it>+ 1, <it>e</it>] &#8592; -1</p>
            <p>12: <b>end for</b></p>
            <p>13: **Initial Conditions End Here**</p>
            <p>14: <b>for </b><it>e </it>:= 0 <b>to </b><it>k </it><b>do</b></p>
            <p>15: &#160;&#160;&#160;<b>for </b><it>d </it>:= -<it>e </it><b>to </b><it>n </it><b>do</b></p>
            <p>16: &#160;&#160;&#160;&#160;&#160;&#160;<it>row </it>&#8592; max(<it>L</it>[<it>d</it>, <it>e </it>- 1] + 1, <it>L</it>[<it>d </it>- 1, <it>e </it>- 1], <it>L</it>[<it>d </it>+ 1, <it>e </it>- 1] + 1</p>
            <p>17: &#160;&#160;&#160;&#160;&#160;&#160;<it>row </it>&#8592; min(<it>row</it>, <it>m</it>)</p>
            <p>18: &#160;&#160;&#160;&#160;&#160;&#160;<b>if </b><it>row </it>&lt;<it>m </it><b>and </b><it>row </it>+ <it>d </it>&lt;<it>n </it><b>then</b></p>
            <p>19: &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<it>row </it>&#8592; <it>row </it>+ <it>Prefix</it>(<it>row </it>+ 1, <it>row </it>+ <it>d </it>+ 1)</p>
            <p>20: &#160;&#160;&#160;&#160;&#160;&#160;<b>end if</b></p>
            <p>21: &#160;&#160;&#160;&#160;&#160;&#160;<it>L</it>[<it>d</it>, <it>e</it>] &#8592; <it>row</it></p>
            <p>22: &#160;&#160;&#160;&#160;&#160;&#160;<b>if </b><it>L</it>[<it>d</it>, <it>e</it>] = <it>m </it><b>and </b><it>d </it>+ <it>m </it>&#8804; <it>n </it><b>then</b></p>
            <p>23: &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;**Occurrence Found**</p>
            <p>24: &#160;&#160;&#160;&#160;&#160;&#160;<b>end if</b></p>
            <p>25: &#160;&#160;&#160;<b>end for</b></p>
            <p>26: <b>end for</b></p>
         </sec>
         <sec>
            <st>
               <p>3.2 The PERL library functions</p>
            </st>
            <p>The function below returns all occurrences of a pattern within a text with at most k differences.</p>
            <p>
               <b>Synopsis</b>
            </p>
            <p>
               <b>use BSAT::K_Difference;</b>
            </p>
            <p>K_Difference <it>Text Pattern K</it></p>
            <p><b>Arguments</b>: As in function <b><ul>K_Mismatch</ul></b></p>
            <p><b>Return values: </b>As in function <b><ul>K_Mismatch</ul></b></p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>4 Longest common subsequence from fragments</p>
         </st>
         <p>In this section we consider the problem of identifying a longest common subsequence (LCS for short) of two strings <it>X </it>and <it>Y</it>, using a set <it>M </it>of matching fragments. That is, strings of a given length that appear in both <it>X </it>and <it>Y</it>. We start by reviewing some basic notions about LCS computation and relate them to approximate string matching, discussed in sections 2 and 3. Then, we outline the algorithm presented in <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>.</p>
         <sec>
            <st>
               <p>4.1 LCS from fragments and edit graphs</p>
            </st>
            <p>It is well known that finding the LCS of <it>X </it>and <it>Y </it>is equivalent to finding the Levenshtein edit distance between the two strings <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, where the "edit operations" are insertion and deletion of a single character. Those edit operations naturally correspond to the differences of type (b) and (c) introduced in section 3 for approximate string matching. Although there is analogy between approximate string matching and LCS computation, the former can be regarded as a local alignment method as opposed to the latter, that is a global alignment method <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Following Myers <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>, we phrase the LCS problem as the computation of a shortest path in the edit graph for <it>X </it>and <it>Y</it>, defined as follows. It is a directed grid graph (see Fig. <figr fid="F2">2</figr>) with vertices (<it>i</it>, <it>j</it>), where 0 &#8804; <it>i </it>&#8804; <it>n </it>and 0 &#8804; <it>j </it>&#8804; <it>m</it>, |<it>X</it>| = <it>n </it>and |<it>Y</it>| = <it>m</it>. We refer to the vertices also as <it>points</it>. There is a vertical edge from each non-bottom point to its neighbor below. There is a horizontal edge from each non-rightmost point to its right neighbor. Finally, if <it>X</it>[<it>i</it>] = <it>Y</it>[<it>j</it>], there is a diagonal edge from (<it>i </it>- 1, <it>j </it>- 1) to (<it>i</it>, <it>j</it>). Assume that each non-diagonal edge has weight 1 and the remaining edges weight 0. Then, the Levenshtein edit distance is given by the minimum cost of any path from (0, 0) to (<it>n</it>, <it>m</it>). We assume the reader to be familiar with the notion of edit script corresponding to the min-cost path and how to recover an LCS from an edit script <abbrgrp><abbr bid="B28">28</abbr><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>. Our LCS from Fragments problem also corresponds naturally to an edit graph. The vertices and the horizontal and vertical edges are as before, but the diagonal edges correspond to a given set of fragments. Each fragment, formally described as a triple (<it>i</it>, <it>j</it>, <it>k</it>), represents a sequence of diagonal edges from (<it>i </it>- <it>j </it>- 1) (the <it>start </it>point) to (<it>i </it>+ <it>k </it>- 1, <it>j </it>+ <it>k </it>- 1) (the <it>end </it>point). For a fragment <it>f</it>, the start and end points of <it>f </it>are denoted by <it>start</it>(<it>f</it>) and <it>end</it>(<it>f</it>), respectively. In the example of Figure <figr fid="F3">3</figr>, the fragments are the sequences of at least 2 diagonal edges of Fig. <figr fid="F2">2</figr>. The LCS from Fragments problem is equivalent to finding a minimum-cost path in the edit graph from (0, 0) to (<it>n</it>, <it>m</it>), where each diagonal edge has weight 0 and each non-diagonal edge has weight 1. The problem has an obvious dynamic programming solution since the graph naturally corresponds to an <it>nxm </it>dynamic programming matrix. However, it also falls into the more efficient algorithmic paradigm of Sparse Dynamic Programming <abbrgrp><abbr bid="B31">31</abbr><abbr bid="B32">32</abbr></abbrgrp>, as discussed in <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> and outlined next.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>an edit graph</p>
               </caption>
               <text>
                  <p><b>an edit graph</b>. An edit graph for the strings <it>X </it>= <it>CDABAC </it>and <it>Y </it>= <it>ABCABBA</it>. It naturally corresponds to a <b>DP </b>matrix. The bold path from (0, 0) to (6, 7) gives an edit script from which we can recover the LCS between <it>X </it>and <it>Y</it>.</p>
               </text>
               <graphic file="1748-7188-2-10-2"/>
            </fig>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>an edit graph with fragments</p>
               </caption>
               <text>
                  <p><b>an edit graph with fragments</b>. An LCS from Fragments edit graph for the same strings as in Figure 2, where the fragments are the sequences of at least two diagonal edges of Figure 2. The bold path from (0, 0) to (6, 7) corresponds to a minimum-cost path under the Levenshtein edit distance.</p>
               </text>
               <graphic file="1748-7188-2-10-3"/>
            </fig>
            <p>For a point <it>p</it>, define <it>x</it>(<it>p</it>) and <it>y</it>(<it>p</it>) to be the <it>x</it>- and <it>y</it>- coordinates of <it>p</it>, respectively. We also refer to <it>x</it>(<it>p</it>) as the <it>row </it>of <it>p </it>and <it>y</it>(<it>p</it>) as the <it>column </it>of <it>p</it>. Define the diagonal number of <it>f </it>to be <it>d</it>(<it>f</it>) = <it>y</it>(<it>start</it>(<it>f</it>)) - <it>x</it>(<it>start</it>(<it>f</it>)).</p>
            <p>We say a fragment <it>f' </it>is <it>left of start</it>(<it>f</it>) if some point of <it>f' </it>besides <it>start</it>(<it>f'</it>) is to the left of <it>start</it>(<it>f</it>) on a horizontal line through <it>start</it>(<it>f</it>), or <it>start</it>(<it>f</it>) lies on <it>f' </it>and <it>x</it>(<it>start</it>(<it>f'</it>)) &lt;<it>x</it>(<it>start</it>(<it>f</it>)). (In the latter case, <it>f </it>and <it>f' </it>are in the same diagonal and overlap.) A fragment <it>f' </it>is <it>above start</it>(<it>f</it>) if some point of <it>f' </it>besides <it>start</it>(<it>f'</it>) is strictly above <it>start</it>(<it>f</it>) on a vertical line through <it>start</it>(<it>f</it>).</p>
            <p>Define <it>visl</it>(<it>f</it>) to be the first fragment to the left of <it>start</it>(<it>f</it>) if such exists, and undefined otherwise. Define <it>visa</it>(<it>f</it>) to be the first fragment above <it>start</it>(<it>f</it>) if such exists, and undefined otherwise.</p>
            <p>We say that fragment <it>f </it>precedes fragment <it>f' </it>if <it>x</it>(<it>end</it>(<it>f</it>)) &lt;<it>x</it>(<it>start</it>(<it>f'</it>)) and <it>y</it>(<it>end</it>(<it>f</it>)) &lt;<it>y</it>(<it>start</it>(<it>f'</it>)), i.e. if the end point of <it>f </it>is strictly inside the rectangle with opposite corners (0, 0) and <it>start</it>(<it>f'</it>).</p>
            <p>Suppose that fragment <it>f </it>precedes fragment <it>f'</it>. The shortest path from <it>end</it>(<it>f</it>) to <it>start</it>(<it>f'</it>) with no diagonal edges has cost <it>x</it>(<it>start</it>(<it>f'</it>)) - <it>x</it>(<it>end</it>(<it>f</it>)) + <it>y</it>(<it>start</it>(<it>f'</it>)) - <it>y</it>(<it>end</it>(<it>f</it>)), and the minimum cost of any path from (0, 0) to <it>start</it>(<it>f'</it>) through <it>f </it>is that value plus <it>mincost</it><sub>0</sub>(<it>f</it>). It will be helpful to separate out the part of this cost that depends on <it>f </it>by the definition <it>Z</it>(<it>f</it>) = <it>mincost</it><sub>0</sub>(<it>f</it>) - <it>x</it>(<it>end</it>(<it>f</it>)) - <it>y</it>(<it>end</it>(<it>f</it>)). Note that <it>Z</it>(<it>f</it>) &#8804; 0 since <it>mincost</it><sub>0</sub>(<it>f</it>) &#8804; <it>x</it>(<it>start</it>(<it>f</it>)) + <it>y</it>(<it>start</it>(<it>f</it>)). The following lemma states that we can compute LCS from fragments by considering only end-points of some fragments rather than all points in the dynamic programming matrix. Moreover, it also gives the appropriate recurrence relations that we need to compute.</p>
            <p><b>Lemma 4.1 </b><abbrgrp><abbr bid="B13">13</abbr></abbrgrp><it>For any fragment f and any point p on f, mincost</it><sub>0</sub>(<it>p</it>) = <it>mincost</it><sub>0</sub>(<it>start</it>(<it>f</it>)).</p>
            <p><it>Moreover, mincost</it><sub>0</sub>(<it>f</it>) <it>is the minimum of x</it>(<it>start</it>(<it>f</it>)) + <it>y</it>(<it>start</it>(<it>f</it>)) <it>and any of c<sub>p</sub>, c<sub>l</sub>, and c<sub>a </sub>that are defined according to the following:</it></p>
            <p><it>1. If at least one fragment precedes f, c</it><sub><it>p </it></sub>= <it>x</it>(<it>start</it>(<it>f</it>)) + <it>y</it>(<it>start</it>(<it>f</it>)) + min{<it>Z</it>(<it>f'</it>): <it>f' </it><it>precedes f</it>}.</p>
            <p><it>2. If visl</it>(<it>f</it>) <it>is defined, c</it><sub><it>l </it></sub>= <it>mincost</it><sub>0</sub>(<it>visl</it>(<it>f</it>))+<it>d</it>(<it>f</it>) - <it>d</it>(<it>visl</it>(<it>f</it>));</p>
            <p><it>3. If visa</it>(<it>f</it>) <it>is defined, c</it><sub><it>a </it></sub>= <it>mincost</it><sub>0</sub>(<it>visa</it>(<it>f</it>)) + <it>d</it>(<it>visa</it>(<it>f</it>)) - <it>d</it>(<it>f</it>);</p>
         </sec>
         <sec>
            <st>
               <p>4.2 Outline of the algorithm</p>
            </st>
            <p>Based on Lemma 4.1, we now present the main steps of the algorithm in <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> computing the required optimal path, given a list <it>M </it>of fragments (represented as triples of integers). It uses a sweepline approach where successive rows are processed, and within rows, points are processed from left to right. Lexicographic sorting of (<it>x</it>, <it>y</it>)-values is needed. The algorithm consists of two main phases, one in which it computes visibility information, i.e., <it>visl</it>(<it>f</it>) and <it>visa</it>(<it>f</it>) for each fragment <it>f</it>, and the other in which it computes Recurrences (1)&#8211;(3) in Lemma 4.1.</p>
            <p>Not all the rows and columns need contain a start point or end point, and we generally wish to skip empty rows and columns for efficiency. For any <it>x </it>(<it>y</it>, resp.), let <it>C</it>(<it>x</it>) (<it>R</it>(<it>y</it>), resp.) be the <it>i </it>for which <it>x </it>is in the <it>i</it>-th non-empty column (row, resp.). These values can be calculated in the same time bounds as the lexicographic sorting. From now on, we assume that the algorithm processes only nonempty rows and columns.</p>
            <p>For the lexicographic sorting and both phases, we assume the existence of a data structure of type <it>D </it>that stores integers <it>j </it>in some range [0, <it>u</it>] and supports the following operations: (1) insert, (2) delete, (3) member, (4) min, (5) successor: given <it>j</it>, the next larger value than <it>j </it>in <it>D</it>, (6) max: given <it>j</it>, find the max value less than <it>j </it>in <it>D</it>. In our toolkit, <it>D </it>is implemented via balanced trees <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>. Therefore, if <it>d </it>elements are stored in it, each operation takes <it>O</it>(log <it>d</it>) time. More complex schemes are proposed and analyzed in <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>, yielding better asymptotic performance. With the mentioned data structures, lexicographic sorting of (<it>x</it>, <it>y</it>)-values can be done in <it>O</it>(<it>d </it>log <it>d</it>) time. In our case <it>u </it>&#8804; <it>n </it>+ <it>m </it>and <it>d </it>&#8804; |<it>M</it>|.</p>
            <p>&#8226; <b>Visibility Computation</b>. We now briefly outline how to compute <it>visl</it>(<it>f</it>) and <it>visa</it>(<it>f</it>) for each fragment <it>f </it>via a sweepline algorithm. We describe the computation of <it>visl</it>(<it>f</it>); that for <it>visa</it>(<it>f</it>) is similar. For <it>visl</it>(<it>f</it>), the sweepline algorithm sweeps along successive rows. Assume that we have reached row <it>i</it>. We keep all fragments crossing row <it>i </it>sorted by diagonal number in a data structure <it>V</it>. For each fragment <it>f </it>such that <it>x</it>(<it>start</it>(<it>f</it>)) = <it>i</it>, we record the fragment <it>f' </it>to the left of <it>start</it>(<it>f</it>) in the sorted list of fragments; in this case, <it>visl</it>(<it>f</it>) = <it>f'</it>. Then, for each fragment <it>f </it>with <it>x</it>(<it>start</it>(<it>f</it>)) = <it>i</it>, we insert <it>f </it>into <it>V</it>. Finally, we remove fragments <inline-formula><m:math name="1748-7188-2-10-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>f</m:mi><m:mo>^</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGMbGzgaqcaaaa@2E11@</m:annotation></m:semantics></m:math></inline-formula> such that <it>y</it>(<it>end</it>(<inline-formula><m:math name="1748-7188-2-10-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>f</m:mi><m:mo>^</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGMbGzgaqcaaaa@2E11@</m:annotation></m:semantics></m:math></inline-formula>)) = <it>i</it>. If the data structure <it>V </it>is implemented as a balanced search tree, the total time for this computation is <it>O</it>(<it>M </it>log <it>M</it>).</p>
            <p>&#8226; <b>The Main Algorithm</b>. Again, we use a sweepline approach of processing successive rows. It follows the same paradigm as the Hunt-Szymanski LCS algorithm <abbrgrp><abbr bid="B34">34</abbr></abbrgrp> and the computation of the <it>RNA </it>secondary structure (with linear cost functions) <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>.</p>
            <p>We use another data structure <it>B </it>of type <it>D</it>, but this time <it>B </it>stores column numbers (and a fragment associated with each one). The values stored in <it>B </it>will represent the columns at which the minimum value of <it>Z</it>(<it>f</it>) decreases compared to any columns to the left, i.e. the columns containing an end point of a fragment <it>f </it>for which <it>Z</it>(<it>f</it>) is smaller than <it>Z</it>(<it>f'</it>) for any <it>f' </it>whose end point has already been processed and which is in a column to the left. Notice that, once we fix a row, <it>D </it>gives a partition of that row in terms of columns. Within a row, first process any start points in the row from left to right. For each start point of a fragment, compute <it>mincost</it><sub>0 </sub>using Lemma 4.1. Note that when the start point of a fragment <it>f </it>is computed, <it>mincost</it><sub>0 </sub>has already been computed for each fragment that precedes <it>f </it>and each fragment that is <it>visa</it>(<it>f</it>) or <it>visl</it>(<it>f</it>). To find the minimum value of <it>Z</it>(<it>f'</it>) over all predecessors <it>f' </it>of <it>f</it>, the data structure <it>B </it>is used. The minimum relevant value for <it>Z</it>(<it>f'</it>) is obtained from <it>B </it>by using the max operation to find the max <it>j </it>&lt;<it>y</it>(<it>start</it>(<it>f</it>)) in <it>B</it>; the fragment <it>f' </it>associated with that <it>j </it>is one for which <it>Z</it>(<it>f'</it>) is the minimum (based on endpoints processed so far) over all columns to the left of the column containing <it>start</it>(<it>f</it>), and in fact this value of <it>Z</it>(<it>f'</it>) is the</p>
            <p>1: Algorithm <b>FLCS</b></p>
            <p>2: For each fragment <it>f</it>, compute <it>visl</it>(<it>f</it>) and <it>visa</it>(<it>f</it>)</p>
            <p>3: <b>for </b><it>i </it>= <it>R</it>(0) to <it>R</it>(<it>n</it>) <b>do</b></p>
            <p>4: &#160;&#160;&#160;<b>for </b>each fragment <it>f </it>s.t. <it>x</it>(<it>start</it>(<it>f</it>)) = <it>i </it><b>do</b></p>
            <p>5: &#160;&#160;&#160;&#160;&#160;&#160;<it>f' </it>&#8592; max on <it>B </it>with key <it>y</it>(<it>start</it>(<it>f</it>))</p>
            <p>6: &#160;&#160;&#160;&#160;&#160;&#160;<b>if </b><it>f' </it>is defined <b>then</b></p>
            <p>7: &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;compute <it>cp</it></p>
            <p>8: &#160;&#160;&#160;&#160;&#160;&#160;<b>end if</b></p>
            <p>9: &#160;&#160;&#160;&#160;&#160;&#160;<b>if </b><it>visl</it>(<it>f</it>) is defined <b>then</b></p>
            <p>10: &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;compute <it>cl</it></p>
            <p>11: &#160;&#160;&#160;&#160;&#160;&#160;<b>end if</b></p>
            <p>12: &#160;&#160;&#160;&#160;&#160;&#160;<b>if </b><it>visa</it>(<it>f</it>) is defined <b>then</b></p>
            <p>13: &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;compute <it>ca</it></p>
            <p>14: &#160;&#160;&#160;&#160;&#160;&#160;<b>end if</b></p>
            <p>15: &#160;&#160;&#160;&#160;&#160;&#160;compute <it>mincost</it>(<it>f</it>)</p>
            <p>16: &#160;&#160;&#160;<b>end for</b></p>
            <p>17: &#160;&#160;&#160;<b>for </b>each fragment <it>f </it>s.t. <it>x</it>(<it>start</it>(<it>f</it>)) = <it>i </it><b>do</b></p>
            <p>18: &#160;&#160;&#160;&#160;&#160;&#160;<it>f' </it>&#8592; max on <it>B </it>with key <it>y</it>(<it>end</it>(<it>f</it>)) + 1</p>
            <p>19: &#160;&#160;&#160;&#160;&#160;&#160;<b>if </b><it>f' </it>is not defined <b>or </b><it>Z</it>(<it>f</it>) &lt;<it>Z</it>(<it>f'</it>) <b>then</b></p>
            <p>20: &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;INSERT <it>f </it>into <it>B </it>with key <it>y</it>(<it>end</it>(<it>f</it>))</p>
            <p>21: &#160;&#160;&#160;&#160;&#160;&#160;<b>end if</b></p>
            <p>22: &#160;&#160;&#160;&#160;&#160;&#160;<b>for </b>each fragment <it>f' </it>:= SUCCESSOR(<it>f</it>) in <it>B </it>such that <it>Z</it>(<it>f'</it>) &#8804; <it>Z</it>(<it>f</it>) <b>do</b></p>
            <p>23: &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;DELETE(<it>f'</it>) from <it>B</it></p>
            <p>24: &#160;&#160;&#160;&#160;&#160;&#160;<b>end for</b></p>
            <p>25: &#160;&#160;&#160;<b>end for</b></p>
            <p>26: <b>end for</b></p>
            <p>minimum value over all predecessors of <it>f</it>.</p>
            <p>After any start points for a row have been processed, process the end points. When an end point of a fragment <it>f </it>is processed, <it>B </it>is updated as necessary if <it>Z</it>(<it>f</it>) represents a new minimum value at the column <it>y</it>(<it>end</it>(<it>f</it>)); successor and deletion operations may be needed to find and remove any values that have been superseded by the new minimum value. Algorithm <b>FLCS </b>gives the pseudo-code of the method just outlined, with the visibility computation omitted for conciseness. In conclusion, we have:</p>
            <p><b>Theorem 4.2 </b><abbrgrp><abbr bid="B13">13</abbr></abbrgrp><it>Suppose X </it>[1 : <it>n</it>] <it>and Y </it>[1 : <it>m</it>] <it>are strings, and a set M of fragments relating substrings of X and Y is given. One can compute the LCS from Fragments in O</it>(|<it>M</it>|log|<it>M</it>|) <it>time and O</it>(|<it>M</it>|) <it>space using standard balanced search tree schemes</it>.</p>
         </sec>
         <sec>
            <st>
               <p>4.3 The C/C++ library functions</p>
            </st>
            <p>The function below computes the longest common subsequence from fragments and returns the corresponding alignment.</p>
            <p>
               <b>Synopsis</b>
            </p>
            <p>
               <b>#include "flcs.h"</b>
            </p>
            <p>
               <ul>ALIGNMENTS</ul>
            </p>
            <p><b><ul>flcs</ul></b> (<ul>char</ul><it><ul>*X</ul></it>, <ul>char</ul><it><ul>*Y</ul></it>, <ul>FRAGSET</ul><it><ul>M</ul></it>);</p>
            <p><b>Arguments</b>:</p>
            <p>&#8226; <it><ul>X</ul></it>: points to a string;</p>
            <p>&#8226; <it><ul>Y</ul></it>: points to a string;</p>
            <p>&#8226; <it><ul>M</ul></it>: point to a <ul>FRAGSET_STRUCT</ul>, that represents a set of fragments.</p>
            <p><b>Return Values</b>: A pointer to <ul>ALIGNMENTS_STRUCT</ul>, which is defined as:</p>
            <p>typedef struct <ul>alignments</ul></p>
            <p>{</p>
            <p>&#160;&#160;&#160;<ul>double </ul><it><ul>distance</ul></it>;</p>
            <p>&#160;&#160;&#160;<ul>char</ul><it><ul>*X</ul></it>;</p>
            <p>&#160;&#160;&#160;<ul>char</ul><it><ul>*Y</ul></it>;</p>
            <p>struct <ul>alignments</ul><it><ul>*next</ul></it>;</p>
            <p>} <ul>ALIGNMENTS_STRUCT</ul>, <it><ul>*ALIGNMENTS</ul></it>;</p>
            <p>where:</p>
            <p>&#8226; <it><ul>distance</ul></it>: is the Levenshtein Distance between strings <it><ul>X</ul></it>and <it><ul>Y</ul></it>, computed using only fragments;</p>
            <p>&#8226; <it><ul>X</ul></it>: is a pointer to the aligned string <it><ul>X</ul></it>, i.e., the string with appropriate spacers inserted;</p>
            <p>&#8226; <it><ul>Y</ul></it>: is a pointer to the aligned string <it><ul>Y</ul></it>with appropriate spacers inserted.</p>
            <p>One can create a set of fragments from all the matching <it>k</it>-tuples between <it><ul>X</ul></it>and <it><ul>Y</ul></it>, using the function:</p>
            <p>
               <ul>FRAGSET</ul>
            </p>
            <p><b><ul>fragset_create_ktuples </ul></b>(<ul>char</ul><it><ul>*X</ul></it>, <ul>char</ul><it><ul>*Y</ul></it>, <ul>int</ul><it><ul>k</ul></it>);</p>
            <p>where:</p>
            <p>&#8226; <it><ul>X</ul></it>: points to string;</p>
            <p>&#8226; <it><ul>Y</ul></it>: points to a string;</p>
            <p>&#8226; <it><ul>k</ul></it>: is the fragment length.</p>
            <p>Auxiliary functions destroying, creating or incrementally updating a set of fragments are the following:</p>
            <p>
               <ul>void</ul>
            </p>
            <p><b><ul>fragset_destroy</ul></b>(<ul>FRAGSET</ul><it><ul> fragset</ul></it>);</p>
            <p>
               <ul>FRAGSET</ul>
            </p>
            <p><b><ul>fragset_create</ul></b>(<ul>int</ul><it><ul>*max_cardinality</ul></it>);</p>
            <p>
               <ul>int</ul>
            </p>
            <p><b><ul>fragset_frag_add</ul></b>(<ul>FRAGSET </ul><it><ul>fragset</ul></it>, <ul>int </ul><it><ul>i</ul></it>, <ul>int</ul><it><ul> j</ul></it>, <ul>int</ul><it><ul> length</ul></it>);</p>
            <p>where</p>
            <p>&#8226; <it><ul>fragset</ul></it>:points to FRAGSET_STRUCT;</p>
            <p>&#8226; <it><ul>i</ul></it>: fragment starting position in the first string <it><ul>X</ul></it>;</p>
            <p>&#8226; <it><ul>j</ul></it>: fragment starting position in the second string <it><ul>Y</ul></it>;</p>
            <p>&#8226; <it><ul>length</ul></it>: fragment length.</p>
         </sec>
         <sec>
            <st>
               <p>4.4 The PERL library functions</p>
            </st>
            <p>The function FLCS computes the longest common subsequence from fragments. It returns the corresponding alignment.</p>
            <p>
               <b>Synopsis</b>
            </p>
            <p>
               <b>use BSAT::FLCS;</b>
            </p>
            <p>FLCS <it>X Y Frags</it></p>
            <p><b>Arguments</b>:</p>
            <p>&#8226; <it>X</it>: is a scalar containing string X.</p>
            <p>&#8226; <it>Y</it>: is a scalar containing string Y.</p>
            <p>&#8226; <it>Frags</it>: is a hash reference (see below).</p>
            <p><b>Return values: </b>FLCS returns a hash corresponding to the alignment between X and Y:</p>
            <p>my %alignment = (</p>
            <p>&#160;&#160;&#160;distance => 0,</p>
            <p>&#160;&#160;&#160;X => "",</p>
            <p>&#160;&#160;&#160;Y => "");</p>
            <p>where:</p>
            <p>&#8226; distance: is a scalar containing the Levenshtein Distance between strings <it><ul>X</ul></it>and <it><ul>Y</ul></it>, computed using only fragments;</p>
            <p>&#8226; X: is a scalar containing the alignment string X;</p>
            <p>&#8226; Y: is a scalar containing the alignment string Y.</p>
            <p>The hash reference Frags is defined as:</p>
            <p>my %Frags = (</p>
            <p>&#160;&#160;&#160;K => 0,</p>
            <p>&#160;&#160;&#160;Set => ());</p>
            <p>where:</p>
            <p>&#8226; K: is a scalar giving the fragment length;</p>
            <p>&#8226; Set: is an array of three elements (<it>i</it>, <it>j</it>, <it>length</it>) specifying a fragment.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>5 Edit distance with gaps</p>
         </st>
         <sec>
            <st>
               <p>5.1 The dynamic programming recurrences</p>
            </st>
            <p>We refer to the edit operations of substitution of one symbol for another (point mutation), deletion of a single symbol, and insertion of a single symbol as <it>basic operations</it>. They are related in a natural way to the differences introduced in section 3. Let a <it>gap </it>be a consecutive set of deleted symbols in one string or inserted symbols in the other string. With the basic set of operations, the cost of a gap is the sum of the costs of the individual insertions or deletions which compose it. Therefore, a gap is considered as a sequence of homogeneous elementary events (insertion or deletion) rather than as an elementary event itself. But, both theoretic and experimental considerations <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B14">14</abbr><abbr bid="B35">35</abbr></abbrgrp>, suggest that the cost <it>w</it>(<it>i</it>, <it>j</it>) of a generic gap <it>X</it>[<it>i</it>, <it>j</it>] must be of the form</p>
            <p>
               <display-formula id="M4"><it>w</it>(<it>i</it>, <it>j</it>) = <it>f</it><sup>1</sup>(<it>X</it>[<it>i</it>]) + <it>f</it><sup>2</sup>(<it>X</it>[<it>j</it>]) + <it>g</it>(<it>j </it>- <it>i</it>)</display-formula>
            </p>
            <p>where <it>f</it><sup>1 </sup>and <it>f</it><sup>2 </sup>are the costs of breaking the string at the endpoints of the gap and <it>g </it>is a function that increases with the gap length.</p>
            <p>In molecular biology, the most likely choices for <it>g </it>are affine or concave functions of the gap lengths, e.g., <it>g</it>(&#8467;) = <it>c</it><sub>1 </sub>+ <it>c</it><sub>2</sub>&#8467; or <it>g</it>(&#8467;) = <it>c</it><sub>1 </sub>+ <it>c</it><sub>2 </sub>log &#8467;, where <it>c</it><sub>1 </sub>and <it>c</it><sub>2 </sub>are constants. With such a choice of <it>g</it>, the cost of a long gap is less than or equal to the sums of the costs of any partition of the gap into smaller gaps. That is, each gap is treated as a unit. Such constraint on <it>g </it>induces a constraint on the function <it>w</it>. Indeed, <it>w </it>must satisfy the following inequality, known as <it>concave Monge condition </it><abbrgrp><abbr bid="B7">7</abbr></abbrgrp>:</p>
            <p>
               <display-formula id="M5"><it>w</it>(<it>a</it>, <it>c</it>) + <it>w</it>(<it>b</it>, <it>d</it>) &#8805; <it>w</it>(<it>b</it>, <it>c</it>) + <it>w</it>(<it>a</it>, <it>d</it>) for all <it>a </it>&lt;<it>b </it>and <it>c </it>&lt;<it>d</it>.</display-formula>
            </p>
            <p>an extremely useful inequality that yields speed-ups in Dynamic Programming <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>.</p>
            <p>The gap sequence alignment problem can be solved by computing the following dynamic programming equation (<it>w' </it>is a cost function analogous to <it>w</it>):</p>
            <p>
               <display-formula id="M6"><it>D</it>[<it>i</it>, <it>j</it>] = min{<it>D</it>[<it>i </it>- 1, <it>j </it>- 1] + <it>sub</it>(<it>X</it>[<it>i</it>], <it>Y</it>[<it>j</it>]), <it>E</it>[<it>i</it>, <it>j</it>], <it>F</it>[<it>i</it>, <it>j</it>]}</display-formula>
            </p>
            <p>where</p>
            <p>
               <display-formula id="M7">
                  <m:math name="1748-7188-2-10-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>E</m:mi>
                           <m:mrow>
                              <m:mo>[</m:mo>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mo>,</m:mo>
                                 <m:mi>j</m:mi>
                              </m:mrow>
                              <m:mo>]</m:mo>
                           </m:mrow>
                           <m:mo>=</m:mo>
                           <m:munder>
                              <m:mrow>
                                 <m:mi>min</m:mi>
                                 <m:mo>&#8289;</m:mo>
                              </m:mrow>
                              <m:mrow>
                                 <m:mn>0</m:mn>
                                 <m:mo>&#8804;</m:mo>
                                 <m:mi>k</m:mi>
                                 <m:mo>&#8804;</m:mo>
                                 <m:mi>j</m:mi>
                                 <m:mo>&#8722;</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                           </m:munder>
                           <m:mrow>
                              <m:mo>{</m:mo>
                              <m:mrow>
                                 <m:mi>D</m:mi>
                                 <m:mrow>
                                    <m:mo>[</m:mo>
                                    <m:mrow>
                                       <m:mi>i</m:mi>
                                       <m:mo>,</m:mo>
                                       <m:mi>k</m:mi>
                                    </m:mrow>
                                    <m:mo>]</m:mo>
                                 </m:mrow>
                                 <m:mo>+</m:mo>
                                 <m:mi>w</m:mi>
                                 <m:mrow>
                                    <m:mo>(</m:mo>
                                    <m:mrow>
                                       <m:mi>k</m:mi>
                                       <m:mo>,</m:mo>
                                       <m:mi>j</m:mi>
                                    </m:mrow>
                                    <m:mo>)</m:mo>
                                 </m:mrow>
                              </m:mrow>
                              <m:mo>}</m:mo>
                           </m:mrow>
                           <m:mo>,</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGfbqrdaWadaqaaiabdMgaPjabcYcaSiabdQgaQbGaay5waiaaw2faaiabg2da9maaxababaGagiyBa0MaeiyAaKMaeiOBa4galeaacqaIWaamcqGHKjYOcqWGRbWAcqGHKjYOcqWGQbGAcqGHsislcqaIXaqmaeqaaOWaaiWabeaacqWGebardaWadaqaaiabdMgaPjabcYcaSiabdUgaRbGaay5waiaaw2faaiabgUcaRiabdEha3naabmaabaGaem4AaSMaeiilaWIaemOAaOgacaGLOaGaayzkaaaacaGL7bGaayzFaaGaeiilaWcaaa@52D2@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>
               <display-formula id="M8">
                  <m:math name="1748-7188-2-10-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>F</m:mi>
                           <m:mrow>
                              <m:mo>[</m:mo>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mo>,</m:mo>
                                 <m:mi>j</m:mi>
                              </m:mrow>
                              <m:mo>]</m:mo>
                           </m:mrow>
                           <m:mo>=</m:mo>
                           <m:munder>
                              <m:mrow>
                                 <m:mi>min</m:mi>
                                 <m:mo>&#8289;</m:mo>
                              </m:mrow>
                              <m:mrow>
                                 <m:mn>0</m:mn>
                                 <m:mo>&#8804;</m:mo>
                                 <m:mi>l</m:mi>
                                 <m:mo>&#8804;</m:mo>
                                 <m:mi>i</m:mi>
                                 <m:mo>&#8722;</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                           </m:munder>
                           <m:mrow>
                              <m:mo>{</m:mo>
                              <m:mrow>
                                 <m:mi>D</m:mi>
                                 <m:mrow>
                                    <m:mo>[</m:mo>
                                    <m:mrow>
                                       <m:mi>i</m:mi>
                                       <m:mo>,</m:mo>
                                       <m:mi>j</m:mi>
                                    </m:mrow>
                                    <m:mo>]</m:mo>
                                 </m:mrow>
                                 <m:mo>+</m:mo>
                                 <m:msup>
                                    <m:mi>w</m:mi>
                                    <m:mo>&#8242;</m:mo>
                                 </m:msup>
                                 <m:mrow>
                                    <m:mo>(</m:mo>
                                    <m:mrow>
                                       <m:mi>l</m:mi>
                                       <m:mo>,</m:mo>
                                       <m:mi>i</m:mi>
                                    </m:mrow>
                                    <m:mo>)</m:mo>
                                 </m:mrow>
                              </m:mrow>
                              <m:mo>}</m:mo>
                           </m:mrow>
                           <m:mo>,</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGgbGrdaWadaqaaiabdMgaPjabcYcaSiabdQgaQbGaay5waiaaw2faaiabg2da9maaxababaGagiyBa0MaeiyAaKMaeiOBa4galeaacqaIWaamcqGHKjYOcqWGSbaBcqGHKjYOcqWGPbqAcqGHsislcqaIXaqmaeqaaOWaaiWabeaacqWGebardaWadaqaaiabdMgaPjabcYcaSiabdQgaQbGaay5waiaaw2faaiabgUcaRiqbdEha3zaafaWaaeWaaeaacqWGSbaBcqGGSaalcqWGPbqAaiaawIcacaGLPaaaaiaawUhacaGL9baacqGGSaalaaa@52DE@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p><it>sub </it>is a symbol substitution cost matrix and the initial conditions of recurrence (6) are <it>D</it>[<it>i</it>, 0] = <it>w'</it>(0, <it>i</it>), 1 &#8804; <it>i </it>&#8804; <it>m </it>and <it>D</it>[0, <it>j</it>] = <it>w</it>(0, <it>j</it>), 1 &#8804; <it>j </it>&#8804; <it>n</it>.</p>
            <p>We observe that the computation of recurrence (6) consists of <it>n </it>+ <it>m </it>interleaved subproblems that have the following general form: Compute</p>
            <p>
               <display-formula id="M9">
                  <m:math name="1748-7188-2-10-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mtable>
                              <m:mtr>
                                 <m:mtd>
                                    <m:mrow>
                                       <m:mi>E</m:mi>
                                       <m:mrow>
                                          <m:mo>[</m:mo>
                                          <m:mi>j</m:mi>
                                          <m:mo>]</m:mo>
                                       </m:mrow>
                                       <m:mo>=</m:mo>
                                       <m:munder>
                                          <m:mrow>
                                             <m:mi>min</m:mi>
                                             <m:mo>&#8289;</m:mo>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mn>0</m:mn>
                                             <m:mo>&#8804;</m:mo>
                                             <m:mi>k</m:mi>
                                             <m:mo>&#8804;</m:mo>
                                             <m:mi>j</m:mi>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mn>1</m:mn>
                                          </m:mrow>
                                       </m:munder>
                                       <m:mrow>
                                          <m:mo>{</m:mo>
                                          <m:mrow>
                                             <m:mi>D</m:mi>
                                             <m:mrow>
                                                <m:mo>[</m:mo>
                                                <m:mi>k</m:mi>
                                                <m:mo>]</m:mo>
                                             </m:mrow>
                                             <m:mo>+</m:mo>
                                             <m:mi>w</m:mi>
                                             <m:mrow>
                                                <m:mo>(</m:mo>
                                                <m:mrow>
                                                   <m:mi>k</m:mi>
                                                   <m:mo>,</m:mo>
                                                   <m:mi>j</m:mi>
                                                </m:mrow>
                                                <m:mo>)</m:mo>
                                             </m:mrow>
                                          </m:mrow>
                                          <m:mo>}</m:mo>
                                       </m:mrow>
                                       <m:mo>,</m:mo>
                                    </m:mrow>
                                 </m:mtd>
                                 <m:mtd>
                                    <m:mrow>
                                       <m:mi>j</m:mi>
                                       <m:mo>=</m:mo>
                                       <m:mn>1</m:mn>
                                       <m:mo>,</m:mo>
                                       <m:mo>&#8943;</m:mo>
                                       <m:mo>,</m:mo>
                                       <m:mi>n</m:mi>
                                    </m:mrow>
                                 </m:mtd>
                              </m:mtr>
                           </m:mtable>
                           <m:mo>,</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeqacaaabaGaemyrau0aamWaaeaacqWGQbGAaiaawUfacaGLDbaacqGH9aqpdaWfqaqaaiGbc2gaTjabcMgaPjabc6gaUbWcbaGaeGimaaJaeyizImQaem4AaSMaeyizImQaemOAaOMaeyOeI0IaeGymaedabeaakmaacmqabaGaemiraq0aamWaaeaacqWGRbWAaiaawUfacaGLDbaacqGHRaWkcqWG3bWDdaqadaqaaiabdUgaRjabcYcaSiabdQgaQbGaayjkaiaawMcaaaGaay5Eaiaaw2haaiabcYcaSaqaaiabdQgaQjabg2da9iabigdaXiabcYcaSiabl+UimjabcYcaSiabd6gaUbaacqGGSaalaaa@57AF@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p><it>D</it>[0] is given and for every <it>k </it>= 1,..., <it>n</it>, <it>D </it>[<it>k</it>] is easily computed from <it>E</it>[<it>k</it>]. We now concentrate on a general algorithm computing (9).</p>
         </sec>
         <sec>
            <st>
               <p>5.2 The GG algorithm</p>
            </st>
            <p>From now on, unless otherwise specified, we assume that <it>w </it>satisfies the concave Monge condition (5). An important notion related to concave Monge condition is concave total monotonicity of an <it>s </it>&#215; <it>p </it>matrix <it>A</it>. <it>A </it>is <it>concave totally monotone </it>if and only if</p>
            <p>
               <display-formula id="M10"><it>A</it>[<it>a</it>, <it>c</it>] &#8804; <it>A</it>[<it>b</it>, <it>c</it>] &#8658; <it>A</it>[<it>a</it>, <it>d</it>] &#8804; <it>A</it>[<it>b</it>, <it>d</it>].</display-formula>
            </p>
            <p>for all <it>a </it>&lt;<it>b </it>and <it>c </it>&lt;<it>d</it>.</p>
            <p>It is easy to check that if <it>w </it>is seen as a two-dimensional matrix, the concave Monge condition implies concave total monotonicity of <it>w</it>. Notice that the converse is not true. Total monotonicity and Monge condition of a matrix <it>A </it>are relevant to the design of algorithms because of the following observations. Let <it>r</it><sub><it>j </it></sub>denote the row index such that <it>A</it>[<it>r</it><sub><it>j</it></sub>, <it>j</it>] is the minimum value in column <it>j</it>. Concave total monotonicity implies that the minimum row indices are nonincreasing, i.e., <it>r</it><sub>1 </sub>&#8805; <it>r</it><sub>2 </sub>&#8805; ... &#8805; <it>r</it><sub><it>m</it></sub>. We say that an element <it>A</it>[<it>i</it>, <it>j</it>] is <it>dead </it>if <it>i </it>&#8800; = <it>r</it><sub><it>j </it></sub>(i.e., <it>A</it>[<it>i</it>, <it>j</it>] is not the minimum of column <it>j</it>). A submatrix of <it>A </it>is dead if all of its elements are dead.</p>
            <p>Let <it>B</it>[<it>i</it>, <it>j</it>] = <it>D</it>[<it>i</it>] + <it>w</it>(<it>i</it>, <it>j</it>), for 0 &#8804; <it>i </it>&#8804; <it>j </it>&#8804; <it>n</it>. We say that <it>B</it>[<it>i</it>, <it>j</it>] is <it>available </it>if <it>D</it>[<it>i</it>] is known and therefore <it>B</it>[<it>i</it>, <it>j</it>] can be computed in constant time. That is, <it>B</it>[<it>i</it>, <it>j</it>] is available only when the column minima for columns 1, 2,..., <it>i </it>have been found. We say that <it>B </it>is <it>on-line</it>, since its entries become available as the computation proceeds.</p>
            <p>The computation of recurrence (9) reduces to the identification of the column minima in an on-line upper triangular matrix <it>B</it>. One can easily show that when <it>w </it>satisfies the concave Monge condition, <it>B </it>is totally monotone. We make use of this fact to obtain an efficient algorithm.</p>
            <p>The algorithm outlined here finds column minima one at a time and processes available entries so that it keeps only possible candidates for future column minima. In the concave case, we use a stack to maintain the candidates. The algorithm can be sketched as follows (proof of correctness can be found in <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>)</p>
            <p>For each <it>j</it>, 2 &#8804; <it>j </it>&#8804; <it>n</it>, we find the minimum at column <it>j </it>as follows. Assume that (<it>i</it><sub>1</sub>, <it>h</it><sub>1</sub>),..., (<it>i</it><sub><it>k</it></sub>, <it>h</it><sub><it>k</it></sub>) are on the stack ((<it>i</it><sub>1</sub>, <it>h</it><sub>1</sub>) is at the top of the stack). Initially, (0, <it>n</it>) is on the stack. The invariant on the stack elements is that in submatrix <it>B</it>[0 : <it>j </it>- 2, <it>j </it>: <it>n</it>] row <it>i</it><sub><it>r</it></sub>, for 1 &#8804; <it>r </it>&#8804; <it>k</it>, is the best (gives the minimum) in the column interval [<it>h</it><sub><it>r</it>-1 </sub>+ 1, <it>h</it><sub><it>r</it></sub>] (assumingh <it>h</it><sub>0 </sub>+ 1 = <it>j</it>). By the concave total monotonicity of <it>B</it>, <it>i</it><sub>1</sub>,..., <it>i</it><sub><it>k </it></sub>are nonincreasing. Thus the minimum at column <it>j </it>is the minimum of <it>B</it>[<it>i</it><sub>1</sub>, <it>j</it>] and <it>B</it>[<it>j </it>- 1, <it>j</it>].</p>
            <p>Now we update the stack with row <it>j </it>- 1 as follows.</p>
            <p>(<b>GG1) </b>If <it>B</it>[<it>i</it><sub>1</sub>, <it>j</it>] &#8804; <it>B</it>[<it>j </it>- 1, <it>j</it>], row <it>j </it>- 1 is dead by concave total monotonicity. If <it>h</it><sub>1 </sub>= <it>j</it>, we pop the top element because it will not be useful.</p>
            <p>(<b>GG2) </b>If <it>B</it>[<it>i</it><sub>1</sub>, <it>j</it>] > <it>B</it>[<it>j </it>- 1, <it>j</it>], we compare row <it>j </it>- 1 with row <it>i</it><sub><it>r </it></sub>at <it>h</it><sub><it>r </it></sub>(i.e., <it>B</it>[<it>i</it><sub><it>r</it></sub>, <it>h</it><sub><it>r</it></sub>] vs. <it>B</it>[<it>j </it>- 1, <it>h</it><sub><it>r</it></sub>]), for <it>r </it>= 1, 2,..., until row <it>i</it><sub><it>r </it></sub>is better than row <it>j </it>- 1 at <it>h</it><sub><it>r</it></sub>. If row <it>j </it>- 1 is better than row <it>i</it><sub><it>r </it></sub>at <it>h</it><sub><it>r</it></sub>, row <it>i</it><sub><it>r </it></sub>cannot give the minimum for any column because row <it>j </it>- 1 is better than row <it>i</it><sub><it>r </it></sub>for column <it>l </it>&#8804; <it>h</it><sub><it>r </it></sub>and row <it>i</it><sub><it>r</it>+1 </sub>is better than row <it>i</it><sub><it>r </it></sub>for column <it>l </it>> <it>h</it><sub><it>r</it></sub>. We pop the element (<it>i</it><sub><it>r</it></sub>, <it>h</it><sub><it>r</it></sub>) from the stack and continue to compare row <it>j </it>- 1 with row <it>i</it><sub><it>r</it>+1</sub>. If row <it>i</it><sub><it>r </it></sub>is better than row <it>j </it>- 1 at <it>h</it><sub><it>r</it></sub>, we need to find the border of the two rows <it>j </it>- 1 and <it>i</it><sub><it>r</it></sub>, which is the largest <it>h </it>&lt;<it>h</it><sub><it>r </it></sub>such that row <it>j </it>- 1 is better than row <it>i</it><sub><it>r </it></sub>for column <it>l </it>&#8804; <it>h</it>; i.e., finding the zero <it>z </it>of <it>f</it>(<it>x</it>) = <it>B</it>[<it>j </it>- 1, <it>x</it>] - <it>B</it>[<it>i</it><sub><it>r</it></sub>, <it>x</it>] = <it>w</it>(<it>j </it>- 1, <it>x</it>) - <it>w</it>(<it>i</it><sub><it>r</it></sub>, <it>x</it>) + (<it>D</it>[<it>j </it>- 1] - <it>D</it>[<it>i</it><sub><it>r</it></sub>]), then <it>h </it>= &#8970;<it>z</it>&#8971;. If <it>h </it>&#8805; <it>j </it>+1, we push (<it>j </it>- 1, <it>h</it>) into the stack.</p>
            <p>In the pseudo-code of Algorithm <b>GG</b>, let <it>I</it>(<it>top</it>) and <it>H</it>(<it>top</it>) denote (<it>i</it><sub>1</sub>, <it>h</it><sub>1</sub>). Moreover, let <it>CLOSEST</it>(<it>j </it>- 1, <it>I</it>(<it>top</it>)) be a function that returns the zero of <it>f</it>(<it>x</it>) (defined in step <b>GG2</b>) closest to <it>j </it>- 1. Notice that, using the monotonicity conditions on <it>w</it>, <it>CLOSEST</it>(<it>j </it>- 1, <it>I</it>(<it>top</it>)) can be computed in <it>O</it>(log <it>n</it>) time. Moreover, we say that <it>f </it>satisfies the <it>closest zero property </it>if such a zero can be computed in constant time. We also notice that when <it>w </it>is a linear function, <it>f </it>obviously satisfies the closet zero property. Moreover, for linear functions, lines 9&#8211;20 of Algorithm <b>GG </b>become useless since only one element can be on the stack: the winner (the minimum) of the comparison on line 5 of the algorithm. We have:</p>
            <p><b>Theorem 5.1 </b><it>Recurrence (9) can be computed in O</it>(<it>n </it>log <it>n</it>) <it>time when w satisfies the concave Monge conditions. The time reduces to O</it>(<it>n</it>) <it>when the closet zero property is satisfied or w is linear. Therefore, given two strings X and Y, their edit distance with gaps can be computed in time O</it>(<it>nm </it>log max(<it>n</it>, <it>m</it>)) <it>time, when both w and w' satisfy the concave Monge conditions and O</it>(<it>nm</it>) <it>time when both functions satisfy the closest zero property or are affine gap costs</it>.</p>
            <p>Two remarks are in order regarding the implementation of the <b>GG </b>algorithm provided here:</p>
            <p>1: Algorithm <b>GG</b></p>
            <p>2: <b>push </b>(0, <it>n</it>) on <it>S</it></p>
            <p>3: <b>for </b><it>j </it>:= 2 <b>to </b><it>n </it><b>do</b></p>
            <p>4: &#160;&#160;&#160;&#8467; &#8592; <it>I</it>(<it>top</it>)</p>
            <p>5: &#160;&#160;&#160;<b>if </b><it>B</it>[<it>j </it>- 1, <it>j</it>] &#8805; <it>B</it>[&#8467;, <it>j</it>] <b>then</b></p>
            <p>6: &#160;&#160;&#160;&#160;&#160;&#160;min is <it>B</it>[&#8467;, <it>j</it>]</p>
            <p>7: &#160;&#160;&#160;<b>else</b></p>
            <p>8: &#160;&#160;&#160;&#160;&#160;&#160;min is <it>B</it>[<it>j </it>- 1, <it>j</it>]</p>
            <p>9: &#160;&#160;&#160;&#160;&#160;&#160;<b>while </b><it>S </it>&#8800; &#8709; <b>and </b><it>B</it>[<it>j </it>- 1, <it>j</it>] &#8804; <it>B</it>[<it>I</it>(<it>top</it>); <it>H</it>(<it>top</it>)] <b>do</b></p>
            <p>10: &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<b>pop</b></p>
            <p>11: &#160;&#160;&#160;&#160;&#160;&#160;<b>end while</b></p>
            <p>12: &#160;&#160;&#160;&#160;&#160;&#160;<b>if </b><it>S </it>= &#8709; <b>then</b></p>
            <p>13: &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<b>push </b>(<it>j </it>- 1, <it>n</it>)</p>
            <p>14: &#160;&#160;&#160;&#160;&#160;&#160;<b>else</b></p>
            <p>15: &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<it>h </it>&#8592; <it>CLOSEST</it>(<it>j </it>- 1, <it>I</it>(<it>top</it>))</p>
            <p>16: &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<b>push </b>(<it>j </it>- 1, <it>h</it>)</p>
            <p>17: &#160;&#160;&#160;&#160;&#160;&#160;<b>end if</b></p>
            <p>18: &#160;&#160;&#160;<b>end if</b></p>
            <p>19: &#160;&#160;&#160;<b>if </b><it>H</it>(<it>top</it>) = <it>j </it><b>then</b></p>
            <p>20: &#160;&#160;&#160;&#160;&#160;&#160;<b>pop</b></p>
            <p>21: &#160;&#160;&#160;<b>end if</b></p>
            <p>22: <b>end for</b></p>
            <p>(a) It takes in input a character substitution matrix. Such a matrix could be one of the well known PAM <abbrgrp><abbr bid="B36">36</abbr></abbrgrp> or BLOSUM <abbrgrp><abbr bid="B37">37</abbr><abbr bid="B38">38</abbr></abbrgrp> matrices. However, those matrices have been designed for maximization problems, while we have stated our alignment problem as a minimization problem. Therefore, in order to use those matrices, we need to change the sign of each entry, i.e., take its dual.</p>
            <p>(b) It takes in input two default gap cost functions, one affine and the other concave: <it>g</it>(&#8467;) = <it>c</it><sub>1 </sub>+ <it>c</it><sub>2</sub>&#8467; and <it>g</it>(&#8467;) = <it>c</it><sub>1 </sub>+ <it>c</it><sub>2 </sub>log &#8467;, where <it>c</it><sub>1 </sub>and <it>c</it><sub>2 </sub>are constants. In this case, the closet zero property holds and the program uses this condition to avoid the binary search. However, the user can also specify a concave cost function by simply providing a pointer to the excutable computing it. In this case, the binary search is used.</p>
         </sec>
         <sec>
            <st>
               <p>5.3 The C/C++ library functions</p>
            </st>
            <p>The function below computes the edit distance between two strings, using convex or affine gap costs. It returns the corresponding alignment.</p>
            <p>
               <b>Synopsis</b>
            </p>
            <p>
               <b>#include "edit_distance_gaps.h"</b>
            </p>
            <p>
               <ul>ALIGNMENTS</ul>
            </p>
            <p><b><ul>edit_distance_gaps</ul></b>(<ul>char</ul><it><ul>*X</ul></it>, <ul>char</ul><it><ul>*Y</ul></it>, <ul>WEIGHT</ul><it><ul> Xw</ul></it>, <ul>WEIGHT </ul><it><ul>Yw</ul></it>,, <ul>MATRIX</ul><it><ul> substitution</ul></it>);</p>
            <p><b>Arguments</b>:</p>
            <p>&#8226; <it><ul>X</ul></it>: points to a string;</p>
            <p>&#8226; <it><ul>Y</ul></it>: points to a string;</p>
            <p>&#8226; <it><ul>Xw</ul></it>: is a pointer to a <ul>WEIGHT_STRUCT</ul>;</p>
            <p>&#8226; <it><ul>Yw</ul></it>: is a pointer to a <ul>WEIGHT_STRUCT</ul>;</p>
            <p>&#8226; <it><ul>substitution</ul></it>: is a pointer to <ul>MATRIX_STRUCT</ul>, a data structure (detailed below) defining an upper triangular substitution cost matrix.</p>
            <p><ul>WEIGHT_STRUCT</ul> defines a generic cost function for gaps, as follows:</p>
            <p>typedef struct <ul>weight</ul></p>
            <p>{</p>
            <p>&#160;&#160;&#160;<ul>int</ul><it><ul> type</ul></it>;</p>
            <p><ul>double</ul><it><ul> Wa</ul></it>, <it><ul>Wg</ul></it>, <it><ul>base</ul></it>;</p>
            <p><ul>double</ul> (<it><ul>*w</ul></it>)(<ul>int </ul><it><ul>l</ul></it>, <ul>int</ul><it><ul> k</ul></it>);</p>
            <p>} <ul>WEIGHT_STRUCT</ul>, <it><ul>*WEIGHT</ul></it>;</p>
            <p>The <it><ul>type</ul></it> is a mendatory field that takes two values:F_AFFINE and F_CONCAVE. In both cases, the total of gap opening and closing costs, i.e., <it><ul>Wg</ul></it>, and the gap extension cost, i.e., <it><ul>Wa</ul></it>, need also be specified. Then, the affine function is <it>W</it><sub><it>a </it></sub>+ <it>W</it><sub><it>g</it></sub>&#8467;, for a gap of length &#8467;. For the concave cost function, we can use the default <it>W</it><sub><it>a </it></sub>+ <it>W</it><sub><it>g</it></sub><it>log</it><sub><it>base</it></sub>(&#8467;), where the <ul>base</ul> of the logarithm must also be specified. One can also use a user-defined concave cost function <it>w </it>by specifying a pointer to a function defined as:</p>
            <p>
               <ul>double</ul>
            </p>
            <p><b><ul>weight_function</ul></b>(<ul>int </ul><it><ul>l</ul></it>, <ul>int</ul><it><ul> k</ul></it>);</p>
            <p><ul>MATRIX_STRUCT</ul> defines a generic cost substitution matrix, as follows:</p>
            <p>typedef struct <ul>matrix</ul></p>
            <p>{</p>
            <p>&#160;&#160;&#160;<ul>char</ul><it><ul>*alphabet</ul></it>;</p>
            <p>&#160;&#160;&#160;<ul>int</ul><it><ul> size</ul></it></p>
            <p>&#160;&#160;&#160;<ul>double</ul><it><ul>**matrix</ul></it></p>
            <p>} <ul>MATRIX_STRUCT</ul>, <it><ul>*MATRIX</ul></it>;</p>
            <p>where <it><ul>alphabet</ul></it> is a pointer to the alphabet array (case insensitive) of cardinality <it><ul>size</ul></it>. The last field <it><ul>matrix</ul></it>is a pointer an upper triangular symbol substitution cost matrix. In case one wants to use the default matrix, i.e., match values 0 and mismatch 1, it suffices to set filed <it>size </it>equal to zero.</p>
            <p><b>Return Values</b>: A pointer to <ul>ALIGNMENTS_STRUCT</ul>, which is defined as in section 4.3, except that <it><ul>distance</ul></it> now refers to the edit distance with gaps.</p>
         </sec>
         <sec>
            <st>
               <p>5.4 The Perl library functions</p>
            </st>
            <p>The <b>Edit_Distance_Gap </b>computes the edit distance with gaps between two strings.</p>
            <p>
               <b>Synopsis</b>
            </p>
            <p>
               <b>use BSAT::Edit_Distance_Gaps;</b>
            </p>
            <p>Edit_Distance_Gaps <it>X Y Xw Yw Substitution</it></p>
            <p><b>Arguments</b>:</p>
            <p>&#8226; <it>X</it>: is a scalar containing string X;</p>
            <p>&#8226; <it>Y</it>: is a scalar containing string Y;</p>
            <p>&#8226; <it>Xw</it>: is a hash reference defined below;</p>
            <p>&#8226; <it>Yw</it>: is a hash reference defined below;</p>
            <p>&#8226; <it>Yw</it>: is a list reference containing the</p>
            <p>&#8226; <it>Substitution</it>: is a list reference containing an upper triangular symbol substitution cost matrix. If undefined, the default values are used, as in section 5.3;</p>
            <p>&#8226; <it>Alphabet</it>: is a list reference containing the characters of alphabet (case insensitive). If undefined, the default values are used, as in section 5.3.</p>
            <p>Xw is defined as (Yw is analogous):</p>
            <p>my %Xw = (</p>
            <p>&#160;&#160;&#160;<it>Type </it>=> "",</p>
            <p>&#160;&#160;&#160;<it>Wa </it>=> 0,</p>
            <p>&#160;&#160;&#160;<it>Wg </it>=> 0,</p>
            <p>&#160;&#160;&#160;<it>Base </it>=> 0,</p>
            <p>&#160;&#160;&#160;<it>w </it>=> \&amp;<it>custom_fuction</it>);</p>
            <p>where the fields are as in the specification of the cost function in section 5.3.</p>
            <p><b>Return values</b>: <b>Edit_Distance_Gaps </b>returns an hash corresponding to the computed alignment and it is defined as in section 4.4, except the distance is now the value of the edit distance with gaps:</p>
            <p>my %alignment = (</p>
            <p>&#160;&#160;&#160;distance => 0,</p>
            <p>&#160;&#160;&#160;X => "",</p>
            <p>&#160;&#160;&#160;Y => "");</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>6 Filtering, statistical scores and model organism generation</p>
         </st>
         <p>In this section we outline the filtering and statistical functions present in the system, starting with the filter. Let <it>O</it><sub>1</sub>,...,<it>O</it><sub><it>s </it></sub>be the output of algorithm <b>SM </b>on the pattern strings <it>p</it><sub>1</sub>,...,<it>p</it><sub>s </sub>and text strings <it>t</it><sub>1</sub>,...,<it>t</it><sub><it>s</it></sub>, respectively. We assume that the algorithm has been used with the same value of <it>k </it>in all <it>s </it>instances. The procedure takes in input the sets <it>O</it><sub><it>i </it></sub>and <it>t</it><sub><it>i</it></sub>, 1 &#8804; <it>i </it>&#8804; <it>s</it>, and a threshold parameter <it>th</it>. It returns a set <it>W </it>consisting of all strings in <it>O</it><sub><it>i </it></sub>that appear in at least <it>th </it>of the text strings. Since each <it>O</it><sub><it>i </it></sub>consists of the occurrences of a pattern <it>p</it><sub><it>i </it></sub>in <it>t</it><sub><it>i</it></sub>, with mismatches, <it>W </it>corresponds to a set of strings representing common occurrences of all patterns in the text strings, i.e., it is a consensus set. The algorithmic details yielding an efficient implementation of the filtering operation are straightforward and therefore omitted.</p>
         <p>We now turn to the z-score. The assessment of the statistical significance of the occurrences of a set of strings <it>W </it>in a set of text strings <it>t</it><sub>1</sub>,...,<it>t</it><sub><it>s </it></sub>is a well established procedure for analysis of biological sequences, in particular via z-score functions <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. Intuitively, the value of the z-score for a set of strings <it>W </it>gives an indication of how relevant are the occurrences of the strings in <it>W </it>in the text strings <it>t</it><sub>1</sub>,...,<it>t</it><sub><it>s</it></sub>, with respect to "a random event" as characterized by a background model. We limit ourselves to give formal definitions and for the case in which <it>W </it>contains only one string and <it>s </it>= 1. For the generalization to the case in which <it>W </it>contains more than one string and the rather involved algorithmic details, the reader is referred to <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>.</p>
         <p>Let <it>p </it>be a string and let <it>X </it>be a set of random strings, generated according to some " background probabilistic model", usually a Markov Source. Let <it>X</it><sub><it>p </it></sub>be the random variable indicating the number of occurrences of <it>p </it>in <it>X </it>and let <it>E</it>(<it>X</it><sub><it>p</it></sub>) and <it>&#963;</it>(<it>X</it><sub><it>p</it></sub>) be the mean and standard deviation, respectively. Then, the <it>z-score </it>associated with <it>p </it>is</p>
         <p>
            <display-formula id="M11">
               <m:math name="1748-7188-2-10-i5" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:msub>
                           <m:mi>z</m:mi>
                           <m:mi>p</m:mi>
                        </m:msub>
                        <m:mo>=</m:mo>
                        <m:mfrac>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>N</m:mi>
                                 <m:mi>p</m:mi>
                              </m:msub>
                              <m:mo>&#8722;</m:mo>
                              <m:mi>E</m:mi>
                              <m:mrow>
                                 <m:mo>(</m:mo>
                                 <m:mrow>
                                    <m:msub>
                                       <m:mi>X</m:mi>
                                       <m:mi>p</m:mi>
                                    </m:msub>
                                 </m:mrow>
                                 <m:mo>)</m:mo>
                              </m:mrow>
                           </m:mrow>
                           <m:mrow>
                              <m:mi>&#963;</m:mi>
                              <m:mrow>
                                 <m:mo>(</m:mo>
                                 <m:mrow>
                                    <m:msub>
                                       <m:mi>X</m:mi>
                                       <m:mi>p</m:mi>
                                    </m:msub>
                                 </m:mrow>
                                 <m:mo>)</m:mo>
                              </m:mrow>
                           </m:mrow>
                        </m:mfrac>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWG6bGEdaWgaaWcbaGaemiCaahabeaakiabg2da9maalaaabaGaemOta40aaSbaaSqaaiabdchaWbqabaGccqGHsislcqWGfbqrdaqadaqaaiabdIfaynaaBaaaleaacqWGWbaCaeqaaaGccaGLOaGaayzkaaaabaacciGae83Wdm3aaeWaaeaacqWGybawdaWgaaWcbaGaemiCaahabeaaaOGaayjkaiaawMcaaaaaaaa@402E@</m:annotation>
                  </m:semantics>
               </m:math>
            </display-formula>
         </p>
         <p>where <it>N</it><sub><it>p </it></sub>is the number of occurrences of <it>p </it>in the strings in <it>X</it>. Notice that <it>z</it><sub><it>p </it></sub>gives the number of standard deviations by which the observed value <it>N</it><sub><it>p </it></sub>exceeds its expected value. It is normalized so that it has mean zero and standard deviation one, so that it can be used to compare the z-score of different strings.</p>
         <p>The module that computes the z-score in our system takes in input the set <it>W </it>output by the filtering function, the text strings <it>t</it><sub>1</sub>,...,<it>t</it><sub><it>s </it></sub>and a model, i.e., a table encoding a Markov source of order 3, together with additional information needed for the computation of the variance (see Appendix A in <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>). The software computing the z-score is a specialization of the software of Sinha and Tompa for the computation of the z-score in YMF, that is designed to work for motifs (a concise and general encoding of a set of strings). As in their case, the code is designed to work only for DNA sequences. Therefore, care must be taken in computing the number of occurrences of a string <it>p </it>in a string <it>t</it>. In fact, one must count occurrences on both DNA strands. That is done by including, for each string in the input set <it>W</it>, its reverse complement.</p>
         <p>Two model organisms are available, Human and Yeast, as they are given by the YMF software distribution of Sinha and Tompa <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. Moreover, via the function that generates a model organism, the user can specify a new model for her/his sequences. Details on input formats for the model are given in the User Guide.</p>
         <sec>
            <st>
               <p>6.1 The C/C++ library functions</p>
            </st>
            <p>The function below computes the z-score value of a set of patterns (all of the same length) with respect to a set of sequences (all of the same length). It works for DNA only.</p>
            <p>
               <b>Synopsis</b>
            </p>
            <p>
               <b>#include "z_score.h"</b>
            </p>
            <p>
               <ul>double</ul>
            </p>
            <p><b><ul>z score </ul></b>(<ul>char</ul><it><ul>**patterns</ul></it>, <ul>char</ul><it><ul>**texts</ul></it>, <ul>char</ul><it><ul>*organismpath</ul></it>);</p>
            <p><b>Arguments</b>:</p>
            <p>&#8226; <it><ul>patterns</ul></it>: is a column vector, each item points to a pattern string. The last item point to NULL;</p>
            <p>&#8226; <it><ul>texts</ul></it>: is a column vector, each item points to a text string. The last item point to NULL;</p>
            <p>&#8226; <it><ul>organismpath</ul></it>: it is the path to the file containing all probabilistic information for an organism.</p>
            <p>
               <b>Return Values</b>
            </p>
            <p>Upon successful completion <b><ul>z_score</ul></b>return a double value, corresponding to z-score.</p>
            <p>The function below generates a Markov model of order 3, from a set of strings. It works for DNA only.</p>
            <p>
               <b>Synopsis</b>
            </p>
            <p>
               <b>#include " model_generatation.h"</b>
            </p>
            <p>
               <ul>int</ul>
            </p>
            <p><b><ul>model_generatation </ul></b>(<ul>char</ul><it><ul>**strings</ul></it>, <ul>char</ul><it><ul>*path</ul></it>, <ul>char</ul><it><ul>*organism</ul></it>);</p>
            <p><b>Arguments</b>:</p>
            <p>&#8226; <it><ul>strings</ul></it>: is a column vector, each item points to a string. The last item point to NULL;</p>
            <p>&#8226; <ul>path</ul>: is a output path;</p>
            <p>&#8226; <ul>organism</ul>: is the organism name;</p>
            <p>
               <b>Return Values</b>
            </p>
            <p><b><ul>model_generation</ul></b> returns zero if the computation is completed successfully and 1 otherwise.</p>
         </sec>
         <sec>
            <st>
               <p>6.2 The Perl library functions</p>
            </st>
            <p>The function below performs a filtering operation on a set of sequences.</p>
            <p>
               <b>use BATS::Filter;</b>
            </p>
            <p>Filter <it>files hits score Hitsthreshold Filesthreshold</it></p>
            <p><b>Arguments</b>:</p>
            <p>&#8226; <it>files</it>: is an array of strings containing the filenames.</p>
            <p>&#8226; <it>hits</it>: is a hash reference containing number of hits for each occurrence per file.</p>
            <p>&#8226; <it>score</it>: is a hash reference containing number of errors for each occurrence.</p>
            <p>&#8226; <it>Filesthreshold</it>: is a scalar containing the minimum number of hits on which occurrences need to be present.</p>
            <p>&#8226; <it>Filesthreshold</it>: is a scalar containing the minimum percentage of files on which occurrences need to be present.</p>
            <p><b>Return values </b>Filter returns an array containing indices of hits that satisfy the threshold.</p>
            <p>The function below computes the z-score value of a set of patterns (all of the same length) with respect to a set of sequences (all of the same length). It works for DNA only.</p>
            <p>
               <b>Synopsis</b>
            </p>
            <p>
               <b>use BATS::Z_Score;</b>
            </p>
            <p>Z_Score <it>patters texts organismpath</it></p>
            <p><b>Arguments</b>:</p>
            <p>&#8226; <it>patterns</it>: is an array of strings containing the set of patterns;</p>
            <p>&#8226; <it>sequences</it>: is an array of strings containing the text strings;</p>
            <p>&#8226; <it>organismpath</it>: it is the path to the file containing all probabilistic information for an organism.</p>
            <p>
               <b>Return values:</b>
            </p>
            <p>Z_Score returns a scalar containing the z-score value of the pattern set.</p>
            <p>The function below generates a Markov model of order 3, from a set of strings. It works for DNA only.</p>
            <p>
               <b>Synopsis</b>
            </p>
            <p>
               <b>use BATS::Model_Generation;</b>
            </p>
            <p>Model_Generatation <it>strings path organism</it></p>
            <p><b>Arguments</b>:</p>
            <p>&#8226; <it>strings</it>: is an array of strings;</p>
            <p>&#8226; <it>path</it>: is a scalar containing the string of the output path;</p>
            <p>&#8226; <it>organism</it>: points to the string containing the name of the organism;</p>
            <p><b>Return values: </b>Model_Generation returns a scalar containing 0 if the computation is completed successfully and 1 otherwise.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>7 Conclusion</p>
         </st>
         <p>We have presented a software library for some basic global and local sequence alignment tasks. Moreover, procedures to assess the statistical significance of the occurrence of a set of DNA pattern strings in a set of DNA text strings has also been provided. Although none of the presented algorithms is new, this the first software library that provides their implementation in one consistent and ready to use package.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>The authors are deeply endebted to S. Sinha and M. Tompa for allowing to modify their software in order to be included in BATS. RG is partially supported by the Italian MIUR FIRB project " Bioinformatica per la Genomica e la Proteomica" and by MIUR FIRB Italy-Israel project " Pattern Matching and Discovery in Discrete Structures, with applications to Bioinformatics".</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <aug>
               <au>
                  <snm>Gusfield</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology</source>
            <publisher>Cambridge University Press</publisher>
            <pubdate>1997</pubdate>
         </bibl>
         <bibl id="B2">
            <aug>
               <au>
                  <snm>Waterman</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Introduction to Computational Biology. Maps, Sequences and Genomes</source>
            <publisher>Chapman Hall</publisher>
            <pubdate>1995</pubdate>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Sequence analysis- contributions by Ulam to molecular genetics</p>
            </title>
            <aug>
               <au>
                  <snm>Goad</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>From Cardinals to Chaos. Reflections on the life and legacy of Stanislaw Ulam</source>
            <publisher>Cambridge University Press</publisher>
            <editor>Cooper N</editor>
            <pubdate>1989</pubdate>
            <fpage>288</fpage>
            <lpage>291</lpage>
         </bibl>
         <bibl id="B4">
            <aug>
               <au>
                  <snm>Kruskal</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Sankoff</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <cnm>Eds</cnm>
               </au>
            </aug>
            <source>Time Wraps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison</source>
            <publisher>Addison-Wesley</publisher>
            <pubdate>1983</pubdate>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Basic Local Alignment Search Tool</p>
            </title>
            <aug>
               <au>
                  <snm>Altshul</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Gish</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Myers</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>J of Molecular Bioilogy</source>
            <pubdate>1990</pubdate>
            <volume>215</volume>
            <fpage>403</fpage>
            <lpage>410</lpage>
         </bibl>
         <bibl id="B6">
            <title>
               <p>An Improved Algorithm for Matching of Biological Sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Gotoh</snm>
                  <fnm>O</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Biology</source>
            <pubdate>1982</pubdate>
            <volume>162</volume>
            <fpage>705</fpage>
            <lpage>708</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/0022-2836(82)90398-9</pubid>
                  <pubid idtype="pmpid" link="fulltext">7166760</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Dynamic Programming: Special Cases</p>
            </title>
            <aug>
               <au>
                  <snm>Giancarlo</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Pattern Matching Algorithms</source>
            <publisher>Oxford University Press</publisher>
            <editor>Apostolico A, Galil Z</editor>
            <pubdate>1997</pubdate>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Data Structures and Algorithms for Approximate String Matching</p>
            </title>
            <aug>
               <au>
                  <snm>Galil</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Giancarlo</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>J of Complexity</source>
            <pubdate>1988</pubdate>
            <volume>4</volume>
            <fpage>32</fpage>
            <lpage>72</lpage>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Efficient String Matching with k Mismatches</p>
            </title>
            <aug>
               <au>
                  <snm>Landau</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Vishkin</snm>
                  <fnm>U</fnm>
               </au>
            </aug>
            <source>Theoretical Computer Science</source>
            <pubdate>1986</pubdate>
            <volume>43</volume>
            <fpage>239</fpage>
            <lpage>249</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/0304-3975(86)90178-7</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Faster algorithms for string matching with k mismatches</p>
            </title>
            <aug>
               <au>
                  <snm>Amir</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Lewenstein</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Porat</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>J of Algorithms</source>
            <pubdate>2004</pubdate>
            <volume>50</volume>
            <fpage>257</fpage>
            <lpage>275</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S0196-6774(03)00097-X</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Introducing Efficient Parallelism into Approximate String Matching and a New Serial Algorithm</p>
            </title>
            <aug>
               <au>
                  <snm>Landau</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Vishkin</snm>
                  <fnm>U</fnm>
               </au>
            </aug>
            <source>Proc. 18th Symposium on Theory of Computing, ACM</source>
            <pubdate>1986</pubdate>
            <fpage>220</fpage>
            <lpage>230</lpage>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Approximate String Matching: A Simpler Faster Algorithm</p>
            </title>
            <aug>
               <au>
                  <snm>Cole</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Hariharan</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>SIAM J Comput</source>
            <pubdate>2002</pubdate>
            <volume>31</volume>
            <fpage>1761</fpage>
            <lpage>1782</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1137/S0097539700370527</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Sparse Dynamic Programming for Longest Common Subsequence from Fragments</p>
            </title>
            <aug>
               <au>
                  <snm>Baker</snm>
                  <fnm>BS</fnm>
               </au>
               <au>
                  <snm>Giancarlo</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>J Algorithms</source>
            <pubdate>2002</pubdate>
            <volume>42</volume>
            <issue>2</issue>
            <fpage>231</fpage>
            <lpage>254</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1006/jagm.2002.1214</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Efficient Sequence Alignment Algorithms</p>
            </title>
            <aug>
               <au>
                  <snm>Waterman</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Journal of Theoretical Biology</source>
            <pubdate>1984</pubdate>
            <volume>108</volume>
            <fpage>333</fpage>
            <lpage>337</lpage>
            <xrefbib>
               <pubid idtype="pmpid">6748696</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Speeding Up Dynamic Programming with Applications to Molecular Biology</p>
            </title>
            <aug>
               <au>
                  <snm>Galil</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Giancarlo</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Theor Comput Sci</source>
            <pubdate>1989</pubdate>
            <volume>64</volume>
            <fpage>107</fpage>
            <lpage>118</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/0304-3975(89)90101-1</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Sequence Comparison with Concave Weighting Functions</p>
            </title>
            <aug>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Myers</snm>
                  <fnm>EW</fnm>
               </au>
            </aug>
            <source>Bull Math Biol</source>
            <pubdate>1988</pubdate>
            <volume>50</volume>
            <fpage>97</fpage>
            <lpage>120</lpage>
            <xrefbib>
               <pubid idtype="pmpid">3207952</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>An Almost Linear Algorithm for Generalized Matrix Searching</p>
            </title>
            <aug>
               <au>
                  <snm>Klawe</snm>
                  <fnm>MM</fnm>
               </au>
               <au>
                  <snm>Kleitman</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>SIAM J on Desc Math</source>
            <pubdate>1990</pubdate>
            <volume>3</volume>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Over- and underrepresentation of short DNA words in herpesvirus genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Leung</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Marsh</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>J Comput Biol</source>
            <pubdate>1996</pubdate>
            <volume>3</volume>
            <fpage>345</fpage>
            <lpage>360</lpage>
            <xrefbib>
               <pubid idtype="pmpid">8891954</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>A Statistical Method for Finding Transcription Factors Binding Sites</p>
            </title>
            <aug>
               <au>
                  <snm>Sinha</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Tompa</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>8-th ISMB Conference, AAAI</source>
            <pubdate>2000</pubdate>
            <fpage>344</fpage>
            <lpage>354</lpage>
         </bibl>
         <bibl id="B20">
            <aug>
               <au>
                  <snm>Mehlhorn</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>N&#228;her</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>The LEDA Platform of Combinatorial and Geometric Computing</source>
            <publisher>Cambridge, UK: Cambridge University Press</publisher>
            <pubdate>1999</pubdate>
         </bibl>
         <bibl id="B21">
            <title>
               <p>The Archtecture of a Software Library for String Processing</p>
            </title>
            <aug>
               <au>
                  <snm>Czumaj</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Ferragina</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Gasieniec</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Muthukrishnan</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Traeff</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Workshop on Algorithm Engineering</source>
            <publisher>University of Venice</publisher>
            <pubdate>1997</pubdate>
            <fpage>166</fpage>
            <lpage>176</lpage>
         </bibl>
         <bibl id="B22">
            <title>
               <p>A space-economical suffix tree construction algorithm</p>
            </title>
            <aug>
               <au>
                  <snm>McCreight</snm>
                  <fnm>EM</fnm>
               </au>
            </aug>
            <source>Journal of the ACM</source>
            <pubdate>1976</pubdate>
            <volume>23</volume>
            <issue>2</issue>
            <fpage>262</fpage>
            <lpage>272</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1145/321941.321946</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>On-Line Construction of Suffix Trees</p>
            </title>
            <aug>
               <au>
                  <snm>Ukkonen</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Algorithmica</source>
            <pubdate>1995</pubdate>
            <volume>14</volume>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Strmat</p>
            </title>
            <url>http://www.cs.ucdavis.edu/~gusfield/strmat.html</url>
         </bibl>
         <bibl id="B25">
            <title>
               <p>BATS Supplementary Material Web Page</p>
            </title>
            <url>http://www.math.unipa.it/~raffaele/BATS</url>
         </bibl>
         <bibl id="B26">
            <title>
               <p>On Finding Lowest Common Ancestors: Simplification and Parallelization</p>
            </title>
            <aug>
               <au>
                  <snm>Schieber</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Vishkin</snm>
                  <fnm>U</fnm>
               </au>
            </aug>
            <source>Siam J on Computing</source>
            <pubdate>1988</pubdate>
            <volume>17</volume>
            <fpage>1253</fpage>
            <lpage>1262</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1137/0217079</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Algorithms for Approximate String Matching</p>
            </title>
            <aug>
               <au>
                  <snm>Ukkonen</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Information and Control</source>
            <pubdate>1985</pubdate>
            <volume>64</volume>
            <fpage>100</fpage>
            <lpage>118</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S0019-9958(85)80046-2</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>An O(ND) Difference Algorithm and Its Variations</p>
            </title>
            <aug>
               <au>
                  <snm>Myers</snm>
                  <fnm>EW</fnm>
               </au>
            </aug>
            <source>Algorithmica</source>
            <pubdate>1986</pubdate>
            <volume>1</volume>
            <fpage>251</fpage>
            <lpage>266</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1007/BF01840446</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>String Editing and Longest Common Subsequence</p>
            </title>
            <aug>
               <au>
                  <snm>Apostolico</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Handbook of Formal Languages</source>
            <publisher>Berlin: Springer Verlag</publisher>
            <editor>Rozenberg G, Salomaa A</editor>
            <pubdate>1997</pubdate>
            <volume>2</volume>
            <fpage>361</fpage>
            <lpage>398</lpage>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Serial Computations of Levenshtein Distances</p>
            </title>
            <aug>
               <au>
                  <snm>Hirschberg</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Pattern Matching Algorithms</source>
            <publisher>Oxford: Oxford University Press</publisher>
            <editor>Apostolico A, Galil Z</editor>
            <pubdate>1997</pubdate>
            <fpage>123</fpage>
            <lpage>142</lpage>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Sparse Dynamic Programming I: Linear Cost Functions</p>
            </title>
            <aug>
               <au>
                  <snm>Eppstein</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Galil</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Giancarlo</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Italiano</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>J of ACM</source>
            <pubdate>1992</pubdate>
            <volume>39</volume>
            <fpage>519</fpage>
            <lpage>545</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1145/146637.146650</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Sparse Dynamic Programming II: Convex and Concave Cost Functions</p>
            </title>
            <aug>
               <au>
                  <snm>Eppstein</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Galil</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Giancarlo</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Italiano</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>J of ACM</source>
            <pubdate>1992</pubdate>
            <volume>39</volume>
            <fpage>546</fpage>
            <lpage>567</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1145/146637.146656</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <aug>
               <au>
                  <snm>Aho</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Hopcroft</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Ullman</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Data Structures and Algorithms</source>
            <publisher>Reading, MA.: Addison-Wesley</publisher>
            <pubdate>1983</pubdate>
         </bibl>
         <bibl id="B34">
            <title>
               <p>A Fast Algorithm for Computing Longest Common Subsequences</p>
            </title>
            <aug>
               <au>
                  <snm>Hunt</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Szymanski</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Comm of the ACM</source>
            <pubdate>1977</pubdate>
            <volume>20</volume>
            <fpage>350</fpage>
            <lpage>353</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1145/359581.359603</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Optimal Sequence Alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Fitch</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>National Academy of Sciences USA</source>
            <pubdate>1983</pubdate>
            <volume>80</volume>
            <fpage>1382</fpage>
            <lpage>1385</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1073/pnas.80.5.1382</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>A model of evolutionary change in proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Dayhoff</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Schwartz</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Orcutt</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Atlas of Protein Sequence and Structure</source>
            <editor>Dayhoff M</editor>
            <pubdate>1978</pubdate>
            <fpage>345</fpage>
            <lpage>352</lpage>
         </bibl>
         <bibl id="B37">
            <title>
               <p>Amino acid substitution matrices from protein blocks</p>
            </title>
            <aug>
               <au>
                  <snm>Henikoff</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Henikoff</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Proc Nat Acad of Sci USA</source>
            <pubdate>1992</pubdate>
            <volume>89</volume>
            <fpage>10915</fpage>
            <lpage>10919</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1073/pnas.89.22.10915</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>Performance evaluation of amino acid substitution matrices</p>
            </title>
            <aug>
               <au>
                  <snm>Henikoff</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Henikoff</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Proteins: Structure, function and genetics</source>
            <pubdate>1993</pubdate>
            <volume>17</volume>
            <fpage>49</fpage>
            <lpage>61</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1002/prot.340170108</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B39">
            <title>
               <p>YMF: A Program for Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation</p>
            </title>
            <aug>
               <au>
                  <snm>Sinha</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Tompa</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <fpage>3586</fpage>
            <lpage>3588</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">169024</pubid>
                  <pubid idtype="pmpid" link="fulltext">12824371</pubid>
                  <pubid idtype="doi">10.1093/nar/gkg618</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
