Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

This article is part of the supplement: Proceedings of the Twelfth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics

Open Access Open Badges Research

Identifying gene clusters by discovering common intervals in indeterminate strings

Daniel Doerr12*, Jens Stoye12, Sebastian Böcker3 and Katharina Jahn124

Author Affiliations

1 Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany

2 Institute for Bioinformatics, Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany

3 Lehrstuhl für Bioinformatik, Friedrich-Schiller-Universität Jena, Jena, Germany

4 Computational Biology Group, Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland

For all author emails, please log on.

BMC Genomics 2014, 15(Suppl 6):S2  ) doi:10.1186/1471-2164-15-S6-S2

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2164/15/S6/S2


Published:17 October 2014

© 2014 Doerr et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Abstract

Background

Comparative analyses of chromosomal gene orders are successfully used to predict gene clusters in bacterial and fungal genomes. Present models for detecting sets of co-localized genes in chromosomal sequences require prior knowledge of gene family assignments of genes in the dataset of interest. These families are often computationally predicted on the basis of sequence similarity or higher order features of gene products. Errors introduced in this process amplify in subsequent gene order analyses and thus may deteriorate gene cluster prediction.

Results

In this work, we present a new dynamic model and efficient computational approaches for gene cluster prediction suitable in scenarios ranging from traditional gene family-based gene cluster prediction, via multiple conflicting gene family annotations, to gene family-free analysis, in which gene clusters are predicted solely on the basis of a pairwise similarity measure of the genes of different genomes. We evaluate our gene family-free model against a gene family-based model on a dataset of 93 bacterial genomes.

Conclusions

Our model is able to detect gene clusters that would be also detected with well-established gene family-based approaches. Moreover, we show that it is able to detect conserved regions which are missed by gene family-based methods due to wrong or deficient gene family assignments.

Keywords:
common intervals; indeterminate strings; gene cluster detection

Background

Gene clusters are sets of functionally associated genes in prokaryotic and fungal genomes that are located close to each other over a longer period of evolutionary time, despite the genome undergoing significant rearrangements. Consequently, gene clusters may be found in several related species by means of comparative gene order analysis. Over the past years several such methods have been proposed and subsequently improved in their sensitivity. Initial gene cluster models considered only completely conserved genomic segments that retain gene order and orientation [1,2]. Later models still required gene clusters to be contiguous and complete but dropped the requirement for co-linearity [3-5]. The most powerful class of approaches can handle imperfect conservation of gene content by allowing to some extent either inserted [6-8] or both inserted and deleted genes [9-11].

All above methods require prior knowledge of homology relations between genes, using either a one-to-one mapping between the gene sets of different genomes [3,6,5], or a general partitioning into gene families [4,7-11]. In the latter, a genome is modeled as a set of sequences over the alphabet of gene families, where each sequence corresponds to a particular chromosome of the organism.

Most commonly, gene families are predicted computationally on the basis of sequence similarity. Various databases exist that offer information of precomputed gene families [12-14]. Furthermore, several software tools are freely available that allow for direct computation of gene family assignments in a dataset of interest [15-17]. Typically, these approaches assume that gene families naturally cluster into densely connected subgraphs in the gene similarity network. However, multi-domain proteins can have strong ties not only to their own family but also to other families they share a domain with. Some of these proteins may not be at all traceable back to a single gene family. While some recent approaches can deal with the ambiguities caused by multi-domain proteins [18,19], it is still a major challenge to define cut-offs in the clustering process that at the same time eliminate spurious edges and keep gene families at a reasonable granularity[20,21].

In this paper, we present a new dynamic model and efficient computational approaches for gene cluster prediction suitable in scenarios ranging from traditional gene family-based gene cluster prediction, via multiple conflicting gene family annotations, to gene family-free analysis, in which gene clusters are predicted solely on the basis of a pairwise similarity measure between the genes of different genomes. We do this by introducing the concept of common intervals to indeterminate strings, which are a class of strings that can have more than one character at every position. We then extend this concept to allow for a limited number of insertions and deletions. Furthermore, we present algorithms that solve related discovery problems of finding all weak common intervals and approximate weak common intervals in indeterminate strings. Finally, we propose a new method for gene family-free discovery of gene clusters based on (approximate) weak common intervals in indeterminate strings.

Methods

Definitions

Indeterminate strings, also known as degenerate strings are formally defined as [22]:

Definition 1 (indeterminate string) For a given finite alphabet <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M61','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M61">View MathML</a>, let <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M62','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M62">View MathML</a>be the power set of <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M61','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M61">View MathML</a>. An indeterminate string is a sequence of character sets, which are elements of <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M63','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M63">View MathML</a>.

In other words, for an indeterminate string S with n index positions must hold that for every i, <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M64','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M64">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M65','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M65">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M66','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M66">View MathML</a>, where <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M67','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M67">View MathML</a> denotes the character set associated with the i-th position in S. In the special case where every position of indeterminate string S holds a singleton set, S is equivalent to an ordinary string. We denote the length of an indeterminate string S with n index positions by <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M68','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M68">View MathML</a> and its cardinality, i.e. the number of all elements in S, by <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M1">View MathML</a>. Two positions a and b, <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M69','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M69">View MathML</a>, induce the indeterminate substring <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M70','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M70">View MathML</a>. To distinguish intervals in different indeterminate strings, we indicate the affiliation of an interval <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71">View MathML</a> to indeterminate string S by the subscript notation <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M72','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M72">View MathML</a>.

Example 1 <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M73','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M73">View MathML</a>is an indeterminate string of length <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M74">View MathML</a>and cardinality <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M75','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M75">View MathML</a>over alphabet <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M76','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M76">View MathML</a>. The third element of S is given by character set <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M77','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M77">View MathML</a>. Interval <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M78','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M78">View MathML</a>induces the substring <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M79','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M79">View MathML</a>.

In this work, we generalize the concept of common intervals, which were initially introduced on permutations [23] and subsequently extended to strings [24]. The idea behind common intervals is to compare strings, or rather substrings, based on their character sets. The character set of an ordinary string S is defined as <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M80','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M80">View MathML</a>. The equivalent concept on indeterminate strings is the following:

Definition 2 (character set) The character set of an indeterminate string <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a> is defined by <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M2">View MathML</a>.

In two ordinary strings S and T over a finite alphabet Σ, two intervals, [i, j] in S and [k, l] in T, are called common intervals if <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M81','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M81">View MathML</a>). The analogon for indeterminate strings is:

Definition 3 (strict common intervals) Given two indeterminate strings <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>, two intervals, <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71">View MathML</a>in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M84','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M84">View MathML</a>in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>, are said to be strict common intervals if and only if their character sets <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M85','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M85">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M86','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M86">View MathML</a>are equal.

A weaker definition based on the intersection relation between character sets is:

Definition 4 (weak common intervals) Given two indeterminate strings <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>, two intervals, <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71">View MathML</a>in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M84','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M84">View MathML</a>in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>, are weak common intervals with common character set <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M87','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M87">View MathML</a>if for each <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M88','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M88">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M89','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M89">View MathML</a>, it holds that <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M90','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M90">View MathML</a>and for each <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M91','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M91">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M92','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M92">View MathML</a>, it holds that <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M93','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M93">View MathML</a>.

In all our use cases, in particular when dealing with conflicting gene family assignments as well as gene family-free gene cluster detection, the concept of weak common intervals appears to be more appropriate. Thus, in the following, we restrict ourselves to the study of weak common intervals.

Furthermore, continuing a previous line of research initially proposed by Schmidt and Stoye in [4], we further extend weak common intervals by allowing a limited number of insertions and deletions:

Definition 5 (approximate weak common intervals) Given two indeterminate strings <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>and a threshold <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M94','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M94">View MathML</a>, two intervals, <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71">View MathML</a>in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M84','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M84">View MathML</a>in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>, are approximate weak common intervals with common character set <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M87','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M87">View MathML</a>if the number of positions with no intersection with <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M95','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M95">View MathML</a> is limited by δ, i.e. <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M97','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M97">View MathML</a>. These positions are called indels.

Generally, algorithms for discovering common intervals of ordinary strings only report pairs of intervals that both are maximal, whereby maximality is defined as follows: An interval <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71">View MathML</a> in string X is called maximal if its immediate left and right neighboring characters, <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M98','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M98">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M99','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M99">View MathML</a> (if such exist), are not contained in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M100','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M100">View MathML</a>. In other words, interval <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71">View MathML</a> cannot be extended to its left or right without expanding the character set of the interval.

In terms of weak common intervals, we introduce the following property derived from [11]:

Definition 6 (<a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M95','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M95">View MathML</a>-closed) Given an indeterminate string <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>, an interval <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71">View MathML</a>, and a character set <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M101','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M101">View MathML</a>, interval <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71">View MathML</a>is <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M95','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M95">View MathML</a>-closed if <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M102','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M102">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M103','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M103">View MathML</a>, and if <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M104','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M104">View MathML</a>or <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M105','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M105">View MathML</a>, and if <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M106','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M106">View MathML</a>or <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M107','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M107">View MathML</a>.

A reasonable balance between omitting expedient and including apparently redundant weak common intervals is found by the subset of those that are mutually-closed, as defined as follows:

Definition 7 (mutually-closed) Given a pair of intervals <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M108','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M108">View MathML</a>of indeterminate strings <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M109','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M109">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M110','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M110">View MathML</a>are mutually-closed if <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M109','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M109">View MathML</a>is <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M111','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M111">View MathML</a>-closed and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M110','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M110">View MathML</a><a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M112','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M112">View MathML</a>-closed.

We consequently restrict the enumeration of weak common intervals and approximate weak common intervals to those that are mutually-closed.

Combinatorial complexity. The maximal number of mutually-closed weak common intervals of two indeterminate strings S and T of length n and m, respectively, is bounded by nm. This result follows from the fact that the number of intervals [k, l] in T that are mutually-closed weak common intervals with any interval with fixed left bound i in S is bounded by m. Likewise, the maximal number of mutually-closed approximate weak common intervals of indeterminate strings S and T is bounded by (δ + 1)2nm.

Gene family-free analysis. In absence of gene family assignments, each gene in the dataset is represented by a unique character, linearly ordered along a chromosomal string. Therefore, the n characters of a chromosomal string can be identified by their integer index set {1, 2, . . . , n}. Relating characters of one chromosomal string to characters of another, we presume that we are given a symmetric similarity measure σAB : A × B → ≥0 for any two index sets A and B.

In gene family-free gene cluster analysis we aim at finding pairs of intervals in two chromosomal strings, whose characters are similar. We can reduce this problem to finding (approximate) weak common intervals between two indeterminate strings. To this end, we construct an index mapping BA:

<a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M3">View MathML</a>

Thus, BA is an indeterminate string over alphabet <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M113','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M113">View MathML</a>. Let <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M114','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M114">View MathML</a> represent the indeterminate string of A, a position in IA shares a character with a position in BA if and only if the similarity of the two corresponding characters is non-zero. Thus, finding intervals in chromosomal strings A and B, whose characters are similar, is equivalent to finding (approximate) weak common intervals of indeterminate strings IA and BA. Note that the set of (approximate) weak common intervals of IA and BA is identical to the one of IB and AB. The (approximate) weak common intervals differ in size and, most substantially, in the similarities between characters within the interval pairs. Therefore, we introduce a simple scoring scheme by which the solution space can be arranged into a landscape of highs and lows of (approximate) weak common intervals, ranked by taking into account the number and the similarities of the contained characters. We define a score function µxy over an index x in index set X and an interval [a, b]Y in index set Y as

<a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M4">View MathML</a>

so that µxy takes values between 0 and 1, being 1 if the gene with highest similarity to x lies within interval [a, b]Y. The overall score of two interval pairs ([i, j]A, [k, l]B) is then

<a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M5">View MathML</a>

We now describe three algorithms to compute all mutually-closed weak common intervals and all mutually-closed approximate weak common intervals with at most δ indels in two indeterminate strings. Note that mutually-closed weak common intervals are a special subclass of mutually-closed approximate weak common intervals for δ = 0.

In the following, we consider two indeterminate strings S of length n and T of length m.

Discovering weak common intervals

We now describe the algorithm Weak Common Intervals on Indeterminate Strings (WCII) as presented in Figure 1. It solves the following problem:

thumbnailFigure 1. WCII algorithm. WCII adapts the search strategy of Didier's Algorithm [24] for common intervals in strings to the computation of weak common intervals in indeterminate strings.

Problem 1 Given two indeterminate strings <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>, discover all mutually-closed weak common intervals of <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>.

To tackle this problem we make use of the following constructs:

Definition 8 (index string) Given an indeterminate string <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>of length <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M115','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M115">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M116','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M116">View MathML</a>denotes the index string of <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>.

Definition 9 (index mapping) Given two indeterminate strings <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>of lengths <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M115','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M115">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M117','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M117">View MathML</a>respectively, the index mapping of <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>onto <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>is given by <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M118','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M118">View MathML</a>, where

<a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M6">View MathML</a>

Note that index strings and index mappings are again indeterminate strings. The key idea of WCII arises from the following observation:

Observation 1 Given two indeterminate strings <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>with index string <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M119','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M119">View MathML</a>and index mapping <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M120','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M120">View MathML</a>, two intervals <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71">View MathML</a>in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M84','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M84">View MathML</a>in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>are weak common intervals if and only if <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M7">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M8">View MathML</a>are weak common intervals.

This equivalence holds because any two positions, x in S and y in T intersect if and only if IS[x] and TS[y] intersect. Since it holds that IS[x] = {x} for all x = 1, . . . , n, we simplify the notation of single character set IS[x] to just x and character set <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M121','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M121">View MathML</a> to just <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M71">View MathML</a>. Note that character <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M122','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M122">View MathML</a> serves subsequently both as character c ϵ [i, j] as well as index in IS.

WCII is an adaptation of Didier's Algorithm [24] of enumerating maximal common intervals in ordinary strings. Didier's strategy can be described as follows: The algorithm iterates over all positions i in S as possible left interval bounds. In each iteration all mutually-closed weak common interval pairs are reported that share the same left bound i in IS. For each possible right bound j i, the algorithm iterates through the set of positions in TS that contain j in their character set. To this end, we make use of an array POS, where POS[j], 1 ≤ j n, is a sorted list of positions in TS containing character j. Each position y ϵ POS[j] is associated with an interval <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M9">View MathML</a>, k y l, called the min-rank interval of character j for position y. It is the largest interval around y for which every position in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M10">View MathML</a> contains at least one character in [i, j]. Obviously, <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M11">View MathML</a> is [i, j]-closed. It remains to be checked if <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M12">View MathML</a> is closed w.r.t. <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M123','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M123">View MathML</a> and that every position in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M13">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M14">View MathML</a> contains a character from <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M124','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M124">View MathML</a>. To show the latter, it is sufficient to show that <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M125','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M125">View MathML</a>, because the character set of each position in IS corresponds to the single element set of its index. The details of both tests are explained below, after relevant data structures are introduced. If both conditions are satisfied, a mutually-closed weak common interval pair is found and subsequently reported.

Like in Didier's Algorithm, we employ two tricks that improve the performance: precomputing min-rank intervals and following paths of rank-nearest successors.

Precomputing min-rank intervals. In order to identify min-rank intervals, it is sufficient to observe the smallest character c i in each position. To this end, we make use of the following construct:

Definition 10 (i-reduced string) Given index mapping <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M126','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M126">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M15">View MathML</a>is the <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M127','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M127">View MathML</a>-reduced string of <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M126','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M126">View MathML</a> of the ith iteration, where <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M16">View MathML</a>.

Min-rank intervals in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M17">View MathML</a> are identical to rank intervals as initially defined by Dider et al. [24]. Interestingly, rank intervals in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M18">View MathML</a> correspond directly to min-rank intervals in TS:

Lemma 1 The set of min-rank intervals in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M126','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M126">View MathML</a>is identical to the set of rank intervals in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M19">View MathML</a>.

Proof: Didier et al.[24] show that rank intervals in a string are nested and that their number is bounded by the length of the string.

Observe that for any position y in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M20">View MathML</a> the rank interval of character <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M21">View MathML</a> is identical to the min-rank interval of j at position y in TS. Let y be a position in TS and j ϵ TS[y] such that <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M22">View MathML</a>. Further, let <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M23">View MathML</a> be the min-rank interval of j at TS[y], <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M24">View MathML</a>, and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M25">View MathML</a> be the min-rank interval of j' at its corresponding position in TS. Because j' ≤ j it consequently holds that <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M26">View MathML</a>. Now, according to the definition of min-rank intervals, <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M27">View MathML</a>, if such position exists. Since j', is the largest character in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M28">View MathML</a> that is smaller than or equal to j, it must also hold that <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M29">View MathML</a>. The same argument holds for <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M30">View MathML</a> if such position exists, therefore <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M31">View MathML</a> is the min-rank interval of both characters j' and j. We conclude that all min-rank intervals for any character in TS at iteration i are contained in the set of rank intervals of <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M32">View MathML</a>.   □

Consequently, all min-rank intervals in TS in the ith iteration (i.e. for a fixed left bound i in IS ) can be precomputed in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M141','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M141">View MathML</a> time using the algorithm given by Didier et al. [24]. They are stored in table INT. For a currently processed character j at position y in TS, INT[y] contains its corresponding min-rank interval. Unlike Didier's Algorithm, INT must be updated after each iteration such that all positions in INT accessed in the following (j + 1)th iteration contain the corresponding min-rank intervals of character j + 1. Details of the update step can be found in Additional file 1 Section 1.1.

Additional file 1.

Format: PDF Size: 328KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Following paths of rank-nearest successors. The second trick in the algorithm consists in increasing the right bound j in IS while walking through positions and characters of TS. Thereby the algorithm jumps from a current position y that contains character j to its rank-nearest successor, which is the position y' containing character j + 1 with the smallest min-rank distance to y as defined as follows:

Definition 11 (min-rank distance) The min-rank distance of any two positions <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M128','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M128">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M129','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M129">View MathML</a> in indeterminate string <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M126','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M126">View MathML</a> for the ith iteration is given by:

<a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M33">View MathML</a>

If several co-optimal positions are available, the tie is broken by choosing the leftmost one as rank-nearest successor. In case no position with character j + 1 exists, or the smallest min-rank distance is '', j has no successor. For the ith iteration, all rank-nearest successors are precomputed and stored in table SUCC which is explained in more detail in Additional file 1 Section 1.2.

Connecting characters larger than or equal to i at their corresponding positions in TS with their rank-nearest successors through directed edges results in a forest of rooted trees. Nodes (across all trees) sharing the same character are said to reside on the same level. In lines 8-28 of Figure 1, the algorithm traverses along paths through this forest in a bottom-up procedure, from one level to the next, starting at those leaves with character i. Besides the currently visited nodes of the level, the algorithm keeps track of the path bounds, which are the outermost positions in TS a path has visited thus far. The currently visited nodes of the paths and their corresponding path bounds are stored in a list labeled LIST. Only after all nodes of the same level j are processed, the algorithm follows all current paths to nodes of the next level j + 1, thereby ensuring that each character in TS is processed at most once. To this end, for all positions containing character j that have the same rank-nearest successor y', the algorithm discontinues the paths of all but the leftmost one with shortest min-rank distance to y' (line 19). Traversing along paths of rank-nearest successors in WCII differs from Didier's Algorithm by the fact that a position in TS may be visited by the same path several times on different levels.

For any given min-rank interval <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M34">View MathML</a> there cannot be more than one weak common interval partner in IS starting at position i. Therefore it is sufficient to track at least one path in each min-rank interval to find all mutually-maximal intervals of IS and TS. Positions in POS are sorted, thus paths leading to the same weak common interval pair appear adjacent to each other in LIST and the common interval pair is reported only for the first (lines 15-17).

For each node in LIST, associated with character j and position y, the algorithm checks if the min-rank interval <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M35">View MathML</a> of j encloses the path bounds up to position y (see condition in line 15). If validated, a weak common interval pair has been found, given by <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M36">View MathML</a>. To ensure mutual closedness, the interval pair is only reported if i − 1 is not contained in the character set <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M130','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M130">View MathML</a> and the successor of y is not within the current bounds of its path (see conditions in lines 13 and 15). Checking for the former can be achieved in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M142','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M142">View MathML</a> time after <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M141','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M141">View MathML</a> time preprocessing by performing a range minimum query on an array of size <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M141','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M141">View MathML</a> where each position containing character i − 1 is assigned 0 and 1 otherwise.

The overall complexity of the algorithm can be summarized as follows: Each position in IS is regarded exactly once as left bound i for all weak common intervals that are reported in one iteration. Once <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M37">View MathML</a> is computed for i = 1 it can be up-dated using array POS, taking overall <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M131','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M131">View MathML</a> time for all left bounds i = 1, . . . , n. Further, for each left bound the algorithm performs <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M141','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M141">View MathML</a> steps to precompute all min-rank intervals and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M131','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M131">View MathML</a> steps to precompute all rank-nearest successors. The subsequent bottom-up procedure and the reporting of weak common intervals requires again <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M131','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M131">View MathML</a> time. Therefore we have:

Theorem 1 Given two indeterminate strings <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>, Algorithm WCII finds all pairs of mutually-closed weak common intervals of <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M132','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M132">View MathML</a> time.

Discovering approximate weak common intervals

We now present the algorithm Approximate Weak Common Intervals on Indeterminate Strings (AWCII) as presented in Figure 2, thus line numbers mentioned in this subsection refer to Figure 2. AWCII solves the following problem:

thumbnailFigure 2. AWCII algorithm. AWCII is a search algorithm for approximate weak common intervals in indeterminate strings. It is an adaptation of RGC [11], an algorithm for computing approximate common intervals in strings.

Problem 2 Given two indeterminate strings <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>and indel threshold <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M94','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M94">View MathML</a>, discover all mutually-closed approximate weak common intervals of <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>with no more than <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M96','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M96">View MathML</a> indels.

Following a strategy similar to WCII, AWCII solves Problem 2 for index mappings IS and TS, instead of S and T. As before, in each iteration the algorithm maintains a fixed left bound i in IS. For each character j ϵ [i, n] all positions y in TS are processed that contain character j (lines 5-25). Thereby character j at position y in TS can be associated with several different intervals around y that are candidates of mutually-closed approximate weak common interval partners for interval <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M38">View MathML</a>. Only intervals surrounding one (or several) positions y can be mutually-closed. However, for an interval <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M39">View MathML</a> containing indels, it no longer holds that the minrank distance of any two positions within the interval is always smaller than the min-rank distance from any position inside to any position outside the interval. As a result, neither precomputed min-rank intervals nor following paths of ranknearest successors can be used for improving the algorithm's performance. Instead we pursue a different approach, thereby making AWCII an adaptation of the RGC algorithm of Jahn [11].

For each dk = 1,..., δ (lines 7-23) AWCII identifies the leftmost position k in TS such that at most dk indels are contained in interval <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M40">View MathML</a> and TS[k] ∩ [i, j] ≠ ∅. Let <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M41">View MathML</a> be the observed number of indels in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M42">View MathML</a> (see line 9), the algorithm then finds for each dl = 1,..., δ − d'k (lines 14-21) the rightmost position l such that again TS[l] ∩ [i, j] ≠ ∅ and the number of indels in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M43">View MathML</a> does not exceed dl. Each (k, l) of the at most (δ + 1)2 combinations of leftmost and rightmost positions gives rise to a candidate pair of mutually-closed approximate weak common intervals <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M44">View MathML</a>. For each candidate pair it is checked that the number of characters in [i, j] not contained in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M123','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M123">View MathML</a> plus the already consumed number of indels in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M45">View MathML</a> does not exceed δ. Finally, it is tested if <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M46">View MathML</a> is <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M123','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M123">View MathML</a>-closed. If both conditions are satisfied, a mutually-closed approximate weak common interval pair is found and is subsequently reported (line 18).

Runtime improvements are achieved by precomputing right and left bounds of candidate intervals <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M47">View MathML</a> for each character j in TS. These bounds are computed within <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M133','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M133">View MathML</a> time for a fixed left bound i in IS and stored in tables L and R respectively. Further, for each such candidate interval <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M48">View MathML</a> the number of characters that are within [i, j] can also be precomputed. This number is used to determine δS in line 16. The construction of the corresponding table, called RANGECONTENT, is achieved within <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M134','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M134">View MathML</a> time for a fixed left bound i. The details of constructing tables L, R, and RANGECONTENT can be found in Additional file 1 Section 2. Note that RANGECONTENT differs significantly from the data structure NUM used in RGC to count characters in intervals.

In terms of overall runtime, for each fixed bound i in IS the algorithm spends <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M134','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M134">View MathML</a> time on computation of the above mentioned auxiliary tables. Thereafter, AWCII requires <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M134','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M134">View MathML</a> time to iterate through all combinations of candidate intervals in L and R and to identify approximate weak common intervals.

Testing for <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M123','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M123">View MathML</a>-closedness of interval <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M49">View MathML</a> can be achieved in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M142','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M142">View MathML</a> time by precomputing a table for all candidate intervals in TS of the ith iteration, where each entry indicates if a character i − 1 or j + 1 is contained in the corresponding candidate interval. Such a table can be constructed within <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M135','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M135">View MathML</a> time using again a simple sweep algorithm. We conclude with the following theorem:

Theorem 2 Given two indeterminate strings <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>and indel threshold <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M94','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M94">View MathML</a>, algorithm AWCII computes all pairs of mutually-closed approximate weak common intervals of <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M82">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M83">View MathML</a>in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M136','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M136">View MathML</a>time.

A runtime heuristic for discovering approximate weak common intervals

Our third algorithm, ACSI (see Figure 3) represents a runtime heuristic that solves Problem 2 exactly and in practice outperforms both WCII and AWCII in gene family-free analysis by orders of magnitude.

thumbnailFigure 3. ACSI algorithm. ACSI is a runtime heuristic that computes all approximate weak common intervals in indeterminate strings.

Just as the two algorithms before, ACSI operates on index string IS and index mapping TS instead of indeterminate strings S and T directly. For every fixed interval [i, j] in IS, ACSI identifies mutually-closed approximate weak common interval partners [k, l] in TS. To this end, it iterates through elements of POS[i], i.e. positions in TS that contain character i (lines 3-7 of Figure 3). For each such position y ϵ POS[i] the algorithm calls a recursive procedure, denoted EXTEND (line 5). This recursive procedure requires 5 parameters, corresponding to fixed bounds <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M50">View MathML</a>, the currently processed interval [k, l] in TS, and the current number of allowed indels, d. In the initial call, interval <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M51">View MathML</a> is set to <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M138','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M138">View MathML</a> and d = δ. EXTEND then increases interval <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M52">View MathML</a> to both sides until [i, j] ∩ TS[k − 1] = ∅ and [i, j] ∩ TS[l + 1] = ∅ (line 10). If in this process the algorithm observes characters i − 1 or j + 1 in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M123','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M123">View MathML</a>, EXTEND returns to the previous call (lines 11-13). Otherwise, it verifies if <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M53">View MathML</a> is a mutually-closed approximate weak common interval pair by testing if the number of characters in [i, j] that are missing in <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M123','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M123">View MathML</a> is less than or equal to the current d and if <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M137','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M137">View MathML</a> (line 14). The interval pair is reported if both conditions are validated. EXTEND then increases <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M54">View MathML</a> to the left, thereby consuming indel positions as long as their overall number remains less than or equal to the current d (line 17). If a position k' < k − 1 has been found such that [i, j] ∩ T[k'] ≠ ∅, EXTEND is called recursively with parameter values <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M55">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M56">View MathML</a>, and the remaining number of allowed indels, given by d + k' + 1 − k (lines 18-20). This step is also symmetrically executed for the right side of <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M57">View MathML</a> (lines 21-24).

The actual heuristic speed-up of the algorithm is achieved by calling procedure EXTEND in line 5 not for all intervals [i, j] in IS but only for those that have chances of success for being a weak common intervals pair with an interval [k, l] around a position y ϵ POS[i]. Thus, the neighborhood around position y is scanned for suitable values of j prior to the execution of EXTEND. The details are described in Additional file 1 Section 3.

Results and discussion

In the following, we highlight the benefit of our dynamic model in comparison with present approaches. Although conflicting gene family assignments are extremely common in practice, this scenario is difficult to evaluate. Assuming the existence of an ultimately true gene family assignment, conflicts arise by incorrect gene family assignments. Therefore an evaluation would inevitably result in benchmarking gene family prediction tools, rather than scrutinizing our model.

Instead, we decided to evaluate our gene family-free model against the traditional gene family-based approach. To this end, we chose a genomic dataset of bacterial genomes that has been used in a prior gene cluster study [8] and was originally obtained from [25]. The dataset features 133 chromosomal sequences, of which we removed all sequences originating from plasmids.

In practice ACSI outperforms both WCII and AWCII as shown by Figure 4. Thus, in all subsequent results, we used ACSI to compute mutually-closed (approximate) weak common intervals.

thumbnailFigure 4. Runtimes of presented algorithms in practice. Running times of ACSI and AWCII with δ = 0 and WCII, measured in a sample of 24 arbitrarily chosen pairwise comparisons of genomes that are contained in the studied dataset. All algorithms produced identical output (as expected). Running times are plotted against the number of pairwise gene similarities (equivalent to the size of <a onClick="popup('http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M143','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/15/S6/S2/mathml/M143">View MathML</a>) contained in the pairwise comparison.

Gene family-based dataset. Genes in this dataset are annotated with COG (Clusters of Orthologous Groups) identifiers [12] which are used to establish homology relationships between genes. The set of genes in the dataset was revised by the latest available gene information under the accession numbers of the respective genomes at NCBI. To this end, genes that are meanwhile marked as pseudo genes were removed from the dataset. No genes were added, since COG annotations of new genes are not available. We further omitted all genomes from subsequent analyses of which more than 10 pseudo genes were removed in this process. 93 genomes remained, comprising on average 2726 genes (minimum/ maximum number of genes: 784/8317).

Gene family-free dataset. Pairwise similarities between genes in the dataset were obtained using the relative reciprocal BLAST score (RRBS) [26]. Genes were compared on the basis of their encoding protein sequence using BLASTP+ [27] with an e-value threshold of 0.1 and disabled composition-based score adjustments.

For evaluation purposes, we produced different degrees of pruned gene similarity sets by filtering spurious gene similarities. For this, we employed an undirected variant of the stringency criterion (parameterized by f ϵ [0, 1]) for gene similarities proposed in [28], which is described in more detail in Additional file 1 Section 4.1.

To evaluate the gene family-free model, we ran an implementation of ACSI for δ = 0 on the unpruned gene similarity graph of our dataset and compared the 4015841 interval pairs with respect to the contained COG identifiers. We discarded all pairs for which at least one interval contained less than two genes with a COG identifier. In the remaining 1194036 interval pairs, we observed that the similarity in the set of COG identifiers depends strongly on the intervals' score (Table 1). Among the clusters with a score greater or equal 10, 95% have the same set of identifiers in both intervals. While this number decreases for smaller scores, still a quarter of the interval pairs with a score lower than 1 do not differ in their COG identifiers. This shows that our approach is able to detect gene clusters that would also be detected with well-established gene family based approaches.

Table 1. Statistics of overlaps between the COG identifier sets of pairs of weak common intervals.

This is not a surprise, as weak common intervals are in fact a generalization of the classic common intervals model: A run of ACSI on a dataset where similarity scores are only set between members with the same COG identifiers finds the exact same set of clusters as the common intervals based approach.

To evaluate the predictive power of our approach, we compare the output of our program to gene clusters predicted by the reference gene cluster algorithm (RGC) [11]. While this algorithm is capable of multiple genome comparison and the detection of faint conservation patterns, we use it in this context for pairwise genome comparison to detect interval pairs (I1,I2) whose gene sets have a symmetric set distance of at most 2. It has been previously observed that the generalization to approximate conservation underlying the reference gene cluster approach is not only a way to find imperfectly conserved clusters, but also a means to add robustness against errors in gene family assignment. For example, an interval pair may appear to have a set distance of two because besides the shared genes, there is one gene unique to I1 and one gene unique to I2. However a closer inspection of the genes reveals that these genes are in fact homologs that were not recognized in the preceding partitioning of genes into homology families. We ran RGC on all pairs of the 93 genomes setting parameters δ = 2 (maximal tolerated symmetric set distance) and s = 3 (the minimum cluster size). The program returned among others 192900 "single-mismatch clusters", i.e. clusters that have exactly one extra gene in each interval. In 47453 (24.60%) of the single-mismatch clusters, we observe a similarity score between the two extra genes in our BLAST dataset. ACSI found 89.84% of the single-mismatch clusters and for 75.24% the extra genes turned out to be pairwise best hits. Moreover we observe that in 18143 among the single-mismatch gene clusters predicted by RGC the two extra genes have exactly the same annotation string. (Annotations containing the word "hypothetical" were ignored.) ACSI finds 90.19% of these clusters. Surprisingly, 4.59% of the single-mismatch clusters in which the two extra genes had best hits to each other were not found by ACSI. This is because for one or more of the other genes in the cluster our BLAST results did not return any similarity score to a gene in the other interval. Apparently the elements of a cluster of orthologous groups can be very faintly related in terms of sequence similarity.

Comparison with RegulonDB data. Among other information about transcriptional regulation, RegulonDB [29] provides a list of operon locations in Escherichia coli K12. While the majority of operons in RegulonDB are computationally predicted, some are also experimentally confirmed. From 2649 operons reported in RegulonDB, 846 span two or more genes. We mapped these operons to the annotation of the E. coli K12 genome in our data set. However, 104 operons contain genes that are not annotated in our dataset and thus were omitted from subsequent analysis. The remaining 742 operons span between 2 and 16 genes, 71.83% of which span 2 or 3 genes. The number of detected gene clusters depends strongly on the degree of evolutionary relatedness between the E. coli K12 genome and other genomes in the dataset. While ACSI and RGC predicted many occurrences in other close related γ-proteobacteria in our dataset, for the majority of genomes only few occurrences of operons were reported. Additional file 1 Section 4.2, gives an overview of the number of found gene clusters in the dataset. The sets of reported operons found by ACSI and RGC are not entirely overlapping. Instead, ACSI finds operons which RGC does not find and vice versa. A complete overview of unique findings for algorithms and parameter settings is shown in Table 2.

Table 2. Unique findings (with 100% overlap) of operons by ACSI and RGC with minimum cluster size s = 2 and varying parameters.

Conclusions

In this work we introduced a new model to detect gene clusters based on the study of (approximate) weak common intervals in indeterminate strings. In context of gene family-free analysis, we presented a scoring scheme for (approximate) weak common intervals which rates both interval size and the degree of similarity between the contained genes of an (approximate) weak common interval pair. We use our gene family-free model to predict gene clusters between pairs of genomes. This approach is evaluated in comparison with the common intervals-based reference gene cluster model.

In addition to the use case of detecting gene clusters, our algorithms can also be helpful to identify synteneous blocks in a gene family-free analysis. The hierarchical nature of common intervals is maintained in our weak common intervals model, which makes it ideal for studying potential synteneous blocks of arbitrary resolution. The basic concept of common intervals in strings has seen many generalizations in the past years which have greatly increased its utility for biological studies, in particular the simultaneous consideration of more than two strings, requiring common intervals to occur in all or at least a certain number of them. This generalization of (approximate) weak common intervals in indeterminate strings is undoubtedly an interesting direction for future work.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

All authors were involved in the early conception of the project. DD, KJ and JS developed the methods and designed the analysis. DD and KJ performed the evaluation and wrote the manuscript; all authors discussed the results, commented on the manuscript, and read and approved its final version.

Acknowledgements

DD receives a scholarship from the CLIB Graduate Cluster Industrial Biotechnology. KJ is funded by DFG grant ST 431/5-1.

Declarations

We acknowledge support for the Article Processing Charge by the German Research Foundation and the Open Access Publication Fund of Bielefeld University Library.

This article has been published as part of BMC Genomics Volume 15 Supplement 6, 2014: Proceedings of the Twelfth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S6.

References

  1. Tamames J, et al.: Evolution of gene order conservation in prokaryotes.

    Genome Biol 2001, 2(6):1-0020. OpenURL

  2. Wolfe KH, Shields DC: Molecular evidence for an ancient duplication of the entire yeast genome.

    Nature 1997, 387:708-713. PubMed Abstract | Publisher Full Text OpenURL

  3. Heber S, Stoye J: Algorithms for finding gene clusters.

    Proceedings of WABI 2001 LNCS 2001, 2149:252-263. OpenURL

  4. Schmidt T, Stoye J: Quadratic time algorithms for finding common intervals in two and more sequences.

    Proc of CPM 2004 LNCS 2004, 3109:347-358. OpenURL

  5. Heber S, Mayr R, Stoye J: Common intervals of multiple permutations.

    Algorithmica 2011, 60(2):175-206. Publisher Full Text OpenURL

  6. Bergeron A, Corteel S, Raffinot M: The algorithmic of gene teams.

    Proceedings of WABI 2002 LNCS 2002, 2452:464-476. OpenURL

  7. He X, Goldwasser MH: Identifying conserved gene clusters in the presence of homology families.

    J Comp Biol 2005, 12(6):638-656. Publisher Full Text OpenURL

  8. Ling X, He X, Xin D: Detecting gene clusters under evolutionary constraint in a large number of genomes.

    Bioinformatics 2009, 25(5):571. PubMed Abstract | Publisher Full Text OpenURL

  9. Rahmann S, Klau GW: Integer linear programs for discovering approximate gene clusters.

    Proceedings of WABI 2006 LNBI 2006, 4175:298-309. OpenURL

  10. Böcker S, Jahn K, Mixtacki J, Stoye J: Computation of median gene clusters.

    J Comput Biol 2009, 16(8):1085-1099. PubMed Abstract | Publisher Full Text OpenURL

  11. Jahn K: Efficient computation of approximate gene clusters based on reference occurrences.

    J Comput Biol 2011, 18(9):1255-1274. PubMed Abstract | Publisher Full Text OpenURL

  12. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database: an updated version includes eukaryotes.

    BMC Bioinformatics 2003, 4:41. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  13. Powell S, Szklarczyk D, Trachana K, Roth A, Kuhn M, Muller J, Arnold R, Rattei T, Letunic I, Doerks T, Jensen LJ, von Mering C, Bork P: eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges.

    Nucleic Acids Res 2012, 40(Database):284-9. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Waterhouse RM, Zdobnov EM, Tegenfeldt F, Li J, Kriventseva EV: OrthoDB: the hierarchical catalog of eukaryotic orthologs in 2011.

    Nucleic Acids Res 2011, 39(Database):283-8. Publisher Full Text OpenURL

  15. Shi G, Peng MC, Jiang T: MultiMSOAR 2.0: an accurate tool to identify ortholog groups among multiple genomes.

    PLoS one 2011, 6(6):20892. Publisher Full Text OpenURL

  16. Li L, Stoeckert CJ, Roos DS: OrthoMCL: identification of ortholog groups for eukaryotic genomes.

    Genome Res 2003, 13(9):2178-2189. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Ostlund G, Schmitt T, Forslund K, Köstler T, Messina DN, Roopra S, Frings O, Sonnhammer ELL: InParanoid 7: new algorithms and tools for eukaryotic orthology analysis.

    Nucleic Acids Res 2010, 38(Database):196-203. Publisher Full Text OpenURL

  18. Song N, Sedgewick RD, Durand D: Domain architecture comparison for multidomain homology identification.

    J Comput Biol 2007, 14(4):496-516. PubMed Abstract | Publisher Full Text OpenURL

  19. Joseph JM, Durand D: Family classification without domain chaining.

    Bioinformatics 2009, 25(12):45-53. Publisher Full Text OpenURL

  20. Frech C, Chen N: Genome-wide comparative gene family classification.

    PLoS one 2010, 5(10):13409. Publisher Full Text OpenURL

  21. Liu J, Rost B: Domains, motifs and clusters in the protein universe.

    Current Opinion in Chemical Biology 2003, 7(1):5-11. PubMed Abstract | Publisher Full Text OpenURL

  22. Holub J, Smyth WF: Algorithms on indeterminate strings.

    Proc of AWOCA 2003 2003, 36-45. OpenURL

  23. Uno T, Yagiura M: Fast algorithms to enumerate all common intervals of two permutations.

    Algorithmica 2000, 26(2):290-309. Publisher Full Text OpenURL

  24. Didier G, Schmidt T, Stoye J, Tsur D: Character sets of strings.

    J Discr Alg 2007, 5(2):330-340. Publisher Full Text OpenURL

  25. Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P: Toward automatic reconstruction of a highly resolved tree of life.

    Science 2006, 311(5765):1283-1287. PubMed Abstract | Publisher Full Text OpenURL

  26. Pesquita C, Faria D, Bastos H, Ferreira AE, Falcão AO, Couto FM: Metrics for GO based protein semantic similarity: a systematic evaluation.

    BMC Bioinformatics 2008, 9(Suppl 5):4. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  27. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool.

    J Mol Biol 1990, 215(3):403-410. PubMed Abstract | Publisher Full Text OpenURL

  28. Lechner M, Findeiß S, Steiner L, Marz M, Stadler PF, Prohaska SJ: Proteinortho: detection of (co-)orthologs in large-scale analysis.

    BMC Bioinformatics 2011, 12:124. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  29. Salgado H, Peralta-Gil M, Gama-Castro S, Santos-Zavaleta A, Muñiz-Rascado L, García-Sotelo JS, Weiss V, Solano-Lira H, Martínez-Flores I, Medina-Rivera A, Salgado-Osorio G, Alquicira-Hernández S, Alquicira-Hernández K, López-Fuentes A, Porrón-Sotelo L, Huerta AM, Bonavides-Martínez C, Balderas-Martínez YI, Pannier L, Olvera M, Labastida A, Jiménez-Jacinto V, Vega-Alvarado L, Del Moral-Chávez V, Hernández-Alvarez A, Morett E, Collado-Vides J: RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more.

    Nucleic Acids Res 2013, 41(Database):203-13. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL