Abstract
Background
Comparative analyses of chromosomal gene orders are successfully used to predict gene clusters in bacterial and fungal genomes. Present models for detecting sets of colocalized genes in chromosomal sequences require prior knowledge of gene family assignments of genes in the dataset of interest. These families are often computationally predicted on the basis of sequence similarity or higher order features of gene products. Errors introduced in this process amplify in subsequent gene order analyses and thus may deteriorate gene cluster prediction.
Results
In this work, we present a new dynamic model and efficient computational approaches for gene cluster prediction suitable in scenarios ranging from traditional gene familybased gene cluster prediction, via multiple conflicting gene family annotations, to gene familyfree analysis, in which gene clusters are predicted solely on the basis of a pairwise similarity measure of the genes of different genomes. We evaluate our gene familyfree model against a gene familybased model on a dataset of 93 bacterial genomes.
Conclusions
Our model is able to detect gene clusters that would be also detected with wellestablished gene familybased approaches. Moreover, we show that it is able to detect conserved regions which are missed by gene familybased methods due to wrong or deficient gene family assignments.
Keywords:
common intervals; indeterminate strings; gene cluster detectionBackground
Gene clusters are sets of functionally associated genes in prokaryotic and fungal genomes that are located close to each other over a longer period of evolutionary time, despite the genome undergoing significant rearrangements. Consequently, gene clusters may be found in several related species by means of comparative gene order analysis. Over the past years several such methods have been proposed and subsequently improved in their sensitivity. Initial gene cluster models considered only completely conserved genomic segments that retain gene order and orientation [1,2]. Later models still required gene clusters to be contiguous and complete but dropped the requirement for colinearity [35]. The most powerful class of approaches can handle imperfect conservation of gene content by allowing to some extent either inserted [68] or both inserted and deleted genes [911].
All above methods require prior knowledge of homology relations between genes, using either a onetoone mapping between the gene sets of different genomes [3,6,5], or a general partitioning into gene families [4,711]. In the latter, a genome is modeled as a set of sequences over the alphabet of gene families, where each sequence corresponds to a particular chromosome of the organism.
Most commonly, gene families are predicted computationally on the basis of sequence similarity. Various databases exist that offer information of precomputed gene families [1214]. Furthermore, several software tools are freely available that allow for direct computation of gene family assignments in a dataset of interest [1517]. Typically, these approaches assume that gene families naturally cluster into densely connected subgraphs in the gene similarity network. However, multidomain proteins can have strong ties not only to their own family but also to other families they share a domain with. Some of these proteins may not be at all traceable back to a single gene family. While some recent approaches can deal with the ambiguities caused by multidomain proteins [18,19], it is still a major challenge to define cutoffs in the clustering process that at the same time eliminate spurious edges and keep gene families at a reasonable granularity[20,21].
In this paper, we present a new dynamic model and efficient computational approaches for gene cluster prediction suitable in scenarios ranging from traditional gene familybased gene cluster prediction, via multiple conflicting gene family annotations, to gene familyfree analysis, in which gene clusters are predicted solely on the basis of a pairwise similarity measure between the genes of different genomes. We do this by introducing the concept of common intervals to indeterminate strings, which are a class of strings that can have more than one character at every position. We then extend this concept to allow for a limited number of insertions and deletions. Furthermore, we present algorithms that solve related discovery problems of finding all weak common intervals and approximate weak common intervals in indeterminate strings. Finally, we propose a new method for gene familyfree discovery of gene clusters based on (approximate) weak common intervals in indeterminate strings.
Methods
Definitions
Indeterminate strings, also known as degenerate strings are formally defined as [22]:
Definition 1 (indeterminate string) For a given finite alphabet , let be the power set of . An indeterminate string is a sequence of character sets, which are elements of .
In other words, for an indeterminate string S with n index positions must hold that for every i, , and , where denotes the character set associated with the ith position in S. In the special case where every position of indeterminate string S holds a singleton set, S is equivalent to an ordinary string. We denote the length of an indeterminate string S with n index positions by and its cardinality, i.e. the number of all elements in S, by . Two positions a and b, , induce the indeterminate substring . To distinguish intervals in different indeterminate strings, we indicate the affiliation of an interval to indeterminate string S by the subscript notation .
Example 1 is an indeterminate string of length and cardinality over alphabet . The third element of S is given by character set . Interval induces the substring .
In this work, we generalize the concept of common intervals, which were initially introduced on permutations [23] and subsequently extended to strings [24]. The idea behind common intervals is to compare strings, or rather substrings, based on their character sets. The character set of an ordinary string S is defined as . The equivalent concept on indeterminate strings is the following:
Definition 2 (character set) The character set of an indeterminate string is defined by .
In two ordinary strings S and T over a finite alphabet Σ, two intervals, [i, j] in S and [k, l] in T, are called common intervals if ). The analogon for indeterminate strings is:
Definition 3 (strict common intervals) Given two indeterminate strings and , two intervals, in and in , are said to be strict common intervals if and only if their character sets and are equal.
A weaker definition based on the intersection relation between character sets is:
Definition 4 (weak common intervals) Given two indeterminate strings and , two intervals, in and in , are weak common intervals with common character set if for each , , it holds that and for each , , it holds that .
In all our use cases, in particular when dealing with conflicting gene family assignments as well as gene familyfree gene cluster detection, the concept of weak common intervals appears to be more appropriate. Thus, in the following, we restrict ourselves to the study of weak common intervals.
Furthermore, continuing a previous line of research initially proposed by Schmidt and Stoye in [4], we further extend weak common intervals by allowing a limited number of insertions and deletions:
Definition 5 (approximate weak common intervals) Given two indeterminate strings and and a threshold , two intervals, in and in , are approximate weak common intervals with common character set if the number of positions with no intersection with is limited by δ, i.e. . These positions are called indels.
Generally, algorithms for discovering common intervals of ordinary strings only report pairs of intervals that both are maximal, whereby maximality is defined as follows: An interval in string X is called maximal if its immediate left and right neighboring characters, and (if such exist), are not contained in . In other words, interval cannot be extended to its left or right without expanding the character set of the interval.
In terms of weak common intervals, we introduce the following property derived from [11]:
Definition 6 (closed) Given an indeterminate string , an interval , and a character set , interval is closed if , , and if or , and if or .
A reasonable balance between omitting expedient and including apparently redundant weak common intervals is found by the subset of those that are mutuallyclosed, as defined as follows:
Definition 7 (mutuallyclosed) Given a pair of intervals of indeterminate strings and , and are mutuallyclosed if is closed and closed.
We consequently restrict the enumeration of weak common intervals and approximate weak common intervals to those that are mutuallyclosed.
Combinatorial complexity. The maximal number of mutuallyclosed weak common intervals of two indeterminate strings S and T of length n and m, respectively, is bounded by nm. This result follows from the fact that the number of intervals [k, l] in T that are mutuallyclosed weak common intervals with any interval with fixed left bound i in S is bounded by m. Likewise, the maximal number of mutuallyclosed approximate weak common intervals of indeterminate strings S and T is bounded by (δ + 1)^{2}nm.
Gene familyfree analysis. In absence of gene family assignments, each gene in the dataset is represented by a unique character, linearly ordered along a chromosomal string. Therefore, the n characters of a chromosomal string can be identified by their integer index set {1, 2, . . . , n}. Relating characters of one chromosomal string to characters of another, we presume that we are given a symmetric similarity measure σ_{AB }: A × B → ℝ_{≥0 }for any two index sets A and B.
In gene familyfree gene cluster analysis we aim at finding pairs of intervals in two chromosomal strings, whose characters are similar. We can reduce this problem to finding (approximate) weak common intervals between two indeterminate strings. To this end, we construct an index mapping B_{A}:
Thus, B_{A }is an indeterminate string over alphabet . Let represent the indeterminate string of A, a position in I_{A }shares a character with a position in B_{A }if and only if the similarity of the two corresponding characters is nonzero. Thus, finding intervals in chromosomal strings A and B, whose characters are similar, is equivalent to finding (approximate) weak common intervals of indeterminate strings I_{A }and B_{A}. Note that the set of (approximate) weak common intervals of I_{A }and B_{A }is identical to the one of I_{B }and A_{B}. The (approximate) weak common intervals differ in size and, most substantially, in the similarities between characters within the interval pairs. Therefore, we introduce a simple scoring scheme by which the solution space can be arranged into a landscape of highs and lows of (approximate) weak common intervals, ranked by taking into account the number and the similarities of the contained characters. We define a score function µ_{xy }over an index x in index set X and an interval [a, b]_{Y }in index set Y as
so that µ_{xy }takes values between 0 and 1, being 1 if the gene with highest similarity to x lies within interval [a, b]_{Y}. The overall score of two interval pairs ([i, j]_{A}, [k, l]_{B}) is then
We now describe three algorithms to compute all mutuallyclosed weak common intervals and all mutuallyclosed approximate weak common intervals with at most δ indels in two indeterminate strings. Note that mutuallyclosed weak common intervals are a special subclass of mutuallyclosed approximate weak common intervals for δ = 0.
In the following, we consider two indeterminate strings S of length n and T of length m.
Discovering weak common intervals
We now describe the algorithm Weak Common Intervals on Indeterminate Strings (WCII) as presented in Figure 1. It solves the following problem:
Problem 1 Given two indeterminate strings and , discover all mutuallyclosed weak common intervals of and .
To tackle this problem we make use of the following constructs:
Definition 8 (index string) Given an indeterminate string of length , denotes the index string of .
Definition 9 (index mapping) Given two indeterminate strings and of lengths and respectively, the index mapping of onto is given by , where
Note that index strings and index mappings are again indeterminate strings. The key idea of WCII arises from the following observation:
Observation 1 Given two indeterminate strings and with index string and index mapping , two intervals in and in are weak common intervals if and only if and are weak common intervals.
This equivalence holds because any two positions, x in S and y in T intersect if and only if I_{S}[x] and T_{S}[y] intersect. Since it holds that I_{S}[x] = {x} for all x = 1, . . . , n, we simplify the notation of single character set I_{S}[x] to just x and character set to just . Note that character serves subsequently both as character c ϵ [i, j] as well as index in I_{S}.
WCII is an adaptation of Didier's Algorithm [24] of enumerating maximal common intervals in ordinary strings. Didier's strategy can be described as follows: The algorithm iterates over all positions i in S as possible left interval bounds. In each iteration all mutuallyclosed weak common interval pairs are reported that share the same left bound i in I_{S}. For each possible right bound j ≥ i, the algorithm iterates through the set of positions in T_{S }that contain j in their character set. To this end, we make use of an array POS, where POS[j], 1 ≤ j ≤ n, is a sorted list of positions in T_{S }containing character j. Each position y ϵ POS[j] is associated with an interval , k ≤ y ≤ l, called the minrank interval of character j for position y. It is the largest interval around y for which every position in contains at least one character in [i, j]. Obviously, is [i, j]closed. It remains to be checked if is closed w.r.t. and that every position in and contains a character from . To show the latter, it is sufficient to show that , because the character set of each position in I_{S }corresponds to the single element set of its index. The details of both tests are explained below, after relevant data structures are introduced. If both conditions are satisfied, a mutuallyclosed weak common interval pair is found and subsequently reported.
Like in Didier's Algorithm, we employ two tricks that improve the performance: precomputing minrank intervals and following paths of ranknearest successors.
Precomputing minrank intervals. In order to identify minrank intervals, it is sufficient to observe the smallest character c ≥ i in each position. To this end, we make use of the following construct:
Definition 10 (ireduced string) Given index mapping , is the reduced string of of the ith iteration, where .
Minrank intervals in are identical to rank intervals as initially defined by Dider et al. [24]. Interestingly, rank intervals in correspond directly to minrank intervals in T_{S}:
Lemma 1 The set of minrank intervals in is identical to the set of rank intervals in .
Proof: Didier et al.[24] show that rank intervals in a string are nested and that their number is bounded by the length of the string.
Observe that for any position y in the rank interval of character is identical to the minrank interval of j at position y in T_{S}. Let y be a position in T_{S }and j ϵ T_{S}[y] such that . Further, let be the minrank interval of j at T_{S}[y], , and be the minrank interval of j' at its corresponding position in T_{S}. Because j' ≤ j it consequently holds that . Now, according to the definition of minrank intervals, , if such position exists. Since j', is the largest character in that is smaller than or equal to j, it must also hold that . The same argument holds for if such position exists, therefore is the minrank interval of both characters j' and j. We conclude that all minrank intervals for any character in T_{S }at iteration i are contained in the set of rank intervals of . □
Consequently, all minrank intervals in T_{S }in the ith iteration (i.e. for a fixed left bound i in I_{S }) can be precomputed in time using the algorithm given by Didier et al. [24]. They are stored in table INT. For a currently processed character j at position y in T_{S}, INT[y] contains its corresponding minrank interval. Unlike Didier's Algorithm, INT must be updated after each iteration such that all positions in INT accessed in the following (j + 1)th iteration contain the corresponding minrank intervals of character j + 1. Details of the update step can be found in Additional file 1 Section 1.1.
Format: PDF Size: 328KB Download file
This file can be viewed with: Adobe Acrobat Reader
Following paths of ranknearest successors. The second trick in the algorithm consists in increasing the right bound j in I_{S }while walking through positions and characters of T_{S}. Thereby the algorithm jumps from a current position y that contains character j to its ranknearest successor, which is the position y' containing character j + 1 with the smallest minrank distance to y as defined as follows:
Definition 11 (minrank distance) The minrank distance of any two positions and in indeterminate string for the ith iteration is given by:
If several cooptimal positions are available, the tie is broken by choosing the leftmost one as ranknearest successor. In case no position with character j + 1 exists, or the smallest minrank distance is '∞', j has no successor. For the ith iteration, all ranknearest successors are precomputed and stored in table SUCC which is explained in more detail in Additional file 1 Section 1.2.
Connecting characters larger than or equal to i at their corresponding positions in T_{S }with their ranknearest successors through directed edges results in a forest of rooted trees. Nodes (across all trees) sharing the same character are said to reside on the same level. In lines 828 of Figure 1, the algorithm traverses along paths through this forest in a bottomup procedure, from one level to the next, starting at those leaves with character i. Besides the currently visited nodes of the level, the algorithm keeps track of the path bounds, which are the outermost positions in T_{S }a path has visited thus far. The currently visited nodes of the paths and their corresponding path bounds are stored in a list labeled LIST. Only after all nodes of the same level j are processed, the algorithm follows all current paths to nodes of the next level j + 1, thereby ensuring that each character in T_{S }is processed at most once. To this end, for all positions containing character j that have the same ranknearest successor y', the algorithm discontinues the paths of all but the leftmost one with shortest minrank distance to y' (line 19). Traversing along paths of ranknearest successors in WCII differs from Didier's Algorithm by the fact that a position in T_{S }may be visited by the same path several times on different levels.
For any given minrank interval there cannot be more than one weak common interval partner in I_{S }starting at position i. Therefore it is sufficient to track at least one path in each minrank interval to find all mutuallymaximal intervals of I_{S }and T_{S}. Positions in POS are sorted, thus paths leading to the same weak common interval pair appear adjacent to each other in LIST and the common interval pair is reported only for the first (lines 1517).
For each node in LIST, associated with character j and position y, the algorithm checks if the minrank interval of j encloses the path bounds up to position y (see condition in line 15). If validated, a weak common interval pair has been found, given by . To ensure mutual closedness, the interval pair is only reported if i − 1 is not contained in the character set and the successor of y is not within the current bounds of its path (see conditions in lines 13 and 15). Checking for the former can be achieved in time after time preprocessing by performing a range minimum query on an array of size where each position containing character i − 1 is assigned 0 and 1 otherwise.
The overall complexity of the algorithm can be summarized as follows: Each position in I_{S }is regarded exactly once as left bound i for all weak common intervals that are reported in one iteration. Once is computed for i = 1 it can be updated using array POS, taking overall time for all left bounds i = 1, . . . , n. Further, for each left bound the algorithm performs steps to precompute all minrank intervals and steps to precompute all ranknearest successors. The subsequent bottomup procedure and the reporting of weak common intervals requires again time. Therefore we have:
Theorem 1 Given two indeterminate strings and , Algorithm WCII finds all pairs of mutuallyclosed weak common intervals of and in time.
Discovering approximate weak common intervals
We now present the algorithm Approximate Weak Common Intervals on Indeterminate Strings (AWCII) as presented in Figure 2, thus line numbers mentioned in this subsection refer to Figure 2. AWCII solves the following problem:
Problem 2 Given two indeterminate strings and and indel threshold , discover all mutuallyclosed approximate weak common intervals of and with no more than indels.
Following a strategy similar to WCII, AWCII solves Problem 2 for index mappings I_{S }and T_{S}, instead of S and T. As before, in each iteration the algorithm maintains a fixed left bound i in I_{S}. For each character j ϵ [i, n] all positions y in T_{S }are processed that contain character j (lines 525). Thereby character j at position y in T_{S }can be associated with several different intervals around y that are candidates of mutuallyclosed approximate weak common interval partners for interval . Only intervals surrounding one (or several) positions y can be mutuallyclosed. However, for an interval containing indels, it no longer holds that the minrank distance of any two positions within the interval is always smaller than the minrank distance from any position inside to any position outside the interval. As a result, neither precomputed minrank intervals nor following paths of ranknearest successors can be used for improving the algorithm's performance. Instead we pursue a different approach, thereby making AWCII an adaptation of the RGC algorithm of Jahn [11].
For each d_{k }= 1,..., δ (lines 723) AWCII identifies the leftmost position k in T_{S }such that at most d_{k }indels are contained in interval and T_{S}[k] ∩ [i, j] ≠ ∅. Let be the observed number of indels in (see line 9), the algorithm then finds for each d_{l }= 1,..., δ − d'_{k }(lines 1421) the rightmost position l such that again T_{S}[l] ∩ [i, j] ≠ ∅ and the number of indels in does not exceed d_{l}. Each (k, l) of the at most (δ + 1)^{2 }combinations of leftmost and rightmost positions gives rise to a candidate pair of mutuallyclosed approximate weak common intervals . For each candidate pair it is checked that the number of characters in [i, j] not contained in plus the already consumed number of indels in does not exceed δ. Finally, it is tested if is closed. If both conditions are satisfied, a mutuallyclosed approximate weak common interval pair is found and is subsequently reported (line 18).
Runtime improvements are achieved by precomputing right and left bounds of candidate intervals for each character j in T_{S}. These bounds are computed within time for a fixed left bound i in I_{S }and stored in tables L and R respectively. Further, for each such candidate interval the number of characters that are within [i, j] can also be precomputed. This number is used to determine δ_{S }in line 16. The construction of the corresponding table, called RANGECONTENT, is achieved within time for a fixed left bound i. The details of constructing tables L, R, and RANGECONTENT can be found in Additional file 1 Section 2. Note that RANGECONTENT differs significantly from the data structure NUM used in RGC to count characters in intervals.
In terms of overall runtime, for each fixed bound i in I_{S }the algorithm spends time on computation of the above mentioned auxiliary tables. Thereafter, AWCII requires time to iterate through all combinations of candidate intervals in L and R and to identify approximate weak common intervals.
Testing for closedness of interval can be achieved in time by precomputing a table for all candidate intervals in T_{S }of the ith iteration, where each entry indicates if a character i − 1 or j + 1 is contained in the corresponding candidate interval. Such a table can be constructed within time using again a simple sweep algorithm. We conclude with the following theorem:
Theorem 2 Given two indeterminate strings and and indel threshold , algorithm AWCII computes all pairs of mutuallyclosed approximate weak common intervals of and in time.
A runtime heuristic for discovering approximate weak common intervals
Our third algorithm, ACSI (see Figure 3) represents a runtime heuristic that solves Problem 2 exactly and in practice outperforms both WCII and AWCII in gene familyfree analysis by orders of magnitude.
Figure 3. ACSI algorithm. ACSI is a runtime heuristic that computes all approximate weak common intervals in indeterminate strings.
Just as the two algorithms before, ACSI operates on index string I_{S }and index mapping T_{S }instead of indeterminate strings S and T directly. For every fixed interval [i, j] in I_{S}, ACSI identifies mutuallyclosed approximate weak common interval partners [k, l] in T_{S}. To this end, it iterates through elements of POS[i], i.e. positions in T_{S }that contain character i (lines 37 of Figure 3). For each such position y ϵ POS[i] the algorithm calls a recursive procedure, denoted EXTEND (line 5). This recursive procedure requires 5 parameters, corresponding to fixed bounds , the currently processed interval [k, l] in T_{S}, and the current number of allowed indels, d. In the initial call, interval is set to and d = δ. EXTEND then increases interval to both sides until [i, j] ∩ T_{S}[k − 1] = ∅ and [i, j] ∩ T_{S}[l + 1] = ∅ (line 10). If in this process the algorithm observes characters i − 1 or j + 1 in , EXTEND returns to the previous call (lines 1113). Otherwise, it verifies if is a mutuallyclosed approximate weak common interval pair by testing if the number of characters in [i, j] that are missing in is less than or equal to the current d and if (line 14). The interval pair is reported if both conditions are validated. EXTEND then increases to the left, thereby consuming indel positions as long as their overall number remains less than or equal to the current d (line 17). If a position k' < k − 1 has been found such that [i, j] ∩ T[k'] ≠ ∅, EXTEND is called recursively with parameter values , , and the remaining number of allowed indels, given by d + k' + 1 − k (lines 1820). This step is also symmetrically executed for the right side of (lines 2124).
The actual heuristic speedup of the algorithm is achieved by calling procedure EXTEND in line 5 not for all intervals [i, j] in I_{S }but only for those that have chances of success for being a weak common intervals pair with an interval [k, l] around a position y ϵ POS[i]. Thus, the neighborhood around position y is scanned for suitable values of j prior to the execution of EXTEND. The details are described in Additional file 1 Section 3.
Results and discussion
In the following, we highlight the benefit of our dynamic model in comparison with present approaches. Although conflicting gene family assignments are extremely common in practice, this scenario is difficult to evaluate. Assuming the existence of an ultimately true gene family assignment, conflicts arise by incorrect gene family assignments. Therefore an evaluation would inevitably result in benchmarking gene family prediction tools, rather than scrutinizing our model.
Instead, we decided to evaluate our gene familyfree model against the traditional gene familybased approach. To this end, we chose a genomic dataset of bacterial genomes that has been used in a prior gene cluster study [8] and was originally obtained from [25]. The dataset features 133 chromosomal sequences, of which we removed all sequences originating from plasmids.
In practice ACSI outperforms both WCII and AWCII as shown by Figure 4. Thus, in all subsequent results, we used ACSI to compute mutuallyclosed (approximate) weak common intervals.
Figure 4. Runtimes of presented algorithms in practice. Running times of ACSI and AWCII with δ = 0 and WCII, measured in a sample of 24 arbitrarily chosen pairwise comparisons of genomes that are contained in the studied dataset. All algorithms produced identical output (as expected). Running times are plotted against the number of pairwise gene similarities (equivalent to the size of ) contained in the pairwise comparison.
Gene familybased dataset. Genes in this dataset are annotated with COG (Clusters of Orthologous Groups) identifiers [12] which are used to establish homology relationships between genes. The set of genes in the dataset was revised by the latest available gene information under the accession numbers of the respective genomes at NCBI. To this end, genes that are meanwhile marked as pseudo genes were removed from the dataset. No genes were added, since COG annotations of new genes are not available. We further omitted all genomes from subsequent analyses of which more than 10 pseudo genes were removed in this process. 93 genomes remained, comprising on average 2726 genes (minimum/ maximum number of genes: 784/8317).
Gene familyfree dataset. Pairwise similarities between genes in the dataset were obtained using the relative reciprocal BLAST score (RRBS) [26]. Genes were compared on the basis of their encoding protein sequence using BLASTP+ [27] with an evalue threshold of 0.1 and disabled compositionbased score adjustments.
For evaluation purposes, we produced different degrees of pruned gene similarity sets by filtering spurious gene similarities. For this, we employed an undirected variant of the stringency criterion (parameterized by f ϵ [0, 1]) for gene similarities proposed in [28], which is described in more detail in Additional file 1 Section 4.1.
To evaluate the gene familyfree model, we ran an implementation of ACSI for δ = 0 on the unpruned gene similarity graph of our dataset and compared the 4015841 interval pairs with respect to the contained COG identifiers. We discarded all pairs for which at least one interval contained less than two genes with a COG identifier. In the remaining 1194036 interval pairs, we observed that the similarity in the set of COG identifiers depends strongly on the intervals' score (Table 1). Among the clusters with a score greater or equal 10, 95% have the same set of identifiers in both intervals. While this number decreases for smaller scores, still a quarter of the interval pairs with a score lower than 1 do not differ in their COG identifiers. This shows that our approach is able to detect gene clusters that would also be detected with wellestablished gene family based approaches.
Table 1. Statistics of overlaps between the COG identifier sets of pairs of weak common intervals.
This is not a surprise, as weak common intervals are in fact a generalization of the classic common intervals model: A run of ACSI on a dataset where similarity scores are only set between members with the same COG identifiers finds the exact same set of clusters as the common intervals based approach.
To evaluate the predictive power of our approach, we compare the output of our program to gene clusters predicted by the reference gene cluster algorithm (RGC) [11]. While this algorithm is capable of multiple genome comparison and the detection of faint conservation patterns, we use it in this context for pairwise genome comparison to detect interval pairs (I_{1},I_{2}) whose gene sets have a symmetric set distance of at most 2. It has been previously observed that the generalization to approximate conservation underlying the reference gene cluster approach is not only a way to find imperfectly conserved clusters, but also a means to add robustness against errors in gene family assignment. For example, an interval pair may appear to have a set distance of two because besides the shared genes, there is one gene unique to I_{1 }and one gene unique to I_{2}. However a closer inspection of the genes reveals that these genes are in fact homologs that were not recognized in the preceding partitioning of genes into homology families. We ran RGC on all pairs of the 93 genomes setting parameters δ = 2 (maximal tolerated symmetric set distance) and s = 3 (the minimum cluster size). The program returned among others 192900 "singlemismatch clusters", i.e. clusters that have exactly one extra gene in each interval. In 47453 (24.60%) of the singlemismatch clusters, we observe a similarity score between the two extra genes in our BLAST dataset. ACSI found 89.84% of the singlemismatch clusters and for 75.24% the extra genes turned out to be pairwise best hits. Moreover we observe that in 18143 among the singlemismatch gene clusters predicted by RGC the two extra genes have exactly the same annotation string. (Annotations containing the word "hypothetical" were ignored.) ACSI finds 90.19% of these clusters. Surprisingly, 4.59% of the singlemismatch clusters in which the two extra genes had best hits to each other were not found by ACSI. This is because for one or more of the other genes in the cluster our BLAST results did not return any similarity score to a gene in the other interval. Apparently the elements of a cluster of orthologous groups can be very faintly related in terms of sequence similarity.
Comparison with RegulonDB data. Among other information about transcriptional regulation, RegulonDB [29] provides a list of operon locations in Escherichia coli K12. While the majority of operons in RegulonDB are computationally predicted, some are also experimentally confirmed. From 2649 operons reported in RegulonDB, 846 span two or more genes. We mapped these operons to the annotation of the E. coli K12 genome in our data set. However, 104 operons contain genes that are not annotated in our dataset and thus were omitted from subsequent analysis. The remaining 742 operons span between 2 and 16 genes, 71.83% of which span 2 or 3 genes. The number of detected gene clusters depends strongly on the degree of evolutionary relatedness between the E. coli K12 genome and other genomes in the dataset. While ACSI and RGC predicted many occurrences in other close related γproteobacteria in our dataset, for the majority of genomes only few occurrences of operons were reported. Additional file 1 Section 4.2, gives an overview of the number of found gene clusters in the dataset. The sets of reported operons found by ACSI and RGC are not entirely overlapping. Instead, ACSI finds operons which RGC does not find and vice versa. A complete overview of unique findings for algorithms and parameter settings is shown in Table 2.
Table 2. Unique findings (with 100% overlap) of operons by ACSI and RGC with minimum cluster size s = 2 and varying parameters.
Conclusions
In this work we introduced a new model to detect gene clusters based on the study of (approximate) weak common intervals in indeterminate strings. In context of gene familyfree analysis, we presented a scoring scheme for (approximate) weak common intervals which rates both interval size and the degree of similarity between the contained genes of an (approximate) weak common interval pair. We use our gene familyfree model to predict gene clusters between pairs of genomes. This approach is evaluated in comparison with the common intervalsbased reference gene cluster model.
In addition to the use case of detecting gene clusters, our algorithms can also be helpful to identify synteneous blocks in a gene familyfree analysis. The hierarchical nature of common intervals is maintained in our weak common intervals model, which makes it ideal for studying potential synteneous blocks of arbitrary resolution. The basic concept of common intervals in strings has seen many generalizations in the past years which have greatly increased its utility for biological studies, in particular the simultaneous consideration of more than two strings, requiring common intervals to occur in all or at least a certain number of them. This generalization of (approximate) weak common intervals in indeterminate strings is undoubtedly an interesting direction for future work.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
All authors were involved in the early conception of the project. DD, KJ and JS developed the methods and designed the analysis. DD and KJ performed the evaluation and wrote the manuscript; all authors discussed the results, commented on the manuscript, and read and approved its final version.
Acknowledgements
DD receives a scholarship from the CLIB Graduate Cluster Industrial Biotechnology. KJ is funded by DFG grant ST 431/51.
Declarations
We acknowledge support for the Article Processing Charge by the German Research Foundation and the Open Access Publication Fund of Bielefeld University Library.
This article has been published as part of BMC Genomics Volume 15 Supplement 6, 2014: Proceedings of the Twelfth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S6.
References

Tamames J, et al.: Evolution of gene order conservation in prokaryotes.

Wolfe KH, Shields DC: Molecular evidence for an ancient duplication of the entire yeast genome.
Nature 1997, 387:708713. PubMed Abstract  Publisher Full Text

Schmidt T, Stoye J: Quadratic time algorithms for finding common intervals in two and more sequences.

Heber S, Mayr R, Stoye J: Common intervals of multiple permutations.
Algorithmica 2011, 60(2):175206. Publisher Full Text

Bergeron A, Corteel S, Raffinot M: The algorithmic of gene teams.

He X, Goldwasser MH: Identifying conserved gene clusters in the presence of homology families.
J Comp Biol 2005, 12(6):638656. Publisher Full Text

Ling X, He X, Xin D: Detecting gene clusters under evolutionary constraint in a large number of genomes.
Bioinformatics 2009, 25(5):571. PubMed Abstract  Publisher Full Text

Rahmann S, Klau GW: Integer linear programs for discovering approximate gene clusters.

Böcker S, Jahn K, Mixtacki J, Stoye J: Computation of median gene clusters.
J Comput Biol 2009, 16(8):10851099. PubMed Abstract  Publisher Full Text

Jahn K: Efficient computation of approximate gene clusters based on reference occurrences.
J Comput Biol 2011, 18(9):12551274. PubMed Abstract  Publisher Full Text

Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database: an updated version includes eukaryotes.
BMC Bioinformatics 2003, 4:41. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Powell S, Szklarczyk D, Trachana K, Roth A, Kuhn M, Muller J, Arnold R, Rattei T, Letunic I, Doerks T, Jensen LJ, von Mering C, Bork P: eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges.
Nucleic Acids Res 2012, 40(Database):2849. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Waterhouse RM, Zdobnov EM, Tegenfeldt F, Li J, Kriventseva EV: OrthoDB: the hierarchical catalog of eukaryotic orthologs in 2011.
Nucleic Acids Res 2011, 39(Database):2838. Publisher Full Text

Shi G, Peng MC, Jiang T: MultiMSOAR 2.0: an accurate tool to identify ortholog groups among multiple genomes.
PLoS one 2011, 6(6):20892. Publisher Full Text

Li L, Stoeckert CJ, Roos DS: OrthoMCL: identification of ortholog groups for eukaryotic genomes.
Genome Res 2003, 13(9):21782189. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ostlund G, Schmitt T, Forslund K, Köstler T, Messina DN, Roopra S, Frings O, Sonnhammer ELL: InParanoid 7: new algorithms and tools for eukaryotic orthology analysis.
Nucleic Acids Res 2010, 38(Database):196203. Publisher Full Text

Song N, Sedgewick RD, Durand D: Domain architecture comparison for multidomain homology identification.
J Comput Biol 2007, 14(4):496516. PubMed Abstract  Publisher Full Text

Joseph JM, Durand D: Family classification without domain chaining.
Bioinformatics 2009, 25(12):4553. Publisher Full Text

Frech C, Chen N: Genomewide comparative gene family classification.
PLoS one 2010, 5(10):13409. Publisher Full Text

Liu J, Rost B: Domains, motifs and clusters in the protein universe.
Current Opinion in Chemical Biology 2003, 7(1):511. PubMed Abstract  Publisher Full Text

Uno T, Yagiura M: Fast algorithms to enumerate all common intervals of two permutations.
Algorithmica 2000, 26(2):290309. Publisher Full Text

Didier G, Schmidt T, Stoye J, Tsur D: Character sets of strings.
J Discr Alg 2007, 5(2):330340. Publisher Full Text

Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P: Toward automatic reconstruction of a highly resolved tree of life.
Science 2006, 311(5765):12831287. PubMed Abstract  Publisher Full Text

Pesquita C, Faria D, Bastos H, Ferreira AE, Falcão AO, Couto FM: Metrics for GO based protein semantic similarity: a systematic evaluation.
BMC Bioinformatics 2008, 9(Suppl 5):4. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool.
J Mol Biol 1990, 215(3):403410. PubMed Abstract  Publisher Full Text

Lechner M, Findeiß S, Steiner L, Marz M, Stadler PF, Prohaska SJ: Proteinortho: detection of (co)orthologs in largescale analysis.
BMC Bioinformatics 2011, 12:124. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Salgado H, PeraltaGil M, GamaCastro S, SantosZavaleta A, MuñizRascado L, GarcíaSotelo JS, Weiss V, SolanoLira H, MartínezFlores I, MedinaRivera A, SalgadoOsorio G, AlquiciraHernández S, AlquiciraHernández K, LópezFuentes A, PorrónSotelo L, Huerta AM, BonavidesMartínez C, BalderasMartínez YI, Pannier L, Olvera M, Labastida A, JiménezJacinto V, VegaAlvarado L, Del MoralChávez V, HernándezAlvarez A, Morett E, ColladoVides J: RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, crossvalidated gold standards and more.
Nucleic Acids Res 2013, 41(Database):20313. PubMed Abstract  Publisher Full Text  PubMed Central Full Text