Abstract
Background
Uncovering the relationship between the conserved chromosomal segments and the functional relatedness of elements within these segments is an important question in computational genomics. We build upon the series of works on gene teams and homology teams.
Results
Our primary contribution is a local slidingwindow SYNS (SYNtenic teamS) algorithm that refines an existing family structure into orthologous subfamilies by analyzing the neighborhoods around the members of a given family with a locally sliding window. The neighborhood analysis is done by computing conserved gene clusters. We evaluate our algorithm on the existing homologous families from the Genolevures database over five genomes of the Hemyascomycete phylum.
Conclusions
The result is an efficient algorithm that works on multiple genomes, considers paralogous copies of genes and is able to uncover orthologous clusters even in distant genomes. Resulting orthologous clusters are comparable to those obtained by manual curation.
Background
Uncovering the relationship between the conserved chromosomal segments and the functional relatedness of elements within these segments is an important question in computational genomics. It is often suggested that regions with similar gene content among different species are evidence for phylogenetic relationship and trace through evolution the inheritance of function from a common ancestor. Within one genome, the presence of large duplicated blocks may be due to the ancient largescale or whole genome duplication, while presence of segments with homologous genes, named conserved gene clusters in multiple genomes more likely indicates an evolutionary constraint for a functionally related group. Our primary contribution is a local slidingwindow algorithm that starts from an existing protein family classification and produces two results: first, concerved gene clusters, and second, a subdivision of families into orhtologous subgroups. Our approach can be seen as using conserved gene clusters in order to sift through the family structure to uncover orthology. We evaluate the biological relevance of our approach on the example of Protoploid yeasts [1].
A number of studies indicate that regions of conserved homology among multiple species may result from functional pressure to keep these genes close, but it may also be conserved because the genomes under study have not sufficiently diverged. For the former, the most well known examples are that of operons in prokaryotes [2], but also the existence of functional interactions [3] and similar expression patterns [4] in closely located genes. For the latter, existence of conserved gene clusters is the computational basis for ancestral genome reconstruction [5] and search for ancestral homologs among genes in the same family [6]. Orthologs are homologous genes related by speciation [7,8] which retain the same functionality as their common ancestors. Homologous genes related by duplication within one lineage are called paralogs and generally differ in functionality [912]. A number of papers introduce algorithms to compute conserved gene clusters and orthologous groups, see for example, [1316]. These approaches vary on a number of parameters. First, there are authors who consider strictly conserved chromosomal segments with similar gene order and orientation [1719]. Second, come the approaches where one considers conserved contiguous regions but without colinearity [13,20]. Third, the authors relax the definition of conserved regions by allowing gaps [18,2124]. Four, paralogous gene copies within one chromosome are allowed in order to explore manytomany homologous relationships [13,25]. Finally, some authors study the effect of varying the gap between adjacent neighbors [24,26,27].
In this paper we start from the notion of gene teams introduced in [28]. This model allows only one copy of a gene on a given chromosome. We relax this restriction by following the approach of homology teams defined in [13]. Furthermore, we set the gap threshold not only for adjacent genes, but by requiring the distance for any two genes considered as being neighbors to be smaller than a certain threshold. A similar choice was made in [20]. We call the obtained gene clusters synteny teams.
Our SYNS (SYNtenic teamS) algorithm refines existing families into orthologous subfamilies, by analyzing the neighborhoods around the members of a given family with a locally sliding window. This is done for all pairs of chromosomes in multiple genomes on which family members appear. The pairwise conserved contiguous segments are agglomerated by relying on a partial homology and biological criteria introduced in [1] between segments. This results in larger conserved segments that we call syntenic zones. We evaluate our algorithm on the existing homologous families for five genomes of the Hemyascomycete yeasts from the Genolevures database [29]. Indeed, there already exists a subclassification of these families into orthologous subfamilies [1] that has undergone expert validation and thus can be used as a reference point for the evaluation of biological relevance of our results. We further illustrate the results of our method for the particular case of the Pdrp (pleiotropic drug resistance proteins) phylogenetic subfamily of ABC transporters that has been manually analyzed in [6].
Methods
In this section we define the notion of unordered conserved gene clusters that allows for paralogous copies and gaps on multiple genomes. Following the work of [20,30,31], we allow one homologous gene to appear more than once in one chromosome. We refine the approach of homology teams [13] by distinguishing between orthologous and paralogous copies of genes. Large syntenic zones are built my merging clusters based on genes common among them instead of directly merging the ordered chains with overlapping families as in [32]. For mathematical notations and examples in a textual format we follow [28].
Definition 1 A chromosome is defined as a pair c = (Σ, G), where Σ = {f_{1}, f_{2}, …, f_{m}} is the set of homologous families and G = (g_{1}, g_{2}, ..., g_{n}) is an ordered sequence of genes. Each gene g_{i} ∈ G is a couple (p_{i}, f_{i}), where p_{i} is the position of gene g_{i}on c and g_{i}belongs to some homologous group f_{i} ∈ Σ.
Here, Σ is the alphabet for any chromosome c and p_{i} is an integer. When it is necessary to indicate to which chromosome belongs a given gene, this is done by a subscript: (p_{i}, f_{i})_{c}.
Definition 2 Given a chromosome c, with two genes g_{i} = (p_{i}, f_{i}) and g_{j} = (p_{j}, f_{j}), the distance between g_{i} and g_{j} is defined by Δ(g_{i}, g_{j}) = p_{i}– p_{j}.
Example 1 Let c_{1}and c_{2}be two chromosomes over the same alphabet Σ = {f_{1}, f_{2}, f_{3}, f_{4}} of homologous families with genes on c_{1}being (1, f_{2}), (2, f_{1}), (4, f_{4}), (7, f_{3}), (8, f_{1}), and on c_{2}being (1, f_{1}), (2, f_{2}), (3, f_{2}), (4, f_{3}), (6, f_{4}). This is denoted by:
c_{1} = 〈f_{2}f_{1}*f_{4}**f_{3}f_{1}〉,
c_{2} = 〈f_{1}f_{2}f_{2}f_{3}*f_{4}〉.
Asterisks stand for genes that are unassigned to homologous groups; notice that * is not part of the alphabet Σ.
A gene subset G′ ⊆ G induces the subset of families Σ′ denoted by Σ(G′) such that f_{i} ∈ Σ′ if and only if there exists g_{i} ∈ G′ such that g_{i} = (p_{i}, f_{i}). A set of genes G′ from the same chromosome, forms a chromosomal segment s = (Σ′, G′, c) with or without gaps. When it is clear from the context, we will assimilate a set of genes G′ with the corresponding chromosomal segment.
For example, in the case of G′ = {(2, f_{1}), (4, f_{4}), (8, f_{1})} and alphabet Σ′ = Σ(G′) = {f_{1}, f_{4}}, G′ defines a chromosomal segment with gaps on c_{1} = 〈f_{2}f_{1} * f_{4} * *f_{3}f_{1}〉. This segment G′ is noncontiguous on c_{1}; the gaps correspond to (5, *), (6, *) and (7, f_{3}).
Definition 3 A chromosomal segment s = (Σ′, G′, c) is contiguous if for any two genes g_{i} = (p_{i}, f_{i}) and g_{j} = (P_{i}, f_{j}) from G′ and any psuch that p_{i} <p <p_{j}, either the gene g = (p, f) at the position p belongs to G' or this position corresponds to an asterisk. Otherwise, the segment is said to be noncontiguous For example, G′ = {(4, f_{4}), (7, f_{3}), (8, f_{1})} on c_{1} = 〈f_{2}f_{1} * f_{4} * *f_{3}f_{1}〉 forms a contiguous segment.
Synteny teams
Two genes g_{i} = (p_{i}, f_{i}) and g_{j} = (p_{j}, f_{j}) on the same chromosome are considered to be neighbors when Δ(g_{i}– g_{j}) <δ for a given threshold δ > 0. For a gene g_{i}, we denote the set of neighbor genes N_{i} to be centered around it, that is N_{i} = {g_{k} = (p_{k}, f_{k})  p_{i} – ⌊δ/2⌋ ≤ p_{k} ≤ p_{i} + ⌊δ/2⌋}.
Definition 4 A chromosomal segment s is called a δ—segment if every pair of genes of s is separated by a distance smaller than δ, that is s = {g_{i}  ∀g_{j} ∈ s, Δ(g_{i}, g_{j}) <δ}. A window w is a contiguous δsegment.
Definition 5 We say that Σ′ ⊆ Σ is a δ—subset if there exists at least one δ—segment s′ = (Σ′, G′, c) such that Σ' = Σ(G'). We say that s' is the witness of this δ—subset.
Example 2 For δ = 3, the δ—subsets on chromosome c_{2} = 〈f_{1}f_{2}f_{2}f_{3} * f_{4}〉 are the following:
 {f_{1}, f_{2}} as witnessed by ((1, f_{1}), (2, f_{2})), ((1, f_{1}), (3, f_{2})), and ((1, f_{1}), (2, f_{2}), (3, f_{2}));
 {f_{2}, f_{3}} as witnessed by ((2, f_{2}), (4, f_{3})), ((3, f_{2}), (4, f_{3})), and ((2, f_{2}), (3, f_{2}), (4, f_{3}));
 {f_{3}, f_{4}} as witnessed by ((4, f_{3}), (6, f_{4})).
Definition 6 Let Σ be the set of homologous families over a set of chromosomes C. We say that Σ′ ⊆ Σ is a δ—cluster if Σ′ is a δ—subset for all chromosomes in some C′ ⊆ C, where C′ ≥ 2. We say that the set of genes
witnesses the δ—cluster Σ′.
A witness S is thus a set of all genes that participate in the segments witnessing the relevant (δsubsets. Let Σ and Σ' to be two (δclusters such that Σ ∩ Σ′ ≠ ∅. Let S and S′ be the corresponding witness sets. Denote by S_{∩} and the sets of genes in each of these witness sets that are members of the families in Σ ∩ Σ′.
Definition 7 A δ—cluster Σ is said to be a (δsynteny if (a) the corresponding witness set S has genes belonging to at least two different chromosomes and (b) there does not exist a δcluster Σ′ with a witness set S′ such that
Example 3 Let c_{1}, c_{2}and c_{3}be chromosomes as shown in figure1.
Figure 1. Example of δclusters and δsyntenies The nontrivial δ—syntenies for the example 3 that cover c_{1}, c_{2} and c_{3} are {f_{4}, f_{5}}, {f_{4}, f_{1}} and {f_{5}, f_{1}}; that cover c_{1} and c_{2} are {f_{1}, f_{2}} and {f_{4}, f_{5}, f_{1}}. Colors indicate homology relationships. Connections indicate the relevant δ—clusters.
c_{1} = 〈f_{3} * *f_{5}f_{4}f_{1} * f_{2}* f_{5}f_{4}〉
c_{2} = 〈f_{1}f_{2} * *f_{3}*f_{4}f_{5}f_{1} * f_{5}〉
c_{3} = 〈f_{2} * f_{3} **f_{5}*f_{1}f_{4} * f_{5}〉
Let (δ = 3. We obtain the following nontrivial δ—clusters: {f_{4}, f_{1}}, {f_{5}, f_{1}}, {f_{4}, f_{5}, f_{1}}, {f_{1}, f_{2}} and {f_{4}, f_{5}} between c_{1}and c_{2}; and {f_{1}, f_{5}}, {f_{1}, f_{4}} and {f_{4}, f_{5}} between c_{1}and c_{3}. The nontrivial δsyntenies are {f_{4}, f_{5}}, {f_{1}, f_{2}}, {f_{4}, f_{1}}, {f_{5}, f_{1}} and {f_{4}, f_{5}, f_{1}}.
The superset inclusion in definition 7 implies that for the computational purposes there is no need to consider the smaller of the two sets and thus causes merging of the syntenies if the witness of one synteny is a complete subset of another in our algorithm.
Example 4 Let c_{1}, c_{2}and c_{3}be three chromosomes in figure2.
Figure 2. Merging of δsyntenies Merging among δsyntenies for chromosomes in the example 4: when considering f_{8}, δcluster {f_{8}, f_{9}} is merged in the <5synteny {f_{7}, f_{8}, f_{9}}. Colors indicate the homology relationship. Connections indicate the relevant δ—clusters (only those relevant for merging are shown).
c_{1} = 〈f_{1} * *f_{4}* f_{6}f_{6}f_{7}f_{8}f_{9}〉
c_{2} = 〈*f_{5}* f_{3}f_{6}* f_{2}f_{4}f_{8}f_{7}f_{9}〉
c_{3} = 〈f_{4}f_{8}f_{4}f_{7}f_{8}f_{8}* f_{8}f_{2}〉
Let (δ = 3 and consider f_{8}. Nontrivial δ—clusters are: {f_{7}, f_{8}, f_{9}}, {f_{7}, f_{8}} and {f_{8}, f_{9}} between c_{1}and c_{2}, {f_{7}, f_{8}} between c_{1}and c_{3}and { f_{8}, f_{2}}, {f_{7}, f_{8}} and {f_{8}, f_{4}} between c_{2}and c_{3}. Therefore, we obtain the following nontrivial δ—syntenies: {f_{7}, f_{8}, f_{9}}, {f_{7}, f_{8}}, {f_{8}, f_{2}} and {f_{8}, f_{4}}. Notice that the δcluster {f_{7}, f_{8}, f_{9}} covers witnesses of the δcluster {f_{8}, f_{9}}, but the witnesses of the δcluster {f_{7}, f_{8}} on chromosome c_{3}do not witness the δcluster {f_{7}, f_{8}, f_{9}}. Therefore, we merge the δcluster {f_{8}, f_{9}} in th e δsynteny {f_{7}, f_{8}, f_{9}}; however, {f_{7}, f_{8}} remains as a separate δsynteny.
We have seen that a (δ—synteny must contain the maximal (δ—cluster with respect to subset inclusion. All (δ—syntenies for a set of chromosomes C, with C >= 2 are included in the result. Such a synteny set is informally called a synteny team following the terminology introduced in [28,32] for gene teams.
Definition 8 Given a δ—synteny teamwe say that Σ_{i}and Σ_{j}are transitively connected if the witnesses S_{i} and S_{j} overlap, that is S_{i} ∩ S_{j} ≥ 1. We further define a δzone as a union of transitively connected δsyntenies Σ_{i}and Σ_{j}.
Example 5 Consider C = {c_{1}, c_{2}, c_{3}} from example 4 and δ = 3. Suppose that we compute clusters in the neighborhood of f_{8}. Nontrivial syntenies are the following: Σ_{1} = {f_{7}, f_{8}} for witness S_{1} = {(8, f_{7})_{c1}, (9, f_{8})_{c1}, (9, f_{8})_{c2}, (10, f_{7})_{c2}, (2, f_{8})_{c3} (4, f_{7})_{c3}, (5, f_{8})_{c3}, (6, f_{8})_{c3}} and Σ_{2} = {f_{4}, f_{8}} for witness S_{2} = {(8, f_{4})_{c2}, (9, f_{8})_{c2}, (1, f_{4})_{c3}, (3, f_{4})_{c3}, (2, f_{8})_{c3}}. Notice, that Σ_{1} ∩ Σ_{2} = {f_{8}} ≠ ∅and S_{1} ∩ S_{2} = {(9, c_{2}, f_{8}), (2, c_{3}, f_{8})} ≠ ∅. We obtain one nontrivial δ—zone {f_{4}, f_{8}, f_{7}} by agglomerating δ—syntenies Σ_{1}and Σ_{2}based on the transitivity (see figure3). Notice that this leaves the gene (8, f_{8})_{c3}, out of the δzone. The transitivity relationship in the SYNS algorithm combines each pair of two δ—syntenies sharing at least one witness into one δzone. The notion of a δ—zone aims at uncovering even distant evolutionary relationships based on conservation of gene content within neighborhoods. It is slightly amended based on the following two considerations.
Figure 3. Agglomeration of δsyntenies in δzones Example of δsyntenies’ agglomeration by transitivity (see example 5): δclusters Σ_{1} = {f_{7}, f_{8}} and Σ_{2} = {f_{4}, f_{8}} are witnessed by gene sets indicated by ✰ and • symbols, respectively. They are merged in one δ—zone {f_{4}, f_{8}, f_{7}} based on witness intersection {(9, f_{8})_{c2}, (2, f_{8})_{c3}}.
1. Several paralogous genes may exist on the same chromosome. When two or more paralogs appear within one window of size δ, we include them in the same witness set of a δsynteny since it is not possible to computationally distinguish between them.
2. It may happen that two distinct δsyntenies share only one paralogous gene. This is what we call a weak bond. Creating a δzone based on a single gene intersection may either lead to a δzone that is phylogenetically valid or may create an erroneous result (see [6]).
Definition 9 Given a δ—synteny teamand its witness set S = {S_{i}, S_{j}} we say that S forms a weak bond if S_{i} ∩ S_{j} = 1. We further define g = S_{i} ∩ S_{j} to be the witness of a weak bond.
The δ—zone {Σ_{i}, Σ_{j}} resulting from a weak bond may be erroneous. We rely on phylogeny to solve this issue. We consider a total order over all the species under study defined by phylogeny: a ≺ b if species b has diverged from the common ancestor earlier than species a (≺ corresponds then to the relative speciation time). When no other witness from a other than g exists, we split the erroneously obtained synteny in two parts: one that contains the orthology relationships within a given family f and another one that keeps the supposed paralogs. The details of how this is done are presented in Results section.
Definition 10 Letbe a δ— synteny team over the witness set S = {S_{i}, S_{j}} such that S_{i} > S_{j} and let g = S_{i} ∩ S_{j} be the witness of a weak bond. If g is from the biggest species according to ≺ in S_{j}, we say that S_{i}witnesses a maximal orthologous (δsynteny Σ_{i}and S'_{j} = S_{j} \ g witnesses a paralogous δsynteny Σ_{j}.
Example 6 Consider C′ = {c_{2}, c_{3}} from example 4 and figure4supposing that c_{3} ≺ c_{2}and consider neighborhoods around f_{8}with (δ = 3. Two nontrivial δsyntenies are connected by a weak bond: Σ_{1} = {f_{8}, f_{2}} with witness S_{1} = {(8, f_{2})_{c2}, (9, f_{8})_{c2}, (8, f_{8})_{c3}, (9, f_{2})_{c3}} and Σ_{2} = { f_{4}, f_{8}} with witness S_{2} = {(7, f_{4})_{c2}, (9, f_{8})_{c2}, (1, f_{4})_{c3}, (2, f_{8})_{c3} (3, f_{4})_{c3}, (5, f_{8})_{c3}}. Indeed, {(9, f_{8})_{c2}} is the witness of this weak bond. Since c_{3} ≺ c_{2}, then Σ_{2}is the maximal orthologous δsynteny with witness S_{2}, while Σ_{1}is the one with the paralogous copy of f_{8} (at position 9 on c_{2}). The set S_{1}becomes Members of a family are split into an orthologous and paralogous subsets present in different syntenies. At the end of our procedure, only the largest orthologous (δzone and the nonintersecting paralogous (δzones covering any given homologous family remain in the result.
Figure 4. Example of a weak bond Example of a weak bond among δsyntenies considering family f_{8}: {(9, f_{8})_{c2}} is the witness of a weak bond between the δsyntenies {f_{8}, f_{2}} and {f_{4}, f_{8}}. Colors indicate the homology relationship. Connections indicate the relevant δ—clusters. Crossed box indicates the witness of a weak bond.
Syntenic TeamS algorithm
In this section, we present the SYntenic TeamS (SYNS) algorithm which computes δ—zones in multiple genomes. In previous work gene teams between two chromosomes of size m and n are computed by an O(m + n)log^{2}(m + n) algorithm consisdering only onetoone homologous relationships [32]. The approach by [20] solves the ordered gene clusters problem by proposing a directed acyclic graph model and an NPhard longest path solution; results contain maximal but also nonmaximal orthologous clusters. Our approach relies on the same slidingwindow general approach as in [20]. However, we gain in time efficiency by limiting the sliding of the window only around positions of family members. Given a set of families Σ and a predefined window size δ, we examine neighborhoods of each family f ∈ Σ in all chromosomes. For all genes of f including paralogous copies, we consider a neighborhood from –δ to +δ around them. This neighborhood is examined by a sliding window of size δ and we form sets of genes corresponding to families in a given window position. These sets are intersected to look for common gene content if they belong to different chromosomes. The intersections define synteny conservation within the family neighborhoods by using definitions in Methods section. We further look for transitivity among δ—syntenies and build (δzones. To do this, we search for overlaps among witnesses of δ—clusters. If the witness intersection size is > 1 then the δ— syntenies are agglomerated to form one δ—zone. Three different cases corresponding to phylogenetic topologies shown in figure 5 are considered for solving the weak bond problem. Let S_{i} and S_{j} to be the two witnesses connected by a weak bond, we sort the genes of these witnesses according to the ≺ order of speciation. If the witness of a weak bond occurs in the biggest species according to ≺ or if there is no any other witness from a bigger species, then we consider that (cases A and B in figure 5) the two clusters define a valid (δzone. Case C in figure 5 shows the situation where forming a (δzone can not be justified from the evolutionary perspective. For cases A and B we continue to search for paralogous gene clusters. We gather all maximal δ—zones in the final result.
Figure 5. Three topologies for a weak bond Examples of topologies where two δsyntenies Σ_{i} and Σ_{j} with witness sets S_{i} and S_{j} have a weak bond. Species are ordered by phylogenetic order Sp_{i} ≺ Sp_{i}_{+1}. Cases A (g_{1} is the witness of weak bond) and B (g_{2} is the witness of weak bond) are considered to be plausible from the evolutionary perspective, while case C (g_{2} is the witness of weak bond) is difficult to explain. Different colors represent the orthologous and paralogous δsyntenies emerging from these cases. Vertical links represent synteny while horizontal arrows represent the weak bond.
Time complexity
Table 1 shows the comparative time complexity analysis of our approach and other existing ortholog detection algorithms for the cases where such information is available. In the SYNS algorithm, we consider that one homologous family f may appear in at most c × t locations in all genomes, where c is the total number of chromosomes and t is the maximal number of paralogous copies. Given that we explore neighborhoods of size 2 × δ + 1, the number of all windows of size δ for f is (δ + 1) × c × t. The computation of all witnesses for a given family takes O(((δ + 1) × c × t)^{2}). If in this computation all the possible intersections are nonempty, then in the worst case scenario we obtain for f the set of (δclusters of size ((δ +1) × c × t)^{2}. Which implies that the (δsynteny computation takes O(((δ + 1) × c × t)^{4}); which is repeated for all families f ∊ Σ.
Table 1. Comparison of time complexity of OrthoMCL, GCFinder and SYNS All experiments have been run on the dualcore Intel Xeon 2.33 GHz server. Results are also available for MultiParanoid (approx. 2 hours run time) and CoCoCL (approx 3 hours run time) for which no time complexity is found in literature.
Evaluation of results
The Genolevures database provides families of proteins across the phylum of hemyascomycetous yeasts. To evaluate the performance of our algorithm, we have executed it on the existing families from the Genolevures Release 3 Candiate 3 (20080924) [33], [29] with 4949 families covering 25196 protein coding genes from five protoploid yeast species [1].
Comparison with other methods
The critical windowsize parameter δ of SYNS was set to 7 for all experiments. This value was obtained in order to match our results with the previously defined and expert validated orthologous subgroups [1]. We have compared the orthologous groups obtained by SYNS on the yeast data to those obtained by the following methods: Cococl [34], MultiParanoid [35] and OrthoMCL [36]. Table 2 shows the numbers of orthologous groups classified by these methods. OrthoMCL [36] was run with default inflation index= 1.5, evalue cutoff= –5 and percent match cutoff = 50 values starting from input fasta files. Cococl was run recursively starting with fasta sequences with boostrap threshold score= 1 and split score= 0.5 and using ClustalW for multiple sequence alignment. Multiparanoid was run using default parameters (no cutoff and no duplicate appearance of gene in clusters), using BLOSUM62 matrix for Blast alignments. Table 2 shows the total number of classified proteins and the total number of orthologous groups detected by SYNS and these algorithms using the original Genolevures families as a baseline [33]. In comparison with the SONS method, the SYNS classifies a comparable number of proteins, but generates more orthologous groups, implying that these groups are more finegrained.
We compare the orthologous groups between the SYNS method and those obtained by other algorithms in table 3. To compare two classifications we first look at how many groups are identical between two methods (Id column) and compute the similarity value (between 0 and 1) over the intersection of the covered protein sets (for definition see [33]). Second, we analyze the differences between two classifications. For these we report the number of proteins that are classified only by the SYNS (SYNS column) when compared to those only classified a given method (meth. column). The remaining differences are classified according to granularity: a split when a group obtained by a given method is split into multiple subgroups by the SYNS algorithm, a merge in the opposite case, and messy when the split/merge relationship is complicated. We further analyze the differences with respect to SONS classification case by case (available at http://www.cbib.ubordeaux2.fr/redmine/projects/syns/files webcite). We have found that in the case of splits between the resulting groups (50 groups in table 3, the more finegrained groups obtained by the SYNS algorithm are more functionally relevant in general. For the cases of merges (141 groups) and messy events (70 groups) there is no clearcut qualitative difference. However, for these 211 cases more functionally plausible groups can be obtained by SYNS when using a smaller window size δ = 5. Overall, SYNS method appears to be the best match with the curated SONS results [1], while relying on a clear mathematical definitions and having satisfactory running time.
Table 2. Comparisons of SYNS and other classifications with the existing family structure as baseline
Table 3. Comparison of different computations of orthologous clusters with SYNS results on the Genolevures data Each line compares a given method with the SYNS; we report the number of genes classified only in the given method (meth), only by the SYNS algorithm (SYNS), the similarity value (sim) between two cluster sets (varying between 0 and 1 as defined in [33]), the number of genes that appear as singletons, the number of splits and merges between two cluster sets as well as the number of unclassifiable cases (messy).
Analysis of two protein families
We illustrate the functional relevance the SYNS algorithm by considering the classification of Pdrp (pleiotropic drug resistance transporter proteins) subfamily performed in [6]. This is a subset of the PDR proteins from the GL3C0025 (total 60 proteins) Genolevures family. We compare this manual analysis with the results obtained automatically by SONS and SYNS algorithms.
Seven SONS, six SYNS and seven groups obtained by manual curation provide hypothethis on the evolution of this protein family. The manually curated orthologous groups are confirmed by gene cluster analysis. But in some cases the results differ. Groups P_{1} through P_{4} in table 4 denote four orthologous groups over five species annotated in [6] according to their S. cerevisiae members, namely Pdr12p group (P_{3}, 5 members), Snq2p group (P_{1} + P_{2}, 5+4=9 members) and Pdrp5p/15p group (P_{4}, 3 members). Groups P_{5} through P_{7} in table 4 contain genes whose relationship to Pdr5p/15p is based on phylogenetic evidence only [6]. Three tandem gene repeats appear in ERGO (Eremothecium gossypii), KLLA (Kluyveromyces lactis) and SAKL (Saccharomyces kluyveri) and are found in a similar neighborhood [6] in groups P_{1} and P_{2}.
Comparatively to the SONS classification, our approach proposes a more conservative classification for these proteins into orthologous groups. Indeed, SONS exclude ZYRO0D17710g from the Snq2/YNR070w phylogenetic cluster, while regrouping the remaining proteins belonging to P_{1} and P_{2}. Moreover, according to [6], SAKL0F04312g belongs to the Aus1p/Pdr11p group which has no shared neighborhood in preWGD five species. Thus, it is not surprising that this gene is missing in the SYNS classification (SONS algorithm classifies it in an independent group, not shown in table 4).
A similar analysis is done for the GL3C0026 family that has 57 members and four different functionally annotated groups. Figure 6 illustrates the evolutionary pattern based on the combination of phylogenetic analysis and functional annotations of this family. SONS algorithm produces 7 orthologous gene clusters, while SYNS generates 8 clusters functionally more relevant. Both SONS and SYNS successfully classify the Lornithine transaminase (OTAse) group (with the S. cerevisiae member YLR438w CAR2). However, SONS classification fails to distinguish the YGR019w UGA1 Gammaaminbutyrate (GABA) transaminase group from the YNR058w aminopelargonic acid aminotransferase (DAPA) group. On the contrary, SYNS method separates the cluster having the YGR019w UGA1 gene according to its functional anotation. Our algorithm also succeeds to correctly distinguish the single orthog gene clusters from the YGR019w UGA1 group. For the YOL140w ARG8 Acetylornithine aminotransferase group, both SONS and SYNS algorithms provide similar conserved gene clusters. However, SONS erroneously mixes some genes of this group with YGR019w UGA1 cluster and YNR058w BIO3 cluster, whereas SYNS algorithm succeeds to distinguish them. The combined functional annotations and neighborhood analysis support the evolutionary pattern illustrated in figure 6 for the GL3C0026 family. Therefore we can conclude that the final δzones in our algorithm may preserve a functionally meaningful conserved gene clusters.
Table 4. Comparisons of orthologous clusters subdividing the Pdrp Genolevures family The Pdrp Genolevures family GL30025 as analysed by a) SONS results b) SYNS results c) after manual curation. The comparisons have been performed over the same sets of genes as in figure 3 in [6] for the Pdrp ”sensu stricto” proteins subset of the GL3C0025 family.
Figure 6. Analysis of the Pdrp family Relationships between the 57 members of GL3C0026 family based on their functional annotations. Each line lists genes from one species (indicated on the left); each box represents one gene. For example line ZYRO, first box on the left A11990 stands for ZYROA11990 gene. The numbers below the boxes represent the relative gene order (position) on the chromosomes. Genes with similar functional annotations are connected using the same color.
Conclusion
The double goal of this study is to identify locally conserved gene clusters and to use them in order to subdivide an existing family structure into orthologous groups. To this end, we define a model for unordered local synteny and propose an algorithm to identify conserved gene clusters and their division into orthologous and paralogous clusters among multiple genomes. To validate our approach we have executed our method for the five Hemyascomycetous yeasts and genomes and examined the conserved nonoverlapping gene clusters that arise from each homologous family of Genolevures database [29]. Our approach shows 99% protein coverage for existing homologous groups.
We perform similar comparisons with the existing SONS groups [6] over the Genolevures families. The 90% similarity between our approach and SONS groups indicates that our automatic method comes close to the manually curated results, especially since part of the differences between these groups can be explained by the nonclassification of the paralogous conserved gene clusters by SONS. This confirms the pertinence of our definition of conserved neighborhoods based on transitivity and phylogenetic constraints that make it possible to include tandem repeats as well as loss, fusions or transpositions of gene copies in chromosomal rearrangements of genomes. The SYNS method makes it possible to distinguish between orthologous and paralogous conserved gene clusters and thus makes it possible to include tandem repeats as well as loss, fusions or transpositions of gene copies in chromosomal rearrangements of genomes. This implies that the proposed sliding window and partial traversal approach, efficiently produces biologically relevant conserved gene clusters and corresponding orthologous groups with O(n((δ + 1) × c × t)^{4}) worstcase complexity, for a predefined window size δ.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
Conceived and designed the experiments: AS, MN. Performed the experiments and analyzed the data: AS. Wrote the paper: AS, HS, MN.
Acknowledgements
The authors would like to thank Pascal Durrens for constructive discussions and help with the analysis of biological relevance of results.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 9, 2011: Proceedings of the Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S9.
References

Consortium G: Comparative genomics of protoploid Saccharomycetaceae.
Genome Res 2009, 19(10):1696709. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ermolaeva M: Operon finding in bacteria.
Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics 2005, 28862891.

Snel B, Bork P, Huynen M: The identification of functional modules from the genomic association of gene.
Proc Natl Acad Sci USA 2002, 99(9):58905895. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: A fingerprint of proteins that physically interact.
Trends Biochem. Sci. 1998, 23(9):324328. PubMed Abstract  Publisher Full Text

Bergeron A, Blanchette M, Chateau A, Chauve C: Reconstructing Ancestral Gene Orders Using Conserved Intervals. In WABI. Volume 3240. Edited by Jonassen I KJ, Lecture Notes in Computer Science. Springer; 2004::1425.

Seret ML, Diffels JF, Goffeau A, Baret PV: Combined phylogeny and neighborhood analysis of the evolution of the ABC transporters conferring multiple drug resistance in hemiascomycete yeasts.
BMC genomics 2009, 10(459):111. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fitch WM: Distinguishing homologous from analogous proteins.
Syst. Zool 1970, 19:99113. PubMed Abstract  Publisher Full Text

Fitch W: Homology a personal view on some of the problems.
Trends Genet 2000, 16:227231. PubMed Abstract  Publisher Full Text

Kondrashov FA, Rogozin IB, Wolf YI, Koonin EV: Selection in the evolution of gene duplications.
Genome Biol 2002, 3:RESEARCH0008. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lynch M, Conery JS: The evolutionary fate and consequences of duplicate genes.
Science 2000, 290:11511155. PubMed Abstract  Publisher Full Text

Lynch M, Force A: The probability of duplicated gene preseration by subfunctionalization.
Genetics 2000, 154:459473. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ohno S: Evolution be gene duplication. New York: Springer; 1970.

He X, Goldwasser MH: Identifying conserved gene clusters in the presence of homology families.
Journal of computational biology 2005, 12(6):638656. PubMed Abstract  Publisher Full Text

Bansal AK: An automated comparative analysis of 17 complete microbial genomes.
Bioinformatics 1999, 15:900908. PubMed Abstract  Publisher Full Text

Goldberg D, McCouch S, Kleinberg J: Algorithms for constructing comparative maps. In Comparative Genomics. Edited by Shankoff D, Nadeau JH. NL: Kluwer Academic Press; 2000:281294.

Housworth EA, Postlethwait J: Measures of synteny conservation between species pairs.
Genetics 2002, 162:441448. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Nadeau JH, Shankoff D: Counting on comparative maps.
Trends Genet 1998, 14(12):495501. PubMed Abstract  Publisher Full Text

Tamames J: Evolution of gene order conservation in prokaryotes.

Wolfe KH, Shields DC: Molecular evidence for an ancient duplication of the entire yeast genome.
Nature 1997, 387:708713. PubMed Abstract  Publisher Full Text

Yang Q, Yi G, Zhang F, Thon MR, Sze SH: Identifying gene clusters within localized regions in multiple genomes.
Journal of Computational Biology 2010, 17(5):657668. PubMed Abstract  Publisher Full Text

Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling.
PNAS 1999, 96:28962901. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Vandepoele K, Saeys Y, Simillion C, Raes J, Peer YVD: The automatic detection of homologous regions (ADHoRe) and its application to microcolinearity between arabidopsis and rice.
Genome Research 2002, 12(11):17921801. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Vision TJ, Brown DG, Tanksley SD: The origins of genomic duplications in Arabidopsis.
Science 2000, 290:21142117. PubMed Abstract  Publisher Full Text

Hoberman R, Sankoff D, Durand D: The Statistical Significance of MaxGap Clusters. In Comparative Genomics. Volume 3388. Edited by Lecture Notes in Computer Science. Edited by Lagergren J. Springer Berlin / Heidelberg; 2005::5571. Publisher Full Text

Parida L: Gapped permutation pattern discovery for gene order comparisons.
J. Comput. Biol. 2007, 14:4555. PubMed Abstract  Publisher Full Text

Heber S, Stoye J: Algorithms for finding gene clusters.
Lect. notes Comput. Sci. 2001, 2149:252263. Publisher Full Text

Kim S, Choi JH, Yang J: Gene teams with relaxed proximity constraint.

Bergeron A, Corteel S, Raffinot M: The algorithmic of gene teams. In Proc. 2nd Annual Workshop on Algorithms in Bioinformatics (WABI), Volume 2452 of Lectures Notes in Computer Science. New York: SpringerVerlag; 2002:464476.

Sherman DJ, Martin T, Nikolski M, Cayla C, Souciet J, Durrens P: Genolevures: protein families and synteny among complete hemiascomycetous yeast proteomes and genomes.
Nucleic Acids Research 2009, 37(Database issue):D550D554. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Didier G: Common intervals of two sequences.
Lect. Notes Comput. Sci 2003, 2812:1724. Publisher Full Text

Schmidt T, Stoye J: Quadratic time algorithms for finding common intervals in two or more sequences.
Lect. Notes Comput. Sci 2004, 3109:347358. Publisher Full Text

Beal MP, Bergeron A, Corteel S, Raffinot M: An algorithmic view of gene teams.
Theoret. Comput. Sci 2004, 320(23):395418. Publisher Full Text

Nikolski M, Sherman D: Family relationships: should consensus reign?  consensus clustering for protein families.
Bioinformatics 2007, 23(2):e71e76. PubMed Abstract  Publisher Full Text

Jothi R, Zotenko E, Tasneem A, Przytycka TM: COCOCL: hierarchical clustering of homology relations based on evolutionary correlations.
Bioinformatics 2006, 22(7):779788. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Alexeyenko A, Tamas I, Liu G, Sonnhammer E: Automatic clustering of orthologs and inparalogs shared by multiple proteomes.
Bioinformatics 2006, 22(14):e9e15. PubMed Abstract  Publisher Full Text

Li L, Stoeckert C, Roos D: OrthoMCL: identification of ortholog groups for eukaryotic genomes.
Genome Res 2003, 13(9):217889. PubMed Abstract  Publisher Full Text  PubMed Central Full Text