<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>gb-2004-5-5-r32</ui>
	<ji>GBJ</ji>
	<fm>
		<dochead>Research</dochead>
		<bibl>
			<title>
				<p>Detection of evolutionarily stable fragments of cellular pathways by hierarchical clustering of phyletic patterns</p>
			</title>
			<aug>
				<au id="A1" ca="yes">
					<snm>Glazko</snm>
					<mi>V</mi>
					<fnm>Galina</fnm>
					<insr iid="I1"/>
					<email>gvg@stowers-institute.org</email>
				</au>
				<au id="A2">
					<snm>Mushegian</snm>
					<mi>R</mi>
					<fnm>Arcady</fnm>
					<insr iid="I1"/>
					<insr iid="I2"/>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>Stowers Institute for Medical Research, 1000 E 50th Street, Kansas City, MO 64110, USA</p>
				</ins>
				<ins id="I2">
					<p>Department of Microbiology, Molecular Genetics, and Immunology, University of Kansas Medical Center, Kansas City, KS 66160, USA</p>
				</ins>
			</insg>
			<source>Genome Biology</source>
			<issn>1465-6906</issn>
			<pubdate>2004</pubdate>
			<volume>5</volume>
			<issue>5</issue>
			<fpage>R32</fpage>
			<url>http://genomebiology.com/2004/5/5/R32</url>
			<xrefbib>
				<pubid idtype="pmpid">15128446</pubid>
			</xrefbib>
		</bibl>
		<history>
			<rec>
				<date>
					<day>11</day>
					<month>11</month>
					<year>2003</year>
				</date>
			</rec>
			<revrec>
				<date>
					<day>19</day>
					<month>2</month>
					<year>2004</year>
				</date>
			</revrec>
			<acc>
				<date>
					<day>31</day>
					<month>3</month>
					<year>2004</year>
				</date>
			</acc>
			<pub>
				<date>
					<day>27</day>
					<month>4</month>
					<year>2004</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2004</year>
			<collab>Glazko et al.; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.</collab>
		</cpyrt>
		<shorttitle>
			<p>Detection of evolutionarily stable fragments of cellular pathways by hierarchical clustering of phyletic patterns</p>
		</shorttitle>
		<shortabs>
			<p>A hierarchy of 3,688 phyletic patterns was characterized encompassing more than 5,000 known protein-coding genes from 66 complete microbial genomes. The results indicate that gene loss and displacement has occurred in the evolution of most pathways.</p>
		</shortabs>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<sec>
					<st>
						<p>Background</p>
					</st>
					<p>Phyletic patterns denote the presence and absence of orthologous genes in completely sequenced genomes and are used to infer functional links between genes, on the assumption that genes involved in the same pathway or functional system are co-inherited by the same set of genomes. However, this basic premise has not been quantitatively tested, and the limits of applicability of the phyletic-pattern method remain unknown.</p>
				</sec>
				<sec>
					<st>
						<p>Results</p>
					</st>
					<p>We characterized a hierarchy of 3,688 phyletic patterns encompassing more than 5,000 known protein-coding genes from 66 complete microbial genomes, using different distances, clustering algorithms, and measures of cluster quality. The most sensitive set of parameters recovered 223 clusters, each consisting of genes that belong to the same metabolic pathway or functional system. Fifty-six clusters included unexpected genes with plausible functional links to the rest of the cluster. Only a small percentage of known pathways and multiprotein complexes are co-inherited as one cluster; most are split into many clusters, indicating that gene loss and displacement has occurred in the evolution of most pathways.</p>
				</sec>
				<sec>
					<st>
						<p>Conclusions</p>
					</st>
					<p>Phyletic patterns of functionally linked genes are perturbed by differential gains, losses and displacements of orthologous genes in different species, reflecting the high plasticity of microbial genomes. Groups of genes that are co-inherited can, however, be recovered by hierarchical clustering, and may represent elementary functional modules of cellular metabolism. The phyletic patterns approach alone can confidently predict the functional linkages for about 24% of the entire data set.</p>
				</sec>
			</sec>
		</abs>
	</fm>
	<meta>
		<classifications>
			<classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
			<classification type="BMC" subtype="man_spc_id" id="30010001">Biochemistry and structural biology</classification>
			<classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
			<classification type="BMC" subtype="man_spc_id" id="30010014">Microbiology and parasitology</classification>
		</classifications>
	</meta>
	<bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<p>Completely sequenced genomes and their gene repertoires are an important resource for studying biological evolution and cellular function. A crucial step in genome analysis, and the foundation of evolutionary and metabolic reconstructions, is determination of orthologous relationships between genes in different genomes <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. In 1997, Tatusov, Koonin and Lipman combined orthologs and their lineage-specific duplicates into clusters of orthologous groups (COGs) and proposed the first practical algorithm for finding orthologs on a large scale <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. They introduced phyletic patterns as a representation of the distribution of COGs across genomes, useful for tracking the evolutionary events such as vertical gene inheritance, gene loss and horizontal transfer.</p>
			<p>Pellegrini and co-workers <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> emphasized the idea that phyletic patterns can also be used as a post-homology method of predicting protein function, on the premise that genes/COGs encoding functionally linked proteins are co-inherited (simultaneously present or simultaneously absent) in the same subsets of genomes. A functional link between two proteins can be understood either as physical interaction between them, or, more broadly, as their involvement in the same metabolic pathway or functional system, and phyletic patterns are coded as strings of bits, standing for presences or absences of homologs in different genomes. It has been proposed that the Hamming distance of 3 bits or less between phyletic patterns is a useful similarity threshold for detecting functionally linked genes <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. <it>Ad hoc </it>application of the method produced several experimentally validated predictions, such as a novel type of isopentenyl pyrophosphate isomerase in archaea and some bacteria <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr></abbrgrp>, several participants in the 2-C-methyl-<smcaps>D</smcaps>-erythritol-4-phosphate (MEP) pathway of isoprenoid biosynthesis in bacteria and plants <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, and new components of queuosine biosynthesis pathway in Gram-positive bacteria <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>.</p>
			<p>Even with complete genome sequencing and high-throughput determination of gene function, many central metabolic pathways remain only partially characterized. The candidate genes filling the 'missing' steps are sought, and phyletic patterns may be used to identify many more such candidates. In practice, this approach is usually combined with other homology and post-homology methods, such as measurement of gene coexpression, prediction of coexpression from operon structure, and identification of multidomain fusions <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr></abbrgrp>. We do not know how many functional connections between genes/COGs can be inferred solely from their co-inheritance. On a more general note, co-occurrence of genes in genomes is one measure of their association in gene networks, and quantification of this association is needed for any system-wide study of gene function and evolution.</p>
			<p>To utilize fully the information offered by phyletic patterns, and to understand their limitations, we seek a better understanding of general properties of patterns and distances between them. A possible limitation of the phyletic-pattern method is that lineage-specific gains and losses of genes, thought to be pervasive in microbial evolution <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, will corrupt the similarity, increasing distance between functionally linked genes. One example of a pathway teeming with differential gains and losses is the tricarboxylic acid (TCA) cycle, which is present in its 'full' (that is, <it>E. coli</it>-like) form in only a few species, mostly within the proteobacterial clade, but is rearranged in other microbial lineages, presumably in connection with adaptation to changes in the redox status of the environment (Figure <figr fid="F1">1</figr> and <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>).</p>
			<fig id="F1">
				<title>
					<p>Figure 1</p>
				</title>
				<caption>
					<p>Phyletic patterns are corrupted by gene gains and losses</p>
				</caption>
				<text>
					<p>Phyletic patterns are corrupted by gene gains and losses. The consensus phylogenetic tree on top is the species' tree based on genomic content <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. Small black and white squares indicate, respectively, presences and absences of genes in each species. (<b>a</b>) TCA cycle. Blue box indicates the 'canonical' cycle, as known from saprophytic Enterobacteriaceae with large genomes. (<b>b</b>) Glycolysis. The green box indicates omnipresent COGs in the evolutionarily ancient bottom part of glycolysis, and the red box indicates three COGs coding for phosphoglycerate mutase activity. None of the patterns in the red box is close to the patterns in the green box, even though all these COGs are functionally linked. (<b>c</b>) Most genomes have just one of the two types of thymidylate synthase, but the blue boxes indicate several exceptions to this rule. <b>(d) </b>The full names of the species listed along the top of (a) and the TCA enzymes corresponding to the COGs shown in (a-c).</p>
				</text>
				<graphic file="gb-2004-5-5-r32-1"/>
			</fig>
			<p>A special case of gene gain/loss is gene displacement, when the same function is performed by non-orthologous genes in different species <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. For example, most enzymes from the triose part of the glycolytic pathway are present in almost every species, but one activity, phosphoglycerate mutase, can be carried out by three non-orthologous genes, and the pattern for each of these COGs is not a good match to the rest of the pathway (Figure <figr fid="F1">1</figr>). Phyletic patterns themselves, however, may be used to track displacements, by assuming that the alternative isofunctional genes display negative correlation, or 'complementarity'. A recent example of such an approach is the discovery of the novel type of thymidylate synthase, flavin-dependent ThyX, deduced by reversing presences and absences in a pattern of the conventional, folate-dependent thymidylate synthase ThyA <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. As with positive correlations, the complementary relationship is obscured by asynchronous gains and losses and by functional redundancy, when two genes performing the same molecular function are encoded by the same genome (Figure <figr fid="F1">1</figr>).</p>
			<p>Recent attempts at a more quantitative understanding of phyletic patterns include devising a scoring function for negative correlation, which has helped to find displacements of thiamine biosynthesis genes among the candidates shortlisted by other methods <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, and development of significance tests for similarities between two patterns <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr></abbrgrp>. It has also been proposed to improve the sensitivity of phyletic pattern matching by combining binary information of gene presence/absence and phylogenetic distance between orthologs <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp>.</p>
			<p>In this work, we characterize the relationships between functionally linked genes/COGs across multiple genomes, and ask what can be inferred, in a systematic way, about the metabolism and evolution of prokaryotes, on the basis of phyletic patterns alone. Four main components of our quantitative analysis are: distance between patterns; method for producing graphs based on the distance data; method for partitioning the graph into subsets; and estimation of error rate in predicting functional links. Generally speaking, phyletic patterns are binary vectors in species space, and distance between them can be measured in many ways. Patterns and the set of distances between each pair of them define a graph, in which one may discern subgraphs, or clusters, of similar pattern vectors. The quest for finding functionally linked genes/COGs then amounts to constructing a graph in which the number of automatically identifiable, biologically relevant clusters is maximized.</p>
		</sec>
		<sec>
			<st>
				<p>Results</p>
			</st>
			<sec>
				<st>
					<p>Hierarchical clustering of phyletic patterns</p>
				</st>
				<p>The key question in any clustering is the choice of the appropriate combination of distance measure and clustering algorithm. We investigated the effect of various distances between patterns, of different clustering approaches, and of several methods of tree splitting on the recovery of functionally linked proteins.</p>
				<p>Several measures of distance between phyletic patterns have been proposed <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B18">18</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr></abbrgrp>. Most of them do not address a crucial requirement, which we illustrate in the following example. Consider two pairs of proteins (x<sub>1</sub>, y<sub>1</sub>) and (x<sub>2</sub>, y<sub>2</sub>), with patterns x<sub>1 </sub>= (1011110), y<sub>1 </sub>= (0111110), x<sub>2 </sub>= (1000000), y<sub>2 </sub>= (0000001). We are interested in whether there is a functional link between x<sub>1 </sub>and y<sub>1</sub>, and between x<sub>2 </sub>and y<sub>2</sub>. Clearly, only in the case (x<sub>1</sub>, y<sub>1</sub>) can it be said that 'two proteins tend to be found together'. Yet, most distances, including Euclidean and other <it>l</it><sub><it>p</it></sub>-norms, Hamming distance, and J-divergence, are the same in both cases (see Materials and methods for details). The two cases are nevertheless readily distinguishable by the mutual information (MI) measure, and are placed even further apart when using complement of correlation coefficient <graphic file="gb-2004-5-5-r32-i1.gif"/>, or its modifications, such as squared anticorrelation, also called diametric distance <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>, and absolute anticorrelation (see Materials and methods). MI- and correlation-based measures are further compared in the next section.</p>
				<p>To derive clusters of related patterns from their pairwise distances, we have explored several unsupervised clustering techniques, of both agglomerative and divisive type. Divisive algorithms, such as K-means clustering and bisection <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>, need an <it>a priori </it>fixed number of clusters, which is unknown. When we fixed this number using the average number of the UPGMA clusters, we found that divisive methods underperform compared to agglomerative algorithms. Therefore, agglomerative approaches were used throughout most of the study, in particular the hierarchical clustering methods UPGMA (unweighted pair-group method with arithmetic mean) and neighbor joining (NJ). The programs that we used produce a tree-like graph of phyletic patterns, exemplified in Figure <figr fid="F2">2</figr>.</p>
				<fig id="F2">
					<title>
						<p>Figure 2</p>
					</title>
					<caption>
						<p>Groups of phyletic patterns and COGs revealed by hierarchical clustering of patterns in species space</p>
					</caption>
					<text>
						<p>Groups of phyletic patterns and COGs revealed by hierarchical clustering of patterns in species space. The presentation is similar to Figure <figr fid="F1">1</figr>, but the black and white squares are vertically compressed in order to show all 4,589 COGs in one figure. The full tree of COGs is shown at the left; at 170 COGs per 1 mm height, it is not particularly suitable for visual consumption, but some closely linked clusters (short branches) can be discerned.</p>
					</text>
					<graphic file="gb-2004-5-5-r32-2"/>
				</fig>
				<p>The large tree-like graph produced by hierarchical clustering of phyletic patterns has to be partitioned into smaller graphs in order to find groups of functionally linked proteins. The splitting criterion can be chosen beforehand, by, say, deciding on the upper level of distance at which two phyletic patterns are still considered similar, or by controlling the size or number of clusters. Instead of making a more or less arbitrary choice of such parameters, we used the distribution of similarities between patterns to infer the threshold at which a similarity becomes significantly higher than average <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. We settled on a threshold at which about 90% of the entire dataset was included in the partitioned clusters (see Materials and methods for details).</p>
				<p>Species themselves can be seen as vectors in the COG space, and distances between such vectors can be used to build the species' phylogeny <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>; this aspect is not considered in the current work, except for illustrative purposes (Figure <figr fid="F2">2</figr>, tree at the top).</p>
			</sec>
			<sec>
				<st>
					<p>The quality of clustering solutions</p>
				</st>
				<p>We studied the quality of eight clustering solutions produced by combining two clustering algorithms - UPGMA and NJ - and four variants of correlation-based distance (all graphs are available from the authors on request). We were interested in two criteria of quality: sensitivity and percentage of lost data. Sensitivity of a solution is defined as the percentage of genes/COGs that belong to the same pathway or functional system and are assigned to the same cluster, counted for each pathway and averaged. The percentage of lost data is the fraction of COGs that belong to the set of known pathways, but were not included into our clusters at a given similarity threshold. In these tests, we used 52 pathways and functional systems, containing 716 COGs altogether, from the COG database <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. The sensitivity and the percentage of lost data both depended on the clustering algorithm and distance measure. Use of diametric distance resulted in the major improvement in sensitivity and the lowest percentage of the lost data (Figure <figr fid="F3">3a</figr>).</p>
				<fig id="F3">
					<title>
						<p>Figure 3</p>
					</title>
					<caption>
						<p>Comparison of distance measures and clustering algorithms</p>
					</caption>
					<text>
						<p>Comparison of distance measures and clustering algorithms. (<b>a</b>) Diametric distance combined with NJ clustering results in the highest sensitivity and the smallest percentage of lost data. (<b>b</b>) The effect of selected distance measures between phyletic patterns on the recovery of functionally linked pairs of genes. The criteria of functional linkages on the basis of the KEGG maps, as well as the values for mutual information are as in <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>.</p>
					</text>
					<graphic file="gb-2004-5-5-r32-3"/>
				</fig>
				<p>Functional inference on the basis of phyletic patterns has been benchmarked by von Mering 
<it>et al</it>. <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. They used smaller set of species and distances based on mutual information, and evaluated the performance of the method by comparing linked pairs of genes in <it>E. coli </it>and their co-occurrence in the KEGG metabolic maps (see Figure <figr fid="F2">2</figr> in <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> for details). We compared their and our approach in the context of the current dataset, by clustering our phyletic patterns using their MI-derived distance, and interrogating KEGG maps with <it>E. coli </it>proteins. The performances of correlation and MI-based distances were quite similar in this test (Figure <figr fid="F3">3b</figr>). Apparently, however, correlation distance is more accurate in assigning to genes to metabolic pathways as defined in the COG database than to larger KEGG charts (Figure <figr fid="F3">3a</figr> and G.V.G. and A.R.M., unpublished data).</p>
				<p>Analysis of the clustering quality within the individual pathways and functional systems indicate that they tend to fall into three broad categories. For some pathways, such as heme biosynthesis or TCA cycle, the specificity of clustering was similar and low, regardless of the methods. Other pathways and systems were confidently clustered regardless of the protocol. These include, for example, the MEP pathway of terpenoid biosynthesis, lipid A biosynthesis, and the NADH-ubiquinone oxidoreductase complex. The third, and largest, category included pathways for which recovery in a cluster was dependent on the clustering method. Perhaps predictably, the percentage of correctly extracted genes in a pathway correlates significantly (<it>p </it>&lt; 0.05) with its average information content; that is, with the conservation of phyletic patterns among the members of a pathway (Additional data file 1). It seems likely that, given the genes already assigned to a partially characterized pathway or function, one might be able to estimate the probability that phyletic patterns will be helpful in finding the functionally linked genes.</p>
			</sec>
			<sec>
				<st>
					<p>Partitioning of the best clustering solution</p>
				</st>
				<p>The NJ algorithm in combination with diametric distance between phyletic patterns produced the clustering solution in which the known pathways were optimally recovered. To detect more relationships between phyletic patterns, and novel functional links between genes, we analyzed that clustering solution manually, by studying all clusters of similar phyletic patterns within the graph.</p>
				<p>One obvious result of our analysis is the existence of several large clusters of co-inherited COGs, seen as prominent rectangles of black and white (Figure <figr fid="F2">2</figr>). Inspection of the corresponding phyletic patterns indicates mostly phylogenetic, rather than functional, relationships, namely the presence of these COGs in all species, or only in bacteria, or only in archaea/eukarya. The former type of pattern reflects the minimal gene set compatible with modern-type cell <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. The latter two patterns apparently indicate extreme divergence of some pathways between bacteria and archaea/eukarya, and the independent origin of other pathways in these domains of life <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>.</p>
				<p>Each of these three clusters contains COGs from more than one functional system. The minimal gene set (about 70 COGs) is dominated by proteins involved in translation and transcription, and also includes components of other systems, such as protein maturation and nucleotide salvage <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. Archaea and eukarya share, to the exclusion of bacteria, many ribosomal proteins, basic machinery for DNA replication and transcription, some factors of RNA transcription, translation and decay, and a few metabolic enzymes (about 55 COGs <abbrgrp><abbr bid="B30">30</abbr><abbr bid="B31">31</abbr></abbrgrp>). Forty-seven COGs found in all bacteria but not in archaea have roles in replication, transcription, translation and protein secretion. Thus, if an uncharacterized protein has a phyletic pattern similar to any of these three patterns, this would suggest a shortened list of functional possibilities, but would not be sufficient to pinpoint the pathway.</p>
				<p>We removed these large clusters and focused on identifying every small cluster that consisted of proteins with experimentally established functional connections. We called these functionally linked clusters of genes 'PP-clusters', because genes in these clusters share similar phyletic patterns. There were 223 PP-clusters, ranging in size from two to 23 COGs, with diametric distance from zero to 0.4, and including 890 COGs (24% of the entire dataset) altogether (see the list of PP-clusters in Additional data file 2).</p>
				<p>To estimate the probability of obtaining these functional connections by chance, all COGs were randomly assigned to clusters, so that the average size was the same as the average PP-cluster size (327 random clusters, 14 COGs per cluster on average). The ratios of experimentally established functional connections observed within PP-clusters and at random were computed for 100 independent replicates of random clusters. The probability of getting, in random trials, as many or more functional connections as found in PP-clusters was estimated to be less than 3%. Thus, the functional linkage of COGs in PP-clusters was highly significant.</p>
				<p>We were next interested in how many of these tightly linked PP-clusters could be derived automatically, without manual inspection. We computed the range of the average within PP-cluster branch lengths, which, in the case of diametric distance, were found to vary from 0 to 0.4, and derived clusters in one step, by cutting the graph in Figure <figr fid="F2">2</figr> at several fixed lengths within this range. Cutting at two different branch lengths produced the same number (89) of automatically derived PP-clusters, but the number of COGs included in these clusters was different (Table <tblr tid="T1">1</tblr>). The number of false positives, estimated as the percent of automatically derived PP-clusters that were not presented in manually derived PP-clusters, was less than 20% in each case.</p>
				<tbl id="T1">
					<title>
						<p>Table 1</p>
					</title>
					<caption>
						<p>Manually and automatically derived PP-clusters*</p>
					</caption>
					<tblbdy cols="8">
						<r>
							<c ca="left">
								<p>Procedure of PP-cluster definition</p>
							</c>
							<c ca="center">
								<p>Number of PP-clusters</p>
							</c>
							<c ca="center">
								<p>Total number of COGs in all clusters</p>
							</c>
							<c ca="center">
								<p>COGs shared with manually derived PP-clusters</p>
							</c>
							<c ca="center">
								<p>Average number of COGs in a cluster</p>
							</c>
							<c ca="center">
								<p>Number of clusters absent in manually derived PP-clusters</p>
							</c>
							<c ca="center">
								<p>Number of pure RS<sup>&#8224; </sup>clusters</p>
							</c>
							<c ca="center">
								<p>FPs<sup>&#8225;</sup></p>
							</c>
						</r>
						<r>
							<c cspan="8">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Manual annotation</p>
							</c>
							<c ca="center">
								<p>223</p>
							</c>
							<c ca="center">
								<p>890</p>
							</c>
							<c ca="center">
								<p>N/A</p>
							</c>
							<c ca="center">
								<p>4.1</p>
							</c>
							<c ca="center">
								<p>N/A</p>
							</c>
							<c ca="center">
								<p>-</p>
							</c>
							<c ca="center">
								<p>-</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Automated tree cutting at average branch length 0.2</p>
							</c>
							<c ca="center">
								<p>89</p>
							</c>
							<c ca="center">
								<p>1,774</p>
							</c>
							<c ca="center">
								<p>315</p>
							</c>
							<c ca="center">
								<p>19.9</p>
							</c>
							<c ca="center">
								<p>38</p>
							</c>
							<c ca="center">
								<p>20</p>
							</c>
							<c ca="center">
								<p>0.19</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Automated tree cutting at average branch length 0.3</p>
							</c>
							<c ca="center">
								<p>89</p>
							</c>
							<c ca="center">
								<p>3,960</p>
							</c>
							<c ca="center">
								<p>395</p>
							</c>
							<c ca="center">
								<p>44.5</p>
							</c>
							<c ca="center">
								<p>26</p>
							</c>
							<c ca="center">
								<p>12</p>
							</c>
							<c ca="center">
								<p>0.16</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>*PP-clusters, clusters of COGs functionally linked on the basis of similar phyletic patterns. <sup>&#8224;</sup>RS clusters: clusters containing only COGs annotated as 'poorly characterized' in COGs database, where R stands for 'general function prediction only' and S stands for 'function unknown'. <sup>&#8225;</sup>The number of false positives (FPs) is the proportion of clusters that were not presented in manually derived PP-clusters.</p>
					</tblfn>
				</tbl>
				<p>The largest distance between two COGs in one PP-cluster (0.36) was observed for two subunits of NADH:ubiquinone oxidoreductase - COG1143 and COG1894 - linked into PP-cluster both manually and automatically. Among the 66 genomes, 25 contain both these COGs, and 15 genomes either one or the other, giving a Hamming distance of 15. Although this is an extreme case, many COGs in other PP-clusters were separated by Hamming distances as high as 8 to 10. Thus, hierarchical clustering with diametric distance can detect functional links in the zone where more simple measures were not particularly helpful.</p>
			</sec>
			<sec>
				<st>
					<p>Case-by-case analysis of phyletic pattern hierarchy: known and new functional connections</p>
				</st>
				<p>The PP-clusters are dominated by groups of COGs from the same metabolic pathway or functional system. In 56 cases, however, a PP-cluster contained component(s) without an established functional connection to the rest of the cluster. In 17 cases, such COGs were the ingroups within the PP-cluster; that is, the distance between a COG and the rest of the PP-cluster was smaller than between some of the functionally linked PP-cluster members. In 23 cases, the connection between the 'unexpected' COG and the rest of the PP-cluster could be tentatively proposed. Examples of such novel functional connections follow (see Additional data file 2 for complete listing of COGs and additional predictions).</p>
				<sec>
					<st>
						<p>PP-cluster new005</p>
					</st>
					<p>PP-cluster new005 (genes found in archaea, eukarya and gammaproteobacteria) is a multienzyme system probably involved in RNA maturation. It contains RNA 3'-terminal phosphate cyclase (COG0430), pseudouridylate synthase distantly related to TruB (COG0585), and a multifunctional protein (COG1444) that is found in the rRNA processosome <abbrgrp><abbr bid="B32">32</abbr></abbrgrp> and contains an uncharacterized enzymatic domain with a Rossmann-like fold, a Walker-type ATPase domain, a GNAT-type acetyltransferase and a putative nucleic acid-binding domain <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>.</p>
				</sec>
				<sec>
					<st>
						<p>PP-cluster new023</p>
					</st>
					<p>PP-cluster new023 links cell-shape determination genes <it>mreA</it> (COG1077), <it>mreB</it> (COG1792), <it>ccmA</it> (COG1664) and COGs involved in flagellum biosynthesis and chemotaxis (in diverse bacteria, including spirochetes, proteobacteria and cyanobacteria).</p>
				</sec>
				<sec>
					<st>
						<p>PP-cluster new015</p>
					</st>
					<p>PP-cluster new015 suggests novel activities involved in the MEP pathway (most bacteria, except Gram-positives) and links it to the biosynthesis of cell-wall components (COG0860, COG0791).</p>
				</sec>
				<sec>
					<st>
						<p>PP-cluster new012</p>
					</st>
					<p>PP-cluster new012 from proteobacteria links a component of the N-end rule protein degradation pathway - Leu/Phe-tRNA-protein transferase (COG2360) - to the putative executive components of the pathway, two metalloproteases (COG2377 and COG0339).</p>
				</sec>
				<sec>
					<st>
						<p>PP-cluster new001 and PP-cluster new006</p>
					</st>
					<p>Two specialized systems consist of divalent cation transporters and enzymes predicted to require these cations for activity. PP-cluster new001 (diverse bacteria, archaea and some fungi) contains a zinc transporter (COG0053) and membrane zinc-dependent hydrolase (COG2220). PP-cluster new006 (many bacteria and some archaea) contains thymidine phosphorylase (COG0213) and two proteins transporting cobalt or similar divalent cation (COG0619 and COG 1122); an unidentified cation has been detected in thymidine phosphorylase crystals and is thought to be involved in enzyme function <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>.</p>
				</sec>
				<sec>
					<st>
						<p>Other PP-clusters</p>
					</st>
					<p>There were 16 putative PP-clusters composed only of COGs that had at best only a very generic functional prediction ('putative hydrolase') or none at all. These clusters may represent pathways and systems that we still have to discover.</p>
					<p>Finally, a distinct type of PP-clusters is recovered by two of the distances we used in this study, <it>d</it><sub>|<it>r</it>| </sub>and <it>d</it><sub><it>r2</it></sub>. Both of these distances approach zero not only when two patterns are similar, but also when they are close to complementarity. This can indicate mutual exclusion between two COGs, as often observed with non-orthologous gene displacements <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. We found 12 PP-clusters that included COGs with complementary patterns (Additional data file 2). Some of them represent the well-known pairs of mutually displacing COGs, for example, the two types of thymidylate synthases (COG0207 and COG1351), two ribose 5-phosphate isomerases (COG0120 and COG0698), or two classes of lysyl-tRNA synthetases (COG1190 and COG 1384). Other PP-clusters of that type seem to predict previously unknown gene displacements, such as the HD-superfamily phosphohydrolase (COG1078), probably substituting for some function of the Holiday junction resolvase RuvC (COG0817). Thus, the diametric distance is not only the most sensitive distance measure for PP-cluster definition, but it also has an advantage of finding some gene displacements. We expect to detect many more as our methodology of pattern comparison improves.</p>
				</sec>
			</sec>
			<sec>
				<st>
					<p>Hierarchical clustering decomposes pathways and systems into blocks of genes with tight co-inheritance</p>
				</st>
				<p>One result of this work is that, whatever we tried, most of the PP-clusters recovered only fragments of the known pathways and functional complexes. This fragmentation affects all classes of processes - biosynthesis and degradation of all classes of molecules, signal transduction, cell division, and so on - and was especially evident in the case of long biosynthetic pathways. Indeed, of the 52 pathways represented among the PP-clusters, only MEP pathway, lipid A biosynthesis and the aerobic branch of cobalamine biosynthesis were completely covered by one specific PP-cluster each (Additional data file 2), whereas most of the other pathways were distributed among two, three or four PP-clusters, and some of their components may not be included in any PP-cluster at all.</p>
				<p>One reason for this fragmentation may be a rigid hierarchical clustering procedure, which forces each COG into a cluster once and for all. For example, the path of riboflavin biosynthesis was split between PP-cluster 211 and PP-cluster 220, and the latter cluster also included the components of two pathways for biosynthesis of several different amino acids; there are no obvious links between biosynthesis of all those compounds (unless one resorts to the general arguments of carbon-pool availability). Because a COG cannot be included in more than one PP-cluster, there is also a possibility of COG misplacement, which may happen more readily in the case of the larger COGs that include paralogs with different functions. This phenomenon deserves further investigation.</p>
				<p>At least in some cases, however, the fragmentation of pathways into PP-clusters seems also to reflect different functional roles and evolutionary fates of COGs within the same pathway. Indeed, further inspection of the split between the components of riboflavin pathway (Figure <figr fid="F4">4</figr>) indicates that PP-cluster 211 contains the components of the pathway that are missing from most archaea (archaeal protein with riboflavin synthase activity <abbrgrp><abbr bid="B35">35</abbr></abbrgrp> belongs to COG1731, which appears to be a distant paralog of COG0054; A.R.M., unpublished data). COGs 1985, 0108 and 0054 (PP-cluster 220) define the evolutionarily most conserved core of the pathway, whereas the entrance into it (COGs 0807 and 0117), as well as the last step, enabled by COGs 0307 or 1731, but also known to occur spontaneously <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>, are more variable and prone to gene displacements.</p>
				<fig id="F4">
					<title>
						<p>Figure 4</p>
					</title>
					<caption>
						<p>Fragmentation of riboflavin biosynthesis</p>
					</caption>
					<text>
						<p>Fragmentation of riboflavin biosynthesis. (<b>a</b>) PP-cluster 211 contains the volatile part of riboflavin biosynthesis that is mostly missing in archaea (COGs 0307, 0117, 0807). (<b>b</b>) PP-cluster 220 contains the evolutionary most conservative part of the pathway (COGs 1985, 0108, 0054). Gray shading indicates enzymes in PP-cluster 220, unrelated to riboflavin biosynthesis.</p>
					</text>
					<graphic file="gb-2004-5-5-r32-4"/>
				</fig>
				<p>In another example, the bacterial type IV secretion apparatus came out as four PP-clusters, one of which (PP-cluster 067) consisted of genes <it>virB8</it>, <it>virB9</it>, <it>virB10 </it>and <it>virB4 </it>(the names are from the operon involved in transfer of plasmid DNA in <it>Agrobacterium</it>). Recent studies indicate that the <it>virB7-virB8-virB9-virB10 </it>subset of the VirB operon is indeed a module sufficient for DNA uptake by the recipient, but some of the VirB1-VirB4 components are additionally required for maximum recipient activity <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>.</p>
				<p>These two examples may represent two facets of pathway decomposition into PP-clusters. In the case of the type IV secretion apparatus, at least some of the components of the system appear to represent a functionally and perhaps structurally discrete subsystem, which may be inherited semi-autonomously and retain its own phyletic pattern. In the case of riboflavin biosynthesis, evolutionary variation at the first step of the pathway remains unexplained, while a non-orthologous gene displacement appears to have perturbed the phyletic pattern of the last step.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Discussion</p>
			</st>
			<p>Here we have examined the quantitative aspects of deducing functional links between proteins on the basis of their simultaneous presences and absences in completely sequenced genomes. Whereas the post-homology methods, including definition of operons, multidomain proteins and phyletic patterns, work quite well when combined with each other <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B28">28</abbr><abbr bid="B30">30</abbr></abbrgrp>, very little is known about the efficiency and limitations of each method. It has been noted that a high 'co-occurrence score' (essentially, the distance between phyletic patterns based on the complement of mutual information) is less indicative of a functional link than chromosomal proximity of genes or translational fusion of domains <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. We were interested in whether the comparison of phyletic patterns can be improved, in order to detect functional links and to separate them from the phylogenetic signal <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>.</p>
			<p>One notable result of our investigation is that the use of correlation-based measures, and, in particular, of the diametric distance between patterns, substantially improves recovery of functional links between genes. This choice of distance measure appears to distinguish well between true co-inheritance versus pairs of rare genes whose patterns are dominated by zeros. Moreover, diametric distance groups not only patterns that are close to identity, but also those that are close to complementarity, thus helping to detect gene displacements. We focused on the algorithms producing the hierarchical trees, that is, directed acyclic graphs. Other, non-hierarchical types of graph have also been used to represent the relationships between proteins; for example, pairwise linkage graphs with scale-free properties have been used to describe the network of protein-protein interactions <abbrgrp><abbr bid="B38">38</abbr></abbrgrp> and the space of protein structures <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. Some of these approaches may complement our pattern-clustering procedure, and different types of graphs may discover different subsets of functional links.</p>
			<p>Another result of our study is the evidence that the co-inheritance of functionally linked genes is constantly perturbed by differential gains, losses and displacements of orthologous genes. This volatility of phyletic patterns reflects the high plasticity and rapid evolution of gene content in microbial genomes <abbrgrp><abbr bid="B40">40</abbr></abbrgrp> and calls for improving the techniques for phyletic pattern comparison. When this manuscript was under revision, Snel and Huynen <abbrgrp><abbr bid="B41">41</abbr></abbrgrp> reported a similar set of observations of perturbation of gene co-inheritance in microbial evolution. It did not escape our attention that the two-dimensional image of clustered patterns is similar to the now-familiar presentation of whole-genome gene-expression arrays, and that our PP-cluster discovery process is akin to inferring functional links from co-activation and co-inhibition of gene activity. The analysis of gene expression makes extensive use of hierarchical clustering of gene-expression patterns, and many techniques involved will be the same as in the case of phyletic patterns <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>. We note, however, that there is currently no clear quantitative model of the process that produces gene-expression values. In contrast, in our case, phyletic patterns and distances between them can be understood, in quantitative detail, in terms of gene gains and losses in the course of genome evolution <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>.</p>
		</sec>
		<sec>
			<st>
				<p>Materials and methods</p>
			</st>
			<sec>
				<st>
					<p>The data</p>
				</st>
				<p>Gene presences and absences are summarized in the COG database <abbrgrp><abbr bid="B43">43</abbr></abbrgrp>. There were 4,873 COGs from 66 complete genomes of unicellular organisms in the COG database, as of 21 September, 2003 <abbrgrp><abbr bid="B44">44</abbr></abbrgrp>. After exclusion of 284 fungus-specific COGs, we have 3,372 patterns containing one COG and 316 patterns containing two or more COGs, 4,589 COGs in total. Each <it>i</it>th COG (<it>i </it>= 1,..., 4,589) is a vector, where the <it>j</it>th coordinate (<it>j </it>= 1,..., 66) is set at 1 if it is represented in the <it>j</it>th genome, and at 0 if it is not. This vector is equivalent to what has been called 'phylogenetic pattern' in <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> and 'phylogenetic profile' in <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. We feel 'phyletic' is preferential to 'phylogenetic', because a pattern explicitly tells us what is going on in each phylum, whereas phylogeny of a set of species is not necessarily recoverable from a pattern or even from a set of patterns.</p>
				<p>Some COGs contain a mix of orthologs and lineage-specific gene duplications <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. In some cases, functions of genes within such enlarged COG diverge substantially, which may produce artifacts in the process of functional inference <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. In our final set of PP-clusters (see Results and Discussion sections for details) there were only 26 (3%) of these 'multifunctional' COGs. An average COG in PP-clusters contained 1.2 genes per species, and 85% of all COGs in the database had less than two genes per species (counting in the denominator only species that had genes in this COG - that is, the 'ones' in the phyletic pattern). The impact of large, functionally heterogeneous COGs on our analysis thus appears to be slight.</p>
			</sec>
			<sec>
				<st>
					<p>The choice of distance measure</p>
				</st>
				<p>The successful discovery of a relationship between phyletic patterns depends on the way the distance and similarity between two pattern vectors <graphic file="gb-2004-5-5-r32-i2.gif"/> are measured. We considered a variety of distance measures. These include: <it>l</it><sub><it>p </it></sub>norm (that is,</p>
				<p><graphic file="gb-2004-5-5-r32-i3.gif"/>,</p>
				<p>where <it>p </it>= 1: Manhattan; <it>p </it>= 2: Euclidean; <it>p </it>= &#8734;: Chebyshev distance); Hamming distance, that is, the number of mismatched vector coordinates between two patterns, <it>d</it><sub><it>H </it></sub>= #(<it>x</it><sub><it>i </it></sub>&#8800; <it>y</it><sub><it>i</it></sub>); the complement <it>d</it><sub><it>MS </it></sub>= (1 - <it>J</it>) of Jaccard's similarity index <it>J</it>, which is the cardinality of vectors' intersection divided by the cardinality of their union, <it>J </it>= #(<it>x</it><sub><it>i </it></sub>&#8745; <it>y</it><sub><it>i</it></sub>)/#(<it>x</it><sub><it>i </it></sub>&#8746; <it>y</it><sub><it>i</it></sub>) (this is also known as the Marczewski-Steinhaus distance); the complement of the correlation coefficient, <graphic file="gb-2004-5-5-r32-i1.gif"/>, where</p>
				<p>
					<graphic file="gb-2004-5-5-r32-i4.gif"/>
				</p>
				<p>is the Pearson correlation coefficient; squared anticorrelation, or diametric distance <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>, <graphic file="gb-2004-5-5-r32-i5.gif"/>; absolute anticorrelation distance, <graphic file="gb-2004-5-5-r32-i6.gif"/>; mutual information</p>
				<p><graphic file="gb-2004-5-5-r32-i7.gif"/><abbrgrp><abbr bid="B9">9</abbr><abbr bid="B45">45</abbr></abbrgrp>,</p>
				<p>where <it>P</it><sub><it>i</it></sub>, <it>P</it><sub><it>j</it></sub>, <it>P</it><sub><it>ij </it></sub>are the frequencies of occurrences for, respectively, genes <it>i</it>, <it>j </it>and gene pairs (<it>i, j</it>) in two genomes; Kullback-Leibler (KL) distance and J-divergence. KL is the relative entropy of two probability mass functions <it>p(x) </it>and <it>q(x) </it>over the random variable <it>X </it></p>
				<p><graphic file="gb-2004-5-5-r32-i8.gif"/><abbrgrp><abbr bid="B46">46</abbr></abbrgrp>.</p>
				<p>The average of two KL distances between two distributions (J-divergence) is symmetric and therefore more applicable for clustering <abbrgrp><abbr bid="B47">47</abbr></abbrgrp>.</p>
				<p>In the challenge example that we discuss in Results, that of two gene pairs (x<sub>1</sub>, y<sub>1</sub>) and (x<sub>2</sub>, y<sub>2</sub>), with patterns x<sub>1 </sub>= (1011110), y<sub>1 </sub>= (0111110), x<sub>2 </sub>= (1000000), y<sub>2 </sub>= (0000001), all <it>l</it><sub><it>p</it></sub>-norm distances are the same for both pairs: for example, Euclidean, <it>d</it><sub>2</sub>(<it>x</it><sub>1</sub>, <it>y</it><sub>1</sub>) = <it>d</it><sub>2</sub>(<it>x</it><sub>2</sub>, <it>y</it><sub>2</sub>) = <graphic file="gb-2004-5-5-r32-i9.gif"/>; or Hamming, <it>d</it><sub><it>H</it></sub>(<it>x</it><sub>1</sub>, <it>y</it><sub>1</sub>) = <it>d</it><sub><it>H</it></sub>(<it>x</it><sub>2</sub>, <it>y</it><sub>2</sub>) = 2. J-divergence is zero in both cases. The MI measure distinguishes between the two cases: <it>M</it>(<it>x</it><sub>1</sub>, <it>y</it><sub>1</sub>) = 0.019 and <it>M</it>(<it>x</it><sub>2</sub>, <it>y</it><sub>2</sub>) = 0.010. The difference, however, is more pronounced in the case of correlation distance <it>d</it><sub><it>r </it></sub>(0.3 and -0.16, respectively). The <it>d</it><sub><it>r2 </it></sub>and <it>d</it><sub>|<it>r</it>| </sub>distances also readily distinguish between these cases, as well as Jaccard's similarity index (<it>J</it>(<it>x</it><sub>1</sub>, <it>y</it><sub>1</sub>) = 0.5, <it>J</it>(<it>x</it><sub>2</sub>, <it>y</it><sub>2</sub>) = 0). Note that, while all distances equal zero for two identical phyletic patterns, only the squared correlation and the absolute anticorrelation distances also equal zero for two complementary patterns. This is a useful property when one wants to look for gene displacements (see Results and Discussion).</p>
			</sec>
			<sec>
				<st>
					<p>Clustering and preliminary partitioning</p>
				</st>
				<p>Algorithms of supervised, parametric or partitional clustering are of limited use for our purpose, because of the lack, respectively, of a well-defined training set, a statistical model of pattern distribution, and the knowledge of underlying cluster number. We studied several algorithms of divisive clustering included in the CLUTO package <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>, as well as two standard agglomerative algorithms for hierarchical clustering, familiar from the phylogenetic studies - average linkage (UPGMA) and neighbor joining (NJ) from the PAUP* 4.0b8 package <abbrgrp><abbr bid="B48">48</abbr></abbrgrp>. Agglomerative clustering was the most sensitive and specific, as described in detail in Results. Because the divisive clustering algorithms need an <it>a priori </it>fixed number of clusters, we estimated such numbers on the basis on the average number of UPGMA clusters (from 67 to 157, depending on the parameters). The quality of clustering solution, however, was lower for K-means and other divisive algorithms (for example, repeated bisections) than in the case of agglomerative algorithms. Results of all clustering experiments are shown in Tables 1 and 2 in Additional data file 3.</p>
				<p>To partition the space of clustered patterns into groups of functionally linked proteins, we used the cutoffs derived from comparing similarity between random patterns, as well as between the functionally linked ones. In each phyletic pattern, all ones and zeros were randomly shuffled to destroy the existing correlations. The figures in Additional data file 4 show distributions of 10,527,166 correlation coefficients among phyletic patterns and shuffled phyletic patterns from the COGs database. Among shuffled patterns, 99% of correlation coefficients were below 0.3, corresponding to the distance <it>d</it><sub><it>r </it></sub>= 0.7. Therefore, if we choose this distance value as a threshold, the probability that two uncorrelated patterns have a correlation coefficient more than 0.3 is less than 1%. At this threshold, however, only several huge clusters can be found. In another test, we inferred the similarity threshold from the distribution of correlation coefficients among original non-shuffled patterns. The distribution of all pairwise correlation coefficients among original patterns does not differ significantly from the normal distribution (Figure A in Additional date file 4, &#967;<sup>2</sup> = 1.34) and 99% of correlation coefficients are below 0.8, corresponding to the distance <it>d</it><sub><it>r </it></sub>= 0.2 and <graphic file="gb-2004-5-5-r32-i10.gif"/> (average branch length in cluster). At this threshold, only about 30% of the entire dataset was included in the clusters (see Table 1 in Additional data file 3).</p>
			</sec>
			<sec>
				<st>
					<p>The quality of a clustering solution</p>
				</st>
				<p>At the first step, in order to estimate the quality of each clustering solution, we introduced three empirical indices: the 'group homogeneity' (GrH), 'functional homogeneity' (FunH), 'uncertainty' (Unc), and percentage of data lost. The first two indices indicate the percentage of COGs from the same group/functional category in the cluster (we used definitions of groups and functional categories from the COGs database <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>). The Unc is computed as the percentage of poorly characterized COGs in the cluster. Statistical properties of the cluster were evaluated using three other indices, namely 'consistency' (Cons), 'average distance between cluster members' (AveD) and 'in-cluster variance' (Var) (see Additional data file 1 for computational details). For best functional parsing of the metabolic map, GrH<sub>Max</sub>, FunH<sub>Max </sub>and Cons<sub>Max </sub>as well as Unc<sub>Min</sub>, AveD<sub>Min </sub>and Var<sub>Min </sub>should be found. In practice, these measures are highly correlated, for example, lower AveD<sub>Min </sub>is, the higher FunH<sub>Max </sub>is (Table 1 in Additional data file 1). Moreover, most of these indices were almost the same in all clustering solutions. The only exception was the percentage of data lost, which showed about 10% difference between solutions (Figure <figr fid="F3">3a</figr>).</p>
				<p>The other measure of quality of a clustering solution is its sensitivity, which is the proportion of COGs from the same pathway or functional category, included in the same cluster. This measure was strongly dependent on the distance and clustering algorithm (Table 2 in Additional data file 3). Diametric distance <it>d</it><sub><it>r2 </it></sub>tends to simultaneously minimize data loss and recovers the largest number of statistically significant clusters (Figure <figr fid="F3">3a</figr> and see also Table 1 in Additional data file 3), most likely because the square of correlation decreases its value, thus increasing the allowed distance between patterns.</p>
				<p>The information content of a pathway <it>I</it><sub><it>p </it></sub>= <it>H</it><sub><it>r </it></sub>- <it>H</it><sub><it>p</it></sub>, where <it>H</it><sub><it>p </it></sub>is the sum over uncertainties of every position in patterns in a pathway:</p>
				<p>
					<graphic file="gb-2004-5-5-r32-i11.gif"/>
				</p>
				<p>(<it>j </it>= 1,..,66, <it>i </it>= 0 or 1). The frequency <graphic file="gb-2004-5-5-r32-i12.gif"/> stands for patterns 'support' for <it>j</it>th species, <graphic file="gb-2004-5-5-r32-i13.gif"/><abbrgrp><abbr bid="B49">49</abbr></abbrgrp>. <it>H</it><sub><it>r </it></sub>is computed similarly, but for reshuffled patterns.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Additional data files</p>
			</st>
			<p>The following additional files are included with the online version of this paper. Additional data file 1 is a figure showing correlations between the percentage of correctly predicted pathway and its information content (Additional data file <supplr sid="s1">1</supplr>). Additional data file 2 is a list of PP-clusters describing (1) functional predictions and gene displacements and (2) functionally linked clusters of genes, PP-clusters (Additional data file <supplr sid="s2">2</supplr>). Additional data file 3 contains tables describing the results of clustering experiments: Table 1 shows the values of classification quality indices for UPGMA/NJ algorithms with different distance measures and Table 2 the performance of UPGMA/NJ algorithms with different distance measures (Additional data file <supplr sid="s3">3</supplr>). Additional data file 4 is a figure showing the distributions of correlation coefficients between phyletic patterns. The distributions of 10,527,166 correlation coefficients and modified correlation coefficients between original (red bars) and shuffled (blue bars) phyletic patterns from COGs database are shown (Additional data file <supplr sid="s4">4</supplr>).</p>
			<suppl id="s1">
				<title>
					<p>Additional data file 1</p>
				</title>
				<caption>
					<p>A figure showing correlations between the percentage of correctly predicted pathway and its information content</p>
				</caption>
				<text>
					<p>A figure showing correlations between the percentage of correctly predicted pathway and its information content</p>
				</text>
				<file name="gb-2004-5-5-r32-s1.pdf">
					<p>Click here for additional data file</p>
				</file>
			</suppl>
			<suppl id="s2">
				<title>
					<p>Additional data file 2</p>
				</title>
				<caption>
					<p>A list of PP-clusters describing (1) functional predictions and gene displacements and (2) functionally linked clusters of genes, PP-clusters</p>
				</caption>
				<text>
					<p>A list of PP-clusters describing (1) functional predictions and gene displacements and (2) functionally linked clusters of genes, PP-clusters</p>
				</text>
				<file name="gb-2004-5-5-r32-s2.pdf">
					<p>Click here for additional data file</p>
				</file>
			</suppl>
			<suppl id="s3">
				<title>
					<p>Additional data file 3</p>
				</title>
				<caption>
					<p>Tables describing the results of clustering experiments</p>
				</caption>
				<text>
					<p>Tables describing the results of clustering experiments</p>
				</text>
				<file name="gb-2004-5-5-r32-s3.pdf">
					<p>Click here for additional data file</p>
				</file>
			</suppl>
			<suppl id="s4">
				<title>
					<p>Additional data file 4</p>
				</title>
				<caption>
					<p>A figure showing the distributions of correlation coefficients between phyletic patterns</p>
				</caption>
				<text>
					<p>A figure showing the distributions of correlation coefficients between phyletic patterns</p>
				</text>
				<file name="gb-2004-5-5-r32-s4.pdf">
					<p>Click here for additional data file</p>
				</file>
			</suppl>
		</sec>
	</bdy>
	<bm>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>Homology: a personal view on some of the problems.</p>
				</title>
				<aug>
					<au>
						<snm>Fitch</snm>
						<fnm>WM</fnm>
					</au>
				</aug>
				<source>Trends Genet</source>
				<pubdate>2000</pubdate>
				<volume>16</volume>
				<fpage>227</fpage>
				<lpage>231</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/S0168-9525(00)02005-9</pubid>
						<pubid idtype="pmpid" link="fulltext">10782117</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B2">
				<title>
					<p>A genomic perspective on protein families.</p>
				</title>
				<aug>
					<au>
						<snm>Tatusov</snm>
						<fnm>RL</fnm>
					</au>
					<au>
						<snm>Koonin</snm>
						<fnm>EV</fnm>
					</au>
					<au>
						<snm>Lipman</snm>
						<fnm>DJ</fnm>
					</au>
				</aug>
				<source>Science</source>
				<pubdate>1997</pubdate>
				<volume>278</volume>
				<fpage>631</fpage>
				<lpage>637</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1126/science.278.5338.631</pubid>
						<pubid idtype="pmpid" link="fulltext">9381173</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B3">
				<title>
					<p>Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.</p>
				</title>
				<aug>
					<au>
						<snm>Pellegrini</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Marcotte</snm>
						<fnm>EM</fnm>
					</au>
					<au>
						<snm>Thompson</snm>
						<fnm>MJ</fnm>
					</au>
					<au>
						<snm>Eisenberg</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Yeates</snm>
						<fnm>TO</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>1999</pubdate>
				<volume>96</volume>
				<fpage>4285</fpage>
				<lpage>4288</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1073/pnas.96.8.4285</pubid>
						<pubid idtype="pmpid" link="fulltext">10200254</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B4">
				<title>
					<p>Biosynthesis of isoprenoids via mevalonate in Archaea: the lost pathway.</p>
				</title>
				<aug>
					<au>
						<snm>Smit</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Mushegian</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2000</pubdate>
				<volume>10</volume>
				<fpage>1468</fpage>
				<lpage>1484</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1101/gr.145600</pubid>
						<pubid idtype="pmpid" link="fulltext">11042147</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B5">
				<title>
					<p>An unusual isopentenyl diphosphate isomerase found in the mevalonate pathway gene cluster from <it>Streptomyces </it>sp. strain CL190.</p>
				</title>
				<aug>
					<au>
						<snm>Kaneda</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Kuzuyama</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Takagi</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Hayakawa</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Seto</snm>
						<fnm>H</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>2001</pubdate>
				<volume>98</volume>
				<fpage>932</fpage>
				<lpage>937</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1073/pnas.020472198</pubid>
						<pubid idtype="pmpid" link="fulltext">11158573</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B6">
				<title>
					<p>The non-mevalonate pathway of isoprenoids: genes, enzymes and intermediates.</p>
				</title>
				<aug>
					<au>
						<snm>Rohdich</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Kis</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Bacher</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Eisenreich</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>Curr Opin Chem Biol</source>
				<pubdate>2001</pubdate>
				<volume>5</volume>
				<fpage>535</fpage>
				<lpage>540</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/S1367-5931(00)00240-4</pubid>
						<pubid idtype="pmpid" link="fulltext">11578926</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B7">
				<title>
					<p>Identification of four genes necessary for biosynthesis of the modified nucleoside queuosine.</p>
				</title>
				<aug>
					<au>
						<snm>Reader</snm>
						<fnm>JS</fnm>
					</au>
					<au>
						<snm>Metzgar</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Schimmel</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>De Crecy-Lagard</snm>
						<fnm>V</fnm>
					</au>
				</aug>
				<source>J Biol Chem</source>
				<pubdate>2004</pubdate>
				<volume>279</volume>
				<fpage>6280</fpage>
				<lpage>6285</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1074/jbc.M310858200</pubid>
						<pubid idtype="pmpid" link="fulltext">14660578</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B8">
				<title>
					<p>Inference of protein function and protein linkages in <it>Mycobacterium tuberculosis </it>based on prokaryotic genome organization: a combined computational approach.</p>
				</title>
				<aug>
					<au>
						<snm>Strong</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Mallick</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Pellegrini</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Thompson</snm>
						<fnm>MJ</fnm>
					</au>
					<au>
						<snm>Eisenberg</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Genome Biol</source>
				<pubdate>2003</pubdate>
				<volume>4</volume>
				<fpage>R59</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1186/gb-2003-4-9-r59</pubid>
						<pubid idtype="pmpid" link="fulltext">12952538</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B9">
				<title>
					<p>Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages.</p>
				</title>
				<aug>
					<au>
						<snm>Date</snm>
						<fnm>SV</fnm>
					</au>
					<au>
						<snm>Marcotte</snm>
						<fnm>EM</fnm>
					</au>
				</aug>
				<source>Nat Biotechnol</source>
				<pubdate>2003</pubdate>
				<volume>21</volume>
				<fpage>1055</fpage>
				<lpage>1062</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1038/nbt861</pubid>
						<pubid idtype="pmpid" link="fulltext">12923548</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B10">
				<title>
					<p>Genome evolution reveals biochemical networks and functional modules.</p>
				</title>
				<aug>
					<au>
						<snm>von Mering</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Zdobnov</snm>
						<fnm>EM</fnm>
					</au>
					<au>
						<snm>Tsoka</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Ciccarelli</snm>
						<fnm>FD</fnm>
					</au>
					<au>
						<snm>Pereira-Leal</snm>
						<fnm>JB</fnm>
					</au>
					<au>
						<snm>Ouzounis</snm>
						<fnm>CA</fnm>
					</au>
					<au>
						<snm>Bork</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>2003</pubdate>
				<volume>100</volume>
				<fpage>15428</fpage>
				<lpage>15433</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1073/pnas.2136809100</pubid>
						<pubid idtype="pmpid" link="fulltext">14673105</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B11">
				<title>
					<p>Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes.</p>
				</title>
				<aug>
					<au>
						<snm>Mirkin</snm>
						<fnm>BG</fnm>
					</au>
					<au>
						<snm>Fenner</snm>
						<fnm>TI</fnm>
					</au>
					<au>
						<snm>Galperin</snm>
						<fnm>MY</fnm>
					</au>
					<au>
						<snm>Koonin</snm>
						<fnm>EV</fnm>
					</au>
				</aug>
				<source>BMC Evol Biol</source>
				<pubdate>2003</pubdate>
				<volume>3</volume>
				<fpage>2</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1186/1471-2148-3-2</pubid>
						<pubid idtype="pmpid" link="fulltext">12515582</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B12">
				<title>
					<p>Variation and evolution of the citric-acid cycle: a genomic perspective.</p>
				</title>
				<aug>
					<au>
						<snm>Huynen</snm>
						<fnm>MA</fnm>
					</au>
					<au>
						<snm>Dandekar</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Bork</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Trends Microbiol</source>
				<pubdate>1999</pubdate>
				<volume>7</volume>
				<fpage>281</fpage>
				<lpage>291</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/S0966-842X(99)01539-5</pubid>
						<pubid idtype="pmpid" link="fulltext">10390638</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B13">
				<title>
					<p>Non-orthologous gene displacement.</p>
				</title>
				<aug>
					<au>
						<snm>Koonin</snm>
						<fnm>EV</fnm>
					</au>
					<au>
						<snm>Mushegian</snm>
						<fnm>AR</fnm>
					</au>
					<au>
						<snm>Bork</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Trends Genet</source>
				<pubdate>1996</pubdate>
				<volume>12</volume>
				<fpage>334</fpage>
				<lpage>336</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/0168-9525(96)20010-1</pubid>
						<pubid idtype="pmpid" link="fulltext">8855656</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B14">
				<title>
					<p>An alternative flavin-dependent mechanism for thymidylate synthesis.</p>
				</title>
				<aug>
					<au>
						<snm>Myllykallio</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Lipowski</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Leduc</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Filee</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Forterre</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Liebl</snm>
						<fnm>U</fnm>
					</au>
				</aug>
				<source>Science</source>
				<pubdate>2002</pubdate>
				<volume>297</volume>
				<fpage>105</fpage>
				<lpage>107</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1126/science.1072113</pubid>
						<pubid idtype="pmpid" link="fulltext">12029065</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B15">
				<title>
					<p>Systematic discovery of analogous enzymes in thiamin biosynthesis.</p>
				</title>
				<aug>
					<au>
						<snm>Morett</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Korbel</snm>
						<fnm>JO</fnm>
					</au>
					<au>
						<snm>Rajan</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Saab-Rincon</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Olvera</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Olvera</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Schmidt</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Snel</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Bork</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Nat Biotechnol</source>
				<pubdate>2003</pubdate>
				<volume>21</volume>
				<fpage>790</fpage>
				<lpage>795</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1038/nbt834</pubid>
						<pubid idtype="pmpid" link="fulltext">12794638</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B16">
				<title>
					<p>Genomic functional annotation using co-evolution profiles of gene clusters.</p>
				</title>
				<aug>
					<au>
						<snm>Zheng</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Roberts</snm>
						<fnm>RJ</fnm>
					</au>
					<au>
						<snm>Kasif</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Genome Biol</source>
				<pubdate>2002</pubdate>
				<volume>3</volume>
				<fpage>research0060.1</fpage>
				<lpage>0060.9</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmpid" link="fulltext">12429059</pubid>
						<pubid idtype="doi">10.1186/gb-2002-3-11-research0060</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B17">
				<title>
					<p>Identification of functional links between genes using phylogenetic profiles.</p>
				</title>
				<aug>
					<au>
						<snm>Wu</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Kasif</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>DeLisi</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19</volume>
				<fpage>1524</fpage>
				<lpage>1530</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/btg187</pubid>
						<pubid idtype="pmpid" link="fulltext">12912833</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B18">
				<title>
					<p>The use of phylogenetic profiles for gene predictions.</p>
				</title>
				<aug>
					<au>
						<snm>Liberles</snm>
						<fnm>DA</fnm>
					</au>
					<au>
						<snm>Thoren</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>von Heijne</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Elofsson</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Curr Genomics</source>
				<pubdate>2002</pubdate>
				<volume>3</volume>
				<fpage>131</fpage>
				<lpage>138</lpage>
			</bibl>
			<bibl id="B19">
				<title>
					<p>A tree kernel to analyse phylogenetic profiles</p>
				</title>
				<aug>
					<au>
						<snm>Vert</snm>
						<fnm>JP</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2002</pubdate>
				<volume>18 Suppl 1</volume>
				<fpage>S276</fpage>
				<lpage>S284</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12169557</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B20">
				<title>
					<p>Localizing proteins in the cell from their phylogenetic profiles.</p>
				</title>
				<aug>
					<au>
						<snm>Marcotte</snm>
						<fnm>EM</fnm>
					</au>
					<au>
						<snm>Xenarios</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>van Der Bliek</snm>
						<fnm>AM</fnm>
					</au>
					<au>
						<snm>Eisenberg</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>2000</pubdate>
				<volume>97</volume>
				<fpage>12115</fpage>
				<lpage>12120</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1073/pnas.220399497</pubid>
						<pubid idtype="pmpid" link="fulltext">11035803</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B21">
				<title>
					<p>Annotation of bacterial genomes using improved phylogenomic profiles.</p>
				</title>
				<aug>
					<au>
						<snm>Enault</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Suhre</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Abergel</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Poirot</snm>
						<fnm>O</fnm>
					</au>
					<au>
						<snm>Claverie</snm>
						<fnm>JM</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19 Suppl 1</volume>
				<fpage>I105</fpage>
				<lpage>I107</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmpid" link="fulltext">12855445</pubid>
						<pubid idtype="doi">10.1093/bioinformatics/btg1013</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B22">
				<title>
					<p>Visualization and interpretation of protein networks in <it>Mycobacterium tuberculosis </it>based on hierarchical clustering of genome-wide functional linkage maps.</p>
				</title>
				<aug>
					<au>
						<snm>Strong</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Graeber</snm>
						<fnm>TG</fnm>
					</au>
					<au>
						<snm>Beeby</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Pellegrini</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Thompson</snm>
						<fnm>MJ</fnm>
					</au>
					<au>
						<snm>Yeates</snm>
						<fnm>TO</fnm>
					</au>
					<au>
						<snm>Eisenberg</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2003</pubdate>
				<volume>31</volume>
				<fpage>7099</fpage>
				<lpage>7109</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/nar/gkg924</pubid>
						<pubid idtype="pmpid" link="fulltext">14654685</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B23">
				<title>
					<p>Diametrical clustering for identifying anti-correlated gene clusters.</p>
				</title>
				<aug>
					<au>
						<snm>Dhillon</snm>
						<fnm>IS</fnm>
					</au>
					<au>
						<snm>Marcotte</snm>
						<fnm>EM</fnm>
					</au>
					<au>
						<snm>Roshan</snm>
						<fnm>U</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19</volume>
				<fpage>1612</fpage>
				<lpage>1619</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/btg209</pubid>
						<pubid idtype="pmpid" link="fulltext">12967956</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B24">
				<title>
					<p>Evaluation of hierarchical clustering algorithmsfor document datasets</p>
				</title>
				<aug>
					<au>
						<snm>Zhao</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Karypis</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<url>http://www-users.cs.umn.edu/~karypis/publications/Papers/PDF/vhcluster2.pdf</url>
			</bibl>
			<bibl id="B25">
				<title>
					<p>A hierarchical unsupervised growing neural network for clustering gene expression patterns.</p>
				</title>
				<aug>
					<au>
						<snm>Herrero</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Valencia</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Dopazo</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2001</pubdate>
				<volume>17</volume>
				<fpage>126</fpage>
				<lpage>136</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/17.2.126</pubid>
						<pubid idtype="pmpid" link="fulltext">11238068</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B26">
				<title>
					<p>Genome trees and the tree of life.</p>
				</title>
				<aug>
					<au>
						<snm>Wolf</snm>
						<fnm>YI</fnm>
					</au>
					<au>
						<snm>Rogozin</snm>
						<fnm>IB</fnm>
					</au>
					<au>
						<snm>Grishin</snm>
						<fnm>NV</fnm>
					</au>
					<au>
						<snm>Koonin</snm>
						<fnm>EV</fnm>
					</au>
				</aug>
				<source>Trends Genet</source>
				<pubdate>2002</pubdate>
				<volume>18</volume>
				<fpage>472</fpage>
				<lpage>479</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/S0168-9525(02)02744-0</pubid>
						<pubid idtype="pmpid" link="fulltext">12175808</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B27">
				<title>
					<p>COGs database: pathways and functional systems</p>
				</title>
				<url>http://www.ncbi.nlm.nih.gov/cgi-bin/COG/palox?sys=all</url>
			</bibl>
			<bibl id="B28">
				<title>
					<p>STRING: a database of predicted functional associations between proteins.</p>
				</title>
				<aug>
					<au>
						<snm>von Mering</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Huynen</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Jaeggi</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Schmidt</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Bork</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Snel</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2003</pubdate>
				<volume>31</volume>
				<fpage>258</fpage>
				<lpage>261</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/nar/gkg034</pubid>
						<pubid idtype="pmpid" link="fulltext">12519996</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B29">
				<title>
					<p>The minimal genome concept.</p>
				</title>
				<aug>
					<au>
						<snm>Mushegian</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Curr Opin Genet Dev</source>
				<pubdate>1999</pubdate>
				<volume>9</volume>
				<fpage>709</fpage>
				<lpage>714</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/S0959-437X(99)00023-4</pubid>
						<pubid idtype="pmpid" link="fulltext">10607608</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B30">
				<aug>
					<au>
						<snm>Koonin</snm>
						<fnm>EV</fnm>
					</au>
					<au>
						<snm>Galperin</snm>
						<fnm>MY</fnm>
					</au>
				</aug>
				<source>Sequence - Evolution - Function: Computational Approaches in Comparative Genomics</source>
				<publisher>Norwell, MA: Kluwer Academic Publishers</publisher>
				<pubdate>2003</pubdate>
			</bibl>
			<bibl id="B31">
				<title>
					<p>Comparative genomics of archaea: how much have we learned in six years, and what's next?</p>
				</title>
				<aug>
					<au>
						<snm>Makarova</snm>
						<fnm>KS</fnm>
					</au>
					<au>
						<snm>Koonin</snm>
						<fnm>EV</fnm>
					</au>
				</aug>
				<source>Genome Biol</source>
				<pubdate>2003</pubdate>
				<volume>4</volume>
				<fpage>115</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1186/gb-2003-4-8-115</pubid>
						<pubid idtype="pmpid" link="fulltext">12914651</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B32">
				<title>
					<p>Making ribosomes.</p>
				</title>
				<aug>
					<au>
						<snm>Fatica</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Tollervey</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Curr Opin Cell Biol</source>
				<pubdate>2002</pubdate>
				<volume>14</volume>
				<fpage>313</fpage>
				<lpage>318</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/S0955-0674(02)00336-8</pubid>
						<pubid idtype="pmpid" link="fulltext">12067653</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B33">
				<title>
					<p>Evolution and function of processosome, the complex that assembles ribosomes in eukaryotes: clues from comparative sequence analysis.</p>
				</title>
				<aug>
					<au>
						<snm>Mushegian</snm>
						<fnm>AR</fnm>
					</au>
				</aug>
				<source>Prog Nucl Acids Mol Biol</source>
				<pubdate>2004</pubdate>
				<inpress/>
			</bibl>
			<bibl id="B34">
				<title>
					<p>Structural analyses reveal two distinct families of nucleoside phosphorylases.</p>
				</title>
				<aug>
					<au>
						<snm>Pugmire</snm>
						<fnm>MJ</fnm>
					</au>
					<au>
						<snm>Ealick</snm>
						<fnm>SE</fnm>
					</au>
				</aug>
				<source>Biochem J</source>
				<pubdate>2002</pubdate>
				<volume>361</volume>
				<fpage>1</fpage>
				<lpage>25</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1042/0264-6021:3610001</pubid>
						<pubid idtype="pmpid" link="fulltext">11743878</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B35">
				<title>
					<p>Biosynthesis of riboflavin: an unusual riboflavin synthase of <it>Methanobacterium thermoautotrophicum.</it></p>
				</title>
				<aug>
					<au>
						<snm>Eberhardt</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Korn</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Lottspeich</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Bacher</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>J Bacteriol</source>
				<pubdate>1997</pubdate>
				<volume>179</volume>
				<fpage>2938</fpage>
				<lpage>2943</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">9139911</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B36">
				<title>
					<p>Biosynthesis of vitamin b2 (riboflavin).</p>
				</title>
				<aug>
					<au>
						<snm>Bacher</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Eberhardt</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Fischer</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Kis</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Richter</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<source>Annu Rev Nutr</source>
				<pubdate>2000</pubdate>
				<volume>20</volume>
				<fpage>153</fpage>
				<lpage>167</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1146/annurev.nutr.20.1.153</pubid>
						<pubid idtype="pmpid" link="fulltext">10940330</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B37">
				<title>
					<p>Functional subsets of the virB type IV transport complex proteins involved in the capacity of <it>Agrobacterium tumefaciens </it>to serve as a recipient in virB-mediated conjugal transfer of plasmid RSF1010.</p>
				</title>
				<aug>
					<au>
						<snm>Liu</snm>
						<fnm>Z</fnm>
					</au>
					<au>
						<snm>Binns</snm>
						<fnm>AN</fnm>
					</au>
				</aug>
				<source>J Bacteriol</source>
				<pubdate>2003</pubdate>
				<volume>185</volume>
				<fpage>3259</fpage>
				<lpage>3269</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1128/JB.185.11.3259-3269.2003</pubid>
						<pubid idtype="pmpid" link="fulltext">12754223</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B38">
				<title>
					<p>Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome.</p>
				</title>
				<aug>
					<au>
						<snm>Rzhetsky</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Gomez</snm>
						<fnm>SM</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2001</pubdate>
				<volume>17</volume>
				<fpage>988</fpage>
				<lpage>996</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/17.10.988</pubid>
						<pubid idtype="pmpid" link="fulltext">11673244</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B39">
				<title>
					<p>Expanding protein universe and its origin from the biological Big Bang.</p>
				</title>
				<aug>
					<au>
						<snm>Dokholyan</snm>
						<fnm>NV</fnm>
					</au>
					<au>
						<snm>Shakhnovich</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Shakhnovich</snm>
						<fnm>EI</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>2002</pubdate>
				<volume>99</volume>
				<fpage>14132</fpage>
				<lpage>14136</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1073/pnas.202497999</pubid>
						<pubid idtype="pmpid" link="fulltext">12384571</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B40">
				<title>
					<p>Genomes in flux: the evolution of archaeal and proteobacterial gene content.</p>
				</title>
				<aug>
					<au>
						<snm>Snel</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Bork</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Huynen</snm>
						<fnm>MA</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2002</pubdate>
				<volume>12</volume>
				<fpage>17</fpage>
				<lpage>25</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1101/gr.176501</pubid>
						<pubid idtype="pmpid" link="fulltext">11779827</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B41">
				<title>
					<p>Quantifying modularity in the evolution of biomolecular systems.</p>
				</title>
				<aug>
					<au>
						<snm>Snel</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Huynen</snm>
						<fnm>MA</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2004</pubdate>
				<volume>14</volume>
				<fpage>391</fpage>
				<lpage>397</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1101/gr.1969504</pubid>
						<pubid idtype="pmpid" link="fulltext">14993205</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B42">
				<title>
					<p>Comparisons and validation of statistical clustering techniques for microarray gene expression data.</p>
				</title>
				<aug>
					<au>
						<snm>Datta</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Datta</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19</volume>
				<fpage>459</fpage>
				<lpage>466</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/btg025</pubid>
						<pubid idtype="pmpid" link="fulltext">12611800</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B43">
				<title>
					<p>Clusters of Orthologous Groups (COGs)</p>
				</title>
				<url>http://www.ncbi.nlm.nih.gov/COG/new</url>
			</bibl>
			<bibl id="B44">
				<title>
					<p>The COG database: an updated version includes eukaryotes.</p>
				</title>
				<aug>
					<au>
						<snm>Tatusov</snm>
						<fnm>RL</fnm>
					</au>
					<au>
						<snm>Fedorova</snm>
						<fnm>ND</fnm>
					</au>
					<au>
						<snm>Jackson</snm>
						<fnm>JJ</fnm>
					</au>
					<au>
						<snm>Jacobs</snm>
						<fnm>AR</fnm>
					</au>
					<au>
						<snm>Kiryutin</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Koonin</snm>
						<fnm>EV</fnm>
					</au>
					<au>
						<snm>Krylov</snm>
						<fnm>DM</fnm>
					</au>
					<au>
						<snm>Mazumder</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Mekhedov</snm>
						<fnm>SL</fnm>
					</au>
					<au>
						<snm>Nikolskaya</snm>
						<fnm>AN</fnm>
					</au>
					<etal/>
				</aug>
				<source>BMC Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>4</volume>
				<fpage>41</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1186/1471-2105-4-41</pubid>
						<pubid idtype="pmpid" link="fulltext">12969510</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B45">
				<title>
					<p>Predicting protein function by genomic context: quantitative evaluation and qualitative inferences.</p>
				</title>
				<aug>
					<au>
						<snm>Huynen</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Snel</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Lathe</snm>
						<fnm>W</fnm>
						<suf>3rd</suf>
					</au>
					<au>
						<snm>Bork</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2000</pubdate>
				<volume>10</volume>
				<fpage>1204</fpage>
				<lpage>1210</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1101/gr.10.8.1204</pubid>
						<pubid idtype="pmpid" link="fulltext">10958638</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B46">
				<aug>
					<au>
						<snm>Cover</snm>
						<fnm>TM</fnm>
					</au>
					<au>
						<snm>Thomas</snm>
						<fnm>JA</fnm>
					</au>
				</aug>
				<source>Elements of Informational Theory</source>
				<publisher>New York: Wiley</publisher>
				<pubdate>1991</pubdate>
			</bibl>
			<bibl id="B47">
				<title>
					<p>Symmetrizing the Kullback-Leibler Distance</p>
				</title>
				<aug>
					<au>
						<snm>Johnson</snm>
						<fnm>DH</fnm>
					</au>
					<au>
						<snm>Sinanovic</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<url>http://cmc.rice.edu/docs/docs/Joh2001Mar1Symmetrizi.pdf</url>
			</bibl>
			<bibl id="B48">
				<aug>
					<au>
						<snm>Swofford</snm>
						<fnm>DL</fnm>
					</au>
				</aug>
				<source>PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods), Version 4</source>
				<publisher>Sunderland, MA: Sinauer Associates</publisher>
				<pubdate>2000</pubdate>
			</bibl>
			<bibl id="B49">
				<title>
					<p>Information content of binding sites on nucleotide sequences.</p>
				</title>
				<aug>
					<au>
						<snm>Schneider</snm>
						<fnm>TD</fnm>
					</au>
					<au>
						<snm>Stormo</snm>
						<fnm>GD</fnm>
					</au>
					<au>
						<snm>Gold</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Ehrenfeucht</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>J Mol Biol</source>
				<pubdate>1986</pubdate>
				<volume>188</volume>
				<fpage>415</fpage>
				<lpage>431</lpage>
				<xrefbib>
					<pubid idtype="pmpid">3525846</pubid>
				</xrefbib>
			</bibl>
		</refgrp>
	</bm>
</art>
