Phylogenetic analyses of protein families are used to define the evolutionary relationships between homologous proteins. The interpretation of protein-sequence phylogenetic trees requires the examination of the taxonomic properties of the species associated to those sequences. However, there is no online tool to facilitate this interpretation, for example, by automatically attaching taxonomic information to the nodes of a tree, or by interactively colouring the branches of a tree according to any combination of taxonomic divisions. This is especially problematic if the tree contains on the order of hundreds of sequences, which, given the accelerated increase in the size of the protein sequence databases, is a situation that is becoming common.
We have developed PhyloView, a web based tool for colouring phylogenetic trees upon arbitrary taxonomic properties of the species represented in a protein sequence phylogenetic tree. Provided that the tree contains SwissProt, SpTrembl, or GenBank protein identifiers, the tool retrieves the taxonomic information from the corresponding database. A colour picker displays a summary of the findings and allows the user to associate colours to the leaves of the tree according to any number of taxonomic partitions. Then, the colours are propagated to the branches of the tree.
Phylogenetic trees based upon multiple sequence alignments of proteins from many species are commonly used to determine the evolutionary relationships between homologous sequences, which can give insights into the evolution of a protein family and the functional specificity of the members of the family .
It is expected that the phylogenetic tree reflects the events of duplication and speciation of proteins. Through speciation, related proteins in different organisms are generated that reflect the taxonomic relations of those organisms. However, cases of phylogenetic associations being at odds with known taxonomy can be interesting anomalies worthy of investigation, perhaps indicating problems in the generation of the multiple sequence alignment or events of lateral gene transfer . Also, events of gene duplication (that may define subfamilies with subtle variations of the common functional theme of the family) can be observed by the repetition of a taxonomic structure at multiple places in the tree.
Unfortunately, available software for rendering phylogenetic trees does not provide a simple means of automatically retrieving taxonomic information for the sequences represented in the tree, or of graphically representing arbitrary taxonomic properties of trees, thus allowing the study of the relation of a phylogenetic tree with the taxonomic relations between the species represented in the tree. PhyloView was developed in an attempt to address this limitation and provide a simple and generic means of doing this.
Results and discussion
Upon initial loading the script provides a form where the user may upload a phylogenetic tree in Newick (New Hampshire) format, used by most phylogenetic packages (e.g., ). The tree should contain SwissProt, SpTrembl , or GenBank GI protein identifiers  in the leaf node names. The associated records are then dynamically retrieved from the appropriate database online over the internet and the associated taxonomic information is extracted. Processing times for the initial upload of a tree vary with the number of sequences in the tree, the load on the server and the response speed of the public databases queried. For example, a tree with 600 sequences (which in our experience is pretty large for a phylogenetic tree) takes about one minute to load. Subsequent processing times for the same tree tend to be much shorter as the taxonomic information associated with the protein sequence identifiers is cached and no further queries of public databases are required.
To allow users to show their own identifiers in the tree, we use the following internal format for names: "DBID*YOURID" where DBID is the database identifier used by PhyloView to extract the taxonomic information, YOURID is the identifier to be displayed in the tree, and * is a user defined separator (the default symbol is "/"). Optionally, the DBID can be removed from the rendered tree leaving the user identifiers.
Figure 1. PhyloView web interface. The phylogenetic tree is input on the top left window. Bottom-left: summary of the taxonomic levels present in the tree (with number of sequences in each within brackets) that can be expanded and contracted at will. A colour picker allows the association of a colour with any taxonomic level. Right: interpretation of a tree. In this example, a multiple sequence alignment of putative transcription initiation factor 2, gamma subunit, and related sequences, is used to illustrate PhyloView (the example is available at the web site). Colouring chosen is: Archaea:red; Bacteria:pink; Cyanobacteria:light pink; Eukarya:blue; Viridiplantae:green; Mammals:light blue. Repeating phylogenetic structures make obvious the existence of two subfamilies (IF2G, and a hypothetical IF2P), and the presence of three outliers (top: three GTPases of unknown function, wrongly included in the alignment). The plant sequence that groups with the Cyanobacteria (IF2C_ARATH) is a chloroplast IF2G. The eukaryotic members that group with bacteria (IF2M) are mitochondrial IF2Gs. Recent duplications of mammalian IF2Gs are also apparent.
Once the colours have been chosen, resubmitting the form will render a new tree where the various nodes and branches are coloured based upon the above choice. The taxonomic colouring algorithm is such that every branch of the tree receives the colour assigned to the taxonomic group with most members under that branch. In case of a tie between assignments, the more specific one is given precedence (for example, Viridiplantae over Eukarya). Colouring of a given branch only happens if more than 50% of the sequences under that branch belong to a single taxonomic group with an assigned colour.
Mouse-over of the phylogenetic tree leaf nodes in SVG mode creates floating tool-tip type output with full taxonomic information for the sequence. The preferred form of output for the tree is an SVG image. SVG is an XML based standard for vector graphics. Though not natively supported by most browsers, a number of plug-ins is freely available, for example the Adobe SVG viewer .
We plan to extend PhyloView as a visualization framework for enhancing sequence phylogenetic tree images with associated data. We welcome feedback and proposals for additional features from users.
PhyloView is the first web server dedicated to colouring according to taxonomy of phylogenetic trees. There is other software that may be used to attain similar results but with considerably more effort. For example, Mesquite  (an open source modular software system for evolutionary analysis written in Java) and MacClade  (a commercial computer program for phylogenetic analysis that runs only on MacOS), allow the manual colouring of the branches of a phylogenetic tree, but these are complicated general purpose programs and achieving this is a laborious and complicated process. PhyloView is intended to streamline and simplify this, allowing the user to rapidly explore different combinations of colours and taxonomic partitions for the best visual result.
Availability and requirements
PhyloView requires Internet Explorer with the Adobe SVG viewer plug-in and can be used at . Source code is available from that location as well.
MA and ER conceived the tool. GP implemented the tool. GP and MA drafted the manuscript. All authors tested the tool during its development, and read and approved the final manuscript. MA was previously known as Miguel A. Andrade.
This work was supported by projects funded by the Canadian Foundation for Innovation, and the Ontario Research and Development Challenge Funds. MA is recipient of a Canada Research Chair of Bioinformatics. We thank the members of the Bioinformatics group of the Ontario Genomics Innovation Centre for helpful discussions.
J Theor Biol 1965, 8:357-366. PubMed Abstract
Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JGR, Korf I, Lapp H L, ehvaslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka ED, Wilkinson M, E. B: The Bioperl Toolkit: Perl modules for the life sciences.
Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS: The Universal Protein Resource (UniProt).
Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Church DM, DiCuccio M, Edgar R, Federhen S, Helmberg W, Kenton DL, Khovayko O, Lipman DJ, Madden TL, Maglott DR, Ostell J, Pontius JU, Pruitt KD, Schuler GD, Schriml LM, Sequeira E, Sherry ST, Sirotkin K, Starchenko G, Suzek TO, Tatusov R, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information.
Folia Primatol (Basel) 1989, 53:190-202. PubMed Abstract