Cophenetic correlation analysis as a strategy to select phylogenetically informative proteins: an example from the fungal kingdom
- Equal contributors
Yeast Research, CBS-Fungal Biodiversity Centre, Uppsalalaan 8, 3584 CT Utrecht, The Netherlands
BMC Evolutionary Biology 2007, 7:134 doi:10.1186/1471-2148-7-134Published: 9 August 2007
The construction of robust and well resolved phylogenetic trees is important for our understanding of many, if not all biological processes, including speciation and origin of higher taxa, genome evolution, metabolic diversification, multicellularity, origin of life styles, pathogenicity and so on. Many older phylogenies were not well supported due to insufficient phylogenetic signal present in the single or few genes used in phylogenetic reconstructions. Importantly, single gene phylogenies were not always found to be congruent. The phylogenetic signal may, therefore, be increased by enlarging the number of genes included in phylogenetic studies. Unfortunately, concatenation of many genes does not take into consideration the evolutionary history of each individual gene. Here, we describe an approach to select informative phylogenetic proteins to be used in the Tree of Life (TOL) and barcoding projects by comparing the cophenetic correlation coefficients (CCC) among individual protein distance matrices of proteins, using the fungi as an example. The method demonstrated that the quality and number of concatenated proteins is important for a reliable estimation of TOL. Approximately 40–45 concatenated proteins seem needed to resolve fungal TOL.
In total 4852 orthologous proteins (KOGs) were assigned among 33 fungal genomes from the Asco- and Basidiomycota and 70 of these represented single copy proteins. The individual protein distance matrices based on 531 concatenated proteins that has been used for phylogeny reconstruction before  were compared one with another in order to select those with the highest CCC, which then was used as a reference. This reference distance matrix was compared with those of the 70 single copy proteins selected and their CCC values were calculated. Sixty four KOGs showed a CCC above 0.50 and these were further considered for their phylogenetic potential. Proteins belonging to the cellular processes and signaling KOG category seem more informative than those belonging to the other three categories: information storage and processing; metabolism; and the poorly characterized category. After concatenation of 40 proteins the topology of the phylogenetic tree remained stable, but after concatenation of 60 or more proteins the bootstrap support values of some branches decreased, most likely due to the inclusion of proteins with lowers CCC values. The selection of protein sequences to be used in various TOL projects remains a critical and important process. The method described in this paper will contribute to a more objective selection of phylogenetically informative protein sequences.
This study provides candidate protein sequences to be considered as phylogenetic markers in different branches of fungal TOL. The selection procedure described here will be useful to select informative protein sequences to resolve branches of TOL that contain few or no species with completely sequenced genomes. The robust phylogenetic trees resulting from this method may contribute to our understanding of organismal diversification processes. The method proposed can be extended easily to other branches of TOL.