Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.
The COG database: an updated version includes eukaryotes
Roman L Tatusov1*, Natalie D Fedorova1, John D Jackson1, Aviva R Jacobs1, Boris Kiryutin1, Eugene V Koonin1, Dmitri M Krylov1, Raja Mazumder2, Sergei L Mekhedov1, Anastasia N Nikolskaya2, B Sridhar Rao1, Sergei Smirnov1, Alexander V Sverdlov1, Sona Vasudevan1, Yuri I Wolf1, Jodie J Yin1 and Darren A Natale2
The availability of multiple, essentially complete genome sequences of prokaryotes
and eukaryotes spurred both the demand and the opportunity for the construction of
an evolutionary classification of genes from these genomes. Such a classification
system based on orthologous relationships between genes appears to be a natural framework
for comparative genomics and should facilitate both functional annotation of genomes
and large-scale evolutionary studies.
We describe here a major update of the previously developed system for delineation
of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of
prokaryotes and unicellular eukaryotes and the construction of clusters of predicted
orthologs for 7 eukaryotic genomes, which we named KOGs after eu
roups. The COG collection currently consists of 138,458 proteins, which form 4873
COGs and comprise 75% of the 185,505 (predicted) proteins encoded in 66 genomes of
unicellular organisms. The eu
roups (KOGs) include proteins from 7 eukaryotic genomes: three animals (the nematode
Caenorhabditis elegans, the fruit fly Drosophila melanogaster and Homo sapiens), one plant, Arabidopsis thaliana, two fungi (Saccharomyces cerevisiae and Schizosaccharomyces pombe), and the intracellular microsporidian parasite Encephalitozoon cuniculi. The current KOG set consists of 4852 clusters of orthologs, which include 59,838
proteins, or ~54% of the analyzed eukaryotic 110,655 gene products. Compared to the
coverage of the prokaryotic genomes with COGs, a considerably smaller fraction of
eukaryotic genes could be included into the KOGs; addition of new eukaryotic genomes
is expected to result in substantial increase in the coverage of eukaryotic genomes
with KOGs. Examination of the phyletic patterns of KOGs reveals a conserved core represented
in all analyzed species and consisting of ~20% of the KOG set. This conserved portion
of the KOG set is much greater than the ubiquitous portion of the COG set (~1% of
the COGs). In part, this difference is probably due to the small number of included
eukaryotic genomes, but it could also reflect the relative compactness of eukaryotes
as a clade and the greater evolutionary stability of eukaryotic genomes.
The updated collection of orthologous protein sets for prokaryotes and eukaryotes
is expected to be a useful platform for functional annotation of newly sequenced genomes,
including those of complex eukaryotes, and genome-wide evolutionary studies.