Open Access Methodology article

Large-scale clustering of CAGE tag expression data

Kazuro Shimokawa1*, Yuko Okamura-Oho1, Takio Kurita2, Martin C Frith13, Jun Kawai14, Piero Carninci14 and Yoshihide Hayashizaki14

Author Affiliations

1 Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan

2 National Institute of Advanced Industrial Science and Technology, Tsukuba, Ibaraki 305-8568, Japan

3 Institute for Molecular Bioscience, University of Queensland, Brisbane, Qld 4072, Australia

4 Genome Science Laboratory, Discovery Research Institute, RIKEN Wako Institute, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan

For all author emails, please log on.

BMC Bioinformatics 2007, 8:161  doi:10.1186/1471-2105-8-161

Published: 21 May 2007

Abstract

Background

Recent analyses have suggested that many genes possess multiple transcription start sites (TSSs) that are differentially utilized in different tissues and cell lines. We have identified a huge number of TSSs mapped onto the mouse genome using the cap analysis of gene expression (CAGE) method. The standard hierarchical clustering algorithm, which gives us easily understandable graphical tree images, has difficulties in processing such huge amounts of TSS data and a better method to calculate and display the results is needed.

Results

We use a combination of hierarchical and non-hierarchical clustering to cluster expression profiles of TSSs based on a large amount of CAGE data to profit from the best of both methods. We processed the genome-wide expression data, including 159,075 TSSs derived from 127 RNA samples of various organs of mouse, and succeeded in categorizing them into 70–100 clusters. The clusters exhibited intriguing biological features: a cluster supergroup with a ubiquitous expression profile, tissue-specific patterns, a distinct distribution of non-coding RNA and functional TSS groups.

Conclusion

Our approach succeeded in greatly reducing the calculation cost, and is an appropriate solution for analyzing large-scale TSS usage data.