Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Software

GO-Diff: Mining functional differentiation between EST-based transcriptomes

Zuozhou Chen12, Weilin Wang3, Xuefeng Bruce Ling4, Jane Jijun Liu4 and Liangbiao Chen2*

Author Affiliations

1 College of Life Science, Zhejiang University, Hangzhou 310029, China

2 Laboratory of Molecular and Developmental Biology, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100080, China

3 Center of Organ Transplantation, First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, 310003, China

4 Amgen Inc., South San Francisco, CA 94080, USA

For all author emails, please log on.

BMC Bioinformatics 2006, 7:72  doi:10.1186/1471-2105-7-72

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/7/72


Received:13 September 2005
Accepted:16 February 2006
Published:16 February 2006

© 2006 Chen et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Large-scale sequencing efforts produced millions of Expressed Sequence Tags (ESTs) collectively representing differentiated biochemical and functional states. Analysis of these EST libraries reveals differential gene expressions, and therefore EST data sets constitute valuable resources for comparative transcriptomics. To translate differentially expressed genes into a better understanding of the underlying biological phenomena, existing microarray analysis approaches usually involve the integration of gene expression with Gene Ontology (GO) databases to derive comparable functional profiles. However, methods are not available yet to process EST-derived transcription maps to enable GO-based global functional profiling for comparative transcriptomics in a high throughput manner.

Results

Here we present GO-Diff, a GO-based functional profiling approach towards high throughput EST-based gene expression analysis and comparative transcriptomics. Utilizing holistic gene expression information, the software converts EST frequencies into EST Coverage Ratios of GO Terms. The ratios are then tested for statistical significances to uncover differentially represented GO terms between the compared transcriptomes, and functional differences are thus inferred. We demonstrated the validity and the utility of this software by identifying differentially represented GO terms in three application cases: intra-species comparison; meta-analysis to test a specific hypothesis; inter-species comparison. GO-Diff findings were consistent with previous knowledge and provided new clues for further discoveries. A comprehensive test on the GO-Diff results using series of comparisons between EST libraries of human and mouse tissues showed acceptable levels of consistency: 61% for human-human; 69% for mouse-mouse; 47% for human-mouse.

Conclusion

GO-Diff is the first software integrating EST profiles with GO knowledge databases to mine functional differentiation between biological systems, e.g. tissues of the same species or the same tissue cross species. With rapid accumulation of EST resources in the public domain and expanding sequencing effort in individual laboratories, GO-Diff is useful as a screening tool before undertaking serious expression studies.

Background

Cellular development and its associated biochemical processes within and between various cell types are determined by the relevant cellular proteomes, which are tightly regulated by biochemical synthesis, different stage genetic interactions and various metabolic pathways. The proteome of a cell is largely (but not exclusively) regulated by gene expression [1], and the transcriptome can be regarded as a sensitive read-out of the proteome revealing the biochemical state of the cell. Currently the most popular gene expression analysis platforms include gene microarray [2] and the serial analysis of gene expression (SAGE) [3]. To analyze the molecular and cellular processes and probe the principles, mechanisms, and major developmental events giving rise to diverse tissue types, gene expression analysis has become an indispensable approach to facilitate our understanding of biology. Developmental abnormalities, including tumor, have also been explored through tumor expression profiling analysis to discover the contributing genetic and extrinsic factors.

Many genes participating in the same biological process are co-regulated and these periodically expressed genes drive the dynamics of the underlying biological processes, such as the periodically expressed protein complexes during the yeast cell cycles [4]. However, to discover such functional dynamics and their associated gene members directly from expression data is both biologically important and computationally challenging [5,6]. Nevertheless, from the biological perspective, it is imperative to integrate and associate gene expression with molecular functions, cellular components, and biological processes, thus allowing the comparative transcriptomic analysis to be an effective biological knowledge mining process. Through a taxonomy of biological concepts and their species-independent attributes for annotating gene sequences, the Gene Ontology (GO) [7,8], serves as a shared language, standardizing biological vocabularies, for communicating biological data and knowledge for comparative genomics and comparative transcriptomics.

The GO database schema models a directed acyclic graph (DAG) relationally, and the terms (graph nodes) and term-term relationships provide the conceptualizations of biological domains of knowledge [9]. High throughput annotation methods [10-13] can electronically annotate any uncharacterized protein or transcript through identifying GO annotated domains or aligning with GO annotated model organism sequences. For example, DIAN [10] and InterProScan[14] apply domain-mapping approaches to assign sequences with GO terms, GOtcha [11] predicts uncharacterized sequences' GO associations by assign each association a term-specific probability (P-score) as a measure of confidence and AutoFACT [12] combines multiple BLAST reports from several user-selected databases to predict GO associations. These tools are good for genome annotators, where the goal is for gene annotation and classification purposes. Thanks to the GO consortium, gene sequences of model organisms, either from manual curatorial efforts or from direct experimental evidences, have been well characterized with high quality GO annotations. High-quality manual and computational GO annotations provide invaluable resource and solid groundwork for additional data mining and biological mechanism characterization.

The advances in microarray technology and data mining studies allow the simultaneous analysis of all genes in the entire transcriptome, producing differentially regulated gene lists in the condition under study. To obtain the biological significance, these differentially expressed genetic profiles should be interpreted under the contexts of molecular functions, biological processes and cellular components. The GO databases have been utilized as tools to annotate these differentially expressed genes [15]. By comparing the number of differentially expressed genes with those of background genes at each GO graph node, over represented GO terms can be identified to translate the gene lists into a better understanding of the biological phenomena involved [16-21]. This approach of focusing on the genes with high magnitude of changes and relying on these sparse annotations with specific GO terms ignores the majority of the expression data sets, and may fail the detection of considerably more subtle changes in gene networks [22]. To address this, methods have been developed to evaluate

    G
O terms utilizing
    H
olistic
    E
xpression information (GHE) to obtain functional analysis, such as GO-Mapper, GOAL and GOdist [22-24].

The availability of the huge amount of expressed sequence tags (ESTs)[25] have made it possible to construct various tissue specific transcriptomes, thus allowing much more flexibility in the areas of large scale comparative transcriptomics analysis between different biological systems. Specifically, the dbEST, a division of GenBank, has collected 31,307,034 ESTs from 976 species, of which 474 species having at least 1,000 sequence tags (dbEST release 111105, Nov, 11,2005). To support the EST-based gene expression analysis, software tools have been developed to convert the EST frequencies into readily analyzable transcription maps to identify differentially expressed genes, which include Digital Differential Display [26,27], cDNA xProfiler [28], cDNA Digital Gene Expression Displayer (DGED) [29], and DigiNorthern [30]. However, methods are not available yet to analyze EST derived transcription maps to extract GO terms that are either significantly over- or under-represented to enable global functional profiling for comparative transcriptomics.

GO based microarray profiling analysis approach, however, cannot readily be applied to EST based transcription analysis and functional profiling. First, unlike microarry, where gene expression is normally distributed, EST (and also SAGE) data is generated by random sampling, results in "tag counts", governed by Poisson distribution [31,32]. Thus, statistical approaches for EST-analyses are different. It has been shown that Chi-square test performed the best among several statistical methods in the EST and SAGE analyses [32]. EST analysis is based upon the count of the sequence tags where some have sufficient while others have insufficient tag counts. As a consequence, microarray analysis approaches cannot be directly applicable. Third, the gene expression representations are different between the microarray and EST data sets, not easily accommodated by current microarray analytical tools. In contrast to the difficulty to compare microarray data cross array platforms, unbiased EST libraries can be easily combined and compared. This is because EST data sets are in the same data formats, and are generated and processed with similar procedures.

In this study, we present GO-Diff, a GO-based system biology approach for high throughput comparative transcriptomics. The algorithm implementation can comprehensively integrate and efficiently process large EST-based transcription maps, and directly compare different biological systems, e.g. the same-type tissue samples from different developmental stages or from different species, based upon GO term representation analysis. Three comparative transcriptomics analyses were described to demonstrate GO-Diff's validities and data mining utilities. A quantitative evaluation was also conducted to evaluate the consistency of GO-Diff performance.

Implementation

GO-Diff knowledge base

The GO-Diff knowledge base comes from three contributing resources. The GO structure information, as described in the standard OBO file, was downloaded from the GO website [33]. The mapping between Unigenes and GO terms was constructed through the integration of the Gene-GO mapping and the Gene-Unigene mapping [34]. The Unigene-GO mapping is also readily available from other resources including the GOA Uniprot-GO and the Uniprot-Gene [35] mappings in human, mouse, rat and zebrafish.

EST frequencies are computed for all Unigenes in each dbEST tissue specific transcription map within the knowledge base. Source files are downloaded from the Unigene FTP site [36]. The GO-Diff Knowledge base can be updated via the GO-Diff update programs to integrate latest EST data sets and GO knowledge data sets. The GO evidence codes are integrated as part of the knowledge base. In order to assist the user to focus and limit the search space, those GO terms corresponding to irrelevant biological knowledge can be excluded from further analysis if relevant GO term evidence codes are selected.

GO-Diff algorithm

The algorithm flow chart is diagramed in Figure 1A and 1B. GO-Diff is designed and implemented to perform comparative transcriptomics with the following three analytical options: comparing dbEST libraries captured in GO-Diff knowledge base; comparing dbEST libraries captured in GO-Diff knowledge base with a user-defined EST transcriptome; comparing EST transcriptomes which are both defined by the user. The EST libraries within the knowledge base can be selected through the descriptive keywords or dbEST library identification numbers for comparative analysis. The expression profile for each sample of interest is based upon the EST frequencies of Unigenes computed from that sample's unbiased EST libraries. The Unigene clusters serve as the bridge between the EST-based gene expression and biological knowledge encapsulated by GO terms, leading to the construction of the "GO profiles" for the biological samples. The EST-based gene expression analysis essentially is the comparative dissection of the two GO profiles.

thumbnailFigure 1. A. Flow diagram of GO term representation calculation B. Overview diagram of GO-Diff algorithm.

Independent of Unigene, another approach to link EST to the GO terms and construct EST-based expression profile is through sequence assembly and direct sequence GO annotation. This approach has the advantages to perform GO-Diff analysis de novo and does not depend on previous Unigene annotations. This is especially true when these EST sequences are novel and fresh from an ongoing sequence project. In addition, this approach can maximally utilize the EST information in a given transcriptomes. However, the computational cost might be heavy. EST sequences are assembled into contigs using sequence assembly tools such as CAP3 [37], Phrap[38] and TIGR assembler [39] to BLAST against the GO annotated databases. GoPipe [40,41] and other tools are used to post process the BLAST results and extract GO annotations for the assembled contigs. The expression profile for each sample of interest is based upon the EST frequencies of these contigs. Like the Unigene clusters, the assembled contigs link the EST information to the GO terms to construct a GO profile for that particular transcriptome.

We define the "EST Coverage Level of a GO Term" (ECLG) as the total of the ESTs of the Unigenes or contigs mapped to a specific GO term, where xi is the EST count of Unigene cluster or contig i that is associated with a specific GO term.

E C L G O I d = f G O I d ( x 1 , x 2 , , x n ) = i = 1 n x i       ( 1 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGfbqrcqWGdbWqcqWGmbatdaWgaaWcbaGaem4raCKaem4ta8KaeyOeI0IaemysaKKaemizaqgabeaakiabg2da9iabdAgaMnaaBaaaleaacqWGhbWrcqWGpbWtcqGHsislcqWGjbqscqWGKbazaeqaaOGaeiikaGIaemiEaG3aaSbaaSqaaiabigdaXaqabaGccqGGSaalcqWG4baEdaWgaaWcbaGaeGOmaidabeaakiabcYcaSiablAciljabcYcaSiabdIha4naaBaaaleaacqWGUbGBaeqaaOGaeiykaKIaeyypa0ZaaabCaeaacqWG4baEdaWgaaWcbaGaemyAaKgabeaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGUbGBa0GaeyyeIuoakiaaxMaacaWLjaWaaeWaaeaacqaIXaqmaiaawIcacaGLPaaaaaa@5A66@

The ECLG not only covers the ESTs directly linked to a GO term, but also includes the ESTs associated with its children GO nodes due to the "true path rule".

We define the "Relative EST Coverage Level of a GO Term" (RECLG) as the proportion of the ESTs under the specific GO term in total ESTs with GO term annotations. XAll-Go is the number of total ESTs within the Unigene clusters or the contigs that have the GO term annotations.

R E C L G G O I d = i = 1 n x i / X A l l G O = i = 1 n x i / i = 1 m x i       ( 2 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGsbGucqWGfbqrcqWGdbWqcqWGmbatcqWGhbWrdaWgaaWcbaGaem4raCKaem4ta8KaeyOeI0IaemysaKKaemizaqgabeaakiabg2da9maaqahabaGaemiEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGVaWlcqWGybawdaWgaaWcbaGaemyqaeKaemiBaWMaemiBaWMaeyOeI0Iaem4raCKaem4ta8eabeaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGUbGBa0GaeyyeIuoakiabg2da9maaqahabaGaemiEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGVaWlaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6gaUbqdcqGHris5aOWaaabCaeaacqWG4baEdaWgaaWcbaGaemyAaKgabeaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGTbqBa0GaeyyeIuoakiaaxMaacaWLjaWaaeWaaeaacqaIYaGmaiaawIcacaGLPaaaaaa@65F6@

The "EST Coverage Ratio of a GO Term" (ECRG) is defined as the RECLG ratio of the two transcriptomes in the study.

E C R G G O I d = ( R E C L G G O I d | s e t 2 ) / ( R E C L G G O I d | s e t 1 ) = ( i = 1 n 2 x i / i = 1 m 2 x i ) / ( i = 1 n 1 x i / i = 1 m 1 x i )       ( 3 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGfbqrcqWGdbWqcqWGsbGucqWGhbWrdaWgaaWcbaGaem4raCKaem4ta8KaeyOeI0IaemysaKKaemizaqgabeaakiabg2da9iabcIcaOiabdkfasjabdweafjabdoeadjabdYeamjabdEeahnaaBaaaleaacqWGhbWrcqWGpbWtcqGHsislcqWGjbqscqWGKbazcqGG8baFcqWGZbWCcqWGLbqzcqWG0baDcqaIYaGmaeqaaOGaeiykaKIaei4la8IaeiikaGIaemOuaiLaemyrauKaem4qamKaemitaWKaem4raC0aaSbaaSqaaiabdEeahjabd+eapjabgkHiTiabdMeajjabdsgaKjabcYha8jabdohaZjabdwgaLjabdsha0jabigdaXaqabaGccqGGPaqkcqGH9aqpcqGGOaakdaaeWbqaaiabdIha4naaBaaaleaacqWGPbqAaeqaaOGaei4la8YaaabCaeaacqWG4baEdaWgaaWcbaGaemyAaKgabeaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGTbqBcqaIYaGma0GaeyyeIuoakiabcMcaPiabc+caViabcIcaOmaaqahabaGaemiEaG3aaSbaaSqaaiabdMgaPbqabaaabaGaemyAaKMaeyypa0JaeGymaedabaGaemOBa4MaeGymaedaniabggHiLdGccqGGVaWldaaeWbqaaiabdIha4naaBaaaleaacqWGPbqAaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabd2gaTjabigdaXaqdcqGHris5aOGaeiykaKIaaCzcaiaaxMaadaqadaqaaiabiodaZaGaayjkaiaawMcaaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemOBa4MaeGOmaidaniabggHiLdaaaa@96A3@

Here we first calculate the ECLG before calculating the ratio, which is an inversion of the calculating steps in other GHE approaches like GO-Mapper. This feature is suitable for mining the possible differentially expressed genes represented by low abundance ESTs, which are prone to be either omitted or over-exploited if GO-Mapper approach is directly adopted. GO-Mapper probably performs well in microarray analysis, but it is not sensitive or accurate enough when apply to EST analysis, in which a great portion of "N vs. 0" (N <= 5) and other low abundance tags exist. Under this condition, the GO-Mapper approach would average the insignificant but "highly" ratio-ed genes (for example, 1/0 is infinitive in mathematical calculation, but is not a significant differentiation for gene expression) with other significantly ratio-ed genes (e.g. "1000 vs. 6") of the same GO term, yielding high false positives. On the opposite, if those insignificant but "highly" ratio-ed genes were pre-filtered by the users, a great information loss would occur.

To analyze whether the GO terms are significantly differentially represented between the two transcriptomes in the study, a 2 by 2 contingency table will be constructed for Chi-square test. If the Chi-square test does not meet the empirical criterion, Fisher's Exact test will be used instead. These tests reveal differentially represented GO terms between two GO Profiles. However, additional measures are necessary in order to calculate the global similarity or dissimilarity between the two transcriptomes of interest. To address this, Pearson correlation coefficient is calculated between the two GO profiles to report the global similarities.

Since all the GO terms are sampled during the analysis, the potential issues with multiple testing should be addressed. Within the GO-Diff algorithm, the linear step-up procedure [42] is adopted to adjust the False Discovery Rate (FDR). The algorithm can be fine tuned through parameters including the FDR cut-off defaulting at 0.1, the EST coverage ratio cut-off defaulting at 3, and unwanted GO associations can be excluded by their evidence codes.

Results

Exhausting EST sequencing projects provide a vast repository of EST information, which can be an alternative resource for gene expression analysis across different biological systems. GO-Diff is the first software to integrate EST-based expression profiles with the GO knowledge database to achieve functional differentiation analysis between transcriptomes. Three comparative transcriptomics analyses were performed to demonstrate GO-Diff's data mining utilities and software processing capabilities. GO-Diff results were studied and characterized against existing biological knowledge for validation analysis where possible.

Functional differences between mouse oocyte and preimplantation embryos – intra-species comparative transcritpomics

To study the functional differences among transcriptomes from the same species, we applied GO-Diff to analyze dbEST libraries of mouse-unfertilized eggs and different developmentally staged mouse preimplantation embryos.

Using GO-Diff, four different embryonic staged libraries were pooled and compared to that of unfertilized eggs in order to reveal transcriptome dynamics and extract functional and developmental perspectives between oocytes and early embryos. In this study, 121 differentially represented GO terms were revealed under the criteria of a false discovery rate at 0.1 and at least 1.5-fold of the EST coverage ratio. Results are summarized in Table 1 and details can be found in Additional file 1 and 2.

Table 1. GO terms revealed by comparing mouse oocyte and preimplantation embryonic transcriptomes F: molecular function. P: biological process. C: cellular component

Additional File 1. Full list of the differentially represented GO terms between transcriptomes of mouse oocyte and preimplantation embryos.

Format: PDF Size: 76KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Additional File 2. Full list of the Unigene clusters associated with the differentially presented GO terms in transcriptomes of mouse oocyte and preimplantation embryos. Description: The four columns of numbers from left to right are: tag number of the Unigene cluster and the relative abundant of Unigene cluster of the oocyte transcriptome and the preimplantation embryos respectively.

Format: PDF Size: 166KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

The findings by GO-Diff agreed well with previous studies. The absolute rate of protein synthesis increased during preimplantation development from the oocyte to 8-cell stage [43], and this biological process was quantitated to be up-regulated 3-fold in this study. Cellular components involved with protein synthesis, e.g. the ribosome-related GO categories, were simultaneously enriched, also consistent with previous findings [44]. Profiling during preimplanatation mouse development, our GO-Diff analyses and a recent microarray study [45] using the EASE program [21] were very consistent. Our analyses also confirmed assumptions or observations that had not been fully investigated, therefore providing some clues for new discoveries. As shown in Table 2 and comprehensive details 2, GO-Diff revealed the enrichment of transcripts encoding Cathepsins locating at lysosome during development, indicating active protein degradation in mouse preimplantation embryos [46-48].

Table 2. Cellular component "lysosome" – GO-Diff analysis of dbEST libraries of unfertilized egg and embryos

Representation of "DNA damage response" related GO terms in mouse oocyte and preimplantation embryos – meta-analysis to test a specific hypothesis

Supported by our GO-Diff results, the recent microarray study of the preimplantation embryos [45] observed the over-representation of transcripts involved in DNA damage response and DNA repair in oocytes in comparison to that in the preimplantation stages, and suggested this over-representation reflected the oocyte's possible response to selective pressures such that genomic integrity could be ensured. However, the over-representation could very well be data analysis artifacts as both over-representation in oocytes and under-representation of those transcripts in embryonic cells could, on the surface, yield similar over-representation analysis results. Under this circumstance, comparisons with other tissues could provide some additional evidences and even definitive answers. With the GO-Diff knowledge base integrating various types of dbEST libraries, this kind of analysis is straightforward. Pairwise comparative analyses of dbEST libraries of eight other tissues with those of oocyte and preimplantation embryos yielded many differentially represented GO terms related to DNA damage response (Table 3). With these cross-tissue examinations, we observed transcripts associated with such processes were indeed highly represented in both the preimplantation embryos and the oocytes. With a number of transcriptomes as references, oocytes had more pronounced transcriptions under those GO terms compared to all samples analyzed including embryos as well, leading to the conclusion that oocytes have more represented GO terms in the area of DNA damage response and DNA repair.

Table 3. "DNA damage response" related GO terms that are differentially represented between oocyte and preimplantation with respect with common reference tissues. Pairwise comparisons of oocyte and preimplantation dbEST libraries to eight reference libraries revealed five GO terms related to the biological process of "DNA damage response": GO:0006974 ("response to DNA damage stimulus"), GO:0042770 ("DNA damage response, signal transduction"), GO:0000077 ("DNA damage checkpoint"), GO:0003684 ("damaged DNA binding") and GO:0006281 ("DNA repair"). GO terms with EST Coverage Ratio >= 1.5 or <= 1/1.5 and with corrected P_value of 0.1 were selected.

Preliminary characterization of functional differences between human and mouse liver – inter-species comparative transcritpomics

It is interesting to explore how transcriptome variations are related with physiological differences between species. Comparing transcriptomes from a functional perspective may help explore physiological diversities. In this study, we explored the functional differences of liver transcriptomes between human and mouse. Inter-species transcriptome comparison is not as straightforward as intra-species comparison due to the unequal GO annotation coverage between species. To reduce false positives caused by biased GO annotation, we incorporated multiple GO-Diff results into meta-analysis using both GO associations in the background database and from BLAST search. Following this strategy, we compared a series of dbEST libraries of human and mouse liver as shown in Table 4. 261 GO terms were found differentially represented between human and mouse in the liver (3).

Table 4. The pairwise comparative analysis of the relevant mouse and human dbEST libraries

Additional File 3. List of Meta-analysis-supported GO terms identified by GO-Diff in the comparison between human and mouse liver Description: The number in the table is the EST coverage ratio (human/mouse) of the GO term. It is represented by "inf" when the EST coverage level in mouse is zero, and by "1" when no significant (ECRG >= 3, FDR <= 0.1) differences are found in the two sets of dbEST libraries.

Format: PDF Size: 40KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Currently, comparative transcriptomic analyses between mouse and human are challenging both experimentally and statistically. Therefore it is difficult to validate GO-Diff's results in this regard. Nevertheless, the current results may provide some evidence relating to the physiological divergence between human and mouse. In liver, the GO categories related with aerobic metabolism are represented in higher levels in the mouse, such as "mitochondrion", "hemoglobin complex", "proton-transporting ATP synthase complex (sensu Eukarya)", "ATP synthesis coupled proton transport", "oxidoreductase activity, acting on peroxide as acceptor" and "oxygen transporter activity" ... These results may simply reflect the faster metabolic rate in mouse due to the body mass effect. Our findings may provide a gene expression perspective to explore relationships between body mass and standard metabolic rate.

Quantitative and qualitative estimation of GO-Diff performance

Neither quantitative benchmark data sets nor other similar tools are currently available to accurately evaluate GO-Diff performance. We selected 16 unbiased EST libraries from human and mouse brain and liver, and ran GO-Diff to determine the consistency and reproducibility of the GO-Diff algorithm results (for detailed method see the 4). Table 5 and 6 show the match-up schemes of the pairwise transcriptome comparisons, and Table 7 shows the consistencies of the comparison results within and cross-species. Due to the fact that small EST libraries contain fewer genes, a portion of differentially represented GO terms identified in large-library comparisons may not be detected in smaller-library comparisons. This may reduce the observed consistency between the small library comparisons. To reduce this artifact, results from low-volume library comparisons were paired solely to that of the largest library comparison of that group to perform evaluation as shown in Table 7.

Additional File 4. Procedures to evaluate GO-Diff consistency.

Format: PDF Size: 35KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Table 5. Human EST libraries and their match-up grid in consistency test of GO-Diff Libraries are listed in their Ids, and each pairwise comparison is numbered. The evaluation criteria are also shown.

Table 6. Mouse EST libraries and their match-up grid in consistency test of GO-Diff

Table 7. Consistency evaluation of GO-Diff results from the above comparisons in pairs listed GO terms identified as differentially represented in both sides are listed as 'Identical', those in the same GO paths are listed as 'Parent-Child', and those appeared in only one side are listed as 'Different'.Consistency rate is calculated by ('Identical'+'Parent-Child')/('Identical'+'Parent-Child'+'Different').

The average consistencies of human-human, mouse-mouse and human-mouse comparisons were 60.9%, 69.2% and 47.1% respectively. Recent studies showed that 18%-94% of genes could be differentially expressed among individuals of the same species [49-51], which adversely affected the results of consistency test. Detailed discussion of intra- and inter- species expression variations falls out of the scope of the current work. Nevertheless, even in the contexts of the high intra-species variation and even higher variation between species, GO-Diff can generate repeatable and reliable results.

Discussion

GO-Diff is a knowledge-based data mining method, and its implementation analyzes EST transcription maps from a functional perspective upon biological domain knowledge encapsulated by GO terms. As in our case study analyses of mouse preimplantation development and human/mouse liver comparison, GO-Diff revealed many differentially represented GO categories, some of which were consistent with previous findings, others could be suggestive for future follow-up studies.

When exploring biological mechanisms of non-model organisms or un-profiled tissues, EST analysis is usually the first step to systematically study gene constitutions and gene expression. Given that GO terms are coined to be species independent, GO-Diff can facilitate the comparisons of the transcriptomes of new species according to molecular function, biological process and cellular components. In addition, the GO-Diff framework has the capability to quickly establish the analysis process to allow whole-transcriptome comparative analysis between the transcriptome of interest against a large repository of pre-sorted transcriptomes, which span different species or different tissue origins within the knowledge base. Recently, it has been suggested that many tissue-specific differences in gene expression are unique only to one population and thus are unlikely to contribute to fundamental differences between tissue types [52]. In this regard, the GO-Diff approach does offer the benefit of quickly constructing several transcriptomes of the same type and allow global analysis of different populations of the same tissue. The comparative analysis of these transcriptomes against various reference transcriptomes can weed out those population specific sampling artifacts. This kind of analyses would be difficult to perform across different platforms when conventional microarray or SAGE technologies are utilized if multiple transcriptomes are profiled and analyzed simultaneously and comparatively.

EST sequencing is not as high throughput as array technology. Based on Fisher's Exact test, we listed in Table 8 the tag counts required to achieve 95% of confidence in determining differential expression. In gene-based differential expression analysis, the number in the table is the tag count of a given gene, and in GO-based analysis, it is the ECLG. For libraries containing a few thousand tags, a tag count ratio of at least 0 vs. 6 is required to be a differential expression. The criterion is even more restrictive when multiple testing is taken into account, therefore, only a few highly expressed genes in the libraries can be evaluated, rendering the GO over-representation analysis unrealistic. GO-Diff attempts to solve this problem with the following features: it incorporates the entire body of the expression information; optionally combines multiple libraries of same kind; and lastly, adds up the tag counts of the same GO term before calculating the ratio – the EST Coverage Ratio of a GO term (ECRG), instead of averaging the expression ratios of a GO term, making it more sensitive and accurate in detection of differential GO terms represented by low abundant ESTs.

Table 8. The minimum tags in the compared libraries required for a 'significant' evaluation based on the Fisher's Exact test calculation The number of tags on the top of each column indicates the total number of ESTs in each library.

It is common that genes may play multiple biological roles in different tissues or different species. This may become the source of false positives where some physiologically irrelevant GO terms will make into the final analysis report. For example, the Unigene cluster Mm.5098 is a component of transcriptional repressor complex (GO:0017053) and also plays a role in lung development (GO:0030324). In the case study of oocyte and preimplantation embryo comparative analysis, both of the GO terms were found to be differentially represented. Obviously, "role in lung development" is a false positive result in this context. This phenomenon appears more frequently when a highly expressed gene dominates several GO terms. GO-Diff tries to address these issues by providing detailed information of the significant GO terms for manual verification and following analysis. First, the expression levels of Unigene clusters associated with the significant GO terms are displayed, which allows the researchers to find significant GO terms that may have been dominated by the same Unigene cluster. Once identified, those GO terms, which are dominated by spurious gene expression artifacts and are obviously irrelevant to the particular research focus, can be excluded. Second, GO-Diff produces additional html-formatted outputs with links to AmiGO [53] and the NCBI Unigene database to gather relevant information for additional analysis. Third, the user graphical interface facilitates the interactive usage of the program. In this regard, GO-Diff provides not only a high throughput processing method but also an iterative data analysis platform much involving the researchers.

Inter-species comparisons are essential and increasingly demanding when genomes and transcriptomes of many organisms of various evolutionary lineages are available. However, inter-species transcriptome comparisons lack a common reference set. Unlike transcriptomes of the same species, in which a set of common genes or transcripts are used as references, and the expression level of each reference sequence can be uniformly evaluated among the experimental samples, transcriptomes from different species usually do not share the same set of reference sequences, which make the comparisons methodologically more challenging. One solution is to employ a set of orthologous genes from the compared species to form a reference set as implemented in methods of [54-58] explicitly or implicitly. This approach by its design suffers from some limitations, especially in moderately related species and for EST analysis as well. In moderately related species, many orthologs are no longer in a simple one to one relation, and when alternative splicing and EST assembly errors are taken into account, a common unique-transcript set between two species becomes very difficult to establish. GO-Diff made the first attempt to utilize the GO structure as the common reference set to organize transcripts into functional groups, and perform meaningful comparisons.

The current GO-Diff implementation focuses on the leverage of the EST resources for comparative transcriptomics. However, since the GO-Diff analysis is comparing the GO term representations rather than comparing the expression directly to interpret biology, the algorithm is flexible and can be further applied to SAGE data analysis. With the rapid accumulation of different gene expression resources in the public domain, GO-Diff can have broad applications and can serve as a knowledge driven data mining platform for comparative transcriptomics analysis.

Conclusion

GO-Diff is the first software to mine functional differentiation between any EST-based transcriptomes by integrating EST profiles with GO knowledge databases. It efficiently and effectively translates EST frequencies in transcriptomes of various tissues or the same tissue across different species into EST Coverage Ratio of GO Terms. The ratio is then tested for statistical significance to uncover differentially represented GO terms between the transcriptomes, and functional differences are thus inferred. With the rapid accumulation of different EST resources in the public domain, GO-Diff can have broad applications and can serve as a knowledge driven data mining platform for comparative transcriptomic analysis.

Abbreviations

EST: Expressed Sequence Tag; GO: Gene Ontology; SAGE: Serial Analysis of Gene Expression; FDR: False Discovery Rate; ECLG: EST Coverage Level of a GO Term; RECLG: Relative EST Coverage Level of a GO Term; ECRG: EST Coverage Ratio of a GO Term; GHE: Evaluate GO terms utilizing Holistic Expression information

Authors' contributions

ZC, WW and LC designed and developed the methodology. ZC programmed the software. ZC, WW and XBL carried out the transcriptome comparisons and analysis. ZC, XBL JJL and LC wrote the manuscript. JJL tested the software.

Availability and requirements

- Project name: GO-Diff

- Project home page: http://www.fishgenome.org/bioinfo/godiff/index.htm webcite

- Operating system(s): Linux, Unix (no GUI)

- Programming language: Perl

- Other requirements: X Window System for GUI, Gtk and Perl-Gtk for x_godiff.pl

- License: GPL

- Restrictions to use by non-academics: on request

Acknowledgements

This work is supported in partial by grants from the Natural Science Foundation of China to LC (30330080) and the 973 Program to LC (2004CB117404). We would like to thank NCBI, the GOA project and the GO Consortium for their data sets. We would also thank Zhenjiang Ning for programming assistance, Hua Ye for preparing the software manual and Sheng Zhu for the Web support.

References

  1. Kanapin A, Batalov S, Davis MJ, Gough J, Grimmond S, Kawaji H, Magrane M, Matsuda H, Schonbach C, Teasdale RD, Yuan Z: Mouse proteome analysis.

    Genome Res 2003, 13:1335-1344. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray.

    Science 1995, 270:467-470. PubMed Abstract OpenURL

  3. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial analysis of gene expression.

    Science 1995, 270:484-487. PubMed Abstract OpenURL

  4. de Lichtenberg U, Jensen LJ, Brunak S, Bork P: Dynamic complex formation during the yeast cell cycle.

    Science 2005, 307:724-727. PubMed Abstract | Publisher Full Text OpenURL

  5. Peng S, Xu Q, Ling XB, Peng X, Du W, Chen L: Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines.

    FEBS Lett 2003, 555:358-362. PubMed Abstract | Publisher Full Text OpenURL

  6. Liu JJ, Cutler G, Li W, Pan Z, Peng S, Hoey T, Chen L, Ling XB: Multiclass cancer classification and biomarker discovery using GA-based algorithms.

    Bioinformatics 2005, 21:2691-2697. PubMed Abstract | Publisher Full Text OpenURL

  7. Gene Ontology Home Page [http://www.geneontology.org] webcite

  8. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

    Nat Genet 2000, 25:25-29. PubMed Abstract | Publisher Full Text OpenURL

  9. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, Richter J, Rubin GM, Blake JA, Bult C, Dolan M, Drabkin H, Eppig JT, Hill DP, Ni L, Ringwald M, Balakrishnan R, Cherry JM, Christie KR, Costanzo MC, Dwight SS, Engel S, Fisk DG, Hirschman JE, Hong EL, Nash RS, Sethuraman A, Theesfeld CL, Botstein D, Dolinski K, Feierbach B, Berardini T, Mundodi S, Rhee SY, Apweiler R, Barrell D, Camon E, Dimmer E, Lee V, Chisholm R, Gaudet P, Kibbe W, Kishore R, Schwarz EM, Sternberg P, Gwinn M, Hannick L, Wortman J, Berriman M, Wood V, de la Cruz N, Tonellato P, Jaiswal P, Seigfried T, White R: The Gene Ontology (GO) database and informatics resource.

    Nucleic Acids Res 2004, 32:D258-261. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  10. Pouliot Y, Gao J, Su QJ, Liu GG, Ling XB: DIAN: a novel algorithm for genome ontological classification.

    Genome Res 2001, 11:1766-1779. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  11. Martin DM, Berriman M, Barton GJ: GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes.

    BMC Bioinformatics 2004, 5:178. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  12. Koski LB, Gray MW, Lang BF, Burger G: AutoFACT: an automatic functional annotation and classification tool.

    BMC Bioinformatics 2005, 6:151. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  13. Camon E, Magrane M, Barrell D, Binns D, Fleischmann W, Kersey P, Mulder N, Oinn T, Maslen J, Cox A, Apweiler R: The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro.

    Genome Res 2003, 13:662-672. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R: InterProScan: protein domains identifier.

    Nucleic Acids Res 2005, 33:W116-120. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. Zhong S, Li C, Wong WH: ChipInfo: Software for extracting gene annotation and gene ontology information for microarray analysis.

    Nucleic Acids Res 2003, 31:3483-3486. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  16. Draghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA: Global functional profiling of gene expression.

    Genomics 2003, 81:98-104. PubMed Abstract | Publisher Full Text OpenURL

  17. Doniger SW, Salomonis N, Dahlquist KD, Vranizan K, Lawlor SC, Conklin BR: MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data.

    Genome Biol 2003, 4:R7. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  18. Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC, Lababidi S, Bussey KJ, Riss J, Barrett JC, Weinstein JN: GoMiner: a resource for biological interpretation of genomic and proteomic data.

    Genome Biol 2003, 4:R28. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  19. A l-Shahrour F, Diaz-Uriarte R, Dopazo J: FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes.

    Bioinformatics 2004, 20:578-580. PubMed Abstract | Publisher Full Text OpenURL

  20. Beissbarth T, Speed TP: GOstat: find statistically overrepresented Gene Ontologies within a group of genes.

    Bioinformatics 2004, 20:1464-1465. PubMed Abstract | Publisher Full Text OpenURL

  21. Hosack DA, Dennis G Jr, Sherman BT, Lane HC, Lempicki RA: Identifying biological themes within lists of genes with EASE.

    Genome Biol 2003, 4:R70. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  22. Smid M, Dorssers LC: GO-Mapper: functional analysis of gene expression data using the expression level as a score to evaluate Gene Ontology terms.

    Bioinformatics 2004, 20:2618-2625. PubMed Abstract | Publisher Full Text OpenURL

  23. Volinia S, Evangelisti R, Francioso F, Arcelli D, Carella M, Gasparini P: GOAL: automated Gene Ontology analysis of expression profiles.

    Nucleic Acids Res 2004, 32:W492-499. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  24. Ben-Shaul Y, Bergman H, Soreq H: Identifying subtle interrelated changes in functional gene categories using continuous measures of gene expression.

    Bioinformatics 2005, 21:1129-1137. PubMed Abstract | Publisher Full Text OpenURL

  25. Boguski MS, Lowe TM, Tolstoshev CM: dbEST – database for "expressed sequence tags".

    Nat Genet 1993, 4:332-333. PubMed Abstract | Publisher Full Text OpenURL

  26. Digital Differential Display [http://www.ncbi.nlm.nih.gov/UniGene/info_ddd.html] webcite

  27. Scheurle D, DeYoung MP, Binninger DM, Page H, Jahanzeb M, Narayanan R: Cancer gene discovery using digital differential display.

    Cancer Res 2000, 60:4037-4043. PubMed Abstract | Publisher Full Text OpenURL

  28. cDNA xProfiler [http://cgap.nci.nih.gov/Tissues/xProfiler] webcite

  29. cDNA Digital Gene Expression Displayer [http://cgap.nci.nih.gov/Tissues/GXS] webcite

  30. Wang J, Liang P: DigiNorthern, digital expression analysis of query genes based on ESTs.

    Bioinformatics 2003, 19:653-654. PubMed Abstract | Publisher Full Text OpenURL

  31. Cai L, Huang H, Blackshaw S, Liu JS, Cepko C, Wong WH: Clustering analysis of SAGE data using a Poisson approach.

    Genome Biol 2004, 5:R51. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  32. Man MZ, Wang X, Wang Y: POWER_SAGE: comparing statistical tests for SAGE experiments.

    Bioinformatics 2000, 16:953-959. PubMed Abstract | Publisher Full Text OpenURL

  33. Gene Ontology OBO file [http://www.geneontology.org/ontology/gene_ontology.obo] webcite

  34. Gene-GO mapping and Gene-Unigene mappings [ftp://ftp.ncbi.nih.gov/gene/DATA] webcite

  35. GOA Uniprot-GO, Uniprot-Gene mappings [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa] webcite

  36. Unigene FTP site [ftp://ftp.ncbi.nih.gov/repository/UniGene/] webcite

  37. Huang X, Madan A: CAP3: A DNA sequence assembly program.

    Genome 1999, 9:868-877. Publisher Full Text OpenURL

  38. Phrap [http://www.phrap.org/] webcite

  39. TIGR Assembler [http://www.tigr.org/software/assembler/] webcite

  40. GoPipe [http://www.fishgenome.org/bioinfo/gopipe/index.php] webcite

  41. Chen Z, Xue C, Zhu SX, Zhou F, Ling XB, Liu G, Chen L: GoPipe: Streamlined Gene Ontology Annotation for Batch Anonymous Sequences with Statistics.

    Prog Biochem Biophys 2005, 32:187-191. OpenURL

  42. Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing.

    J Roy Stat Soc B 1995, 57:289-300. OpenURL

  43. Schultz RM, Letourneau GE, Wassarman PM: Program of early development in the mammal: changes in patterns and absolute rates of tubulin and total protein synthesis during oogenesis and early embryogenesis in the mouse.

    Dev Biol 1979, 68:341-359. PubMed Abstract | Publisher Full Text OpenURL

  44. LaMarca MJ, Wassarman PM: Program of early development in the mammal: changes in absolute rates of synthesis of ribosomal proteins during oogenesis and early embryogenesis in the mouse.

    Dev Biol 1979, 73:103-119. PubMed Abstract OpenURL

  45. Zeng F, Baldwin DA, Schultz RM: Transcript profiling during preimplantation mouse development.

    Dev Biol 2004, 272:483-496. PubMed Abstract | Publisher Full Text OpenURL

  46. Stanton JL, Green DP: Meta-analysis of gene expression in mouse preimplantation embryo development.

    Mol Hum Reprod 2001, 7:545-552. PubMed Abstract | Publisher Full Text OpenURL

  47. Hamatani T, Carter MG, Sharov AA, Ko MS: Dynamics of global gene expression changes during mouse preimplantation development.

    Dev Cell 2004, 6:117-131. PubMed Abstract | Publisher Full Text OpenURL

  48. Merz EA, Brinster RL, Brunner S, Chen HY: Protein degradation during preimplantation development of the mouse.

    J Reprod Fertil 1981, 61:415-418. PubMed Abstract OpenURL

  49. Cheung VG, Conlin LK, Weber TM, Arcaro M, Jen KY, Morley M, Spielman RS: Natural variation in human gene expression assessed in lymphoblastoid cells.

    Nat Genet 2003, 33:422-425. PubMed Abstract | Publisher Full Text OpenURL

  50. Oleksiak MF, Churchill GA, Crawford DL: Variation in gene expression within and among natural populations.

    Nat Genet 2002, 32:261-266. PubMed Abstract | Publisher Full Text OpenURL

  51. Oleksiak MF, Roach JL, Crawford DL: Natural variation in cardiac metabolism and gene expression in Fundulus heteroclitus.

    Nat Genet 2005, 37:67-72. PubMed Abstract | Publisher Full Text OpenURL

  52. Whitehead A, Crawford DL: Variation in tissue-specific gene expression among natural populations.

    Genome Biol 2005, 6:R13. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  53. AmiGO [http://www.godatabase.org/cgi-bin/amigo/go.cgi] webcite

  54. Zhou XJ, Gibson G: Cross-species comparison of genome-wide expression patterns.

    Genome Biol 2004, 5:232. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  55. Bergmann S, Ihmels J, Barkai N: Similarities and differences in genome-wide expression data of six organisms.

    PLoS Biol 2004, 2:E9. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  56. Rifkin SA, Kim J, White KP: Evolution of gene expression in the Drosophila melanogaster subgroup.

    Nat Genet 2003, 33:138-144. PubMed Abstract | Publisher Full Text OpenURL

  57. McCarroll SA, Murphy CT, Zou S, Pletcher SD, Chin CS, Jan YN, Kenyon C, Bargmann CI, Li H: Comparing genomic expression patterns across species identifies shared transcriptional profile in aging.

    Nat Genet 2004, 36:197-204. PubMed Abstract | Publisher Full Text OpenURL

  58. Caceres M, Lachuer J, Zapala MA, Redmond JC, Kudo L, Geschwind DH, Lockhart DJ, Preuss TM, Barlow C: Elevated gene expression levels distinguish human from non-human primate brains.

    Proc Natl Acad Sci U S A 2003, 100:13030-13035. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL