Abstract
Background
The search for enriched (aka overrepresented or enhanced) ontology terms in a list of genes obtained from microarray experiments is becoming a standard procedure for a systemlevel analysis. This procedure tries to summarize the information focussing on classification designs such as Gene Ontology, KEGG pathways, and so on, instead of focussing on individual genes. Although it is well known in statistics that association and significance are distinct concepts, only the former approach has been used to deal with the ontology term enrichment problem.
Results
BayGO implements a Bayesian approach to search for enriched terms from microarray data. The R sourcecode is freely available at http://blasto.iq.usp.br/~tkoide/BayGO webcite in three versions: Linux, which can be easily incorporated into preexistent pipelines; Windows, to be controlled interactively; and as a webtool. The software was validated using a bacterial heat shock response dataset, since this stress triggers known systemlevel responses.
Conclusion
The Bayesian model accounts for the fact that, eventually, not all the genes from a given category are observable in microarray data due to low intensity signal, quality filters, genes that were not spotted and so on. Moreover, BayGO allows one to measure the statistical association between generic ontology terms and differential expression, instead of working only with the common significance analysis.
Background
A systems biology approach which is becoming increasingly used in microarray data analysis is the search for enriched (aka overrepresented or enhanced) terms in a list of interesting genes. This kind of approach tries to disclose the biological meaning behind the massive amount of data derived from highthroughput techniques [1,2]. By translating the results into a more humanfriendly output, the search for overrepresented terms can reveal pathway connections to track biological processes, helping the biologists to build systemlevel hypotheses.
The problem of ontology term enrichment is generally treated as the test for terms in a gene list that are present in higher numbers than it would be expected only by chance. The terms analyzed are generally derived from an ontology or a classification design. Many genome sequencing projects define their own organismspecific gene classification but recently, we are witnessing an increase in the use of general and standardized vocabularies and classification procedures, such as the one proposed by The Gene Ontology Consortium [3] or by the KEGG database [4].
There are many software packages and web sites addressing the ontology term enrichment problem and it is difficult to acknowledge all of them. The most commonly known are those listed in the Gene Ontology [5] website [6], such as: OntoExpress [7], GeneMerge [8], FuncAssociate [9], FatiGO [10], GOstat [11], GOArray [12], GO::TermFinder [13], THEA [14] and OntologyTraverser [15]; but there are other options such as LACK [16], which deals with generic lexical bias. Recently, more than ten different tools were compared and reviewed in [17].
Despite several options, there is no software that searches for enriched ontology terms using a Bayesian statistical framework. Moreover, to our knowledge, all the software packages available are based on the hypothesis test paradigm. We believe that, with the aim of defining enriched terms, the measure of statistical association can be an informative alternative or a useful complement to the usual statistical significance. Thus, our original contributions in this work are: (i) to provide a software that measures the statistical association between differential gene expression and a given ontology term and (ii) to present a statistical model for the enrichment problem that takes into account the realistic fact that, sometimes, not all the genes previously known as related to a given property are in fact observed.
We provide a mathematical sourcecode that can be easily accommodated into preexistent software packages or pipelines, in addition to a version for Windows that can be interactively controlled and a webtool that supports organisms that share special local interest and are neglected by mainstream tools.
Implementation
Measure of statistical association
The genes in a microarray dataset can be organized using a classification scheme such as GO [3], KEGG [4] or an organismspecific categorization. Let i be the list of genes related to the term(s) in focus. Let j be the list of genes related to term(s) different from the term(s) i. The dataset is described by a 2 × 3 contingency table:
Table 1. Gene Ontology terms considered significantly enriched (P < 0.05) by the Bayesian approach. The GO terms marked with an asterisk are all those considered significant (pvalue < 0.05) by the Frequentist approach. G is the gamma measure of statistical association and G90% is its 90% credibility interval ("errorbar")
where: X_{ij }is the number of differentially expressed genes that are iexclusive, N_{ij }is the total number of genes that are iexclusive, X_{ij }is the number of differentially expressed genes belonging to the i and j intersection, N_{ij }is the total number of genes belonging to the i and j intersection, X_{ji }is the number of differentially expressed genes that are not related to the term i (thus jexclusive) and N_{ji }is the total number of genes that are not related to the term i (thus jexclusive). Note that a given gene must be counted in one and only one of these cases and that summing over all N. yields the total number of considered genes.
In 1954, Goodman and Kruskal wrote a classical work discussing several issues on measures of association for contingency tables [18]. We choose to report the Goodman and Kruskal's gamma [19] since it is symmetric and normalized between 1 and 1. The gamma measure (G) is a 2 × 3 extension of the known 2 × 2 Yule's Q:
G = (p  q)/(p + q) (1)
where p = X_{ij}(N_{ij } X_{ij }+ N_{ji } X_{ji}) + X_{ij}(N_{ji } X_{ji}) and q = X_{ji}(N_{ij } X_{ij }+ N_{ij } X_{ij}) + X_{ij}(N_{ij } X_{ij}). G values near 1 indicate that the property described by the term(s) in focus have an important role in the biological phenomena studied. G values below or near zero have no relevant biological interpretation since they mean the absence of association with differential gene expression.
Probabilistic model and bayesian inference
Focussing on a given term(s) and using the same notation from the previous subsection, let x_{ij }be the number of differentially expressed genes observed that are iexclusive, n_{ij }be the total number of observed genes that are iexclusive, x_{ij }is the number of observed differentially expressed genes belonging to the i and j intersection, n_{ij }is the total number of observed genes belonging to the i and j intersection, x_{ji }is the number of differentially expressed genes observed that are not related to the term i and n_{ji }is the total number of genes that are not related to the term i. Note that N. ≥ n. ≥ x. for any subindex, represented by the "dot".
It is possible to realize that x.X.,N.,n. follows a hypergeometric distribution [20]:
We will omit N. and n. from the notation since they are always known numbers. Using a noninformative uniform as the a priori distribution for X. and the hypergeometric likelihood above, it is possible to show that the a posteriori distribution of (X.  x.)x. is a BetaBinomial distribution [20]:
where B is the beta special function.
Therefore, the 2 × 3 contingency table can be rewritten as:
The degree of association determined in Eq.1 is now a function of the random variables X_{ij}, X_{ij }and X_{ji}, described by the BetaBinomial a posteriori probability functions.
In computational terms, it means that several independent 2 × 3 contingency tables are built using random variates drawn according to their respective BetaBinomial probability distributions. The empirical probability distribution of the association level is obtained by applying Eq.1 to each of these random contingency tables. The significance analysis and the construction of credibility intervals ("errorbars") are based on this empirical distribution of G.
We have used 90% credibility intervals as default, but this parameter can be easily adjusted in the sourcecode. The credibility intervals are defined as previously described [21].
Assessing the significance of association measurements
To evaluate the significance of a measured statistical association, we have determined the probability P = Pr(G_{M }≥ G_{obs}), where G_{obs }is the observed association and G_{M }is the association obtained from MonteCarlo simulations. This simulation process accounts for the ontology's structure since it is influenced by the connectivity of the ontology graph.
The simulation consists in uniformly sampling lists from the observed genes. In each simulation round, this random list of length M is considered as the list of differentially expressed genes. Ideally, the number of genes used in each round considers that there is an intrinsic error when one defines a gene as differentially expressed. Keeping the length constant M = n_{ij }+ n_{ij }+ n_{ji }could be an unrealistic approximation, due to the possibility of a global falsecalling rate. We used a generic and heuristic sampling rule considering this length as M ~ Poisson(n_{ij }+ n_{ij }+ n_{ji}). It is possible to incorporate other rules to define the length M in BayGO sourcecode. The rule can be derived from the particular method used for the identification of differentially expressed genes.
The definition of P is similar to the Frequentist pvalue if G_{obs }is a real number and M is kept constant. It is known that, in this configuration, simulations tend to agree with the theoretical results from Fisherlike methods. However, if we consider that not all the genes were observed and thus, the association is also a random variable, then the Frequentist pvalue is not defined. On the other hand, the Bayesian analog P is still valid since the event {G_{M }≥ G} is well defined. This probability is calculated in the BayGO software by counting the number of times that this event occurs during the MonteCarlo rounds. Note that, in each round, a new G value is calculated based on random variates drawn according to the BetaBinomial model described in the previous subsection.
BayGO sourcecode
BayGO sourcecode is provided in three versions: Linux, Windows^{® }and as a package for building a custom webbased interface. The core of BayGO is written as R language scripts [22]. The package for webbased interfaces is a set of Perl, HTML and R sourcecodes.
The main sourcecode of BayGO receives as inputs: an R binary "database" file, a list of differentially expressed genes in a plaintext file, a list of genes that were not differentially expressed and the number of simulation rounds desired for the significance and errorbar determination. The R binary "database" file keeps the organism information: an ontology table, containing all available ontology terms and descriptions, and an ontologytogene table, listing all the genes classified under each term. This file is built only once by an auxiliary script and it is used in several analysis. To build this "database" it is necessary to provide as inputs: a genetoontology plaintext file (similar to the GO .goa files, for example) and an ontology description plaintext file. Detailed examples are provided in the BayGO usermanual available at BayGO Homepage [23].
BayGO assumes that the ontologytogene or genetoontology tables are adequately built by the user, considering the different levels of abstraction of the GO graph desired. These issues include taking into account the hierarchical structure when considering father/sons ontology terms and defining an arbitrary truncation level of the graph [17].
The Linux version was designed to be easily incorporated into preexisting tools or pipelines. There are several software packages that deal with the ontology term enrichment problem [17] and many of them are integrated with relevant databases, creating a knowledge discovery environment for the biologist user. We did not try to surpass these efforts and therefore, we provide a software that receives and exports simple plaintext files and is operated in a noninteractive form. These features allow BayGO to be incorporated as a mathematical module in complex and complete pipelines. An example of an OpenSource software which has a powerful GUI is Clutree [24].
The version for Windows^{® }was designed to be extremely userfriendly. It does not require any programming skills and it is controlled interactively by menus. The functions to analyze the data and build R binary "database" files are integrated in the windowsstyle menus.
Finally, the webbased version was designed to allow the researchers to establish their own local webtool for the ontology term enrichment analysis. It is easy to realize that the organisms supported by the webbased tools are restricted to those with a vast research community such as human, mouse, yeast and so on. This version will enable research communities focussed on other organisms to settle a useful resource, assuming a simple infrastructure such as a webserver with PERL/CGI allowed.
Microarray experiments
To test the BayGO software, we used microarray data from the heat shock response in bacteria. Since this stress triggers a conserved biological response from a systemlevel viewpoint, we believe that meaningful results could show BayGO usefulness. Therefore, we analyzed the transcriptional response of the phytopathogenic bacterium Xylella fastidiosa exposed to a temperature shiftup from 29°C to 40°C for 25 minutes.
The complete dataset and all the MIAMErequired details regarding the construction, hybridization, image acquisition and expression ratio normalization are publicly available at GEO database [25,26] under the accession number GSE3044. The microarray slides and data were obtained according to [27]. The expression ratio between the 40°C and 29°C (control) conditions was normalized using the LOWESS fitting exactly as in [27].
Unreliable spots, presenting intensities too similar to the local background or saturated were filtered out. Spots presenting mean signal intensity below the mean background plus 2 times its standard deviation in Cy3 and Cy5 simultaneously were eliminated from subsequent analysis. Saturated spots were also discarded. The HTself method [28] was used to define the differentially expressed and non differentially expressed genes. Briefly, the HTself method uses selfself experiments (cDNA from the 29°C control condition labeled independently with both Cy3 and Cy5 fluorescent dye and hybridized simultaneously in the same microarray slide) to derive an intensitydependent cutoff curve. We used credibility intervals of 0.99, window size of 1.0 and window step of 0.2. It is assumed that genes found consistently outside this cutoff curve are differentially expressed. A gene was classified as differentially expressed or not if it has at least 5 measured replicates and 80% of the replicates are outside or inside the intensitydependent cutoff curves, respectively.
The pathway information about the bacterium Xylella fastidiosa was obtained directly from the KEGG website [29,4] and the Gene Ontology information from the GOA@EBI website [5,30].
We considered an ontology term as significantly enriched if it presents a pvalue smaller than 0.05. To reach the conclusions using the Frequentist method, we adopted the default parameters in GeneMerge [8] software, including Bonferroni correction and using the arrayed genes as the reference set. To reach the conclusions using our Bayesian method, we used the wholegenome as the reference set, as required by our proposed model.
Results and discussion
Measurement of statistical association in addition to statistical significance
The measurement of association is an extensively studied issue in statistics [18,19,31]. The usual meaning of "association" refers to coefficients that measure the strength of relationships; they are used to measure the relationship when there is a dichotomy. Two other kinds of association measurements are the wellknown "correlation", which is used when both variables are intervals, and the "reliability", which is used to analyze a variable with itself. The concept of statistical association is different from statistical significance. Measures of significance compare the strength of an observed relationship with one that would be expected by chance, if one performs random sampling. The significance analysis is influenced not only by the strength of a relationship but also by other parameters such as the sample size or the sampling procedure. It is possible to have a relationship that presents a nonsignificant strong association or a significant weak relationship. Significance is relevant only when one has a random sample whereas association is always relevant to draw conclusions. Since significance and association are not equivalent, researchers may report both measures when discussing their findings. The use of association measurements for the problem of ontology term enrichment in bioinformatics has not been explored yet; the predominance of significance approaches is evident.
We propose to measure the statistical association in the context of the ontology term enrichment problem using the Goodman and Kruskal's gamma factor G [19] (Eq.1). G is a symmetric index, normalized between 1 and 1. The interpretation of this measure is that G values near 1 indicate strong positive association. The positive association of a given term with differential gene expression means that the property described by the ontology term on focus can be involved in the biological phenomenon studied. Negative and zero association are biologically meaningless in this context. The zero value means that there is no statistical association and the negative values indicate association with nondifferential gene expression.
It is not possible to define in advance a cutoff value above which one can consider an ontology term as presenting a relevant association. On the other hand, there are some traditionally used cutoffs for significance such as 0.01 or 0.05. It is intuitive to rely on a conclusion showing significance of 0.05, but it is difficult to foresee the value of association that represents a strong association in a general context (the same argument holds for correlation measurements). Depending on the application, the value of 0.95 can represent a strong association (or correlation); in other context, the same value can be an insufficient evidence of association between two facts. Ideally, sorting ontology terms by their G values and performing external independent biological assays could provide the association cutoffs. In order to evaluate the significance of the obtained associations, we calculated the probability P = Pr(G_{M }≥ G), where G_{obs }is the observed association and G_{M }is a random variable obtained by MonteCarlo simulation (see Implementation section for details).
Bayesian model accounting for nonobserved genes
In a typical microarray dataset, a gene could not be detected for several reasons: it was not spotted on the array due to a selection procedure; it did not meet the quality standards on intensity, integrity, technical reproducibility, etc. If we cannot observe all the genes, the number of genes differentially expressed observed is an estimate and, therefore, we need an inferential framework.
The advantages and disadvantages of the Bayesian statistics over the Frequentist approach belong to an old and endless debate in statistics. Some advantages of the Bayesian approach for microarray analysis are well discussed elsewhere [32]. In the problem of ontology term enrichment, the Bayesian framework allows the incorporation of prior information regarding the known size of each set of genes associated to an ontology term.
If we consider that all genes associated to a certain term were observed, the association measure G is a real number. In addition, if the significance analysis is made upon random lists of differentially expressed genes with the same length of the observed differentially expressed gene list, the probability P = Pr(G_{M }≥ G) is equivalent to the usual Frequentist pvalue. These assumptions are tacitly used when one performs all the Fisherlike tests [17].
However, if one cannot observe all the genes associated to a term, then the association should be treated as a random variable whose probability distribution reflects our ignorance about the true value. Our ignorance becomes smaller as we observe more genes. Moreover, we can establish credibility intervals ("errorbars") for the degree of association, similarly as discussed previously in a SAGE analysis context [21]. The Bayesian analogue for the pvalue, P = Pr(G_{M }≥ G_{obs}), is still valid even if both G_{M }and G_{obs }are random variables, whereas the Frequentist pvalue is not defined in this case.
It is possible to show that, under certain conditions, once one observes a portion of the differentially expressed genes related to a given ontology term, the unknown remainder portion is described by a BetaBinomial probability distribution [20]. This result holds under a noninformative uniform a priori (see Implementation section for mathematical details) given the number of differentially and non differentially expressed genes that were observed and the number of genes known to be related to the ontology term. In other words, as it is performed in an electoral process, we estimate the behavior of the finite population of genes related to an ontology term using a sample of it.
Examples of the BayGO usefulness
To show the usefulness of the BayGO software, we devise some illustrative examples of situations that one can face when analyzing microarray data. One of the examples shows the usefulness of the statistical association and the other examples show the Bayesian feature of considering the sampling effect.
To highlight the difference between our Bayesian method and one that does not account for the sampling effect, we have arbitrarily chosen the "hypergeometric" test to represent the Frequentist Fisherlike approaches [8,17].
We will consider a fictitious testorganism with characteristics compatible with several real organisms. Suppose that this testorganism has 4000 genes, classified into several categories that are described by ontology terms. Let us focus on an arbitrary class and list all the genes related to a term i that describes this class. The size of the set i is 30, and only 5 genes are exclusively related to the property described by the term i. Let j be the list of genes that have other terms different from i related to them. The size of this set j is 3975. The set ij contains the genes that are exclusively related to the ontology term i while the set ji contains the genes that are not related to the ontology term i. The set ij contains the genes that are not i exclusive. Finally, suppose that we carried out microarray experiments comprising all the known genes.
The genetoontology table for this testorganism, the complete Bayesian and Frequentist results and the corresponding gene lists used in the following illustrative examples are available as 1.
Additional File 1. Zipped file containing all the data relative to the artificial testorganism. It contains: the genetoGO table, the complete genome, the input data for creating the examples of "Examples of the BayGO usefulness" section and all the outputs/results obtained for those examples.
Format: ZIP Size: 92KB Download file
In the first example, suppose that all the 4000 genes are observable and the microarray experiment yielded 400 differentially expressed genes, distributed as follows:
In this 1^{st }scenario, there is no qualitative difference between the two approaches. The pvalue obtained by the Frequentist test is <10^{13}, indicating a significant enrichment of the term i. Accordingly, using the Goodman and Kruskal's gamma (Eq.1) to measure the statistical association between the term i and differential expression, one obtains G_{obs }= 0.899, a value close to 1 which means strong association. Moreover, since all available genes were observed, G is known without uncertainty and its 90% credibility interval [0.899; 0.899] has zero length. The MonteCarlo simulations with 1000 rounds indicate significant association since P = 0.
Suppose that, for the same dataset, the result from half of the genes related to term i was not measured by typical reasons such as: it did not meet quality standards on intensity, integrity, technical reproducibility, and so on. Now the contingency table is:
In this 2^{nd }scenario, both approaches are still indicating a significant enrichment of the term i. However, the statistical association measured is now a random variable, since the data is a sample from the complete scenario. The Frequentist test's pvalue is 3.5·10^{4}. The counts for all irelated genes were diminished proportionally (e.g. approximately half), thus, the measured association remains the same G_{obs }= 0.899. Nevertheless, there is now an uncertainty related to the measured statistical association and its 90% credibility interval is [0.829;0.937]. Despite the uncertainty, the association measured is much more probable than the association obtained from 1000 uniformly generated random lists of differentially expressed genes, since P = 0.
Now, suppose that only one fifth of the i related genes had their expression status defined. The contingency table is:
In this 3^{rd }scenario, the two approaches clearly disagree. The Frequentist test's pvalue is 0.32 indicating no significant enrichment. The association measure gives the same result G_{obs }= 0.899 with the 90% credibility interval [0.724;0.948]. The significance of the obtained association is P = 0.04, indicating that the enrichment could be accepted in face of the usual 0.05 cutoff.
Intuitively, we believe that finding an abundance of 4/400 for the differentially expressed genes is relevant when comparing with the total abundance of 6/3976 for all genes. Therefore, we believe that the association measurement is capturing the essential feature of the data, even if subjected to uncertainty.
In the following, we will consider another extreme example: when the observed genes are highly biased. Toward this aim, suppose that the testorganism has a gene annotation slightly different from the one used above, allocating the same 30 genes in a different way among the ontology terms. In the previous example, we emphasized the consistent pattern of the association measurement if the observed sample is "equilibrated" over the possible results. Now, using the same testorganism and the same microarray setup, suppose that we could observe all available genes, yielding the following contingency table:
It is clear that, from both Frequentist test and Bayesian model, there is no significant enrichment of the term i, since the pvalue is 0.17, G_{obs }= 0.282 without uncertainty and P = 0.11.
Suppose that one cannot observe the result from the genes related exclusively to the term i. Let us consider, for instance, that the 25 iexclusive genes were not spotted on the microarray slide due to gene selection. Therefore, the contingency table is:
In this scenario, one is led to believe that there is a significant enrichment, since all spotted genes related to the term i were found to be differentially expressed. In fact, the pvalue is <10^{5 }and the association is very strong with G_{obs }= 1. However, the 90% credibility interval of the association measurement is large [0.382;1.00] and the significance is low with P = 0.34.
We believe that this example shows the importance of considering the sampling effect in the significance or in the association result, manifested by the large "errorbar".
Application to real microarray data
To validate the BayGO approach, we used the microarray experiment data corresponding to a relatively wellcharacterized biological phenomenon: the heat shock response in bacteria. The high temperature stress is an aggressive environmental perturbation for bacteria, triggering system level changes in their transcriptional program. Although this stress response has many peculiarities depending on the particular bacteria studied, there are known conserved responses that can guide the validation of our model.
In this sense, we analyzed the response of the phytopathogenic bacterium Xylella fastidiosa to a temperature shiftup from 29°C to 40°C for 25 minutes, using DNA microarrays fully described in the GEO database under the accession number GSE3044. To define the upregulated and downregulated genes, as well the genes whose expression was unchanged, we used the HTself method [28] (see more details in the Implementation section).
In this example, we will focus on the upregulated genes induced by the temperature stress; therefore, the noninduced genes are the downregulated plus the nonperturbed genes. It is not our aim in this work to explore the particular biological implications of this dataset. The complete study of heat shock response comprising several timepoints, along with the biological rationale and conclusions will appear elsewhere (T. Koide, R.Z.N. Vêncio and S.L.Gomes, submitted).
We avoid making extensive comparisons with the methods available since there are several combinations of ways in which a term enrichment analysis can be performed [17]. A given method can yield a different qualitative result depending on the multipletest correction used, depending on several free parameters or depending on the input ontology (for example, GO analysis can be performed considering or not certain connections of the graph). Furthermore, we believe that presently, there is no reasonable independent methodology of biological validation to confront the systemlevel biological conclusions obtained with different softwares.
We chose to present the BayGO results along with the results obtained by GeneMerge [8]. GeneMerge is one of the few methods available that accepts arbitrary ontologies and supports arbitrary organisms, not being limited to the most studied ones [17]. Note, however, that this is an illustrative comparison.
The first analysis is relative to KEGG pathways. 2 contains the complete output from the Bayesian and Frequentist softwares for KEGG terms analysis. The pathway terms significantly enriched (pvalue < 0.05) according to the Frequentist methods are: Protein folding and associated processing and Folding, Sorting and Degradation. The terms enriched (P < 0.05) according to our Bayesian methods are: Protein folding and associated processing, Folding, Sorting and Degradation and Genetic Information Processing. These results are completely compatible with what is known about the bacterial heat shock response [33]. It seems that BayGO correctly captured the KEGG term Genetic Information Processing since it is known that bacteria facing such stress change their gene expression program in a broad systemlevel sense.
Additional File 2. Excel or Gnumeric worksheet containing all the results from term enrichment analysis of Xylella fastidiosa heat shock experiment, described in the "Application to real microarray data" section.
Format: XLS Size: 230KB Download file
This file can be viewed with: Microsoft Excel Viewer
The second analysis is relative to Gene Ontology. 2 contains the outputs from the BayGO and GeneMerge softwares for GO terms analysis. The ontology terms significantly enriched (pvalue < 0.05) according to the Frequentist method are: Protein folding (GO:0006457), Unfolded protein binding (GO:0051082), Response to unfolded protein (GO:0006986) and Protein binding (GO:0005515). The terms enriched according to our Bayesian method (P < 0.05) compose a more numerous set, including the 4 terms found by the Frequentist method. Table 1 shows these significant terms.
The biological insight provided by the Bayesian method seems to be richer when one is trying to elaborate a systemlevel picture of the transcriptional disturbance caused by heat shock. For instance, BayGO was able to highlight the term HslUV protease activity (GO:0009377) and Heat shock protein binding (GO:0031072) that were not highlighted by the Frequentist analysis. Both activities are classically known to be involved in the heat shock response [33].
We restricted our illustrative comparisons to the significance analysis since there is no similar software that approaches the statistical association issue presented in this work. We believe that the use of association measurements can help the biologists in their microarray analysis. For instance, the term Chaperone activator activity (GO:0030189) presents a strong association level (G_{obs }= 1) but not a high significance (P < 0.17) and the chaperone molecules are known to be related to the heat shock response. Results for all terms are available in 2.
It is important to note that the comparison performed in this section should be carefully considered since there are many possible variations for the Frequentist analysis [17]. However, we believe that these experimental results show the usefulness of BayGO software.
BayGO sourcecode
Our sourcecode is provided as R language [22] scripts in three versions: Linux, Windows. and as a package for building custom webbased interface locally. The Linux version receives its parameters noninteractively by commandline arguments and thus, it can be easily incorporated into preexisting pipelines or webbased tools. The version for Windows is a userfriendly program that is controlled interactively by menus, without the necessity of programming, being more appropriate for endusers. Finally, the webbased version is a set of R, Perl and HTML sourcecodes that allows one to build his own webbased tool locally, focussing on the organisms of interest.
Using the webbased version, we have created a useful webtool that supports organisms of interest to our local research community [23]. The evergrowing list of supported organisms includes some of those that are neglected by the most used similar tools, such as Xylella fastidiosa, Xanthomonas citri, Blastocladiella emersonii, etc. Although BayGO was designed based on Gene Ontology terms, the software can also be used with other classification designs. For this, the user has to provide a genetoontology hash table for all known genes and not only for those spotted in the microarray slide.
Caveats, limitations and recommendations
The approach of finding enriched ontology terms in microarray data has several known intrinsic limitations. In our opinion, the most severe limitations are: (i) the practical difficulty in the experimental validation of the conclusions derived from such methods, since it is difficult to manipulate a large set of genes to observe which systemlevel deductions are real or methods' artifacts; (ii) the large quantity of "free parameters" that can be changed in routinely use of these methods, from the hierarchical structure of ontologies limited by the user to the choice of cutoff pvalues; (iii) the tacit assumption that a given function, pathway, etc., is "important" or not solely based on numerical aspects (number of genes, statistical association, etc) ignoring the qualitative aspects involved (few differentially expressed genes in a pathway might be sufficient to trigger systemlevel responses); and (iv) the results rely heavily on gene annotation provided for the studied organism.
It is likely that, even for well studied prokaryotes, many genes have no known function and others are likely to be involved in different processes and therefore, they are associated to multiple presently unassigned ontology terms. A potential implication is that the contingency tables probably hold more uncertainty than it is taken into account by the present models. There are some efforts in trying to quantify the effects of misannotation or to establish some form of probabilistic gene annotation, including Bayesian models [34,35]. Future directions of works in the ontology term enrichment problem should include such gene annotation uncertainties.
The ontology term enrichment problem is not restricted to differential gene expression analysis or to microarray derived data. The current models, including the one implemented in BayGO, are capable of analysing the enrichment relative to other dichotomous properties such as "belonging to a particular gene cluster" vs. "not belonging", in clustering analysis [24]; "it is expressed (ON)" vs. "it is not expressed (OFF)", in gene transcription analysis; "narrowly expressed transcripts" vs. "prevalently expressed transcripts" [36]; and so on.
Particularly for BayGO, the dichotomous properties are defined in the two rows of the contingency table. The significance of the statistical association is measured relative to the first row, however, since gamma (Eq.1) is symmetric around zero, flipping rows order will yield the "opposite" analysis. In the illustration presented in the section Application to real microarray data, we chose to analyze the induced genes after temperature stress, defining the dichotomous property as "upregulated" vs. "not upregulated" (this class includes not differentially expressed and downregulated genes). Conversely, one may want to focus on the repressed genes defining the contigency table's rows as "downregulated" vs. "not downregulated". Another possibility is to analyze "perturbed" vs "no change", independent of the direction of the change.
Generally, the methods to find enriched terms work using nominal lists of genes without knowledge of how the list was obtained. The means by which these gene lists were obtained define the interpretation and meaning of the result. It is important to note that the method used to obtain the differentially expressed genes (or any other property being studied) has an important impact on the enrichment analysis since they determine the gene list that is inputed. In this work, we analysed our microarray data using our recently published HTself method [28] but we provide the same analysis using widely used approaches as well at the Supplemental material [23].
BayGO was designed to be a mathematical module and therefore, it has a limited interactive potential. For instance, it does not provide a graphical user interface (GUI) that allows the user to browse his/her results. These features are complex and are beyond BayGO's scope. It was designed to be incorporated into preexisting software with elaborated GUI frontend. One good option of such frontend is the OpenSource software Clutree [24]. One of the future directions for BayGO evolution could be its incorporation on available GUIs.
Conclusion
When dealing with the systemlevel problem of finding enriched terms from microarray data, most of the softwares available use only the significance analysis. In this work, we have introduced the use of a measure of statistical association between ontology terms and differentially expressed genes, in addition to the common significance analysis. We elaborated a Bayesian statistical model for the ontology term enrichment problem that incorporates information about the composition of the ontology by taking into account genes that were not observed in the microarray data. The examples given in this work aim at making the investigators aware of the necessity of considering the sampling problem when drawing systemlevel conclusions.
The BayGO software can be used with generic genetoontology tables and not only with GO classification. The webtool is implemented for a set of organisms with particular interest to our local research community, but we made available the sourcecode that allows one to build custom webtools for other organisms. We also made available an R sourcecode and a webbased tool that calculates these association measurements and their significances.
Finally, we would like to highlight that the estimation of statistical association and statistical significance are not equivalent procedures. We believe that the use of statistical association should be more explored in bioinformatics. Statistical association is an established tool in statistics, widely used in other research fields since it allows the visualization of relationships inside the data that are not considered when using only significance analysis
Availability and requirements
• Project name: BayGO
• Project home page: http://blasto.iq.usp.br/~tkoide/BayGO webcite
• Operating system(s): Platform independent (Linux, Windows, Mac OS X and webservice for supported organisms)
• Programming language: R.
• Other requirements: to build a local version of the webservice it is necessary to have Apache with CGI allowed and Perl.
• License: under the GNU General Public License
List of abbreviations
GO: Gene Ontology
KEGG: Kyoto Encyclopedia of Genes and Genomes
SAGE: Serial Analysis of Gene Expression
GUI: graphical user interface
Authors' contributions
RZNV and TK conducted the work and drafted the manuscript. RZNV and CABP are responsible for the statistical model and software. TK and SLG are responsible for all the experimental work and biological interpretation. All the authors wrote and approved the manuscript.
Acknowledgements
We thank Thiago M. Venancio for several suggestions and critical reading of the manuscript. TK is supported by a doctoral fellowship from Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP). SLG and CABP are partially supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq).
References

Cavalieri D, De Filippo C: Bioinformatic methods for integrating wholegenome expression results into cellular networks.
Drug Discov Today 2005, 10:727734. PubMed Abstract  Publisher Full Text

Yue L, Reisdorf WC: Pathway and ontology analysis: emerging approaches connecting transcriptome data and clinical endpoints.
Curr Mol Med 2005, 5:1121. PubMed Abstract  Publisher Full Text

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, IsselTarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.
Nat Genet 2000, 25:2529. PubMed Abstract  Publisher Full Text

Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes.
Nucleic Acids Res 2000, 28:2730. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

GO: GOA@EBI Home Page. [http://www.ebi.ac.uk/GOA/proteomes.html] webcite

Consortium TGO: Tools for Gene Expression Analysis. [http://www.geneontology.org/GO.tools.microarray.shtml] webcite

Khatri P, Draghici S, Ostermeier GC, Krawetz SA: Profiling gene expression using ontoexpress.
Genomics 2002, 79:266270. PubMed Abstract  Publisher Full Text

CastilloDavis CI, Hartl DL: GeneMergepostgenomic analysis, data mining, and hypothesis testing.
Bioinformatics 2003, 19:891892. PubMed Abstract  Publisher Full Text

Berriz GF, King OD, Bryant B, Sander C, Roth FP: Characterizing gene sets with FuncAssociate.
Bioinformatics 2003, 19:25022504. PubMed Abstract  Publisher Full Text

AlShahrour F, DiazUriarte R, Dopazo J: FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes.
Bioinformatics 2004, 20:578580. PubMed Abstract  Publisher Full Text

Beissbarth T, Speed TP: GOstat: find statistically overrepresented Gene Ontologies within a group of genes.
Bioinformatics 2004, 20:14641465. PubMed Abstract  Publisher Full Text

Osier MV, Zhao H, Cheung KH: Handling multiple testing while interpreting microarrays with the Gene Ontology Database.
BMC Bioinformatics 2004, 5:124. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G: GO::TermFinderopen source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes.
Bioinformatics 2004, 20:37103715. PubMed Abstract  Publisher Full Text

Pasquier C, Girardot F, Jevardat de Fombelle K, Christen R: THEA: ontologydriven analysis of microarray data.
Bioinformatics 2004, 20:26362643. PubMed Abstract  Publisher Full Text

Young A, Whitehouse N, Cho J, Shaw C: OntologyTraverser: an R package for GO analysis.
Bioinformatics 2005, 21:275276. PubMed Abstract  Publisher Full Text

Kim CC, Falkow S: Significance analysis of lexical bias in microarray data.
BMC Bioinformatics 2003, 4:12. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems.
Bioinformatics 2005, 21:35873595. PubMed Abstract  Publisher Full Text

Goodman LA, Kruskal WH: Measures of association for cross classifications.

Garson GD: Political Science Methods. Boston, Holbrook Press; 1976.

Basu D, Pereira CAB: On the Bayesian analysis of categorical data: the problem of nonresponse.
J Statist Plan Inf 1982, 6:345362. Publisher Full Text

Vencio RZ, Brentani H, Pereira CA: Using credibility intervals instead of hypothesis tests in SAGE analysis.
Bioinformatics 2003, 19:24612464. PubMed Abstract  Publisher Full Text

Team RDC: R: A language and environment for statistical computing. [http://www.Rproject.org] webcite

Vencio RZ, Koide T: BayGO Home Page. [http://blasto.iq.usp.br/~tkoide/BayGO] webcite

Bidaut G, Ochs MF: ClutrFree: cluster tree visualization and interpretation.
Bioinformatics 2004, 20:28692871. PubMed Abstract  Publisher Full Text

Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, Rudnev D, Lash AE, Fujibuchi W, Edgar R: NCBI GEO: mining millions of expression profilesdatabase and tools.
Nucleic Acids Res 2005, 33:D5626. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

NCBI: Gene Expression Omnibus. [http://www.ncbi.nlm.nih.gov/geo] webcite

Koide T, Zaini PA, Moreira LM, Vencio RZ, Matsukuma AY, Durham AM, Teixeira DC, ElDorry H, Monteiro PB, da Silva AC, VerjovskiAlmeida S, da Silva AM, Gomes SL: DNA microarraybased genome comparison of a pathogenic and a nonpathogenic strain of Xylella fastidiosa delineates genes important for bacterial virulence.
J Bacteriol 2004, 186:54425449. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Vencio RZ, Koide T: HTself: SelfSelf Based Statistical Test for Low Replication Microarray Studies.
DNA Res 2005, 12(3):211214. PubMed Abstract  Publisher Full Text

KEGG: KEGG Home Page. [http://www.genome.jp/kegg/] webcite

Camon EB, Barrell DG, Dimmer EC, Lee V, Magrane M, Maslen J, Binns D, Apweiler R: An evaluation of GO annotation retrieval for BioCreAtIvE and GOA.
BMC Bioinformatics 2005, 6 Suppl 1:S17. PubMed Abstract  BioMed Central Full Text

Liebetrau AM: Measures of association. Newbury Park, Sage Publications; 1983.

Yang D, Zakharkin SO, Page GP, Brand JP, Edwards JW, Bartolucci AA, Allison DB: Applications of Bayesian statistical methods in microarray data analysis.
Am J Pharmacogenomics 2004, 4:5362. PubMed Abstract  Publisher Full Text

Yura T, Nagai H, Mori H: Regulation of the heatshock response in bacteria.
Annu Rev Microbiol 1993, 47:321350. PubMed Abstract  Publisher Full Text

Levy ED, Ouzounis CA, Gilks WR, Audit B: Probabilistic annotation of protein sequences based on functional classifications.
BMC Bioinformatics 2005, 6:302. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Engelhardt BE, Jordan MI, Muratore KE, Brenner SE: Protein Molecular Function Prediction by Bayesian Phylogenomics.
PLoS Comput Biol 2005, 1:e45. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhang J, Zhang L, Coombes KR: Gene sequence signatures revealed by mining the UniGene affiliation network.
Bioinformatics 2005, 22:385391. PubMed Abstract  Publisher Full Text