Abstract
Background
One main research challenge in the postgenomic era is to understand the relationship between protein sequences and their biological functions. In recent years, several automated annotation systems have been developed for the functional assignment of uncharacterized proteins. The underlying assumption of these systems is that similar sequences imply similar biological functions. However, it has been noted that matching sequences do not always infer similar functions.
Results
In this paper, we present the correlation between protein sequences and protein functions for the yeast proteome in the context of gene ontology. A novel measure is introduced to define the overall similarity between two protein sequences. The effects of the level as well as the size of a gene ontology group on the degree of similarity were studied. The similarity distributions at different levels of gene ontology trees are presented. To evaluate the theoretical prediction power of similar sequences, we computed the posterior probability of correct predictions.
Conclusion
The results indicate that protein pairs of similar biological functions tend to have higher sequence similarity, although the similarity distribution in each functional group is heterogeneous and varies from group to group. We conclude that sequence similarity can serve as a key measure in protein function prediction. However, the resulting annotations must be verified through other means. A method that combines a broader range of measures is more likely to provide more accurate prediction. Our study indicates that the posterior probability of a correct prediction could serve as one of the key measures.
Background
The human genome project and numerous other genome projects have produced a large and ever increasing amount of sequence data. One of the main research challenges in the postgenomic era is to understand the relationship between the nucleotide sequences of genes and the functions of the proteins they encode. Traditionally, the functional annotation of genes has been done manually by experienced individual curators with the help of advanced searching tools. However, to unlock the potential of the huge amount of genomicwide sequence data, it is necessary to develop largescale approaches for the functional assignment of uncharacterized proteins [110]. In recent years, several automated annotation systems have been developed based on homologues identified from database searches, text mining, gene ontologies, and coexpression relationships obtained from microarray gene expression patterns [1123]. In sequence similaritybased approaches, the function of a query protein can be deduced from those of homologous proteins of known functions obtained from database searches. The underlying assumption of these approaches is that similar sequences imply similar biological functions. Since this assumption is true in many cases and the approaches are simple, this type of sequence matching schemes have been most popular and widely used, although it has been noted that matching sequences do not always infer similar functions [2426].
The Gene Ontology (GO) consortium provides a vocabulary to describe gene and gene product attributes in any organism [27]. GO includes three ontological categories: molecular function, biological process, and cellular component. A molecular function GO term represents a biological activity involving one or more gene products. A biological process GO term represents a series of biological activities. And a cellular component GO term, as the name suggests, represents a component of a cell. The GO terms in each category are organized in a directed acyclic graph (DAG), i.e., a specialized GO term (child) could be associated with one or several less specialized GO terms (parents).
Since the establishment of GO, many ontologybased sequence annotation approaches have been developed [1623], including several webbased automated GO annotation software tools [18,19]. These attempts typically involve a search of homologous proteins in GOmapped databases including Genbank and SwissProt. Hennig et al.'s OntoBlast and Zehetner's GOB let present a list of homologues together with their GO terms [18,19]. Martin et al.'s GOtcha searches a set of seven model genomes and returns scored matches [20]. Xie et al.'s GO Engine combines homology search with text mining [17]. Schug et al. developed a rulebased function prediction method based on the intersection of GO terms that contain protein domain at different similarity levels [16]. Abascal et al. presented an automatic annotation method based on protein family identification [21]. Jensen et al. used neural networks for the prediction while Vinayagam et al. used support vector machines [22,23]. The appeal of these approaches is that they can directly assign a biological meaning to an uncharacterized protein sequence.
In this study, we investigate the mathematical underpinnings of the automated sequence annotation approaches that are based on sequence similarity and gene ontology. We explore the structures of the three ontology categories and reevaluate the assumption that similar sequences give rise to similar biological functions. We introduce a novel measure of overall similarity between two protein sequences based on a set of local BLAST alignments. Using the complete proteome from the model organism yeast, we study the degree of overall similarity of yeast protein sequences in each functional group defined by GO terms. We examine the effects of the level of GO terms and the size of GO groups on the degree of similarity. We present the sequence similarity distributions at different levels of GO DAGs and the distributions of siblings of GO groups. To evaluate the theoretical prediction power of similar sequences, we compute the posterior probability of the hypothesis that protein A possesses the same biological function as protein B, given B's biological function is known and A and B are similar.
Results and Discussion
Alltoall pairwise protein sequence local alignments were performed using the alignment tool for blasting two sequences (I) which was retrieved from the NCBI ftp site [28]. The pvalues were calculated based on a novel measure (Equation (2) in Methods section) of overall similarity of two protein sequences. The distributions of the pvalues are shown in Table 1. The first column presents the pvalue distribution of protein sequence pairs from the complete yeast proteome. This distribution serves as a control for the distribution of the whole population. The second, third and fourth columns show the distributions for sequences annotated for biological processes, molecular functions and cellular components, respectively. As we can see, the four distributions are quite similar, indicating that the annotated proteins in each of the three gene ontology categories provide a representative sample set of sequences from the complete yeast proteome. On the other hand, we clearly see that the majority of sequence pairs are not similar. Only about 4% of the sequence pairs have pvalues less than 0.01.
Table 1. The pvalue distributions of protein sequence pairs.
The distributions of the number of GO groups at different levels of the gene ontologies are shown in Figure 1. In this study, when a pair of sequences appears on multiple levels, the highest level (most specialized level) was chosen for the analysis. We see clearly that the GO groups in molecular function and cellular component populate the third and the fourth level of the ontologies while the biological process GO groups are mainly distributed around level six. The average sizes of GO groups at different levels of the ontologies are shown in Figure 2. We see that in all three GO categories the average size of the GO groups decreases in most of the cases as their level increases. We note that groups of less than six protein sequences are not shown.
Figure 1. Numbers of GO groups at different levels of the gene ontologies.
Figure 2. Average size of GO groups at different levels of the ontologies.
Figures 3, 4, 5 show the pvalue distributions of protein sequence pairs annotated for molecular function, biological process and cellular component GO terms at different levels of GO categories. Each curve in the figures represents the percentages of sequence pairs of less than or equal to a certain pvalue across different levels of GO categories. Some curves do not include the percentages for all levels because no sequence pair on those levels has a pvalue less than or equal to certain thresholds. For the sequences annotated for molecular function and cellular component GO terms, we see clearly that the majority of the sequence pairs are considered nonsimilar throughout the levels. Over 59% of sequence pairs at all levels have pvalues greater than 0.01. However, the number of similar sequence pairs does increase steadily with their GO levels. In particular, the percentage of pairs with high similarity scores (p ≤ 10^{10}) has a steep increase from the root level to level 5. Level 9 has the highest percentage of similar pairs for molecular function ontology. At this level over 35% of the sequence pairs have similarity pvalues less or equal to 10^{3 }while for levels 5 through 8, over 13% of the sequence pairs have pvalues less or equal to 10^{3}. These percentages are significantly higher than the 1.8% extracted from the pvalue distribution of the entire population of sequence pairs annotated for molecular functions (Table 1). Also we can see that the percentage increase is not monotonic from levels 6 to 9. There is a short trend that the percentage decreases with level. We believe that this result is mainly due to the nature of the ontology graph in which fewer GO terms are on levels higher than 6. Another reason that may also possibly contribute to the result is that in our analysis, the level of a GO term is defined to be the lowest level on which it appears in the GO DAG. For the cellular component ontology, level 7 has about 12% of the sequence pairs with pvalues less than or equal to 10^{3}. The average for levels 5 and 6 is about 3.6%. We also see that for the two ontologies, significantly more pairs have high similarity scores (p ≥ 10^{10}) at levels 5 or above than those at levels below 5. For the biological process ontology, the increase of the number of similar pairs starts to level off around level 7, apparently much higher than for molecular function and cellular component ontologies. About 6.2% of pairs at levels 7 through 11 have the similarity pvalues less than and equal to 10^{3}, compared to an average of 1.74% at levels below 7. Similar to the two other ontologies, there are significantly more pairs annotated for biology process GO terms having high similarity scores (p ≥ 10^{10}) at levels 7 and above than at levels below 7.
Figure 3. The pvalue distributions of sequence pairs annotated for molecular function. Each curve represents the percentages of sequence pairs of less than or equal to certain pvalue across different levels of the GO category. (Some curves do not include the percentages for all levels because no sequence pair on those levels has a pvalue less than or equal to certain thresholds.)
Figure 4. The pvalue distributions of sequence pairs annotated for biological process. Each curve represents the percentages of sequence pairs of less than or equal to certain pvalue across different levels of the GO category. (Some curves do not include the percentages for all levels because no sequence pair on those levels has a pvalue less than or equal to certain thresholds.)
Figure 5. The pvalue distributions of sequence pairs annotated for cellular component. Each curve represents the percentages of sequence pairs of less than or equal to certain pvalue across different levels of the GO category. (Some curves do not include the percentages for all levels because no sequence pair on those levels has a pvalue less than or equal to certain thresholds.)
The complete pvalue distributions of sequence pairs for each GO group of the three ontologies are shown in the supplement tables I, II, III (Additional data file 1, 2, 3). Table 2 shows a typical part of the supplement table II. It presents the pvalue distribution of sequence pairs in some GO groups on the transporter activity branch of the molecular function ontology tree. Numbers in each row of the table represent the percentages of sequence pairs of pvalues within certain range. We see very much diversified pvalue distributions over different GO groups. Most of the distributions are independent of the sizes of the groups. More noticeably, the sequence pairs in the carbohydrate transporter activity group have much higher similarity scores. Over 75% of pairs have the pvalues less than or equal to 10^{5}. In particular, the 17 sequences in the subgroup monosaccharide transporter activity are extremely similar with each other. All the pvalues of the 136 pairs are less than or equal to 10^{50}. On the other hand, the sequences in the GO group monovalent inorganic cation transporter activity which is at the same level as monosaccharide transporter activity exhibits much low similarity scores. More than 96% sequence pairs have pvalues greater than 10^{2}. Also we see, in general, within one branch of the ontology tree, the higher level a GO group is at, the higher similarity its sequence pairs have. More convincingly, 707 out of 903 biological process groups, 304 out of 362 molecular function groups, and 216 out of 284 cellular component groups have higher percentage of sequence pairs of pvalues less 10^{3 }than those of their parents (in the case of multiple parents, the averages of similarity scores of the parents are considered). This result indicates the strong correlation between sequence similarity and function similarity/specificity.
Additional data file 1. The pvalue distribution of sequence pairs in GO groups of molecular function ontology. The pvalue distribution of sequence pairs in GO groups of molecular function ontology. This file contains the pvalue distribution of sequence pairs in all GO groups of molecular function ontology
Format: XLS Size: 44KB Download file
This file can be viewed with: Microsoft Excel Viewer
Additional data file 2. The pvalue distribution of sequence pairs in GO groups of biological process ontology. The pvalue distribution of sequence pairs in GO groups of biological process ontology. This file contains the pvalue distribution of sequence pairs in all GO groups of biological process ontology
Format: XLS Size: 1.1MB Download file
This file can be viewed with: Microsoft Excel Viewer
Additional data file 3. The pvalue distribution of sequence pairs in GO groups of cellular component ontology. The pvalue distribution of sequence pairs in GO groups of cellular component ontology. This file contains the pvalue distribution of sequence pairs in all GO groups of cellular component ontology
Format: XLS Size: 146KB Download file
This file can be viewed with: Microsoft Excel Viewer
Table 2. The pvalue distribution of sequence pairs in some GO groups on transporter activity branch.
The dependence of sequence similarity on group size was also examined. No strong correlation was found although there is a vague trend of increasing degree of sequence similarity as the group gets smaller. The Pearson correlation coefficient of group size versus percentage of sequence pairs with pvalues less than or equal to 10^{5 }for molecular function ontology is about 0.124. The coefficients for biological process and cellular component are 0.137 and 0.136, respectively. As an example, the protein kinase activity group has 94 annotated sequences. About 70% of the 4371 sequence pairs have pvalues less than or equal to 10^{5}. On the other hand, the nucleobase, nucleoside, nucleotide kinase activity group has only 10 annotated sequences. Only 10 out of the 45 pairs have pvalues less than or equal to 10^{5}, although both groups are at the same level (level 5) of the molecular function ontology tree. The fact that most child groups have higher similarity scores than their parents might be the main factor contributing to the weak negative correlation between sizes and similarity scores.
The above results indicate that proteins of similar biological functions tend to have higher sequence similarity. The level of GO groups on a gene ontology tree depicts to a certain degree the functional similarity of the groups, although it's far from being able to accurately characterize the relationship between protein sequence similarity and biological function similarity. To evaluate how much protein sequence similarity can contribute to biological function prediction, we computed the posterior probabilities of correct predictions using equation (4). The results for the protein kinase activity branch of the molecular function ontology tree are presented in Table 3 while the pvalue distributions of sequence pairs for the branch are shown in Table 4 for comparison. As we can see from the results, the posterior probability of a correct assignment varies greatly from group to group. For example, if a database search hits the nucleotide kinase activity group with a pvalue less than or equal to 10^{100}, one can almost be certain that the protein with that query sequence belongs to the nucleotide kinase activity group. On the other hand a hit to the protein kinase activity group with the same pvalue would carry only 13% of the confidence that the protein belongs to the group. We believe that the high degree of variation observed in the posterior probabilities indicate that the posterior probability could serve as a key measure in protein function predictions.
Conclusion
In this paper, we studied the correlation between protein sequence similarity and function similarity for the yeast proteome in the context of the three gene ontologies. The results indicate that protein pairs in a GO group tend to have higher sequence similarity than a randomly drawn sequence pair, although the pvalue distributions of sequence pairs in GO groups are heterogeneous and vary from group to group. We conclude that sequence similarity can serve as one of the key measures in protein function prediction. However, the results do not directly translate into a high confidence of the function prediction provided by automated protein annotation systems that are solely based on sequence similarity and GO definitions. These methods can serve as a preliminary tool for functional predictions. The resulting annotations have to be verified through other means. A method that combines a broader range of measures, including sequence similarity, GO definitions, gene expression patterns, as well as available knowledge of the organism under study, is more likely to provide more accurate function prediction. Our study indicates that the posterior probability of a correct prediction could serve as one of those key measures.
Methods
The complete yeast (Sacchyromyces cerevisiae) proteome was obtained from SwissProt [29] on July 2005. It includes 6467 protein sequences. GO definition files were obtained from the Gene Ontology consortium web site [30]. In the version of July 1^{st }of 2005, there are 19094 GO terms including 9856 biological process terms, 7559 molecular function terms, and 1679 cellular component terms. Among the 6467 protein sequences, 4175 are annotated with 1084 biological process terms, 3317 are annotated with 1060 molecular function terms, and 4735 are annotated with 354 cellular component terms.
Sequence similarities can be measured through pairwise global alignments or local alignments. Homologous protein sequences are usually similar over active domains and thus share common folds and functions. Therefore, local alignment is a more appropriate method for comparing protein sequences for their functional similarity. There are several local alignment schemes for comparing protein sequences, including BLAST that can be used together with different scoring systems such as BLOSUM62 and BLOSUM80. The program returns a list of local alignments of certain statistical significance. However, how to measure the overall similarity of two protein sequences is not obvious. For example, proteins with two similar domains with certain similarity scores could be considered to be much more similar than proteins with only one domain with a higher similarity score. In this study, we introduce a novel measure of overall similarity of two protein sequences. We utilize the alignment tool for blasting two sequences to obtain the list of optimal local alignments. Let {S_{1},...,S_{n}} be scores of a list of best local alignments with certain statistical significance. Instead of using the highest score (max{S_{1},...,S_{n}}) in the list to measure the overall similarity of the two protein sequences, we use the following score S to measure the overall similarity of two sequences:
where stands for the probability of finding a highscoring segment pair (HSP) with a local alignment score of at least S_{i}, and E_{i }is the expected number of HSPs of score at least S_{i }and can be obtained directly from the alignment tool. Assuming the HSPs are independent of each other, the pvalue:
measures the probability of finding a pair of protein sequences with a list of scores at least {S_{1}, ..., S_{n}}. The corresponding Evalue for the overall similarity (the expected number of the lists that have scores at least {S_{1}, ..., S_{n}}) therefore can be written as:
E = ln(1  e^{S}). (3)
We use the pvalues and the Evalues to measure the overall similarity of a pair of protein sequences. Since when x < 1, ln(1  x) = x + x^{2}/2 + O(x^{3}), the E and pvalues are essentially the same when they are small. For example, when p = 10^{5}, p  E is of order 10^{10}. For convenience, we use pvalues to present our results in this paper. The alignment tool used for blasting two sequences (b12seq) was retrieved from the NCBI ftp site [28]. We used version 2.2.11 with default parameters and the substitution matrix BLOSUM62. All processing scripts were written in Perl.
GO terms in each of the three ontology categories were parsed and stored in a tree structure similar to the one used in AmiGO [31] to form gene ontology trees. Since GO terms are originally organized in a DAG, a GO term may have several parent terms. In this case, the child term appears multiple times on the same or different levels of the tree. In this paper, we define the level of a GO term to be the lowest level on which it appears, i.e. the shortest distance of the GO term from the root. Protein sequences were then parsed and mapped onto the gene ontology trees to form GO groups. GO groups with less than six protein sequences were removed for statistically meaningful results. As a result, 4175 sequences in 906 distinct biological process groups, 3317 in 362 distinct molecular function groups, and 4735 in 284 distinct cellular component groups are included in the analysis. The final biological process ontology tree consists of 11 levels and 11091 tree nodes (GO groups), of which 903 are unique. The molecular function ontology tree consists of 9 levels and 471 tree nodes, of which 362 are unique. The cellular component tree has 7 levels and 1692 tree notes, of which 284 are unique.
To evaluate how much protein sequence similarity can contribute to biological function prediction, the posterior probabilities of correct predictions can be computed using Bayes' theorem [32]:
where G represents a GO group, ε represents the pvalue threshold for a sequence pair s_{1 }and s_{2}, while the pvalue p(s_{1}, s_{2}) is calculated based on equation (2).
Authors' contributions
ZHD proposed the research idea, designed and implemented the research approaches, and drafted the manuscript. BH participated in the implementation of the research approaches and the result analysis. LR contributed to the idea of introducing a novel measure for the overall similarity of two protein sequences and participated in the result discussion and the manuscript writing. DMP and TS participated in developing the research idea, provided the explanation of the biological meaning of the results, and contributed to the manuscript writing. All authors read and approved the final manuscript.
Acknowledgements
The work is partially supported by NSF DUE 0410727 (ZHD), NIH 1R01HL061438 (DMP), and UA faculty research fellowship (ZHD).
This article has been published as part of BMC Bioinformatics Volume 7, Supplement 4, 2006: Symposium of Computations in Bioinformatics and Bioscience (SCBB06). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/7?issue=S4.
References

Altschul SF, Boguski MS, Gish W, Wootton JC: Issues in searching molecular sequence databases.
Nature Genetics 1994, 6:119129. PubMed Abstract  Publisher Full Text

Casari G, Andrade MA, Bork P, Boyle J, Daruvar A, Ouzounis C, Schneider R, Tamames J, Valencia A, Sander C: Challenging times for bioinformatics.
Nature 1995, 376:647648. PubMed Abstract  Publisher Full Text

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: a new generation of protein database search programs.
Nucleic Acids Research 1997, 25:33893402. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Smith TF, Zhang X: The challenges of genome sequence annotation or "The devil is in the details".
Nature Biotechnology 1997, 15:12221223. PubMed Abstract  Publisher Full Text

Bork P, Dandekar T, DiazLazcoz Y, Eisenhaber F, Huynen M, Yuan Y: Predicting function: from genes to genomes and back.
Journal of Molecular Biology 1998, 283:707725. PubMed Abstract  Publisher Full Text

Bork P, Koonin EV: Predicting functions from protein sequences: where are the bottlenecks?
Nature Genetics 1998, 18:313318. PubMed Abstract  Publisher Full Text

Doerks T, Bairoch A, Bork P: Protein annotation: detective work for function prediction.
Trends in Genetics 1998, 14:248250. PubMed Abstract  Publisher Full Text

Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D: A combined algorithm for genomewide prediction of protein function.
Nature 1999, 402:8386. PubMed Abstract  Publisher Full Text

Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.
Proc Natl Acad Sci USA 1999, 96:42854288. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling.
Proc Natl Acad Sci USA 1999, 96:28962901. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hoersch S, Leroy C, Brown NP, Andrade MA, Sander C: The GeneQuiz web server: protein functional analysis through the Web.
Trends in Biochemical Sciences 2000, 25:3335. PubMed Abstract  Publisher Full Text

Sakata K, Nagamura Y, Numa H, Antoniol BA, Nagasaki H, Idonuma A, Watanabe W, Shimizu Y, Horiuchi I, Matsumoto T, Sasaki T, Higo K: RiceGAAS: an automated annotation system and database for rice genome sequence.
Nucleic Acids Research 2002, 30:98102. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Riley ML, Schmidt T, Wagner C, Mewes HW, Frishman D: The PEDANT genome database in 2005.
Nucleic Acids Research 2005, 33(Database):D308D310. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D: A Bayesian framework for combining heterogeneous data sources for gene function prediction.
Proc Natl Acad Sci USA 2003, 100:83488353. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhou Y, Young JA, Santrosyan A, Chen K, Yan SF, Winzeler EA: In silico gene function prediction using ontologybased pattern identification.
Bioinformatics 2005, 21:12371245. PubMed Abstract  Publisher Full Text

Schug J, Diskin S, Mazzarelli J, Brunk BP, Stoeckert CJ Jr: Predicting gene ontology functions from ProDom and CDD protein domains.
Genome Research 2002, 12:648655. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Xie H, Wasserman A, Levine Z, Novik A, Grebinskiy V, Shoshan A, Mintz L: Largescale protein annotation through Gene Ontology.
Genome Research 2002, 12:785794. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hennig S, Groth D, Lehrach H: Automated Gene Ontology annotation for anonymous sequence data.
Nucleic Acids Res 2002, 31:37123715. Publisher Full Text

Zehetner G: OntoBlast function: From sequence similarities directly to potential functional annotations by ontology terms.
Nucleic Acids Research 2003, 31:37993803. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Martin DM, Berriman M, Barton GJ: GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes.
BMC Bioinformatics 2004, 5:178. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Abascal F, Valencia A: Automatic annotation of protein function based on family identification.
PROTEINS: Structure, Function, and Genetics 2003, 53:683692. Publisher Full Text

Jensen LJ, Gupta R, Staerfeldt HH, Brunak S: Prediction of human protein function according to Gene Ontology categories.
Bioinformatics 2003, 19:635642. PubMed Abstract  Publisher Full Text

Vinayagam A, Konig R, Moormann J, Schubert F, Eils R, Glatting KH, Suhai S: Applying support vector machines for Gene Ontology based gene function prediction.
BMC Bioinformatics 2004, 5:116. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Gerlt JA, Babbitt PC: Can sequence determine function?
Genome Biol 2000, 1:Reviews0005.10005.10. BioMed Central Full Text

Ouzounis CA, Karp PD: The past, present and future of genomewide reannotation.
Genome Biol 2002, 3:Comment2001.12001.6. BioMed Central Full Text

Sali A: Functional links between proteins.
Nature 1999, 402:2326. PubMed Abstract  Publisher Full Text

The Gene Ontology Consortium: Gene ontology: tool for the unification of biology.
Nature Genetics 2000, 25:259. PubMed Abstract  Publisher Full Text

Tatusova TA, Madden TL: Blast 2 Sequences, a new tool for comparing protein and nucleotide sequences.
FEMS Microbiology Letters 1999, 174:247250. PubMed Abstract  Publisher Full Text

Yeast proteome [ftp://ftp.expasy.org/databases/complete_proteomes/entries/eukaryota/] webcite

GO terms [ftp://ftp.geneontology.org/pub/go/ontologyarchive/] webcite

AmiGO [http://www.godatabase.org/cgibin/amigo/go.cgi] webcite

Weiss NA: Bayes's Rule. In Introductory Statistics. 7th edition. New York: Addison Wesley; 2004:195200.