Evaluation and integration of functional annotation pipelines for newly sequenced organisms: the potato genome as a test case

Amar, David; Frades, Itziar; Danek, Agnieszka; Goldberg, Tatyana; Sharma, Sanjeev K; Hedley, Pete E; Proux-Wera, Estelle; Andreasson, Erik; Shamir, Ron; Tzfadia, Oren; Alexandersson, Erik

doi:10.1186/s12870-014-0329-9

Research article
Open access
Published: 05 December 2014

Evaluation and integration of functional annotation pipelines for newly sequenced organisms: the potato genome as a test case

David Amar¹,
Itziar Frades²,
Agnieszka Danek³,
Tatyana Goldberg⁴,
Sanjeev K Sharma⁵,
Pete E Hedley⁵,
Estelle Proux-Wera^2,6,
Erik Andreasson²,
Ron Shamir¹,
Oren Tzfadia⁷ &
…
Erik Alexandersson²

BMC Plant Biology volume 14, Article number: 329 (2014) Cite this article

4906 Accesses
16 Citations
1 Altmetric
Metrics details

Abstract

Background

For most organisms, even if their genome sequence is available, little functional information about individual genes or proteins exists. Several annotation pipelines have been developed for functional analysis based on sequence, ‘omics’, and literature data. However, researchers encounter little guidance on how well they perform. Here, we used the recently sequenced potato genome as a case study. The potato genome was selected since its genome is newly sequenced and it is a non-model plant even if there is relatively ample information on individual potato genes, and multiple gene expression profiles are available.

Results

We show that the automatic gene annotations of potato have low accuracy when compared to a “gold standard” based on experimentally validated potato genes. Furthermore, we evaluate six state-of-the-art annotation pipelines and show that their predictions are markedly dissimilar (Jaccard similarity coefficient of 0.27 between pipelines on average). To overcome this discrepancy, we introduce a simple GO structure-based algorithm that reconciles the predictions of the different pipelines. We show that the integrated annotation covers more genes, increases by over 50% the number of highly co-expressed GO processes, and obtains much higher agreement with the gold standard.

Conclusions

We find that different annotation pipelines produce different results, and show how to integrate them into a unified annotation that is of higher quality than each single pipeline. We offer an improved functional annotation of both PGSC and ITAG potato gene models, as well as tools that can be applied to additional pipelines and improve annotation in other organisms. This will greatly aid future functional analysis of ‘-omics’ datasets from potato and other organisms with newly sequenced genomes. The new potato annotations are available with this paper.

Background

Potato (Solanum tuberosum) is the 3rd largest food crop in terms of human consumption [1]. It is therefore important for our food security, and understanding its genome is called for. Examples of major challenges in potato research are its sensitivity to drought stress and its lack of resistance to certain diseases, e.g., the oomycete Phytopthora infestans, which caused the Irish famine in the 1840’s. Farmers need to use large amounts of fungicides to protect their potato crops, thereby increasing the cost of cultivation and threatening the environment. For example, the global cost of protection and yield loss due to P. infestans has been estimated at €4800 M annually [2].

Recently, the potato genome (Solanum tuberosum group Phureja) was sequenced by the Potato Genome Sequencing Consortium (PGSC). The PGSC analysis of the genome reported gene models for 39,031 representative transcripts, and 56,218 including splicing variants [3]. In a later effort, the International Tomato Annotation Group (ITAG) produced new gene models by jointly analyzing the tomato and potato genomes [4]. These new gene models covered 34,727 and 35,004 predicted protein-coding genes for the tomato and the potato genomes, respectively. Unfortunately, few experimentally validated genes (e.g., by fluorescent-tagged proteins, or gene knock-outs) are available in newly sequenced genomes in which, unlike established model organisms, few genes have verified functions such as the case is for potato. Comprehensive and accurate functional annotation of the genes in such recently sequenced genomes is a prerequisite to efficient exploitation of these genomic data.

A key tool for functional annotation is the Gene Ontology (GO), which provides a structured set of defined terms representing gene properties [5]. The structure of gene ontology is composed of three major domains: cellular component (CC), the parts of a cell or its extracellular environment; molecular function (MF), the elemental activities of a gene product at the molecular level; and biological process (BP), which describes a set of functionally related molecular events. Thus, the complete GO structure provides a unified vocabulary of biological terms, which can also be used to evaluate biological similarity of different terms [6]. Annotating a gene means placing it within some or all of the three gene ontology domains.

Recent advances in plant science are marked by the rapidly increasing availability and quality of high-throughput sequencing data. The most basic usage of these data is gene function prediction, wherein GO plays a pivotal part. There are several computational suites like EXPANDER [7], MapMan [8], Mercator [9] and AmiGO [10] that enable biologists to run GO enrichment analyses in several plant model systems. This is usually done by first identifying a group of genes that behave similarly in a given expression dataset, seeking ontology terms highly enriched in the group, and associating the highly enriched functions with unannotated genes that belong to the same group. This process is sometimes called “guilt by association”. Automated gene function annotation is also relevant for well-investigated plant model organisms, such as Arabidopsis thaliana, tomato, Brachypodium and rice, wherein ~40% of the genes still do not have any known function [11].

In order to assign functional annotation to sequenced plant transcripts, researchers can use several sequence-based annotation pipelines. For a comprehensive summary of methods and principles behind automated functional annotation see [12]. Some recent efforts have been made to characterize the annotation quality of plant genomes. For example, Jaramillo-Garzón, et al. [13] used sequence features and showed high predictability of MF and CC terms and lower predictability of BP terms. However, the analysis was limited to a small subset of the GO terms (GO-Slim). Ramsak, et al. [8] presented GOMapMan, a tool for visualization and analysis of gene annotation in plants. In potato, information from orthologous gene families across 26 sequenced plant genomes was analyzed in order to increase the number of potato genes associated with GO terms [14]. Still, a robust, automated approach to evaluate and compare genome-wide annotation pipelines is direly needed.

A typical genome-wide functional annotation of newly sequenced organisms starts by using a single ‘default’ pipeline. Here, we analyzed the two sets of potato gene models, from the ITAG and PGSC. We compared six annotation pipelines: Trinotate HMM, Trinotate BLAST [15], OrthoMCL-UniProt [16], BLAST2GO [17], Phytozome [18] and InterPro2GO provided in BioMart [19] (Figure 1). These pipelines were chosen because they seek to provide a comprehensive annotation of the whole genome. Some of these pipelines are based solely on sequence similarity (BLAST), others rely on specific domains and some are based on clustering of groups of orthologous gene families. As we shall show, one clear conclusion of this work is that functional annotations of genomes should rely on more than one annotation pipeline.

By examining the GO terms generated by these pipelines, we demonstrate that they predict very dissimilar annotations (e.g., on average, less than 30% of the genes annotated by two pipelines are assigned with the same function). To evaluate the performance of the pipelines we first created a set of potato genes (hereafter referred to as “gold standard”), with known functional characterization, including genes from the well characterized biosynthetic Carotenoids pathway. We show that pipelines may have rather low accuracy compared to the gold standard. Since the size of the gold standard is rather modest (116 PGSC genes ids), we used an additional validation scheme based on gene expression data. Under the premise that genes participating in the same biological process should have more similar expression pattern than expected by chance, we evaluated the predictions of each pipeline based on its intra-process gene co-expression level. We show that while all pipelines provide much higher intra-process co-expression than expected by chance, there are large differences among the methods. We introduce a simple method to combine the results of the different pipelines into a single integrated annotation. Compared to the single pipelines, it improved gene coverage, prediction precision, and the overall co-expression of predicted GO processes. In addition to improved annotation of potato genes, our analysis provides generic tools that can be applied to improve the annotation of other newly sequenced plants.

Results and discussion

A compendium of the state-of-the-art annotation tools

In this study, we tested automatic annotation pipelines on the potato genome. We used six state-of-the-art tools for GO gene function prediction: (1) Trinotate HMM, (2) Trinotate BLAST [15], (3) OrthoMCL-UniProt [16], (4) BLAST2GO [17], (5) Phytozome [18], and (6) InterPro2GO [19]. See Methods and Additional file 1: Methods S1-4 for details. We note that every program has its own set of parameters and fitting the best parameter combination for a particular dataset is a substantial effort. The common practice in this area is to use published tools with the default parameter values (see e.g. [20],[21]. If necessary, we then mapped its predicted functions to GO terms using automated mapping files such as Pfam2GO, and the genes and transcripts to protein identifiers. Thus, in our analysis a gene corresponds to either a transcript or a protein that appeared in the output of the pipelines. Next, the output of each pipeline was summarized as a set of predicted gene-GO term pairs. For each gene we then retained only the most “specific” GO terms. That is, in case a gene is associated with two GO terms A and B, but B is a generalization of A (i.e. an ancestor of A in the GO hierarchy), we excluded B. We call this step ancestor removal. Note that after filtering, many genes were still associated with more than one GO term, since a gene can have several associated annotations none of which is an ancestor of another. For the output of all pipelines, see Additional file 2: Table S1, Additional file 3: Table S2, Additional file 4: Table S3, Additional file 5: Table S4, Additional file 6: Table S5 and Additional file 7: Table S6 for PGSC, and Additional file 8: Table S7, Additional file 9: Table S8, Additional file 10: Table S9, Additional file 11: Table S10, Additional file 12: Table S11 and Additional file 13: Table S12 for ITAG. Although Gene Ontology has its limitations as it is biased towards what is already known, it is still a universal key tool for functional annotation inferring functionality based on sequence identity, domains and structure, and literature studies.

Disparity among pipelines

The output from each pipeline can be represented as a triplet (P, G, GO) where P is the set of all predicted gene-GO term pairs (after ancestor removal), G is the set of genes covered by P, and GO is the set of GO terms covered by P. We measured the pairwise similarity between the triplets obtained from the six pipelines used in the study. Three different ways were used to compare the output of two pipelines A = (P_A, G_A, GO_A) and B = (P_B, G_B, GO_B). First, we measured the overlap between the predictions of the pipelines P_A and P_B. This was done by calculating the ratio between the size of the intersection of P_A and P_B and the size of the union of P_A and P_B. This measure is called the Jaccard score [22],[23]. Second, we measured the similarity between the covered gene sets G_A and G_B of the pipelines by calculating their Jaccard scores. These two scores are complementary: the first measures the overall similarity between A and B, whereas the second measures the tendency of A and B to cover the same genes. However, these scores ignore the GO structure and thus they are oblivious to the functional similarity among different GO terms. Therefore, we also used a similarity score based on the semantic similarity of GO terms [24]. Given a specific GO type GT (BP or MF), for each gene we measured the semantic similarity between its GO terms in A and its GO terms in B. We then took the average over all genes as the similarity of A and B in GT (see Methods for details). As this score uses the structure of the GO hierarchy, we call it structure-based.

An example of the structure-free similarity of the predictions is shown in Figure 2A. The figure shows the pairwise Jaccard score between the PGSC MF predictions of the pipelines. Overall the similarity is low, averaging 0.27. Nevertheless, local patterns can be observed. For example, InterPro2GO, Trinotate HMM, and Phytozome were more similar (average 0.46). Figure 2B shows the Jaccard similarity between the PGSC genes annotated by the different pipelines. The mean similarity was a higher 0.54, which is still quite low. This indicates that different pipelines tend to cover different genes and, even when covering the same genes, they often associate distinct annotations to them. Even when re-computing the structure-free similarity restricted only for the genes shared by each pair of pipelines (considering both MF and BP predictions), the average score was only 0.27.

The structure-based MF and BP similarity of PGSC genes is summarized in Figure 2C and 2D. Similar matrices on ITAG data are shown in Additional file 1: Figure S1. Again, pipelines tend to be very different, with average similarity of 0.29 in BP and 0.42 in MF. The scores are higher than for the structure-free approach because the structure-based approach assigns higher scores when predictions are different but biologically similar. Also, like in the structure-free scores in Figure 2A, InterPro2GO, Trinotate HMM, and Phytozome formed a cluster both in BP and in MF. Taken together, the discrepancies among pipelines show that pipelines differ in the sets of genes they cover, and the annotation of the same genes in different pipelines can be quite dissimilar.

Ensemble of pipelines

The marked disparity in gene annotation by different pipelines calls for an integration of the different predictions in order to provide a unified potato gene annotation. We developed a simple ensemble algorithm inspired by previous studies [25]. Our algorithm takes as input the predictions of all pipelines and for each gene merges its predictions into a vector of scores denoted as the gene’s combined profile (Figure 3). Briefly, we first calculate the pipeline-specific gene profiles. For a specific pipeline that predicted the pair (G, t), where G is a gene and t is a GO term, the t-th position of the profile is 1 if G is associated with t or at least one of its descendants, and otherwise it is 0 (top right in Figure 3). The combined profile of each gene G is the sum of its pipeline-specific profiles (Figure 3 right). The value in the combined profile of a gene shows how many pipelines agree with each gene-GO term association. Given a threshold k, for each gene we report all GO terms with a combined score ≥ k. This process produces a list of GO terms for each gene. We call this variant Ensemble-k. Finally, we apply the ancestor removal filter described above. Thus, each value of k produces a different variant of the ensemble algorithm. Figure 3 shows a toy example of Ensemble-1 and 2. For clarity, in the next sections we use the name annotation method for both pipelines and variants of the ensemble algorithm. We also tested a more involved supervised ensemble method, which in addition ranks the pipelines by their average F-measure against a gold standard (see below), but this did not improve the results (see Additional file 1: Method S6).

We compared the annotation methods in terms of gene coverage and the average number of GO terms per gene, which we denote as NGPG. Ideally, gene coverage should be as high as possible, while NGPG should be low [26]. The results are shown in Figure 4A and 4B. One can observe marked differences between the different pipelines, and between ITAG and PGSC gene models. For example, based on PGSC data, InterPro2GO and OrthoMCL-UniProt have the highest gene coverage (29,445 and 26,371, respectively), and NGPG score (7 and 7.1, respectively). However, based on ITAG data, OrthoMCL-UniProt’s results were similar to those for PGSC, while for InterPro2GO the number of genes dropped under 20,000 and the NGPG score increased to 8.1 (Figure 4B).

Figure 4A and 4B also show the gene coverage and the NGPG of the ensemble algorithm. As expected, using either Ensemble-1 or 2 increased the gene coverage compared to the single pipelines using both ITAG and PGSC gene models. For example, based on PGSC the number of covered gene models (including splicing variants) was 41,668 (k = 1) and 29,495 (k = 2). Larger k values led to a sharp decrease in gene coverage, such that even single pipelines covered more genes. Using Ensemble-1, the NGPG score was similar to the highest score obtained by a single pipeline, reaching a score of 6.70 on PGSC data, and 8.15 on ITAG data. Ensemble-2 led to a sharp decrease in NGPG: 4.39 on PGSC, and 4.68 on ITAG.

In summary, our results show that the ensemble algorithm increases the gene coverage considerably without increasing the NGPG score. Ensemble-1 increased gene coverage by more than 5000 genes on both ITAG and PGSC data, while keeping the NGPG score similar to that of the highest single pipelines. Ensemble-2 increased the gene coverage only moderately compared to the single pipelines but the NGPG score declined sharply compared to all pipelines (except Phytozome, but the latter has low gene coverage), hence providing much more focused annotations. In the next sections we demonstrate that the aforementioned improvements were not achieved at the expense of precision.

Validation using the potato gold standard

To evaluate predictions of the different annotation methods we compiled a gold standard of 838 and 724 gene-GO term pairs based on PGSC and ITAG data, respectively, using manual annotation by experts (see Methods and Additional file 14: Table S13, Additional file 15: Table S14 and Additional file 16: Table S15). The number of genes included in the gold standard (43 with literature references, which are mapped to 116 PGSC gene ids, see Additional file 14: Table S13), is small, but in an organism such as potato it still contains the majority of genes with experimental evidence. We evaluated the annotation methods by calculating their GO-based precision and recall. Use of the GO structure to calculate scores for gold standard validation has been previously suggested by [27]. The GO-based recall of a gene measures the extent to which its terms according to the gold standard are covered by its predicted GO terms. The GO-based precision of a gene measures the extent its predicted GO terms match the gold standard terms. For each pipeline we calculated the average precision and average recall (over the genes) and report the F-measure, which is the harmonic mean of the precision and the recall [28]. See Methods for a full description of these calculations.

The results of the validation based on PGSC and ITAG data are illustrated in Figures 5 and Additional file 1: Figure S2, respectively. Figure 5A shows the F-measure for BP GO terms. Ensemble-1 and 2 reached F-measures of 0.8 and 0.77, respectively, while the top performing pipeline was InterPro2GO with only 0.61. Figure 5B shows the F-measure on the MF gold standard. Ensemble-1 and 2 reached F-measures of 0.84 and 0.83, respectively, whereas the top performing pipeline was InterPro2GO with an F-measure of only 0.71. Thus, the results are in agreement with the BP validation: Ensemble-1 and 2 performed best and improved upon the single pipelines. Taken together, our results indicate that Ensemble-1 and 2 provide a significant improvement in comparison to single pipelines.

Validation using gene expression data

An obvious disadvantage of any gold standard is that it is limited to experimentally validated genes and subject to the opinion of experts. Consequently, we added an additional validation based on gene co-expression analysis, where we measured the ability of pipelines to predict the same GO-term to highly co-expressed genes. Our co-expression analysis is based on the gene expression of 12,956 genes in 326 expression profiles from over 20 microarray studies. We used the Pearson correlation coefficient to measure co-expression between genes.

We used the gene pairwise co-expression scores to validate predicted GO BP terms. In order to reduce noise, we ignored terms with >500 genes, or with fewer than five genes. Given a set of genes predicted to be associated with the same GO term according to a specific annotation method, we tested if the level of co-expression among its genes is higher than expected by chance (see Methods for details). Thus, for each term in a specific annotation method we calculated a single p-value. To summarize these values when comparing methods we calculated two scores: (1) the number of GO terms with p <0.001, and (2) the percentage of GO terms with p <0.001 (out of all predicted terms with at least three genes). The former is a measure of coverage of significant GO terms, whereas the latter is a measure of quality of the predicted GO BP terms. Similarly to the gold standard, this analysis simply aimed to compare pipelines. Future work can use similar approaches to select highly co-expressed GO terms from different pipelines for subsequent analyses.

The results of the gene co-expression validation based on PGSC data are shown in Figure 6. See Additional file 1: Figure S3 for results of ITAG. The top two pipelines in terms of the number of significant GO terms were InterPro2GO (n = 411) and BLAST2GO (n = 345). The top two pipelines in terms of the percentage of significant GO terms were InterPro2GO (35%) and Phytozome (30%). The ensemble algorithm markedly improved the number of significant GO terms: Ensemble-1 achieved 718, and Ensemble-2 achieved 650. However, the ensemble methods did not improve upon the single pipelines in terms of the percentage of significant GO terms: Ensemble-1 and 2 achieved 22% and 27%, respectively. Nevertheless, the score of Ensemble-2 was better than all pipelines except for InterPro2GO and Phytozome. Thus, the ensemble approach provided an improvement of at least 1.5-fold in the number of significant GO terms, at the expense of a drop of 8% in the percentage of significant GO terms compared to the best pipeline. Note that the co-expression and the GO analyses are complementary, since the gold standard genes do not manifest unusually high co-expression (see Additional file 1: Methods S7).

Merging the different merits using a rank-based comparison

Our analysis shows that the ensemble approach is beneficial according to most criteria. However, since we used multiple ways to score the methods, it is hard to decide which k value is best and which pipelines are better. To provide a clear unified view we used a non-parametric rank-based consolidation of the different scores [29]. In the previous sections, for each annotation method we calculated two F-measure scores in the gold standard analysis and two scores in the gene co-expression analysis. In addition, we compared the annotation methods by their gene coverage and NGPG. Note that when ranking methods by their NGPG score, lower scores are better. In contrast, when ranking methods by their gene coverage, higher scores are better. To consolidate these different scores, we used six rankings: by gene coverage and the NGPG score, by the two F-measures of the gold standard validation and by the two scores of the gene co-expression validation. We reversed the scores when necessary so that rank 1 was the best for each method, averaged the rankings and ranked the methods by their average rank. We call this score rank-merge.

Figure 7 displays the rank-merge results on PGSC (A) and ITAG (B) data. The top three methods are colored black. In both cases the top method was Ensemble-2, with an average rank of 1.66 in PGSC and 1.16 in ITAG. Among the different pipelines evaluated, Phytozome obtained the top score for PGSC data with an average rank of 3.66 while BLAST2GO obtained top score for ITAG data with an average rank of 3.50. Note that Ensemble-1, 2, and 3 were ranked consistently high in both tests. See also Additional file 17: Table S16 for PGSC and Additional file 18: Table S17 for ITAG. Thus, we conclude that the ensemble approach, especially with k = 2, is beneficial and can assist in integration of different gene function prediction pipelines. See Additional file 1: Method S5 for details on reproducing the results and applying the pipeline to new genomes.

Note that using k = 1 is equivalent to assigning to each gene all its annotations from all pipelines (and their ancestors) and then performing ancestor removal. While this method is the most intuitive ensemble, we show here that varying the k parameter can improve the annotation of genomes.

A seemingly natural test case for our approach is to evaluate it in predicting function of Arabidopsis genes. However, it is not clear how this can be done in a rigorous and unbiased manner. Tools for functional annotation of genes in newly sequenced plants are heavily dependent on sequence similarity to genes in model species such as Arabidopsis. In order to test such tools in predicting Arabidopsis gene functions, one has to exclude all the annotations directly – or indirectly – derived from Arabidopsis. Doing so would entail tracing indirect annotation sources, which often are not recorded in the pipelines. Instead, we used the newly sequenced potato genome along with experimentally verified gene functions and rich gene expression data in our evaluation.

Conclusion

For recently sequenced, non-model organisms, automatic functional annotation of genes, which also mainly relies on sequence-based prediction, often suffers from low gene coverage and poor specificity. We confirmed that this is the case for the potato genome by analyzing six state of the art annotation pipelines.

We observed that the predictions of different pipelines for functional annotations of genes are markedly different, in spite of the fact that all pipelines are based on sequence analysis. We showed that combining predictions from several pipelines increases both the coverage and the accuracy of gene ontology predictions. The simple ensemble approach used here could be applied easily to other sequenced genomes and improve functional annotation by taking advantage of different GO prediction tools. However, a comparison of the consistency among pipelines is not enough when the goal is to either select the best pipeline or to integrate the different predictions. The pipelines should also be evaluated based on the precision of their predictions. The most intuitive way is to compare the pipelines to a set of known annotations. However, in newly sequenced organisms such as potato, known annotations are scarce in the main public databases. To overcome this, we compiled a gold standard of experimentally-validated gene-GO associations. Although this gold standard is relatively small, we have found it useful for comparing pipelines. Furthermore, to overcome the limited number of genes in the gold standard, we used a second validation method based on gene co-expression testing the ability of pipelines to predict co-expression of genes associated to the same GO-term.

Finally, we introduced an integrated annotation of the different pipelines that outperformed the single pipelines both in the gold standard validation and in the co-expression validation. Our integration approach depends on selecting a parameter k that corresponds to the stringency by which we filter out gene-GO associations. That is, when associating a gene to a GO term, at least k pipelines must agree with this association. Thus, we have implicitly assumed that each of the pipelines we used has meaningful predictions. Moreover, all pipelines have the same weight in the integration process. Future analyses can seek methods that give more weight to better pipelines, or add an initial step that filters out pipelines of exceptionally low prediction quality. The new functional annotations of the potato genome as well as for the probes on the JHI Solanum tuberosum microarray are available with this paper (Additional file 17: Table S16, Additional file 18: Table S17 and Additional file 19: Table S18). We also provide tools as open source R code for implementing the methodology with additional pipelines and for other sequenced organisms.

Methods

Executing the functional annotation pipelines

We defined a pipeline as an automated process that predicts association between genes and functions. The input to a pipeline can be DNA sequence, protein sequence, or protein domains. The output of a pipeline is a set of pairs in the form of (gene ID, GO term ID). We ran all pipelines for the ITAG (potato.Sotub.proteins.itag.v1.fasta) and PGSC (PGSC_DM_v3.4_pep_representative.fasta) gene models separately, using default settings as follows:

The OrthoMCL-UniProt pipeline

We ran the OrthoMCL [16] pipeline in two steps:

1.
Building the clusters of homologs: We retrieved from Phytozome (v9.1) 16 plant proteomes, covering the whole plant phylogeny. Together with the proteomes predicted from the potato PGSC and ITAG gene models, we aligned the proteomes against each other using blastp [30]; (parameters: −e-value: 1e-05 -outfmt 6). We then used OrthoMCL v2 to build clusters of homologous proteins.
2.
Annotating GO terms: To annotate every protein sequence of the 18 complete plant proteomes with GO terms we ran a blast search against the entire UniProt database (version 2013_08) [31] with an e-value cut-off of 1e-10. For every protein sequence we kept a ranked list of the ten best hits (i.e. hits with the lowest e-value). We associated the first hit in the list that had GO annotation in UniProt. An OrthoMCL cluster then inherits all GO terms associated with its proteins, and each PGSC (and ITAG) protein inherits the GO terms of its cluster.

For complete protocol details refer to the Additional file 1: Method S2.

The BLAST2GO pipeline

Using the BLAST2GO interface [17], we blasted the PGSC and ITAG protein sequences against the NCBI NR database (blastp parameters: −e-value: 1e-05 -max_target_seqs 20 -outfmt 5). We then loaded the blastp output files into Blast2GO (v2.6.6, with default parameters) and assigned GO terms to the PGSC and ITAG sequences according to its output.

The trinotate pipeline

In the Trinotate suite [15] we used default settings for the NCBI-BLAST (SwissProt), HMMER [32], and Pfam [33]. For complete protocol details refer to the Additional file 1: Method S3.

The phytozome pipeline

We downloaded the potato annotation from Phytozome v9.1 [http://www.phytozome.net/potato.php; [18] (http://www.phytozome.net/potato.php). The gene annotation is Solanum tuberosum Group Phureja DM1-3 516R44 (CIP801092) Genome Annotation v3.4 mapped to pseudomolecule sequence (PGSC_DM_v3_2.1.10_pseudomolecules.fa).

InterPro2GO data from BioMart

We downloaded the potato data from (http://central.biomart.org/). GO terms in BioMart are derived from the semi-automated InterPro2GO [19].

Formatting pipelines

In order to compare pipelines, we mapped their predicted annotation to a set of common Gene Ontology (GO) terms. If the original pipeline output was not in GO term IDs it was mapped to GO IDs using the gene ontology consortium mapping files for GO terms. We applied this procedure to the pipelines Trinotate, InterPro2GO, BLAST2GO, Phytozome, and in mapping of orthologous and paralogous gene families in 18 sequenced plant species by OrthoMCL clustering.

Composing the potato ‘gold standard’

A ‘gold standard’ set of potato genes was constructed based on literature evidence from functional gene studies by wet-lab experiments in potato reported in PlantCyc [http://pmn.plantcyc.org/PLANT/organism-summary] and a few additional studies on potato [34]-[37]. In total a list of 43 potato genes/proteins was created (Additional file 14: Table S13). These protein names were searched for their corresponding identifiers published by the PGSC [3], resulting in 116 unique PGSC gene identifiers.

The aforementioned list of genes matched 1658 GO terms from all six tested pipelines. Each gene-GO term association was then manually scored with the help of literature searches in an unbiased manner, where the experts assigning scores to GO-associations did not know from which pipeline the annotation originated. Every GO term in the set was scored as ‘1’ (low evidence), ‘2’ (neutral or unknown) and ‘3’ (high evidence). In the final analysis only association scores of 3 were used for the gold standard, producing 838 annotations (Additional file 15: Table S14). To perform analyses on both gene models, PGSC genes were mapped to ITAG genes using BLAST (identity >95%, length >100 amino acids). This produced an ITAG gold standard with 724 annotations (Additional file 16: Table S15).

Comparing pipelines and gold-standard evaluation

Mathematical notations

In the Results section we sketched the calculations for comparing pipelines and evaluating pipelines against the gold standard. Here, we provide a full description of these calculations. For this purpose we start here with more detailed definitions.

Let G be the set of all genes in the tested organism and let T be the set of all GO terms. The output of a pipeline P is a set of pairs P = {p ₁, …, p _k} where each annotation pair p_i = (g_i,t_i) is an association between a gene g_i (in G) and a GO term t_i (in GO). Let BP(P) be the subset of P resulted from taking all pairs in P in which the term t is a biological process. Similarly, define MF(P) for molecular function, and CC(P) for cellular component. Below we define functions of pipelines. Note that by definition each of BP(P), MF(P), and CC(P) is a set of pairs. Thus, in the definitions below P is either the original output of a pipeline or the result of applying BP, MF, or CC on it.

We define Genes(P) as the set of genes covered by P and Terms(P) as the set of GO terms covered by P. We define Genes(P,t) as the set of genes associated with a GO term t according to P, and Terms(P,g) as the set of GO terms associated with a gene g according to P. Finally, we denote Sem(t _i,t _j) as the semantic similarity between two GO terms t _i and t _j. Semantic similarity here is a measure that quantifies the closeness of two terms in the GO graph. There are several ways to calculate semantic similarity among GO terms. In this study we used Wang’s method [6],[24].

Jaccard coefficient between two pipelines

The Jaccard coefficient is a generic measure of similarity between two sets. It is defined as the ratio between the size of the intersection of the sets and the size of the union of the sets. For example, given two pipelines P ₁ and P ₂, denote intersect(P ₁,P ₂) as the set of annotation pairs that are both in P ₁ and in P ₂, and let union(P ₁,P ₂) be the set of annotation pairs that are either in P ₁ or in P ₂. The Jaccard coefficient J _pipeline(P ₁,P ₂) is the ratio between the number of annotation pairs in intersect(P ₁,P ₂) and the number of annotation pairs in union(P ₁,P ₂). In addition, we calculate the Jaccard coefficient J _Genes(P ₁,P ₂) between the gene sets Genes(P ₁) and Genes(P ₂) to measure the tendency of two pipelines to annotate the same genes.

Structure-based similarity between two pipelines

The Jaccard measure above is oblivious to the functional similarity among GO terms. Thus, we used semantic similarity as a means to define a structure-based similarity between two pipelines P ₁ and P ₂. We start by defining the similarity between the set of annotations of a single gene. Given a gene g our goal is to measure the semantic similarity between Terms(P ₁,g) and Terms(P ₂,g). As a first step we define the similarity between a single GO term t and a set of GO terms T’ as:

S i m' (t, T^{'}) = max_{t' \in T'} S e m (t, t')

This score is high only if T’ contains t or similar GO terms. Next, we use this score to calculate the similarity between Terms(P ₁,g) and Terms(P ₂,g) using the running-max-average [6]:

rmaxa (P_{1}, P_{2}, g) = \frac{\sum_{t_{i} \in Terms (P_{1}, g)} S i m' (t_{i}, Terms (P_{2}, g)) + \sum_{t_{j} \in Terms (P_{2}, g)} S i m' (t_{j}, Terms (P_{1}, g))}{|Terms (P_{2}, g)| + | Terms (P_{1}, g) |}

This score will be high only if Terms(P ₁,g) covers the biological functionalities of Terms(P ₂,g) and vice versa. Finally, the overall similarity between P ₁ and P ₂ is the average gene-wise similarity:

S i m (P_{1}, P_{2}) = \frac{\sum_{g \in Genes (P) \cup Genes (P_{2})} rmaxa (P_{1}, P_{2}, g)}{|Genes (P) \cup Genes (P_{2})|}

GO-based precision and recall

The calculations above measure similarity among pipelines. Here we define a way to measure the precision and recall of a pipeline P compared to a gold standard GS. Similarly to P, GS is a set of annotation pairs {gs ₁, …, gs _k} where each pair gs _i = (g _i,t _i) is an association between a gene g _i (in G) and a GO term t _i (in T). We first define the precision of a single gene g. The GO-based precision of pipeline P for gene g measures the extent by which Terms(P,g) is covered by Terms(GS,g):

prec (P, G S, g) = \frac{\sum_{t_{i} \in Terms (P, g)} S i m' (t_{i}, Terms (G S, g))}{|Terms (P, g)|}

The precision of P is defined as the average precision of the genes in Genes(G):

prec (P, G S) = \frac{\sum_{g \in Genes (G S)} prec (P, G S, g)}{|Genes (G S)|}

The GO-based recall of pipeline P for gene g measures the extent by which Terms(P,g) covers Terms(GS,g):

recall (P, G S, g) = \frac{\sum_{t_{i} \in Terms (G S, g)} S i m^{'} (t_{i}, Terms (P, g))}{|Terms (G S, g)|}

The recall of P is defined as the average recall of the genes in Genes (G):

recall (P, G S) = \frac{\sum_{g \in Genes (G S)} recall (P, G S, g)}{|Genes (G S)|}

Microarray data preprocessing and normalization

We have integrated potato gene expression data from over 20 studies based on the Agilent JHI Solanum tuberosum 60 k v1 microarray (ArrayExpress ID: E-MTAB-1655) processed at the James Hutton Institute using standard Agilent recommended methodologies [38]. The studies included 326 conditions derived from the following treatments: moderate heat-stress [38], short- and long-day growth regimes [39], bruising, phosphorous growth regimes, acidity, Phytopthora infestans infection [40], and phosphite [41], BABA [14], ABA, brassinosteroid, SA treatment. Varietal differences and tuber, stem and leaf tissues were included.

We applied quantile normalization using the Limma package [42] and subtracted the background intensity from the foreground intensity for each spot using the ‘normexp’ method [43]. Our normalized expression matrix contained 52,998 probes. In order to reduce statistical noise and to focus on genes with high variation we removed both probes with consistently low expression values across the samples and probes with low variance. Thresholds for probe removal were adjusted as proposed in [44], see Additional file 1: Method S4 for more details. 14,000 probes remained in the data. These probes were mapped to 12,956 genes, approximately the same amount of genes analyzed in Tzfadia, et al. [44].

Evaluating co-expression of predicted GO processes

Given a gene set U associated with a specific GO term, and a gene expression matrix X with genes as rows, we first calculate the Pearson correlation between all pair of genes in U using their expression profiles in X. To evaluate if the correlations in U tend to be higher than expected by chance we sample random gene pairs in X and calculate their correlation to get a distribution of random correlation scores. We used the Kolmogorov-Smirnov test to compare the real correlations scores of U to the random correlation scores. To improve robustness, we repeated this process 50 times for each gene set U and used the mean p-value over all repeats.

Additional files

Abbreviations

GO:: Gene ontology
PGSC:: Potato genome sequencing consortium
ITAG:: International tomato annotation group
CC:: Cellular component
MF:: Molecular function
BP:: Biological process
NGPG:: Number of GO terms per gene

References

Birch PR, Bryan G, Fenton B, Gilroy EM, Hein I, Jones JT, Prashar A, Taylor MA, Torrance L, Toth IK: Crops that feed the world 8: Potato: are the trends of increased global production sustainable?. Food Security. 2012, 4 (4): 477-508. 10.1007/s12571-012-0220-1.
Article Google Scholar
Haverkort A, Boonekamp P, Hutten R, Jacobsen E, Lotz L, Kessel G, Visser R, Van der Vossen E: Societal costs of late blight in potato and prospects of durable resistance through cisgenic modification. Potato Res. 2008, 51 (1): 47-57. 10.1007/s11540-008-9089-y.
Article Google Scholar
Potato Genome Sequencing Consortium: Genome sequence and analysis of the tuber crop potato. Nature. 2011, 475 (7355): 189-195. 10.1038/nature10158.
Article Google Scholar
Zouine M, Latché A, Rousseau C, Regad F, Pech J-C, Philippot M, Bouzayen M, Delalande C, Frasse P, Schiex T: The tomato genome sequence provides insights into fleshy fruit evolution. Nature. 2012, 485: 635-641. 10.1038/nature11119.
Article Google Scholar
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.
Article PubMed Central CAS PubMed Google Scholar
Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF: A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007, 23 (10): 1274-1281. 10.1093/bioinformatics/btm087.
Article CAS PubMed Google Scholar
Ulitsky I, Maron-Katz A, Shavit S, Sagir D, Linhart C, Elkon R, Tanay A, Sharan R, Shiloh Y, Shamir R: Expander: from expression microarrays to networks and functions. Nat Protoc. 2010, 5 (2): 303-322. 10.1038/nprot.2009.230.
Article CAS PubMed Google Scholar
Ramsak Z, Baebler S, Rotter A, Korbar M, Mozetic I, Usadel B, Gruden K: GoMapMan: integration, consolidation and visualization of plant gene annotations within the MapMan ontology. Nucleic Acids Res. 2013, 42: D1167-D1175. 10.1093/nar/gkt1056.
Article PubMed Central PubMed Google Scholar
Lohse M, Nagel A, Herter T, May P, Schroda M, Zrenner R, Tohge T, Fernie AR, Stitt M, Usadel B: Mercator: a fast and simple web server for genome scale functional annotation of plant sequence data. Plant Cell Environ. 2014, 37 (5): 1250-1258. 10.1111/pce.12231.
Article CAS PubMed Google Scholar
Carbon S, Ireland A, Mungall CJ, Shu S, Marshall B, Lewis S: AmiGO: online access to ontology and annotation data. Bioinformatics. 2009, 25 (2): 288-289. 10.1093/bioinformatics/btn615.
Article PubMed Central CAS PubMed Google Scholar
Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, Alexander DL, Garcia-Hernandez M, Karthikeyan AS, Lee CH, Nelson WD, Ploetz L, Singh S, Wensel A, Huala E: The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 2012, 40 (Database issue): D1202-D1210. 10.1093/nar/gkr1090.
Article PubMed Central CAS PubMed Google Scholar
Promponas VJ, Ouzounis CA, Iliopoulos I: Experimental evidence validating the computational inference of functional associations from gene fusion events: a critical survey. Brief Bioinform. 2012, 15 (3): 443-454. 10.1093/bib/bbs072.
Article PubMed Central PubMed Google Scholar
Jaramillo-Garzón JA, Gallardo-Chacón JJ, Castellanos-Domínguez CG, Perera-Lluna A: Predictability of gene ontology slim-terms from primary structure information in Embryophyta plant proteins. BMC Bioinformatics. 2013, 14 (1): 68-10.1186/1471-2105-14-68.
Article PubMed Central PubMed Google Scholar
Bengtsson T, Weighill D, Proux-Wera E, Levander F, Resjo S, Burra DD, Moushib LI, Hedley PE, Liljeroth E, Jacobson D, Alexandersson E, Andreasson E: Proteomics and transcriptomics of the BABA-induced resistance response in potato using a novel functional annotation approach. BMC Genomics. 2014, 15 (1): 315-10.1186/1471-2164-15-315.
Article PubMed Central PubMed Google Scholar
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011, 29 (7): 644-652. 10.1038/nbt.1883.
Article PubMed Central CAS PubMed Google Scholar
Li L, Stoeckert CJ, Roos DS: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003, 13 (9): 2178-2189. 10.1101/gr.1224503.
Article PubMed Central CAS PubMed Google Scholar
Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M: Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 2005, 21 (18): 3674-3676. 10.1093/bioinformatics/bti610.
Article CAS PubMed Google Scholar
Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N: Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 2012, 40 (D1): D1178-D1186. 10.1093/nar/gkr944.
Article PubMed Central CAS PubMed Google Scholar
Kasprzyk A: BioMart: driving a paradigm change in biological data management. Database (Oxford). 2011, 2011: bar049-10.1093/database/bar049.
Article Google Scholar
Zhao K, Bartley LE: Comparative genomic analysis of the R2R3 MYB secondary cell wall regulators of Arabidopsis, poplar, rice, maize, and switchgrass. BMC Plant Biol. 2014, 14 (1): 135-10.1186/1471-2229-14-135.
Article PubMed Central CAS PubMed Google Scholar
Kim HA, Lim CJ, Kim S, Choe JK, Jo S-H, Baek N, Kwon S-Y: High-throughput sequencing and De Novo Assembly of Brassica oleracea var. Capitata L. for transcriptome analysis. PLoS One. 2014, 9 (3): e92087-10.1371/journal.pone.0092087.
Article PubMed Central PubMed Google Scholar
Jaccard P: Etude comparative de la distribution florale dans une portion des Alpes et du Jura: Impr. Corbaz; 1901.
Google Scholar
Jaccard P: The distribution of the flora in the alpine zone. 1. New Phytol. 1912, 11 (2): 37-50. 10.1111/j.1469-8137.1912.tb05611.x.
Article Google Scholar
Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S: GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics. 2010, 26 (7): 976-978. 10.1093/bioinformatics/btq064.
Article CAS PubMed Google Scholar
Khatri P, Done B, Rao A, Done A, Draghici S: A semantic analysis of the annotations of the human genome. Bioinformatics. 2005, 21 (16): 3416-3421. 10.1093/bioinformatics/bti538.
Article PubMed Central CAS PubMed Google Scholar
Klie S, Nikoloski Z: The choice between mapman and gene ontology for automated gene function prediction in plant science. Front Genet. 2012, 3: 115-10.3389/fgene.2012.00115.
Article PubMed Central PubMed Google Scholar
Defoin-Platel M, Hindle M, Lysenko A, Powers S, Habash D, Rawlings C, Saqi M: AIGO: Towards a unified framework for the Analysis and the Inter-comparison of GO functional annotations. BMC Bioinformatics. 2011, 12 (1): 431-10.1186/1471-2105-12-431.
Article PubMed Central PubMed Google Scholar
Powers D: Evaluation: From precision, recall and f-measure to roc., informedness, markedness & correlation. J Mach Learn Technol. 2011, 2 (1): 37-63.
Google Scholar
Datta S, Pihur V: An adaptive optimal ensemble classifier via bagging and rank aggregation with applications to high dimensional data. BMC Bioinformatics. 2010, 11: 427-10.1186/1471-2105-11-427.
Article PubMed Central PubMed Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410. 10.1016/S0022-2836(05)80360-2.
Article CAS PubMed Google Scholar
Magrane M, Consortium U: UniProt Knowledgebase: a hub of integrated protein data. Database (Oxford). 2011, 2011: bar009-10.1093/database/bar009.
Article Google Scholar
Finn RD, Clements J, Eddy SR: HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011, 39 (suppl 2): W29-W37. 10.1093/nar/gkr367.
Article PubMed Central CAS PubMed Google Scholar
Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J: The Pfam protein families database. Nucleic Acids Res. 2012, 40 (D1): D290-D301. 10.1093/nar/gkr1065.
Article PubMed Central CAS PubMed Google Scholar
Pasare SA, Ducreux LJ, Morris WL, Campbell R, Sharma SK, Roumeliotis E, Kohlen W, van der Krol S, Bramley PM, Roberts AG, Fraser PD, Taylor MA: The role of the potato (Solanum tuberosum) CCD8 gene in stolon and tuber development. New Phytol. 2013, 198 (4): 1108-1120. 10.1111/nph.12217.
Article CAS PubMed Google Scholar
Sharma SK, Millam S, Hein I, Bryan GJ: Cloning and molecular characterisation of a potato SERK gene transcriptionally induced during initiation of somatic embryogenesis. Planta. 2008, 228 (2): 319-330. 10.1007/s00425-008-0739-8.
Article CAS PubMed Google Scholar
Navarro C, Abelenda JA, Cruz-Oro E, Cuellar CA, Tamaki S, Silva J, Shimamoto K, Prat S: Control of flowering and storage organ formation in potato by FLOWERING LOCUS T. Nature. 2011, 478 (7367): 119-122. 10.1038/nature10431.
Article CAS PubMed Google Scholar
Kloosterman B, Abelenda JA, Gomez Mdel M, Oortwijn M, de Boer JM, Kowitwanich K, Horvath BM, van Eck HJ, Smaczniak C, Prat S, Visser RG, Bachem CW: Naturally occurring allele diversity allows potato cultivation in northern latitudes. Nature. 2013, 495 (7440): 246-250. 10.1038/nature11912.
Article CAS PubMed Google Scholar
Hancock RD, Morris WL, Ducreux LJ, Morris JA, Usman M, Verrall SR, Fuller J, Simpson CG, Zhang R, Hedley PE, Taylor MA: Physiological, biochemical and molecular responses of the potato (Solanum tuberosum L.) plant to moderately elevated temperature. Plant Cell Environ. 2014, 37 (2): 439-450. 10.1111/pce.12168.
Article CAS PubMed Google Scholar
Morris WL, Hancock RD, Ducreux LJM, Morris JA, Usman M, Verrall SR, Sharma SK, Bryan G, Mcnicol JW, Hedley PE: Day length dependent restructuring of the leaf transcriptome and metabolome in potato genotypes with contrasting tuberization phenotypes. Plant Cell Environ. 2014, 37 (6): 1351-1363. 10.1111/pce.12238.
Article CAS PubMed Google Scholar
Ali A, Alexandersson E, Sandin M, Resjö S, Lenman M, Hedley P, Levander F, Andreasson E: Quantitative proteomics and transcriptomics of potato in response to Phytophthora infestans in compatible and incompatible interactions. BMC Genomics. 2014, 15 (1): 497-10.1186/1471-2164-15-497.
Article PubMed Central PubMed Google Scholar
Burra DD, Berkowitz O, Hedley PE, Morris J, Resjö S, Levander F, Liljeroth E, Andreasson E, Alexandersson E: Phosphite-induced changes of the transcriptome and secretome in Solanum tuberosum leading to resistance against Phytophthora infestans. BMC Plant Biol. 2014, 14 (1): 254-10.1186/s12870-014-0254-y.
Article PubMed Central PubMed Google Scholar
Smyth GK: Limma: Linear Models for Microarray Data. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Edited by: Gentleman R, Carey V, Dudoit S, R Irizarry WH. 2005, Springer, New York, 397-420. 10.1007/0-387-29362-0_23.
Chapter Google Scholar
Ritchie ME, Silver J, Oshlack A, Holmes M, Diyagama D, Holloway A, Smyth GK: A comparison of background correction methods for two-colour microarrays. Bioinformatics. 2007, 23 (20): 2700-2707. 10.1093/bioinformatics/btm412.
Article CAS PubMed Google Scholar
Tzfadia O, Amar D, Bradbury LM, Wurtzel ET, Shamir R: The MORPH algorithm: ranking candidate genes for membership in Arabidopsis and tomato pathways. Plant cell. 2012, 24 (11): 4389-4406. 10.1105/tpc.112.104513.
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgements

The authors would like to thank SURF-sara (https://www.surfsara.nl/) for hosting the hack-a-thon sessions and providing high performance computing services. This work is part of the Allbio initiative and was partially supported by grant number EU FP7; 289452; KBBE.2011.3.6-02). David Amar was supported in part by fellowships from the Azrieli foundation, and the Edmond J. Safra center for Bioinformatics at Tel Aviv University. Erik Alexandersson and Itziar Frades were supported by Crafoord grant (20120533) and the Swedish Foundation for Strategic Research (RB608-0006) and Estelle Proux-Wera by PlantLink. Agnieszka Danek was supported by POIG.02.03.01-24-099/13 grant: “GeCONiI - Upper Silesian Center for Computational Science and Engineering”. We thank the MapMan team for assistance. We would also like to thank Ashfaq Ali, Kate Dreher and Paul Kersey for helpful discussions and input, and Efrat Weithorn for manuscript edit help.

Author information

Authors and Affiliations

Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
David Amar & Ron Shamir
Deptartment of Plant Protection Biology, Swedish University of Agricultural Sciences, Alnarp, Sweden
Itziar Frades, Estelle Proux-Wera, Erik Andreasson & Erik Alexandersson
Institute of Informatics, Silesian University of Technology, Akademicka 2A, Gliwice, 44-100, Poland
Agnieszka Danek
Department for Bioinformatics and Computational Biology, Technical University of Munich, Arcisstraße 21, Munich, 80333, Germany
Tatyana Goldberg
Cell and Molecular Sciences, The James Hutton Institute, Aberdeen, Scotland, UK
Sanjeev K Sharma & Pete E Hedley
Current affiliation: SciLifeLab, Stockholm University, Universitetsvägen 10, Stockholm, 114 18, Sweden
Estelle Proux-Wera
Department of Plant Science, The Weizmann Institute of Science, Rehovot, Israel
Oren Tzfadia

Authors

David Amar
View author publications
You can also search for this author in PubMed Google Scholar
Itziar Frades
View author publications
You can also search for this author in PubMed Google Scholar
Agnieszka Danek
View author publications
You can also search for this author in PubMed Google Scholar
Tatyana Goldberg
View author publications
You can also search for this author in PubMed Google Scholar
Sanjeev K Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Pete E Hedley
View author publications
You can also search for this author in PubMed Google Scholar
Estelle Proux-Wera
View author publications
You can also search for this author in PubMed Google Scholar
Erik Andreasson
View author publications
You can also search for this author in PubMed Google Scholar
Ron Shamir
View author publications
You can also search for this author in PubMed Google Scholar
Oren Tzfadia
View author publications
You can also search for this author in PubMed Google Scholar
Erik Alexandersson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Oren Tzfadia.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

OT and E Alexandersson designed the study and DA, IF, AD, TG, SKS, EPW, OT and E Alexandersson performed the research. DA, IF and AD contributed new analytic computational tools. PH, E Andreasson and SKS provided the gene expression data sets. DA, IF, AD, TG, SKS, EPW OT and E Alexandersson analyzed the data. DA, IF, RS, OT and E Alexandersson wrote the paper. All authors read and approved the final manuscript.

Electronic supplementary material

12870_2014_329_MOESM1_ESM.pdf

Additional file 1: Figure S1.: ITAG pipeline similarity, Figure S2. ITAG gold standard validation, Figure S3. ITAG gene expression validation and Methods S1-5. (PDF 840 KB)

Additional file 2: Table S1.: InterPro2GO PGSC pipeline output. (XLS 3 MB)

Additional file 3: Table S2.: BLAST2GO PGSC pipeline output. (XLS 3 MB)

Additional file 4: Table S3.: OrthoMCL-UniProt PGSC pipeline output. (XLS 6 MB)

Additional file 5: Table S4.: Phytozome PGSC pipeline output. (XLS 2 MB)

Additional file 6: Table S5.: Tri_BLAST PGSC pipeline output. (XLS 4 MB)

Additional file 7: Table S6.: Tri_HMM PGSC pipeline output. (XLS 1 MB)

Additional file 8: Table S7.: BioMart ITAG pipeline output. (XLS 2 MB)

Additional file 9: Table S8.: BLAST2GO ITAG pipeline output. (XLS 4 MB)

Additional file 10: Table S9.: OrthoMCL-UniProt ITAG pipeline output. (XLS 5 MB)

Additional file 11: Table S10.: Phytozome ITAG pipeline output. (XLS 1 MB)

Additional file 12: Table S11.: Tri_BLAST ITAG pipeline output. (XLS 4 MB)

Additional file 13: Table S12.: Tri_HMM ITAG pipeline output. (XLS 1 MB)

Additional file 14: Table S13.: Potato gold standard genes with literature references. (XLSX 17 KB)

Additional file 15: Table S14.: PGSC gold standard. (XLSX 20 KB)

Additional file 16: Table S15.: ITAG gold standard. (XLSX 19 KB)

Additional file 17: Table S16.: PGSC ensemble output with k = 2. (XLS 4 MB)

Additional file 18: Table S17.: ITAG ensemble output with k = 2. (XLS 4 MB)

12870_2014_329_MOESM19_ESM.xlsx

Additional file 19: GO annotation file based on ensemble k = 2 for JHI Solanum tuberosum 60 k v1 microarray (ArrayExpress ID: E-MTAB-1655). (XLSX 3 MB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Amar, D., Frades, I., Danek, A. et al. Evaluation and integration of functional annotation pipelines for newly sequenced organisms: the potato genome as a test case. BMC Plant Biol 14, 329 (2014). https://doi.org/10.1186/s12870-014-0329-9

Download citation

Received: 12 June 2014
Accepted: 10 November 2014
Published: 05 December 2014
DOI: https://doi.org/10.1186/s12870-014-0329-9

Evaluation and integration of functional annotation pipelines for newly sequenced organisms: the potato genome as a test case

Abstract

Background

Results

Conclusions

Background

Results and discussion

A compendium of the state-of-the-art annotation tools

Disparity among pipelines

Ensemble of pipelines

Validation using the potato gold standard

Validation using gene expression data

Merging the different merits using a rank-based comparison

Conclusion

Methods

Executing the functional annotation pipelines

The OrthoMCL-UniProt pipeline

The BLAST2GO pipeline

The trinotate pipeline

The phytozome pipeline

InterPro2GO data from BioMart

Formatting pipelines

Composing the potato ‘gold standard’

Comparing pipelines and gold-standard evaluation

Mathematical notations

Jaccard coefficient between two pipelines

Structure-based similarity between two pipelines

GO-based precision and recall

Microarray data preprocessing and normalization

Evaluating co-expression of predicted GO processes

Additional files

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Plant Biology

Contact us