### Abstract

#### Background

Any method that *de novo *predicts protein function should do better than random. More challenging, it also
ought to outperform simple homology-based inference.

#### Methods

Here, we describe a few methods that predict protein function exclusively through homology. Together, they set the bar or lower limit for future improvements.

#### Results and conclusions

During the development of these methods, we faced two surprises. Firstly, our most successful implementation for the baseline ranked very high at CAFA1. In fact, our best combination of homology-based methods fared only slightly worse than the top-of-the-line prediction method from the Jones group. Secondly, although the concept of homology-based inference is simple, this work revealed that the precise details of the implementation are crucial: not only did the methods span from top to bottom performers at CAFA, but also the reasons for these differences were unexpected. In this work, we also propose a new rigorous measure to compare predicted and experimental annotations. It puts more emphasis on the details of protein function than the other measures employed by CAFA and may best reflect the expectations of users. Clearly, the definition of proper goals remains one major objective for CAFA.

### Background

#### Our contribution to CAFA1

UniProt [1] holds over 22 million sequences (May 2012), but reliable and detailed experimental
annotations exist for fewer than 1% of these. GO, the Gene Ontology [2] has become the *gold standard *for function annotation and many methods have emerged that predict GO annotations
[3]. Due to various problems, it has been almost impossible to assess how well these
methods perform. CAFA (Critical Assessment of Function Annotations) has arisen to
address the challenges by a comprehensive independent comparison [4].

CAFA also drove our work presented here. Three teams of students implemented three different methods predicting function through homology, i.e. through inference from experimental annotations of related proteins. All three groups began with the same description what to do, and that description was more comprehensive and detailed than many descriptions of methods in papers. Two of the three methods were surprisingly competitive in CAFA and outperformed other similar methods. This triggered the decision to enhance and combine these classifiers in one meta predictor. This post-CAFA method did NOT participate in CAFA. Would it have, it might have reached the top-10 ranks among all participants. Either way, it suggests several ways for the improvement of function prediction by homology, as demonstrated in this post-CAFA evaluation.

Additionally, we developed a new metric to compare predicted and actual GO annotations. It provides insight into how methods perform with respect to the prediction of the exact functions. This turns out to be largely neglected by existing measures.

#### Related work

Several advanced methods have appeared that also predict protein function through
homology-based inference. *ConFunc *[5] first assigns the proteins found via PSI-BLAST to groups so that all members of a
group share a particular GO term. Looking at the alignments in each group, the method
then identifies conserved functional residues, scores them and only outputs the groups
above a certain combined score. *GOSLING *[6] first derives various features of the terms found in the BLAST result (e.g. GO evidence
code, E-Value and bit score. Using many decision trees, the prediction is then flagged
as either correct or incorrect. *PFP *[7] follows an approach very similar to *GOtcha *[8], but also considers highly insignificant BLAST hits and co-occurrence between GO
terms. An extension of *GOtcha, ESG *[9], additionally differentiates between the hits found in different PSI-BLAST iterations.
*GOstruct *[10] takes the idea of co-occurrence to the next level and builds a sophisticated SVM
machinery around "structured output spaces". This refers to the extension of the input
space (E-Values, asf.) with all experimentally observed GO-subgraphs. *FANN-GO*[11] uses E-Values as inputs to neural networks. Methods based on data sources other than
similarity to already annotated proteins are described in a recent review [3].

### Methods

#### GO (Gene Ontology) for CAFA

GO has three parts: Molecular Function Ontology (MFO), Biological Process Ontology
(BPO) and Cellular Component Ontology (CCO). CAFA considered only the MFO and BPO.
Both correspond to two directed acyclic graphs and capture different aspects of protein
function. Functional keywords ("GO terms") are nodes and their relationships are labeled
edges. The ontology is hierarchical: following the edges from a node, each new term
corresponds to a more general concept of the original function. All paths converge
at the root node, which can simply be interpreted as, e.g., *has a molecular function*.

The complete functional annotation of each protein has two subgraphs (MFO and BPO).
If a subgraph does not contain all the terms that can be inferred by going from its
nodes to the root, we perform a *propagation*. Given a set of GO terms, this operation refers to its extension with all ancestral
terms. Ancestrors can be found by following all outgoing paths from the terms to the
root: each visited node is an ancestor. If the GO terms have scores (e.g. to reflect
their reliability), the latter are also propagated: each parent term is simply assigned
the maximum score of its children. Sometimes, a propagated subgraph needs to be reduced
again to the *leaf terms*. A leaf term is not the parent of any other term in the propagation and corresponds
to *the *most exact description of a function for the given protein.

In order to integrate the operations above into our methods, we used the graph_path table provided by the GO consortium. It contains all possible paths in the entire GO graph, pre-calculated by specific path inference rules.

#### Assessment of predicted GO annotations

Analogously to CAFA, we use fixed sets of target proteins to compare prediction methods. Each target corresponds to one or two propagated GO subgraphs of experimentally validated terms (depending on whether both BPO and MFO annotations are available or only one of the two). A method is supposed to predict these subgraphs (e.g. the left tree in Figure 1) and assign a reliability between 0.0 and 1.0 to each predicted term (e.g. green nodes in Figure 1). Then we assess their accuracy in the following ways, separately for the MFO and BPO. For the first two measures, we exclusively used the original CAFA implementations, GO version, targets and target annotations. Only to implement our new leaf threshold measure, we slightly adapted the programs.

#### Top-20

Given the prediction of a single protein, the *top-20 measure *first reduces the prediction to the terms with the highest reliability (Figure 1: green nodes with score 0.8). It then defines recall as the number of correctly predicted
GO terms divided by the number of all true GO terms. Precision corresponds to the
number of correctly predicted GO terms divided by the number of all predicted GO terms.
In Figure 1, for example, recall is 1/11 = 0.09 and precision is 1/2 = 0.5. If a target is not
predicted at all, it is assigned a recall of 0.0. Precision is not calculated in such
a case and has no influence. Repeating this for all targets, we obtain the average
recall and precision. This is the first point in the recall-precision curve. In order
to fill the curve, we gradually add the terms with 2nd, 3rd, ..., 20th highest reliability
to the predictions and recalculate all of the above.

**Figure 1.** **A functional annotation and its prediction**. This Figure shows one annotation of a sample protein A and its prediction. Each
node in a graph corresponds to one GO term and the edges to relationships such as
"is a" or "part of". The edges always point to the root node (either "MFO" or "BPO"),
which by itself is not informative and discarded in every evaluation. For clearity,
the left subgraph only shows the experimental annotation of A. This means, all GO
terms have either been experimentally verified or inferred from the same. The red
circles indicate the leaf terms, i.e. the nodes which are not a parent of any other
term. In the right subgraph, we see the experimental annotation again, but overlaid
with predicted terms (green) and their reliabilities. This time, the leaf terms correspond
to the predicted GO annotation, instead of the actual annotation.

#### Threshold

The *threshold measure *[4] follows a similar concept as *top-20*. Instead of considering a certain number of terms for each target at a time the measure
demands a threshold between 0.0 and 1.0. In case of a threshold of 0.82, for example,
each prediction is reduced to terms with a reliability greater than or equal to 0.82.
Recall and precision can then be calculated analogously to the *top-20 *measure. A curve is obtained by lowering the threshold in steps of 0.01 from 1.0 to
0.0.

#### Leaf threshold

The *leaf threshold measure*, finally, operates exclusively on the leaves of a propagation (red nodes in Figure
1). First, predicted and experimental subgraphs are reduced to their leaf terms (Figure
1: experimental leaves on the left, predicted leaves on the right). Then, we define
a threshold T as before, e.g. T = 0.82, and reduce each prediction to the leaves with
a reliability ≥ T. The recall of a single prediction is given by the number of correctly
predicted leaves divided by the number of all experimental leaves. Precision is defined
analogously. Consequently, we can derive a recall-precision curve in the same way
as for the threshold measure. In Figure 1, we obtain the first non-empty prediction as soon as the threshold reaches 0.80 (the
highest score of all predicted leaves is 0.8). In this case, recall and precision
correspond to 0/3 = 0.0 and 0/1 = 0.0.

The leaf threshold measure is orthogonal to the top-20 and threshold measure: in the
case of low recall, for example, the former two measures remove specific GO terms
from the prediction and retain only the more general terms. Naturally, more general
terms have a higher chance to overlap with the experimental propagation than specific
terms, resulting in higher precision. However, the *leaves *of this reduced prediction are not more likely to overlap with the *leaves *of the experimental annotation. If the full prediction was the best estimate of the
experimental leaves, the reduced version could even result in recall = precision =
0.0 by the leaf threshold measure, because the reduction might remove all correctly
predicted leaves. Our new measure assesses how well the exact functions of a protein
are predicted. Too general or specific predictions are penalized.

#### Maximum F1 score

The *top-20 *and threshold measure were the two main metrics in the CAFA meeting. The leaf measure
is introduced here for the first time. In order to rank methods, the CAFA organizers
additionally used the maximum F1 score over all recall-precision pairs obtained with
the threshold measure (*Fmax*). The F1 score is defined as:

We also employed Fmax in order to choose among alternative parameters during method development after CAFA.

#### Homology-based methods

All the methods that we presented at CAFA were developed as part of the exercises of a regular lecture at the TUM in Munich (year 1-3 in bioinformatics master). Due to limitations in time and resources, our methods had to focus on a k-nearest-neighbor approach and to only use the hits returned from a PSI-BLAST [12] query against Swiss-Prot [1]. They were supposed to optimize the F1 score (threshold measure) of the leaf term with the highest reliability. We split the 16 participating students into three groups, each of which had to develop an own implementation. After 8 weeks of time and one week before the CAFA submission deadline, we received the following three methods. Their key features are summarized in Table 1.

**Table 1.** Comparison of student methods.

#### StudentA (Figure 2)

Begin with 2-iteration PSI-BLAST against all Swiss-Prot proteins with GO annotations
(E-Value < 0.1). Extract GO terms of the six top PSI-BLAST hits (or all if fewer than
6 hits found). Each identified GO term is scored 1.0 if the term is found in all 6
hits and 0.5 otherwise. Once the term-score pairs have been extracted, only the leaf
terms of their propagation are retained. Then apply the following filter to reduce
functional redundancy: (i) create *branches *by propagating each predicted leaf term separately; (ii) calculate all pairwise branch
overlaps, with the overlap being defined as the number of common GO terms in both
branches divided by the average branch size.

**Figure 2.** **Flow chart of StudentA**.

*StudentA*first reduces the BLAST output to the best 6 hits. GO terms that are part of the annotation in all 6 hits are assigned a score of 1.0, all others 0.5. Then the predicted GO graph is assembled by propagating the scores and pruned again during a functional redundancy reduction (see text). This reduced graph is output to the user.

Next, cluster all branches so that each pair from two different clusters overlaps less than 10%. For each cluster, the branch with the longest path to the root is chosen, reduced to its original leaf term with the original score and output to the user. As the redundancy reduction may filter out highly supported terms, we apply a final correction: if any pair of branches from previous steps overlaps over 90%, the term common to both and with the longest path to the root, i.e., the lowest common ancestor, is added to the result.

#### StudentB (Figure 3)

Begin with 2-iteration PSI-BLAST against all Swiss-Prot proteins with GO annotations
(E-Value < 0.002 for 1^{st }and E-Value < 0.01 for 2^{nd }round). Each PSI-BLAST hit is associated with the propagation of its GO terms and
each term in the propagation is associated with the PSI-BLAST E-Value of the hit.
We then define two scores.

**Figure 3.** **Flow chart of StudentB**.

*StudentB*first logarithmizes the E-Values of all BLAST hits and averages them. The result is mapped into a range from 0 to 1 by looking up its percentile in a precompiled distribution. This percentile is the

*Template Quality Score*and reflects how well we can predict the entire target. To score single terms, we multiply it with the score of each predicted leaf, i.e. the

*Combined Leaf Score*. This is the average of the logarithmized E-Values of the nodes on the path from the leaf to the root. The propagation of the leaf terms and their scores is output to the user.

The *template quality score *gauges the reliability of the entire PSI-BLAST query with respect to the goal of assigning
function. First, we calculate the *raw template score *as the average over the logarithms of all returned PSI-BLAST E-Values plus twice the
standard deviation (also derived from the log(E-Value)). The standard deviation is
added to correct for cases with relatively low top scores and relatively high averages.
This raw template score is normalized into a value from 0 to 1 by mapping it into
a percentile bin obtained from running PSI-BLAST in the same fashion on a sample of
all Swiss-Prot proteins (e.g. a score obtained by 90% of the samples for all Swiss-Prot
is scored 0.1 = 1-0.9). We call this percentile the template quality score.

The *combined leaf score *measures the reliability of each predicted leaf. First, we compile the propagated
set of all GO terms for all PSI-BLAST hits. Each term can occur in multiple hits and
thus be associated with multiple E-Values. The *support *of a term is defined as the sum over the logarithm of its E-Values divided by the
sum of the logarithm over the E-Values of all hits. The *combined leaf score *of a leaf in the set of GO terms above is then given by the average support of itself
and all of its ancestors.

Finally, we multiply *template quality *and *combined leaf *score for each leaf, combine all the leaf-score pairs in one set and output its propagation
to the user.

#### StudentC (Figure 4)

Begin with 2-iteration PSI-BLAST against all Swiss-Prot proteins with GO annotations
(E-Value < 0.1). Count how often a particular GO term appeared in the PSI-BLAST hits
(without propagation). All nodes with counts are propagated through the GO tree. Instead
of taking the maximum count of all children at each parent node, however, their values
are summed up and added to that of the parent node (normalization to [0,1] by division
by maximal value). We call this type of scoring the *max support*. The PSI-BLAST scores, on the other hand, are considered as follows.

**Figure 4.** **Flow chart of StudentC**.

*StudentC*first counts how often each GO term appeared in the BLAST hits and performs a cumulative propagation: for each inner node, all counts of its child terms are summed up and added to its own count (depth-first traversal). Dividing each count by that of the root term, we obtain a score between 0 and 1 for each term. In parallel, we calculate a second score for each term by assigning it the maximum percentage of positives of all associated hits (see text). Finally, we multiply the two scores, determine the highest scoring leaf term and output only its propagation to the user.

For each PSI-BLAST hit, we first read off the *positive identity*. This value is included in the default BLAST output and corresponds to the number
of *positives *divided by the alignment length. (Each mutation column in the default BLAST output
with a positive score by BLOSUM62 is a *positive*.) Then, we multiply the *max support *of each term with the highest associated *positive identity *(we may have many *positive identities*, because a GO term can be associated with multiple PSI-BLAST hits). The method outputs
only the one branch corresponding to the highest scoring leaf term.

#### Post-CAFA re-parameterization

After CAFA, we parameterized the above three basic homology-inference methods. For
*StudentA*, we introduced the options to exclude predictions with a score of 0.5 and to choose
the number of PSI-BLAST hits to consider (before: 6; now: 1, 5 or 9). For *StudentC*, we added alternative PSI-BLAST E-Value thresholds (before: 1e-01; now: 1e00, 1e-03
or 1e-06) and percentage pairwise sequence identity as an alternative to the *positive identity*. We also enabled the optional output of all branches, instead of restricting it to
the most probable one. The original implementation of *StudentB *had a bug: an alternative graph_path table inverted the order of the columns by mistake.
The results of this bug were submitted to CAFA. We fixed the bug and allowed for alternatives
in the thresholds for E-Values and maximum numbers of PSI-BLAST hits (E-Value before:
1e-02; now: 1e00, 1e-03 or 1e-06; max. number of hits before: 250 [PSI-BLAST default];
now: 5, 50 or 500).

For all methods, we also add the choice of the number of PSI-BLAST iterations (before: 2 for all methods; now: 1, 2 or 3). Finally, we enabled the filtering out of Swiss-Prot annotations with unclear experimental support (optional restriction to the following experimental GO evidence codes: IDA, IMP, IPI, IGI, IEP, TAS, IC, EXP).

The re-parameterization created 36, 54, and 72 different parameter combinations for
*StudentA-C*, respectively. We optimized the parameters by picking the combination leading to
the highest Fmax (threshold measure; Eq. 1) on a hold-out data set. This data set
comprised all Swiss-Prot proteins annotated with experimentally verified GO terms
in 2010 ("Set 2010"). All proteins annotated before 2010 served as templates ("Set
< 2010"). This ascertained that there was no overlap to the CAFA targets. In the following,
we refer to the optimized student methods as

*StudentA'-C'*

#### Post-CAFA method combination

Due to the end of the lecture during which the methods were developed, we could not
combine them. We did this also post-CAFA. We randomly split Set 2010 into two equal
parts (Set 2010a and 2010b). Parameters were optimized on the first split (2010a;
as before, only with 2010a instead of 2010). These optimized variants of *StudentA-C *(say *StudentA''-C''*) were applied to the second split (2010b). Then, we switched the roles of the two
sets and repeated the procedure to obtain predictions for each protein in Set 2010.
With these predictions, we trained a commonly used meta classifier [13], namely a weighted least-squares linear regression model. This corresponded to the
formula x*A' + y*B' + z*C' + i = p, where A', B' and C' are the results of the student
methods for each predicted GO term and [x-z] and i are the coefficients to optimize
in the regression so that p reflects the reliability of the GO term. In order to meta-predict
a new target protein, we first annotate it with methods *StudentA'-C'*. Each predicted GO term is then converted into a vector of three elements (one dimension
for each method) and put into the formula above. The resulting value of p is the reliability
of the GO term for the given target. We refer to this predictor as

*MetaStudent'*

#### Baseline classifiers

The CAFA organizers implemented the following three baseline classifiers to gauge
the improvement of current function predictiors over old or naïve methods [11]. (1) *Priors*. Every target has the same annotations and each term's score is the probability of
that term occurring in Swiss-Prot. (2) *BLAST*. Target annotations are simply the maximum sequence identity returned by BLAST under
default parameters when aligning a target with all proteins annotated with a given
term. (3) *GOtcha*. Using the same BLAST results as *BLAST, Gotcha *[8] I-Scores are calculated as the sum of the negative logarithm of the E-Value of the
alignment between the target protein and all proteins associated with a given term.
Additionally, we introduce *Priors'*, which simply returns the entire GO annotation of a random Swiss-Prot protein. Scores
are assigned as in *Priors*.

#### Data sets

We used five different data sets for method development and evaluation. All are exclusively derived from GO and the GO annotated proteins from Swiss-Prot and only differ in their release dates. The first three methods used the GO/Swiss-Prot releases from Oct. 2010 ("

- Set < 2010_10"

- 2010 ("Set < 2010_12")

- Set < 2010

- Set 2010

The CAFA organizers provided the original CAFA targets (436 with BPO and 366 with MFO annotations). They correspond to the proteins annotated between January and May 2011 ("

- Set 2011

Note that this implied that all our post-CAFA optimization could have been accomplished completely BEFORE the submission to CAFA (we just had not been fast enough). Nevertheless, we clearly label all the new methods as "post-CAFA" in this work.