|
Resolution: standard / high Figure 8.
Example of the generation of cut-offs for classification of ssd-orthologs and probable
paralogs, based on an iterative-true-negative analysis (i.e. based on an introduction
of random sets of true-negatives). The particular analysis illustrated here is a Ratio1
analysis for the mouse, rat, human RefSeq RBH dataset, with true-negatives introduced
into the mouse (ingroup1) set. In panel A, the number of putative orthologous groups
in each ratio range for the true-negative-transformed data set is shown for the whole
data set (light shaded bars) and for just the introduced true-negatives only (dark
shaded bars). Note how the distribution of the data set differs from that of the true
negatives (i.e. introduced paralogs). In panel B, the proportion of randomly introduced
true-negatives at 0.5 ratio range intervals is used to formulate cut-offs (denoted
by dashed lines) for classifying ssd-orthologs and probable paralogs for the analysis.
For the ssd-orthologs cut-off (left-most dashed line), no more than 10% true negatives
in a given ratio range are permitted for the ssd-orthologs range. For the probable
paralogs cut-off (right-most dashed line) the proportion of true negatives is at or
above 50 percent. The resulting middle region bounded by these two cut-off points
establishes the "uncertain" orthology class ratio range. Dashed-lines denoting these
particular cut-offs are also illustrated on the figure in Panel A for reference. This
approach for a true-negative analysis and cut-off generation is also performed for
Ratio2 [1] and the combination of cut-offs for Ratio1 and Ratio2 are used to classify putative
orthologous groups from another data set (such as an RBH-predicted data set) into
the three classification levels of "probable ssd-ortholog", "uncertain" and "probable
paralogs". Panel C schematically shows the areas of an R1 × R2 that would be classified
in this way, with the cut-off numbers in this particular example matching the RefSeq
RBH-based mouse-rat-human analysis (see Table 2 for how these ranges are numerically
determined).
Fulton et al. BMC Bioinformatics 2006 7:270 doi:10.1186/1471-2105-7-270 |