Distribution of gene scores according to four methods. Coding genes of Syn contaminated with 114 genes (3% of the total number of coding genes of Syn) from Tel were used. The z-score is the deviation of a score of a gene from the mean score, in units of standard deviations. Z-scores were binned every 0.25 units. For the two scoring methods (W8 and CGS) that use covariance, the signs of the Z scores were reversed so that putative foreign genes would lie on the right side of the graph (see Methods). The thick lines, thin lines, and dashed lines show the distributions of scores for all coding genes, test core genes, and introduced foreign genes, respectively. The right-most arrow identifies the z-score that splits the test core gene distribution into a ratio of 95:5. The left-most arrow identifies the z-score that maximizes the difference between the number of scores of introduced genes and number of scores of test core genes to the right of the arrow, a score that occurs at the intersection of the two curves. The shaded area is the maximal discrimination, i.e., the area under the dashed line minus the area under the thin line (the number of true positives minus the number of false positives) using the threshold marked by the left-most arrow. (A) GC, (B) Codon bias, (C) W8, (D) CGS.
Elhai et al. BMC Genomics 2012 13:245 doi:10.1186/1471-2164-13-245