Figure 3.

Predicting the usage of core promoters. A) Visual representation of the performance of promoter usage rate prediction. We plot predicted vs actual promoter usage rates (as measured by RNA-seq in the gene body, RNAPII in the promoter region and RNAPII in the gene body), expressed as log2 (Tags Per Million(TPM)). The predicted values are obtained using a linear model with 10-fold cross-validation. B) Summary of promoter usage rate prediction performance. The box plots summarize the correlations by Pearson Correlation Coefficients (PCC) calculated between actual and predicted promoter usage measurements; a perfect correlation will give a PCC of 1. We tested the framework using either only 9 epigenetic modifications, or including additional features (methylation status, dinucleotide and normalized GC content). The variation estimates are achieved performing a 10% holdout experiment on 10 random non-overlapping splits. We used both linear models and Random Forest methods: the Random Forest consistently outperforms the linear model, but the absolute differences in mean PCC values are small (~0.05).

Chen et al. BMC Genomics 2011 12:544   doi:10.1186/1471-2164-12-544
Download authors' original image