Predicting the usage of core promoters. A) Visual representation of the performance of promoter usage rate prediction. We plot predicted vs actual promoter usage rates (as measured by RNA-seq in the gene body, RNAPII in the promoter region and RNAPII in the gene body), expressed as log2 (Tags Per Million(TPM)). The predicted values are obtained using a linear model with 10-fold cross-validation. B) Summary of promoter usage rate prediction performance. The box plots summarize the correlations by Pearson Correlation Coefficients (PCC) calculated between actual and predicted promoter usage measurements; a perfect correlation will give a PCC of 1. We tested the framework using either only 9 epigenetic modifications, or including additional features (methylation status, dinucleotide and normalized GC content). The variation estimates are achieved performing a 10% holdout experiment on 10 random non-overlapping splits. We used both linear models and Random Forest methods: the Random Forest consistently outperforms the linear model, but the absolute differences in mean PCC values are small (~0.05).
Chen et al. BMC Genomics 2011 12:544 doi:10.1186/1471-2164-12-544