Validating clustering results by the mutual information: A schematic example. Each gene is uniquely assigned to one functional category Ai and grouped into cluster Cj by a given clustering algorithm. The joint probabilities can be straightforwardly estimated from the associated contingency table and the mutual information is calculated according to Eq. (1). To assess how related the clustering is to the annotation, the value of the mutual information is compared to random assignments of genes to cluster number, i.e. each gene is randomly assigned to a cluster, preserving the total number of genes within each cluster, but destroying all possible relationship between the clustering and the functional annotation. The lower right plot shows the mutual information, compared to an ensemble of 500 randomized assignments, In this example, the z-score, estimated according to Eq. (8), is S ≈ 3.8. For a z-score to be deemed significant, we further require that no random assignment results in a mutual information equal or larger that the tested annotation. Note that, though we expect the mutual information to be zero for the randomized assignments, the average estimated mutual information for randomized data has a bias towards positive values due to finite-size effects [19,20]. As a rule of thumb, to obtain reliable estimate of the mutual information the number of genes should be at least three times larger than the number of clusters or functional categories .
Steuer et al. BMC Bioinformatics 2006 7:380 doi:10.1186/1471-2105-7-380