Table 4

The distribution of clusters with their characteristics given different values for k (the number of clusters) from 500 to 3,000.

K

500

1,000

2,000

3,000


Single Species cluster

422 (84.4%)

904 (90.4%)

1897 (94.9%)

2894 (96.5%)

# of Phenocopy-Pairs (of 25)

25 (100%)

13 (52%)

12 (48%)

8 (32%)

Cluster w/PT-Sim ≥ 0.4

92 (18.4%)

293 (29.3%)

526 (26.3%)

810 (40.5%)

# Genes

3221

5886

6379

6878

Cluster w/GO-Sim ≥ 0.4

51 (10.2%)

206 (20.6%)

522 (26.1%)

921 (46.1%)

Correlation GO-Sim vs PT-SIM

0.53

0.41

0.37

0.28

# Genes

863

1800

2392

3065

Cluster w/PPi ≥ 75%

21 (4.2%)

60 (6.0%)

174 (8.7%)

305 (10.2%)

# Genes

1497

1858

2335

2702

Cluster w/PPi ≥ 33%

63 (12.6%)

138 (13.8%)

286 (14.3%)

413 (13.8%)

# Genes

3890

4322

4965

4996

Cluster for GO-Predictions

90 (18%)

196 (19.6%)

393 (19.7%)

611 (20.4%)

# Genes

2820

3213

4145

4546

# Terms

142

345

730

1226

Precision

72.55%

67.91%

63.40%

60.31%

Recall

16.73%

22.98%

25.63%

28.32%

Avg. Genes/Cluster

54

29

16

11


As internal measure for cluster quality we sought to gain insight how the data structure changes by choosing different values for k, ranging from 500 to 3,000. Here, Filter 1 has been applied for GO-predictions. For details, see text.

Groth et al. BMC Bioinformatics 2008 9:136   doi:10.1186/1471-2105-9-136

Open Data