Despite their involvement in the regulation of gene expression and their importance as genomic markers for promoter prediction, no objective standard exists for defining CpG islands (CGIs), since all current approaches rely on a large parameter space formed by the thresholds of length, CpG fraction and G+C content.
Given the higher frequency of CpG dinucleotides at CGIs, as compared to bulk DNA, the distance distributions between neighboring CpGs should differ for bulk and island CpGs. A new algorithm (CpGcluster) is presented, based on the physical distance between neighboring CpGs on the chromosome and able to predict directly clusters of CpGs, while not depending on the subjective criteria mentioned above. By assigning a p-value to each of these clusters, the most statistically significant ones can be predicted as CGIs. CpGcluster was benchmarked against five other CGI finders by using a test sequence set assembled from an experimental CGI library. CpGcluster reached the highest overall accuracy values, while showing the lowest rate of false-positive predictions. Since a minimum-length threshold is not required, CpGcluster can find short but fully functional CGIs usually missed by other algorithms. The CGIs predicted by CpGcluster present the lowest degree of overlap with Alu retrotransposons and, simultaneously, the highest overlap with vertebrate Phylogenetic Conserved Elements (PhastCons). CpGcluster's CGIs overlapping with the Transcription Start Site (TSS) show the highest statistical significance, as compared to the islands in other genome locations, thus qualifying CpGcluster as a valuable tool in discriminating functional CGIs from the remaining islands in the bulk genome.
CpGcluster uses only integer arithmetic, thus being a fast and computationally efficient algorithm able to predict statistically significant clusters of CpG dinucleotides. Another outstanding feature is that all predicted CGIs start and end with a CpG dinucleotide, which should be appropriate for a genomic feature whose functionality is based precisely on CpG dinucleotides. The only search parameter in CpGcluster is the distance between two consecutive CpGs, in contrast to previous algorithms. Therefore, none of the main statistical properties of CpG islands (neither G+C content, CpG fraction nor length threshold) are needed as search parameters, which may lead to the high specificity and low overlap with spurious Alu elements observed for CpGcluster predictions.
Given the inherent mutability of methylated cytosine, the human genome has only a fraction (≈ 20%) of the CpG dinucleotides expected on the basis of its G+C content [1,2]. However, the resulting scarcity of CpGs is not uniform throughout the chromosome: there are many DNA tracts (CpG islands or CGIs), totaling 1% of the genome, where CpGs are abundant [3-6]. The lack of methylation at CGIs, together with their elevated G+C content relative to the genome average, results in a frequency of CpG dinucleotides that is about 10-fold higher than in bulk DNA [5,6]. About 60% of all genes have a CGI, normally unmethylated, in their promoter region [2,6,7]. However, in some physiological or pathological situations promoter-associated CGIs can be methylated, then provoking a change in the expression of the associated gene [8-11]. The maintenance of a particular genomic pattern of methylated CpGs provides an epigenetic means for differential regulation of gene expression [2,7,12].
Approximately 80% of all CpGs are methylated in human and mouse genomes, which makes the hypomethylated and GC-rich CGIs an outstanding genomic property. Given their putative function in gene regulation and their importance as genomic markers in promoter prediction, over recent years there has been a considerable effort to predict CGIs in silico. Current algorithms (newcpgreport ; cpg ; CpGProD ; CpGIS [16,17]; CpGIE ; CpGED ) rely on the ad hoc thresholds of length, CpG O/E ratio and G+C content early defined by Gardiner-Garden and Frommer . These three thresholds lead to a parameter space which is relatively large and difficult to explore completely. Consequently, in many publications, these parameters have been fine-tuned in different ways -for example, to filter out spurious Alu elements or restricting the prediction to putative promoter CGIs. However, in every fine-tuning, "valid" CGIs also become filtered out, as a consequence of using the same parameters for both prediction and filtering; this suggests the use of different parameters in both steps, as proposed below in the CpGcluster algorithm. Another shortcoming, shared by the algorithms using the conventional moving-window approach (newcpgreport, CpGProdD, CpGIS and CpGIE), but not by the cpg script (which uses compositional segmentation) or CpGED (which uses a sliding double window), is that the island boundaries cannot be accurately defined to single base-pair resolution. As is well known (see, for example ), the methods using a moving window add another level of subjectivity in choosing both the length of the window and the step size. Taking this problem into account, the algorithm CpGcluster is able to predict the island boundaries to a single base-pair resolution by definition.
Bulk CpGs are thought to be in a dynamic equilibrium between the decay of methylated CpGs and the generation of new ones due to point mutations . This is a random process and therefore the CpG distance distributions should be strikingly different for bulk and island CpGs, which motivated our approach. In particular, the distances between consecutive bulk CpGs, as the result of a random process, should follow the geometric distribution, while the distance distribution for in-island CpGs must contain information on the high local clustering. Taking advantage of this high local clustering of CpG dinucleotides at CGIs, CpGcluster directly predicts clusters of CpGs on the chromosome. Predicted clusters with high enough statistical significance can then be identified as CGIs (see Methods).
The accuracy of CpGcluster was evaluated by comparing it to five commonly used CGI finder programs (Table 1). To benchmark the programs, a set of test sequences containing experimentally determined CGIs in a random background was assembled, as described in the Methods section. As in the gene-finding field , the accuracy of a prediction was measured by comparing the predicted island value (island or non-island) with the true island value for each nucleotide along the test sequence, then deriving estimates for Sensitivity (Sn, the proportion of island nucleotides that have been correctly predicted as islands), Specificity (Sp, the proportion of predicted island nucleotides that are actually islands; this measure is also known as Positive Predictive Value, or PPV) and the Correlation Coefficient (CC, a single scalar value that summarizes Sn and Sp as a measure of global accuracy). The averaged values over ten test sequences, each with 400 experimental islands randomly distributed over a randomized sequence, are shown in Table 1. When the threshold distance was set to the median of the observed distribution, CpGcluster showed moderate values for Sn, while reaching the highest ones for both CC and Sp. The high specificity achieved by our algorithm indicates that it has the lowest rate of false-positive predictions (i.e. only 2.4% of the predicted nucleotides turned out actually not to be part of a CGI).
Table 1. Benchmarking of CpGcluster
For these results, the median distance between neighbor CpGs and a p-value cutoff of 10-5 were used to run CpGcluster. As shown in the last row of Table 1, the raising of the distance threshold to the 75th percentile, thereby obtaining longer islands, increased sensitivity by more than 20% while only minimally improving overall accuracy. However, this led to a smaller fraction of CGI overlapping with PhastCons (shown below). On the other hand, lowering the p-value threshold beyond 10-5 slightly increased Sp but also clearly decreased Sn, thus lowering overall global accuracy (not shown). Consequently, the median distance was used as the only parameter for the island prediction and the 10-5 cutoff for the filtering in all subsequent analyses.
Finally, we examined another accuracy indicator, the Hit percentage, which gives the proportion of experimental CGIs which have been at least partially overlapped by the predicted islands. Table 1 shows that CpGcluster "hits" a higher number of islands than any other algorithm. This highest partial overlap (at least the core region of the CGI is predicted), together with the highest specificity mentioned above, might indicate an advantage of CpGcluster over the other tested algorithms.
Statistical analysis of predicted islands in human and mouse genomes
Basic statistics of the CGIs predicted by CpGcluster in the human (hg17) and mouse (mm7) genome assemblies are shown in Table 2. For comparison, this Table also includes an analysis of the islands predicted by CpGProD. The number of CGIs predicted by CpGcluster in both genomes is higher, and their average length shorter than those predicted by CpGProD or the ones previously reported in the literature (see, for example, ). Some of the short CGIs predicted by CpGcluster might be spurious, but some others may play a true functional role (see below). The spurious fraction may be due, for example, to the presence of arginine-rich exons, which are then rich in CGN codons, and therefore are prone to erroneous identification as CGIs. Comparing our prediction to annotated exon boundaries, we estimated this fraction to be relatively low (only about 3.4% of the predicted CGIs in the human genome correspond actually with exons).
Table 2. Basic statistics of CpGcluster and CpGProD islands
The hypothesis that some short CGIs could be truly functional is based on the fact that many known functional CGIs are shorter than commonly assumed -the extreme example being Xenopus CGIs, which are known to be shorter and have a lower G+C content than the CGIs found in mammals . However, short and functional CGIs exist also in the human genome. One example concerns the CGI of the human tissue-specific SERPINB5 gene. The promoter of this gene is associated with a GC-rich region that, while fulfilling the conventional %G+C and CpG fraction defining criteria for CGIs, is significantly shorter than the average  and consequently goes unnoticed in most annotations . To our knowledge, CpGcluster is the only algorithm capable of catching the core of this fully functional CGI [see 1]. A second example refers to MAGE genes, which are found as antigens in a wide variety of tumors, and become methylated during normal mammalian development. They have a CpG-rich region 300–650 bp long at their 5' end that, although shorter than average CGIs, remains nonmethylated in sperm but methylated in somatic tissues, where the genes are not expressed. Therefore, these genes represent clear examples of tissue-specific genes that use DNA methylation as a primary mechanism for their regulation . The ability to detect the CpG-rich regions enabling this type of regulation is an important measure of quality for any CGI finder and was tested on ten MAGE genes having known TSS (Table 3). Our algorithm detected CGIs in eight of the ten MAGE genes analyzed, while the number of islands reported by the other programs in this gene set was significantly lower.
Additional file 1. Alignment with SERPINB5 promoter. Sequence alignment between the promoter of SERPINB5 (SPR, associated with a GC-rich region) and the CGI predicted by CpGcluster. None of the remaining CGI finders were able to detect any CGI in this region. The transcription start site (TSS), as given by the DBTSS data base , is shown in bold type.
Format: DOC Size: 23KB Download file
This file can be viewed with: Microsoft Word Viewer
Table 3. Overlap with PhastCons and MAGE genes
The minimum length of a functional CGI is a difficult question, but insights can be derived from recent advances in mapping functional promoters. The shortest island in our prediction which overlaps with a TSS from DBTSS is 33 bp in length. When functional promoters are determined through ChIP-on-chip technology , that length goes down to 13 bp. Finally, when promoters are determined by using the cap analysis of gene expression (CAGE) approach  the minimum island length is just 11 bp long, thus approaching the minimum lengths observed in both DerLab and CpGcluster databases. Thus, it seems that even very short islands may be functional. Also, it bears mentioning that short islands (<200 bp) predicted by CpGcluster which overlap with a TSS also show a very high degree of overlap (37%) with conserved elements (see below), thus suggesting probable biological relevance.
Further insight into the possible role of short CGIs is suggested by the recent finding of CpG "islets", genomic regions that are not conventionally classified as CpG islands because of their short length (<200 bp), but have a GC content and observed-to-expected CpG ratio that is characteristic of a CpG island. CpG islets may be non-methylated, corresponding to sites of active transcription and/or boundaries that separate major centromeric chromatin sub-domains .
All in all, these data support the possibility that genomic tracts with GC content and CpG Obs/Exp ratios typical of CGIs, but below the detection threshold of conventional CGI finders, may have a functional role in the genome. CpGcluster represents a new tool that may help to uncover such regions.
Minimal overlap between CGIs predicted by CpGcluster and Alu retrotransposons
A major source of uncertainty in CGI prediction is the interference of Alu retrotransposons. These elements, abundant in primate genomes, have often been falsely identified as CGIs by conventional CGI finders. To cope with this problem, some authors [15,16,18] have proposed a simple increment in the value of some of the thresholds used. The drawback of such a strategy is that some CGIs associated with genes would also be excluded under these more stringent criteria. Even so, the fraction of Alu overlap shown by the islands predicted by most programs is still rather large, while CpGcluster's CGIs demonstrate the least amount of overlap with Alu elements (Table 3). We wish to stress especially that CpGcluster does not need any minimum-length criterion to exclude a higher proportion of Alu elements than any of the previously existing algorithms tested.
Highest degree of overlap between CpGcluster islands and phylogenetic conserved elements (PhastCons) from vertebrates
Functional genomic elements, being under natural selection, are expected to be highly conserved during evolution. Therefore, if the predicted CGIs truly play a functional role, they should show a high degree of overlap with vertebrate PhastCons . Taking advantage of the 'most conserved' track (based on the best-in-genome pairwise alignments between human and other seven vertebrate genomes) at UCSC Genome Browser , we computed the percentage of overlap between PhastCons and the CGIs predicted by the different finders. As seen in Table 3, the islands predicted by CpGcluster show the highest degree of overlap with PhastCons, thus indicating that our algorithm predicts a higher proportion of evolutionarily conserved, functionally relevant CGIs than does any other tested algorithm.
Promoter and CpG island co-location
For a further quality assessment for the islands predicted by CpGcluster, we assigned them to five classes according to their co-location with annotated genes from the RefSeq database . The classification proposed by Ioshikhes and Zhang  was improved by using exon boundaries (instead of absolute positions) to define the different classes. Accordingly, we divided CGIs into five classes defined as follows: L1, the island overlaps with the TSS; L2, the island does not overlap with the TSS but is located somewhere between 2 kb upstream of the TSS and the end of the first exon; L3, the island is located somewhere between the end of the first exon and the start of the last exon; L4, the island is located between the start of the last exon and 2 kb downstream of the Transcription End Site (TES); NG, the island is outside the gene environment.
Most of the islands predicted by CpGcluster are located outside of the gene environment (Table 4). Only 56750 (or 28.7%) in humans and 40348 (or 34.9%) in mice are within or around the genes. Note, however, that the curate samples of RefSeq genes used in elaborating this table represent less than half of the existing genes in both species. When we analyzed the entire RefSeq database without any filtering, these percentages rose to 53.4% and 47.2%, respectively. When Genscan gene predictions, another well-known gene-finder track available at the UCSC Genome Browser, were considered, the percentages rose to 83.4% and 82.2%, respectively. These results indicate that a substantial fraction of the CGIs predicted by CpGcluster may overlap the putative, non yet confirmed genes predicted by this popular gene-finder.
Table 4. Location of CpGcluster islands
Table 4 also shows that both in humans and in mice CpGcluster predicts drastically different islands as a function of genomic location: promoter CGIs (L1) are longer and have lower p-values than do the rest of the classes. Another important observation is that both in humans and mice, promoter CGIs are much richer in vertebrate PhastCons elements than are non-genic islands (NG).
In addition, two surprising observations were made in the mouse genome: 1) promoter islands have smaller CpG densities and CpG fractions than non-genic islands have; and 2) L4 islands, located mainly in the 3' untranslated regions (3' UTRs), show a high proportion of PhastCons overlap. It is well known that when mouse and human orthologous genes are compared, the mouse line shows a net loss of CpG islands . This probably indicates a higher "pressure" on CGIs in mice, which may account for these findings.
For a comparison, we also analyzed the co-location of genes with the CGIs predicted by CpGProD (Table 5). As shown in Table 2, the statistical properties of the CGIs predicted by this finder are quite similar for mouse and human. However, as shown in Tables 2 and 4, CpGcluster predicts islands with striking differences between human and mouse, especially when looking at the co-location with genes. For example, mouse CGIs overlapping with TSS have lower CpG densities and Obs/Exp ratios than non-genic islands. We interpret these as being the true statistical properties of those islands (overlapping with a TSS), as CpGcluster does not predetermine these values in the detection process.
"Stretches of DNA with a high G+C content, and a frequency of CpG dinucleotides close to the expected value, appear as CpG clusters within the CpG-depleted bulk DNA, and are now generally known as CpG islands". This original description of CpG islands by Gardiner and Frommer in 1987  formulates the basic idea underlying the present work: CpG dinucleotides appear clustered within the CpG-depleted bulk DNA and these clusters should be able to be associated with CpG islands. In the same work , the above authors also proposed a criterion for CpG islands based on thresholds which later became the basic principle of practically all existing CpG island finders. They justify these criteria by assuming that CpG-rich regions over 200 bp in length are unlikely to have occurred by chance alone, which points out another important property of CpG islands implemented in this work: the statistical significance. Some years before, McClelland and Ivarie  had introduced a Chi-square test to assign a statistical significance to CpG islands. Therefore, our approach is probably more related to the original perception of CpG islands as statistically significant CpG clusters within CpG-depleted regions.
Both our distance approach (which directly predicts CpG clusters) and the threshold approach are derived from the same original idea stating that the CpGs form clusters in the genome. However, the main disadvantage of any threshold approach is that generally valid CpG islands may become discarded as well, an effect that is aggravated as the dimension of the parameter space grows. In our distance approach, we reduced the parameter space notably, furthermore linking the distance parameter to intrinsic statistical properties of the sequence. The chosen median distance between two CpGs in a given chromosome separates fairly well the CpG clustering from the inter-cluster distances (see Fig. 1) and therefore affords certain objectivity to the choice of this parameter. Note furthermore that the median distance is correlated to the G+C content of the chromosome sequence. The higher the G+C content of the chromosome, the higher the probability that a CpG appears and consequently the lower will be the median distance. In this way, the median distance adjusts itself to the global conditions dictated by the given input sequence. This can hardly be achieved using the conventional large-dimension threshold parameter space and therefore, in previous work, the same threshold values were used indiscriminately for all the chromosomes.
Figure 1. Probability density function of distances between neighboring CpGs. Distribution of distances between neighboring CpG dinucleotides in the human chromosome 1. The observed distribution is represented in symbols, while the random expectation corresponding to the geometric distribution [Eq. 1] is represented in a solid line. Note that, in a good approximation, the median separates over-represented distances from under-represented ones.
The first consequence of the difference between the distance and threshold approaches is that, on average, CpGcluster islands are shorter. However, they show higher mean G+C content, CpG density, and CpG fractions than do any of the other five tested algorithms (Table 2). The lower values shown by these threshold-based algorithms may be an inherited consequence of the general approach shared by most of them. To some extent, the chosen thresholds predetermine the statistical properties of the islands, since these usually become enlarged as long as the thresholds are not violated. This threshold-dependent enlargement in the search process may also lead to the observed over-prediction of CpG islands and high Alu overlap shown by most threshold-based algorithms. On the contrary, CpGcluster overcomes this drawback since statistical properties of the CGIs, such as G+C content or CpG fraction are not used as search parameters. Note furthermore that the p-value is a crucial filter parameter to sort out spurious Alu elements. Young Alus have p-values around 10-7 (with slight variations among chromosomes); therefore, the high substitution rates on the Alu CpG sites produce a fast loss of statistical significance, which explains the low overlap with spurious Alu elements shown by the islands predicted by CpGcluster.
Finally, we wish to discuss briefly the lack of any length filter in CpGcluster which allows the prediction of extremely short islands and which, at first glance, could be interpreted as a disadvantage. It should be noted that in all of the previous algorithms the length is not used for prediction purposes, and is considered only in the final filtering process. In fact, the original idea of the length threshold was to guarantee that the predicted islands are not just a mere product of chance alone. Instead, we change the length filter by a statistically stricter criterion: the p-value. In this way, all predicted CGIs are statistically significant CpG clusters. We are aware that the putative functional CGIs are on average very long (as for example the L1 class in Table 4). However, it is important to stress the conceptual difference between the detection of CpG clusters and the subsequent filtering for a particular subset (e.g. promoter overlapping CGIs). These two steps should be clearly distinguished.
The distance-based CGI-finder algorithm described here presents three outstanding features: i) all the predicted CGIs start and end with a CpG dinucleotide; ii) all the computations needed use integer arithmetic, thus leading to a fast and computationally efficient CGI finder, and iii) a p-value is associated with each of the predicted islands. When compared to other CGI finders,CpGcluster is able to predict CGIs with the highest global accuracy and specificity, thus indicating a low rate of false-positive predictions. Short but fully functional CGIs are also predicted by CpGcluster. Furthermore, the degree of overlap of predicted CGIs with Alu retrotransposons is minimal, while the overlap with vertebrate PhastCons is maximal. The promoter CGIs predicted by CpGcluster also show the highest statistical significance, thus qualifying CpGcluster as a valuable tool to distinguish functional CGIs from the remaining islands in the bulk genome.
The algorithm CpGcluster presented in this work consists of two main steps: i) a distance-based algorithm searches for clusters of CpGs in the chromosome sequence. ii) a p-value is associated with each of these clusters, then predicting as CGIs only those clusters with large enough statistical significance (i.e. for which their p-values are below the selected threshold). These two steps are explained in detail in the next subsections.
CpG cluster-searching algorithm
The cluster-searching method is based on the statistical properties of the physical distances between neighboring CpG dinucleotides on the DNA sequence. In principle, if CpGs are distributed totally at random along the chromosome sequence, the distances between neighboring CpG dinucleotides should follow the geometric distribution:
P(d) = (1 - p)d-1 p 
where P(d) represents the probability of finding a distance d between neighboring CpGs and p corresponds to the probability of CpGs in the sequence, calculated as the ratio between CpGs and the total number of dinucleotides in the DNA sequence.
The working hypothesis behind the cluster-searching algorithm is that the abundant CpGs in CGIs may be separated by shorter distances (thereby forming clusters) than the distances between bulk CpGs, which in principle should follow the geometric distribution [Eq. 1].
To test our working hypothesis, in Figure 1 we represent the normalized probability distribution of distances between neighboring CpGs corresponding to human chromosome 1 (Additional files 2, 3, 4, 5 show the normalized probabilities of distances for all the human chromosomes). The median/mean values of CpG distances for all chromosomes are shown in Table 6. Figure 1 shows that short distances are over-represented when compared to the geometric distribution, while intermediate distances are less abundant than the theoretical random expectations. Large distances are also over-represented when compared to the geometric distribution. For a clear display of these features, in Figure 2 we represent the same as in Figure 1, but using logarithmic axis: distances below 40 bp (the median) and above 300 bp are over-represented, while the intermediate values are under-represented. Both facts clearly indicate strong clustering: the abundant short distances separate intra-cluster CpGs while the large distances (also more abundant than randomly expected) separate the clusters themselves.
Format: DOC Size: 921KB Download file
This file can be viewed with: Microsoft Word Viewer
Format: DOC Size: 844KB Download file
This file can be viewed with: Microsoft Word Viewer
Format: DOC Size: 845KB Download file
This file can be viewed with: Microsoft Word Viewer
Format: DOC Size: 847KB Download file
This file can be viewed with: Microsoft Word Viewer
Figure 2. Probability density function of distances between neighboring CpGs (log-scale). The same as in Figure 1, but using logarithmic axis; over-represented large distances can be appreciated.
Therefore, the DNA sequence was scanned looking for the presence of such CpG clusters. The algorithm performs the following steps:
1. The DNA chromosome sequence is scanned for CpG dinucleotides, then recording the positions occupied by the 'C': x1, x2 ... xN, N being the total number of CpGs in the sequence. The sequence was usually scanned in the 5' → 3' direction. Trivially, the reverse scan (3' → 5') produces the same results.
2. As a convention, the physical distance separating two neighboring CpGs is defined as:
di = xi+1 - xi - 1, 
so that the minimal distance between two neighboring CpGs (i.e. CGCG) is equal to 1.
3. In the course of the scan, the first distance below a given threshold (dt) identifies the first CpG cluster. The threshold dt can be conveniently derived from the distribution of distances between neighboring CpGs in the chromosome sequence. The median distance (Figs. 1, 2) often gives the best results because the median distance of the observed distribution is approximately at the transition point of the over-represented (intra-cluster) small distances and the under-represented intermediate ones. This is not an exclusive property of chromosome 1, as it is shared by all the chromosomes (see Additional files 2, 3, 4, 5), thus indicating that the median distance can be chosen in general as a good threshold (dt).
4. We then try to extend this first cluster downstream (→ 3') by adding the next CpG while the distances are below dt. When a distance exceeding dt is found, the cluster is completed, and the search for a new one continues downstream.
5. Steps 3 and 4 are iterated until all the CpG clusters in the sequence are identified.
Note that this algorithm acquires two important and distinctive features by construction. First, all predicted CGIs start and end with a CpG dinucleotide, which seems appropriate. Secondly, the algorithm uses only integer arithmetic, thus being computationally efficient. No other CGI searching algorithm shares these two important properties.
Assigning p-values to CpG-clusters
Once all the CpG clusters are found in the sequence following the algorithm described above, the next step is to associate a p-value with each one – i.e. the probability of such a cluster appearing by chance in a random sequence. Such a probability can be estimated either numerically by a randomization test on the DNA sequence or by means of a theoretical probability function (both cases shown in 6). For the latter case, the negative binomial distribution (also known as Pascal or Pólya distribution) can be conveniently tailored to the requirements of CpG clusters. In general, this distribution can be applied to experiments with dichotomous outcomes (either success or failure) and gives the probability of having a certain number of failures when the number of successes was fixed in advance, taking into account that the experiment must always end with a success. By translating these requirements to a genomic context, the successes were equated with CpG dinucleotides and the failures with non-CpGs (all other 15 possible dinucleotides). One prerequisite is that all trials be independent, which is not automatically fulfilled when dealing with overlapping dinucleotides (note that a CpG dinucleotide will always be followed by a non-CpG [GN] dinucleotide). Therefore, these "forced" non-CpGs need to be considered when calculating the success probabilities. Thus, the probability for a cluster with a number (N) of CpGs is given by
Additional file 6. Comparison between analytical and experimental randomization. Length distribution of clusters with different numbers of CpGs. The analytical distribution (black line) is virtually identical to the experimental one obtained by a randomization that preserves the CpG frequency.
Format: DOC Size: 459KB Download file
This file can be viewed with: Microsoft Word Viewer
This formula takes into account that all our clusters start with a CpG, and therefore the number of successes is N-1 (instead of N, the number of CpGs in the cluster). The number of independent non-CpGs (nf) in a cluster (failures) can be calculated as:
nf = L - 2 · N 
L being the cluster length (in nucleotides). The success probability p (probability of finding a CpG) is calculated as:
Ns being the number of CpG dinucleotides in the sequence and nis the number of independent dinucleotides (i.e. including the CpGs but excluding the forced non-CpGs). The theoretical probabilities determined by this analytical method agree well with those found by numerical simulation, as shown in 6. Given its lower computational cost, the theoretical approach was implemented in our software.
The negative binomial is a two-tailed distribution (6). The left tail indicates a high local CpG clustering (accumulation of CpGs) while the right tail is comprised of CpG-depleted regions. Therefore, the probability that an observed local CpG frequency is significantly higher than those expected under random conditions (CpG clustering) is given by the cumulative density function of the CpG cluster at point nf, which can therefore be taken as its p-value:
The use of this latter expression allows us to discriminate between the clusters found in the first step of the algorithm: those clusters with a p-value below a given threshold (usually 10-5, see Section "Benchmarking CpGcluster") are predicted as CGIs, while the rest of the clusters are discarded.
Assembling test sequences containing CpG islands
To evaluate the accuracy of CpGcluster and compare it to other programs, we assembled a set of test sequences on the basis of an experimental CGI library . This physical library was constructed using a two-step cloning strategy involving the isolation of GC-rich chromosomal fragments based on their lack of methylation in vivo, followed by an enrichment of fragments that could be methylated in vitro. The following steps were taken to assemble the test sequences:
(1) The full list of DerLab CGIs  was retrieved. These experimental CGIs can be quite short and the minimum length is actually 8. Out of the 6235 CGIs, 1612 (or 26 %) were shorter than 200 bp. The experimental islands were then divided into two groups: those that overlapped with the TSS and those that did not. The TSS coordinates were taken from the DBTSS database . These two groups differed significantly in their mean length, CpG density, and CpG fraction, with all values being higher for the TSS group. In assembling the test sequences, we exclusively used the TSS group, which had a greater average length.
(2) In addition, non-island sequences – the sequences located between the CGIs of chromosome 22, as specified by the UCSC annotation  – were extracted.
(3) To further ensure a random background for the CGIs in our test sequences, all non-island segments were randomly shuffled using an algorithm that preserves dinucleotide frequencies . As non-island segments could contain some non-annotated CGIs, this step ensures the randomness of the non-island segments, at the same time conserving nucleotide and dinucleotide compositions. This setup is the less biased, as none of the finders is expected to predict CGIs on randomized sequences.
(4) The shuffled non-island segments were then alternatively combined with 400 island segments overlapping with TSS, chosen at random from the DerLab sample, thus assembling a test sequence of approximately 18 Mb in length.
(5) Using the assembling process described in step (4), we generated a set of 10 test sequences containing experimental CGIs alternating with shuffled non-island segments.
Availability and requirements
Project name: CpGcluster
Operating system(s): platform independent
Programming language: Perl 5 (see 7 for source code)
Format: ZIP Size: 4KB Download file
Licence: open source
CC: Correlation Coefficient
CGI: CpG island
CpG O/E ratio: Ratio between observed and expected CpG frequencies
CpG: dinucleotide CG
G+C content, %G+C: Molecular fraction of guanine and cytosine
NG: non-genic islands
PhastCons: Phylogenetic Conserved Elements
Sn: Sensitivity – the proportion of island nucleotides which have been correctly predicted as islands
Sp: Specificity – the proportion of predicted island nucleotides that are actually islands
SPR: promoter of SERPINB5 gene
TES: Transcription End Site
TSS: Transcription Start Site
UTR: Untranslated Regions
JLO proposed the distance approach, then associating a p-value with each predicted island by means of a randomization test. JMA and PLE wrote and analyzed a prototype program based on distances and suggested to assign p-values analytically. MH proposed the use of the negative binomial distribution for computing p-values, carried out the randomization tests and implemented the final version into a Perl script. CP evaluated the accuracy of CpGcluster against other island-finding programs. PC carried out the analysis of the CpG distance geometric distribution which motivates the use of the median distance as a threshold. JLO, MH and PC drafted the manuscript and edited the contributions from the remaining authors. All the authors read and approved the final version.
Helpful comments by Antonio Marín and two anonymous reviewers are greatly appreciated. This work was supported by the Spanish Government (BIO2005-09116-C03-01 to JLO, MH, PC and CP and BIO2002-04014-C03-03 to PLE and JMA) and Plan Andaluz de Investigación (CVI-162 and FQM-322). MH and CP acknowledge the grants from the University of Granada (Spain) and the German Academic Exchange Service (DAAD), respectively. CP is grateful for the use of the facilities of the German Cancer Research Center, Heidelberg, Germany). The help of David Nesbitt with the English version of the manuscript is also appreciated.
In Silico Biol 2003, 3(3):235-40. PubMed Abstract
Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, Semple CA, Taylor MS, Engstrom PG, Frith MC, Forrest AR, Alkema WB, Tan SL, Plessy C, Kodzius R, Ravasi T, Kasukawa T, Fukuda S, Kanamori-Katayama M, Kitazume Y, Kawaji H, Kai C, Nakamura M, Konno H, Nakano K, Mottagui-Tabar S, Arner P, Chesi A, Gustincich S, Persichetti F, Suzuki H, Grimmond SM, Wells CA, Orlando V, Wahlestedt C, Liu ET, Harbers M, Kawai J, Bajic VB, Hume DA, Hayashizaki Y: Genome-wide analysis of mammalian promoter architecture and evolution.
Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.
Heisler LE, Torti D, Boutros PC, Watson J, Chan C, Winegarden N, Takahashi M, Yau P, Huang TH, Farnham PJ, Jurisica I, Woodgett JR, Bremner R, Penn LZ, Der SD: CpG Island microarray probe sequences derived from a physical library are representative of CpG Islands annotated on the human genome.