Global Pre-Clinical Statistics, Pfizer Global Research and Development, 10777 Science Center Drive, San Diego, CA, 92121, USA

Computational Biology Group, Oncology Research Unit, Pfizer Global Research and Development, San Diego, CA, 92121, USA

Statistics, Corporate Analytics, Amylin Pharmaceuticals Inc, 9360 Towne Centre Drive, San Diego, CA, 92121, USA

Abstract

Background

Human cancer is caused by the accumulation of tumor-specific mutations in oncogenes and tumor suppressors that confer a selective growth advantage to cells. As a consequence of genomic instability and high levels of proliferation, many passenger mutations that do not contribute to the cancer phenotype arise alongside mutations that drive oncogenesis. While several approaches have been developed to separate driver mutations from passengers, few approaches can specifically identify activating driver mutations in oncogenes, which are more amenable for pharmacological intervention.

Results

We propose a new statistical method for detecting activating mutations in cancer by identifying nonrandom clusters of amino acid mutations in protein sequences. A probability model is derived using order statistics assuming that the location of amino acid mutations on a protein follows a uniform distribution. Our statistical measure is the differences between pair-wise order statistics, which is equivalent to the size of an amino acid mutation cluster, and the probabilities are derived from exact and approximate distributions of the statistical measure. Using data in the Catalog of Somatic Mutations in Cancer (COSMIC) database, we have demonstrated that our method detects well-known clusters of activating mutations in KRAS, BRAF, PI3K, and

Conclusions

Our proposed method is useful to discover activating driver mutations in cancer by identifying nonrandom clusters of somatic amino acid mutations in protein sequences.

Background

Cancer is a genetic disease caused by the accumulation of tumor-specific (somatic) mutations in two broadly defined types of genes called tumor suppressors and oncogenes (Vogelstein and Kinzler (2004)

Several methods have been developed for the automated prediction of driver oncogenic mutations in individual genes, yet few are suitable for detecting activating mutations. The most straightforward method predicts that driver mutations have a large number of mutations relative to the estimated background mutational rate, after normalizing for gene size (Wang et al. (2002)

We propose an alternative approach to detect activating mutations in oncogenes, based on the hypothesis that only a small number of specific mutations can activate a protein. To be precise, we hypothesize that a localized cluster of amion acid mutations within a protein sequence, especially in the absence of obvious mutational hotspots, is a fingerprint of selection for the oncogenic phenotype associated with activating driver mutations. Evolutionary studies demonstrate that most amino acids replacements are either neutral or incompatible with protein function (Graur and Li (2000)

Several methods in the statistics literature can be applied to detect mutation clusters. For example, Naus (1965)

In this work a new statistics method is introduced that identifies nonrandom mutation clustering without specifying the number of mutations or the cluster length. The exact and approximate distribution of the statistical measure is derived and a nonrandom mutation clustering (NMC) algorithm is developed based on the measure. We confirmed the utility of this approach by detecting well-known activating mutations in KRAS, BRAF, PI3K, and

Results

Data Description

Data used in this study are from COSMIC (Catalog of Somatic Mutations in Cancer) database version 40 (Forbes et al (2008)

Nonrandom clusters in cancer genes

Using the NMC algorithm (see Methods), 12 different proteins out of 446 contain nonrandom amino acid mutation clusters with cutoff probability of less than 0.05, with the most significant clusters listed in Table

Genes with significant mutation clusters (Probability < 0.01)

**Gene**

**Cluster**

**size**

**Cluster**

**positions**

**Number of**

**mutations in**

**cluster**

**Cumulative cluster**

**probability***

KRAS (188 aa)

2

12-13

131

1.47E-234

BRAF (766 aa)

1

600-600

60

2.02E-157

TP53 (393 aa)

155

132-286

326

3.07E-101

NRAS (189 aa)

1

61-61

33

7.11E-62

PIK3CA (1068 aa)

5

542-546

27

7.09E-46

CTNNB1 (781 aa)

13

33-45

12

8.54E-19

ERBB2 (1255 aa)

1

776-776

2

7.97e-4

HRAS (189 aa)

1

61-61

4

2.06E-06

PTEN (403 aa)

63

111-173

8

5.50E-05

MAP2K7 (419 aa)

1

162-162

2

0.002386

LRRK2 (2534 aa)

4

1723-1726

2

0.003547

*: only most significant cluster per gene is listed

Mutation hotspots in classical oncogenes

Table

Mutation positions for selected oncogenes

**Gene**

**Position (#of mutations)**

BRAF(766 aa)

464(1), 466(2), 469(4), 581(1),

596(2), 597(2), **600**(60), 601(2)

KRAS(188 aa)

**12**(99), **13**(32), 22(1), 23(1), 61(6),

117(1), 146(10)

CTNNB1 (781 aa)

6(1), **33**(3), **34**(2), **37**(3), **41**(2), **45**(2)

PIK3CA(1068 aa)

- 88

449(1), 453(1), 539(1), **542**(5),

**545**(20), **546**(2), 549(1),

- 1023

1025(1), 1047(21), 1049(1), 1066(1)

The number of mutations for each position is shown in parenthesis, positions within clusters from Table 1 are highlighted in bold, and CpG positions are underlined.

Ribbon representation of the PI3K

**Ribbon representation of the PI3K α**. Ribbon representation of the PI3K

For most genes in Table

General remarks on detected mutation hotspots

In addition to known clusters of activating mutations in major oncogenes, several other genes have significant mutation hot-spots. For example, two mutations between the Roc (Ras of complex proteins) and kinase domains in the LRRK2 locus form a significant cluster. The LRRK2 kinase, also known as PARK8, is not considered to be a classical cancer gene. It most closely resembles the family of tyrosine-like kinases that phosphorylate serine/threonine residues and lies upstream of mitogen-activated protein kinase (MAPK) pathways (Mata et al. (2006)

As expected, we found fewer significant mutation hot-spots in tumor suppressors, and these hot-spots were typically much larger than those associated with oncogenes. In general, inactivating amino acid mutations are not expected to form localized nonrandom clusters, but rather to span many residues in highly conserved regions (e.g. Nigro et al. (1989)

Ribbon representation of the human p53

**Ribbon representation of the human p53**. Ribbon representation of the human p53 core domain X-ray structure (PDB Code: 2OCJ; Wang et al. (2007)

Discussion and Conclusions

A new method for the identification of nonrandom mutation clusters in biological sequences is presented. The method is fast, robust, and unlike many previous methods, it is does not require a fixed window length, which enables the identification of significant clusters of variable sizes, particularly important for the detection of activating mutations. We have applied this method to investigate somatic amino acid mutations in the COSMIC database. Our method detected very short clusters spanning a few individual amino acid positions in the case of the oncogenes BRAF or KRAS, as well as larger regions in the tumor suppressors p53 and PTEN.

A recent paper by Wagner (2007)

Our method has several potential limitations. First of all, the status of all coding positions must be determined. This is primarily a limitation for older studies, where typically only those exons with known mutations were screened. However, with the explosion of large-scale cancer genome sequencing (e.g. Sjöblom et al. (2006)

The aim of the method is to detect activating mutations that are assumed to be concentrated in specific amino acid positions. Activating mutations are typical for cellular proto-oncogenes and, as expected, significant clusters are detected in oncogenes such as BRAF, RAS genes, CTNNB1/

In conclusion, we propose a new method for discovering nonrandom clusters of mutations in biological sequences. Unlike previous approaches, the method does not use fixed length windows and therefore can be used to detect clusters of highly variable sizes. We demonstrated the value of this method to detect activating amino acid mutations in human tumors and confirmed nonrandom clustering of well-known oncogenic mutations in several classical oncogenes. The method can be also used to discover new oncogenes from large-scale cancer genome data and to identify gain-of-function mutations in tumor suppressors. Finally, detection of nonrandom sequence changes is a general problem and the method may be useful in other areas such as DNA polymorphism analysis and comparative evolutionary studies (Wagner (2007)

Methods

Single amino acid mutations may lead to changes in protein function. Because missense mutations are the most likely single-point genetic mutation to have an effect on protein function, the nonrandom mutation clustering (NMC) algorithm is applied to missense mutations in individual genes in this work.

The NMC algorithm is derived under the following assumptions: 1. each amino acid residue in a protein sequence has equal mutation probability; 2. mutations between amino acid positions are independent; 3. mutations between samples are independent; and 4. the number of potentially available samples is larger than the number of mutations.

Denote _{i}, a random variable between 1 and _{i }= j) = 1/N, where j = 1,...,

By assumption, mutations are random and can occur at the same position more than once. The data are transferred into order statistics by ordering the _{i }into _{(1) }≤...≤ _{(i) }≤...≤ _{(n)}, where _{(i) }is the ith smallest number in the sample, i = 1,..., _{ki}= _{(k) }- _{(i)}, for any pair i, k, i < k, i, k = 1, .., _{ki}, and declare the clustering to be nonrandom when the probability that the distance between order statistics _{ki }is less than a pre-defined significant probability level _{ki }≤ _{ki }≤ _{ki}, the chance that the distance between order statistics _{(i) }and _{(k) }is as close or closer than _{ki }≤ _{ki }has the simple interpretation of the size of the mutation cluster.

1.1 Derivation of the distribution of statistical measure

While distributions of order statistics are usually derived for continuous distributions, they have also been derived for discrete distributions. Burr (1955) _{ki}, where i = 1 and k = _{ki}.

The distribution of _{ki }is developed from the joint distribution of order statistics _{(i) }and _{(k) }for any pair i, k, i < k, i, k = 1, .., _{ki}, the distance between order statistics _{(i) }and _{(k)}, can range from 0, which means both mutations are located at the same position, to _{ki }= 1 implies that the mutations are adjacent to each other and so on. We develop the distribution of _{ki }for each possible scenario.

_{ki }= 0, for any pair i, k, i < k, i, k = 1, .., _{(i) }and _{(k) }are located at the same position. Taking the _{ki }= 0 is written as

The distribution is derived using the properties of order statistics. For example, when _{(i) }= _{(k) }= 1, the first _{(i) }= _{(k) }=

For _{ki }= 1, for any pair i, k, i < k, i, k = 1, .., n, the order statistics _{(i) }and _{(k) }are adjacent to each other. The probability distribution can be written as:

For _{ki }=

The distributions for _{ki }= 1 and _{ki }= _{ki}= 0. The _{(i)}, and the _{(k)}. For the remaining _{(i)}, _{(i) }and _{(k) }and the remaining _{(k)}, where _{ki }= 1 and _{ki }=

Finally, for the special case of i = 1 and k = _{ki }may be simplified as

Note that Pr(R_{n1 }≤

1.2 Approximation of the distribution

The derivation in section 1.1 is the exact distribution of the statistical measure for nonrandom mutation clustering in the discrete uniform distribution. Proteins typically contain hundreds or thousands of amino acids and it is convenient to approximate the discrete uniform distribution with a continuous uniform distribution (0, 1) because calculating the distribution of _{ki }=

When the _{(i) }and _{(k)}, for any pair i, k, i < k, i, k = 1, .., is:

where distance is normalized to be in the range (0,1), so the distance _{ki }= (_{(k) }- _{(i)})/_{ki}= _{(k) }- _{(i)}. The cumulative distribution can be written as Pr(_{ki }≤

which by iterated integration by parts gives:

Using the continuous uniform distribution, _{ki }simply follows a Beta distribution with parameters _{ki }≤ 1) = 1. This result was reported in Johnson et al. (1995)

1.3 Correction for multiple testing

For each pair-wise order statistic, the exact and continuous distributions can be calculated using formulas in sections 1.1 and 1.2. Clusters are evaluated for each pair of order statistics, which can elevate the false positive rate due to multiple testing. A Bonferroni correction can be chosen to correct the false positive rate because it doesn't require an independent hypotheses assumption and it is a conservative test. The false discovery rate (FDR) developed by Benjamini and Hochberg (1995)

1.4 NMC algorithm

The exact and approximate distributions of distance between pair-wise order statistics were derived in section 1.1 and 1.2. The calculation is rapid for the special case when _{ki }is 0 or 1 or for the range statistics, and we use the exact distribution derived in section 1.1 to ensure accuracy for these cases. For further efficiency when calculating the distribution for _{ki }= 1, the algorithm is stopped when the iterated summation in the distribution reaches the significance level because the full summation is larger than the partial summation and the difference cannot be significant. The continuous distribution is used for computational efficacy when the difference _{ki }is greater than 1. The nonrandom mutation clustering (NMC) algorithm is summarized in the following procedure:

• **Input**: Number and location of missense mutations in a protein

• **Output**: A table with columns of nonrandom mutation cluster size, starting location of the cluster, ending location of the cluster, number of mutations observed in the cluster and probability of the cluster that is significant after Bonferroni or FDR correction.

• **NMC algorithm**:

◦ Step 1: Reorder the mutation positions into order statistics and set the significance level

◦ Step 2: For each pair-wise order statistics, calculate the probability Pr(_{ki }≤

◦ Step 3: Calculate the Bonferroni or FDR corrected probabilities.

◦ Step 4: Report the multiple-testing corrected significant clusters in the output table after sorting from the lowest probability to the highest.

The R source code is available in Additional file

**NMC**. R source code of NMC algorithm.

Click here for file

**Poweranalysis**. Analysis of minimum number of mutations required for NMC algorithm

Click here for file

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JY designed and developed the statistical method, and coded the NMC algorithm in R. AP and PAR proposed the idea of detecting activating mutations with nonrandom clusters. AP acquired the COSMIC database and prepared the data. JY and AP performed the analysis and drafted the manuscript. EAL and PAR contributed the idea of three-dimensional mutation detection. CT contributed the idea of the statistical method. EAL, PAR and CT revised the manuscript. PAR finalized the manuscript. All authors read and approved the final manuscript.

Acknowledgements

JY, AP, EAL and PAR are full-time Pfizer employees. CT was a full-time Pfizer employee at the time of the work. The authors thank Professor David M. Rocke from University of California, Davis for helpful discussions and suggestions on the paper. In addition, the authors thank two anonymous referees for their insightful comments.