Abstract
Background
Aneuploidy has long been recognized to be associated with cancer. A growing body of evidence suggests that tumorigenesis, the formation of new tumors, can be attributed to some extent to errors occurring at the mitotic checkpoint, a major cell cycle control mechanism that acts to prevent chromosome missegregation. However, so far no statistical model has been available quantify the role aneuploidy plays in determining cancer.
Methods
We develop a statistical model for testing the association between aneuploidy loci and cancer risk in a genomewide association study. The model incorporates quantitative genetic principles into a mixturemodel framework in which various genetic effects, including additive, dominant, imprinting, and their interactions, are estimated by implementing the EM algorithm.
Results
Under the new model, a series of hypotheses tests are formulated to explain the pattern of the genetic control of cancer through aneuploid loci. Simulation studies were performed to investigate the statistical behavior of the model.
Conclusions
The model will provide a tool for estimating the effects of genetic loci on aneuploidy abnormality in genomewide studies of cancer cells.
Background
In recent years, there has been a wealth of literature on the development of statistical methods for genetic analysis of complex diseases, such as cancer [1,2]. These methods, mostly founded on rigorous statistical theory and models, have been instrumental in the analysis and modeling of genetic data, leading to the identification of significant genetic variants involved in pathogenesis [3,4]. However, many existing statistical methods neglect biological principles refreshed and updated from the latest scientific discoveries obtained by using new genomic technologies. A lack of the integration between statistics and biology will significantly limit our detection and characterization of the new genetic underpinnings of a disease. The motivation of this study is to develop a novel statistical model for detecting the genetic control of cancer through chromosomal loci predisposing to aneuploidy.
Aneuploidy occurs when an individual has an abnormal number of chromosomes. Partial or whole chromosomes may be duplicated or missing in individuals with this condition. Cytological studies show that aneuploidy is one of the most pronounced differences between normal and cancer cells [5]. However, debates have arisen over how aneuploid cells are produced and whether or not they are a cause or consequence of tumorigenesis [6,7]. A growing body of evidence from molecular genetic studies supports a role of aneuploidy in the genetic underpinning of cancer [6,812]. According to extensive work by Duesberg and his group, the impact of aneuploidy on cancer is embodied in the following aspects:
(1) Aneuploidy is confirmed to generate abnormal phenotypes, such as Down syndrome in humans and cancer in animals;
(2) The degree of aneuploidy is correlated with phenotype abnormality;
(3) Since aneuploidy imbalances the highly balancesensitive components of the spindle apparatus, it destabilizes symmetrical chromosome segregation;
(4) Both nongenotoxic and genotoxic carcinogens can cause aneuploidy by physical or chemical interaction with mitosis proteins.
Similar to point (2), there is additional evidence that cancerspecific phenotypes result when aneuploidy exceeds a certain threshold [13,14]. Kops et al. [15] outlined the cytological mechanisms for aneuploid formation from checkpoint signalling. Normally, chromosome missegregation can be prevented at the mitotic checkpoint by delaying cellcycle progression through mitosis until all chromosomes have successfully made spindlemicrotubule attachments, but a defect in the mitotic checkpoint can generate aneuploidy, facilitate tumorigenesis, and can cause increased resistance to anticancer therapies [16].
The statistical model developed to detect cancer genes is constructed with a random sample of aneuploid patients with cancer drawn from a natural population. At an aneuploid locus, polyploids occur because of the duplication of one or two parental chromosomes and, thus, the model can be formulated to test the genetic imprinting of alleles due to their different parental origins.
If the aneuploidy hypothesis is continuously confirmed, this model will provide a timely tool to quantify the genetic effects of aneuploidy loci on cancer susceptibility by integrating the genetic data from the cancer genome project. Also, by comparing with the model for detecting somatic mutations, this new model will help to determine the relative importance of the aneuploidy and mutation hypotheses in cancer studies.
Methods
Study Design
Suppose there is a normal human diploid population which is at HardyWinerberg equilibrium (HWE). Some individuals in this population form cancer owing to particular regions of their chromosomes multiplied to form a triploid, tetraploid, or a polyploid of any higher order. To simply describe our model, we only consider a triploid. As proven below, the population after chromosomal multiplication will deviate from HWE. We assume that a total of n cancer patients are randomly sampled from this population. Each sampled patient is a triploid at a particular aneuploid locus. We genotype all these patients at duplicated chromosomal segments with molecular markers, although the parental origin of chromosomal duplication is unknown. A phenotype that defines cancer is measured for all subjects. A model will be derived to distinguish between the genetic effects of alleles inherited from the maternal (M) and paternal parents (P).
Chromosome Duplication
Triploid Model
Consider a gene of interest A, with two alleles A and a, on a chromosome (say chromosome 3). Figure describes the process of a pair of normal chromosomes that are duplicated into a triploid for a portion of chromosome 3. For a normal diploid, the genotypes at this gene may be AA, Aa, or aa. Considering parentspecific origins of alleles, we use Aa, Aa, aA, and aa to denote the configurations of these genotypes, respectively, where the left and rightside alleles of the vertical lines represent two alleles from different parents. Of these, configurations Aa and aA are genotypically observed as the same genotype Aa. When the chromosomal segment that harbors this gene are duplicated for only one single chromosome, triploids with two copies from one parent and the third copy from the other parent will result. It is possible that a single chromosome derived from maternal and paternal parents may both be duplicated, but with a different frequency. Thus, through such a duplication, four configurations in the normal diploid will form a total of eight triploid configurations, which are classified into four different genotypes:
(1) AAA including configurations AAA, duplicated from the leftside parent, and AAA, duplicated from the rightside parent, of configuration AA;
(2) AAa including configurations AAa duplicated from the leftside parent of configuration Aa and aAA duplicated from the rightside parent of configuration aA;
(3) Aaa including configurations aaA duplicated from the leftside parent of configuration aA and Aaa duplicated from the rightside parent of configuration Aa;
(4) aaa including configurations aaa, duplicated from the leftside parent, and aaa, duplicated from the rightside parent, of configuration aa.
Let p and q (p + q = 1) are the allele frequencies of A and a in the original population before chromosome duplication. For a natural population at HWE, genotype frequencies can be expressed as p^{2 }for genotype AA, 2pq for genotype Aa, and q^{2 }for genotype aa.
Theorem
For an HWE diploid population, chromosome duplication operating on particular loci can violate the equilibrium status of the population.
Proof: Let g and h denote a proportion of allele A and a that is duplicated, respectively. Thus, of diploid genotype AA, a proportion g will become AAA, with the remaining proportion 1  g unduplicated. Similarly, a proportion h and 1  h will be aaa and aa after duplication for diploid genotype aa. Diploid genotype Aa will have three possibilities, AAa with a proportion of g, Aaa with a proportion of h, and Aa with a proportion of 1  g  h. In a duplicated population purely composed of triploids, we will have allele frequencies for A and a, respectively, as
The genotype frequency of triploid AAA in the duplicated population is expressed as
Similarly, we have the genotype frequency of triploid aaa as
Thus, unless and , the duplicated population will be at HardyWeinberg disequilibrium.
This theorem shows that traditional HWE theory for population genetic studies will not be useful for cancer gene identification. Meanwhile, this theorem provides a foundation for deriving a statistical model to conduct genomewide association studies of cancer.
Quantitative Genetic Parameters
The cancer patients sampled are purely composed of triploids at an aneuploid locus. Each patient is typed for DNAbased markers at aneuploid loci and also phenotyped for a cancer trait. Let n_{k }denote the observed number of triploid genotype k (k = 1 for AAA, 2 for AAa, 3 for Aaa, and 4 for aaa). The total number of patients sampled is . It has proven that chromosomal duplication may violate the HardyWeinberg equilibrium of the population. Thus, genotype frequencies, P_{k}, are expressed as the products of allele frequencies plus disequilibrium parameters. Let D_{1 }and D_{2 }denote the HardyWeinberg disequilibrium coefficients associated with allele A and a, respectively, at the duplicated gene. Thus, the frequencies of four genotypes can be expressed in terms of allele frequencies and disequilibria (Table 1).
Table 1. The changes of genotypes and genotype frequencies after chromosomal duplication.
The same triploid genotype at the duplicated gene may have different values when the expression of its alleles depends on the origin of parents. For example, triploid genotype AAA may be formed from normal diploid genotype AA when either the maternally(M) or paternallyderived allele (P) is doubled. Thus, the configuration of genotype AAA can be either A_{M}A_{M}A_{F }or A_{M}A_{F}A_{F}, where the subscripts denote the parental origin of alleles. Table 2 gives the genotypic values (μ_{k1}, μ_{k2}) of two possible configurations of each triploid genotype. These genotypic values are partitioned into eight different components, the overall mean (μ), additive dominant genetic effect (a), dominance genetic effects of AA over a (d) and A over aa (d'), genetic imprinting effects due to different origins of alleles (λ), and interactions between the additive and imprinting effects (I_{aλ}), the d dominance and imprinting effects (I_{dλ}), and the d' dominance and imprinting effects (I_{d'λ}), respectively.
Table 2. Genotypic values and proportions of different configurations of a triploid genotype at a duplicated gene.
For each triploid genotype, the relative proportions of two underlying configurations can be different, depending on the rate of the duplication of parentspecific chromosomes. Let u and 1  u be the proportions of the duplication of allele A derived from the maternal and paternal parents, respectively. Similarly, let v and 1  v be the proportions of the duplication of allele a derived from the maternal and paternal parents, respectively (Table 2). These proportions can be estimated from genotype data.
Estimation
It is straightforward to estimate the frequency of a triploid genotype with genotype observations using
which is derived from a polynomial likelihood. The EM algorithm is implemented to estimate the allele frequencies ( and ) and HWD coefficients from the triploid genotype observations of the aneuploid population sampled (Table 1). It is described as follows:
In the E step, we calculate the proportion of an allele within a triploid genotype using
for allele A, and
for allele a.
In the M step, the allele frequencies are then estimated with the following equations:
and the HWD coefficients are calculated by
To estimate the duplication rate and genotypic values for each configuration, we need to formulate a mixture model because each triploid genotype contains two unknown configurations. The likelihood of genotype observations at the duplicated gene (Table 1) and phenotypic values (y) measured for all subjects is constructed as
where is the vector of unknown parameters, and exp (k = 1, ..., 4; j = 1, 2) is the normal distribution of the phenotypic trait with mean μ_{kj }and variance σ^{2}.
To obtain the maximum likelihood estimates (MLEs) of the parameters, we implement the EM algorithm to the likelihood (8). In the E step, the posterior probability with which a subject i with a specific triploid genotype has a configuration j is calculated using
In the M step, by solving the loglikelihood equations, the parameters are estimated with the calculated posterior probabilities, i.e.,
A loop of the E and M steps is iterated between equations (9) and (10), (11), (12) and (13). Thus, the parameter estimates are obtained when the estimates converge to stable values. The MLEs of genetic effects can be obtained by solving a system of linear equations given in Table 2, i.e.,
Hypothesis Tests
How a duplicated gene deviates from HardyWeinberg equilibrium can be tested by formulating the null hypothesis as follows:
under which genotype frequencies can be estimated from the estimated allele frequencies using equation (1). The loglikelihood ratio calculated under the null and alternative hypotheses follows a χ^{2}distribution with 2 degrees of freedom. It is interesting to test the two disequilibria separately. Under the null hypothesis H_{0 }: D_{1 }= 0. genotype frequencies are estimated using equation (1), but with a constraint P_{3 }= p^{3}, in addition to constraint P_{1}+P_{2}+P_{3}+P_{4 }= 1. Similarly, genotype frequencies are estimated with a constraint P_{4 }= q^{3 }for testing whether D_{2 }= 0.
Whether the duplicated gene is significantly associated with cancer susceptibility can be tested using the null hypothesis μ_{kj }≡ μ for k = 1, ..., 4; j = 1, 2. The additive effect and two types of dominance effects can be tested jointly or separately by formulating the relevant null hypotheses based on equations (14), (15), and (16). The imprinting effect and its interactions with additive and dominance effects can be tested by using the null hypothesis H_{0 }: λ = 0, H_{0 }: I_{aλ }= 0, H_{0 }: I_{dλ }= 0, and H_{0 }: I_{d'λ}= 0 constructed with equations (??), (18), (18), and (19), respectively.
The model can also be used to test the significance of duplication rate for a parentspecific chromosome by formulating the null hypothesis H_{0 }: u = 1 or H_{0 }: v = 1. This information helps to understand the genetic structure and evolutionary process of cancer risk.
Results
Simulation studies were used to investigate the statistical properties of the model in terms of estimation precision, power and false positive rates. We simulate a cancer population of triploids for a portion of chromosome. The allele frequencies at a triploid locus are = 0.6 and = 0.4. The HWD coefficients at this locus are assumed as D_{1 }= 0:08, D_{2 }= 0:06. By assuming the duplication rates of 0.3 and 0.4 for two parental chromosomes, respectively, the distribution of four different triploid genotypes AAA, AAa, Aaa, and aaa can be simulated. The phenotypic values of cancer traits were simulated by summing the additive, dominance, imprinting, and their interaction effects given with particular values and the errors of measurement within each triploid genotype following a normal distribution with variance scaled by a heritability of 0.05, 0.10, and 0.20, respectively. Different sample sizes, 400, 800, and 2,000 are considered.
The model was used to estimate allele frequencies, HWD, parentspecific duplication rates, and genetic effects for a cancer population (Table 3). As shown by small sampling errors of the estimates, allele frequencies can be precisely estimated with a modest sample size (400). A larger sample size (say 800) is needed to provide precise estimation of two HWD coefficients D_{1 }and D_{2}. Because duplication rates determine the mixture proportions for each triploid genotype, their estimates will be affected by the heritability level. If the cancer trait has a larger heritability, then a sample size of 400 will provide good estimates of duplication rates. For a less heritable trait, a large sample size (2000 or even more) is needed for good estimation. The additive effect can be generally well estimated, but the estimates of the dominant effects need much larger sample size. The estimation precision of the imprinting effect seems to be intermediate between that of the additive and dominant effects. It is interesting to see that the additive × imprinting interaction effect can be better estimated than the imprinting effect alone. It is hard to estimate the interactions between the dominant and imprinting effects unless an extremely large sample size (> 2000) is used. Overall, the impact of heritability is large than that of sample size, suggesting that it is important to allocate limited resource to measure phenotypes precisely rather than increase sample size simply.
Table 3. The estimates of population genetic parameters (p, u, v) and quantitative genetic parameters (a, d, d', λ, I_{aλ}, I_{dλ}, I_{d'λ}) from simulated data with different sample size and heritability combinations.
The power to detect the overall genetic effect and imprinting effect was investigated. In general, the model has great power for the identification of aneuploid loci causing cancer. To achieve adequate power for imprinting effect detection, a large sample size and/or large heritability is required. Overall, a sample size of 400 with a heritability of 0.2 can reach power of over 0.75 for the detection of imprinting effects. We also performed simulation studies to examine the false positive rates for detecting overall genetic effects and imprinting effects at aneuploid loci. It appears that in each case the false positive rates can be controlled to be below 510%.
Discussion
Over the past 100 years since Theodor Boveri hypothesized that mitotic defects that result in tetraploidy promote oncogenesis [17], a tremendous concern has been given to explore the genetic cause of tumorigenesis. It has been partly established that aneuploidy has an effect on proliferation and survival of tumors [5]. The recent discovery of components of the mitotic checkpoint, as well as the realization that many of the classic tumour suppressors and oncogene products regulate mitotic progression, has renewed interest in the role of aneuploidy in tumorigenesis [10,15,16]. With the completion of the human genome projects and HapMap project, there is a pressing need for the development of statistical models for estimating the genetic effect of aneuploid loci on cancer risk.
In this article, we present a statistical strategy for detecting the genetic control of cancer traits through genotyping aneuploids of cancer cells. The model proposed presents two novelties. First, it has for the first time integrated the latest discovery of cancer genetic studies with statistical principles and directly pushed the modeling effort of cancer gene identification at the frontier of cancer biology. The experimental design used is founded on biologically relevant hypotheses from which data can be collected in an effective way. The derived closed forms for the EM algorithm to estimate various parameters will provide an efficient computation for any data set. Second, the model capitalizes on traditional quantitative genetic theory, allowing the partition of overall genetic control into different components. Particularly, we are able to estimate and test the effect of genetic imprinting on cancer risk [18,19] and, thus, draw a detailed picture of genetic control triggered from different parental chromosomes. The model can also characterize the interactions of additive and dominant effects with imprinting effects, helping to gain a better insight into the complexity of the genetic architecture of cancer.
We performed computer simulation to examine the statistical properties of the model. Results from simulation studies were investigated, from which an appropriate sample size is determined for a cancer trait with a particular heritability. Analyses of model power and false positive rates validated the possible usefulness of the model when practical data sets are available. Through a simple mathematical proof, we found that the HardyWeinberg equilibrium of an original population can be destroyed when some chromosomes are duplicated.
The idea of the model can be extended to several more complicated situations. First, the aneuploidy control of cancer may be derived from highorder aneuploid, such as tetraploids. A highorder polyploid not only contain more allelic combinations, but also a more amount of missing data due to the duplication of different chromosomes with unknown parental origins. To model the tetraploidy control of cancer, a more sophisticated algorithm is required to obtain efficient estimates of parameters. Second, different aneuploid loci responsible for cancer traits may be associated in the duplication population and interact in a coordinated manner. Modeling of multilocus associations and multilocus epistasis will deserve a further investigation although these pieces of information can better explain the genetic variation of cancer than single loci. Third, other factors, such as sex, race, and life style, also contribute to cancer. It is crucial to incorporate these factors and study the effects of each of them and their interactions with genes in tumorigenesis.
Conclusion
We have derived a new statistical model for identifying genetic loci that control quantitative phenotypes of aneuploidy cancer through a genomewide association study. We integrate quantitative genetic principles into the model, allowing the estimation of different types of genetic effects. The new model can generate a series of hypotheses tests about the explanation of the genetic control mechanisms of cancer through aneuploid loci. Although our model was explored merely from a theoretical perspective, specific experiments should be readily launched to collect the data according to the genetic design suggested. By analyzing such data, the new model should be able to uncover unique results, facilitating our understanding of how aneuploid processes are linked with cancer through genetic mediations.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
YL derived the equations and performed simulation studies. AB led and performed simulation studies. LW participated in simulation studies. ZW participated in simulation studies. GC interpreted the biological relevance of the model. RW conceived of the model and coordinated its design and simulation test. All authors read and approved the final manuscript.
Figure 1. Diagram for chromosome duplication and the resulting changes of genotypes at an aneuploid gene A. Alleles in different colors denote their parentspecific origins separated by the vertical lines.
Acknowledgements
We thank Dr. Justo Lorenzo Bermejo, Dr. George Heinze, Dr. Marek Kimmel, and Dr. Elizabeth Petty for their constructive comments which help to improve the manuscript. This work is supported by joint grant DMS/NIGMS0540745 and a Penn State Cancer Institute Seed Grant.
References

Clarke R, Ressom HW, Wang A, Xuan J, Liu MC, Gehan EA, Wang Y: The properties of highdimensional data spaces: implications for exploring gene and protein expression data.
Nature Reviews Cancer 2008, 8:184194. Publisher Full Text

Stephens M, Balding DJ: Bayesian statistical methods for genetic association studies.
Nature Reviews Genetics 2009, 10:681690. PubMed Abstract  Publisher Full Text

Cookson W, Liang L, Abecasis G, Mokatt M, Lathrop M: Mapping complex disease traits with global gene expression.
Nature Reviews Genetics 2009, 10:184194. PubMed Abstract  Publisher Full Text

Mackay TFC, Stone EA, Ayroles JF: The genetics of quantitative traits: challenges and prospects.
Nature Reviews Genetics 2009, 10:565577. PubMed Abstract  Publisher Full Text

Williams BR, Amon A: Aenuploidy: Cancer's fatal aw?
Cancer Research 2009, 69(13):52895291. PubMed Abstract  Publisher Full Text

Hede K: Which came first? Studies clarify role of aneuploidy in cancer.
Journal of National Cancer Institute 2005, 97(2):8789. Publisher Full Text

Ganem N, Storchova Z, Pellman D: Tetraploidy, aneuploidy and cancer.
Current Opinion in Genetics and Development 2007, 17(2):157162. PubMed Abstract  Publisher Full Text

Lengauer C, Kinzler KW, Vogelstein B: Genetic instabilities in human cancers.
Nature 1998, 396(6712):643649. PubMed Abstract  Publisher Full Text

Duesberg P, Rasnick D, Li R, Winters L, Rausch C, Hehlmann R: How aneuploidy may cause cancer and genetic instability.
Anticancer Research 1999, 19(6A):48874906. PubMed Abstract

Hanks S, Rahman N: Aneuploidyconcer predisposition syndromes: A new link between the mitotic spindle checkpoint and cancer.
Cell Cycle 2005, 4(2):225227. PubMed Abstract  Publisher Full Text

Stock RP, Bialy H: The sigmoidal curve of cancer.
Nature Biotechnology 2003, 21:1314. PubMed Abstract  Publisher Full Text

Weaver BA, Cleveland DW: Does aneuploidy cause cancer?
Current Opinion in Cell Biology 2006, 18(6):658667. PubMed Abstract  Publisher Full Text

Duesberg P, Li R, Fabarius A, Hehlmann R: The chromosomal basis of cancer.
Cellular Oncology 2005, 27(5):293318. PubMed Abstract  Publisher Full Text

Duesberg P: Chromosomal chaos and cancer.
Scientifi American 2007, 296(5):5259. Publisher Full Text

Kops GJ, Weaver BA, Cleveland DW: On the road to cancer: aneuploidy and the mitotic checkpoint.
Nature Reviews Cancer 2005, 5(10):773785. PubMed Abstract  Publisher Full Text

Suijkerbuijk SJ, Kops GJ: Preventing aneuploidy: The contribution of mitotic checkpoint proteins.
BBAReviews on Cancer 2008, 1786:2431. PubMed Abstract  Publisher Full Text

Maderspacher F: Theodor Boveri and the natural experiment.
Current Biology 2008, 18(7):279286. Publisher Full Text

Pulford DJ, Falls JG, Killian JK, Jirtle RL: Polymorphisms, genomic imprinting and cancer susceptibilit.
Mutation Research 1999, 436:5967. PubMed Abstract  Publisher Full Text

Jirtle RL: Genomic imprinting and cancer.
Experimental Cell Ressearch 1999, 248:1824. Publisher Full Text
Prepublication history
The prepublication history for this paper can be accessed here: