Marked variation in predicted and observed variability of tandem repeat loci across the human genome
1 Molecular and Cellular Therapeutics, Royal College of Surgeons in Ireland, Dublin 2, Ireland
2 UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Belfield, Dublin 4, Ireland
BMC Genomics 2008, 9:175 doi:10.1186/1471-2164-9-175Published: 16 April 2008
Tandem repeat (TR) variants in the human genome play key roles in a number of diseases. However, current models predicting variability are based on limited training sets. We conducted a systematic analysis of TRs of unit lengths 2–12 nucleotides in Whole Genome Shotgun (WGS) sequences to define the extent of variation of 209,214 unique repeat loci throughout the genome.
We applied a multivariate statistical model to predict TR variability. Predicted heterozygosity correlated with heterozygosity in the CEPH polymorphism database (correlation ρ = 0.29, p < 0.0005) better than the correlation between the CEPH and WGS data (ρ = 0.17), presumably because the model smoothes noise from small sample sizes. A multivariate logistic model of 8 parameters accounted for 36% of the variation in the WGS data. Validation studies of 70 experimentally investigated TRs revealed high concordance with the model's predictions (p < 0.0001).
Variability among 2–12-mer TRs in the genome can be modeled by a few parameters, which do not markedly differ according to unit length, consistent with a common mechanism for the generation of variability among such TRs. Analysis of the distributions of observed and predicted variants across the genome showed a general concordance, indicating that the repeat variation dataset does not exhibit strong regional ascertainment biases. This revealed a deficit of variant repeats in chromosomes 19 and Y – likely to reflect a reduction in 2-mer repeats in the former and a reduced level of recombination in the latter – and excesses in chromosomes 6, 13, 20 and 21.