Genome-wide maps of linkage disequilibrium (LD) and haplotypes have been created for different populations. Substantial sharing of the boundaries and haplotypes among populations was observed, but haplotype variations have also been reported across populations. Conflicting observations on the extent and distribution of haplotypes require careful examination. The mechanisms that shape haplotypes have not been fully explored, although the effect of sample size has been implicated. We present a close examination of the effect of sample size on haplotype blocks using an original computational simulation.
A region spanning 19.31 Mb on chromosome 20q was genotyped for 1,147 SNPs in 725 Japanese subjects. One region of 445 kb exhibiting a single strong LD value (average |D'|; 0.94) was selected for the analysis of sample size effect on haplotype structure. Three different block definitions (recombination-based, LD-based, and diversity-based) were exploited to create simulations for block identification with θ value from real genotyping data. As a result, it was quite difficult to estimate a haplotype block for data with less than 200 samples. Attainment of a reliable haplotype structure with 50 samples was not possible, although the simulation was repeated 10,000 times.
These analyses underscored the difficulties of estimating haplotype blocks. To acquire a reliable result, it would be necessary to increase sample size more than 725 and to repeat the simulation 3,000 times. Even in one genomic region showing a high LD value, the haplotype block might be fragile. We emphasize the importance of applying careful confidence measures when using the estimated haplotype structure in biomedical research.
There is a great interest in using genetic association studies to identify the disease-susceptibility variants related to the common complex diseases. To design these studies appropriately, it is important to understand the feature of linkage disequilibrium (LD) in candidate genes or genomic regions of interest . Several studies have shown that the human genome contains regions of high LD value with low haplotype diversity by a small number of SNPs [2-4]. These regions are called haplotype block, each of which reflects the descent from a single ancient ancestral chromosome. The construction of a haplotype block is one way to reduce the complexity of the problem of association mapping of the common complex diseases.
Haplotype blocks are defined computationally by various algorithms. In general, they are classified into the following categories: recombination-based, LD-based, and diversity-based methods. These block definitions are consistent with a block-covered sequence, which is considered a block as a part of the genomic sequence. However, the haplotype block border is not usually stable, and blocks can fall into sub-blocks within the border .
The general properties of haplotype block construction in the human genome are not well understood. Thus, the International Haplotype Map (HapMap) project has genotyped a huge number of SNPs in samples from subjects of Caucasian, African, and Asian descent to better understand the human haplotype structure . Several studies (including HapMap) have shown differences in LD and haplotype block patterns in populations and chromosomes. In addition, it was revealed that SNPs ascertainment, selection and spacing could explain the observed block length [5,7], and that SNPs density has a crucial influence on the length of method-defined blocks . Despite the extensive empirical studies on haplotype blocks, there is no definitive answer as to how sample size impacts the assessment of block structure. For example, a study on chromosome 21 examined 20 independent subjects from diverse populations . Even for a relatively large data set, it contained only 275 individual samples, leading to 400 independent chromosomes . Thus, it was possible that the detected block structure was dependent on the small number of samples, and this seemed a preliminary finding. It remains unclear how many individuals are needed to acquire reliable features of haplotype block [5,9].
In this study, we developed a simulation with random re-sampling from real genotyping data in 725 Japanese and introduced an original measurement of θ value for the identification of haplotype block defined by three algorithms. With the original measurement, we focused on haplotype block structure, especially within a high LD region. We further assessed the robustness of haplotype blocks estimated under the different sample size conditions.
Selection of the analyzed region
One region was selected on chromosome 20q11.22, in which a single strong |D'| block was observed without substantial recombination in the current population (Fig. 1A). However, the region was broken into small blocks by r2. The average values of |D'| and r2 were 0.94 and 0.59, respectively. Based on NCBI human build 35, the total length was about 445 kb from 32,311,428 to 32,756,554 bp. This region was composed of 37 SNPs, and the average distance between SNPs was 12.4 kb. SNPs with MAF greater than 0.1 (average MAF 0.36) were used for subsequent study. Detailed information on selected SNPs is shown in Table 1.
Figure 1. (A) Map of linkage disequilibrium (LD) coefficient |D'| (lower part) and r2 (upper part) within the analyzed region on chromosome 20q11.22 (445 kb). A single strong |D'| block (average 0.94) is observed, but broken into small blocks with r2. (B) Structure of reference haplotype blocks with three different algorithms (Kamatani's method, Gabriel's method, and four-gamete test) using 725 samples. Each haplotype block is shown as blue (Kamatani's), green (Gabriel's), or red (four-gamete test) horizontal bars. Two gaps of (a) and (b) were observed in the analyzed region ((a) in the four-gamete test, and (b) in Gabriel's method). (C) θ profiles with the maximum number of samples (x = 725), and 3,000 repeats (N = 3,000). θ profiles show values between 0 and 1 based on its definition. The scale of θ profiles is shown on the vertical axis. The physical distance on chromosome 20q is shown on the horizontal axis. SNPs positions are presented as vertical gray bar through (A) to (C). Based on the complete identification of θ profiles, two SNPs intervals of (I) and (II) were selected for further analysis. We also selected four SNPs intervals from (i) to (iv), showing the variations of θ profiles, for further analysis. (A), (B), and (C) are illustrated at the same physical scale.
Table 1. List and information of 37 SNPs in the analyzed region on chromosome 20q11.22
Use of a novel measurement: θ value
An original measurement (θ value) was used for block identification. The θ value represents the probability of whether a SNPs interval resides within or outside a haplotype block. The word "interval" as used here refers to the region between two adjacent SNPs (See Methods). Using the θ value with more than two block definitions allows an estimation of the suitable structure of the haplotype block in the region of interest.
In the analyzed region, different reference haplotype blocks were identified with three separate algorithms, despite a single strong |D'| value (Fig. 1B). To evaluate the discrepancy in block identification, θ profiles were calculated for all pairwise SNPs with a flow chart in Fig. 2.
Figure 2. A flow chart of computational simulation. This flow chart outlines the basic algorithms including re-sampling, haplotype block definition, and block concordance (See Methods). x is the sample size in the re-sampling process. This parameter is increased from 50 to 700 in increments of 50, in addition to 10, 25, and 725. n is the number of repeat times of simulation under the maximum repeat time N. The θ value is an original measurement of the estimation of haplotype block, and is the ratio of times to N times when simulating the transition zone of block (See Materials and Methods). HWE denotes Hardy-Weinberg equilibrium, and MAF denotes minor allele frequencies.
The influence of the mixture of cases with control samples was examined first. θ profiles calculated with only controls (x = 358 and N = 3,000) were compared to the mixed data of controls and cases (x = 358 and N = 3,000). The mean square errors (MSE) between the two data sets was calculated, where MSE was the sum of the variance and squared bias of the estimates . The observed MSE values were under 0.005 (0.5%) ranged from 0.00384 to 0.00305 (Additional file 1). Similarly, when θ profiles were compared with mixed data from 725 samples (N = 3,000), the MSE values were under 0.005, leading to a reasonable accuracy for the combination of the two groups (Additional file 1). This calculation suggested that the profound bias of the mixture of cases was rather low. In addition, there were no significant differences (Chi-square P < 0.01) in allele frequencies with all 37 SNPs in a case-control association study and all SNPs satisfied with HWE test (P < 0.05). As a result, the genotyping data from 367 cases and 358 controls were merged for a total of 725 samples of Japanese genotyping data.
With the maximum sample size of 725 subjects and 3,000 repeat times, θ profiles were plotted against the physical position of the SNPs (Fig. 1C). The complete concordance of haplotype blocks was observed in some intervals (e.g., (I) and (II)), but not in others (e.g., (i), (ii), (iii), and (iv)). It may be difficult to clearly identify the haplotype block even for the θ value. For subsequent analyses, two intervals, (I) and (II), were selected as the complete concordance regions of the block, and four intervals from (i) to (iv) were defined as incomplete concordance regions.
Reference haplotype blocks and θ value
Although there was a difference among reference haplotype blocks across the analyzed region (Fig. 1B), θ profiles generated by the three algorithms were comparatively similar (Fig. 1C). We compared θ profiles with reference haplotype blocks defined by 725 samples.
In the interval (a), a reference haplotype block was disrupted by the four-gamete test, but it was continuous with the other two methods (Fig. 1B). This finding, supported by the intermediate r2 value, identified interval (a) as the fragile region of haplotype block. In contrast, θ profiles showed low values (less than 0.2) with all three algorithms in the corresponding interval (interval (ii) in Fig. 1C). Namely, interval (a) was identified as the transition zone in 20% of 3,000 simulations of the θ values. Interpretation of the θ profiles reveals that interval (a) might be included in the transition zone of the block, and that LD might be disrupted against the estimation of Gabriel's and Kamatani's methods.
All reference haplotype blocks were continuous in the interval (i) (Fig. 1B). However, four-gamete test showed a θ value of 0.30 after 3,000 simulations and a divided interval (Fig. 1C). In addition, the θ value was not zero with the other two algorithms. Similar discrepancies were observed between the θ value and reference blocks defined by a single algorithm with two other intervals (iii) and (iv). It was quite difficult to clearly identify the block structure in these SNPs intervals. These results suggest that haplotype blocks might be fragile with one block algorithm, even if the region showed a single strong LD value defined by |D'|.
Data simulation by θ value
To evaluate the effect of sample size on the identification of haplotype blocks, θ profiles were calculated with a variable number of sample sizes (x). Fig. 3 left side panels show the results obtained with intervals (I) and (II). Within these two intervals, complete concordance was observed between the θ value and reference block definitions. The θ value could increase or decrease to the maximum or minimum value (1 or 0) dependent on the sample size, and it converged when using more than 200 samples. This result suggests that sample size could have an effect on the identification of haplotype block, and it was not possible to obtain a reliable block identification for data with less than 200 samples.
Figure 3. The effect of sample size on haplotype block characteristics under different block definitions: θ profiles versus sample number (x). The number of repeat times (n) was adjusted to 3,000 (N = 3,000). The labels of analyzed SNPs intervals are identical to those in Figure 1. Sample size (x) is shown on the horizontal axis. θ profiles (ratio of times to N times when simulating the transition zone of haplotype block) are shown on the vertical axis. The three labels represent different haplotype block definitions: blue square, Kamatani's method; green triangle, Gabriel's method; red diamond, four-gamete test.
There was also a difficulty with haplotype block definitions within the other four intervals ((i), (ii), (iii), and (iv)) in spite of the estimation by the θ value. The θ values in these intervals were more strongly influenced by sample size than those in intervals (I) and (II). In particular, the θ value did not converge even for calculations with more than 600 samples. This might imply that the precise identification of haplotype block was difficult for data with 725 samples. However, the exact number of samples required to reach the plateau value for precise evaluation of haplotype block remains unclear.
θ profiles were also generated after changing the number of simulation times (n) up to the maximum repeat times of 10,000 (Fig. 4). The θ value converged and reached the plateau value in all algorithms when simulations were repeated 2,500 times. Thus, the θ value with 3,000 repeat times was reliable for block identification with the plateau value in the simulation. In the simulation of 50 samples (dashed lines), the θ value was not equal to the result in 725 samples (solid lines), even if the repeat times increased to the maximum (10,000). As a whole, these observations indicate that a sample size greater than 725 with a computational simulation of 3,000 times is required to obtain a converged θ value.
Figure 4. The effect of the number of simulation repeats on haplotype block characteristics under different block definitions: θ profiles versus the number of repeat times (n). The sample size (x) was adjusted by fixing it at 725 (solid lines) or 50 (dashed lines). The labels of analyzed SNPs intervals are identical to those in Figure 1. The number of repeat times (n) is shown on the horizontal axis. The simulation was repeated 10,000 times (N = 10,000) from 1, 100, 500, 1,000, 2,000, 3,000, 4,000, 5,000, and 7,500 times. θ profiles (the ratio of times to N times when simulating the transition zone of haplotype block) are shown on the vertical axis. The three labels represent different haplotype block definitions: blue square, Kamatani's method; green triangle, Gabriel's method; red diamond, four-gamete test.