A reciprocal recurrent selection program has been under way for the Coffea canephora coffee tree for approximately thirty years in the Ivory Coast. Association genetics would help to speed up this program by more rapidly selecting zones of interest in the genome. However, prior to any such studies, the linkage disequilibrium (LD) needs to be assessed between the markers on the genome. These data are essential for guiding association studies.
This article describes the first results of an LD assessment in a coffee tree species. Guinean and Congolese breeding populations of C. canephora have been used for this work, with the goal of identifying ways of using these populations in association genetics. We identified changes in the LD along the genome within the different C. canephora diversity groups. In the different diversity groups studied, the LD was variable. Some diversity groups displayed disequilibria over long distances (up to 25 cM), whereas others had disequilibria not exceeding 1 cM. We also discovered a fine structure within the Guinean group.
Given these results, association studies can be used within the species C. canephora. The coffee recurrent selection scheme being implemented in the Ivory Coast can thus be optimized. Lastly, our results could be used to improve C. arabica because one of its parents is closely related to C. canephora.
Keywords:Africa; Association studies; Coffea canephora; Genetic diversity; Linkage disequilibrium; Microsatellite
Plant breeding is making increasing use of molecular markers to speed up selection cycles and thereby more rapidly introgress interesting zones of the genome in varieties of agronomic interest. In addition to QTL analyses, which only focus on a single progeny, association genetics can be used to focus on a set of populations to identify zones of interest of the genome in breeding. However, for association studies to be successful it is necessary to know the degree of linkage disequilibrium (LD) beforehand, along with its extent and its distribution within the species studied. This disequilibrium, which measures the intensity of linkage between markers, is used to determine the molecular marker density of the genome, which is necessary for association studies to be effective [1-3].
Association studies are particularly relevant in perennial species, as they can be carried out on pre-existing populations, in collections or in selection trials. They do not require the creation of specific populations by controlled crossing, as is the case with conventional genetic mapping approaches . Association studies have proven effective in a large number of plant species, including maize  and grapevine . A prior study of the LD in a species guides the choice of one of the two association study possibilities: genome-wide scan or candidate gene approach.
No large-scale LD study has been undertaken yet on the coffee tree. Coffea canephora Pierre ex A. Froehner (2n = 2X = 22) is a strictly outcrossing diploid species with a genetic self-incompatibility system [7,8]. It provides 41.3% of the world coffee production . One of the most ambitious genetic improvement programs for this species has been conducted in the Ivory Coast and is based on a reciprocal recurrent selection scheme (RRS) that was launched in the 1990s [10,11]. This selection program uses the genetic diversity of the species by creating hybrids between the genotypes of two genetic groups: Congolese from central Africa and Guinean from western Africa. The combined use of the RRS and association genetics could help to more effectively guide crosses and speed up the introgression of specific alleles identified as being of interest through the early selection of genotypes derived from crosses.
However, natural populations of C. canephora are relatively small in size. The strictly outcrossing reproduction system of this species and the different levels of existing kinship create complex genetic structures at the population scale. This structure in populations is superimposed on the larger diversity group (DG) scale .
The expected results were as follows: i) knowledge of the LD dynamics at the genome level for a certain number of coffee diversity groups or populations, ii) enhanced knowledge of the genetic structure of the Guinean diversity group, and iii) identification of populations that can be used in association genetics.
DG G, Guinean diversity group: a mixture of different wild or cultivated genotypes collected in the Ivory Coast and Guinea. It includes the populations Mouniandougou, Ira 1, Ira 2, Fourougbankoro, Piné, and cultivated Guinean clones
DG GP, Pélézi diversity group: a natural Guinean population related to the Guinean diversity group, isolated from a forest of the Ivory Coast
DG C, Nana diversity group: a population collected in the Central African Republic (CAR)
DG SG2: genotypes cultivated in the Ivory Coast and Uganda or in collections in Brazil, most likely originating from the Congo basin
DG SG1: genotypes cultivated in Togo and Benin (Niaouli population) or from surveys in the Luki collection (Democratic Republic of Congo, Luki population), originating from the Atlantic seaboard, from Gabon to the Democratic Republic of Congo
Table 1. Characteristics and origins of theC.canephoradiversity groups
Choice of microsatellite markers and genotyping
All individuals were genotyped using 108 microsatellite markers mapped to eleven linkage groups (LG, called A to K) on a C. canephora linkage map that spanned a length of 1320 cM (Figure 1). The markers are described in Additional file 1: Table S1.
Format: XLSX Size: 30KB Download file
Figure 1. Location on the genetic map of the 108 microsatellite markers used for the linkage disequilibrium study. In black, markers from genome banks; in red, markers derived from EST or gene sequences; in green, markers developed on sequences of BAC-ends; and in purple, markers developed on sequences of BAC 111O18. Markers not present on the genetic map published by Leroy et al. (2011) are underlined.
All LGs were studied with a larger number of markers on LGs A, B, D, F, G and H. The average distance between markers was 13 cM, ranging from 0 cM for the closest marker pairs to 243.3 cM for the most distant marker pairs.
The genotyping was performed according to the protocol described in . Size controls were replicated on the different gels to ensure the uniformity of genotyping data. Data were imported into the Microsoft EXCEL® spreadsheet from SAGA GT® (LI-COR Biosciences, Lincoln, Nebraska, USA) and were formatted for the different data analysis software used.
Genetic structure validation of the sample
A model-based Bayesian analysis implemented in STRUCTURE [17,18] was performed to validate the DG structure of our sample. We ran a correlated-allele model with 10,000 iterations and 10,000 burn-ins; ten runs were made for each assumed K (putative number of populations), with K varying from one to ten. The ad-hoc statistics Δ(K) proposed by  were used to assess the number of populations. The Hardy-Weinberg Equilibrium (HWE) and summary statistics, including the number of alleles and the expected and observed heterozygosity for each locus, were computed for the entire sample and for each DG using ARLEQUIN software v. 18.104.22.168 . We also computed the AMOVA and F-statistics for diverse levels of population structure using ARLEQUIN.
Genetic structure of diversity group G
The Guinean diversity group is composed of a large number of natural populations from the forests of the Ivory Coast, including the Pélézi population (DG GP, see above), which exhibits some original characteristics. The wide variety of these populations, as well as the existence of Guinean genotypes taken from smallholder plantations in the living collections in the Ivory Coast, further justify an in-depth study of this diversity group.
Earlier diversity studies did not offer sufficient resolution to analyze the genetic structure of this group in a satisfactory manner (lack of markers). The genetic structure of this group was analyzed using DARwin software based on the calculation of genetic dissimilarities between individuals followed by a factorial analysis from a dissimilarity matrix (FADM) and the construction of a neighbor-joining tree (NJ) [21,22]. For comparison purposes, we ran a STRUCTURE analysis within the DG G using the same parameters and procedure as above.
Statistical data analysis and LD analysis
Reconstruction of haplotypes
The species C. canephora is a highly heterozygous species. Therefore, it is difficult to distinguish the allele phase of the double heterozygotes Aa/Bb, i.e., whether A is associated with B or with b at the haplotype level . However, the most common and powerful measurements of the LD (D, D’, r2) rely on an estimation of the haplotypic or gametic linkage disequilibrium , i.e., using the allele phase information at the gamete level. Consequently, to estimate these measurements, it is necessary to have access to the haplotypes. Therefore, we used PHASE software [25-27] to reconstruct the haplotypes. This software estimates the most probable haplotypes for each genotype based on an EM algorithm (Expectation-Maximization) that incorporates a coalescence hypothesis in a maximum likelihood model. The haplotypes are reconstructed following a certain number of strong hypotheses using parameters such as allele frequencies, the possibilities of recombination between markers and simulated allele pedigrees.
We used the entire set of 356 genotypes, as the algorithm functions better with genetically structured data . Nevertheless, the dataset was partitioned by the LG of the genetic map (11 matrices in all), as the markers situated on different LG could not be in phase. This approach has been shown to be effective, and the gain in power has been proven, as in the case of grapevine [28,29].
To enable the use of PHASE while maintaining the Stepwise Mutation hypothesis for the microsatellite markers, our data that were expressed in allele sizes were converted into repeat numbers using CREATE software .
Five algorithm repeats based on 1000 iterations, 100 thinning intervals and 1000 burn-ins were performed, and the repeat displaying the greatest maximum likelihood was used for the remaining analyses.
The resulting tables adopted for each LG were then merged and formatted for incorporation into PowerMarker to analyze the LD, declaring genotype data of the known phase as the data type.
The LD was analyzed for the five DGs, together and separately. We calculated the D’ and r2 values for the set of possible combinations of markers, two-by-two, using PowerMarker software . These measurements were initially developed for bi-allelic loci. Nevertheless, an estimation of these measurements for multi-allelic loci can be performed by establishing a weighted mean for the set of disequilibria between allele pairs [32,33].
Exact Fisher tests were carried out for all the possible combinations to test whether the haplotype frequencies between two loci were the product of the allele frequency corresponding to the two loci. Allele counting was organized in a contingency table, and permutations, following an algorithm using a Markov-Monte-Carlo chain, were used to calculate the unbiased p-values associated with the test . We corrected the significance limit of the p-values associated with the exact test using Bonferroni’s correction to more effectively overcome the effect of the very large number of tests performed. We graphically represented the r2 values as a function of distance for each of the groups studied.
Genetic structure of the entire sample
The ad-hoc statistics Δ(K) indicate an uppermost level of structure in five populations. The Δ(K) and bar-plot from the STRUCTURE run showing the highest lnP(D) for K=5 are presented in Figure 2. The expected and observed heterozygosity and Hardy-Weinberg exact test associated p-values for each locus are provided in Additional file 2: Table S2. At the level of the entire population, all HWE departure tests are significant, indicating a high level of structure. Even at the population level (GP), we still detect a significant departure from HWE for a high proportion (56%) of the polymorphic loci. The AMOVA and F-statistics also indicate a high structure in our sample, with more than 38% of the variance due to the DG and with all the Fst between DG significant at a 5% level.
Additional file 2: Table S2. Number of alleles, expected heterozigosity(He), observed heterozigosity and Hardy-Weinberg equilibrium exact test associated p-values (P-value )and standard deviation (s.d.) for each locus and for the whole population and the Diversity Groups SG2, SG1, C, G and GP. NA stands for not available due to monomorphic locus.
Format: XLSX Size: 35KB Download file
Figure 2. (a) Ad-hoc statistics Δ(K) based on STRUCTURE lnP(D) summarized over 10 reps for each K (assumed number of populations) exhibiting a signal at K=5 populations and (b) bar-plot of the STRUCTURE run exhibiting the highest lnP(D) for K=5. STRUCTURE analysis was performed over the 356 genotypes.
Genetic structure of diversity group G
The ad-hoc statistics Δ(K) based on the STRUCTURE analysis gave an uppermost level of structure within DG G in three populations. The Δ(K) and bar-plot from the STRUCTURE run showing the highest lnP(D) for K=3 are presented in Figure 3. The factorial analysis based on dissimilarity index and the NJ tree are shown in Figure 4. Both analyses revealed a marginal structure in three clusters. On the tree, these three clusters corresponded to the Mouniandougou population plus a few individuals from the Piné and cultivated Guinean populations (Gsub1); to the majority of the Ira2 population (Gsub2); and to the Fourougbankoro and Ira1 populations plus a few cultivated Guinean and Piné individuals (Gsub3). For the rest of our work, we considered DG G to be a composite group of different populations, and we also tried to analyze separately the three identified subgroups. The expected and observed heterozygosity and Hardy-Weinberg exact test associated p-values for each locus are provided in Additional file 3: Table S3. The AMOVA analysis and F-statistics of the entire sample, including Gsub1, Gsub2 and Gsub3, are presented in Additional file 4: Table S4.
Format: XLSX Size: 30KB Download file
Format: XLSX Size: 11KB Download file
Figure 3. (a) Ad-hoc statistics Δ(K) based on STRUCTURE lnP(D) summarized over 10 reps for each K (assumed number of populations) exhibiting a signal at K=3 populations and (b) bar-plot of the STRUCTURE run exhibiting the highest lnP(D) for K=3. STRUCTURE analysis was performed over the DG G only.
Figure 4. (a) Genetic structure of diversity group G: first two axes of the associated factorial analysis from a dissimilarity matrix and (b) corresponding NJ tree. The percentages of total variability are given for each axis. Some coherent subgroups have been identified. For the NJ tree, bootstrap values obtained from 5000 iterations of bootstrap procedure are indicated. For clarity, only values greater than 60 are printed. Unit colors are based on STRUCTURE assignment for each genotype. Probabilities of ancestry greater than 0.6 were considered.
LD analysis at the pan-genomic level
LD estimates and their within- and between-linkage group significance by exact Fisher tests and Bonferroni’s correction
Table 2 presents a summary of the exact tests carried out on the different clusters considered. These results demonstrate the importance of the significant associations generated by the genetic structure. In fact, for the set of genotypes, it can be seen that 98% of the marker pairs displayed a significant disequilibrium, regardless of whether they were linked. In addition, DG C and DG GP and the three subgroups of DG G were the only ones to show a within-group: between-group ratio over 1. When trying to consider a finer structure, a certain number of significant between-group associations were no longer detected. For DG G, we found a considerable correction of genomic LD for the set of subgroups (Gsub1, Gsub2 and Gsub3) in relation to DG composite G.
Table 2. Analyses of associations between markers per exact Fisher test for the different genetic groups considered
Decrease in LD with genetic distance for the set of genotypes
Figure 5 shows the r2 and D’ values (after Bonferroni’s correction) as a function of distance for the set of genotypes. In general, the D’ values were much higher than the r2 values. In addition, it seems that these values were much more stochastic with large proportions of very high LD that corresponded to non-significant associations. When the different DGs were considered (data not shown), D’ seemed to be very sensitive to the structure effects and was also sensitive to the variations in allele frequencies between the different markers. Consequently, although the values of the two measurements appeared to decrease with distance, it seems that D’ was less sensitive than r2 to that decrease, with values remaining high at very long distances.
Figure 5. Decrease in LD (measured by D’and r2)as a function of genetic distance in centimorgans for the set of genotypes. Only significant values after Bonferroni’s correction are shown. The logarithmic regression curves were calculated for all the data.
As these two parameters have different properties, the information provided is not redundant. Indeed, D’ measures only the recombination history, whereas r2 measures both the recombination and mutation . These different properties signify that D’ is globally higher than r2 and can potentially reveal more associations. The calculated values of D’ are lower when population sizes are small, which is why this parameter is preferred when studying the evolutionary history of large populations. However, with association studies, the most indicative value of potential power is the r2 measurement, as it provides an indication of the way the markers and phenotypic traits being studied will be correlated. We cite the D’ for a general comparison on the set of individuals before discussing in detail some differences in r2 between the different populations.
Decrease in LD with distance for the different diversity groups
The results of this analysis revealed a general decrease in the LD with distance. Nevertheless, we found as many different cases as there were DGs studied.
Diversity groups SG1, SG2 and G
For DG SG1 (Figure 6), we found moderate to high r2 values compared to the other populations, with a large proportion of between-LG LD. The major genetic structure in two populations (Niaouli and Luki) helps to explain such results. In light of these results, it is difficult to choose an r2 limit that can be used in association genetics for this DG and that will avoid the risks of false detections.
Figure 6. Decrease in LD (measured by r2)as a function of genetic distance for groups SG1 (a) and SG2 (b). The regressions were calculated on significant values only (blue values).
In DG SG2 (Figure 6), a large share of the significant r2 values was found between unlinked markers. The r2 values were extremely low, even for some very close markers. This observation may be explained by the origin of this DG, which must have undergone greater genetic mixing than the other origins. Indeed, this group originated from a major center of Coffea canephora diversity and is constituted mainly by genotypes resulting from the cultivation and selection processes. It can also be linked to the high number of alleles found in this DG, resulting in a more effective recombination process and diluting the LD signal. For this DG, it is possible to hypothesize that obtaining comparable values between linked and unlinked markers is not due to a structural phenomenon, but rather to very low or non-existent LD in this population. Indeed, structure analyses indicated that this diversity group is not structured in sub-populations and that most of the genotypes within this group are of a cultivated origin. Consequently, more than one microsatellite per cM would be needed in this DG to have any hope of satisfactorily covering the entire genome.
For DG G (Figure 7), higher r2 values between linked markers than between unlinked markers were observed, with a clear decrease in those values with distance. Although some values over 0.1 could be seen for the DG as a whole, this was not the case for subgroups Gsub1 and Gsub2. Indeed, in these two subgroups, we found significant and high r2 values between close markers. By choosing an empirical r2 limit of 0.2, we arrived at an LD covering approximately 5 cM for subgroup Gsub1 and 20 cM for subgroup Gsub2 (Figure 7); these values suggest that it would be possible to cover the entire genome with a reasonable number of markers, i.e., approximately 290 and 70 markers, respectively.
Figure 7. Decrease in LD (measured by r2)as a function of genetic distance for groups G (a) and subgroup Gsub2 (b). The regressions were calculated on significant values only (blue values).
A very low LD was found for subgroup Gsub3, as well as for DG SG2. The LD remained very low for this group, even at low distances. Therefore, more than 1500 markers would most likely be required to cover the entire genome in this population.
Lastly, DG C and DG GP (Figure 8) exhibited high r2 values between close markers, displaying a strong decrease with distance, with a faster decrease in C than in GP. When an r2 value limit of 0.2 was chosen for GP and 0.1 was chosen for C, the LD was significant over a distance of approximately 23 cM for GP and 5 cM for C. These distances suggest that association studies can be conducted on these two populations with approximately sixty-five and 290 markers for GP and C, respectively.
Figure 8. r2as a function of genetic distance for the Pélézi (a) and C (b) groups.
Knowledge of the global LD thus provides a general idea of the variations in the LD over the entire genome. However, as the LD is highly variable depending on the region, a separate analysis of the various LGs for DGs GP, C, SG2 and G was carried out to compare the LD depending on the genome region.
Comparison of LD patterns between linkage groups
The graphs for r2 as a function of distance for LGs A, B, D, F, G and H are given in Figure 9 for DGs GP, C, SG2 and G.
Figure 9. Decrease in r2 as a function of distance for 6 linkage groups for a few diversity groups.
The analysis of these results shows that the LD varied depending on the LGs and the DGs that were considered. We confirmed the virtual absence of any usable r2 values for DGs SG2 and G. In contrast, for DG GP, we found moderate to high values for the set of LGs, although there were some large differences between the LGs. For DG C, only LGs A, F and G seemed to display r2 values over 0.1. However, these results need to be examined further, particularly with regards to LG G, which had a much greater marker density than the other LGs.
When observing the LD matrices per DG (data not shown), significant p-values were preferentially located close to the diagonal, and hence, between linked markers. This preferential localization was particularly apparent for DGs GP and C. In contrast, DGs SG2 and G displayed a large share of significant p-values outside the LGs. DG C only had a few r2 values over 0.1, reflecting the short distance at which the LD can be detected. However, DG GP had a larger number of values over 0.1 within the different LGs with a LD that seemed to be organized in blocks.
This LD study is the first of its kind in the genus Coffea. We identified numerous different cases of LD evolution along the genome within the different DGs. We were also able to identify a cryptic structure within DG G. This very fine genetic structure had not been discovered in earlier studies either due to a lack of resolution or because too few markers were used.
What is the point of a genome-wide LD study?
As in all species, we found a decrease in the LD with distance. For a population in equilibrium between mutation and genetic drift, the LD (measured by r2) is expected to depend on both the effective size of the population and the recombination rate between the loci considered [23,24]. The closer any two markers are, the longer it will take for the LD to dissipate. Therefore, one expects to find some LD values greater between close markers than between distant markers at a given moment in the evolution of a population.
The approach we adopted, with one marker every 13 cM on average but with highly densified portions of the genome, seemed to be the best strategy for constructing an initial pan-genomic view of the LD properties in the species studied. Moreover, our approach enabled a comparison of the LD behavior over different LGs. This comparison should lead to a clearer understanding of the possibilities for association studies within the studied populations.
Is LD in Coffea canephora an insurmountable problem for association studies?
Our results demonstrate that it is possible to perform association studies by working specifically on each population, but not on the global diversity level, in the species. Depending on the populations, the needed marker density varies, but the prospects of using association studies to support breeding programs for our species are quite interesting.
Indeed, the graph showing the set of r2 and D’ values on the scale of the 356 genotypes, combined with the importance of the genetic structure found in the entire sample, clearly illustrates the importance of the structure effect on the detection of associations between unlinked markers. The Hardy-Weinberg disequilibrium could not be reduced from the whole sample scale to the GD level. However, disequilibrium still exists within natural populations of C. canephora, as shown by [12,35]. This disequilibrium thus prevents the implementation of association studies for an entire set of genotypes using simple correlation models, which do not take into account structure and kinship effects. This result was confirmed by the large number of significant correlations between markers located on the different LGs for the 356 genotypes compared to those found on the DGs. Therefore, it seems necessary to work at the population level to more effectively study the LD dynamics in C. canephora, as the analyses showed that the most valuable results were obtained on DG GP, DG C and two Guinean subgroups (Gsub1 et Gsub2) corresponding more or less to natural populations.
We were thus able to reveal a high variability in LD within the different DGs, with a large share of residual between-linkage group disequilibrium in DG SG2 and DG SG1. These results may potentially lead to the detection of false positives in association studies, even at low levels of genetic structure. The importance of this “genomic” LD (as opposed to local LD) was variable depending on the groups, and by taking into account structure and kinship in association studies, it will be possible to overcome this variability. The residual genomic LD values for the less-structured DGs may be explained by different kinship levels within the natural populations of our species. Nevertheless, for DG SG2, the low r2 values obtained suggest that the LD is significant at very short distances. Indeed, natural populations of coffee trees are usually small, isolated populations with a small number of mother trees and a few juveniles, involving major relations of kinship, despite the strict outcrossing of the studied species.
We used Bonferroni’s correction to consider only truly significant values. Nevertheless, this correction is very conservative and may lead to a substantial loss of power in association studies. Many other corrections have been proposed in the literature in recent years, but none seems to be satisfactory. Moreover, we have shown that, in our case, this correction mainly made it possible to eliminate a certain number of disequilibrium values between unlinked or very weak markers. Normally, in association studies, such a correction will not be necessary because the main source of error (genetic structure) will be controlled. These questions should be given due consideration along with updates in the proposed models. Models that take into account structure and kinship in association studies appear to be a major advance in these approaches, helping to increase both the power and the resolution of such studies .
Genetic structure of DG G
Our study enabled us to more effectively determine a fine genetic structure for DG G. Structure seems to exist in these populations, but there are indications of major gene exchanges between them. We found a structure in three subgroups (Gsub1, Gsub2 and Gsub3) with both model-based and distance-based analyses. This very fine genetic structure can only be studied with a large number of markers. DG G was initially described by Berthaud  using isozyme markers. Berthaud concluded at the time that there was an absence of genetic structure within this group. However, Cubry et al. [12,13] showed the existence of a Guinean population that was different from the others (GD GP). It will also be important to study kinships within natural populations, along with the gene flow existing between them, to understand the dynamics of those populations on the forest scale in Guinea and the Ivory Coast.
What models can be used for association studies in Coffea canephora?
Our results show, particularly for the Guineans, that a large number of “control” markers (i.e., control markers that can be used to estimate structure and kinship independently from the association study) are needed to separate the fine structure into populations. Therefore, our case seems to be quite similar to the case of maize, where a set of eighty-nine microsatellite markers was used by Flint-Garcia et al.  to study structure and kinship on 302 lines.
After correction of the p-values by the Bonferroni method, some large and significant values of the two LD measurements (D’ and r2) were found both between unlinked and linked markers, preventing any distinction between associations based on a physical link between markers and those created by the structure. Therefore, the genomic control approach (adaptation of the significance limit to the number of associations detected between unlinked markers) appears to be less efficient and may lead to a large number of false negatives. This observation is one of the greatest criticisms of this model advanced by Yu et al. . Moreover, this approach estimates that structure has the same effect at any point of the genome .
The structured association approach proposed by Pritchard et al.  seems to be more efficient than the genomic control. Nevertheless, the degree of kinship in the populations studied, as shown by the diversity trees obtained (particularly for DG GP), indicates that a share of the confounding effect of genetic structuring is not taken into account in this model. Consequently, it seems that the model best adapted to the species and populations in our study is the mixed model proposed by Yu et al. . This approach has shown its power and its superior control of false positives when compared to other methods using simulated data.
These association study models are becoming increasingly efficient, and we seem to be arriving at a critical point in the development of these approaches. Even so, particular attention must be given to the choice of traits studied and their distribution within the sample on which association studies are performed. Indeed, by correcting the structure effect, there is a risk of not being able to detect traits that would have a distribution superimposed on the population structure .
Which target populations should be used for association studies?
The purpose of our work was to make an initial assessment of the LD at the pan-genomic level in C. canephora. We thus discovered considerable variation in the LD between populations. The DGs comprising natural populations, such as GP or C, appear to have a moderate to high LD, at approximately 5 to 25 cM. In these DGs, it seems feasible to carry out genome-wide scan type studies. Nevertheless, given the stochasticity of the LD between LGs and its sensitivity to low allele frequencies, we have certain reservations regarding this type of approach, notably when using highly polymorphic multi-allelic microsatellite markers.
The DGs comprising improved populations, such as SG2, seemed to have undergone substantial genetic mixing with greater diversity and a virtually undetectable LD on the scale at which we worked. Consequently, this type of population seems more suited to regional or candidate gene type approaches.
To conclude, the current association study models can be used to consider structure and kinship effects, enabling this type of approach for use even with composite and structured samples.
Which approach for association studies in Coffea canephora?
For most of the considered DGs, a high density of markers is required to perform association studies. Using SNPs in additions to the SSR should be of great value.
Considering ongoing work on SNP discovery and genotyping by sequencing in coffee, a high number of markers will likely be obtained in the short term. Then, genome-wide association studies (GWAS) can be easily applied to populations used in breeding processes, such as populations involved in the RRS in the Ivory Coast. In countries such as Uganda, GWAS should be applied to the entire germplasm used for breeding for tolerance to Coffee wilt disease (CWD). Therefore, the candidate gene approach should only be used in very low LD populations or for specific purposes.
We were able to demonstrate a cryptic structure within the Guinean diversity group (DG G). This very fine genetic structure was not detected in earlier studies either due to a lack of resolution or because too few markers were used. Nevertheless, complex dynamics seem to exist within the coffee populations of the Guinean region that are substantially different from those found in the forests of Uganda, as described by Musoli et al. . A more in-depth study of these populations will be undertaken to establish the relationships and gene flow existing between them.
The LD study, the first of its kind in the genus Coffea, has enabled us to consider association studies in the species C. canephora. In the different DGs, we identified numerous different cases of LD evolution along the genome. This study provided us with a basis for carrying out association studies and thereby optimizing the reciprocal recurrent selection scheme for the genetic improvement of C. canephora. This selection scheme is based on the complementary traits of diversity groups and the agronomic value of intergroup hybrids. Association studies within groups will allow us to significantly improve each of the complementary populations for their traits of interest. An initial application of this approach can be employed in Guinean populations (exhibiting high LD) with a limited number of markers throughout the genome. A selection process should be efficient and quick using association studies for the early selection of heritable traits (such as vigor or bean size) in intergroup hybrids.
The major genetic structure of our species, which may have limited the feasibility of association studies a few years ago, can now be taken into account in the latest models. Nevertheless, it will be important to carefully consider all the models and, in particular, to avoid over-correcting the structure effect, as this action may lead to numerous false negatives and a major loss of detection power.
Lastly, the results of our study on C. canephora could be extrapolated to C. arabica. Although C. arabica has a very restricted genetic base, as shown by Lashermes et al.  and Anthony et al. , one of the species participating in its creation is very close to C. canephora.
LD: Linkage Disequilibrium; DG: Diversity Group; LG: Linkage Group; cM: Centimorgan.
The authors declare that they have no competing interests.
PC, MD, DP, KA, SB and TL designed the study. HL provided the plant material. PC, FDB and KA carried out the genotyping experiments. PC and DP analyzed the data. PC, FDB and TL wrote the manuscript. All authors read and approved the final manuscript.
We thank the Grand Plateau Technique Régional de Génotypage (GPTR-GENOTYPAGE) and Ronan Rivallan for help in carrying out the genotyping experiments. We also thank Dr. Colin Kelleher from the DBN Plant Molecular Laboratory, National Botanic Gardens of Ireland, and Dr. Jean-Pierre Labouisse, CIRAD Montpellier, for correction of the manuscript. The authors are grateful to Nestlé Tours and Dr. Dominique Crouzillat for providing some genomic and EST-derived microsatellites markers. PC was supported by a French Grant from the research ministry for his PhD.
The Plant Genome Journal 2008, 1:5. Publisher Full Text
Fournier-Level A, Le Cunff L, Gomez C, Doligez A, Ageorges A, Roux C, Bertrand Y, Souquet J-M, Cheynier V, This P: Quantitative genetic bases of anthocyanin variation in grape (Vitis vinifera L. ssp. Sativa) berry: a quantitative trait locus to quantitative trait nucleotide integrated study.
Leroy T, Montagnon C, Charrier A, Eskes AB: Reciprocal Recurrent Selection applied to Coffea canephora Pierre. I. Characterization and evaluation of breeding populations and values of intergroup hybrids.
Euphytica 1993, 67:113-125. Publisher Full Text
Cubry P, De Bellis F, Pot D, Musoli P, Leroy T: Global analysis of Coffea canephora Pierre ex Froehner (Rubiaceae) from the Guineo-Congolese region reveals impacts from climatic refuges and migration effects.
Genet Resour Crop Evol 2012, 1-19. Publisher Full Text
Structuration de la diversité génétique et analyse des patrons de déséquilibre de liaison de l’espèce Coffea canephora Pierre ex. Froehner. 2008.
Dussert S, Lashermes P, Anthony F, Montagnon C, Trouslot P, Combes MC, Berthaud J, Noirot M, Hamon S: Le caféier, Coffea canephora. In Diversité génétique des plantes tropicales cultivées. CIRAD, Montpellier; 1999:175-194.
Leroy T, Bellis F, Legnate H, Kananura E, Gonzales G, Pereira LF, Andrade AC, Charmetant P, Montagnon C, Cubry P, Marraccini P, Pot D, Kochko A: Improving the quality of African robustas: QTLs for yield- and quality-related traits in Coffea canephora.
DARwin software. CIRAD. 2006.
Liu T, Todhunter RJ, Lu Q, Schoettinger L, Li H, Littell RC, Burton-Wurster N, Acland GM, Lust G, Wu R: Modeling extent and distribution of zygotic disequilibrium: implications for a multigenerational canine pedigree.
Musoli P, Cubry P, Aluka P, Billot C, Dufour M, De Bellis F, Pot D, Bieysse D, Charrier A, Leroy T: Genetic differentiation of wild and cultivated populations: diversity of Coffea canephora Pierre in Uganda.
Flint-Garcia SA, Thuillet AC, Yu J, Pressoir G, Romero SM, Mitchell SE, Doebley J, Kresovich S, Goodman MM, Buckler ES: Maize association population: a high-resolution platform for quantitative trait locus dissection.
Yu J, Pressoir G, Briggs WH, Bi IV, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, Kresovich S, Buckler ES: A unified mixed-model method for association mapping that accounts for multiple levels of relatedness.
Camus-Kulandaivelu L, Veyrieras J-B, Madur D, Combes V, Fourmann M, Barraud S, Dubreuil P, Gouesnard B, Manicacci D, Charcosset A: Maize adaptation to temperate climate: relationship between population structure and polymorphism in the Dwarf8 gene.