Baetis harrisoni Barnard is a mayfly frequently encountered in river studies across Africa, but the external morphological features used for identifying nymphs have been observed to vary subtly between different geographic locations. It has been associated with a wide range of ecological conditions, including pH extremes of pH 2.9–10.0 in polluted waters. We present a molecular study of the genetic variation within B. harrisoni across 21 rivers in its distribution range in southern Africa.
Four gene regions were examined, two mitochondrial (cytochrome c oxidase subunit I [COI] and small subunit ribosomal 16S rDNA [16S]) and two nuclear (elongation factor 1 alpha [EF1α] and phosphoenolpyruvate carboxykinase [PEPCK]). Bayesian and parsimony approaches to phylogeny reconstruction resulted in five well-supported major lineages, which were confirmed using a general mixed Yule-coalescent (GMYC) model. Results from the EF1α gene were significantly incongruent with both mitochondrial and nuclear (PEPCK) results, possibly due to incomplete lineage sorting of the EF1α gene. Mean between-clade distance estimated using the COI and PEPCK data was found to be an order of magnitude greater than the within-clade distance and comparable to that previously reported for other recognised Baetis species. Analysis of the Isolation by Distance (IBD) between all samples showed a small but significant effect of IBD. Within each lineage the contribution of IBD was minimal. Tentative dating analyses using an uncorrelated log-normal relaxed clock and two published estimates of COI mutation rates suggest that diversification within the group occurred throughout the Pliocene and mid-Miocene (~2.4–11.5 mya).
The distinct lineages of B. harrisoni correspond to categorical environmental variation, with two lineages comprising samples from streams that flow through acidic Table Mountain Sandstone and three lineages with samples from neutral-to-alkaline streams found within eastern South Africa, Malawi and Zambia. The results of this study suggest that B. harrisoni as it is currently recognised is not a single species with a wide geographic range and pH-tolerance, but may comprise up to five species under the phylogenetic species concept, each with limited pH-tolerances, and that the B. harrisoni species group is thus in need of taxonomic review.
It is axiomatic that successful applied biology relies on sound taxonomy. For instance, the success of biomonitoring depends in part on the correct identification of the indicator organisms involved . Mayflies are known to be important indicator species in aquatic biomonitoring (e.g. [2-5]). Amongst these, two species of mayflies in the genus Baetis Leach stand out in importance: B. rhodani Pictet in Europe (e.g. [6-8]), and B. harrisoni Barnard in southern Africa (e.g. [2,9-11]). Baetis rhodani has long been recognised as a complex of cryptic species (see  for a review). The research on B. harrisoni presented here investigates whether this species may be similar to B. rhodani in terms of its widespread distribution and range of ecological tolerances, and whether it also represents a number of cryptic species.
Baetis harrisoni is a physically robust species, with nymphs found in slow- to fast-flowing (0.1–1.0 ms−1) streams and rivers throughout sub-Saharan Africa . It has also been recorded in polluted waters ranging in pH from about 2.9 to 10.0 [9,14]. Such a broad geographical distribution and variation in environmental tolerance in an organism with dubious dispersal ability suggests that there may be cryptic species associated with the name. A third line of evidence that B. harrisoni may constitute more than one species is that the morphological features used to identify its nymphs (e.g. mouthpart structure, abdominal pigmentation and relative size of gills) have been observed to vary subtly between populations in different geographic locations. Baetis harrisoni is therefore worth studying as a species of practical significance but uncertain taxonomy.
The use of molecular (DNA) data in resolving cryptic lineages has at least four advantages. Firstly, DNA data facilitate the detection of cryptic species because they provide more direct evidence of population genetic processes like interbreeding than do morphological data, and often with greater precision. For example, recent studies using the mitochondrial cytochrome c oxidase I (COI) gene showed that the name B. rhodani was applied to at least nine morphologically cryptic haplogroups  and that in Finland the species B. macani Kimmins and B. vernus Curtis are both paraphyletic in terms of their COI gene phylogenies . Secondly, molecular markers provide a valuable benchmark against which morphological characters for each lineage can be sought, and facilitate the interpretation of morphological variability. A re-analysis of morphological characters of the nymphs of B. macani (following a molecular study) provided diagnostic markers for identification of some lineages identified by DNA haplotypes . Thirdly, DNA data facilitate the association of conspecific immature and adult stages in life-cycles that feature radical metamorphosis and this has been done successfully in some mayflies . Finally, by helping to resolve questions of relationship amongst specimens from different places with greater precision, molecular markers can provide information about historical and geographical processes that underlie diversity.
Mitochondrial DNA sequences (especially of the COI region) provide useful data for addressing the types of challenges raised here [15,16,18]. However, numerous critiques of the taxonomic use of this gene [19-23] emphasise the need to cross-validate phylogenies with data from additional nuclear genes. This need arises from the possibility of incomplete lineage sorting, horizontal gene transfer, introgression and population fragmentation effects . For example, two specimens identified by their morphology as members of B. macani and B. liebenauae Keffermüller, respectively, were identified as B. vernus using COI-based phylogenetic analysis, suggesting mitochondrial introgression . Previous molecular taxonomic studies of Baetis species [15,16] have used only mitochondrial sequence data.
Our study has two aims. First, it provides a molecular study of the genetic variation within B. harrisoni across its distribution range in southern Africa using four genes (COI, 16S, PEPCK and EF1α). Second, it explores two nuclear genes (PEPCK and EF1α) as additional nuclear DNA markers to cross-validate the results of COI data, given recent critiques of using this region in isolation.
Our sampling strategy was driven by seeking cryptic variation where one is most likely to find it. Baetis harrisoni occurs throughout the south and east of South Africa (Albany Museum records: Figure 1), in biomes and climates that vary from Mediterranean (with winter rainfall) in the southwest Cape Floristic Region to temperate (with aseasonal rainfall) in the south-eastern Albany Thicket, to subtropical (with summer rainfall) in the north-eastern Savanna. In addition the western and southern Cape rivers are acidic, and those of the east neutral to alkaline . This environmental heterogeneity is probably greater than would be found in the species’ distribution in the rest of Africa, and if B. harrisoni did contain independent cryptic taxa, they are most likely to occur here. To contextualise the rest of Africa, specimens of B. harrisoni from Zambia and Malawi were included in the study.
Figure 1. Map of recorded and sampled localities. Recorded localities of B. harrisoni and the location of sampled sites corresponding to the major lineages of B. harrisoni in Southern Africa. Data obtained from the Albany Museum database.
At least ten nymphs of B. harrisoni were collected from each of 19 rivers spread across South Africa and one river in each of Zambia and Malawi (Figure 1, see Additional file 1: Table S1) and preserved in 96% ethanol. Two or three specimens from each river were used for molecular analysis, while the remaining specimens were kept for future morphological study. Nymphs of B. rhodani were provided by Dr Jean-Luc Gattolliat (Museum of Zoology, Switzerland) to serve as an outgroup. Vouchers of samples extracted for DNA (heads and the first pair of legs, or whole exoskeletons) are housed at the Albany Museum, Grahamstown (AMGS).
Additonal file 1. Table S1. GenBank Accession numbers, samples and localities. Clade labels correspond to Figures 2 and 3. (‘-’ indicates failure to amplify; * = sequence obtained from GenBank; EC = Eastern Cape; KZN = KwaZulu-Natal; LIM = Limpopo; MPU = Mpumalanga; NW = North West; RSA = South Africa; WC = Western Cape).
Format: PDF Size: 63KB Download file
This file can be viewed with: Adobe Acrobat Reader
DNA was extracted either from the entire abdomen using the Chelex® 100 extraction protocol  or by internal body digestion using the Invisorb extraction kit (Invitek).
Partial segments of four gene regions were amplified and sequenced, two mitochondrial (cytochrome c oxidase subunit ICOI and small subunit ribosomal 16S rDNA16S) and two nuclear (elongation factor 1 alphaEF1α and phosphoenol pyruvate carboxykinasePEPCK). The primers used for both the PCR and sequencing reactions were as follows: COI was amplified using C1-N-2568  and C1-J-1718 ; 16S was amplified using 16Sar and 16Sbr ; EF1α was amplified using DV-EF-F1 (5′- CAGGAYGTATACAAAATTGGTGG -3′) (Vanderpool, unpublished) and DALL (5′- CTACACACATTGGTTTGCTGGG -3′) designed for this study; PEPCK was amplified using pepFb12 (5′- GGAACTTCAAACAGCACCAAT -3′) and pepRb45 (5′- ACCTTGTGTTCTGCAGCT -3′) (modified by Vuataz, from ). All PCR reactions were performed in a 50 μl volume using the following thermal cycling profile: initial denaturation of 94°C for 5 min, followed by 35–40 cycles of denaturation at 94°C for 30s, primer annealing at 48°C (COI and 16S), 53°C (EF1α) or 52°C (PEPCK) for 30s, elongation at 72°C for 1 min 30s; followed by a final extension period of 72°C for 10 min. PCR products were cleaned using the Wizard® SV (Promega Corp.) and Invisorb PCRapace (Invitek) quick purification kits.
Cycle sequencing of the cleaned PCR product was carried out in both directions for each gene region using the flanking PCR primers and the ABI Big Dye Sequencing kit v.3.1, according to the manufacturer’s instructions. Sequence trace files were generated using an ABI 3100 genetic analyser situated at Rhodes University. Trace files were checked and edited using GeneStudio v.2 (GeneStudio, Inc). Sequences were initially aligned using the ClustalW algorithm in MEGA v.4  using the default parameters and then refined manually. Gaps in the 16S alignment were treated as missing data and there were no introns in the PEPCK or EF1α alignments. Sequences are available through GenBank [GenBank: HM636930-HM637101] (see Additional file 1: Table S1).
To test whether the data from each dataset (corresponding to each gene region) could be combined into a single data set, pairwise Incongruence Length Difference (partition homogeneity) tests  were conducted in PAUP* v.4.0b10 , using 1000 replicates. Each gene was tested for substitution saturation using plots of transitions and transversions against F84 distance in DAMBE v4.5 . Analyses were then conducted using Maximum Parsimony (MP) and Bayesian Inference (BI) for each dataset independently and the combined molecular data.
Each gene dataset was tested for the most appropriate model of sequence evolution using the AIC test  as implemented in MrModeltest v.2.2  using MrMTgui (available from http://genedrift.org/software/mrmtgui.html webcite).
Bayesian Inference analyses were conducted using MrBayes v.3.1.2  for each of the datasets. Each BI analysis comprised two independent runs each of ten million generations. Random starting trees with four chains (one cold, three hot) were used with trees sampled every 1000 generations. All model parameters except branch length and topology were unlinked across partitions and among-partition rate variation was accommodated following Marshall et al. . Stationarity was assessed using the Potential Scale Reduction Factor data and plots of likelihood scores, tree length and average standard deviation of split frequencies against the number of generations. The first 1000 trees sampled (10%) were discarded from each run as burn-in. The majority rule consensus Bayesian topology and posterior probability values were then computed from the remaining sampled trees. For the combined analysis the effect of partitioning the data by gene was estimated using pairwise comparison of Bayes Factors (sensu) [39,40] with 2lnBFA–B ≥ 10 indicative of strong support for partitioning strategy A vs. B, following Kass and Raftery . Although popular, the harmonic mean (HM) estimate of the marginal likelihood as reported by MrBayes is a biased estimator of the marginal likelihood  with a large and unpredictable variance . These properties render the HM inappropriate for Bayes Factor comparisons, instead the more recent stepping stone (SS) method is recommended [42,44]. Following analysis in MrBayes the marginal likelihood of the trees resulting from the two partitioning strategies were re-estimated using the generalized stepping stone (SS) method  as implemented in Phycas 1.2.0 http://www.phycas.org webcite using 30 β-values with 1000 cycles per β.
Parsimony analyses were performed using the heuristic search option in PAUP* version 4.0b10 . A simple search with TBR (Tree Bisection and Reconnection) branch-swapping was used to find the approximate length of the shortest trees. This was followed by a random input analysis using 1000 repetitions and TBR branch-swapping, keeping all trees equal to or shorter than the shortest tree found in the simple search. This process was repeated until no shorter trees were found. All of the shortest trees were retained and used to compute the strict consensus tree. Nodal support was investigated using 100 bootstrap pseudoreplicates  with MAXTREES set to 10 000, TBR branch-swapping and simple stepwise addition.
To estimate the contribution of each dataset to the overall support of each node, partitioned Bremer support (PBS) values [46-49] were calculated using TreeRot v.3  and PAUP* using only the parsimony informative characters for each dataset and 1000 random addition replicates. As only the major lineages were of interest, the PBS analysis used a reduced dataset comprising only individuals from each clade represented with the complete molecular dataset.
Both mean within- and between-clade distances were estimated from the COI and PEPCK data using the Maximum Composite Likelihood model in MEGA 4 . In addition, the contribution of Isolation by Distance (IBD) was estimated using the most informative gene data set (COI) in GenAlEx 6.1  with 999 permutations using the Mantel test in GenAlEx 6.1, initially for the group as a whole and subsequently for each clade individually, excluding the Zambian and Malawian clades that originated from one site each, preventing analysis.
As no fossils were available, a tentative dating analysis used minimum (1.5% per MY; ) and maximum (3.5% per MY; ) reported values for COI substitution rates of insects calibrated within the last 10 MY. Divergence times were estimated using BEAST v1.6.1  as follows. Data were partitioned by gene (PEPCK, 16S, COI) and the model parameters selected by MrModeltest for each partition (Table 1) were incorporated into each analysis. Each analysis comprised four independent runs of 50 million generations, with a UPGMA (Unweighted Pair Group Method with Arithmetic Mean) starting tree under the assumptions of an uncorrelated log-normal (UCLN) relaxed clock. The PEPCK and 16S substitution rates were estimated relative to the COI rate (COI rate set at 0.0075 or 0.0177 substitutions per site per MY) in BEAST. Each analysis was run under either a Coalescent (Constant Size), Yule, or Birth-Death prior to ascertain the affect of these tree priors on the resulting divergence estimates. Run statistics including the effective sample size for each parameter were examined in Tracer v1.5  and convergent independent runs for each analyses were then combined in Log Combiner 1.6.1 (part of the BEAST package) with a final burn-in (10%) of the trees. Node ages and the corresponding confidence intervals were then summarised using a Maximum Clade Credibility (MCC) tree in TreeAnnotator v1.6.1 (part of the BEAST package). These trees and corresponding node ages were viewed using FigTree v1.3.1 (part of the BEAST package).
Table 1. Data characteristics and analysis summaries
Species boundaries were explored using changes in branching rates following Pons et al . Both single and multiple thresholds, general mixed Yule-coalescent (GMYC) models were applied to each gene in isolation and to the combined data. After removing the outgroup specimens and all duplicate haplotypes, each chronogram was estimated using BEAST under the appropriate gene-specific model parameters (Table 1), with a relaxed uncorrelated log-normal (UCLN) clock, the mean substitution rate fixed at 1 and branch lengths estimated using a Coalescent (Constant Size) prior. Analyses were run for 50 million generations. Following the BEAST analyses the Maximum Clade Credibility chronograms were analysed using the SPLITS (available from http://r-forge.r-project.org/projects/splits webcite) and APE  packages in the R statistical environment (R Development Core Team, 2011). A log-likelihood ratio test was then used to assess the significance of the estimated shift in branching rates for both the single and multiple threshold models in each analysis.
The combined molecular data consisted of 67 specimens and 1935 nucleotides (16S = 502 bp; PEPCK = 357 bp; COI = 618 bp; EF1α = 458 bp) obtained from specimens from 22 rivers (see Additional file 1: Table S1; Figure 1) including outgroups. Certain sample sequences were successfully amplified for some genes and not others and are therefore absent in certain gene analyses (see Additional file 1: Table S1). Of the four genes, COI showed the most variation and EF1α the least (Table 1). Individual plots of transitions and transversions against F84 distance for the ingroup samples showed no signs of saturated substitution within the four datasets (data not shown).
The results of the Parsimony and Bayesian analyses are summarised in Table 1. Pairwise ILD tests showed that three of the four datasets were not significantly incongruent, with the exception of EF1α which was significantly incongruent from the other three genes (Table 2). Comparison of the two partitioning strategies in the combined Bayesian analyses showed no change in topology but an improvement in the support of a single node (Figure 2: the posterior probability of node 3 improved from 0.62 [single partition] to 0.97 [three partitions]). Comparison of the marginal likelihoods estimated using the generalized stepping stone (SS) method suggested that partitioning data by gene provided a much better ln-likelihood value (2lnBF1–3 = 818.04) than not partitioning the data for each gene (Table 1), and thus the results of the Bayesian analysis partitioning the data by gene (three partitions) are shown (Figure 2).
Table 2. Pairwise Incongruence Length Difference (ILD)
Figure 2. Combined Bayesian Inference with GMYC. Bayesian Inference phylogram of combined molecular data (excluding EF1α data) with support for major nodes shown above (Bayesian posterior probability / Parsimony bootstrap / Bremer support) and below (partitioned Bremer support: PEPCK / 16S / COI) each branch. The columns to the right of the tree indicate species boundaries identified by the GMYC likelihood analyses for the COI (clades: C1–C6), 16S (clades: S1–S5) and PEPCK (clades: P1–P7) partitions respectively. The fourth (dark grey) column shows five well supported monophyletic consensus groups identified by the GMYC likelihood analyses of the combined data (PEPCK + 16S + COI), representing putative phylogenetic species further discussed in this study. Node numbers (circled) correspond to date estimates reported in Tables 3 &4 and Additional file 2: Figure S1.
Five clades were found, consisting of samples collected from the Cederberg mountains and the Breede River district, which is a winter rainfall region of South Africa hereafter termed “SA Ced”; the entire winter rainfall region of South Africa, Cederberg inclusive, hereafter termed “SA West”; and the eastern summer rainfall region of southern Africa, corresponding to samples from South Africa “SA East”; Malawi, “MAL”; and Zambia, “ZAM”. The arrangement of the various clades differed depending on the data analysed.
Analysis of the 16S data resulted in a basal polytomy (Figure 3a). Two clades (SA Ced and SA West) were recovered as monophyletic; samples from eastern rivers were recovered as two paraphyletic clades (SA East + MAL) and (ZAM). Three clades (SA Ced, SA West and (SA East + MAL)) received good support (MP: ≥ 82; BI: ≥ 0.94) while the clade from Zambia (ZAM) received moderate support (MP: 82; BI: 0.81). Analysis of the PEPCK data (Figure 3b) resulted in three clades (SA Ced, SA West and (SA East + MAL)), with two (SA Ced and SA West) receiving good support (MP: ≥ 90; BI: 1.00). Within the eastern lineage the SA East samples formed a well-supported clade (MP: ≥ 90; BI: ≥ 0.98), whereas the samples from Malawi (MAL) were monophyletic but only moderately supported (MP: 63; BI: 0.87) and the samples from Zambia (ZAM) did not amplify for this gene. Analysis of the COI data (Figure 3c) indicated five clades (SA Ced, SA West, SA East, MAL and ZAM), each with good support (MP: ≥ 99; BI: 1.00). The trees recovered from analyses of the EF1α data (Figure 3d) were poorly supported and incongruent with the trees produced from the other genes, both when analysed separately (Figure 3a-c) and when combined in a total evidence analysis (Figure 2). Specimens that were represented in distinct geographic clades within the 16S, PEPCK and COI analyses were indistinguishable in the EF1α analysis (Figure 3d).
Figure 3. Bayesian Inference Phylograms. Bayesian Inference phylograms of each of the four different datasets corresponding to each individual gene: (a) 16S (b) PEPCK (c) COI (d) EF1α. Posterior probability support (p > 0.95) is indicated using an asterisk above the corresponding branch.
Considering the limited phylogenetic information provided by the EF1α data (Table 1) and their significant incongruence with all other sampled datasets (Table 2), the EF1α gene was excluded from the combined analysis (Figure 2). The MP and BI analyses of the combined remaining molecular data (comprising 16S, PEPCK and COI datasets) recovered five monophyletic lineages (SA Ced, SA West, SA East, MAL and ZAM). Three lineages (SA Ced, SA West and a combined “eastern” clade comprising (SA East + MAL + ZAM)) received good support (MP: ≥ 73; BI: 1.00). Within the eastern clade the Zambian (ZAM) lineage was recovered as sister to the combined Malawian (MAL) and South African (SA East) lineages, which received moderate support (MP: 56; BI: 0.97). In addition the sister relationship of ((SA East + MAL + ZAM) and SA West) received moderate support (MP: 66; BI: 0.99). The PBS analysis indicated little conflict between the three (16S, PEPCK and COI) datasets, with very good support for each clade (combined PBS ≥ 8) but only moderate support for the between-clade relationships (combined PBS ≤ 6).
The within-clade distance estimated using the Maximum Composite Likelihood model was comparatively low (COI: mean = 1.8%, range = 0.5–3.0%; PEPCK: mean = 0.7%, range = 0.2–0.9%), whereas the between-clade distance was an order of magnitude greater (COI: mean = 17.6%, range = 11.6–22.5%; PEPCK: mean = 4.4%, range = 2.4–5.7%) and comparable to the distance between the ingroup and outgroup taxa (COI: mean = 22.4%; range = 21.1–23.6%; PEPCK: mean = 7.5%, range = 6.7–9.6%).
Isolation by Distance (IBD) analyses of the samples as a whole showed a small but significant effect of IBD (R2 = 0.084; p = 0.001). The contribution of IBD was minimal within each individual clade (Table 3).
The topology of the various BEAST dating analyses (data not shown) agreed in general with the MrBayes analysis (Figure 2), with two exceptions: the analysis under a Yule prior and a COI substitution rate of 1.5% per MY (Table 4) and the analysis under a Birth-Death prior and a COI substitution rate of 3.5% per MY (Table 5), both of which did not recover a monophyletic clade comprising (SA East + MAL + ZAM + SA West: Figure 2, node 3) due to a weakly supported clade comprising (SA West + SA Ced). Unsurprisingly divergence estimates were affected by both the tree prior and the substitution rate used (Tables 3 & 4; Additional file 2: Figure S1). The fast substitution rate (3.5% per MY) resulted in divergence estimates which were approximately half those estimated using the slow substitution rate (1.5% per MY) across the whole range of estimates. The effect of tree prior was relatively minor for the within clade estimates (Tables 3 & 4; Additional file 2: Figure S1) but the between clade divergence estimates were strongly affected by the tree prior with a Yule prior favouring divergence estimates that were approximately half of those estimated using the Coalescent and Birth-Death priors (Tables 3 & 4; Additional file 2: Figure S1). Comparisons of the individual BEAST runs using the Bayes Factor calculator in Tracer with 1000 bootstrap replicates showed that a Yule prior (3.5% Ln = −5294.406; 1.5% Ln = −5295.392) was favoured over both the Coalescent prior (3.5% Ln = −5295.698; 1.5% Ln = −5296.326) and Birth-Death prior (3.5% Ln = −5296.363; 1.5% Ln = −5295.95), although this support was only marginal (2lnBFA-B ≤ 4 in all pairwise comparisons).
Within each of the lineages the mean time to most recent common ancestor (TMRCA) ranged from the youngest: 0.5–0.7 mya (MAL) to the oldest: 1.5–1.7 mya (SA East) using the faster rate (Table 5) or 1.1–1.6 mya (MAL) to 3.4–3.9 (SA East) using the slower rate (Table 4). The ingroup taxa are estimated to share a common ancestor in the mid- to late Miocene at least 5.2 mya (fast rate, Yule prior)–24.6 mya (slow rate, Birth-Death prior). The most recent cladogenesis between major lineages occurred at least 2.4 mya (fast rate, Yule prior)–8.9 mya (slow rate, Birth-Death prior) between the SA East and MAL lineages (Table 4).
Table 3. Isolation by Distance (IBD)
Table 4. TMRCA with substitution rate at 1.5% per My
Table 5. TMRCA with substitution rate at 3.5% per My
The results of the individual GMYC analyses are summarised in Table 6. In all cases the multiple-threshold model was not significantly different from the single model and was rejected in favour of the more conservative single threshold model. The COI data recovered six ML entities (Figure 2, C1–C6) the 16S data recovered five ML entities (Figure 2, S1–S5) and the PEPCK data recovered seven ML entities (Figure 2, P1–P6), but none of these outcomes were significantly different from the null model of a single species. However, the combined COI + 16S + PEPCK data recovered six ML entities, which was significantly different from the null model of a single species (p = 0.04). Five of these entities correspond to monophyletic, well supported clades (Figure 2).
Table 6. Summary of GMYC