Department of Applied Mathematics and Computer Science, Faculty of Sciences, Ghent University, Krijgslaan 281, S9, 9000 Gent, Belgium

Department of Botany, Faculty of Sciences, Palacký University, Slechtitelu 11783 71 Olomouc, Czech Republic

, Bayer CropScience NV, Seeds, Technologiepark 38, 9052 Zwijnaarde, Belgium

Abstract

Background

Sampling core subsets from genetic resources while maintaining as much as possible the genetic diversity of the original collection is an important but computationally complex task for gene bank managers. The Core Hunter computer program was developed as a tool to generate such subsets based on multiple genetic measures, including both distance measures and allelic diversity indices. At first we investigate the effect of minimum (instead of the default mean) distance measures on the performance of Core Hunter. Secondly, we try to gain more insight into the performance of the original Core Hunter search algorithm through comparison with several other heuristics working with several realistic datasets of varying size and allelic composition. Finally, we propose a new algorithm (Mixed Replica search) for Core Hunter II with the aim of improving the diversity of the constructed core sets and their corresponding generation times.

Results

Our results show that the introduction of minimum distance measures leads to core sets in which all accessions are sufficiently distant from each other, which was not always obtained when optimizing mean distance alone. Comparison of the original Core Hunter algorithm, Replica Exchange Monte Carlo (REMC), with simpler heuristics shows that the simpler algorithms often give very good results but with lower runtimes than REMC. However, the performance of the simpler algorithms is slightly worse than REMC under lower sampling intensities and some heuristics clearly struggle with minimum distance measures. In comparison the new advanced Mixed Replica search algorithm (MixRep), which uses heterogeneous replicas, was able to sample core sets with equal or higher diversity scores than REMC and the simpler heuristics, often using less computation time than REMC.

Conclusion

The REMC search algorithm used in the original Core Hunter computer program performs well, sometimes leading to slightly better results than some of the simpler methods, although it doesn’t always give the best results. By switching to the new Mixed Replica algorithm overall results and runtimes can be significantly improved. Finally we recommend including minimum distance measures in the objective function when looking for core sets in which all accessions are sufficiently distant from each other. Core Hunter II is freely available as an open source project at

Background

The concept of a core collection was first introduced in

To be able to generate diverse core sets we need evaluation measures that express the diversity of a given collection. These measures are based on a variety of criteria including phenotypic traits or genetic marker data

The first two of these allocation methods were proposed by Brown in

Another allocation method, the M-method, was proposed by Schoen and Brown in

It should be noted that the objectives of the D-method and MSTRAT differ. While MSTRAT aims at including rare and localized alleles by maximizing allelic richness, the goal of the D-method is a high representation of the original genetic diversity in the core by including widely adapted accessions that are genetically distant from each other. The former approach is favored by taxonomists and geneticists while the latter corresponds more to the breeder’s preference.

Other non-stratified methods include genetic distance sampling and least distance stepwise sampling. Genetic distance sampling

All of the previously mentioned methods assume that the desired core size (or distance threshold, in case of genetic distance sampling) is known in advance and given as input to the sampling strategy, and then try to create a good core set of the desired size according to the specific objective used. However, a related problem is that of finding the smallest possible core set that retains all unique alleles from the original collection. The PowerCore algorithm was presented in

Core Hunter was developed as a new, very flexible framework for selecting core collections

In this paper we present Core Hunter II as an extension to the original Core Hunter framework. First we investigate if minimum distance measures, in addition to the available mean distances, ensure that accessions in the core will be sufficiently distant from each other. A second objective is to gain more insight into the performance of the REMC search engine by comparing it with several other heuristic optimization methods implemented in the same flexible Core Hunter framework. In the original Core Hunter article

Methods

We use the same formal definition of the core subset selection problem and multi-objective pseudo-index as in

The core subset selection problem

Given some collection ^{a} core ^{∗}∈^{∗})=max{

The multi-objective pseudo-index

Given any _{
i
}and weights 0≤_{
i
}≤1,

the corresponding pseudo-index is defined as

This pseudo-index has no biological meaning at all but just serves as a mean to optimize several measures at once according to their importance (weight).

Evaluation measures

The original version of Core Hunter only supports genetic marker data (also called molecular marker data) and offers seven diversity measures, including two genetic distance measures, three allelic diversity indices and two auxiliary measures. We briefly discuss these here and refer to

Genetic distance measures are defined on pairs of accessions and express their similarity. The higher the distance between two accessions, the more genetically different they are and conversely highly similar accessions can be identified as those being very close to each other. To assess the diversity of an entire collection using a genetic distance measure, it is customary to report the mean distance between all pairs of accessions contained in this collection. These measures are especially useful for breeders who want to ensure that each accession in the selected core set is sufficiently different from the others. Core Hunter offers the Modified Rogers (MR)

The allelic diversity indices are directly computed on the entire core set and are particularly useful for preserving rare alleles, which makes them very well suited for applications aimed at genetic conservation, such as sampling core collections from germplasm resources. Three such diversity indices are available in Core Hunter: (1) Shannon’s diversity index (SH)

Finally, two auxiliary measures are also available, expressing the extent to which the original alleles from the entire collection are still present in the core. The allele coverage (CV)

Minimum distance measures

When expressing the diversity of a collection using the mean distance between all pairs of accessions, it is not clear whether optimizing this mean value will in fact lead to cores in which all accessions are sufficiently distant from each other. High mean distance does not

Performance of REMC

To gain more insight into the performance of the REMC search algorithm that is implemented in the Core Hunter software, we compared its results with those of several simpler methods, implemented in the same flexible Core Hunter framework. All of these are well known basic heuristic search methods:

1. Standard Local Search (LS) starts with a random solution and then iteratively samples random neighbor solutions, accepting them as the new solution if and only if they are better than the current solution. For our application, we use a so called

2. MSTRAT Steepest Descent: as mentioned before, Core Hunter was previously compared with the external MSTRAT program

3. LR Greedy Search is a deterministic algorithm that does not take any randomized decision ^{b} that always performs

For the first two methods and REMC, the stop criteria that decide when to terminate the search^{c} are: (1) maximum runtime – 60 seconds by default, (2) minimum progression, and (3) maximum time without improvement (stuck time). The LR method does not accept such stop criteria as it simply terminates when the desired core size has been reached. For our experiments we set a maximum runtime limit for each randomized algorithm and record both the diversity of the resulting core sets and the corresponding convergence time, which is defined as the point in time from which no more improvement was observed when all figures were rounded to 3 decimal places.

The goal of the comparison is to find out when simple methods break down and where it would be better to turn to more advanced methods such as REMC. These results will give us more insight into the specific characteristics of those problems on which simple methods do or do not fail.

Mixed Replica Search

In addition to our comparison of REMC with these simpler methods, we also present a new advanced search engine that is inspired by the replicated approach of REMC, but uses heterogeneous instead of homogeneous replicas, which implement different search techniques. We experimented with several simple methods, including those explained in the previous section, and several more advanced heuristics to assess whether we could improve on the results of REMC. We observed that different methods outperformed REMC in several experiments in respect to either the runtime or diversity score, but each method had some drawbacks. Thus, we decided to design one robust Mixed Replica Search engine (MixRep) which combines the strength of several search techniques, to be able to tackle different problems with different techniques without the need of determining in advance which technique is more suited to a specific problem.

The MixRep algorithm is based on four different types of replicas, consisting of two simple and two more advanced search techniques:

1. LR Semi Replica (LR): a modified version of the deterministic LR(2,1) search avoiding the overhead of exhaustively sampling the first pair of accessions. This replica starts with two randomly chosen accessions and thus introduces a small random effect making the resulting technique no longer purely deterministic.

2. Local Search Replica (LS): this replica performs standard local search using the previously described single perturbation neighborhood.

3. Tabu Search Replica (Tabu): a more advanced technique, based on steepest descent, which always continues with the best neighbor of the current solution even if it is worse than the current solution, but skipping those neighbors that have been declared tabu. In theory all previously visited solutions should be declared tabu. This technique prevents the search from continuously revisiting previous solutions and from traversing cycles within the search space. Our implementation of tabu search uses the steepest descent technique from MSTRAT to construct neighbors.

4. Simple Monte Carlo Replica (MC): these replicas are exactly the same as those used in the REMC search algorithm ^{d}.

The algorithm uses only one single LR replica as this method is deterministic (apart from the random selection of the first pair of accessions) so its results show little to no variation. The other three replicas are used repeatedly with different initial solutions. The search process can be described as follows:

1. One LR replica is created, initialized with a random pair of accessions and activated to run in the background until its search process is complete.

2. Several LS replicas are created and randomly initialized with core sets of the desired size.

3. Until some stop criterion is met, consecutive search rounds are performed containing the following steps:

(a) All replicas perform some search steps, independently of the other replicas.

(b) The best solution over all replicas is tracked and improvements are reported.

(c) Regularly, new advanced replicas are created (Tabu, MC) and initialized with new cores, obtained by merging promising solutions from the current replicas.

(d) Replicas which did not improve on their current solution during their last couple of search steps are considered to be stuck and subsequently removed.

(e) If the global improvement drops below a certain threshold or if there was no improvement at all for some time, the search is boosted by adding several new randomly initialized LS replicas to provide new variation.

Note that in step (3a) replicas perform their search steps independently from each other, which is in fact also the case in the replicated REMC algorithm

For specific details concerning the MC replicas we refer to ^{e}. When the current solution has been perturbed into one of its neighbors by changing the accession at index

In summary, the Mixed Replica algorithm starts with some simple, fast methods to perform an initial exploration of the search space. Afterwards, more advanced methods take over, starting in these areas where the simple methods had arrived. On average these areas of the search space contain generally better solutions, thus presenting a more difficult task of further improving on the current solution. As soon as little or no more improvement is being made the search is boosted by introducing new simple, fast methods, starting from new random points in the search space to supply new variation. The best solution over all replicas is tracked at all times and reported when some stop criterion is met, by default after a maximum total runtime of 60 seconds.

Datasets

We performed intensive experiments using five different realistic datasets, including the larger two datasets used in the original Core Hunter article

● ‘bulk maize data set’

– 275 samples, genotyped at 24 SSR loci with 186 total alleles

– obtained by fingerprinting 275 bulks of maize landrace populations, each containing multiple maize individuals from the Americas and Europe using 24 multi-allelic SSR markers

● ‘accession maize data set’

– 521 samples, genotyped at 26 SSR loci with 209 total alleles

– obtained by fingerprinting 521 maize individuals from 25 different populations using 26 multi-allelic SSR markers

● ‘flax data set’

– 708 samples, genotyped at 141 IRAP loci with 282 total ‘alleles’

– obtained by fingerprinting 708 bulks of 10 flax individuals each using 141 IRAP markers (similar to AFLP); only two possible states occur for each bulk at each marker locus: (i) presence of allele and (ii) absence or marker failure, where it is not possible to distinguish between these last two states

● ‘pea data set’

– 1283 samples, genotyped at 19 RBIP loci with 38 total ‘alleles’

– obtained by fingerprinting 1283 bulks of 10 pea individuals each using 19 RBIP markers, with 4 different possible states for each bulk at each maker locus: (i) presence of allele in each individual, (ii) absence in each individual, (iii) mixed state having both individuals with presence and absence in the same bulk, and finally (iv) the zero state which means no data is available

● ‘large pea data set’

– 4429 samples, genotyped at 17 RBIP loci with 34 total ‘alleles’

– obtained in the same setting as the previous dataset, but containing significantly more samples (again bulks of 10 individuals)

Implementation and hardware

Extensions to the original Core Hunter software were implemented in Java (version 1.6), starting from the original code which was kindly provided by the authors. All of our main experiments were performed on a 2.53 GHz Intel Core i5 dual core MacBook Pro with 4 GB of RAM and 256 KB of CPU cache per core. Some additional experiments were run on the UGent ‘helios’ computing server, a 2 × 6 core machine which has two 6-core 3.07 GHz Intel Xeon X5675 processors, 48 GB of RAM and 12 MB cache for each CPU, running Debian Linux. We will explicitly note which experiments were run on this helios server.

The statistical R software was used to produce all visualizations of datasets and sampled cores. Principal component analyses were performed using the built-in R command

Results and discussion

First we will present results of a comparison of REMC with the more simple methods described in the previous section, using the original Core Hunter evaluation measures. Then we will illustrate a possible problem regarding minimum distances if mean distances are optimized alone, using some generated toy example datasets of low dimension^{f}. Next, the impact of including these newly introduced minimum distances in the objective function when sampling from the realistic datasets will be discussed.

Based on these results, we will give further motivation of the specific composition of our new Mixed Replica algorithm and then we will discuss this method’s performance regarding both the diversity of the constructed core sets and the runtimes until convergence. To investigate the impact of the sampling intensity on the performance of our algorithms, all experiments have been repeated for two different sampling intensities (int): 20% (fairly large) and 5% (rather low), both within the range of sampling intensities proposed in previous research

Performance of REMC using original measures

Table ^{g}. Only for the large pea dataset was the runtime limit set to 10 minutes due to its large size. No runtime limit holds for LR search. Both the diversity scores of the constructed core sets and their corresponding convergence times (smaller figures) are presented. The latter is defined as the point in time from which no more improvement was observed^{h}. In cases where several methods gave different results in terms of the reported core diversity, the highest score is shown in bold. For each dataset the bottom line shows the corresponding diversity scores of the entire collection to allow comparison with the scores of the selected cores. Single measure optimizations were performed for each of the available (mean) distance measures (MR, CE) and diversity indices (SH, HE, NE), but not for the auxiliary measures (PN, CV). In practice PN and CV are not generally used as a single objective but as additional constraints when the main goal is optimization of one or more of the other measures. The mixed pseudo-index does contain all seven measures with equal weights, including these auxiliary measures.

**Algorithm**
^{
*
}

**MR**

**(t)**

**CE**

**(t)**

**SH**

**(t)**

**HE**

**(t)**

**NE**

**(t)**

**Mixed**
^{
**
}

**(t)**

^{*}For each combination of algorithm, dataset and evaluation measure, 20 independent runs were performed from which averaged results are reported. By default runs were limited by a runtime of 60 seconds, except for the large pea dataset where a runtime limit of 10 minutes was applied. Furthermore the LR method does not accept a runtime limit but continues search until the desired core size has been reached.

^{**}Results shown are those of a pseudo-index containing all seven measures with equal weights.

^{▾}These results were computed on the helios server.

**Bulk maize data set (275)**

Local S.

0.572

0.45s

0.641

0.55s

4.531

0.35s

0.667

0.25s

3.446

0.65s

**10.680**

15.0s

MSTRAT

0.572

0.31s

0.641

0.32s

4.531

0.38s

0.667

0.43s

3.446

0.43s

10.678

1.5s

LR(2,1)

0.572

0.61s

0.641

0.64s

4.531

1.1s

0.667

1.0s

3.446

1.0s

**10.680**

4.2s

REMC

0.572

1.0s

0.641

2.0s

4.531

2.0s

0.667

1.0s

3.446

3.0s

**10.680**

15.0s

Original

0.440

0.521

4.399

0.620

2.937

**Accession maize data set (521)**

Local S.

**0.695**

2.0s

0.752

1.0s

4.670

1.0s

0.676

0.45s

3.501

2.0s

11.086

15.0s

MSTRAT

**0.695**

1.7s

0.752

1.7s

4.670

1.6s

0.676

1.5s

3.501

1.5s

11.083

8.2s

LR(2,1)

**0.695**

2.9s

0.752

2.9s

4.670

4.2s

0.676

3.9s

3.502

3.9s

**11.087**

17.5s

REMC

0.694

4.0s

0.752

4.0s

4.670

5.0s

0.676

3.0s

3.502

20.0s

11.086

50.1s

Original

0.630

0.696

4.467

0.591

2.742

**Flax data set (708)**

Local S.

**0.512**

2.1s

**0.512**

2.1s

5.340

0.58s

**0.263**

0.58s

1.469

1.1s

**8.878**

12.7s

MSTRAT

**0.512**

5.1s

**0.512**

5.1s

5.340

3.7s

**0.263**

3.8s

1.469

3.8s

8.877

25.1s

LR(2,1)

**0.512**

7.4s

**0.512**

7.4s

5.340

13.3s

**0.263**

12.9s

1.469

12.8s

**8.878**

50.4s

REMC

0.511

5.0s

0.511

4.0s

5.340

30.0s

0.262

4.0s

1.469

30.0s

8.874

60.4s

Original

0.468

0.468

5.285

0.222

1.377

**Pea data set (1283)**

Local S.

**0.593**

3.0s

**0.597**

2.7s

**3.556**

1.1s

**0.440**

1.0s

**1.867**

6.3s

**7.946**

53.6s

MSTRAT

**0.593**

28.8s

**0.597**

28.5s

**3.556**

17.5s

**0.440**

18.3s

**1.867**

18.2s

7.851

60.6s

LR(2,1)

**0.593**

34.1s

**0.597**

34.3s

**3.556**

24.5s

**0.440**

28.3s

**1.867**

27.9s

**7.946**

03m03s

REMC

0.591

50.0s

0.595

30.0s

3.553

7.0s

0.437

15.0s

1.865

15.0s

7.876

61.2s

Original

0.509

0.515

3.482

0.381

1.713

**Large pea data set ^{▾} (4429)**

Local S.

**0.594**

49.4s

**0.596**

38.1s

**3.486**

18.3s

**0.465**

16.9s

**1.886**

23.8s

**7.947**

07m43s

MSTRAT

0.555

10m03s

0.558

10m03s

3.478

10m03s

0.458

10m03s

1.866

10m02s

7.396

10m07s

LR(2,1)

**0.594**

42m56s

**0.596**

42m35s

**3.486**

21m18s

**0.465**

21m24s

**1.886**

21m23s

**7.947**

04h08m

REMC

0.577

03m41s

0.580

08m49s

3.470

08m37s

0.448

04m29s

1.875

05m22s

7.621

10m03s

Original

0.464

0.466

3.348

0.352

1.609

As we can see, results are very similar for each of the four algorithms. The advanced REMC algorithm never outperforms all of the simple methods when comparing the diversity scores of the constructed core sets for a specific dataset and evaluation measure. More accurately, REMC never outperforms LR search and only occasionally presents slightly better results than Local Search and/or MSTRAT. Except for the smallest (bulk) maize dataset, some or even all simple methods often slightly outperform REMC. However differences in diversity are never significant. The largest difference is observed when optimizing the mixed objective function for the large pea dataset, where both Local Search and LR outperform REMC with a relative improvement of about 4%. It should be noted that in this case LR takes much more time than the runtime limit imposed on the other methods.

For the large pea dataset in general both REMC and MSTRAT result in somewhat worse scores than Local Search and LR. Furthermore simple Local Search is much faster than any other method including REMC, with convergence times below one minute for each single measure and of about 7 minutes in case of a mixed objective. Although LR reaches very similar or the same scores as Local Search for this large dataset, it is a lot slower with runtimes up to several hours. This longer runtime is due to the fact that LR starts with an empty solution and has to perform a fixed number of steps relative to the core size, which depends on the original dataset size and given sampling intensity. For large datasets and intensities, this process becomes slower and for the evaluation measures used it clearly does not offer any gain in diversity compared with the very fast Local Search. A similar speed issue also applies for MSTRAT, as this method evaluates many neighbors in each step, again relative to the dataset size. Furthermore, MSTRAT sometimes results in lower scores than Local Search, for example in the case when analyzing the large pea dataset.

For the smaller datasets runtimes of Local Search are also often significantly lower than those of the advanced REMC method, which is not surprising since REMC performs computations for several search replicas. It is mainly due to this reduced runtime that Local Search is sometimes able to construct slightly more diverse core sets than REMC, within the imposed time limit. By performing some informal experiments with higher runtime limits, we learned that in most cases REMC finds these results too when given more time.

Table

**Algorithm**
^{
*
}

**MR**

**(t)**

**CE**

**(t)**

**SH**

**(t)**

**HE**

**(t)**

**NE**

**(t)**

**Mixed**
^{
**
}

**(t)**

^{*}For each combination of algorithm, dataset and evaluation measure, 20 independent runs were performed from which averaged results are reported. By default runs were limited by a runtime of 60 seconds, except for the large pea dataset where a runtime limit of 10 minutes was applied. Furthermore the LR method does not accept a runtime limit but continues search until the desired core size has been reached.

^{**}Results shown are those of a pseudo-index containing all seven measures with equal weights.

^{▾}These results were computed on the helios server.

**Bulk maize data set (275)**

Local S.

0.643

0.25s

**0.700**

0.25s

4.567

0.25s

0.685

0.25s

3.625

0.35s

10.781

0.9s

MSTRAT

0.643

0.14s

0.699

0.14s

4.567

0.15s

0.685

0.15s

3.616

0.15s

10.772

0.46s

LR(2,1)

0.643

0.34s

**0.700**

0.37s

4.565

0.68s

0.685

0.62s

3.605

0.59s

**10.790**

2.2s

REMC

0.643

0.35s

**0.700**

0.35s

**4.568**

0.65s

0.685

0.55s

**3.631**

3.0s

**10.790**

7.0s

0.440

0.521

4.399

0.620

2.937

**Accession maize data set (521)**

Local S.

**0.723**

0.45s

0.781

0.35s

**4.724**

0.95s

0.701

0.35s

3.880

2.0s

11.210

4.0s

MSTRAT

0.722

0.26s

0.781

0.27s

4.723

0.34s

0.701

0.32s

3.874

0.32s

11.200

1.1s

LR(2,1)

**0.723**

1.2s

0.781

1.2s

**4.724**

2.1s

0.701

2.1s

3.861

2.1s

11.206

6.8s

REMC

**0.723**

0.75s

**0.782**

2.0s

**4.724**

2.0s

**0.702**

5.0s

**3.886**

6.0s

**11.216**

50.0s

Original

0.630

0.696

4.467

0.591

2.742

**Flax data set (708)**

Local S.

0.533

1.0s

0.533

1.0s

5.358

0.60s

0.278

0.69s

1.504

1.2s

**8.965**

12.8s

MSTRAT

0.533

0.66s

0.533

0.67s

5.358

0.73s

0.278

1.1s

1.504

1.1s

8.960

3.5s

LR(2,1)

0.533

3.3s

0.533

3.3s

5.358

7.3s

0.278

7.0s

1.504

7.0s

8.962

22.0s

REMC

0.533

3.0s

0.533

3.0s

5.358

2.0s

0.278

3.0s

1.504

4.0s

**8.965**

30.0s

Original

0.468

0.468

5.285

0.222

1.377

**Pea data set (1283)**

Local S.

0.626

1.4s

0.629

1.2s

3.578

0.50s

0.451

0.40s

1.898

0.70s

**8.084**

33.1s

MSTRAT

0.626

1.5s

0.629

1.6s

3.578

1.1s

0.451

1.0s

1.898

1.0s

8.083

6.9s

LR(2,1)

0.626

3.5s

0.629

3.5s

3.578

4.8s

0.451

4.8s

1.898

4.8s

8.083

18.3s

REMC

0.626

10.0s

0.629

7.0s

3.578

2.0s

0.451

3.0s

1.898

2.0s

8.083

40.0s

Original

0.509

0.515

3.482

0.381

1.713

**Large pea data set ^{▾} (4429)**

Local S.

**0.635**

09.4s

**0.637**

23.9s

**3.518**

02.4s

**0.496**

15.1s

**1.983**

27.5s

**8.158**

04m17s

MSTRAT

**0.635**

49.6s

**0.637**

49.6s

**3.518**

24.5s

**0.496**

25.4s

**1.983**

25.5s

**8.158**

03m44s

LR(2,1)

**0.635**

01m18s

**0.637**

01m19s

**3.518**

01m05s

0.495

01m18s

1.981

01m09s

**8.158**

07m42s

REMC

0.633

06m05s

0.634

36.8s

3.515

27.6s

0.492

13.9s

1.982

06m53s

8.147

09m36s

Original

0.464

0.466

3.348

0.352

1.609

We conclude that for the original Core Hunter evaluation measures, simple methods perform very well. In most of our experiments at least one and often all simple methods were able to construct equal or slightly more diverse core sets than REMC, while their runtimes are often significantly lower, especially those of simple Local Search. Runtimes of both LR and MSTRAT are very sensitive to the dataset and core size, so these methods become slower for large datasets and intensities. In case of fairly low sampling intensities it is harder for the simple methods to create good core sets and in this case they are sometimes just outperformed by REMC, but differences are never significant. So in conclusion for these measures simple methods present very similar results while often using less computation time. Simple Local Search is clearly the fastest of all considered methods and often it also gives the best results. It should be noted however that when comparing the results of the simple methods, it is not always the same method that outperforms all the others.

Minimum distance measures

Figure

3D Toy example datasets, optimizing mean versus minimum distance

**3D Toy example datasets, optimizing mean versus minimum distance.** Core collections sampled from two generated three-dimensional toy example datasets, respectively of size 500 and 1000, the former being completely random, the latter having a very strongly clustered structure. Both datasets contain only one single marker with 3 corresponding alleles. Core selection was performed using the REMC algorithm, optimizing mean (top) and minimum (bottom) MR distances. For the random dataset, the sampling intensity is set to 0.2, while an intensity of 0.05 is used for the larger, clustered set. **(a)** random dataset, mean Modified Rogers’ distance (sampling intensity = 0.2), **(b)** clustered dataset, mean Modified Rogers’ distance (sampling intensity = 0.05), **(c)** random dataset, minimum Modified Rogers’ distance (sampling intensity = 0.2), **(d)** clustered dataset, minimum Modified Rogers’ distance (sampling intensity = 0.05).

When sampling core collections from these datasets with the objective of optimizing mean MR distance (MR), REMC constructed the core sets displayed in Figure

Figures

Performance of REMC using minimum distances

Now we will present results of using these new minimum distance measures when sampling from realistic datasets. Table

**Optimized →**

**MR**

**MRmin**

**MixedMR**
^{
**
}

^{*}For each combination of algorithm, dataset and evaluation measure, 20 independent runs were performed from which averaged results are reported. By default runs were limited by a runtime of 60 seconds, except for the large pea dataset where a runtime limit of 10 minutes was applied. Furthermore the LR method does not accept a runtime limit but continues search until the desired core size has been reached.

^{**}Results shown are those of a pseudo-index containing both minimum and mean MR distance, with equal weight = 0.5.

^{∙}Not used during optimization, but computed afterwards on the constructed core sets.

^{∘}Components of mixed MR measure.

^{▾}These results were computed on the helios server.

**Algorithm**
^{
*
}

**MR**

**(t)**

**MRmin**
^{
∙
}

**MRmin**

**(t)**

**MR**
^{
∙
}

**MixedMR**

**(t)**

**MRmin**
^{
∘
}

**MR**
^{
∘
}

**Bulk maize data set (275)**

Local S.

0.572

0.45s

0.258

0.392

4.9s

0.548

0.471

2.6s

0.380

**0.561**

MSTRAT

0.572

0.31s

0.258

0.386

1.8s

0.543

0.470

1.2s

0.380

0.560

LR(2,1)

0.572

0.61s

0.258

0.393

1.2s

0.549

0.473

1.5s

0.393

0.553

REMC

0.572

1.0s

0.258

**0.397**

35.6s

0.549

**0.476**

23.6s

**0.395**

0.557

Original

0.440

0.116

0.116

0.440

0.116

0.440

**Accession maize data set (521)**

Local S.

**0.695**

2.0s

0.392

0.404

0.40s

0.630

0.582

4.3s

0.471

**0.694**

MSTRAT

**0.695**

1.7s

0.392

0.403

0.32s

0.631

0.583

4.1s

0.473

**0.694**

LR(2,1)

**0.695**

2.9s

0.392

**0.555**

4.3s

0.670

**0.618**

5.9s

**0.555**

0.681

REMC

0.694

4.0s

0.392

0.497

56.7s

0.646

0.608

51.0s

0.526

0.690

Original

0.630

0.294

0.294

0.630

0.294

0.630

**Flax data set (708)**

Local S.

**0.512**

2.1s

0.223

0.226

0.60s

0.468

0.406

6.4s

0.300

**0.512**

MSTRAT

**0.512**

5.1s

0.223

0.226

1.2s

0.469

0.404

12.5s

0.296

**0.512**

LR(2,1)

**0.512**

7.4s

0.223

**0.377**

10.6s

0.494

**0.443**

15.7s

**0.386**

0.499

REMC

0.511

5.0s

0.213

0.315

30.9s

0.475

0.422

39.5s

0.337

0.508

Original

0.468

0.000

0.000

0.468

0.000

0.468

**Pea data set (1283)**

Local S.

**0.593**

3.0s

0.000

0.000

0.10s

0.509

0.302

4.6s

0.011

**0.593**

MSTRAT

**0.593**

28.8s

0.000

0.000

0.63s

0.510

0.299

60.7s

0.006

0.592

LR(2,1)

**0.593**

34.1s

0.000

**0.324**

50.2s

0.569

**0.454**

01m15s

**0.324**

0.583

REMC

0.591

50.0s

0.000

0.006

36.6s

0.510

0.375

60.4s

0.166

0.583

Original

0.509

0.000

0.000

0.509

0.000

0.509

**Large pea data set ^{▾} (4429)**

LR(2,1)

**0.594**

42m56s

0.000

**0.243**

52m46s

0.554

**0.411**

01h35m

**0.243**

**0.579**

REMC

0.577

03m41s

0.000

0.000

0.19s

0.463

0.273

09m08s

0.000

0.546

Original

0.464

0.000

0.000

0.464

0.000

0.464

The results show that differences between diversity scores reported by the different algorithms are generally much bigger here than for the original Core Hunter measures discussed before. Local Search and MSTRAT perform worse than LR and REMC in all experiments using either MRmin or the mixed MR objective. The differences are more obvious for the larger datasets where it is more difficult to obtain high minimum distances because more accessions have to be selected. In this case both Local Search and MSTRAT break down when minimum distances are included in the objective. As this effect was already clearly visible from the results of the first four datasets, Local Search and MSTRAT were not used for the large pea dataset experiments. Interestingly it is not the advanced REMC, but LR which leads to the highest minimum distances for the larger datasets. Only for the smallest dataset (bulk maize) does REMC outperform LR and differences between their results increase for larger datasets.

When using mean MR alone, sampling from both pea sets leads to a minimum distance of zero for all algorithms, which means that accessions that are identical^{i} have been selected. In fact these datasets suffer from the same problem as the toy examples presented before, having many accessions and only few total alleles. The size of the selected core (size > 250 for pea set, and > 800 for large pea) is significantly larger than the dimension of the dataset (dim < 40). Optimizing only mean distance is not enough to guarantee high minimum distance and for these sets some identical accessions were selected in the core. Yet, by using the mixed MR objective, the LR method is able to sample cores from the smallest pea dataset with a quite high minimum MR distance of 0.324 while retaining a mean MR of 0.583, only less than 2% lower than the value of 0.593 which was obtained when optimizing mean MR alone. Even for the large pea dataset, LR reports a fairly large minimum distance of 0.243 together with only a small decrease of less than 3% in mean distance score. Note that REMC – even when using the mixed objective – reports much lower minimum distances for these pea sets. For the large pea dataset REMC still samples cores with zero minimum distance. These results suggest that much higher minimum distances can be reached, while retaining similar mean distances, by including both measures in the objective function and using a well chosen, suitable algorithm.

For the smaller datasets, this same conclusion holds. Although these datasets don’t suffer from the dimensionality problem and already reach acceptable minimum distances with mean MR alone, using the mixed MR objective still results in higher minimum distances while retaining most of the mean score, compared to mean MR alone. Across all experiments, gains in minimum distance range from 20.03% to 73.21% (and in fact infinite relative improvement for the pea datasets), while losses in mean are always smaller than 5.36%. For these datasets therefore it might also be useful to include minimum distances in the objective.

Optimizing minimum distance alone however would not be a good idea, because this presents two problems originating from the fact that many sets have exactly the same minimum MR distance. First, some of these sets might very well have higher mean values than others and we want to favor these. Second, having many possible cores with equal score makes finding a good solution more difficult to solve with optimization algorithms. This effect can be noticed in our results, as the obtained minimum distance values are often higher when optimizing the mixed MR objective compared to optimizing minimum distance alone. Minimum distances should therefore be used as additional constraints by including them in the objective without leaving out the original mean distance measures.

Finally it should again be noted that although LR seems to be very well suited for optimization of minimum distances this method becomes slower for large datasets. This problem with LR is most obvious in the case of large datasets and intensities leading to large core sizes, because of its deterministic nature, starting with an empty solution. For the large pea dataset LR requires much more time than the runtime limit applied to REMC. Because of this big difference in runtimes for the large pea dataset we experimented with applying higher runtime limits to REMC, but even when going up to a limit of 2 hours instead of 10 minutes results of REMC almost do not improve compared to the results shown in Table

To investigate the results in greater detail we performed a principal component analysis (PCA) of several core sets selected from the large pea dataset and compared the distribution of the distances between these selected accessions and those of the entire collection. Figure

PCA plots and distance histograms of cores sampled from large pea dataset

**PCA plots and distance histograms of cores sampled from large pea dataset.** This figure shows both PCA plots and distance histograms of core collections sampled from the large pea dataset, once obtained by optimizing mean MR alone and once by optimizing the mixed MR objective which includes both mean and minimum MR distance with equal weight. The sampling intensity was set to 0.2 and cores where constructed using the LR method. **(a)** optimizing mean Modified Rogers’ distance – core structure, **(b)** optimizing mixed Modified Rogers’ distance – core structure, **(c)** optimizing mean Modified Rogers’ distance – pairwise distance distribution, **(d)** optimizing mixed Modified Rogers’ distance – pairwise distance distribution.

These plots indicate that both the selected core accessions and their corresponding distances are quite similar for both objective functions, with two important differences. First, the core plots clearly show that when optimizing mean MR alone, several identical accessions are selected (green rectangles), while none of these are selected using the mixed distance measure. Second, using mixed MR leads to the selection of more intermediate accessions, while mean MR leaves more of a gap near to the center of the space. Similar differences can be observed from the corresponding distance distributions. The histogram in Figure

Table

Click here for file

**Optimized →**

**MR**

**MRmin**

**MixedMR**
^{
**
}

^{*}For each combination of algorithm, dataset and evaluation measure, 20 independent runs were performed from which averaged results are reported. By default runs were limited by a runtime of 60 seconds, except for the large pea dataset where a runtime limit of 10 minutes was applied. Furthermore the LR method does not accept a runtime limit but continues search until the desired core size has been reached.

^{**}Results shown are those of a pseudo-index containing both minimum and mean MR distance, with equal weight = 0.5.

^{∙}Not used during optimization, but computed afterwards on the constructed core sets.

^{∘}Components of mixed MR measure.

^{▾}These results were computed on the helios server.

**Algorithm**
^{
*
}

**MR**

**(t)**

**MRmin**
^{
∙
}

**MRmin**

**(t)**

**MR**
^{
∙
}

**MixedMR**

**(t)**

**MRmin**
^{
∘
}

**MR**
^{
∘
}

**Bulk maize data set (275)**

Local S.

0.643

0.25s

0.438

0.529

0.60s

0.613

0.578

0.50s

0.516

0.641

MSTRAT

0.643

0.14s

0.353

0.522

0.42s

0.608

0.578

0.31s

0.513

**0.643**

LR(2,1)

0.643

0.34s

0.513

0.534

0.52s

0.622

0.576

0.72s

0.523

0.628

REMC

0.643

0.35s

0.513

**0.539**

1.8s

0.615

**0.582**

4.7s

**0.534**

0.629

Original

0.440

0.116

0.116

0.440

0.116

0.440

**Accession maize data set (521)**

Local S.

**0.723**

0.45s

0.511

0.495

0.10s

0.634

0.663

1.5s

0.607

0.719

MSTRAT

0.722

0.26s

0.476

0.490

0.17s

0.635

0.651

0.59s

0.581

**0.721**

LR(2,1)

**0.723**

1.2s

0.510

0.620

1.8s

0.699

0.674

2.3s

0.635

0.712

REMC

**0.723**

0.75s

0.519

**0.630**

58.7s

0.700

**0.678**

56.0s

**0.638**

0.717

Original

0.630

0.294

0.294

0.630

0.294

0.630

**Flax data set (708)**

Local S.

0.533

1.0s

0.337

0.309

0.20s

0.470

0.475

2.6s

0.418

**0.532**

MSTRAT

0.533

0.66s

0.341

0.320

0.36s

0.470

0.468

1.6s

0.405

**0.532**

LR(2,1)

0.533

3.3s

0.357

**0.446**

4.1s

0.515

0.481

6.2s

**0.446**

0.517

REMC

0.533

3.0s

0.337

0.429

44.3s

0.505

**0.487**

20.3s

**0.446**

0.529

Original

0.468

0.000

0.000

0.468

0.000

0.468

**Pea data set (1283)**

Local S.

0.626

1.4s

0.200

0.122

0.10s

0.510

0.481

2.7s

0.338

0.624

MSTRAT

0.626

1.5s

0.209

0.104

0.25s

0.510

0.454

4.3s

0.282

**0.625**

LR(2,1)

0.626

3.5s

0.229

**0.429**

5.3s

0.595

**0.520**

7.7s

**0.429**

0.611

REMC

0.626

10.0s

0.246

0.328

60.5s

0.552

0.510

26.1s

0.397

0.622

Original

0.509

0.000

0.000

0.509

0.000

0.509

**Large pea data set ^{▾} (4429)**

LR(2,1)

**0.635**

01m18s

0.000

**0.343**

01m55s

0.597

**0.488**

03m04s

**0.364**

0.611

REMC

0.633

06m05s

0.000

0.000

0.13s

0.462

0.313

02m03s

0.000

**0.626**

Original

0.464

0.000

0.000

0.464

0.000

0.464

We conclude that Local Search is no longer the most promising method when minimum distances are included in the objective. As minimum distances are much more sensitive to the exact composition of the core than mean distances, we need better methods in this case. The LR method seemed to be very well suited for these more difficult problems, and was often much better than the advanced REMC method. LR can become quite slow, especially in case of large datasets and intensities, but depending on the application this significant increase in minimum distance could be worth the extra runtime. However, in cases where high minimum distance is not required, the performance of LR over Local Search does not warrant the extra runtime required by LR.

Mixed Replica Search motivation

Based on the results from the previous subsections we are now able to give further motivation for the specific composition of our new Mixed Replica search (MixRep) algorithm. We showed that the simple methods performed very well in many experiments, but it was not always the same method that was the most promising and each of the simple methods has its drawbacks. Local Search and MSTRAT clearly cannot cope with minimum distance measures, and both MSTRAT and LR Search become slower when run on relatively large datasets. Including several methods in one robust algorithm avoids the need of selecting the most suitable method. As Local Search is the fastest method and LR is better when including minimum distances in the objective, we decided to use both these methods in the initial search phase. However, the results showed that in some cases the advanced REMC slightly outperformed the other methods in terms of diversity scores. To benefit from advantages of both the simple methods and REMC we used a Mixed Replica approach, which contains LR and Local Search replicas at the start. Additional advanced search engines are then included in later stages of the search (MC & Tabu) to find better scores not obtainable by the simpler methods, if such scores are possible in the dataset. In this way our method will be able to tackle different problems in an efficient way, with fast computation on simple problems and yet very good results in more difficult settings, if additional runtime is available.

Performance of Mixed Replica Search

Now we will present results for our new robust Mixed Replica search and compare these with the results of the original REMC Core Hunter algorithm. Table

**Algorithm**
^{
∗
}

**MR**

**(t)**

**CE**

**(t)**

**SH**

**(t)**

**HE**

**(t)**

**NE**

**(t)**

**Mixed**
^{
∗∗
}

**(t)**

^{*}For each combination of algorithm, dataset and evaluation measure, 20 independent runs were performed from which averaged results are reported. By default runs were limited by a runtime of 60 seconds, except for the large pea dataset where a runtime limit of 10 minutes was applied.

^{**}Results shown are those of a pseudo-index containing all seven measures with equal weights.

^{▾}These results were computed on the helios server.

**Bulk maize data set (275)**

REMC

0.572

1.0s

0.641

2.0s

4.531

2.0s

0.667

1.0s

3.446

3.0s

10.680

15.0s

MixRep

0.572

0.45s

0.641

0.46s

4.531

0.49s

0.667

0.50s

3.446

0.59s

10.680

2.2s

Original

0.440

0.521

4.399

0.620

2.937

**Accession maize data set (521)**

REMC

0.694

4.0s

0.752

4.0s

4.670

5.0s

0.676

3.0s

3.502

20.0s

11.086

50.1s

MixRep

**0.695**

1.1s

0.752

0.68s

4.670

1.1s

0.676

0.67s

3.502

3.8s

**11.087**

17.1s

Original

0.630

0.696

4.467

0.591

2.742

**Flax data set (708)**

REMC

0.511

5.0s

0.511

4.0s

5.340

30.0s

0.262

4.0s

1.469

30.0s

8.874

60.4s

MixRep

**0.512**

1.6s

**0.512**

1.7s

5.340

0.80s

**0.263**

0.83s

1.469

1.6s

**8.878**

13.0s

Original

0.468

0.468

5.285

0.222

1.377

**Pea data set (1283)**

REMC

0.591

50.0s

0.595

30.0s

3.553

7.0s

0.437

15.0s

1.865

15.0s

7.876

61.2s

MixRep

**0.593**

3.2s

**0.597**

3.1s

**3.556**

1.5s

**0.440**

1.4s

**1.867**

7.7s

**7.945**

37.6s

Original

0.509

0.515

3.482

0.381

1.713

**Large pea data set ^{▾} (4429)**

REMC

0.577

03m41s

0.580

08m49s

3.470

08m37s

0.448

04m29s

1.875

05m22s

7.621

10m03s

MixRep

**0.594**

01m18s

**0.596**

53.5s

**3.486**

39.6s

**0.465**

36.3s

**1.886**

47.3s

**7.811**

10m21s

Original

0.464

0.466

3.348

0.352

1.609

Results for the same experiments, but now with a sampling intensity of only 5%, are reported in Table

**Algorithm**
^{
∗
}

**MR**

**(t)**

**CE**

**(t)**

**SH**

**(t)**

**HE**

**(t)**

**NE**

**(t)**

**Mixed**
^{
∗∗
}

**(t)**

^{*}For each combination of algorithm, dataset and evaluation measure, 20 independent runs were performed from which averaged results are reported. By default runs were limited by a runtime of 60 seconds, except for the large pea dataset where a runtime limit of 10 minutes was applied.

^{**}Results shown are those of a pseudo-index containing all seven measures with equal weights.

^{▾}These results were computed on the helios server.

**Bulk maize data set (275)**

REMC

0.643

0.35s

0.700

0.35s

4.568

0.65s

0.685

0.55s

3.631

3.0s

10.790

7.0s

MixRep

0.643

0.19s

0.700

0.29s

4.568

0.50s

0.685

0.42s

3.631

2.0s

10.790

2.0s

Original

0.440

0.521

4.399

0.620

2.937

**Accession maize data set (521)**

REMC

0.723

0.75s

0.782

2.0s

4.724

2.0s

0.702

5.0s

3.886

6.0s

11.216

50.0s

MixRep

0.723

0.38s

0.782

0.68s

4.724

0.83s

0.702

1.7s

**3.887**

3.5s

11.216

10.2s

Original

0.630

0.696

4.467

0.591

2.742

**Flax data set (708)**

REMC

0.533

3.0s

0.533

3.0s

5.358

2.0s

0.278

3.0s

1.504

4.0s

8.965

30.0s

MixRep

0.533

0.81s

0.533

0.82s

5.358

0.76s

0.278

0.97s

1.504

1.5s

8.965

11.2s

Original

0.468

0.468

5.285

0.222

1.377

**Pea data set (1283)**

REMC

0.626

10.0s

0.629

7.0s

3.578

2.0s

0.451

3.0s

1.898

2.0s

8.083

40.0s

MixRep

0.626

1.4s

0.629

1.2s

3.578

0.62s

0.451

0.55s

1.898

0.95s

**8.084**

11.0s

Original

0.509

0.515

3.482

0.381

1.713

**Large pea data set ^{▾} (4429)**

REMC

0.633

06m05s

0.634

36.8s

3.515

27.6s

0.492

13.9s

1.982

06m53s

8.147

09m36s

MixRep

**0.635**

7.5s

**0.637**

13.0s

**3.518**

2.1s

**0.496**

8.9s

**1.983**

9.6s

**8.158**

02m15s

Original

0.464

0.466

3.348

0.352

1.609

Our previous results showed that including the new minimum distance measures in the objective function, together with the original mean distances, can lead to cores with higher minimum distance while retaining similar means. Table

**Optimized →**

**MixedMR**
^{
**
} **(int=0.2)**

**MixedMR**
^{
**
} **(int=0.05)**

^{*}For each combination of algorithm, dataset and evaluation measure, 20 independent runs were performed from which averaged results are reported. By default runs were limited by a runtime of 60 seconds, with some exceptions. For the small pea dataset with an intensity of 20%, a runtime limit of 150 seconds was applied. For the large pea dataset runtime limits were set to 10 minutes for the 5% intensity and 2 hours for the 20% intensity.

^{**}Results shown are those of a pseudo-index containing both minimum and mean MR distance, with equal weight = 0.5.

^{∘}Components of mixed MR measure.

^{▾}These results were computed on the helios server.

**Algorithm**
^{
*
}

**MixedMR**

**(t)**

**MRmin**
^{
∘
}

**MR**
^{
∘
}

**MixedMR**

**(t)**

**MRmin**
^{
∘
}

**MR**
^{
∘
}

**Bulk maize data set (275)**

REMC

**0.476**

23.6s

0.395

0.557

0.582

4.7s

0.534

0.629

MixRep

0.475

23.9s

0.393

0.557

0.582

5.8s

0.534

0.630

Original

0.116

0.440

0.116

0.440

**Accession maize data set (521)**

REMC

0.608

51.0s

0.526

0.690

**0.678**

56.0s

0.638

0.717

MixRep

**0.618**

6.9s

0.555

0.682

0.676

31.5s

0.635

0.717

Original

0.294

0.630

0.294

0.630

**Flax data set (708)**

REMC

0.422

39.5s

0.337

0.508

**0.487**

20.3s

0.446

0.529

MixRep

**0.440**

19.0s

0.378

0.502

0.486

44.5s

0.445

0.527

Original

0.000

0.468

0.000

0.468

**Pea data set (1283)**

REMC

0.396

02m27s

0.209

0.583

0.510

26.1s

0.397

0.622

MixRep

**0.454**

02m01s

0.324

0.583

**0.520**

6.6s

0.429

0.612

Original

0.000

0.509

0.000

0.509

**Large pea data set ^{▾} (4429)**

REMC

0.278

40m29s

0.000

0.556

0.313

02m03s

0.000

0.626

MixRep

**0.405**

01h40m

0.230

0.580

**0.487**

02m48s

0.361

0.612

Original

0.000

0.464

0.000

0.464

In case of a lower sampling intensity of only 5%, results of both methods are similar in most cases. Only for the pea datasets does MixRep lead to higher scores than REMC with a significant relative improvement in mixed score (> 55%) for the large pea set, caused by a large increase of the minimum component while again retaining most of the mean component. It is interesting to note that for this large pea dataset, REMC samples cores with zero minimum distance in all experiments, including some identical accessions in the core, while MixRep always leads to non-zero minimum distances. Similar results for a mixed measure containing both minimum and mean CE are reported in (see Additional file

We conclude that when aiming at high minimum distances the results of MixRep are very similar to those of the LR method and therefore often significantly better than those of all other methods (not only REMC, but also MSTRAT and Local Search as shown before). The new MixRep algorithm is often able to sample cores with much higher minimum distances than REMC, especially for these datasets where it is more difficult to reach these high minimum values e.g. datasets with large size and/or low dimension. However for these larger datasets, MixRep is slower than REMC but then gains in minimum distance are very high.

Conclusions

Our results show that when aiming at core subsets in which all accessions are sufficiently distant from each other including minimum distance measures in the objective function, in combination with the original mean distance measures, improves performance. This additional measure often leads to cores with significantly higher minimum distances while retaining very similar mean distance scores compared to optimizing mean distance alone.

With Core Hunter II we have introduced a new advanced search algorithm – Mixed Replica search (MixRep) which uses heterogeneous replicas, an approach inspired by the results of a comparison of several algorithms – and showed that this new method improves on the results of the original REMC algorithm in two different ways. When optimizing the original Core Hunter evaluation measures (MR, CE, SH, HE, NE or a mixed measure) the new MixRep algorithm samples cores with equal or slightly higher diversity scores than REMC, while being much faster. Secondly, when minimum distances (MRmin or CEmin) are included in the objective to avoid selection of identical or very similar accessions inside the core, using MixRep instead of the original REMC often leads to significantly higher minimum distance scores. This effect is most obvious in case of large collections with relatively low dimension when sampling with fairly large intensities. For these large datasets, it does take more time to reach high minimum distances so it is important to apply higher runtime limits to achieve this goal. However the beauty of the MixRep algorithm is that in the case where minimum distances are not important, one can simply apply lower runtime limits and the same algorithm will quickly sample very good cores in terms of the remaining evaluation measures.

Future work concerning new versions of Core Hunter includes adding support for phenotypic variables, as for now only genetic marker data are supported. Furthermore, Core Hunter is currently freely available, but only as a command-line tool so development on a rich graphical user interface has already begun to provide user-friendly access to this core selection tool. Finally it might also be interesting to try to further improve results by plugging in new search engines inside the MixRep replicas. For example the current LR replica is quite slow for large core sizes, although including this replica does lead to significantly better results in terms of minimum distance scores. It may be useful to look for faster search replicas which also have this interesting property, to speed up the MixRep algorithm when aiming at high minimum distances.

Endnotes

^{a}Optimal only in theory for large datasets, where evaluating all possible subsets is computationally infeasible. In practice we turn to heuristic methods that cannot guarantee an optimal solution, to keep the search process feasible.^{b}Because our distance measures cannot be computed on sets containing less than two accessions, we have slightly modified this approach by exhaustively selecting the best first pair and then proceeding with the LR scheme. Selecting two accessions by exhaustive search is still computationally feasible.^{c}These same stop criteria are available for all randomized heuristics introduced in this paper.^{d}Temperatures of newly created MC replicas are chosen randomly between a given minimum and maximum temperature. By default these are set to 50.0 and 100.0 respectively, and if desired the user can specify other minimum and maximum values using the advanced search options.^{e}Modifications of this kind are frequently applied when using tabu search in practice, to avoid excessive memory usage and computation time.^{f}As noted before, distance measures treat each allele as one dimension so the dimension of a dataset is defined as the total number of alleles over all marker loci.^{g}The actual runtimes might slightly exceed this limit as the elapsed runtime is only checked after each search round and some algorithms implement quite intensive search rounds performing several search steps, possibly for several replicas.^{h}Note that these convergence times are bounded by the runtime limit and it is always possible that further improved would have been made beyond this limit.^{i}By identical we mean that the accessions can not be distinguished from one another using the available data. The accessions have the same alleles for all available markers used within the dataset.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

HDB proposed the algorithm, implemented it, and performed all new experiments under the supervision of VF. PS provided data, advice on experiments and biological interpretation of the results. GD provided advice on algorithm development. HDB wrote the initial manuscript with all authors contributing to the final version. All authors read and approved the final manuscript.

Acknowledgements

PS acknowledges funding from Bioversity International AEGIS LOA10/048 and the Ministry of Education, Youth and Sports of the Czech Republic (projects MSM 6198959215 and 2678424601). Data contributing to the pea dataset was provided by Noel Ellis from Aberystwyth University, UK and R. Redden from ATFC, Australia. Chris Thachuk, University of British Columbia, Canada kindly provided the source code for the original Core Hunter software.