<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2156-9-35</ui>
   <ji>1471-2156</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>Precision-mapping and statistical validation of quantitative trait loci by machine learning</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Bedo</snm>
               <fnm>Justin</fnm>
               <insr iid="I1"/>
               <insr iid="I3"/>
               <email>bedo@ieee.org</email>
            </au>
            <au id="A2">
               <snm>Wenzl</snm>
               <fnm>Peter</fnm>
               <insr iid="I2"/>
               <email>peter@DiversityArrays.com</email>
            </au>
            <au id="A3">
               <snm>Kowalczyk</snm>
               <fnm>Adam</fnm>
               <insr iid="I1"/>
               <email>adam.kowalczyk@nicta.com.au</email>
            </au>
            <au id="A4" ca="yes">
               <snm>Kilian</snm>
               <fnm>Andrzej</fnm>
               <insr iid="I2"/>
               <email>andrzej@DiversityArrays.com</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Life Sciences, NICTA and Department of Electrical and Electronic Engineering, The University of Melbourne, Parkville, Victoria 3010, Australia</p>
            </ins>
            <ins id="I2">
               <p>Diversity Arrays P/L, 1 Wilf Crane Cr. (Yarralumla), Canberra, ACT 2600, Australia</p>
            </ins>
            <ins id="I3">
               <p>The Research School of Information Sciences and Engineering, The Australian National University, Canberra, Australia</p>
            </ins>
         </insg>
         <source>BMC Genetics</source>
         <issn>1471-2156</issn>
         <pubdate>2008</pubdate>
         <volume>9</volume>
         <issue>1</issue>
         <fpage>35</fpage>
         <url>http://www.biomedcentral.com/1471-2156/9/35</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">18452626</pubid>
               <pubid idtype="doi">10.1186/1471-2156-9-35</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>21</day>
               <month>10</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>02</day>
               <month>5</month>
               <year>2008</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>02</day>
               <month>5</month>
               <year>2008</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2008</year>
         <collab>Bedo et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>We introduce a QTL-mapping algorithm based on Statistical Machine Learning (SML) that is conceptually quite different to existing methods as there is a strong focus on generalisation ability. Our approach combines ridge regression, recursive feature elimination, and estimation of generalisation performance and marker effects using bootstrap resampling. Model performance and marker effects are determined using independent testing samples (individuals), thus providing better estimates. We compare the performance of SML against Composite Interval Mapping (CIM), Bayesian Interval Mapping (BIM) and single Marker Regression (MR) on synthetic datasets and a multi-trait and multi-environment dataset of the progeny for a cross between two barley cultivars.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>In an analysis of the synthetic datasets, SML accurately predicted the number of QTL underlying a trait while BIM tended to underestimate the number of QTL. The QTL identified by SML for the barley dataset broadly coincided with known QTL locations. SML reported approximately half of the QTL reported by either CIM or MR, not unexpected given that neither CIM nor MR incorporates independent testing. The latter makes these two methods susceptible to producing overly optimistic estimates of QTL effects, as we demonstrate for MR. The QTL resolution (peak definition) afforded by SML was consistently superior to MR, CIM and BIM, with QTL detection power similar to BIM. The precision of SML was underscored by repeatedly identifying, at &#8804; 1-cM precision, three QTL for four partially related traits (heading date, plant height, lodging and yield). The set of QTL obtained using a 'raw' and a 'curated' version of the same genotypic dataset were more similar to each other for SML than for CIM or MR.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>The SML algorithm produces better estimates of QTL effects because it eliminates the optimistic bias in the predictive performance of other QTL methods. It produces narrower peaks than other methods (except BIM) and hence identifies QTL with greater precision. It is more robust to genotyping and linkage mapping errors, and identifies markers linked to QTL in the absence of a genetic map.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>The notion that DNA polymorphism explains the phenotypic diversity of living organisms has been the driving force behind the Human Genome Project and widespread investment in plant and animal genomics. Over the last 30 years, many examples of causal effects on phenotypes arising from DNA sequence variation have been reported. Finding associations between DNA variation and phenotypes is straightforward for 'simple' traits that are inherited in a Mendelian fashion as monogenic characters. Yet, most of the economically important phenotypic variation (e.g. crop yield and its components) is inherited through a number of Quantitative Trait Loci (QTL) with different magnitudes of effect and complex interactions among themselves and with the environment <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>.</p>
         <p>QTL can be identified through their genetic linkage with molecular markers. In a typical experiment, the progeny of an experimental population are simultaneously analysed for their genetic makeup (molecular markers) and one or more phenotypic traits of interest. The marker data are used to build a genetic map, which is a pre-requisite for the majority of QTL-detection methods <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>. The simplest method to identify markers linked to QTL is single Marker Regression (MR), which fits a linear model to each marker using the trait data. Simple Interval Mapping (SIM) disentangles QTL effects from the confounding effect of linkage distance between markers and QTL by regressing phenotypic data on the genotypic information for marker intervals rather than the markers themselves <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. QTL are detected by 'stepping' through the whole genome to generate a profile of the proportion of phenotypic variance explained or the logarithm-of-odds ratio (LOD score) in favour of a QTL.</p>
         <p>The Composite Interval Mapping (CIM) approach refines the SIM algorithm by incorporating background markers as cofactors into a multiple regression model <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. In this way, variation due to other QTL can be partly accounted for. The CIM approach was further extended by using multiple marker intervals to fit multi-QTL models to the trait data and selecting the 'best' model with a stepwise forward and backward selection procedure (Multiple Interval Mapping; MIM) <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. Other approaches such as Bayesian Interval Mapping (BIM) <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> approach the problem by applying Bayesian inference over the whole genome using priors designed to produce sparse models.</p>
         <p>Here we explore a conceptually quite different QTL-mapping approach that focuses on generalisation ability. The approach is based on Statistical Machine Learning (SML) and differs from other methods in that it estimates the generalisation performance of a QTL model by splitting the data into independent training and testing subsets that are used for model induction and evaluation, respectively (Figure <figr fid="F1">1</figr>). Resampling data into training and testing subsets is quite common in microarray analyses, particularly in the context of cancer genomics <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>System dataflow diagram</p>
            </caption>
            <text>
               <p><b>System dataflow diagram</b>. Dataflow diagram (DFD) depicting the QTL analysis. Rectangles with round corners indicate processes, other rectangles indicate data stores, and lines indicate data flow. The left figure shows the top-level DFD, the right shows further detail of the 'SML analysis' process.</p>
            </text>
            <graphic file="1471-2156-9-35-1"/>
         </fig>
         <p>Our QTL detection method determines the contribution of each marker to the model performance during the recursive feature elimination (RFE) procedure. First, a linear model containing every marker is fitted to the phenotype. The model is then reduced in size by recursively eliminating the least useful markers and refitting the model until only a single marker is left, which is similar to recursive feature elimination support vector machines <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp>. We assign the <it>change in variance explained </it>after each elimination (measured on the test set) to the marker that was removed. The entire process is then repeated numerous times to derive an unbiased bootstrap estimate of the predictive power of each marker. To generate a QTL profile across the genome, the contributions of genetically linked markers within a sliding map window are added.</p>
         <p>We compare the performance of the SML algorithm with the performance of two conventional QTL-mapping methods (MR, CIM) and the more recently developed BIM. For this purpose, we re-analyse a well-known multi-trait and multi-environment dataset for a population of doubled haploid (DH) lines derived from the F<sub>1 </sub>of a cross between cultivars Steptoe and Morex, and study some synthetic datasets.</p>
      </sec>
      <sec>
         <st>
            <p>Results and Discussion</p>
         </st>
         <sec>
            <st>
               <p>Treatment of multi-environment data</p>
            </st>
            <p>In QTL mapping, we are primarily interested in quantifying the influence of genotypic variation on phenotypes. In practice, this is confounded by environmental variation to differing extents depending on the trait. In this paper, we limit our approach to mapping the genotypic component of the traits. The interaction between QTL and environments (QTL &#215; E), an important element influencing phenotypic variation of many quantitative characters, will be addressed in a separate paper.</p>
            <p>In order to precisely measure the genotypic component we use data collected on genetically identical Steptoe/Morex DH lines grown in multiple environments. We standardise the phenotypes within each environment to a mean of 0 and a standard deviation of 1, and then calculate the mean (per phenotype and genotype) across all environments. The scaling within environments aligns the distributions, and the averaging provides an estimate of the common underlying signal. The resulting increase in QTL detection power for a whole-genome SML model based on 548 markers is demonstrated in Figure <figr fid="F2">2</figr>; incorporating information from multiple environments provides an increase in the variance explained for all traits.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Multiple environments</p>
               </caption>
               <text>
                  <p><b>Multiple environments</b>. Effect of including phenotypic data from multiple environments before modelling. Along the <it>x</it>-axis is the number of environments used in the pre-processing of phenotypic data, and the <it>y</it>-axis is the fraction of variance explained. For each number of environments, all possible permutations of the available environments were tested. Each permutation was evaluated by a 50-permutation bootstrap of a whole-genome model fitted using ridge regression. Dotted lines are 95% confidence intervals for the mean derived using the t-test.</p>
               </text>
               <graphic file="1471-2156-9-35-2"/>
            </fig>
            <p>The benefit from increasing the number of environments differs between traits. This is not surprising as more environments will provide a better estimate of the genotypic variation, thus traits that are heavily influenced by the environment are expected to benefit more from the inclusion of more environments. The latter is seen clearly for lodging, <it>&#945;</it>-amylase, and plant height where the inclusion of more environments produces a substantial increase in performance over a single environment. We can therefore use the degree of increase in variance explained as a crude measure of environmental "susceptibility" or, conversely, heritability of the trait. For example, heading time appeared to be less influenced by environmental factors (2-fold increase in variance explained) than plant height (3.5-fold increase) and the degree of lodging (5.5-fold increase). The performance improvement due to the inclusion of multiple environments is, of course, accompanied by a decrease in the fraction of the total (multi-environment) variance that remains after averaging the scaled phenotypes across environments (Table <tblr tid="T1">1</tblr>), and thus the latter can also be used as an estimate of environmental susceptibility.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Percentage of total phenotypic variance remaining after averaging scaled phenotypes across environments.<sup>a</sup></p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Trait</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Variance (%)</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p><it>&#945;</it>-Amylase</p>
                     </c>
                     <c ca="center">
                        <p>52.2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Diastatic power</p>
                     </c>
                     <c ca="center">
                        <p>74.1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Malt extract</p>
                     </c>
                     <c ca="center">
                        <p>54.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Heading date</p>
                     </c>
                     <c ca="center">
                        <p>70.4</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Plant height</p>
                     </c>
                     <c ca="center">
                        <p>64.2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Lodging</p>
                     </c>
                     <c ca="center">
                        <p>40.3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Grain protein content</p>
                     </c>
                     <c ca="center">
                        <p>45.6</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Yield</p>
                     </c>
                     <c ca="center">
                        <p>22.4</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p><sup>a </sup>The number of environments was 6 for lodging; 9 for <it>&#945;</it>-amylase, diastatic power, malt extract, and grain protein; and 16 for the remaining traits.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Model size and genetic complexity of traits</p>
            </st>
            <p>The SML algorithm combines Recursive Feature (marker) Elimination (RFE) with ridge regression and bootstrapping (see <it>Methods</it>). It starts with a whole-genome model and progressively eliminates individual markers from the model. When the algorithm starts removing markers with predictive value, the predictive variance explained starts dropping. The number of markers in the smallest model that explains a close-to-maximum fraction of the variance (the 'optimal model') can therefore be used as an indicator of the genetic complexity of a trait.</p>
            <p>Figure <figr fid="F3">3</figr> displays the performance of models of varying size obtained through recursive feature elimination. The size of the 'optimal model' varied considerably among different traits. For pubescent leaves, it is evident that the optimal model contains one marker only &#8211; indeed the locus determining the character (m<it>Pub</it>). All additional markers actually decrease performance as they only add noise rather than information. This effect was also observed for other traits such as yield (not shown). Plant height is an example of a trait that can be accurately modelled with a small number of markers, thus suggesting a relatively low genetic complexity. Diastatic power and <it>&#945;</it>-amylase, by contrast, are traits that appear to be genetically quite complex. For example to accurately model diastatic power, 100 markers are required, while 400 markers are required for <it>&#945;</it>-amylase. These large numbers suggest that the genetic signal is spread out throughout the genome, and that many markers influence (with small individual effects) the phenotypic outcome.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Reduction of model size</p>
               </caption>
               <text>
                  <p><b>Reduction of model size</b>. Performance of models of varying size (number of markers) for four traits: pubescent leaves, plant height, diastatic power, and <it>&#945;</it>-amylase. The <it>x</it>-axis is the number of features (markers), and the <it>y</it>-axis is the fraction of variance explained, estimated using the zero bootstrap. Vertical grey lines indicate the optimal operating points. Dotted lines are 95% confidence intervals derived using the t-test.</p>
               </text>
               <graphic file="1471-2156-9-35-3"/>
            </fig>
            <p>To verify the accuracy of estimating the number of QTL, we performed simulation experiments using a group of 100 artificial datasets. These datasets were simultaneously analysed by Bayesian Interval Mapping (BIM) <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp> for the purpose of benchmarking our method. Each dataset contained 1-10 QTL positioned randomly at markers evenly spaced at 1 cM intervals across ten chromosomes of 100 cM length. As shown in Figure <figr fid="F4">4</figr>, the median difference in the number of detected QTL for SML is zero, with a low variance. This result demonstrates that the genetic complexity of traits can be estimated very precisely from the performance curves given by the SML method. By contrast, BIM tends to underestimate the number of QTL.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Accuracy of genetic complexity estimates</p>
               </caption>
               <text>
                  <p><b>Accuracy of genetic complexity estimates</b>. Comparison of an analysis of 100 synthetic datasets with BIM and SML. The y-axis shows the difference between the true and estimated number of QTL.</p>
               </text>
               <graphic file="1471-2156-9-35-4"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Statistical validation of QTL through bootstrapping</p>
            </st>
            <p>An important estimation technique used in our method is bootstrap resampling. Bootstrap resampling involves creating a subset of the data for training, and using the remainder for testing (see <it>Methods</it>). In this way, independent data are reserved for testing the model derived from the training data. This approach produces less biased estimates of the generalisation error (the predictive performance of a model on data unseen during training), and hence a better estimate of the true effect of a putative QTL <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>.</p>
            <p>Figure <figr fid="F5">5</figr> illustrates the bias that can occur when not using independent DH lines for testing the predictive power of a QTL model. We used MR to detect the top QTL and estimate its predictive performance, both using bootstrap resampling and resubstitution (i.e. deriving an estimate based on the whole dataset). For the bootstrap analysis, 200 iterations were used. Each iteration involved detecting the top QTL using MR and training a single QTL linear model on the training data, then estimating the variance explained on the independent test data (the withheld DH lines). In the figure, the red crosses and box plots show the results obtained with resubstitution and bootstrap resampling, respectively. For each trait except pubescence leaves, the resubstitution estimate is overly optimistic, sitting outside the upper quartile of the bootstrap estimate.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Whole-dataset bias</p>
               </caption>
               <text>
                  <p><b>Whole-dataset bias</b>. Demonstration of the optimistic bias that arises when measuring predictive performance on training data. For each trait, the optimal marker was selected using MR, either on the entire dataset (red crosses) or within a 200-permutation zero bootstrap environment (box plots).</p>
               </text>
               <graphic file="1471-2156-9-35-5"/>
            </fig>
            <p>This result illustrates that resubstitution estimates of QTL effects are inherently biased upward. As a consequence, bootstrap resampling reduces the detection of spurious QTL; QTL deemed important on the training set by chance will not reflect the same importance when measured on the test data. Other authors have explored resampling techniques such as cross-validation in the context of QTL detection and evaluation <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>, and the biases that arise when not using resampling methods have been well demonstrated. Hence the use of bootstrap resampling in the SML procedure should facilitate more robust QTL detection.</p>
         </sec>
         <sec>
            <st>
               <p>QTL identified compared to other methods</p>
            </st>
            <sec>
               <st>
                  <p>Real data</p>
               </st>
               <p>To further benchmark SML against other QTL mapping methods, we identified QTL for nine traits using SML, single Marker Regression (MR), Composite Interval Mapping (CIM) and BIM. In the case of CIM we used 20 markers at > 10 cM distance from the investigated interval to adjust for the genome background. For BIM, the default values specified in the R/qtlbim package were used for the priors and sampling parameters. Table <tblr tid="T2">2</tblr> shows the average degree of correlation of the genome profiles of variance explained (the QTL effects) among the various methods. SML and CIM produced the most correlated results (Pearson's correlation coefficient <it>r </it>= 0.80). This is despite the fact that SML uses marker information only, while CIM requires the additional information of a genetic map. The BIM profiles were less correlated with the profiles generated by other methods on average.</p>
               <tbl id="T2">
                  <title>
                     <p>Table 2</p>
                  </title>
                  <caption>
                     <p>Correlation between genome profiles of variance explained obtained with different QTL-mapping methods.<sup>a</sup></p>
                  </caption>
                  <tblbdy cols="5">
                     <r>
                        <c ca="left">
                           <p>
                              <b>Method<sup>b</sup></b>
                           </p>
                        </c>
                        <c ca="left">
                           <p>
                              <b>SML</b>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <b>MR</b>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <b>CIM</b>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <b>BIM</b>
                           </p>
                        </c>
                     </r>
                     <r>
                        <c cspan="5">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>
                              <b>MR</b>
                           </p>
                        </c>
                        <c ca="left">
                           <p>0.65 &#177; 0.04</p>
                        </c>
                        <c ca="center">
                           <p>-</p>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>
                              <b>CIM</b>
                           </p>
                        </c>
                        <c ca="left">
                           <p>0.80 &#177; 0.09</p>
                        </c>
                        <c ca="center">
                           <p>0.72 &#177; 0.15</p>
                        </c>
                        <c ca="center">
                           <p>-</p>
                        </c>
                        <c>
                           <p/>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>
                              <b>BIM</b>
                           </p>
                        </c>
                        <c ca="left">
                           <p>0.48 &#177; 0.13</p>
                        </c>
                        <c ca="center">
                           <p>0.46 &#177; 0.15</p>
                        </c>
                        <c ca="center">
                           <p>0.44 &#177; 0.14</p>
                        </c>
                        <c ca="center">
                           <p>-</p>
                        </c>
                     </r>
                  </tblbdy>
                  <tblfn>
                     <p><sup>a </sup>The values given are means &#177; SD across the nine traits investigated in this study.</p>
                     <p><sup>b </sup>QTL-detection methods were: SML, Statistical Machine Learning; MR, single Marker Regression; CIM, Composite Interval Mapping with 20 background markers at > 10 cM distance from the tested interval, and BIM, Bayesian Interval Mapping.</p>
                  </tblfn>
               </tbl>
               <p>We next counted and compared the QTL reported by SML, MR and CIM at a significance level of <it>p </it>&lt; 0.05 (Figure <figr fid="F6">6</figr>). BIM was not included in this detailed comparison as it is difficult to match the frequentist null-hypothesis rejection thresholds with the Bayes factors used with BIM. SML reported slightly less than half the number of QTL than MR and CIM, presumably because the bootstrap-validation step eliminated spurious QTL (see previous section); MR, for example, reported five spurious peaks for pubescent leaves, a trait known to be encoded by a single Mendelian trait (Additional File <supplr sid="S1">1</supplr>). Perhaps not surprisingly, about half of the QTL detected by either MR or CIM could not be cross-validated by a second method. By contrast, 95% of the QTL identified by SML were also detected by MR and/or CIM (Figure <figr fid="F6">6</figr>). These results suggest that QTL detected by SML are more robust and hence more likely to be 'biologically significant'.</p>
               <suppl id="S1">
                  <title>
                     <p>Additional file 1</p>
                  </title>
                  <text>
                     <p><b>QTL detected with different algorithms (<it>p </it>&lt; 0.05)</b>. PDF file containing a list of QTL identified for each combination of QTL-detection method (SML, MR, and CIM) and trait (<it>&#945;</it>-amylase, diastatic power, heading date, plant height, lodging, malt extract, pubescent leaves, grain protein content, and yield).</p>
                  </text>
                  <file name="1471-2156-9-35-S1.pdf">
                     <p>Click here for file</p>
                  </file>
               </suppl>
               <fig id="F6">
                  <title>
                     <p>Figure 6</p>
                  </title>
                  <caption>
                     <p>Cross-validation of QTL</p>
                  </caption>
                  <text>
                     <p><b>Cross-validation of QTL</b>. Overlaps among QTL detected by SML, MR and CIM at a <it>p </it>&lt; 0.05 level. QTL in common between each pair of methods were identified as described in the section entitled 'Comparisons between QTL-detection methods and map versions' in <it>Methods</it>. The reported numbers are the sums across all nine traits investigated in this study.</p>
                  </text>
                  <graphic file="1471-2156-9-35-6"/>
               </fig>
               <p>There was a large overlap between QTL identified in this study and previous studies of the same DH population <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp>. SML identified well-known major QTL for <it>&#945;</it>-amylase (chromosomes 2 H, 7 H), diastatic power (1 H, 4 H, 7 H), grain protein content (2 H, 4 H, 5 H), malt extract (2 H, 4 H, 7 H), heading date (2 H), height (2 H, 3 H), lodging (2 H, 3 H, 4 H) and yield (3 H) (Additional File <supplr sid="S1">1</supplr>) <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr></abbrgrp>.</p>
               <p>Figure <figr fid="F7">7</figr> displays the profiles generated using several methods on the heading date, height, lodging and yield traits. The yield QTL on chromosome 3 H at a cumulative map position of 431 cM indeed coincided closely with the main lodging QTL (431 cM) and one of the plant-height QTL (432 cM). Lodging is expected to affect yield, yet the yield QTL profile produced by SML was identical, irrespective of whether or not environments where lodging was reported were included in the analysis (data not shown).</p>
               <fig id="F7">
                  <title>
                     <p>Figure 7</p>
                  </title>
                  <caption>
                     <p>Comparison of different QTL methods</p>
                  </caption>
                  <text>
                     <p><b>Comparison of different QTL methods</b>. Genome-wide QTL profiles for four traits generated by SML, MR, CIM and BIM. A 5 cM averaging window was applied to the BIM profile for plotting. Horizontal dotted lines are <it>p </it>&lt; 0.05 thresholds for SML. The plots are based on the allele calls and genotypes underlying the 'raw' version of the linkage map (see section entitled 'Genetic-map construction' in <it>Methods</it>).</p>
                  </text>
                  <graphic file="1471-2156-9-35-7"/>
               </fig>
               <p>Hayes and colleagues suggested that the positive allele for the yield QTL on chromosome 3 H coincided with low lodging and height-QTL alleles from the opposite parent <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. These previous observations are clearly reinforced by our results and appear to point to a locus influencing plant height that has independent pleiotropic effects on both lodging and yield as opposed to a causal chain (tall plants &#8594; lodging &#8594; reduced yield). Plant height also appeared to affect lodging via another QTL on chromosomes 2 H (241 cM), which coincided for the two traits. Plant height, in turn, appeared to be partly associated with heading date because the main QTL for these two traits coincided precisely (chromosome 2 H; 269 cM). We conclude that the SML-QTL algorithm confirms and extends previously hypothesised relationships among these traits. Clearly, the resolution of the QTL profiles generated by SML facilitates the genetic dissection of traits into physiological or phenological components.</p>
            </sec>
            <sec>
               <st>
                  <p>Synthetic data</p>
               </st>
               <p>We also compared the genome profiles of variance explained (the QTL effects) derived from the 100 synthetic datasets discussed earlier, in order to benchmark SML against BIM and MR. These methods were selected to represent the two extremes of algorithmic complexity of existing QTL mapping methods. To summarise these profiles and give an idea of overall performance of each method, we considered each dataset to be a binary classification problem &#8211; for each marker, classify it as a QTL or not a QTL. Such a binary classification can be accomplished by choosing a threshold and classifying markers exceeding this threshold as linked to QTL. However, as the threshold affects the trade-off between type-I and type-II errors, we used the Area under the Receiver Operating Characteristic (AROC) <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> to measure the performance. The AROC is an order statistic equal to the probability of correctly ordering pairs from different classes (see "QTL classification performance" section in <it>Methods</it>).</p>
               <p>Figure <figr fid="F8">8</figr> summarises this experiment in the form of a box plot. The results demonstrate that MR performs worse than BIM and SML &#8211; as expected &#8211; with a lower median and large variance. BIM achieved a high median performance, but had a larger variance than SML. Though the BIM median was higher, the difference between the means of SML and BIM was not significant (<it>p </it>= 0.499). We conclude that both methods are similar with respect to locating QTL.</p>
               <fig id="F8">
                  <title>
                     <p>Figure 8</p>
                  </title>
                  <caption>
                     <p>QTL profile accuracy on simulated data</p>
                  </caption>
                  <text>
                     <p><b>QTL profile accuracy on simulated data</b>. Accuracy of different methods of classifying individual markers as linked to synthetic QTL on 100 simulated datasets. Results of genome profiles obtained using BIM, SML, and MR on 100 simulated datasets. The y-axis here is the Area under the Receiver Operating Characteristic (AROC). The 0.5 level indicates random performance and 1 indicates perfect performance.</p>
                  </text>
                  <graphic file="1471-2156-9-35-8"/>
               </fig>
               <p>Finally, we examined a single synthetic dataset comprising of a 2,000 cM-long 'chromosome' that contained 20 randomly positioned QTL of random strength. Figure <figr fid="F9">9</figr> shows the smoothed profiles (5 cM averaging window for BIM and 5 cM summing window for SML) of variance explained obtained using BIM and SML (See Additional File <supplr sid="S2">2</supplr>). Here it is clear that SML provides better estimates of QTL strength &#8211; non-QTL markers are assigned low variance explained and the estimates at QTL markers are not overly optimistic. The lack of a bootstrapping step during which experimental units (plants) are resampled presumably accounts for the upward bias of BIM (see also section entitled "Statistical validation of QTL through bootstrapping"). One may claim that SML is underestimating the variance, however after applying the suggested 5 cM summing window the estimates are improved.</p>
               <fig id="F9">
                  <title>
                     <p>Figure 9</p>
                  </title>
                  <caption>
                     <p>SML and BIM genome profiles on synthetic data</p>
                  </caption>
                  <text>
                     <p><b>SML and BIM genome profiles on synthetic data</b>. Estimated QTL effects using BIM and SML for a single synthetic 'chromosome' of 2,000 cM length with 20 simulated QTL. QTL were positioned randomly with random strength. Red lines indicate true QTL locations, with height denoting strength. BIM profile smoothed using a 5 cM averaging window, and SML profile smoothed using a 5 cM summing window.</p>
                  </text>
                  <graphic file="1471-2156-9-35-9"/>
               </fig>
               <suppl id="S2">
                  <title>
                     <p>Additional file 2</p>
                  </title>
                  <text>
                     <p><b>Unsmoothed results obtained in the analysis of a synthetic 'chromosome'</b>. PowerPoint file with two plots containing the unsmoothed results from which the plots in Figure <figr fid="F9">9</figr> were generated.</p>
                  </text>
                  <file name="1471-2156-9-35-S2.ppt">
                     <p>Click here for file</p>
                  </file>
               </suppl>
               <p>It is important to emphasize that the <it>amount </it>of variance explained <it>supportable by the data </it>will be less than the theoretical variance explained shown in red due to small sample size (100 samples with 2001 features) and noise. Measuring the AROC on both variance explained profiles gives 0.83 for SML and 0.78 for BIM, indicating the SML peaks are better aligned with QTL and more distinct than the BIM peaks.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>QTL resolution</p>
            </st>
            <p>The precision with which a QTL can be mapped is important in the context of marker-assisted selection and gene cloning in particular. Narrow QTL peaks are also important for distinguishing closely linked QTL (or genes) affecting the trait. Figures <figr fid="F7">7</figr> and <figr fid="F9">9</figr> demonstrate that SML consistently generated narrower and better defined QTL signals than MR, CIM and BIM. It should be noted that we used quite aggressive settings for CIM to produce narrow QTL peaks (background markers at > 10 cM) <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. To evaluate the precision of SML, we investigated the centromeric region on chromosome 7 H flanked by markers <it>Amy2 </it>(64 cM) and <it>Brz </it>(95.2 cM) (Additional File <supplr sid="S3">3</supplr>). This region contains several overlapping QTL for malting-quality traits, including malt extract, <it>&#945;</it>-amylase and diastatic power <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B18">18</abbr></abbrgrp>.</p>
            <suppl id="S3">
               <title>
                  <p>Additional file 3</p>
               </title>
               <text>
                  <p><b>Genotypic data used for QTL analysis</b>. Excel file containing 0/1 allele calls and A/B genotypes (segregation data) for both the 'raw' and the 'curated' Steptoe/Morex genetic map.</p>
               </text>
               <file name="1471-2156-9-35-S3.xls">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>It had been speculated that one of the two <it>&#945;</it>-amylase QTL could be attributed to <it>Amy2</it>, a structural gene encoding low-<it>p</it>I <it>&#945;</it>-amylase <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. The resolution afforded by conventional QTL-mapping methods, however, was insufficient to settle this issue. The CIM analysis in this study also reported a broad peak on chromosome 7 H. The QTL profile generated by SML, by contrast, showed two distinct peaks (Figure <figr fid="F10">10</figr>; Additional File <supplr sid="S1">1</supplr>). One of the two peaks was at 4.6-cM distance from the <it>Amy2 </it>locus (the other was further away). Given that various partially related traits mapped to identical QTL with less than 1-cM precision (Figure <figr fid="F7">7</figr>), a 4.6-cM distance would suggest the structural gene and the QTL are not identical. This result is indeed consistent with a fine-mapping study of this region that identified recombinants between <it>Amy2 </it>and the QTL <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> and hence underscores the high resolution afforded by SML.</p>
            <fig id="F10">
               <title>
                  <p>Figure 10</p>
               </title>
               <caption>
                  <p>QTL for <it>&#945;</it>-amylase on chromosome 7 H</p>
               </caption>
               <text>
                  <p><b>QTL for <it>&#945;</it>-amylase on chromosome 7 H</b>. QTL profiles produced with SML, CIM and BIM. The positions of the structural &#945;-amylase gene (<it>Amy2</it>) and the maximum of the SML QTL peak are indicated by vertical dotted lines. A 5 cM averaging window was applied to the BIM profile for plotting. 'Significant peaks' (<it>p </it>&lt; 0.05 for SML and CIM; 2log BF > 2.2 for BIM) are highlighted by asterisks. The plot is based on the allele calls and genotypes underlying the 'raw' version of the linkage map (see section entitled 'Genetic-map construction' in <it>Methods</it>).</p>
               </text>
               <graphic file="1471-2156-9-35-10"/>
            </fig>
            <p>Conventional methods map QTL with limited precision, particularly if the fraction of the variance explained by a QTL is low <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. In CIM, the width of QTL peaks can be reduced by using more closely linked markers for genetic-background adjustment. This approach, however, decreases the statistical power of the method <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> and relies on an <it>ad-hoc </it>cM-distance threshold. BIM provides a similar degree of resolution as SML but appears to overestimate QTL effects to an even larger extent than CIM, and reports QTL peaks not supported by the other methods (Figure <figr fid="F10">10</figr>).</p>
            <p>By contrast, SML generates unbiased QTL models and increases QTL definition by shrinking the size of the models through recursive marker elimination and apportioning variance to individual markers based on nested models. Individual markers are evaluated in the context of other markers; so if multiple markers contain a similar level of information then the (largely) superfluous markers will be removed. The remaining marker(s) will still explain most of the variance, while the variance attributed to the superfluous markers will be small, thus resulting in well-defined QTL peaks.</p>
         </sec>
         <sec>
            <st>
               <p>Robustness to genotyping and linkage-mapping errors</p>
            </st>
            <p>Genotyping errors affect the accuracy of the marker order on a genetic map and hence the performance of QTL-detection methods that require a linkage map. We compared the QTL profiles produced with SML, CIM and MR using two different genotypic datasets: the dataset underlying a 'raw' version of the Steptoe/Morex map (0.4% potential genotyping errors; 97.0% call rate) and the dataset corresponding to a 'curated', re-optimised version of the map (potential genotyping errors removed; 99.6% call rate). Table <tblr tid="T3">3</tblr> presents an overview of this comparison. The QTL profiles were highly correlated for MR, less correlated for SML and the least correlated for CIM. Despite the high correlated QTL profiles, only 67% of more than 80 QTL identified with MR were consistent between the two map versions. The between-map consistency of the QTL detected with CIM (approximately 80) was even lower (64%).</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Consistency between QTL detected with 'raw' and 'curated' genotypic data.</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3" ca="center">
                        <p>
                           <b>Total number of QTL detected (<it>p </it>&lt; 0.05)</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Method<sup>a</sup></b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Correlation between QTL profiles<sup>b</sup></b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Raw dataset</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Curated dataset</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Overlap<sup>c</sup></b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>SML</p>
                     </c>
                     <c ca="left">
                        <p>0.895 &#177; 0.085</p>
                     </c>
                     <c ca="center">
                        <p>38</p>
                     </c>
                     <c ca="center">
                        <p>34</p>
                     </c>
                     <c ca="left">
                        <p>29 (81%)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>MR</p>
                     </c>
                     <c ca="left">
                        <p>0.998 &#177; 0.002</p>
                     </c>
                     <c ca="center">
                        <p>86</p>
                     </c>
                     <c ca="center">
                        <p>84</p>
                     </c>
                     <c ca="left">
                        <p>57 (67%)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CIM</p>
                     </c>
                     <c ca="left">
                        <p>0.887 &#177; 0.128</p>
                     </c>
                     <c ca="center">
                        <p>83</p>
                     </c>
                     <c ca="center">
                        <p>86</p>
                     </c>
                     <c ca="left">
                        <p>57 (67%)</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p><sup>a </sup>SML, Statistical Machine Learning; MR, single Marker Regression; CIM, Composite Interval Mapping with 20 background markers at > 10 cM distance from the tested interval.</p>
                  <p><sup>b </sup>QTL profiles are whole-genome plots of the fraction of variance explained vs. genome position similar to those displayed in Figures 6-8. The values reported are means &#177; SD across the nine traits investigated in this study.</p>
                  <p><sup>c</sup>The percentage overlap was computed by division by the average number of QTL detected with the two datasets.</p>
               </tblfn>
            </tbl>
            <p>As a result of the bootstrap-validation step, SML reported less than half of the QTL identified by other methods (see section entitled <it>Statistical validation of QTL through bootstrapping </it>above). However, 81% of these QTL were consistent between map versions. In contrast to CIM, the SML method can function independently of a genetic map. We only used the map for smoothing and conveniently plotting the results. An erroneous marker order in a linkage map, therefore, affects SML only marginally during the final smoothing/plotting step.</p>
            <p>Map curation not only affected QTL detection but also the estimation of QTL effects. Figure <figr fid="F11">11</figr> displays a between-map comparison for diastatic power, one of the genetically more complex traits. In the case of SML, the variance explained by QTL was consistent between the two datasets. CIM was less consistent. For example, map curation reduced the explanatory power of the most important CIM QTL on chromosome 7H from 25% to 10% of variance explained (Figure <figr fid="F11">11</figr>). We conclude from these results that SML is more robust to genotyping and linkage-mapping errors than both MR and CIM.</p>
            <fig id="F11">
               <title>
                  <p>Figure 11</p>
               </title>
               <caption>
                  <p>Robustness to genotyping and linkage-mapping errors</p>
               </caption>
               <text>
                  <p><b>Robustness to genotyping and linkage-mapping errors</b>. Effect of map curation on QTL for diastatic power detected by SML and CIM. In the case of CIM, 20 markers at > 10 cM distance from the tested interval were used to adjust for the genetic background. Statistically significant peaks (<it>p </it>&lt; 0.05) are labelled with asterisks.</p>
               </text>
               <graphic file="1471-2156-9-35-11"/>
            </fig>
            <p>Interestingly, the quality of the "crude" genotyping data set used in the analysis reported here is lower than that of a typical dataset produced by a standard DArT assay (see the 'Genotypic data' section in <it>Methods</it>) but arguably higher than that of a typical dataset generated with (semi)manually scored markers (AFLP or SSR). From this it follows that:</p>
            <p>1. 'Standard' QTL mapping approaches (like CIM), when performed on genotyping datasets obtained with gel-based marker technologies, may produce inconsistent marker/trait associations; and</p>
            <p>2. The SML approach is likely to perform well in detecting and estimating QTL effects when using marker data with a quality similar to that of a standard DArT assay, with negligible improvement afforded by either replicating DArT assays or employing technically more complex and costly SNP genotyping platform(s).</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>The QTL identified with SML are broadly consistent with those detected by other methods. Yet the SML algorithm offers some advantages over QTL methods such as MR, CIM and BIM. SML produces narrower peaks than MR and CIM and hence identifies QTL with greater precision. BIM generates similarly narrow peaks as SML, but unlike SML seems to underestimate the genetic complexity of traits and overestimate the QTL effects on synthetic data. Because of the use of bootstrap resampling, SML avoids the optimistic bias in predictive performance (% variance explained), which is an inherent feature of other methods. Consequently, SML provides better estimates of the QTL effects supportable by the data, thus reducing the false-discovery rate.</p>
         <p>Finally, unlike several other QTL algorithms SML does not require a genetic map. It is therefore applicable to any species or population. Because of this feature, SML is a potentially attractive alternative for association-mapping experiments, an idea that will be explored in a future paper.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Barley population</p>
            </st>
            <p>Our study is based on existing data for 94 F<sub>1</sub>-derived DH plants from a cross between barley cultivars Steptoe and Morex <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr></abbrgrp>. This population has been the subject of extensive phenotyping across a range of environments <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Genotypic data</p>
            </st>
            <sec>
               <st>
                  <p>Data source</p>
               </st>
               <p>We used part of the segregation data from a high-quality Steptoe/Morex map with more than 1,000 markers. This map was built from RFLP, DArT and SSR markers <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>, and had approximately 0.2% potential genotyping errors. To create a more 'typical' dataset for plant QTL studies reported in the literature (with less markers and a higher error rate), we selected a random subset of 464 markers and added 84 markers with more genotyping errors. The majority of these markers were previously rejected DArT markers with low marker-quality scores <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. DArT genotypes ('A' for homozygote maternal, 'B' for homozygote paternal) were translated into the original presence/absence allele calls (0/1) by comparison against the parental alleles. RFLP genotypes were converted into presence/absence allele calls by arbitrarily assigning '1' to the maternal allele.</p>
               <p>Allele calls (0/1) were used to identify QTL using SML and MR. Missing allele calls were imputed with 0.5 because the ridge regression algorithm underlying our method works on continuous input values (see section entitled QTL <it>machine-learning algorithm </it>below). Genotypes (A/B) were used to identify QTL using the map-based CIM approach. Missing genotypes were replaced with expected genotypes derived from flanking markers after genetic-map construction.</p>
            </sec>
            <sec>
               <st>
                  <p>Genetic map construction</p>
               </st>
               <p>For the purpose of displaying SML results and identifying QTL by CIM, we built a genetic map for the dataset of 548 selected markers (351 DArT, 197 RFLP). The marker order was established with RECORD software, and the cM distances between markers were estimated using a multipoint regression algorithm <abbrgrp><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr></abbrgrp>. The resulting 'raw' map had a call rate of 97.0% and contained 0.4% potential genotypic errors (Additional File <supplr sid="S3">3</supplr>). For comparison, we also generated a 'curated' version of the map. Map curation comprised imputing missing genotypes from neighbouring markers, substituting potential genotyping errors (LOD<sub>error </sub>> 4) <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> with missing data, re-optimising the marker order and collapsing co-segregating markers into 'bins'. The resulting refined map had 367 bins and a call rate of 99.6% (Additional File <supplr sid="S3">3</supplr>). We used both the 'raw' and the 'curated' allele calls and genotypes to identify QTL.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Phenotypic data</p>
            </st>
            <sec>
               <st>
                  <p>Data source</p>
               </st>
               <p>The phenotypic data for nine traits, measured in up to 16 different environments, were downloaded from the GrainGenes website <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> (Additional File <supplr sid="S4">4</supplr>).</p>
               <suppl id="S4">
                  <title>
                     <p>Additional file 4</p>
                  </title>
                  <text>
                     <p><b>Phenotypic data used for QTL analysis</b>. Excel file containing phenotypic data for the nine traits investigated in this study (<it>&#945;</it>-amylase, diastatic power, heading date, plant height, lodging, malt extract, pubescent leaves, grain protein content, and yield). The data is from up to 16 different environments and includes averages across standardised environments (see section entitled 'Pre-processing of phenotypic data' in <it>Methods</it>).</p>
                  </text>
                  <file name="1471-2156-9-35-S4.xls">
                     <p>Click here for file</p>
                  </file>
               </suppl>
            </sec>
            <sec>
               <st>
                  <p>Pre-processing of phenotypic data</p>
               </st>
               <p>We introduce a method strongly related to principal component analysis. Let <it>p</it><sub><it>ij </it></sub>be the phenotype measurement for plant <it>i </it>in environment <it>j</it>, <it>n</it><sub>env</sub>, <it>n</it><sub>mrk</sub>, and <it>n</it><sub><it>p </it></sub>be the number of environments, markers, and plants respectively. Then the mean and standard deviation of phenotypes within environments are given by</p>
               <p>
                  <display-formula>
                     <m:math name="1471-2156-9-35-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mtable columnalign="left">
                              <m:mtr>
                                 <m:mtd>
                                    <m:msub>
                                       <m:mover accent="true">
                                          <m:mi>p</m:mi>
                                          <m:mo>&#175;</m:mo>
                                       </m:mover>
                                       <m:mi>j</m:mi>
                                    </m:msub>
                                    <m:mo>=</m:mo>
                                    <m:msubsup>
                                       <m:mi>n</m:mi>
                                       <m:mi>p</m:mi>
                                       <m:mrow>
                                          <m:mo>&#8722;</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                    </m:msubsup>
                                    <m:mstyle displaystyle="true">
                                       <m:munder>
                                          <m:mo>&#8721;</m:mo>
                                          <m:mi>i</m:mi>
                                       </m:munder>
                                       <m:mrow>
                                          <m:msub>
                                             <m:mi>p</m:mi>
                                             <m:mrow>
                                                <m:mi>i</m:mi>
                                                <m:mi>j</m:mi>
                                             </m:mrow>
                                          </m:msub>
                                       </m:mrow>
                                    </m:mstyle>
                                 </m:mtd>
                              </m:mtr>
                              <m:mtr>
                                 <m:mtd>
                                    <m:msub>
                                       <m:mi>s</m:mi>
                                       <m:mi>j</m:mi>
                                    </m:msub>
                                    <m:mo>=</m:mo>
                                    <m:msqrt>
                                       <m:mrow>
                                          <m:msup>
                                             <m:mrow>
                                                <m:mo stretchy="false">(</m:mo>
                                                <m:msub>
                                                   <m:mi>n</m:mi>
                                                   <m:mi>p</m:mi>
                                                </m:msub>
                                                <m:mo>&#8722;</m:mo>
                                                <m:mn>1</m:mn>
                                                <m:mo stretchy="false">)</m:mo>
                                             </m:mrow>
                                             <m:mrow>
                                                <m:mo>&#8722;</m:mo>
                                                <m:mn>1</m:mn>
                                             </m:mrow>
                                          </m:msup>
                                          <m:mstyle displaystyle="true">
                                             <m:munder>
                                                <m:mo>&#8721;</m:mo>
                                                <m:mi>i</m:mi>
                                             </m:munder>
                                             <m:mrow>
                                                <m:mo stretchy="false">(</m:mo>
                                                <m:msub>
                                                   <m:mi>p</m:mi>
                                                   <m:mrow>
                                                      <m:mi>i</m:mi>
                                                      <m:mi>j</m:mi>
                                                   </m:mrow>
                                                </m:msub>
                                                <m:mo>&#8722;</m:mo>
                                                <m:msub>
                                                   <m:mover accent="true">
                                                      <m:mi>p</m:mi>
                                                      <m:mo>&#175;</m:mo>
                                                   </m:mover>
                                                   <m:mi>j</m:mi>
                                                </m:msub>
                                                <m:mo stretchy="false">)</m:mo>
                                             </m:mrow>
                                          </m:mstyle>
                                       </m:mrow>
                                    </m:msqrt>
                                 </m:mtd>
                              </m:mtr>
                           </m:mtable>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGceaqabeaacuWGWbaCgaqeamaaBaaaleaacqWGQbGAaeqaaOGaeyypa0JaemOBa42aa0baaSqaaiabdchaWbqaaiabgkHiTiabigdaXaaakmaaqafabaGaemiCaa3aaSbaaSqaaiabdMgaPjabdQgaQbqabaaabaGaemyAaKgabeqdcqGHris5aaGcbaGaem4Cam3aaSbaaSqaaiabdQgaQbqabaGccqGH9aqpdaGcaaqaaiabcIcaOiabd6gaUnaaBaaaleaacqWGWbaCaeqaaOGaeyOeI0IaeGymaeJaeiykaKYaaWbaaSqabeaacqGHsislcqaIXaqmaaGcdaaeqbqaaiabcIcaOiabdchaWnaaBaaaleaacqWGPbqAcqWGQbGAaeqaaOGaeyOeI0IafmiCaaNbaebadaWgaaWcbaGaemOAaOgabeaakiabcMcaPaWcbaGaemyAaKgabeqdcqGHris5aaWcbeaaaaaa@5742@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>where <it>s</it><sub><it>j </it></sub>and <inline-formula><m:math name="1471-2156-9-35-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mover accent="true"><m:mi>p</m:mi><m:mo>&#175;</m:mo></m:mover><m:mi>j</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmiCaaNbaebadaWgaaWcbaGaemOAaOgabeaaaaa@2EDF@</m:annotation></m:semantics></m:math></inline-formula> are the sample standard deviation and mean of environment <it>j </it>calculated across all plants <it>i </it>&#8712; 1..<it>n</it><sub><it>p</it></sub>. The scaled phenotypes are then given by</p>
               <p>
                  <display-formula>
                     <m:math name="1471-2156-9-35-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:msub>
                                 <m:mover accent="true">
                                    <m:mi>p</m:mi>
                                    <m:mo>^</m:mo>
                                 </m:mover>
                                 <m:mrow>
                                    <m:mi>i</m:mi>
                                    <m:mi>j</m:mi>
                                 </m:mrow>
                              </m:msub>
                              <m:mo>=</m:mo>
                              <m:mfrac>
                                 <m:mrow>
                                    <m:msub>
                                       <m:mi>p</m:mi>
                                       <m:mrow>
                                          <m:mi>i</m:mi>
                                          <m:mi>j</m:mi>
                                       </m:mrow>
                                    </m:msub>
                                    <m:mo>&#8722;</m:mo>
                                    <m:msub>
                                       <m:mover accent="true">
                                          <m:mi>p</m:mi>
                                          <m:mo>&#175;</m:mo>
                                       </m:mover>
                                       <m:mi>j</m:mi>
                                    </m:msub>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:msub>
                                       <m:mi>s</m:mi>
                                       <m:mi>j</m:mi>
                                    </m:msub>
                                 </m:mrow>
                              </m:mfrac>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmiCaaNbaKaadaWgaaWcbaGaemyAaKMaemOAaOgabeaakiabg2da9KqbaoaalaaabaGaemiCaa3aaSbaaeaacqWGPbqAcqWGQbGAaeqaaiabgkHiTiqbdchaWzaaraWaaSbaaeaacqWGQbGAaeqaaaqaaiabdohaZnaaBaaabaGaemOAaOgabeaaaaaaaa@3D49@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>Finally, we can combine the estimates into a single more robust value by calculating the mean across all environments</p>
               <p>
                  <display-formula>
                     <m:math name="1471-2156-9-35-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>y</m:mi>
                                 <m:mi>i</m:mi>
                              </m:msub>
                              <m:mo>=</m:mo>
                              <m:msubsup>
                                 <m:mi>n</m:mi>
                                 <m:mrow>
                                    <m:mi>e</m:mi>
                                    <m:mi>n</m:mi>
                                    <m:mi>v</m:mi>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:mo>&#8722;</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                              </m:msubsup>
                              <m:mstyle displaystyle="true">
                                 <m:munder>
                                    <m:mo>&#8721;</m:mo>
                                    <m:mi>j</m:mi>
                                 </m:munder>
                                 <m:mrow>
                                    <m:msub>
                                       <m:mover accent="true">
                                          <m:mi>p</m:mi>
                                          <m:mo>^</m:mo>
                                       </m:mover>
                                       <m:mrow>
                                          <m:mi>i</m:mi>
                                          <m:mi>j</m:mi>
                                       </m:mrow>
                                    </m:msub>
                                 </m:mrow>
                              </m:mstyle>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyEaK3aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpcqWGUbGBdaqhaaWcbaGaemyzauMaemOBa4MaemODayhabaGaeyOeI0IaeGymaedaaOWaaabuaeaacuWGWbaCgaqcamaaBaaaleaacqWGPbqAcqWGQbGAaeqaaaqaaiabdQgaQbqab0GaeyyeIuoaaaa@3FAD@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>Note that missing values can be handled during the calculation of <it>s</it><sub><it>j </it></sub>and <inline-formula><m:math name="1471-2156-9-35-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mover accent="true"><m:mi>p</m:mi><m:mo>&#175;</m:mo></m:mover><m:mi>j</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmiCaaNbaebadaWgaaWcbaGaemOAaOgabeaaaaa@2EDF@</m:annotation></m:semantics></m:math></inline-formula> by calculating the mean and standard deviation over available measurements only.</p>
               <p>These final values <it>y</it><sub><it>i </it></sub>are very similar to results obtained by projecting onto the first principal component. This can be seen by observing that the <it>y</it><sub><it>i </it></sub>provide a good linear approximation to the full set <it>p</it><sub><it>i,j</it></sub>. We verified this on the barley dataset by calculating the principal component projection and measuring the correlation with the values obtained by the above method. The result was a mean correlation coefficient of 0.99 across all traits.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Synthetic datasets</p>
            </st>
            <p>Synthetic datasets were created using the <it>R/qtl </it>package <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. All datasets were simulated backcrosses using an additive model for the phenotype comprising of 100 individuals. Markers were positioned uniformly across the entire genome with no missing values or genotyping errors. The Haldane mapping function was used to convert genetic distances to recombination fractions. QTL were distributed randomly at marker positions with uniform probability. QTL strength (difference between homozygous and heterozygous) was randomly assigned with uniform probability over the interval [-5,5]. Normally distributed noise with mean 0 and variance 1 was added.</p>
         </sec>
         <sec>
            <st>
               <p>QTL machine-learning algorithm</p>
            </st>
            <p>The QTL detection algorithm is based on a few key concepts: a linear predictive model, recursive feature elimination, bootstrap resampling for estimation of model performance and marker effects, and generation of QTL profiles by local summation. Figure <figr fid="F1">1</figr> (left panel) shows a high level overview of the data flow and processing steps involved in generating the QTL profiles. We now detail each concept.</p>
            <sec>
               <st>
                  <p>Linear predictive model</p>
               </st>
               <p>Underlying our whole technique is the assumption of linear dependence. We assume that contributions from markers are additive. Let <it>x</it><sub><it>ij </it></sub>be the genotype of plant <it>i </it>at marker <it>j</it>, and <inline-formula><m:math name="1471-2156-9-35-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mover accent="true"><m:mi>x</m:mi><m:mo>&#8594;</m:mo></m:mover><m:mi>i</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmiEaGNbaSaadaWgaaWcbaGaemyAaKgabeaaaaa@2EE7@</m:annotation></m:semantics></m:math></inline-formula> be the vector consisting of all markers from plant <it>i</it>. Under the linear assumption, the estimate of <it>y</it><sub><it>i </it></sub>for plant <it>i </it>is</p>
               <p>
                  <display-formula>
                     <m:math name="1471-2156-9-35-i6" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:mi>f</m:mi>
                              <m:mo stretchy="false">(</m:mo>
                              <m:msub>
                                 <m:mover accent="true">
                                    <m:mi>x</m:mi>
                                    <m:mo>&#8594;</m:mo>
                                 </m:mover>
                                 <m:mi>i</m:mi>
                              </m:msub>
                              <m:mo>;</m:mo>
                              <m:mover accent="true">
                                 <m:mi>&#946;</m:mi>
                                 <m:mo>&#8594;</m:mo>
                              </m:mover>
                              <m:mo>,</m:mo>
                              <m:mi>b</m:mi>
                              <m:mo stretchy="false">)</m:mo>
                              <m:mo>=</m:mo>
                              <m:mstyle displaystyle="true">
                                 <m:munder>
                                    <m:mo>&#8721;</m:mo>
                                    <m:mrow>
                                       <m:mi>k</m:mi>
                                       <m:mo>&#8712;</m:mo>
                                       <m:mi>K</m:mi>
                                    </m:mrow>
                                 </m:munder>
                                 <m:mrow>
                                    <m:msub>
                                       <m:mi>x</m:mi>
                                       <m:mrow>
                                          <m:mi>i</m:mi>
                                          <m:mi>k</m:mi>
                                       </m:mrow>
                                    </m:msub>
                                 </m:mrow>
                              </m:mstyle>
                              <m:msub>
                                 <m:mi>&#946;</m:mi>
                                 <m:mi>k</m:mi>
                              </m:msub>
                              <m:mo>+</m:mo>
                              <m:mi>b</m:mi>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOzayMaeiikaGIafmiEaGNbaSaadaWgaaWcbaGaemyAaKgabeaakiabcUda7iqbek7aIzaalaGaeiilaWIaemOyaiMaeiykaKIaeyypa0ZaaabuaeaacqWG4baEdaWgaaWcbaGaemyAaKMaem4AaSgabeaaaeaacqWGRbWAcqGHiiIZcqWGlbWsaeqaniabggHiLdGccqaHYoGydaWgaaWcbaGaem4AaSgabeaakiabgUcaRiabdkgaIbaa@4812@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>where <it>K </it>is a set of markers, <it>x</it><sub><it>ik </it></sub>is the genotype of marker <it>k </it>for plant <it>i</it>, <inline-formula><m:math name="1471-2156-9-35-i7" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>&#946;</m:mi><m:mo>&#8594;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMbaSaaaaa@2D88@</m:annotation></m:semantics></m:math></inline-formula> is the associated weight vector, and <it>b </it>is the bias parameter.</p>
               <p>The parameters <inline-formula><m:math name="1471-2156-9-35-i7" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>&#946;</m:mi><m:mo>&#8594;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMbaSaaaaa@2D88@</m:annotation></m:semantics></m:math></inline-formula> and <it>b </it>are estimated from the training data using the well-known ridge regression algorithm <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>. In brief, ridge regression solves the least squares problem</p>
               <p>
                  <display-formula>
                     <m:math name="1471-2156-9-35-i8" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:mi>min</m:mi>
                              <m:mo>&#8289;</m:mo>
                              <m:mstyle displaystyle="true">
                                 <m:munder>
                                    <m:mo>&#8721;</m:mo>
                                    <m:mi>i</m:mi>
                                 </m:munder>
                                 <m:mrow>
                                    <m:msup>
                                       <m:mrow>
                                          <m:mo stretchy="false">(</m:mo>
                                          <m:msub>
                                             <m:mi>y</m:mi>
                                             <m:mi>i</m:mi>
                                          </m:msub>
                                          <m:mo>&#8722;</m:mo>
                                          <m:mi>f</m:mi>
                                          <m:mo stretchy="false">(</m:mo>
                                          <m:msub>
                                             <m:mover accent="true">
                                                <m:mi>x</m:mi>
                                                <m:mo>&#8594;</m:mo>
                                             </m:mover>
                                             <m:mi>i</m:mi>
                                          </m:msub>
                                          <m:mo>;</m:mo>
                                          <m:mover accent="true">
                                             <m:mi>&#946;</m:mi>
                                             <m:mo>&#8594;</m:mo>
                                          </m:mover>
                                          <m:mo>,</m:mo>
                                          <m:mi>b</m:mi>
                                          <m:mo stretchy="false">)</m:mo>
                                          <m:mo stretchy="false">)</m:mo>
                                       </m:mrow>
                                       <m:mn>2</m:mn>
                                    </m:msup>
                                    <m:mo>+</m:mo>
                                    <m:mi>&#955;</m:mi>
                                    <m:mstyle displaystyle="true">
                                       <m:munder>
                                          <m:mo>&#8721;</m:mo>
                                          <m:mi>k</m:mi>
                                       </m:munder>
                                       <m:mrow>
                                          <m:msubsup>
                                             <m:mi>&#946;</m:mi>
                                             <m:mi>k</m:mi>
                                             <m:mn>2</m:mn>
                                          </m:msubsup>
                                       </m:mrow>
                                    </m:mstyle>
                                 </m:mrow>
                              </m:mstyle>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGagiyBa0MaeiyAaKMaeiOBa42aaabuaeaacqGGOaakcqWG5bqEdaWgaaWcbaGaemyAaKgabeaakiabgkHiTiabdAgaMjabcIcaOiqbdIha4zaalaWaaSbaaSqaaiabdMgaPbqabaGccqGG7aWocuaHYoGygaWcaiabcYcaSiabdkgaIjabcMcaPiabcMcaPmaaCaaaleqabaGaeGOmaidaaOGaey4kaSIaeq4UdW2aaabuaeaacqaHYoGydaqhaaWcbaGaem4AaSgabaGaeGOmaidaaaqaaiabdUgaRbqab0GaeyyeIuoaaSqaaiabdMgaPbqab0GaeyyeIuoaaaa@4FC4@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>where the first term is the sum of squares, the second term is the regulariser, and <it>&#955; </it>> 0 is a tuning parameter for adjusting the amount of regularisation. The regulariser encodes a preference for smoother functions by shrinking the weights towards 0 (and also each other), and gives both a unique solution to the ill-posed minimisation problem and increased robustness against noise. For our QTL analyses, we set <it>&#955; </it>= 1.</p>
            </sec>
            <sec>
               <st>
                  <p>Recursive feature elimination</p>
               </st>
               <p>While a model over the entire set of markers is useful for predicting the phenotypic outcome, we wish to determine the key markers contributing to the genetic variation of traits. In other words, we seek a model with <it>K </it>of low cardinality (i.e. with a low number of elements in the set) that is sufficient for accurate phenotype prediction. This feature (marker) selection is performed by using Recursive Feature Elimination (RFE) to train and evaluate linear models ranging in size from all features to one feature.</p>
               <p>RFE commences with the full model using all features and then discards the least important feature. This process is recursively applied until a model of desired size is reached (we created models down to one marker). In coupling RFE with ridge regression (RFE-RIDGE), the importance can be estimated from the weights <inline-formula><m:math name="1471-2156-9-35-i7" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>&#946;</m:mi><m:mo>&#8594;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMbaSaaaaa@2D88@</m:annotation></m:semantics></m:math></inline-formula> = (<it>&#946;</it><sub><it>k</it></sub>). As the model is linear and all markers have the same range, the absolute value |<it>&#946;</it><sub><it>k</it></sub>| is an estimate of the importance of the marker <it>k</it>. The <it>k</it><sup>th </sup>marker with minimal |<it>&#946;</it><sub><it>k</it></sub>| is deemed the least important and is discarded. Note that re-optimisation of <inline-formula><m:math name="1471-2156-9-35-i7" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>&#946;</m:mi><m:mo>&#8594;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMbaSaaaaa@2D88@</m:annotation></m:semantics></m:math></inline-formula> after each discard is required as the exclusion of a feature will result in a redistribution of weights.</p>
               <p>More precisely, let <inline-formula><m:math name="1471-2156-9-35-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msup><m:mover accent="true"><m:mi>&#946;</m:mi><m:mo>&#8594;</m:mo></m:mover><m:mrow><m:mo stretchy="false">(</m:mo><m:mi>t</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:msup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMbaSaadaahaaWcbeqaaiabcIcaOiabdsha0jabcMcaPaaaaaa@30D8@</m:annotation></m:semantics></m:math></inline-formula> be the model obtained at time step <it>t </it>from applying ridge regression with the set of markers <it>M</it><sub><it>t</it></sub>. The initial model at time step <it>t = 1 </it>is fitted with all markers <it>M</it><sub>1 </sub>= {1,2,..., <it>m</it>}. At each time step, determine the least important feature as <inline-formula><m:math name="1471-2156-9-35-i10" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>&#958;</m:mi><m:mi>t</m:mi></m:msub><m:mo>=</m:mo><m:mi>a</m:mi><m:mi>r</m:mi><m:mi>g</m:mi><m:munder><m:mrow><m:mi>min</m:mi><m:mo>&#8289;</m:mo></m:mrow><m:mi>k</m:mi></m:munder><m:mrow><m:mo>|</m:mo><m:mrow><m:msubsup><m:mi>&#946;</m:mi><m:mi>k</m:mi><m:mrow><m:mo stretchy="false">(</m:mo><m:mi>t</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:msubsup></m:mrow><m:mo>|</m:mo></m:mrow></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeqOVdG3aaSbaaSqaaiabdsha0bqabaGccqGH9aqpcqGGHbqycqGGYbGCcqGGNbWzdaWfqaqaaiGbc2gaTjabcMgaPjabc6gaUbWcbaGaem4AaSgabeaakmaaemaabaGaeqOSdi2aa0baaSqaaiabdUgaRbqaaiabcIcaOiabdsha0jabcMcaPaaaaOGaay5bSlaawIa7aaaa@4391@</m:annotation></m:semantics></m:math></inline-formula>. The new set of markers for the next time step is then <it>M</it><sub><it>t</it>+1 </sub>= <it>M</it><sub><it>t</it></sub>\{<it>&#950;</it><sub><it>t</it></sub>}.</p>
            </sec>
            <sec>
               <st>
                  <p>Bootstrap resampling</p>
               </st>
               <p>To estimate the performance of models the <it>&#949;</it>-0 bootstrap method was used <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. As mentioned previously, this method involves sampling the original dataset with replacement to create a training set, and using all remaining un-sampled instances as the independent test set (Figure <figr fid="F1">1</figr>, right panel). The models are then built on the training set, with the test set reserved for the evaluation of model performance. This process was repeated 50 times.</p>
            </sec>
            <sec>
               <st>
                  <p>Evaluation of models and estimation of marker contributions</p>
               </st>
               <p>To evaluate the performance of a model we used the fraction of variance explained as a criterion. Suppose we have a model (<inline-formula><m:math name="1471-2156-9-35-i7" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>&#946;</m:mi><m:mo>&#8594;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMbaSaaaaa@2D88@</m:annotation></m:semantics></m:math></inline-formula>, <it>b</it>) and we wish to evaluate the variance explained on some test set <it>T</it>. Then, the variance explained is defined as</p>
               <p>
                  <display-formula>
                     <m:math name="1471-2156-9-35-i11" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:msup>
                                 <m:mi>r</m:mi>
                                 <m:mn>2</m:mn>
                              </m:msup>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mover accent="true">
                                 <m:mi>&#946;</m:mi>
                                 <m:mo>&#8594;</m:mo>
                              </m:mover>
                              <m:mo>,</m:mo>
                              <m:mi>b</m:mi>
                              <m:mo stretchy="false">)</m:mo>
                              <m:mo>=</m:mo>
                              <m:mn>1</m:mn>
                              <m:mo>&#8722;</m:mo>
                              <m:mi>min</m:mi>
                              <m:mo>&#8289;</m:mo>
                              <m:mrow>
                                 <m:mo>(</m:mo>
                                 <m:mrow>
                                    <m:mfrac>
                                       <m:mrow>
                                          <m:mstyle displaystyle="true">
                                             <m:munder>
                                                <m:mo>&#8721;</m:mo>
                                                <m:mrow>
                                                   <m:mi>i</m:mi>
                                                   <m:mo>&#8712;</m:mo>
                                                   <m:mi>T</m:mi>
                                                </m:mrow>
                                             </m:munder>
                                             <m:mrow>
                                                <m:msup>
                                                   <m:mrow>
                                                      <m:mo stretchy="false">(</m:mo>
                                                      <m:msub>
                                                         <m:mi>y</m:mi>
                                                         <m:mi>i</m:mi>
                                                      </m:msub>
                                                      <m:mo>&#8722;</m:mo>
                                                      <m:mi>f</m:mi>
                                                      <m:mo stretchy="false">(</m:mo>
                                                      <m:msub>
                                                         <m:mover accent="true">
                                                            <m:mi>x</m:mi>
                                                            <m:mo>&#8594;</m:mo>
                                                         </m:mover>
                                                         <m:mi>i</m:mi>
                                                      </m:msub>
                                                      <m:mo>;</m:mo>
                                                      <m:mover accent="true">
                                                         <m:mi>&#946;</m:mi>
                                                         <m:mo>&#8594;</m:mo>
                                                      </m:mover>
                                                      <m:mo>,</m:mo>
                                                      <m:mi>b</m:mi>
                                                      <m:mo stretchy="false">)</m:mo>
                                                      <m:mo stretchy="false">)</m:mo>
                                                   </m:mrow>
                                                   <m:mn>2</m:mn>
                                                </m:msup>
                                             </m:mrow>
                                          </m:mstyle>
                                       </m:mrow>
                                       <m:mrow>
                                          <m:mstyle displaystyle="true">
                                             <m:munder>
                                                <m:mo>&#8721;</m:mo>
                                                <m:mrow>
                                                   <m:mi>i</m:mi>
                                                   <m:mo>&#8712;</m:mo>
                                                   <m:mi>T</m:mi>
                                                </m:mrow>
                                             </m:munder>
                                             <m:mrow>
                                                <m:msup>
                                                   <m:mrow>
                                                      <m:mo stretchy="false">(</m:mo>
                                                      <m:msub>
                                                         <m:mi>y</m:mi>
                                                         <m:mi>i</m:mi>
                                                      </m:msub>
                                                      <m:mo>&#8722;</m:mo>
                                                      <m:mover accent="true">
                                                         <m:mi>y</m:mi>
                                                         <m:mo>&#175;</m:mo>
                                                      </m:mover>
                                                      <m:mo stretchy="false">)</m:mo>
                                                   </m:mrow>
                                                   <m:mn>2</m:mn>
                                                </m:msup>
                                             </m:mrow>
                                          </m:mstyle>
                                       </m:mrow>
                                    </m:mfrac>
                                    <m:mo>,</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mo>)</m:mo>
                              </m:mrow>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOCai3aaWbaaSqabeaacqaIYaGmaaGccqGGOaakcuaHYoGygaWcaiabcYcaSiabdkgaIjabcMcaPiabg2da9iabigdaXiabgkHiTiGbc2gaTjabcMgaPjabc6gaUnaabmaabaqcfa4aaSaaaeaadaaeqbqaaiabcIcaOiabdMha5naaBaaabaGaemyAaKgabeaacqGHsislcqWGMbGzcqGGOaakcuWG4baEgaWcamaaBaaabaGaemyAaKgabeaacqGG7aWocuaHYoGygaWcaiabcYcaSiabdkgaIjabcMcaPiabcMcaPmaaCaaabeqaaiabikdaYaaaaeaacqWGPbqAcqGHiiIZcqWGubavaeqacqGHris5aaqaamaaqafabaGaeiikaGIaemyEaK3aaSbaaeaacqWGPbqAaeqaaiabgkHiTiqbdMha5zaaraGaeiykaKYaaWbaaeqabaGaeGOmaidaaaqaaiabdMgaPjabgIGiolabdsfaubqabiabggHiLdaaaOGaeiilaWIaeGymaedacaGLOaGaayzkaaaaaa@655A@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>where <inline-formula><m:math name="1471-2156-9-35-i12" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mover accent="true"><m:mi>y</m:mi><m:mo>&#175;</m:mo></m:mover><m:mo>=</m:mo><m:mfrac><m:mn>1</m:mn><m:mrow><m:mrow><m:mo>|</m:mo><m:mi>T</m:mi><m:mo>|</m:mo></m:mrow></m:mrow></m:mfrac><m:mstyle displaystyle="true"><m:munder><m:mo>&#8721;</m:mo><m:mrow><m:mi>i</m:mi><m:mo>&#8712;</m:mo><m:mi>T</m:mi></m:mrow></m:munder><m:mrow><m:msub><m:mi>y</m:mi><m:mi>i</m:mi></m:msub></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyEaKNbaebacqGH9aqpjuaGdaWcaaqaaiabigdaXaqaamaaemaabaGaemivaqfacaGLhWUaayjcSdaaaOWaaabuaeaacqWG5bqEdaWgaaWcbaGaemyAaKgabeaaaeaacqWGPbqAcqGHiiIZcqWGubavaeqaniabggHiLdaaaa@3DD1@</m:annotation></m:semantics></m:math></inline-formula>. This measure provides an overall estimation of the predictive performance of a given model.</p>
               <p>In addition to evaluating the model, a measure of the contribution of individual markers is needed to locate putative QTL. Quantifying these can be done by recasting this problem as a novelty-detection problem: we wish to quantify the amount of additional predictive power provided by each marker given some already selected set of markers. We measure this degree of novelty using the models built with RFE-RIDGE. As RFE-RIDGE produces nested subsets of selected markers, we can attribute the change in variance explained to the marker that was removed between two consecutive models. More precisely, let <inline-formula><m:math name="1471-2156-9-35-i13" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>m</m:mi><m:mi>l</m:mi></m:msub><m:mo>=</m:mo><m:mo stretchy="false">(</m:mo><m:msub><m:mover accent="true"><m:mi>&#946;</m:mi><m:mo>&#8594;</m:mo></m:mover><m:mi>l</m:mi></m:msub><m:mo>,</m:mo><m:msub><m:mi>b</m:mi><m:mi>l</m:mi></m:msub><m:mo stretchy="false">)</m:mo><m:mo>=</m:mo><m:mo stretchy="false">(</m:mo><m:mo stretchy="false">(</m:mo><m:msub><m:mi>&#946;</m:mi><m:mrow><m:mi>k</m:mi><m:mi>l</m:mi></m:mrow></m:msub><m:mo stretchy="false">)</m:mo><m:mo>,</m:mo><m:msub><m:mi>b</m:mi><m:mi>l</m:mi></m:msub><m:mo stretchy="false">)</m:mo><m:mo>&#8712;</m:mo><m:msup><m:mtext>R</m:mtext><m:mrow><m:msub><m:mi>n</m:mi><m:mrow><m:mi>m</m:mi><m:mi>r</m:mi><m:mi>k</m:mi></m:mrow></m:msub></m:mrow></m:msup><m:mo>&#215;</m:mo><m:mtext>R</m:mtext></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyBa02aaSbaaSqaaiabdYgaSbqabaGccqGH9aqpcqGGOaakcuaHYoGygaWcamaaBaaaleaacqWGSbaBaeqaaOGaeiilaWIaemOyai2aaSbaaSqaaiabdYgaSbqabaGccqGGPaqkcqGH9aqpcqGGOaakcqGGOaakcqaHYoGydaWgaaWcbaGaem4AaSMaemiBaWgabeaakiabcMcaPiabcYcaSiabdkgaInaaBaaaleaacqWGSbaBaeqaaOGaeiykaKIaeyicI4SaeeOuai1aaWbaaSqabeaacqWGUbGBdaWgaaadbaGaemyBa0MaemOCaiNaem4AaSgabeaaaaGccqGHxdaTcqqGsbGuaaa@5143@</m:annotation></m:semantics></m:math></inline-formula> be the sequence of models of decreasing size, i.e.{# <it>j </it>| <it>&#946;</it><sub><it>kl </it></sub>= 0} > {# <it>j </it>| <it>&#946;</it><sub><it>j</it>(<it>i</it>+1) </sub>= 0}, and <it>d</it><sub><it>l </it></sub>be the marker eliminated between <it>m</it><sub><it>l </it></sub>and <it>m</it><sub><it>l</it>+1</sub>. Then</p>
               <p>
                  <display-formula>&#916;<it>r</it><sup>2 </sup>(<it>d</it><sub><it>l</it></sub>) = <it>r</it><sup>2 </sup>(<it>m</it><sub><it>l</it></sub>) - <it>r</it><sup>2 </sup>(<it>m</it><sub><it>l</it>+1</sub>)</display-formula>
               </p>
               <p>is a measure of the novelty of a marker with respect to all the remaining markers in the model. We expect that a key QTL marker will be novel in this sense and result in a large change of variance explained when dropped from the model. The average over the bootstrap iterations provides a robust estimate of the importance of each marker to trait prediction. This estimate is referred to as <inline-formula><m:math name="1471-2156-9-35-i14" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mover accent="true"><m:mrow><m:mi>&#916;</m:mi><m:msup><m:mi>r</m:mi><m:mn>2</m:mn></m:msup><m:mo stretchy="false">(</m:mo><m:msub><m:mi>d</m:mi><m:mi>l</m:mi></m:msub><m:mo stretchy="false">)</m:mo></m:mrow><m:mo stretchy="true">&#175;</m:mo></m:mover></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaa0aaaeaacqqHuoarcqWGYbGCdaahaaWcbeqaaiabikdaYaaakiabcIcaOiabdsgaKnaaBaaaleaacqWGSbaBaeqaaOGaeiykaKcaaaaa@347C@</m:annotation></m:semantics></m:math></inline-formula>.</p>
            </sec>
            <sec>
               <st>
                  <p>Generation of QTL profiles</p>
               </st>
               <p>The information provided by &#916;<it>r</it><sup>2 </sup>(<it>d</it><sub><it>l</it></sub>) is immediately useful; we can examine which markers are found to have significant contributions. If a linkage map is available, we can use it to create graphs similar to conventional QTL profiles by simply plotting <inline-formula><m:math name="1471-2156-9-35-i14" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mover accent="true"><m:mrow><m:mi>&#916;</m:mi><m:msup><m:mi>r</m:mi><m:mn>2</m:mn></m:msup><m:mo stretchy="false">(</m:mo><m:msub><m:mi>d</m:mi><m:mi>l</m:mi></m:msub><m:mo stretchy="false">)</m:mo></m:mrow><m:mo stretchy="true">&#175;</m:mo></m:mover></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaa0aaaeaacqqHuoarcqWGYbGCdaahaaWcbeqaaiabikdaYaaakiabcIcaOiabdsgaKnaaBaaaleaacqWGSbaBaeqaaOGaeiykaKcaaaaa@347C@</m:annotation></m:semantics></m:math></inline-formula> vs. the marker positions. However, the <inline-formula><m:math name="1471-2156-9-35-i14" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mover accent="true"><m:mrow><m:mi>&#916;</m:mi><m:msup><m:mi>r</m:mi><m:mn>2</m:mn></m:msup><m:mo stretchy="false">(</m:mo><m:msub><m:mi>d</m:mi><m:mi>l</m:mi></m:msub><m:mo stretchy="false">)</m:mo></m:mrow><m:mo stretchy="true">&#175;</m:mo></m:mover></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaa0aaaeaacqqHuoarcqWGYbGCdaahaaWcbeqaaiabikdaYaaakiabcIcaOiabdsgaKnaaBaaaleaacqWGSbaBaeqaaOGaeiykaKcaaaaa@347C@</m:annotation></m:semantics></m:math></inline-formula> value of a particular genetic location is sometimes 'spread out' among a few highly correlated (genetically close) markers, due to the linkage disequilibrium between the markers and the QTL. This effect can be reduced by smoothing the results based on the positions of markers on a genetic map; for the experiments on barley we smoothed the curves by applying a summing window of 5 cM to collect the contributions of genetically close markers. The 5 cM size was chosen because it provides a good balance between resolution and smoothness.</p>
               <p>Finally, there are two methods for determining a 95% significance threshold. We assume the smoothed <inline-formula><m:math name="1471-2156-9-35-i14" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mover accent="true"><m:mrow><m:mi>&#916;</m:mi><m:msup><m:mi>r</m:mi><m:mn>2</m:mn></m:msup><m:mo stretchy="false">(</m:mo><m:msub><m:mi>d</m:mi><m:mi>l</m:mi></m:msub><m:mo stretchy="false">)</m:mo></m:mrow><m:mo stretchy="true">&#175;</m:mo></m:mover></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaa0aaaeaacqqHuoarcqWGYbGCdaahaaWcbeqaaiabikdaYaaakiabcIcaOiabdsgaKnaaBaaaleaacqWGSbaBaeqaaOGaeiykaKcaaaaa@347C@</m:annotation></m:semantics></m:math></inline-formula> were gamma distributed. The gamma assumption is justified as previous literature shows that QTL effects are gamma distributed <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>, and 95% thresholds can easily be determined by fitting a gamma distribution. Alternatively, when no smoothing is applied an empirical method can be used to estimate the p-values from the bootstrap replicates by applying a standard one-sample t-test.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>QTL classification performance</p>
            </st>
            <p>The Area under the Receiver Operating Characteristic (AROC) <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> is a general measure of classification performance. We used it to evaluate QTL profiles for simulated data where the QTL positions are known. Let <it>s</it><sub><it>i </it></sub>be a score (for example the apportioned variance explained produced by the SML) for each marker <it>i, Q </it>be the set of indices of 'QTL markers' and <it>N </it>be the set of indices of 'non-QTL markers.' The AROC is then given by</p>
            <p>
               <display-formula>
                  <m:math name="1471-2156-9-35-i15" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>P</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:msub>
                              <m:mi>s</m:mi>
                              <m:mi>i</m:mi>
                           </m:msub>
                           <m:mo>></m:mo>
                           <m:msub>
                              <m:mi>s</m:mi>
                              <m:mi>j</m:mi>
                           </m:msub>
                           <m:mo>|</m:mo>
                           <m:mi>i</m:mi>
                           <m:mo>&#8712;</m:mo>
                           <m:mi>Q</m:mi>
                           <m:mo>,</m:mo>
                           <m:mi>j</m:mi>
                           <m:mo>&#8712;</m:mo>
                           <m:mi>N</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>+</m:mo>
                           <m:mfrac>
                              <m:mn>1</m:mn>
                              <m:mn>2</m:mn>
                           </m:mfrac>
                           <m:mi>P</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:msub>
                              <m:mi>s</m:mi>
                              <m:mi>i</m:mi>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:msub>
                              <m:mi>s</m:mi>
                              <m:mi>j</m:mi>
                           </m:msub>
                           <m:mo>|</m:mo>
                           <m:mi>i</m:mi>
                           <m:mo>&#8712;</m:mo>
                           <m:mi>Q</m:mi>
                           <m:mo>,</m:mo>
                           <m:mi>j</m:mi>
                           <m:mo>&#8712;</m:mo>
                           <m:mi>N</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaaLaeiikaGIaem4Cam3aaSbaaSqaaiabdMgaPbqabaGccqGH+aGpcqWGZbWCdaWgaaWcbaGaemOAaOgabeaakiabcYha8jabdMgaPjabgIGiolabdgfarjabcYcaSiabdQgaQjabgIGiolabd6eaojabcMcaPiabgUcaRKqbaoaalaaabaGaeGymaedabaGaeGOmaidaaOGaemiuaaLaeiikaGIaem4Cam3aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpcqWGZbWCdaWgaaWcbaGaemOAaOgabeaakiabcYha8jabdMgaPjabgIGiolabdgfarjabcYcaSiabdQgaQjabgIGiolabd6eaojabcMcaPaaa@5837@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Given a finite set of scores the AROC can simply be estimated by counting:</p>
            <p>
               <display-formula>
                  <m:math name="1471-2156-9-35-i16" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mfrac>
                              <m:mn>1</m:mn>
                              <m:mrow>
                                 <m:mrow>
                                    <m:mo>|</m:mo>
                                    <m:mi>Q</m:mi>
                                    <m:mo>|</m:mo>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:mo>|</m:mo>
                                    <m:mi>N</m:mi>
                                    <m:mo>|</m:mo>
                                 </m:mrow>
                              </m:mrow>
                           </m:mfrac>
                           <m:mstyle displaystyle="true">
                              <m:munder>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>i</m:mi>
                                    <m:mo>&#8712;</m:mo>
                                    <m:mi>Q</m:mi>
                                    <m:mo>,</m:mo>
                                    <m:mi>j</m:mi>
                                    <m:mo>&#8712;</m:mo>
                                    <m:mi>N</m:mi>
                                 </m:mrow>
                              </m:munder>
                              <m:mrow>
                                 <m:mrow>
                                    <m:mo>{</m:mo>
                                    <m:mrow>
                                       <m:mtable>
                                          <m:mtr>
                                             <m:mtd>
                                                <m:mn>1</m:mn>
                                             </m:mtd>
                                             <m:mtd>
                                                <m:mrow>
                                                   <m:mtext>if&#160;</m:mtext>
                                                   <m:msub>
                                                      <m:mi>s</m:mi>
                                                      <m:mi>i</m:mi>
                                                   </m:msub>
                                                   <m:mo>></m:mo>
                                                   <m:msub>
                                                      <m:mi>s</m:mi>
                                                      <m:mi>j</m:mi>
                                                   </m:msub>
                                                </m:mrow>
                                             </m:mtd>
                                          </m:mtr>
                                          <m:mtr>
                                             <m:mtd>
                                                <m:mrow>
                                                   <m:mn>0.5</m:mn>
                                                </m:mrow>
                                             </m:mtd>
                                             <m:mtd>
                                                <m:mrow>
                                                   <m:mtext>if&#160;</m:mtext>
                                                   <m:msub>
                                                      <m:mi>s</m:mi>
                                                      <m:mi>i</m:mi>
                                                   </m:msub>
                                                   <m:mo>=</m:mo>
                                                   <m:msub>
                                                      <m:mi>s</m:mi>
                                                      <m:mi>j</m:mi>
                                                   </m:msub>
                                                </m:mrow>
                                             </m:mtd>
                                          </m:mtr>
                                          <m:mtr>
                                             <m:mtd>
                                                <m:mn>0</m:mn>
                                             </m:mtd>
                                             <m:mtd>
                                                <m:mrow>
                                                   <m:mtext>otherwise</m:mtext>
                                                </m:mrow>
                                             </m:mtd>
                                          </m:mtr>
                                       </m:mtable>
                                    </m:mrow>
                                 </m:mrow>
                              </m:mrow>
                           </m:mstyle>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqaIXaqmaeaadaabdaqaaiabdgfarbGaay5bSlaawIa7amaaemaabaGaemOta4eacaGLhWUaayjcSdaaaOWaaabuaeaadaGabaqaauaabeqadiaaaeaacqaIXaqmaeaacqqGPbqAcqqGMbGzcqqGGaaicqWGZbWCdaWgaaWcbaGaemyAaKgabeaakiabg6da+iabdohaZnaaBaaaleaacqWGQbGAaeqaaaGcbaGaeGimaaJaeiOla4IaeGynaudabaGaeeyAaKMaeeOzayMaeeiiaaIaem4Cam3aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpcqWGZbWCdaWgaaWcbaGaemOAaOgabeaaaOqaaiabicdaWaqaaiabb+gaVjabbsha0jabbIgaOjabbwgaLjabbkhaYjabbEha3jabbMgaPjabbohaZjabbwgaLbaaaiaawUhaaaWcbaGaemyAaKMaeyicI4SaemyuaeLaeiilaWIaemOAaOMaeyicI4SaemOta4eabeqdcqGHris5aaaa@68A2@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
         </sec>
         <sec>
            <st>
               <p>Other QTL-mapping methods</p>
            </st>
            <sec>
               <st>
                  <p>Single Marker Regression (MR)</p>
               </st>
               <p>To obtain the fraction of variance explained for individual markers, the Pearson correlation coefficient between the marker and the phenotype was squared. A phenotype permutation test of 1,000 iterations was used to derive empirical 95% significance thresholds for genome profiles of variance explained <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>.</p>
            </sec>
            <sec>
               <st>
                  <p>Composite Interval Mapping (CIM)</p>
               </st>
               <p>QTL were also identified by CIM using Cartographer 2.5 software <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B35">35</abbr><abbr bid="B36">36</abbr></abbrgrp>. The program settings were adjusted to scan the genome at a walk speed of 1 cM. The 20 most important markers, selected by forward stepwise regression outside a 10-cM window on either side of the markers flanking the test site were used to adjust for the genetic background <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>. Experiment-wise 95% significance threshold for likelihood-ratio genome profiles were estimated using a permutation test based on shuffling genotypes against phenotypes <abbrgrp><abbr bid="B33">33</abbr><abbr bid="B37">37</abbr></abbrgrp>.</p>
            </sec>
            <sec>
               <st>
                  <p>Bayesian Interval Mapping (BIM)</p>
               </st>
               <p>Finally, SML was also benchmarked against BIM <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> using the R package <it>qtlbim </it><abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. The algorithm was restricted to analysis at marker positions only and not within intervals. Two types of genome profiles were used in experiments &#8211; Bayes Factor (BF) profiles for QTL detection, and 'heritability profiles' (i.e. variance explained) for estimating QTL effects. The number of QTL was also estimated using Bayes factors.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Comparisons of QTL profiles</p>
            </st>
            <p>The QTL profiles generated by different methods were compared by computing the Pearson correlation coefficient between the genome profiles of variance explained. For the comparison between different map versions (comprising unequal numbers of markers or bins), the genome scans were first approximated by loess curves based on 1,000 evenly spaced loci.</p>
            <p>Statistically significant QTL were identified for each method by recording the cM positions of peak maxima in genome-wide plots of variance explained (<it>p </it>&lt; 0.05). Each contiguous stretch of above-threshold markers was considered to belong to a single QTL peak. Small clusters of above-threshold markers at less than 5 cM distance from such a stretch of markers (if present) were considered to be part of the shoulder of the same QTL peak. The overlap between the sets of QTL identified using different methods (or map versions) was quantified by counting the instances in which they detected significant QTL within 10-cM of each other.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>List of abbreviations</p>
         </st>
         <p>BF, Bayes Factor; BIM, Bayesian Interval Mapping; CIM, composite interval mapping; DArT, diversity arrays technology; DH, doubled haploid; LOD score, logarithm-of-odds ratio in favour of a QTL; LOD<sub>error</sub>, logarithm of odds value in favour of genotyping error; MIM, multiple interval mapping; MR, single marker regression; QTL, quantitative trait locus/loci; RFE, recursive feature elimination; RFE-RIDGE recursive feature elimination &#8211; ridge regression; RFLP, restriction fragment length polymorphism; SIM, simple interval mapping; SML, statistical machine-learning; SSR, simple sequence repeat.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>JB developed and tested the SML algorithm and phenotype pre-processing procedure, performed the BIM analyses and drafted part of the manuscript. PW provided intellectual input during the development and testing of SML algorithms, built the Steptoe/Morex map, performed the CIM analysis, compared the results of the various QTL methods and drafted part of the manuscript. AKo supervised the development of the SML algorithm and co-edited the manuscript. AKi provided intellectual input during the development and testing of SML algorithms and designed and drafted part of the manuscript. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p><b>AKo and JB acknowledge permission of NICTA to publish </b>this paper. NICTA is funded by the Australian Government's Department of Communications, Information Technology and the Arts and the Australian Council through Backing Australia's Ability and the ICT Centre of Excellence program. Diversity Arrays Technology Pty Ltd acknowledges financial contribution to this work from the Grains Research and Development Corporation (GRDC).</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Mapping quantitative trait loci in plants: uses and caveats for evolutionary biology</p>
            </title>
            <aug>
               <au>
                  <snm>Mauricio</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Nature Rev Genetics</source>
            <pubdate>2001</pubdate>
            <volume>2</volume>
            <fpage>370</fpage>
            <lpage>381</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1038/35072085</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Present and future of quantitative trait locus analysis in plant breeding</p>
            </title>
            <aug>
               <au>
                  <snm>As&#237;ns</snm>
                  <fnm>MJ</fnm>
               </au>
            </aug>
            <source>Plant Breed</source>
            <pubdate>2002</pubdate>
            <volume>121</volume>
            <fpage>281</fpage>
            <lpage>291</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1046/j.1439-0523.2002.730285.x</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Mapping and analysis of quantitative trait loci in experimental populations</p>
            </title>
            <aug>
               <au>
                  <snm>Doerge</snm>
                  <fnm>RW</fnm>
               </au>
            </aug>
            <source>Nat Rev Genetics</source>
            <pubdate>2002</pubdate>
            <volume>3</volume>
            <fpage>43</fpage>
            <lpage>52</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1038/nrg703</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps</p>
            </title>
            <aug>
               <au>
                  <snm>Lander</snm>
                  <fnm>ES</fnm>
               </au>
               <au>
                  <snm>Botstein</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Genetics</source>
            <pubdate>1989</pubdate>
            <volume>121</volume>
            <fpage>185</fpage>
            <lpage>199</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1203601</pubid>
                  <pubid idtype="pmpid" link="fulltext">2563713</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Precision mapping of quantitative trait loci</p>
            </title>
            <aug>
               <au>
                  <snm>Zeng</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Genetics</source>
            <pubdate>1994</pubdate>
            <volume>136</volume>
            <fpage>1457</fpage>
            <lpage>1468</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1205924</pubid>
                  <pubid idtype="pmpid" link="fulltext">8013918</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Multiple interval mapping for quantitative trait loci</p>
            </title>
            <aug>
               <au>
                  <snm>Kao</snm>
                  <fnm>CH</fnm>
               </au>
               <au>
                  <snm>Zeng</snm>
                  <fnm>ZB</fnm>
               </au>
               <au>
                  <snm>Teasdale</snm>
                  <fnm>RD</fnm>
               </au>
            </aug>
            <source>Genetics</source>
            <pubdate>1999</pubdate>
            <volume>152</volume>
            <fpage>1203</fpage>
            <lpage>1216</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1460657</pubid>
                  <pubid idtype="pmpid" link="fulltext">10388834</pubid>
               </pubidlist>
            </xre