<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1758-2946-1-11</ui>
   <ji>1758-2946</ji>
   <fm>
		<dochead>Research article</dochead>
		<bibl>
			<title>
				<p>DPRESS: Localizing estimates of predictive uncertainty</p>
			</title>
			<aug>
				<au ca="yes" id="A1">
					<snm>Clark</snm>
					<mi>D</mi>
					<fnm>Robert</fnm>
					<insr iid="I1"/>
					<insr iid="I2"/>
					<email>bclark@bcmetrics.com</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>Biochemical Infometrics, 827 Renee Lane, Creve Coeur MO 63141, USA</p>
				</ins>
				<ins id="I2">
					<p>School of Informatics, Indiana University, 901 E 10th St, Bloomington IN 47408, USA</p>
				</ins>
			</insg>
			<source>Journal of Cheminformatics</source>
			<issn>1758-2946</issn>
			<pubdate>2009</pubdate>
			<volume>1</volume>
			<issue>1</issue>
			<fpage>11</fpage>
			<url>http://www.jcheminf.com/content/1/1/11</url>
			<xrefbib>
				
			<pubidlist><pubid idtype="pmpid">20298517</pubid><pubid idtype="doi">10.1186/1758-2946-1-11</pubid></pubidlist></xrefbib>
		</bibl>
		<history>
			<rec>
				<date>
					<day>04</day>
					<month>3</month>
					<year>2009</year>
				</date>
			</rec>
			<acc>
				<date>
					<day>14</day>
					<month>7</month>
					<year>2009</year>
				</date>
			</acc>
			<pub>
				<date>
					<day>14</day>
					<month>7</month>
					<year>2009</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2009</year>
			<collab>Clark; licensee BioMed Central Ltd.</collab>
			<note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<sec>
					<st>
						<p>Background</p>
					</st>
					<p>The need to have a quantitative estimate of the uncertainty of prediction for QSAR models is steadily increasing, in part because such predictions are being widely distributed as tabulated values disconnected from the models used to generate them. Classical statistical theory assumes that the error in the population being modeled is independent and identically distributed (IID), but this is often not actually the case. Such inhomogeneous error (heteroskedasticity) can be addressed by providing an individualized estimate of predictive uncertainty for each particular new object <it>u</it>: the standard error of prediction <it>s</it>
						<sub>u </sub>can be estimated as the non-cross-validated error <it>s</it>
						<sub>t* </sub>for the closest object <it>t</it>* in the training set adjusted for its separation <it>d </it>from <it>u </it>in the descriptor space relative to the size of the training set.</p>
					<p>
						<display-formula>
							<graphic file="1758-2946-1-11-i1.gif"/>
						</display-formula>
					</p>
					<p>The predictive uncertainty factor <it>&#947;</it>
						<sub>t* </sub>is obtained by distributing the internal predictive error sum of squares across objects in the training set based on the distances between them, hence the acronym: <it>D</it>istributed <it>PR</it>edictive <it>E</it>rror <it>S</it>um of <it>S</it>quares (DPRESS). Note that <it>s</it>
						<sub>t* </sub>and <it>&#947;</it>
						<sub>t*</sub>are characteristic of each training set compound contributing to the model of interest.</p>
				</sec>
				<sec>
					<st>
						<p>Results</p>
					</st>
					<p>The method was applied to partial least-squares models built using 2D (molecular hologram) or 3D (molecular field) descriptors applied to mid-sized training sets (<it>N </it>= 75) drawn from a large (<it>N </it>= 304), well-characterized pool of cyclooxygenase inhibitors. The observed variation in predictive error for the external 229 compound test sets was compared with the uncertainty estimates from DPRESS. Good qualitative and quantitative agreement was seen between the distributions of predictive error observed and those predicted using DPRESS. Inclusion of the distance-dependent term was essential to getting good agreement between the estimated uncertainties and the observed distributions of predictive error. The uncertainty estimates derived by DPRESS were conservative even when the training set was biased, but not excessively so.</p>
				</sec>
				<sec>
					<st>
						<p>Conclusion</p>
					</st>
					<p>DPRESS is a straightforward and powerful way to reliably estimate individual predictive uncertainties for compounds outside the training set based on their distance to the training set and the internal predictive uncertainty associated with its nearest neighbor in that set. It represents a sample-based, <it>a posteriori </it>approach to defining applicability domains in terms of localized uncertainty.</p>
				</sec>
			</sec>
		</abs>
	</fm>
   <bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<p>Early work on quantitative structure-activity relationships (QSAR) was primarily concerned with relating select physical properties to <it>in vivo </it>biological activity <abbrgrp>
					<abbr bid="B1">1</abbr>
					<abbr bid="B2">2</abbr>
				</abbrgrp>. Ordinary least squares regression (multiple linear regression) was the analytical tool of choice, and the statistical questions addressed focused on whether a particular descriptor was significant or not. QSAR methods soon evolved, however, into being ways of identifying optimal physical properties rather than simply trends, a shift accomplished by fitting to quadratic and bilinear equations. This development was spurred in no small part by the desire to identify optimal octanol/water partition coefficients (logP), generally in pursuit of optimal <it>in vivo </it>activity.</p>
			<p>The focus for pharmaceutical drug discovery subsequently shifted from <it>in vivo </it>testing to <it>in vitro </it>evaluation of interactions between candidate ligands and isolated enzymes or receptors. This change brought with it a shift of descriptors from measurable properties of compounds to computationally estimated properties of molecules, with the calculations in question often being based on (sub)structural descriptors. The next step was to take descriptors into account that were based on molecular structure but were not themselves measurable physical properties. Often these were more or less local in nature, and the purposes of doing the analysis shifted from identifying significant underlying relationships to the descriptors to identifying optimal substituents or substitution patterns. Interest in artificial neural networks (ANNs) <abbrgrp>
					<abbr bid="B3">3</abbr>
				</abbrgrp> and partial least squares with projection onto latent structures (PLS) <abbrgrp>
					<abbr bid="B4">4</abbr>
				</abbrgrp> as analytical tools increased at the same time. Questions related to validity of the model as a whole took center stage as the number of descriptors available proliferated <abbrgrp>
					<abbr bid="B5">5</abbr>
					<abbr bid="B6">6</abbr>
				</abbrgrp>, followed closely by a strong interest in predictivity and how best to establish applicability domains <abbrgrp>
					<abbr bid="B7">7</abbr>
					<abbr bid="B8">8</abbr>
					<abbr bid="B9">9</abbr>
					<abbr bid="B10">10</abbr>
					<abbr bid="B11">11</abbr>
					<abbr bid="B12">12</abbr>
					<abbr bid="B13">13</abbr>
					<abbr bid="B14">14</abbr>
					<abbr bid="B15">15</abbr>
				</abbrgrp>.</p>
			<p>Today, however, the overall statistical properties of a particular QSAR are less relevant to medicinal chemists or environmental regulatory agencies. Recent pressure to simultaneously reduce clinical failures, ensure the safety of bulk chemicals <abbrgrp>
					<abbr bid="B16">16</abbr>
					<abbr bid="B17">17</abbr>
					<abbr bid="B18">18</abbr>
				</abbrgrp> and reduce testing on animals have led to an increasing reliance on models for predicting off-target biological effects and toxicity. This use of QSAR models entails applications to more structurally diverse compounds, but it also changes the relative importance of different kinds of mistakes. If a structure is predicted to have a much higher affinity for the target than it actually does, the cost to a lead optimization program is limited to the synthetic resources wasted on that particular structure. Even that cost is mitigated if something useful was learned about the underlying structure-activity relationship (SAR) in the process. Such a false positive error in predictive toxicology, however, may mean that a life-saving (and profitable) drug never gets commercialized. Compounds mistakenly predicted to be inactive &#8211; false negatives &#8211; represent a missed opportunity in the context of lead optimization, but they have the potential to be downright catastrophic (and ruinous) in the context of predictive toxicology.</p>
			<p>Such considerations put a premium on being able to make a quantitative estimate of how reliable an <it>individual </it>prediction obtained from a given model is. What is more, answers to the question, "How reliable are the predictions about this <it>particular </it>molecule that I am considering for synthesis, clinical evaluation or registration?" are often most relevant for extrapolations to structures near the "outside" edges of the descriptor space defined by the training set. Hence, to be of practical use, constraints on applicability domains need to be "soft" &#8211; i.e., increase with distance from the descriptor space covered by the training set &#8211; but "hard" enough to indicate just how far outside the training set one can safely expect to go. They also need to provide a robust quantitative estimate of predictive reliability that is sensitive to local variations in the descriptor space. This paper presents a novel methodology for doing exactly that based on how close a new compound is to those in the training set and the distribution of internal predictive error across compounds in that set.</p>
			<sec>
				<st>
					<p>Classical statistical theory</p>
				</st>
				<p>The underlying model for linear regression on a vector <b>X </b>of <it>p </it>independent variables is reflected in Eq. 1, wherein <it>Y </it>is the response variable of interest, <it>&#956;</it>
					<sub>Y </sub>is the population mean of <it>Y</it>, <b>&#946; </b>is a vector representing the sensitivities of <it>Y </it>to changes in <b>X</b>, and <b>x </b>is a vector of deviations in <b>X </b>from the population centroid <b>&#956;</b>
					<sub>X</sub>.</p>
				<p>
					<display-formula id="M1">
						<graphic file="1758-2946-1-11-i2.gif"/>
					</display-formula>
				</p>
				<p>As indicated in Eq. 1, the error <it>&#949; </it>is assumed to be normally distributed with mean 0 and a standard deviation <it>&#963;</it>
					<sub>X</sub>. Best linear unbiased estimators (BLUEs) for the various parameters in Eq. 1 can be calculated from a sample <it>T</it>
					<sub>0 </sub>of <it>n </it>observations (in QSAR, compounds) drawn from the full population, provided several preconditions are met <abbrgrp>
						<abbr bid="B19">19</abbr>
					</abbrgrp>:</p>
				<p indent="1">
					<b>1</b>. the strict linear dependence of <it>Y </it>on <b>X </b>set out in Eq. 1 applies across the population;</p>
				<p indent="1">
					<b>2</b>. the sample is random and unbiased;</p>
				<p indent="1">
					<b>3</b>. the descriptors contributing to <b>X </b>are mutually independent in a statistical sense; and</p>
				<p indent="1">
					<b>4</b>. the error distribution <it>&#949; </it>is <it>homoskedastic </it>and independent of <b>X </b>and <it>Y </it>&#8211; i.e., its standard deviation is the same everywhere in the descriptor space, so <it>&#963;</it>
					<sub>X </sub>= <it>&#963; </it>for all <b>X</b>.</p>
				<p>The corresponding regression estimators for each individual observation <it>i </it>and the overall standard error of regression <it>s</it>
					<sub>FIT</sub>are then given as shown in Eqs. 2 and 3.</p>
				<p>
					<display-formula id="M2">
						<graphic file="1758-2946-1-11-i3.gif"/>
					</display-formula>
				</p>
				<p>
					<display-formula id="M3">
						<graphic file="1758-2946-1-11-i4.gif"/>
					</display-formula>
				</p>
				<p>where <inline-formula>
						<graphic file="1758-2946-1-11-i5.gif"/>
					</inline-formula> is the mean value of <it>Y </it>for the sample; <b>x</b>
					<sub>i </sub>= <b>X</b>
					<sub>i </sub>- <b>X</b>
					<sub>0</sub>, with <b>X</b>
					<sub>0 </sub>being the sample centroid for <b>X</b>; and <inline-formula>
						<graphic file="1758-2946-1-11-i6.gif"/>
					</inline-formula> is the predicted value of <it>Y </it>at <b>X</b>
					<sub>i </sub>
					<abbrgrp>
						<abbr bid="B19">19</abbr>
					</abbrgrp>. Note that <it>s</it>
					<sub>FIT </sub>is greater than the root mean square error (RMSE); this is because the means <inline-formula>
						<graphic file="1758-2946-1-11-i5.gif"/>
					</inline-formula> and <b>X</b>
					<sub>0 </sub>and the calculated coefficient vector <b>b </b>are themselves estimates that are subject to sampling error, with 1 and <it>p </it>degrees of freedom, respectively.</p>
				<p>Under these assumptions, the potential error in estimating <it>Y </it>increases as one moves away from the centroid <b>X</b>
					<sub>0</sub>. As a result, the uncertainty <it>s</it>
					<sub>u </sub>in predicting the value of <it>Y </it>at some new ("unknown") value <b>X</b>
					<sub>u </sub>is generally greater than <it>s</it>
					<sub>FIT</sub>. In fact, under the assumptions given above <abbrgrp>
						<abbr bid="B19">19</abbr>
					</abbrgrp>:</p>
				<p>
					<display-formula id="M4">
						<graphic file="1758-2946-1-11-i7.gif"/>
					</display-formula>
				</p>
				<p>where <it>s</it>
					<sub>u </sub>is the expected standard error of prediction (uncertainty) for the new observation <it>u </it>and <it>n </it>is the number of training set observations <it>t </it>used to build the model. The Mahalanobis distances <it>d</it>
					<sub>0, u </sub>and <it>d</it>
					<sub>0, t </sub>are measured in the model space defined by <b>b</b>, i.e., they are weighted Euclidean distances between the centroid <b>X</b>
					<sub>0 </sub>of the descriptor matrix for the training set and the vectors <b>X</b>
					<sub>u </sub>and <b>X</b>
					<sub>t</sub>, respectively.</p>
				<p>The rationale behind the "extra" terms in Eq. 4 is straightforward. For any random sample, the error involved in using <inline-formula>
						<graphic file="1758-2946-1-11-i5.gif"/>
					</inline-formula> as an estimate of <it>&#956;</it>
					<sub>Y </sub>is inversely proportional to <it>n </it>&#8211; hence the 1/<it>n </it>term in Eq. 4. In addition, the accuracy with which <it>&#946; </it>is estimated by <b>b </b>is inversely proportional to how thoroughly <b>X </b>is sampled by the training set, but how much difference that makes to the error is directly proportional to the distance <it>d</it>
					<sub>0, u </sub>between <b>X</b>
					<sub>u </sub>and <b>X</b>
					<sub>0 </sub>in the model space. Together these countervailing effects of variation in <b>X </b>account for the second term within the outer brackets.</p>
			</sec>
			<sec>
				<st>
					<p>Dealing with violated assumptions</p>
				</st>
				<p>The value of <it>s</it>
					<sub>u </sub>produced by Eq. 4 is a best linear unbiased estimator of <it>&#963;</it>
					<sub>u </sub>&#8211; <it>provided the assumptions underlying its derivation hold</it>. Unfortunately, one or more of those assumptions are violated in most QSAR applications. In particular:</p>
				<p indent="1">
					<b>1</b>. the dependence of <it>Y </it>on <b>X </b>rarely fits the prescribed function perfectly, linear or otherwise;</p>
				<p indent="1">
					<b>2</b>. the training set used is usually a non-random sample, its selection biased by matters of historical accident and convenience that reflect the historical trajectory of the synthesis program that motivated the analysis;</p>
				<p indent="1">
					<b>3</b>. the descriptors contributing to <b>X </b>are often correlated to a greater or lesser degree and hence are not independent variables in the statistical sense (correlation implies lack of independence, but the inverse is not true: lack of correlation does not imply statistical independence); and</p>
				<p indent="1">
					<b>4</b>. <it>&#949; </it>is usually heteroskedastic &#8211; its standard deviation <it>&#963;</it>
					<sub>X </sub>is often different in different regions of the descriptor space.</p>
				<p>Most or all of the assumptions are, in fact, explicitly violated when ANNs, PLS, variable selection, quadratic regression, or bilinear regression techniques are applied, with the result that <it>s</it>
					<sub>FIT </sub>and the estimator given by Eq. 4 underestimate the actual uncertainty of prediction, often drastically.</p>
				<p>Several groups have derived theoretical variations of Eq. 4 for use with PLS and principal component analysis (PCA) that seek to address departures from ideality <abbrgrp>
						<abbr bid="B20">20</abbr>
						<abbr bid="B21">21</abbr>
						<abbr bid="B22">22</abbr>
					</abbrgrp>. Unfortunately, subsequent work has demonstrated that these methods are often not robust when applied in realistic situations <abbrgrp>
						<abbr bid="B23">23</abbr>
					</abbrgrp>.</p>
				<p>An alternative, completely empirical approach to assessing aggregate predictive uncertainty is cross-validation, in which each compound in the training set is held back in turn <abbrgrp>
						<abbr bid="B24">24</abbr>
					</abbrgrp>. The value of <it>Y </it>for the held-back compound is then predicted using a model built from the other <it>n </it>- 1 compounds in the training subset <it>T</it>
					<sub>u </sub>= <it>T</it>
					<sub>0 </sub>- {<it>u</it>}. In parallel to Eq. 3, the standard error of cross-validation <it>s</it>
					<sub>CV </sub>is calculated from the predictive error sum of squares (PRESS) according to equation 5:</p>
				<p>
					<display-formula id="M5">
						<graphic file="1758-2946-1-11-i8.gif"/>
					</display-formula>
				</p>
				<p>where <inline-formula>
						<graphic file="1758-2946-1-11-i9.gif"/>
					</inline-formula> is the value of <it>Y </it>predicted by applying the reduced model built from the <it>n </it>- 1 compounds in training subset <it>T</it>
					<sub>u </sub>to <b>X</b>
					<sub>u </sub>and <inline-formula>
						<graphic file="1758-2946-1-11-i10.gif"/>
					</inline-formula> is the corresponding predictive error. The summation is indexed across <it>u </it>to emphasize that prediction is external to the training subset used in each case. Here <it>p </it>represents the number of PLS components included in the model rather than the number of descriptors.</p>
				<p>Cross-validation statistics were originally employed in PLS solely as a way to determine an optimal model complexity, a role for which the classical goodness-of-fit measure <it>r</it>
					<sup>2 </sup>used in ordinary least squares is unsuited <abbrgrp>
						<abbr bid="B24">24</abbr>
					</abbrgrp>. It has since come to widely used to assess predictivity, however. This use is unfortunate, in that a poorly predictive model will have a high <it>s</it>
					<sub>CV </sub>and a low <it>q</it>
					<sup>2</sup>, but the converse may or may not be true: good cross-validation statistics may be due to redundancies in the training set rather than truly robust predictive performance <abbrgrp>
						<abbr bid="B25">25</abbr>
						<abbr bid="B26">26</abbr>
						<abbr bid="B27">27</abbr>
						<abbr bid="B28">28</abbr>
					</abbrgrp>. Some workers prefer to use "leave-some-out" cross-validation &#8211; in which several compounds are held back together &#8211; to address this problem. Nonetheless, the LOO standard error is the best estimate of the full model's predictivity for each individual compound in the training set <abbrgrp>
						<abbr bid="B29">29</abbr>
					</abbrgrp>, which makes it is a reasonable starting point for estimating a model's predictive reliability for structures occupying nearby points in the descriptor space.</p>
				<p>Violation <b>4 </b>&#8211; that error is not identically and independently distributed across compounds &#8211; is especially problematic for QSAR analyses. In one recently described case in point, the variation in predictive error was clearly correlated with one of the two descriptors being used <abbrgrp>
						<abbr bid="B7">7</abbr>
					</abbrgrp>. If that is true when many descriptors are involved (as is the case for PLS), the <it>overall </it>variability in predictive error should be similar across the full range of <it>Y</it>. Such a distribution of error is, in fact, often seen in place of the quadratically increasing spread implied by Eq. 4 <abbrgrp>
						<abbr bid="B30">30</abbr>
					</abbrgrp>. This makes it all too easy to make the unjustified leap to the unjustified conclusion that the aggregate predictive uncertainty &#8211; typically <it>s</it>
					<sub>CV </sub>or the root mean square error of prediction for an external test set (RMSEP or <it>s</it>
					<sub>PRED</sub>) &#8211; is a reliable indicator of the level of uncertainty associated with <it>individual </it>predictions: independence from <it>Y </it>does not imply independence from <b>X</b>.</p>
			</sec>
			<sec>
				<st>
					<p>Partitioning the PRESS</p>
				</st>
				<p>The increasing reliance of drug developers on tabulations of predicted properties makes getting accurate estimates of the uncertainty <it>&#963;</it>
					<sub>u </sub>for individual predictions critically important. Unfortunately, it is rarely if ever possible to construct a unified global model for the dependence of <it>&#963;</it>
					<sub>u </sub>on <b>X</b>. It is neither necessary nor even desirable to do so, however. A better approach is to shift from the classical, descriptor-based view of regression to a sample-based formalism such as that used in the SAMPLS algorithm <abbrgrp>
						<abbr bid="B31">31</abbr>
					</abbrgrp>. This algorithm exploits the fact that Eq. 2 can be recast as Eq. 6 without loss of generality:</p>
				<p>
					<display-formula id="M6">
						<graphic file="1758-2946-1-11-i11.gif"/>
					</display-formula>
				</p>
				<p>where <b>c</b>
					<sub>t, i </sub>= [<it>x</it>
					<sub>t1</sub>
					<it>x</it>
					<sub>i1 </sub>
					<it>x</it>
					<sub>t2</sub>
					<it>x</it>
					<sub>i2 </sub>... <it>x</it>
					<sub>tp</sub>
					<it>x</it>
					<sub>ip</sub>] is the covariance between <b>x</b>
					<sub>t </sub>and <b>x</b>
					<sub>i </sub>and <b>v</b>
					<sub>t </sub>is a weight vector that is specific to compound <it>t</it>. Basically, Eq. 6 says that activity can be expressed as a linear function of the similarities of each compound to each of the other compounds in the training set. This suggests that the observed predictive error <it>e</it>
					<sub>u </sub>can be cast as a sum of contributions from each compound in the training set that increases with similarity to those compounds, which is consistent with the observation that predictive error tends to increase with distance from &#8211; i.e., tends to decrease with increasing similarity to &#8211; compounds in the training set <abbrgrp>
						<abbr bid="B11">11</abbr>
						<abbr bid="B12">12</abbr>
						<abbr bid="B15">15</abbr>
					</abbrgrp>. If <it>t</it>* is the closest (i.e., the most similar) such compound, its standard error (<it>s</it>
					<sub>t*</sub>) is a reasonable first approximation to the predictive error <it>s</it>
					<sub>u </sub>for a new compound. In most QSAR applications, a single response value <it>Y</it>
					<sub>t </sub>is assigned to each compound in the training set, so the best estimate of <it>s</it>
					<sub>t </sub>is simply |<it>e</it>
					<sub>t</sub>|, where <it>e</it>
					<sub>t </sub>is the deviation seen for <it>t </it>in the full, non-cross-validated model, i.e., the residual error of fitting.</p>
				<p>Though the "true" dependence of predictive uncertainty on the Euclidean distance <it>d</it>
					<sub>t*, u </sub>from <it>t</it>* is unknown, its dependence on distance can likely be approximated by a Taylor expansion in which all but the first, linear term in <it>d </it>is dropped. Taken together, these considerations yield the estimator defined by Eq. 7:</p>
				<p>
					<display-formula id="M7">
						<graphic file="1758-2946-1-11-i12.gif"/>
					</display-formula>
				</p>
				<p>where <it>d</it>
					<sub>00 </sub>is the length of the vector <b>x</b>
					<sub>00 </sub>defined by the standard deviations of the descriptors; <it>d</it>
					<sub>00 </sub>= 1 when descriptors have been centered and autoscaled, as was the case here.</p>
				<p>The problem then becomes one of estimating the predictive error <it>&#947;</it>
					<sub>t </sub>associated with each compound <it>t </it>in the training set. PLS tends to overfit, so this term is likely to be greater than <it>s</it>
					<sub>t*</sub>; otherwise Eq. 7 would parallel Eq. 4 exactly, except for the loss of the 1/<it>n </it>aggregation term within the brackets. Instead, one can turn to the squared predictive errors collected during cross-validation. In the calculation of the aggregate predictive uncertainty <it>s</it>
					<sub>CV </sub>(Eq. 5), these are lumped into a single sum &#8211; the PRESS. If, however, one assumes that contributions from nearby training set compounds dominate the predictive error and, further, that the value of <it>&#947; </it>will be comparable for the training subset compounds closest to each individual compound <it>u</it>, the contribution <inline-formula>
						<graphic file="1758-2946-1-11-i13.gif"/>
					</inline-formula> that cross-validation of the <it>i</it>
					<sup>th </sup>compound makes to the PRESS can more appropriately be distributed across the training subset in inverse proportion to the distances between <b>X</b>
					<sub>i </sub>and the <it>n </it>- 1 compounds used to predict <it>Y</it>
					<sub>i </sub>(Eq. 8 and Fig. <figr fid="F1">1</figr>). A similar approach is taken to distributing response variance across the various sources of deviation from the mean in classical analysis of variance (ANOVA).</p>
				<fig id="F1">
					<title>
						<p>Figure 1</p>
					</title>
					<caption>
						<p>Schematic representation of predictive error distribution in DPRESS</p>
					</caption>
					<text>
						<p>
							<b>Schematic representation of predictive error distribution in DPRESS</b>. The arrow weights indicate how much of the error made in predicting the response for the held-out compound (open symbol) is distributed among the compounds in the training set (solid symbols) when calculating the scaling factors <it>&#947;</it>
							<sub>t</sub>. The data set is comprised of 13 observations in a two-dimensional descriptor space. Each panel represents one of the 13 separate analyses that make up the full leave-one-out (LOO) cross-validation run; only four of the 13 are shown.</p>
					</text>
					<graphic file="1758-2946-1-11-1"/>
				</fig>
				<p>
					<display-formula id="M8">
						<graphic file="1758-2946-1-11-i14.gif"/>
					</display-formula>
				</p>
				<p>The normalization factor <it>&#945;</it>
					<sub>i </sub>in Eq. 9 is necessary to ensure that the distribution is a partition &#8211; i.e., that the contributions from the cross-validation step in which compound <it>i </it>was set aside sum to the observed cross-validation error in prediction <inline-formula>
						<graphic file="1758-2946-1-11-i13.gif"/>
					</inline-formula>.</p>
				<p>
					<display-formula id="M9">
						<graphic file="1758-2946-1-11-i15.gif"/>
					</display-formula>
				</p>
				<p>A small constant (1/<it>n</it>) needs to be included to prevent the reciprocal from "exploding" at small distances. Basically, it dictates the distance at which error is expected to distribute evenly. The choice of this particular value is somewhat arbitrary, but 1/n works well and nicely accommodates the tendency of data points to get closer together as the training set gets larger. Taken together, Eqs. 7&#8211;9 define the <it>D</it>istributed <it>PR</it>edictive <it>E</it>rror <it>S</it>um of <it>S</it>quares (DPRESS) approach to estimating predictive uncertainty.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Results</p>
			</st>
			<p>The suitability of DPRESS or any other quantitative model of predictive uncertainty is best evaluated by applying it to experimental QSAR data sets. Here, DPRESS is tested against PLS models obtained using a 3D descriptor (comparative molecular field analysis, or CoMFA <abbrgrp>
					<abbr bid="B32">32</abbr>
					<abbr bid="B33">33</abbr>
					<abbr bid="B34">34</abbr>
				</abbrgrp>) and a 2D descriptor (hologram QSAR, or HQSAR <abbrgrp>
					<abbr bid="B35">35</abbr>
					<abbr bid="B36">36</abbr>
					<abbr bid="B37">37</abbr>
				</abbrgrp>). A large data set (<it>N </it>= 304) was used to insure that the number of compounds held back to evaluate external predictivity was much greater than the numbers needed to train a reasonably robust model.</p>
			<sec>
				<st>
					<p>The data set</p>
				</st>
				<p>The set of structurally diverse cyclooxygenase inhibitors examined here was originally compiled by Chavatte et al. <abbrgrp>
						<abbr bid="B38">38</abbr>
					</abbrgrp>. It includes data on five major and three minor structural classes (Fig. <figr fid="F2">2</figr>) of inhibitors of the inducible form of the enzyme (COX-2). This data set is attractive because the target has been a major focus of research on anti-inflammatory drugs and because it combines substantial structural variation with a few key shared elements such as the distal sulfonyl (SO<sub>2</sub>CH<sub>3</sub>) or sulfamoyl (SO<sub>2</sub>NH<sub>2</sub>) group. In addition, regression models based on this data set are well-characterized in terms of predictive robustness <abbrgrp>
						<abbr bid="B25">25</abbr>
						<abbr bid="B28">28</abbr>
					</abbrgrp> and with respect to variations in how training subsets are selected <abbrgrp>
						<abbr bid="B39">39</abbr>
					</abbrgrp>. Finally, the uneven representation of the different core structures reflects a sampling bias that is typical of the data sets used to build QSARs.</p>
				<fig id="F2">
					<title>
						<p>Figure 2</p>
					</title>
					<caption>
						<p>Representative examples from the five major and three minor structural classes included in the COX-2 data set</p>
					</caption>
					<text>
						<p>
							<b>Representative examples from the five major and three minor structural classes included in the COX-2 data set</b>. The number of members in each class are indicated in parentheses. Each of the five major classes includes both sulfonyl and sulfamoyl analogs.</p>
					</text>
					<graphic file="1758-2946-1-11-2"/>
				</fig>
				<p>Fig. <figr fid="F3">3A</figr> shows how activity is distributed across the various structural classes when the compounds in the data set are projected into two dimensions using embedded non-linear mapping <abbrgrp>
						<abbr bid="B40">40</abbr>
						<abbr bid="B41">41</abbr>
					</abbrgrp> based on the similarity in their molecular fields: symbols are colored by structural class and sized by activity. Clearly, no one structural class has a monopoly on high activity. Fig. <figr fid="F3">3B</figr> shows the distribution of activity across the descriptor space defined by the compounds' molecular holograms. Molecular fields are 3D descriptors, which are more generalized than holograms &#8211; 2D descriptors derived from substructure counts. The more literal character of holograms leads to smaller distances between inhibitors within classes relative to the distances between classes, which accounts for the greater between-class resolution in Fig. <figr fid="F3">3B</figr>. It also accounts for the fact that the sulfonyl and sulfamoyl subclasses are cleanly separated in the hologram space (Fig. <figr fid="F3">3B</figr>) but not in the space defined by the corresponding molecular fields (Fig. <figr fid="F3">3A</figr>).</p>
				<fig id="F3">
					<title>
						<p>Figure 3</p>
					</title>
					<caption>
						<p>The distribution of activity across descriptor spaces for compounds in the COX-2 data set</p>
					</caption>
					<text>
						<p>
							<b>The distribution of activity across descriptor spaces for compounds in the COX-2 data set</b>. Symbols are color-coded by structural class and symbol sizes are proportional to the negative common logarithm of the potency (pIC50). Compounds falling into the three minor classes (cyclopentadienes, isoxazoles and thiophene DuP-697) are indicated in gray. Points in the vertical "hedges" at the top left and top right of the plots represent singletons that are too dissimilar to any other compound to be placed meaningfully within the eNLM. (<b>A</b>) Projection obtained by applying embedded non-linear mapping (eNLM) to the Euclidean distance matrix calculated from steric and electrostatic fields. Points reprsenting compounds from the minor classes are circled. (<b>B</b>) Projection obtained by applying eNLM to the Euclidean distance matrix calculated from molecular holograms hashed to a length of 353. See the <b>Methods </b>section for details.</p>
					</text>
					<graphic file="1758-2946-1-11-3"/>
				</fig>
				<p>The main goal of the work reported here was to see how well local estimates of predictive error obtained by DPRESS reflect the actual distribution of predictive error across the descriptor space. Simple random sampling produces a biased training set because, as in most such data sets, the major structural classes are not evenly represented (Fig. <figr fid="F2">2</figr>). Therefore diverse but representative ("boosted" <abbrgrp>
						<abbr bid="B39">39</abbr>
					</abbrgrp>) training sets were generated by independently drawing five training (sub)sets of 75 compounds from the full set using optimizable <it>k</it>-dissimilarity (OptiSim) selection <abbrgrp>
						<abbr bid="B39">39</abbr>
						<abbr bid="B42">42</abbr>
						<abbr bid="B43">43</abbr>
					</abbrgrp>. Models based on those training sets were then used to predict the activities of the 229 inhibitors not used to construct them. Three additional training sets were drawn at random, only one of which gave acceptable internal cross-validation statistics. Representation in the full data set is biased, so such simple random subsets are biased as well. The results obtained using that training set (set <it>R</it>) are included here to illustrate the effect of sampling bias due to structural redundancy <abbrgrp>
						<abbr bid="B39">39</abbr>
						<abbr bid="B44">44</abbr>
						<abbr bid="B45">45</abbr>
					</abbrgrp>.</p>
			</sec>
			<sec>
				<st>
					<p>CoMFA models</p>
				</st>
				<p>The optimal number of components <it>p</it>* for the CoMFA models obtained for the boosted training sets ranged from three to seven. It is not appropriate to compare models that differ in complexity directly, however, so a consensus complexity of <it>p </it>= 6 was used in all cases. The corresponding leave-one-out (LOO) cross-validated standard errors (<it>s</it>
					<sub>CV</sub>) ranged from 0.681 to 0.762, corresponding to internal predictivities (<it>q</it>
					<sup>2</sup>) of 0.537 to 0.337. The non-cross-validated models exhibited standard errors of regression (<it>s</it>
					<sub>FIT</sub>) ranging from 0.279 to 0.398, corresponding to <it>r</it>
					<sup>2 </sup>values between 0.901 and 0.827. Calculating the root mean square error for external predictions yielded <it>s</it>
					<sub>PRED </sub>= 0.633 to 0.655 &#8211; i.e., the internal cross-validated error underestimated the overall accuracy of external prediction somewhat.</p>
				<p>In contrast, the biased training set <it>R </it>yielded a cross-validated standard error (<it>s</it>
					<sub>CV</sub>) of 0.489, corresponding to a <it>q</it>
					<sup>2 </sup>of 0.696. The overall goodness-of-fit statistics for the non-cross-validated model were <it>s</it>
					<sub>FIT </sub>= 0.279 and <it>r</it>
					<sup>2 </sup>= 0.901. As expected, however, the predictive performance on those compounds not in <it>R </it>was substantially worse than that of the boosted training sets, with <it>s</it>
					<sub>PRED </sub>= 0.744.</p>
				<p>Fig. <figr fid="F4">4A</figr> shows the same projection as Fig. <figr fid="F3">3A</figr>, but here symbol sizes are based on the error in predicted pIC50 rather than on pIC50 itself. The top panels in Fig. <figr fid="F4">4(A&#8211;C)</figr> show the distributions of the individual observed errors in predicted activity (|<it>e</it>|) across the descriptor space, whereas the bottom panels (D-F) show distributions of the corresponding predictive uncertainties (<inline-formula>
						<graphic file="1758-2946-1-11-i16.gif"/>
					</inline-formula>) estimated using DPRESS. The leftmost panels (4A and 4D) were obtained for the model based on the boosted training set (set <it>A</it>) that had the lowest aggregate <it>external </it>predictive standard error (<it>s</it>
					<sub>PRED</sub>), whereas the middle panels (4B and 4E) are results for the boosted training set (set <it>B</it>) that had the lowest aggregate <it>internal </it>(cross-validated) predictive standard error (<it>s</it>
					<sub>CV</sub>) overall. The right-most panels (4C and 4F) display the results for the biased training set <it>R</it>.</p>
				<fig id="F4">
					<title>
						<p>Figure 4</p>
					</title>
					<caption>
						<p>Distribution of observed absolute errors and uncertainties predicted by DPRESS for three different CoMFA models</p>
					</caption>
					<text>
						<p>
							<b>Distribution of observed absolute errors and uncertainties predicted by DPRESS for three different CoMFA models</b>. Projection parameters and color coding are the same as in Fig. 3A except that the horizontal dimension has been compressed somewhat. Symbol size is proportional to the magnitude of the observed error or predicted uncertainty. Compounds from the respective training sets are represented by stars. (<b>A</b>) Observed absolute errors for boosted training set <it>A</it>, which had the best external predictive performance (<it>s</it>
							<sub>PRED </sub>= 0.633; <it>s</it>
							<sub>CV </sub>= 0.762). (<b>B</b>) Observed absolute errors for boosted training set <it>B</it>, which had the best internal predictive performance (<it>s</it>
							<sub>CV </sub>= 0.681; <it>s</it>
							<sub>PRED </sub>= 0.637). (<b>C</b>) Observed absolute errors for the biased training set (<it>s</it>
							<sub>CV </sub>= 0.489; <it>s</it>
							<sub>PRED </sub>= 0.744). (<b>D</b>) Predicted uncertainties for boosted training set <it>A</it>. (<b>E</b>) Predicted uncertainties for boosted training set <it>B</it>. (<b>F</b>) Predicted uncertainties for the biased training set <it>R</it>.</p>
					</text>
					<graphic file="1758-2946-1-11-4"/>
				</fig>
				<p>Several conclusions can be drawn by comparing the distribution of errors to each other and to the distribution of activities. Firstly, though the distributions of observed predictive errors for the three models differ from one another (Fig. <figr fid="F4">4A</figr> vs <figr fid="F4">4B</figr> vs <figr fid="F4">4C</figr>), they resemble each other more than they resemble the distribution of activity itself (Fig. <figr fid="F3">3A</figr>). Secondly, the larger observed errors are not particularly concentrated among the singletons or at the edges of the descriptor space, as would be expected for the ordinary least squares distribution expected based on Eq. 6 and in most published approaches to establishing applicability domains. Thirdly, the distributions of predictive uncertainty seen for the boosted training sets are in good overall agreement with the observed errors with respect to the regions of descriptor space where the observed error is relatively high or low (Fig. <figr fid="F4">4D</figr> vs <figr fid="F4">4A</figr> and <figr fid="F4">4E</figr> vs <figr fid="F4">4B</figr>). Though somewhat less evident, the same is true for the model constructed using the biased training set <it>R </it>(Fig. <figr fid="F4">4F</figr> vs <figr fid="F4">4C</figr>). Finally, the smaller errors predicted by the boosted training with the better internal predictivity (Fig. <figr fid="F4">4E</figr> vs <figr fid="F4">4D</figr>) do seem to be realized in the localized errors actually observed (Fig. <figr fid="F4">4B</figr> vs <figr fid="F4">4A</figr>), even though this was not obvious in the aggregate statistics (<it>s</it>
					<sub>PRED </sub>= 0.637 and <it>s</it>
					<sub>PRED </sub>= 0.633, respectively).</p>
				<p>Interpretation of the plots shown in Fig. <figr fid="F4">4</figr> is complicated because the uncertainty <it>s</it>
					<sub>u </sub>is a measure of the <it>spread </it>in predictive error at <b>X</b>
					<sub>u</sub>; the expected value of the error is still 0. If <inline-formula>
						<graphic file="1758-2946-1-11-i16.gif"/>
					</inline-formula> is an accurate prediction of uncertainty, the magnitude of the observed error (|<it>e</it>
					<sub>u</sub>|) can be expected to be less than <inline-formula>
						<graphic file="1758-2946-1-11-i16.gif"/>
					</inline-formula> about 68% of the time and to almost always (about 95% of the time) be less than 2<inline-formula>
						<graphic file="1758-2946-1-11-i16.gif"/>
					</inline-formula>. The plots in Fig. <figr fid="F5">5</figr> &#8211; in which the predicted uncertainty (which is always positive) is shown as a function of the observed error (which can be positive or negative) represent a more quantitative way to see how well the predicted uncertainties track the spreads in error actually observed outside the training set.</p>
				<fig id="F5">
					<title>
						<p>Figure 5</p>
					</title>
					<caption>
						<p>Predictive uncertainty <inline-formula>
								<graphic file="1758-2946-1-11-i16.gif"/>
							</inline-formula> as a function of the observed error for the CoMFA models</p>
					</caption>
					<text>
						<p>
							<b>Predictive uncertainty <inline-formula>
									<graphic file="1758-2946-1-11-i16.gif"/>
								</inline-formula> as a function of the observed error for the CoMFA models</b>. Filled stars represent members of the training set and define the lines for <inline-formula>
								<graphic file="1758-2946-1-11-i17.gif"/>
							</inline-formula> = |<it>e</it>
							<sub>t</sub>|. Dashed lines correspond to <inline-formula>
								<graphic file="1758-2946-1-11-i16.gif"/>
							</inline-formula> = 2|<it>e</it>
							<sub>u</sub>|. (<b>A</b>) Results of setting <it>&#947;</it>
							<sub>u </sub>= 0 for all compounds. (<b>B</b>) Results for the model constructed from the biased training set <it>R</it>. (<b>C</b>) Results from boosted training set <it>A</it>. (<b>D</b>) Results for boosted training set <it>B</it>.</p>
					</text>
					<graphic file="1758-2946-1-11-5"/>
				</fig>
				<p>Eq. 7 implies that <inline-formula>
						<graphic file="1758-2946-1-11-i17.gif"/>
					</inline-formula> = |<it>e</it>
					<sub>t</sub>| for each member <it>t </it>in the training set. The corresponding points are represented by filled stars in each panel in Fig. <figr fid="F5">5</figr>, which therefore define the lines <inline-formula>
						<graphic file="1758-2946-1-11-i16.gif"/>
					</inline-formula> = |<it>e</it>
					<sub>u</sub>|. Unbiased and normally distributed error should only fall outside these lines about 32% of the time and should fall outside the dotted lines corresponding to <inline-formula>
						<graphic file="1758-2946-1-11-i16.gif"/>
					</inline-formula> = 2|<it>e</it>
					<sub>u</sub>| less than 5% of the time. This is clearly not the case when the cross-validated error for the most similar compound <it>t</it>* in the training set is taken as a direct estimate of <inline-formula>
						<graphic file="1758-2946-1-11-i16.gif"/>
					</inline-formula>, i.e., when <it>&#947;</it>
					<sub>t </sub>is set equal to 0 for all <it>t </it>(Fig. <figr fid="F5">5A</figr>). There are fewer unduly low predicted uncertainties for the biased training set <it>R</it>, but still more than would be expected by chance (Fig. <figr fid="F5">5B</figr>). Note that the bias evident in the model constructed from <it>R </it>comes mostly in the form of negative residuals, i.e., predicted activities that are larger than the observed activities. Such false positives account for most of the "extra" out-of-bounds errors seen in Fig. <figr fid="F5">5B</figr>. The distributions of errors for the boosted training sets are much better behaved; indeed, the predicted uncertainties are slightly more conservative than necessary for large errors in prediction (Fig. <figr fid="F5">5C</figr> and <figr fid="F5">5D</figr>).</p>
			</sec>
			<sec>
				<st>
					<p>HQSAR</p>
				</st>
				<p>HQSAR analyses were carried out as a complement to the results obtained in the CoMFA studies described above. The 2D molecular holograms used were built up from the number of each kind of substructure comprised of between four and seven heavy atoms, the counts being mapped down into count vectors of various lengths by hashing <abbrgrp>
						<abbr bid="B37">37</abbr>
					</abbrgrp>. HQSAR models were then constructed by applying PLS analysis to holograms of length 97, 151, 199, 257, 307 and 353. The optimal complexity for the full model (<it>N </it>= 304) was six components for all hash lengths. The <it>s</it>
					<sub>CV </sub>values obtained ranged from 0.609 to 0.640; the median and average were both 0.622. The value of <it>q</it>
					<sup>2 </sup>ranged from 0.547 to 0.582, with a median of 0.564 and an average of 0.563. Based on these results, a hash length of 353 (<it>s</it>
					<sub>CV </sub>= 0.609 and <it>q</it>
					<sup>2 </sup>= 0.582) was chosen for evaluating the behavior of the various training sets. The corresponding non-cross-validated analysis gave <it>s</it>
					<sub>FIT </sub>= 0.527 and <it>r</it>
					<sup>2 </sup>= 0.687.</p>
				<p>The consensus optimal complexity across the boosted training subsets was five components, in keeping with the full data set's having nearly four times as many compounds and, therefore, containing substantially more information. The <it>s</it>
					<sub>CV </sub>values obtained ranged from 0.691 to 0.776 versus a value of 0.540 for the biased training set <it>R</it>; the respective <it>q</it>
					<sup>2 </sup>values were 0.386 to 0.514 and 0.623. The <it>s</it>
					<sub>PRED </sub>for the boosted subsets ranged from 0.619 to 0.669 and the corresponding value for the biased subset was 0.735. Hence HQSAR performance followed the trend seen for CoMFA: cross-validation under-estimated the predictive error substantially for the biased subset (i.e., was overly optimistic about the extensibility of the model) and over-estimated the predictive error slightly for the boosted training sets. It differed in that it was the boosted training set <it>B </it>which gave the better external predictive performance.</p>
				<p>The distribution of observed predictive errors and predicted uncertainties across the hologram descriptor space are shown in Fig. <figr fid="F6">6</figr> for the model based on boosted training set <it>B</it>, and the corresponding plots of <inline-formula>
						<graphic file="1758-2946-1-11-i16.gif"/>
					</inline-formula> as a function of <it>e</it>
					<sub>u </sub>are shown in Fig. <figr fid="F7">7</figr>. Note that the predicted uncertainties for the boosted HQSAR models were more conservative than those for the CoMFA models discussed above, with the result that the magnitudes of nearly all errors above 0.75 log units were less than the corresponding <inline-formula>
						<graphic file="1758-2946-1-11-i16.gif"/>
					</inline-formula>. This effect is probably a side-effect of the exaggerated separation between classes seen in the hologram space (Fig. <figr fid="F3">3B</figr>).</p>
				<fig id="F6">
					<title>
						<p>Figure 6</p>
					</title>
					<caption>
						<p>Distribution of predictive error and uncertainty across the hologram descriptor space training set <it>A</it>
						</p>
					</caption>
					<text>
						<p>
							<b>Distribution of predictive error and uncertainty across the hologram descriptor space training set <it>A</it>
							</b>. Stars correspond to compounds from the training set. Projection parameters and color coding by class are as indicated for Fig. 3B. Symbol sizes are proportional to magnitude. (<b>A</b>) Observed absolute predictive error. (<b>B</b>) Predicted uncertainty.</p>
					</text>
					<graphic file="1758-2946-1-11-6"/>
				</fig>
				<fig id="F7">
					<title>
						<p>Figure 7</p>
					</title>
					<caption>
						<p>Predicted uncertainty <inline-formula>
								<graphic file="1758-2946-1-11-i16.gif"/>
							</inline-formula> as a function of the observed predictive error <it>e</it>
						</p>
					</caption>
					<text>
						<p>
							<b>Predicted uncertainty <inline-formula>
									<graphic file="1758-2946-1-11-i16.gif"/>
								</inline-formula> as a function of the observed predictive error <it>e</it>
							</b>. Filled stars correspond to compounds included in the test set, whereas open circles represent compounds in the test set. Dashed lines correspond to |e| = 2<inline-formula>
								<graphic file="1758-2946-1-11-i16.gif"/>
							</inline-formula>. (<b>A</b>) Results for the HQSAR model constructed from the biased training set <it>R</it>. (<b>B</b>) Results from boosted training set <it>B</it>.</p>
					</text>
					<graphic file="1758-2946-1-11-7"/>
				</fig>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Discussion</p>
			</st>
			<p>The degree to which any QSAR can be extended to compounds outside of the training set used to construct it is necessarily limited to some degree by the structural diversity of that training set. Some extensibility is necessary, however, if the QSAR is to be of use for something beyond mere rationalization of known activities. When only a few descriptors are being considered, it may be possible to restrict the applicability domain to "internal" regions in the descriptor space, but as the number of descriptors increases distinguishing compounds that lie "outside" the space defined by the training set from those that are "inside" becomes progressively less meaningful. Regardless of the complexity of the system, it is clear that one will often need to extend the applicability domain beyond the training set somehow. It is equally clear that this must be done cautiously, however, and that it would be desirable for the degree of caution to reflect the idiosyncrasies of the QSAR being examined. It would be particularly desirable to take local variations in the uncertainty of predictions into account, rather than trying to find a single acceptable distance to the model that is applicable across the entire descriptor space <abbrgrp>
					<abbr bid="B12">12</abbr>
					<abbr bid="B14">14</abbr>
					<abbr bid="B15">15</abbr>
				</abbrgrp>.</p>
			<p>DPRESS was formulated to address these needs. It is based on two simple assumptions: that the uncertainty in prediction for new objects (e.g., molecular structures) is likely to be dominated by the error in prediction for objects near them in the descriptor space; and that this influence is, to a first approximation, inversely related to the distance between them. The "true" dependence may well be more complex in some cases, but the size of the training set needed to characterize that dependence will almost always be impractically large. In any event, such dependence is likely to reduce to a linear relationship over the relatively short ranges of QSAR extrapolations that have any chance of being relevant.</p>
			<p>The fact that the predictive uncertainties derived from DPRESS analysis are sometimes more conservative than necessary for large errors is of some concern, though that is certainly preferable to the alternative of their being overly optimistic; further work in this area is a matter on ongoing investigation. Nonetheless, the method is intrinsically less constraining than the classical quadratic relationship based on distance from the mean (Eq. 4). Given how much predictive error varies across the model space (e.g., Fig. <figr fid="F4">4</figr>), any approach based on the overall <it>s</it>
				<sub>PRED </sub>seems bound to be overly optimistic regarding the reliability of predicted potencies for some compounds.</p>
			<p>The underlying QSARs examined here &#8211; CoMFA and HQSAR &#8211; both rely on (nominally <abbrgrp>
					<abbr bid="B46">46</abbr>
				</abbrgrp>) linear PLS, but there is no intrinsic reason that the method cannot be more broadly applied. The key point is that the error being distributed must be predictive &#8211; i.e., it needs to reflect predictions made for objects not included in the training set. LOO cross-validation yields the most information for any given dataset, but a leave-some-out approach should be a viable alternative. The predictive errors obtained from the validation sets often used in ANN analysis could be used as well, since there is no intrinsic reason that a linear model for local error distribution should be incompatible with a QSAR that is non-linear on a global scale.</p>
			<p>The usual reasons for preferring LSO over LOO cross-validation are unlikely to be relevant to DPRESS calculations, however. LOO can indeed be distorted when the training set is biased due to redundancy, but DPRESS based on LOO turns out to be conservative in such a situation (see above). The reduction in <it>s</it>
				<sub>CV </sub>that occurs when the sampling density in one particular area of descriptor space is high is reflected in a reduction in the error that each <it>individual </it>prediction contributes to the PRESS. But <inline-formula>
					<graphic file="1758-2946-1-11-i16.gif"/>
				</inline-formula> is not a root mean square, so the effect on its value is offset by the fact that biased sampling necessarily: increases the total number of errors; decreases their spread (<it>d</it>
				<sub>00</sub>); increases the distance between the training set and most new observations; or effects some combination thereof.</p>
			<p>Diverse training sets representative of the full structural space produce more reliable local uncertainty estimates than do biased training sets, indicating that taking care to avoid undue sampling bias (redundancy) in the training set is worth the effort. Even the biased training set <it>R</it>, however, did better than setting the uncertainty of prediction for a new object equal to the observed error for the closest object in the training set (Fig. <figr fid="F5">5</figr> and <figr fid="F7">7</figr>). Moreover, the errors falling outside the range expected based for the calculated <inline-formula>
					<graphic file="1758-2946-1-11-i13.gif"/>
				</inline-formula> for <it>R </it>were false positives, the least serious type of error to make when trying to predict toxicity.</p>
			<p>There are two fundamental differences between the estimate of predictive uncertainty derived from classical theory (Eq. 4) and the DPRESS model represented by Eqs. 7&#8211;9. The first difference is that Eq. 4 is a sum of squares, whereas Eq. 7 is a sum of linear terms. Using a sum of squares formulation was considered for DPRESS, but was found to consistently overestimate the uncertainty of prediction (details not shown). The second difference is that Mahalanobis distances <it>d </it>measured in the model space are used in the classical model, whereas Euclidean distances measured in the descriptor space are used in DPRESS. The less parametric approach is followed for DPRESS because the variation in one or more variables in a particular training set may not be large enough to reveal the influence that variable might exert if examined across a greater range. The small coefficient assigned to such a variable in that event means that substantial deviations in its value will have a negligible effect on distances in the model space. Sticking with distances in the "raw" descriptor space rather than using the descriptor weights from <b>b </b>to calculate a Mahalanobis distance is more conservative &#8211; it assumes that variation in things that have yet to be explored are likely to make predictions less reliable.</p>
		</sec>
		<sec>
			<st>
				<p>Conclusion</p>
			</st>
			<p>Examination of the distribution of predictive errors across the descriptor space makes it clear that errors are consistently larger in some regions than in others &#8211; i.e., the predictive error is heteroskedastic (Fig. <figr fid="F4">4</figr>). Given that a major use of QSAR predictions is in chemoinformatic tabulations used by medicinal chemists and other third parties, it would be good practice to routinely attach some estimate of uncertainty to each prediction. Doing so based on some analytical estimator would be preferable, but is impractical in most real-world situations because it requires detailed <it>a priori </it>knowledge of the global dependence of error on the descriptors. In the absence of such knowledge, a locally linear estimator of predictive reliability that is embedded in the sample space represents a reasonable alternative. Partition of predictive error sum of squares (DPRESS) provides just such an estimator in a form &#8211; that of a standard error &#8211; that is widely understood by those likely to use it. The calculations involved are straightforward and the estimator produced is a qualitatively (Fig. <figr fid="F4">4</figr> and <figr fid="F6">6</figr>) and quantitatively (Fig. <figr fid="F5">5</figr> and <figr fid="F7">7</figr>) reliable estimate of how much confidence one should place in the associated prediction. Moreover, though the particular applications studied here involved PLS models built using 2D and 3D descriptors, the technique is likely applicable to any regression method that can be reformulated in kernel-based terms <abbrgrp>
					<abbr bid="B12">12</abbr>
					<abbr bid="B47">47</abbr>
				</abbrgrp>.</p>
			<p>It is also important when constructing the model in the first place to examine the distribution of predictive error in the descriptor space. If uncertainty is homoskedastic, a classical or uniform distribution model may provide a somewhat more precise estimate of predictive uncertainty. Should (e)NLM or principal components analysis (PCA) indicate heteroskedacity, however, a DPRESS calculation should be carried out before applying the model &#8211; e.g., for prioritizeing compounds for synthesis, acquisition or detailed testing. DPRESS may also serve to highlight regions of structural space from which more data needs to be gathered.</p>
		</sec>
		<sec>
			<st>
				<p>Experimental</p>
			</st>
			<p>Ordinary multiple linear regression is not suitable when the number of descriptors in a data set exceeds the number of observations. PLS <abbrgrp>
					<abbr bid="B4">4</abbr>
				</abbrgrp> was used instead, with the appropriate number of latent variables (components) to include (i.e., the model complexity) being the number corresponding to the first minimum in the "leave-one-out" cross-validated standard error (<it>s</it>
				<sub>CV</sub>). This measure of internal consistency is obtained by setting aside each of the <it>n </it>compounds in the training subset in turn and trying to predict its activity using the other <it>n </it>- 1 compounds in the training set. The external error of prediction (<it>s</it>
				<sub>PRED</sub>) was calculated as the root mean square error for the <it>N </it>- <it>n </it>compounds left out of the model calculation altogether.</p>
			<sec>
				<st>
					<p>Training set selection</p>
				</st>
				<p>Boosted training sets were obtained by applying OptiSim selection to the full data set. OptiSim selection entails drawing a series of random subsamples of size <it>k </it>from the data set of interest. For each subsample in the series, the individual that is most different from those selected from previous subsamples is extracted and added to the selection set <it>S</it>. This procedure results in a representative but diverse selection set that samples the full data set space both efficiently and effectively <abbrgrp>
						<abbr bid="B42">42</abbr>
					</abbrgrp>. Here the structural space was defined in terms of the Tanimoto similarity <it>T</it>(<it>a</it>, <it>b</it>) between the corresponding UNITY substructural fingerprints <abbrgrp>
						<abbr bid="B48">48</abbr>
					</abbrgrp>. The individual <it>a </it>in the <it>i</it>
					<sup>th </sup>subsample for which max(<it>T</it>(<it>a</it>, <it>b</it>): <it>b </it>&#8712; <it>S</it>) is smallest was added to <it>S</it>. Candidates with a Tanimoto similarity greater than 0.8 to any compound already in the selection set were deemed redundant and were excluded from subsamples.</p>
				<p>The selection process was repeated five times with <it>n </it>= 75 and <it>k </it>= 4, using a different random number seed each time. Five inhibitors appeared in every boosted training set, including the thiophene, cyclopentadiene and isoxazole analogs that fall outside the five major classes. A total of 113 inhibitors were not selected for any of the boosted training sets, whereas 191 were selected for at least one of them.</p>
			</sec>
			<sec>
				<st>
					<p>Molecular fields</p>
				</st>
				<p>CoMFA involves using PLS to identify correlations of biological activity with variations in steric and electrostatic molecular fields, which requires that the molecules under consideration be put into similar conformations and into a common frame of reference as a key part of the process. Here, conformations were set and molecular structures aligned based on the homologous atoms in their central and peripheral rings, as has been described in detail elsewhere <abbrgrp>
						<abbr bid="B39">39</abbr>
					</abbrgrp>. Charges were calculated using the method of Marsili and Gasteiger <abbrgrp>
						<abbr bid="B49">49</abbr>
					</abbrgrp>, as extended in SYBYL <abbrgrp>
						<abbr bid="B50">50</abbr>
					</abbrgrp> to take the distribution of <it>&#960; </it>electrons into account ("Gasteiger-H&#252;ckel charges"). Coulombic and Lennard-Jones interaction energies were calculated on a 2 &#197; rectilinear grid extending 4 &#197; beyond the edge of any molecule in the full data set. The probe atom used to calculate the fields was an <it>sp</it>
					<sup>3</sup>-hybridized carbon monocation. Interaction energies were truncated at nominal values above 30 kcal/mol, and electrostatics were ignored within the steric envelope of each inhibitor.</p>
			</sec>
			<sec>
				<st>
					<p>Molecular holograms</p>
				</st>
				<p>The first step in constructing a molecular hologram is to identify all substructures in a molecule that fall within a specified size range &#8211; here, all fragments made up of four to seven atoms, with hydrogens ignored and bond types taken into account. Each fragment is then mapped into a compressed count vector of specified length using a hashing function, so that the elements of that count vector can be used as descriptors in subsequent PLS analyses <abbrgrp>
						<abbr bid="B36">36</abbr>
					</abbrgrp>. The hashing means that different fragments may map to the same position in the final count vector. The fragments overlap, however, so each substructure contributes to many fragment counts. The result is that the noise introduced by "collisions" for a few subfragments constitutes a relatively minor perturbation that is, on average, self-limiting. Overfit PLS models are characteristically unstable to such perturbations, however, so surveying a range of hash lengths and picking one with good but representative statistical properties is a good way to avoid picking a length whose cross-validation statistics are overly optimistic. This is a non-parametric perturbation analysis analogous to looking at the effect of small perturbations in response to assess model stability <abbrgrp>
						<abbr bid="B28">28</abbr>
					</abbrgrp>.</p>
			</sec>
			<sec>
				<st>
					<p>Visualization</p>
				</st>
				<p>2D depictions of the relationship between different compounds were obtained using the embedded non-linear mapping (eNLM) facility <abbrgrp>
						<abbr bid="B40">40</abbr>
					</abbrgrp> in Benchware DataMiner <abbrgrp>
						<abbr bid="B51">51</abbr>
					</abbrgrp>. "Ordinary" NLM can be thought of as placing springs between all pairs of points in the original descriptor space, then compressing the ensemble into two dimensions in such a way that the residual tension in those springs is minimized. Embedded NLM differs in that parts of springs longer than some specified threshold length (horizon) are treated as elastic to extension, i.e., they do not contribute to the overall stress in the system. Here, spring "tensions" were based on the block-wise autoscaled Euclidean distances ("CoMFA Standard scaling" <abbrgrp>
						<abbr bid="B32">32</abbr>
					</abbrgrp>) between the molecular fields or between the molecular holograms of different compounds.</p>
			</sec>
			<sec>
				<st>
					<p>DPRESS</p>
				</st>
				<p>CoMFA and HQSAR analyses were carried out in SYBYL. The distances <it>d</it>
					<sub>t, i </sub>used to partition the PRESS were taken from the SAMPLS.dist file generated by the SYBYL interface as input to the SAMPLS program <abbrgrp>
						<abbr bid="B52">52</abbr>
					</abbrgrp> and represent inter-observation distances in the descriptor space after autoscaling has been applied. The descriptors used here are already either fully commensurate (HQSAR) or are piecewise commensurate (within steric and electrostatic fields but not between them, for CoMFA), so "CoMFA standard" (block) autoscaling was used <abbrgrp>
						<abbr bid="B33">33</abbr>
					</abbrgrp>. Observed and predicted responses were taken from the SAMPLS.out file generated by the SAMPLS program.</p>
				<p>Localized predictivity estimates were calculated by combining scripts written in SYBYL programming language (SPL) with spreadsheet manipulations carried out in Excel. For each compound <it>t </it>in the training set, the scaling factor <it>&#947;</it>
					<sub>t </sub>was calculated based on the observed predictive variance (squared cross-validation error of prediction, <it>&#948;</it>
					<sub>i </sub>
					<sup>2</sup>) for every <it>other </it>compound in the training set (<it>i </it>&#8800; <it>t</it>) weighted inversely by the square of the Euclidean distance between the two (<it>d</it>
					<sub>t, i</sub>) in the descriptor space (Eq. 8). A normalization factor <it>&#945;</it>
					<sub>i </sub>for each compound <it>i </it>was calculated as the sum of squared distances to all other compounds in the training set (Eq. 9). A limiting proximity term of 1/<it>n </it>was included to ensure reasonable behavior for closely-spaced compounds where <it>d</it>
					<sub>t, i </sub>approaches 0; this works well when the descriptors have been autoscaled in some way before use.</p>
				<p>The individual scaling factors <it>&#947;</it>
					<sub>t </sub>obtained from the <it>n </it>LOO cross-validation errors for the training set were used to calculate an estimate <inline-formula>
						<graphic file="1758-2946-1-11-i16.gif"/>
					</inline-formula> for the predictive uncertainty associated with each new structure <it>u </it>based on the observed cross-validation error of the training set compound (<it>t</it>*) lying closest to it in the descriptor space, its distance from <it>t</it>*, and the scaling factor <it>&#947;</it>
					<sub>t* </sub>derived from the model cross-validation analyses (Eqs. 7&#8211;9).</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Abbreviations</p>
			</st>
			<p>ANN: artificial neural network; BLUE: best linear unbiased estimator; CoMFA: comparative molecular field analysis; CV: cross-validation; <it>d</it>: distance; <it>&#948;</it>: predictive error for a compound outside the training set; <it>e</it>: residual error for a compound in the training set; eNLM: embedded non-linear mapping; HQSAR: hologram QSAR; LOO: leave-one-out; PCA: principal components analysis; DPRESS: distributed predictive error sum of squares; PLS: partial least squares with projection to latent structures; PRESS: predictive error sum of squares; QSAR: quantitative structure/activity relationship; <it>s</it>: standard error for a sample; SPL: SYBYL programming language.</p>
		</sec>
		<sec>
			<st>
				<p>Competing interests</p>
			</st>
			<p>The author was formerly an employee of Tripos International, which holds exclusive rights to the CoMFA and HQSAR technologies used here to illustrate the use of DPRESS. Tripos provided the SYBYL program to Biochemical Infometrics but did not provide funding for the work described herein.</p>
		</sec>
	</bdy>
   <bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>Chris Williams of the Chemical Computing Group provide helpful input to the organization of the manuscript, as did the three anonymous reviewers recruited by the Journal. The paper is substantially clearer as a result of their input, and the author appreciates it.</p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>Correlation of biological activity of phenoxyacetic acids with Hammett substituent constants and partition coefficients</p>
				</title>
				<aug>
					<au>
						<snm>Hansch</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Maloney</snm>
						<fnm>PP</fnm>
					</au>
					<au>
						<snm>Fujita</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Muir</snm>
						<fnm>RM</fnm>
					</au>
				</aug>
				<source>Nature</source>
				<pubdate>1962</pubdate>
				<volume>194</volume>
				<fpage>178</fpage>
				<lpage>180</lpage>
			</bibl>
			<bibl id="B2">
				<title>
					<p>p-<it>&#963;</it>-<it>&#960; </it>Analysis. A method for the correlation of biological activity and chemical structure</p>
				</title>
				<aug>
					<au>
						<snm>Hansch</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Fujita</snm>
						<fnm>T</fnm>
					</au>
				</aug>
				<source>J Am Chem Soc</source>
				<pubdate>1964</pubdate>
				<volume>86</volume>
				<fpage>1616</fpage>
				<lpage>1626</lpage>
			</bibl>
			<bibl id="B3">
				<title>
					<p>Computer-assisted structure-activity studies of chemical carcinogens. A heterogeneous data set</p>
				</title>
				<aug>
					<au>
						<snm>Jurs</snm>
						<fnm>PC</fnm>
					</au>
					<au>
						<snm>Chou</snm>
						<fnm>JT</fnm>
					</au>
					<au>
						<snm>Yuan</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>J Med Chem</source>
				<pubdate>1979</pubdate>
				<volume>22</volume>
				<fpage>476</fpage>
				<lpage>483</lpage>
				<xrefbib>
					<pubid idtype="pmpid">458798</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B4">
				<title>
					<p>The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses</p>
				</title>
				<aug>
					<au>
						<snm>Wold</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Ruhe</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Wold</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Dunn</snm>
						<fnm>WJ</fnm>
						<suf>III</suf>
					</au>
				</aug>
				<source>SIAM J Sci Stat Comput</source>
				<pubdate>1984</pubdate>
				<volume>5</volume>
				<fpage>735</fpage>
				<lpage>743</lpage>
			</bibl>
			<bibl id="B5">
				<title>
					<p>A systematic evaluation of the benefits and hazards of variable selection in latent variable regression. Part I. Search algorithm, theory and simulations</p>
				</title>
				<aug>
					<au>
						<snm>Baumann</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Albert</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>von Korff</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>J Chemometrics</source>
				<pubdate>2002</pubdate>
				<volume>16</volume>
				<fpage>339</fpage>
				<lpage>350</lpage>
			</bibl>
			<bibl id="B6">
				<title>
					<p>A systematic evaluation of the benefits and hazards of variable selection in latent variable regression. Part II. Practical applications</p>
				</title>
				<aug>
					<au>
						<snm>Baumann</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>von Korff</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Albert</snm>
						<fnm>H</fnm>
					</au>
				</aug>
				<source>J Chemometrics</source>
				<pubdate>2002</pubdate>
				<volume>16</volume>
				<fpage>351</fpage>
				<lpage>360</lpage>
			</bibl>
			<bibl id="B7">
				<title>
					<p>Assessing applicability domains of toxicological QSARs: definition, confidence in predicted values, and the role of mechanisms of action</p>
				</title>
				<aug>
					<au>
						<snm>Schultz</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Hewitt</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Netzeva</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Cronin</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>QSAR Comb Sci</source>
				<pubdate>2007</pubdate>
				<volume>26</volume>
				<fpage>238</fpage>
				<lpage>254</lpage>
			</bibl>
			<bibl id="B8">
				<title>
					<p>Modeling without boundary conditions: an issue in QSAR validation</p>
				</title>
				<aug>
					<au>
						<snm>Giuliani</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Benigni</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Computer-Assisted Lead Finding and Optimization</source>
				<publisher>Weinheim: Wiley-VCH</publisher>
				<editor>van de Waterbeemd H, Testa B, Folkers G</editor>
				<pubdate>1997</pubdate>
				<fpage>51</fpage>
				<lpage>63</lpage>
			</bibl>
			<bibl id="B9">
				<title>
					<p>Similarity to molecules in the training set is a good discriminator for prediction accuracy in QSAR</p>
				</title>
				<aug>
					<au>
						<snm>Sheridan</snm>
						<fnm>RP</fnm>
					</au>
					<au>
						<snm>Feuston</snm>
						<fnm>BP</fnm>
					</au>
					<au>
						<snm>Maiorov</snm>
						<fnm>VN</fnm>
					</au>
					<au>
						<snm>Kearsley</snm>
						<fnm>SK</fnm>
					</au>
				</aug>
				<source>J Chem Inf Comput Sci</source>
				<pubdate>2004</pubdate>
				<volume>44</volume>
				<fpage>1912</fpage>
				<lpage>1928</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15554660</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B10">
				<title>
					<p>Determining the validity of a QSAR model &#8211; a classification approach</p>
				</title>
				<aug>
					<au>
						<snm>Guha</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Jurs</snm>
						<fnm>PC</fnm>
					</au>
				</aug>
				<source>J Chem Inf Model</source>
				<pubdate>2005</pubdate>
				<volume>45</volume>
				<fpage>65</fpage>
				<lpage>73</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15667130</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B11">
				<title>
					<p>Assessing the reliability of a QSAR model's predictions</p>
				</title>
				<aug>
					<au>
						<snm>He</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Jurs</snm>
						<fnm>PC</fnm>
					</au>
				</aug>
				<source>J Mol Graph Model</source>
				<pubdate>2005</pubdate>
				<volume>23</volume>
				<fpage>503</fpage>
				<lpage>523</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15896992</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B12">
				<title>
					<p>Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules</p>
				</title>
				<aug>
					<au>
						<snm>Schroeter</snm>
						<fnm>TS</fnm>
					</au>
					<au>
						<snm>Schwaighofer</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Mika</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Ter Lakk</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Suelzle</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Ganzer</snm>
						<fnm>U</fnm>
					</au>
					<au>
						<snm>Heinrich</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>M&#252;ller</snm>
						<fnm>K-R</fnm>
					</au>
				</aug>
				<source>J Comput-Aided Mol Des</source>
				<pubdate>2007</pubdate>
				<volume>21</volume>
				<fpage>651</fpage>
				<lpage>664</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">18060505</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B13">
				<title>
					<p>Predictivity of QSAR</p>
				</title>
				<aug>
					<au>
						<snm>Benigni</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Bossa</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>J Chem Inf Model</source>
				<pubdate>2008</pubdate>
				<volume>48</volume>
				<fpage>971</fpage>
				<lpage>980</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">18426198</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B14">
				<title>
					<p>Critical assessment of QSAR models of environmental toxicity against <it>Tetrahymena pyriformis</it>: focusing on applicability domain and overfitting by variable selection</p>
				</title>
				<aug>
					<au>
						<snm>Tetko</snm>
						<fnm>IV</fnm>
					</au>
					<au>
						<snm>Sushko</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Pandey</snm>
						<fnm>AK</fnm>
					</au>
					<au>
						<snm>Zhu</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Tropsha</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Papa</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>&#214;berg</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Todeschini</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Fourches</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Varnek</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>J Chem Inf Model</source>
				<pubdate>2008</pubdate>
				<volume>48</volume>
				<fpage>1733</fpage>
				<lpage>1746</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">18729318</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B15">
				<title>
					<p>The importance of the domain of applicability in QSAR modeling</p>
				</title>
				<aug>
					<au>
						<snm>Weaver</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Gleeson</snm>
						<fnm>MP</fnm>
					</au>
				</aug>
				<source>J Mol Graph Model</source>
				<pubdate>2008</pubdate>
				<volume>26</volume>
				<fpage>1315</fpage>
				<lpage>1326</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">18328754</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B16">
				<title>
					<p>Predicting human safety: screening and computational approaches</p>
				</title>
				<aug>
					<au>
						<snm>Johnson</snm>
						<fnm>DE</fnm>
					</au>
					<au>
						<snm>Wolfgang</snm>
						<fnm>GI</fnm>
					</au>
				</aug>
				<source>Drug Discov Today. </source>
				<pubdate>2000</pubdate>
				<volume>5</volume>
				<fpage>445</fpage>
				<lpage>454</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">11018595</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B17">
				<title>
					<p>The integrated use of models for the properties and effects of chemicals by means of a structured workflow</p>
				</title>
				<aug>
					<au>
						<snm>Bassan</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Worth</snm>
						<fnm>AP</fnm>
					</au>
				</aug>
				<source>QSAR Comb Sci</source>
				<pubdate>2008</pubdate>
				<volume>27</volume>
				<fpage>6</fpage>
				<lpage>20</lpage>
			</bibl>
			<bibl id="B18">
				<title>
					<p>Improving opportunities for regulatory acceptance of QSARs: the importance of model domain, uncertainty, validity and predictability</p>
				</title>
				<aug>
					<au>
						<snm>Walker</snm>
						<fnm>JD</fnm>
					</au>
					<au>
						<snm>Carlsen</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Jaworska</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Quant Struct-Act Rel</source>
				<pubdate>2003</pubdate>
				<volume>22</volume>
				<fpage>6</fpage>
				<lpage>20</lpage>
			</bibl>
			<bibl id="B19">
				<aug>
					<au>
						<snm>Snedecor</snm>
						<fnm>GW</fnm>
					</au>
					<au>
						<snm>Cochran</snm>
						<fnm>WG</fnm>
					</au>
				</aug>
				<source>Statistical Methods</source>
				<publisher>Iowa State Press, Ames, IA</publisher>
				<edition>8</edition>
				<pubdate>1989</pubdate>
			</bibl>
			<bibl id="B20">
				<title>
					<p>Error estimation in PLS latent variable structure</p>
				</title>
				<aug>
					<au>
						<snm>Kleinknecht</snm>
						<fnm>RE</fnm>
					</au>
				</aug>
				<source>J Chemometrics</source>
				<pubdate>1996</pubdate>
				<volume>10</volume>
				<fpage>687</fpage>
				<lpage>695</lpage>
			</bibl>
			<bibl id="B21">
				<title>
					<p>Prediction intervals in partial least squares</p>
				</title>
				<aug>
					<au>
						<snm>Denham</snm>
						<fnm>MC</fnm>
					</au>
				</aug>
				<source>J Chemometrics</source>
				<pubdate>1997</pubdate>
				<volume>11</volume>
				<fpage>39</fpage>
				<lpage>52</lpage>
			</bibl>
			<bibl id="B22">
				<title>
					<p>Propagation of measurement errors for the validation of predictions obtained by principal component regression and partial least squares</p>
				</title>
				<aug>
					<au>
						<snm>Faber</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Kowalski</snm>
						<fnm>BR</fnm>
					</au>
				</aug>
				<source>J Chemometrics</source>
				<pubdate>1997</pubdate>
				<volume>11</volume>
				<fpage>181</fpage>
				<lpage>238</lpage>
			</bibl>
			<bibl id="B23">
				<title>
					<p>Comments on construction of confidence intervals in connection with partial least squares</p>
				</title>
				<aug>
					<au>
						<snm>Morsing</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Ekman</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>J Chemometrics</source>
				<pubdate>1998</pubdate>
				<volume>12</volume>
				<fpage>295</fpage>
				<lpage>299</lpage>
			</bibl>
			<bibl id="B24">
				<title>
					<p>Validation of QSARs</p>
				</title>
				<aug>
					<au>
						<snm>Wold</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Quant Struct-Act Rel</source>
				<pubdate>1991</pubdate>
				<volume>10</volume>
				<fpage>191</fpage>
				<lpage>193</lpage>
			</bibl>
			<bibl id="B25">
				<title>
					<p>Validating models based on large data sets</p>
				</title>
				<aug>
					<au>
						<snm>Clark</snm>
						<fnm>RD</fnm>
					</au>
					<au>
						<snm>Sprous</snm>
						<fnm>DG</fnm>
					</au>
					<au>
						<snm>Leonard</snm>
						<fnm>JM</fnm>
					</au>
				</aug>
				<source>Rational Approaches to Drug Design</source>
				<publisher>Barcelona: Prous Science</publisher>
				<editor>H&#246;ltje H-D, Sippl W</editor>
				<pubdate>2001</pubdate>
				<fpage>475</fpage>
				<lpage>485</lpage>
			</bibl>
			<bibl id="B26">
				<title>
					<p>Beware of q2!</p>
				</title>
				<aug>
					<au>
						<snm>Golbraikh</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Tropsha</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>J Mol Graph Model</source>
				<pubdate>2002</pubdate>
				<volume>20</volume>
				<fpage>269</fpage>
				<lpage>276</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">11858635</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B27">
				<title>
					<p>Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection</p>
				</title>
				<aug>
					<au>
						<snm>Golbraikh</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Tropsha</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>J Comput-Aided Mol Des</source>
				<pubdate>2002</pubdate>
				<volume>16</volume>
				<fpage>357</fpage>
				<lpage>269</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12489684</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B28">
				<title>
					<p>Statistical variation in progressive scrambling</p>
				</title>
				<aug>
					<au>
						<snm>Clark</snm>
						<fnm>RD</fnm>
					</au>
					<au>
						<snm>Fox</snm>
						<fnm>PC</fnm>
					</au>
				</aug>
				<source>J Comput Aided Mol Des. </source>
				<pubdate>2004</pubdate>
				<volume>18</volume>
				<issue>7-9</issue>
				<fpage>563</fpage>
				<lpage>576</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15729855</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B29">
				<title>
					<p>Assessing model fit by cross-validation</p>
				</title>
				<aug>
					<au>
						<snm>Hawkins</snm>
						<fnm>DM</fnm>
					</au>
					<au>
						<snm>Basak</snm>
						<fnm>SC</fnm>
					</au>
					<au>
						<snm>Mills</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>J Chem Inf Comput Sci</source>
				<pubdate>2003</pubdate>
				<volume>43</volume>
				<fpage>579</fpage>
				<lpage>586</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12653524</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B30">
				<title>
					<p>A Comparison of methods for modeling quantitative structure-activity relationships</p>
				</title>
				<aug>
					<au>
						<snm>Sutherland</snm>
						<fnm>JJ</fnm>
					</au>
					<au>
						<snm>O'Brien</snm>
						<fnm>LA</fnm>
					</au>
					<au>
						<snm>Weaver</snm>
						<fnm>DF</fnm>
					</au>
				</aug>
				<source>J Med Chem</source>
				<pubdate>2004</pubdate>
				<volume>47</volume>
				<fpage>3777</fpage>
				<lpage>3787</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15239656</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B31">
				<title>
					<p>Sample-distance partial least squares: PLS optimized for many variables, with application to CoMFA</p>
				</title>
				<aug>
					<au>
						<snm>Bush</snm>
						<fnm>BL</fnm>
					</au>
					<au>
						<snm>Nachbar</snm>
						<fnm>RB</fnm>
						<suf>Jr</suf>
					</au>
				</aug>
				<source>J Comput-Aided Mol Des</source>
				<pubdate>1993</pubdate>
				<volume>7</volume>
				<fpage>587</fpage>
				<lpage>619</lpage>
				<xrefbib>
					<pubid idtype="pmpid">8294948</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B32">
				<title>
					<p>Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins</p>
				</title>
				<aug>
					<au>
						<snm>Cramer</snm>
						<fnm>RD</fnm>
						<suf>III</suf>
					</au>
					<au>
						<snm>Patterson</snm>
						<fnm>DE</fnm>
					</au>
					<au>
						<snm>Bunce</snm>
						<fnm>JD</fnm>
					</au>
				</aug>
				<source>J Am Chem Soc</source>
				<pubdate>1988</pubdate>
				<volume>110</volume>
				<fpage>5959</fpage>
				<lpage>5967</lpage>
			</bibl>
			<bibl id="B33">
				<title>
					<p>The developing practice of comparative molecular field analysis</p>
				</title>
				<aug>
					<au>
						<snm>Cramer</snm>
						<fnm>RD</fnm>
						<suf>III</suf>
					</au>
					<au>
						<snm>DePriest</snm>
						<fnm>SA</fnm>
					</au>
					<au>
						<snm>Patterson</snm>
						<fnm>DE</fnm>
					</au>
					<au>
						<snm>Hecht</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>3D QSAR in Drug Design: Theory, Methods and Applications</source>
				<publisher>Leiden: ESCOM</publisher>
				<editor>Kubinyi H</editor>
				<pubdate>1993</pubdate>
				<fpage>443</fpage>
				<lpage>485</lpage>
			</bibl>
			<bibl id="B34">
				<title>
					<p>Improving the predictive quality of models</p>
				</title>
				<aug>
					<au>
						<snm>Kroemer</snm>
						<fnm>RT</fnm>
					</au>
					<au>
						<snm>Hecht</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Guessregen</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Liedl</snm>
						<fnm>KR</fnm>
					</au>
				</aug>
				<source>3D QSAR in Drug Design</source>
				<publisher>Dordrecht: Kluewer/ESCOM</publisher>
				<editor>Kubinyi H, Folkers G, Martin YC</editor>
				<pubdate>1998</pubdate>
				<volume>3</volume>
				<fpage>41</fpage>
				<lpage>56</lpage>
			</bibl>
			<bibl id="B35">
				<title>
					<p>Molecular Hologram QSAR</p>
				</title>
				<aug>
					<au>
						<snm>Heritage</snm>
						<fnm>TW</fnm>
					</au>
					<au>
						<snm>Lowis</snm>
						<fnm>DR</fnm>
					</au>
				</aug>
				<source>Rational Drug Design: Novel Methodology and Practical Applications, ACS Symposium Series 719</source>
				<publisher>Washington DC: American Chemical Society</publisher>
				<editor>Parrill AL, Reddy MR</editor>
				<pubdate>1999</pubdate>
				<fpage>212</fpage>
				<lpage>225</lpage>
			</bibl>
			<bibl id="B36">
				<title>
					<p>Evaluation of quantitative structure-activity relationship methods for large-scale prediction of chemicals binding to the estrogen receptor</p>
				</title>
				<aug>
					<au>
						<snm>Tong</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Lowis</snm>
						<fnm>DR</fnm>
					</au>
					<au>
						<snm>Perkins</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Chen</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Welsh</snm>
						<fnm>WJ</fnm>
					</au>
					<au>
						<snm>Goddette</snm>
						<fnm>DW</fnm>
					</au>
					<au>
						<snm>Heritage</snm>
						<fnm>TW</fnm>
					</au>
					<au>
						<snm>Sheehan</snm>
						<fnm>DM</fnm>
					</au>
				</aug>
				<source>J Chem Inf Comput Sci</source>
				<pubdate>1998</pubdate>
				<volume>38</volume>
				<fpage>669</fpage>
				<lpage>677</lpage>
				<xrefbib>
					<pubid idtype="pmpid">9722424</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B37">
				<title>
					<p>Effect of parameter variations on the effectiveness of HQSAR analyses</p>
				</title>
				<aug>
					<au>
						<snm>Seel</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Turner</snm>
						<fnm>DB</fnm>
					</au>
					<au>
						<snm>Willett</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Quant Struc-Act Rel</source>
				<pubdate>1999</pubdate>
				<volume>18</volume>
				<fpage>245</fpage>
				<lpage>252</lpage>
			</bibl>
			<bibl id="B38">
				<title>
					<p>Three-dimensional quantitative structure-activity relationships of cyclooxygenase-2 (COX-2) inhibitors: a comparative molecular field analysis</p>
				</title>
				<aug>
					<au>
						<snm>Chavatte</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Yous</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Marot</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Baurin</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Lesieur</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>J Med Chem</source>
				<pubdate>2001</pubdate>
				<volume>44</volume>
				<fpage>3223</fpage>
				<lpage>3230</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">11563921</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B39">
				<title>
					<p>Boosted leave-many-out cross-validation: the effect of training set and test set diversity on PLS statistics</p>
				</title>
				<aug>
					<au>
						<snm>Clark</snm>
						<fnm>RD</fnm>
					</au>
				</aug>
				<source>J Comput-Aided Mol Des</source>
				<pubdate>2003</pubdate>
				<volume>17</volume>
				<fpage>265</fpage>
				<lpage>275</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">13677492</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B40">
				<title>
					<p>Visualizing substructural fingerprints</p>
				</title>
				<aug>
					<au>
						<snm>Clark</snm>
						<fnm>RD</fnm>
					</au>
					<au>
						<snm>Patterson</snm>
						<fnm>DE</fnm>
					</au>
					<au>
						<snm>Soltanshahi</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Blake</snm>
						<fnm>JF</fnm>
					</au>
					<au>
						<snm>Matthew</snm>
						<fnm>JB</fnm>
					</au>
				</aug>
				<source>J Mol Graph Model</source>
				<pubdate>2000</pubdate>
				<volume>18</volume>
				<fpage>404</fpage>
				<lpage>411</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">11143558</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B41">
				<title>
					<p>Stochastic Proximity Embedding</p>
				</title>
				<aug>
					<au>
						<snm>Agrafiotis</snm>
						<fnm>DK</fnm>
					</au>
				</aug>
				<source>J Comput Chem</source>
				<pubdate>2003</pubdate>
				<volume>24</volume>
				<fpage>1215</fpage>
				<lpage>1221</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12820129</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B42">
				<title>
					<p>OptiSim: an extended dissimilarity selection method for finding diverse representative subsets</p>
				</title>
				<aug>
					<au>
						<snm>Clark</snm>
						<fnm>RD</fnm>
					</au>
				</aug>
				<source>J Chem Inf Comput Sci</source>
				<pubdate>1997</pubdate>
				<volume>37</volume>
				<fpage>1181</fpage>
				<lpage>1188</lpage>
			</bibl>
			<bibl id="B43">
				<title>
					<p>Balancing representativeness against diversity using optimizable <it>K</it>-dissimilarity and hierarchical clustering</p>
				</title>
				<aug>
					<au>
						<snm>Clark</snm>
						<fnm>RD</fnm>
					</au>
					<au>
						<snm>Langton</snm>
						<fnm>WJ</fnm>
					</au>
				</aug>
				<source>J Chem Inf Comput Sci</source>
				<pubdate>1998</pubdate>
				<volume>38</volume>
				<fpage>1079</fpage>
				<lpage>1086</lpage>
			</bibl>
			<bibl id="B44">
				<title>
					<p>Getting past diversity in assessing virtual library designs</p>
				</title>
				<aug>
					<au>
						<snm>Clark</snm>
						<fnm>RD</fnm>
					</au>
				</aug>
				<source>J Brazil Chem Soc</source>
				<pubdate>2002</pubdate>
				<volume>13</volume>
				<fpage>788</fpage>
				<lpage>794</lpage>
			</bibl>
			<bibl id="B45">
				<title>
					<p>The effect of structural redundancy in validation sets on virtual screening performance</p>
				</title>
				<aug>
					<au>
						<snm>Clark</snm>
						<fnm>RD</fnm>
					</au>
					<au>
						<snm>Shepphird</snm>
						<fnm>JK</fnm>
					</au>
					<au>
						<snm>Holliday</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>J Chemometrics</source>
			</bibl>
			<bibl id="B46">
				<title>
					<p>Nonlinear dependence in comparative molecular field analysis</p>
				</title>
				<aug>
					<au>
						<snm>Kim</snm>
						<fnm>KK</fnm>
					</au>
				</aug>
				<source>3D QSAR in Drug Design Theory, Methods and Applications</source>
				<publisher>Leiden: ESCOM</publisher>
				<editor>Kubinyi H</editor>
				<pubdate>1993</pubdate>
				<fpage>71</fpage>
				<lpage>82</lpage>
			</bibl>
			<bibl id="B47">
				<title>
					<p>Introduction to scientific data mining: Direct kernel methods and applications</p>
				</title>
				<aug>
					<au>
						<snm>Embrechts</snm>
						<fnm>MJ</fnm>
					</au>
					<au>
						<snm>Szymanski</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Sternickel</snm>
						<fnm>K</fnm>
					</au>
				</aug>
				<source>Computationally Intelligent Hybrid Systems: The Fusion of Soft Computing and Hard Computing</source>
				<publisher>New York: Wiley</publisher>
				<editor>Ovaska SJ</editor>
				<pubdate>2005</pubdate>
				<fpage>317</fpage>
				<lpage>362</lpage>
			</bibl>
			<bibl id="B48">
				<title>
					<p>Comparison of similarity coefficients for clustering and compound selection</p>
				</title>
				<aug>
					<au>
						<snm>Haranczyk</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Holliday</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>J Chem Inf Model</source>
				<pubdate>2008</pubdate>
				<volume>48</volume>
				<fpage>498</fpage>
				<lpage>509</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">18293953</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B49">
				<title>
					<p>Iterative partial equalization of orbital electronegativity &#8211; a rapid access to atomic charges</p>
				</title>
				<aug>
					<au>
						<snm>Gasteiger</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Marsili</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Tetrahedron</source>
				<pubdate>1980</pubdate>
				<volume>36</volume>
				<fpage>3219</fpage>
				<lpage>3228</lpage>
			</bibl>
			<bibl id="B50">
				<source>SYBYL, v 8.0</source>
				<publisher>Tripos International: St. Louis, MO</publisher>
				<pubdate>2008</pubdate>
			</bibl>
			<bibl id="B51">
				<source>Benchware DataMiner, v. 1.6</source>
				<publisher>Tripos International: St. Louis, MO</publisher>
				<pubdate>2007</pubdate>
			</bibl>
			<bibl id="B52">
				<aug>
					<au>
						<snm>Bush</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>SAMPLS: SAMple-driven Partial Least Squares</source>
				<publisher>Merck &amp; Co., Inc.: Rahway, NJ</publisher>
				<pubdate>1993</pubdate>
			</bibl>
		</refgrp>
	</bm>
</art>
