Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Methodology article

Determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach

Xinyu Liu1, Yupeng Wang2 and TN Sriram1*

Author Affiliations

1 Department of Statistics, University of Georgia, Athens, GA 30602, USA

2 Computational Biology Service Unit, Cornell University, Ithaca, NY 14853, USA

For all author emails, please log on.

BMC Bioinformatics 2014, 15:190  doi:10.1186/1471-2105-15-190


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/15/190


Received:25 June 2013
Accepted:4 June 2014
Published:14 June 2014

© 2014 Liu et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.

Abstract

Background

Data on single-nucleotide polymorphisms (SNPs) have been found to be useful in predicting phenotypes ranging from an individual’s class membership to his/her risk of developing a disease. In multi-class classification scenarios, clinical samples are often limited due to cost constraints, making it necessary to determine the sample size needed to build an accurate classifier based on SNPs. The performance of such classifiers can be assessed using the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) for two classes and the Volume Under the ROC hyper-Surface (VUS) for three or more classes. Sample size determination based on AUC or VUS would not only guarantee an overall correct classification rate, but also make studies more cost-effective.

Results

For coded SNP data from D(≥2) classes, we derive an optimal Bayes classifier and a linear classifier, and obtain a normal approximation to the probability of correct classification for each classifier. These approximations are then used to evaluate the associated AUCs or VUSs, whose accuracies are validated using Monte Carlo simulations. We give a sample size determination method, which ensures that the difference between the two approximate AUCs (or VUSs) is below a pre-specified threshold. The performance of our sample size determination method is then illustrated via simulations. For the HapMap data with three and four populations, a linear classifier is built using 92 independent SNPs and the required total sample sizes are determined for a continuum of threshold values. In all, four different sample size determination studies are conducted with the HapMap data, covering cases involving well-separated populations to poorly-separated ones.

Conclusion

For multi-classes, we have developed a sample size determination methodology and illustrated its usefulness in obtaining a required sample size from the estimated learning curve. For classification scenarios, this methodology will help scientists determine whether a sample at hand is adequate or more samples are required to achieve a pre-specified accuracy. A PDF manual for R package “SampleSizeSNP” is given in Additional file 1, and a ZIP file of the R package “SampleSizeSNP” is given in Additional file 2.

Keywords:
Area under the receiver operating characteristic curve; Classification; HapMap data; Heterogeneous stock mice data; Probability of correct classification; Receiver operating characteristic; Sample size determination

Background

Data on single-nucleotide polymorphisms (SNPs) have been found to be useful in predicting an individual’s class membership or his/her response to a drug, susceptibility to environmental factors such as toxins, and the risk of developing a particular disease, among others [1-5]. The classification literature provides a variety of classifiers (e.g., Support Vector Machine, genetic programming, Neural Networks and Logistic Regression) and sample size determination methods [6-10], but most of these are only applicable to continuous data.

Recently Liu et al. [11] developed an optimal Bayes classifier and a linear classifier for coded SNP data from two classes, and obtained a normal approximation to the probability of correct classification (PCC) for each classifier. They also proposed a sample size determination methodology to determine an adequate sample size, which ensures that the difference between the two approximate PCCs is below a pre-specified threshold value. Using Monte Carlo simulations, Liu et al. [11] assessed the validity of their approximations. Furthermore, they illustrated the performance of their sample size determination method via simulations and a real data analysis using the HapMap data on two populations—Chinese and Japanese.

While Liu et al. [11] showed that their sample size determination method is competitive, they also pointed out that an additional maximization step is required in order to determine the discrimination values for each of their classifiers; see their REMARK1 in their article for more details. When there are three or more classes, however, determination of such discrimination values is not only more difficult, but also increases the overall computational burden. In a two-class scenario, a well known way to overcome this difficulty is to consider the Receiver Operating Characteristic (ROC) curve, which plots the True Positive Rates vs. False Positives Rates, at various discrimination values [12,13]. Note that the ROC allows the discrimination value to be varied and it simultaneously explores all possible combinations of the correct classification rates [14]. The Area Under the ROC curve (AUC) is commonly used as a scalar performance measure, which allows classifiers to be compared independent of the discrimination values. Unfortunately, the AUC measure is only applicable to a two-class scenario. A popular extension of the AUC measure, known as the Volume Under the ROC hyper-Surface (VUS) measure, is often used in a multi-class scenario (see e.g., Landgrebe and Duin [14] and Landgrebe and Paclik 2010 [15]).

This article revisits the problem of sample size determination in classification scenarios involving coded SNP data, but uses the AUC and the VUS as performance measures for two-class and multi-class scenarios, respectively. More specifically, for coded SNP data from D(≥2) classes, we derive an optimal Bayes classifier and obtain a normal approximation to its probability of correct classification, which is denoted by PCC(). We also derive a linear classifier and obtain a normal approximation to its probability of correct classification, which is denoted by <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M1">View MathML</a>. For an overall assessment of each of the classifiers, we define the scalar measures AUC (for two-class) and VUS (for multi-class), and correspondingly define the quantities <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M2">View MathML</a> for each classification scenario. For the two-class scenario, we propose to determine the sample size n for which <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M3">View MathML</a>, where γ∈(0,1) is a pre-specified threshold value. Whereas, for the multi-class scenario, we propose to determine the sample size n for which <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M4">View MathML</a>. A computational method to determine the total sample size for various values of γ is described. Monte Carlo simulations are carried out to corroborate our theoretical approximations, and the performance of our sample size determination method is assessed via simulations and analysis of the HapMap data consisting of 3 and 4 populations, respectively. In all, four different sample size determination studies are conducted with the HapMap data, covering cases involving well-separated populations to poorly-separated ones. Details are given in the data analysis section.

R software was used to carry out all the computations. A PDF manual for R package “SampleSizeSNP” is given in Additional file 1, and a ZIP file of the R package “SampleSizeSNP” is given in Additional file 2.

Additional file 1. Manual of R package “SampleSizeSN”.

Format: PDF Size: 157KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Additional file 2. R package “SampleSizeSNP” in ZIP file.

Format: ZIP Size: 250KB Download fileOpen Data

Methods

Assumptions

Suppose there are D(≥2) distinct classes denoted by C1,…,CD, consisting of n1,…,nD subjects, respectively. For each subject, we observe a p-dimensional SNP vector, <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M5">View MathML</a>, where typically p is much larger (>>) than <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M6">View MathML</a>, and the jth SNP is coded in such a way xj=0,1,2, which denotes the number of minor alleles in the genotype “aa”, “Aa” and “AA”, respectively. It is possible that some of the SNPs are highly correlated, leading us to choose one SNP to represent a set of highly correlated ones. For classification and sample size determination, we make the following assumptions:

1. For an m such that <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M7">View MathML</a>, the data vector <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M8">View MathML</a> consists only of m SNPs, which are statistically independent. That is, the rest of the (pm) correlated SNPs are not used for classification.

2. For each k=1,…,D and j=1,…,m, we postulate Hardy-Weinberg equilibrium, according to which the probability mass function of the coded SNP (Xj) belonging to class k is given by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M9">View MathML</a>

where θk,j is the minor allele frequency at locus j in class k, and by definition θk,j∈(0.01,0.5). Here, θk,j<0.5 because it is the minor allele frequency, and θk,j>0.01 ensures that the polymorphism is not a mutation. For each k=1,…,D, let <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M10">View MathML</a> denote the parameter vector corresponding to the class Ck.

3. There is a percentage ρ of the m SNPs with marginal effect on any two classes, and let l=⌊ρm⌋ be the number of SNPs with marginal effects.

The optimal classifier and its PCC

By the assumptions above, the conditional mass function of <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M11">View MathML</a> given the class Ck, k=1,…,D, is

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M12">View MathML</a>

Suppose <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M13">View MathML</a> and we denote the marginal mass function <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M14">View MathML</a>, then for each 1≤kD, the posterior mass function of the class Ck given <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M15">View MathML</a> is

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M16">View MathML</a>

For any fixed k=1,…,D, the Bayes classification rule then classifies <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M17">View MathML</a> to the class Ck if

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M18">View MathML</a>

(1)

for all kk. This leads to the optimal Bayes classifier, which classifies <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M19">View MathML</a> to Ck if

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M20">View MathML</a>

(2)

for all kk, where

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M21">View MathML</a>

(3)

Then, the PCC of the optimal Bayes classifier is defined as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M22">View MathML</a>

In Additional file 3: Appendix 1, we derive a normal approximation for PCC(), as l. That is, for large l, we show that

Additional file 3. Appendix 1–5.

Format: PDF Size: 699KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M23">View MathML</a>

(4)

where ϕ is the (D−1)-dimensional multivariate normal density, <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M24">View MathML</a> is a multiple integral, <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M25">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M26">View MathML</a> are (D−1)×1 vectors, and Σl,k is a (D−1)×(D−1) matrix. All these quantities are defined in Additional file 3: Appendix 1.

In Additional file 3: Appendix 4, we give an expression for (4) for the case D=3.

A linear classifier and its PCC

Motivated by the form of the optimal Bayes classifier in (2), we consider the following linear classifier that classifies <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M27">View MathML</a> to the class Ck if

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M28">View MathML</a>

(5)

for all kk, where <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M29">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M30">View MathML</a> are the maximum likelihood estimators of <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M31">View MathML</a>, respectively. Also, the values of the weights wj,n(k,k) in (5) are determined in the following way: For each j=1,…,m and kk, suppose we test the hypothesis <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M32">View MathML</a> versus <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M33">View MathML</a>. Then wj,n(k,k)=1 if <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M34">View MathML</a> is rejected; else wj,n(k,k)=0. In Additional file 3: Appendix 2, we use the large sample theory to derive a Wald test of level α to test <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M35">View MathML</a> versus <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M36">View MathML</a>, and an expression for the power, <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M37">View MathML</a>, of this test, when <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M38">View MathML</a>.

In Additional file 3: Appendix 3, we derive a normal approximation for the PCC of the linear classifier, denoted by <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M39">View MathML</a>. That is, for large l, we show that

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M40">View MathML</a>

(6)

Note that <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M41">View MathML</a> depends on <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M42">View MathML</a> through <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M43">View MathML</a>; see Additional file 3: Appendix 3 for details. In Additional file 3: Appendix 4, we give an expression for (6) for the case D=3.

AUC and VUS for the optimal and linear classifiers

For any (k,k), define

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M44">View MathML</a>

Then, for the optimal Bayes classifier in (2) we have from (4) that

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M45">View MathML</a>

(7)

and similarly, for the linear classifier in (5), we have from (6) that

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M46">View MathML</a>

(8)

for k=1,…,D. When D=2, for the optimal Bayes classifier, the ROC() for two classes is the curve ξ2,2 vs. (1−ξ1,1). Then, the AUC() is

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M47">View MathML</a>

However, when the number of classes D≥3, we need to consider the volume under the ROC hypersurface. Following the work of Landgrebe and Duin [14], the VUS is defined as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M48">View MathML</a>

(9)

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M49">View MathML</a>

By replacing ξk.k by <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M50">View MathML</a> [see (8)] in the above definitions of ROC,AUC and the VUS, we obtain corresponding expressions for the linear classifier in (5). We denote the resulting ones as <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M51">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M52">View MathML</a>. In Additional file 3: Appendix 4, we derive these expressions for the case D=3.

Computation of VUS

As is evident from (9), the computation of VUS involves high dimensional integration. Given below is a brief description of the steps involved in the computation of VUS. For ease of exposition, we will denote ξk=ξk,k, k=1,…,D. First, we randomly generate the thresholds <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M53">View MathML</a> (see (9)) and compute the corresponding <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M54">View MathML</a> satisfying (7). Note that the <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M55">View MathML</a> contributes to the integration in VUS only if all the ξk’s are positive.

To find as many <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M56">View MathML</a> values that contribute to the integration as possible, we use the ant colony optimization algorithm, where only the <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M57">View MathML</a> values corresponding to the <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M58','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M58">View MathML</a> values that contribute to the integration are retained. However, these are perturbed by a small noise and the resulting <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M59','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M59">View MathML</a> values are used as seeds for the next iteration. Then, we use the genetic algorithm to obtain another <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M60','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M60">View MathML</a> value located in a different region within (0,1)k, which also contributes to the integration. We use the ant colony algorithm and the genetic algorithm alternatively to eventually generate a dense set of <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M61','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M61">View MathML</a> values that contribute to the integration. Note that the process is such that the newly generated <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M62','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M62">View MathML</a> values are appended to all the previously generated <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M63','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M63">View MathML</a> values.

Now, to compute the volume, VUS(), we use the convhulln function in the qhullR-package. Note that the convhulln function is designed to determine the convex hull of a set of D-dimensional points and thus compute the volume of the hull. In view of this, in order to compute the volume, VUS(), a base of <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M64','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M64">View MathML</a> (this is same as the <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M65','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M65">View MathML</a> vector, except that one of its components, e.g. the first component, is set to 0) is appended to the original <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M66','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M66">View MathML</a>. Since in each iteration the new <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M67','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M67">View MathML</a> values are appended to the old <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M68','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M68">View MathML</a> values from the previous iterations, and the VUS is concave, the computed VUS is supposed to increase in value with each iteration. We stop appending the new <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M69','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M69">View MathML</a> values when |VUSoldVUSnew|<0.001. When this criterion is satisfied, we obtain the value of VUS(). Similarly, the values of <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M70','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M70">View MathML</a> are calculated.

Sample size determination using VUS or AUC

Given a threshold γ, we determine the sample size n satisfying the following condition:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M71">View MathML</a>

(10)

For the case D=2, we determine the sample size n satisfying the condition: <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M72','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M72">View MathML</a>. A simulation study for the case D=2 is carried out in Additional file 3: Appendix 5 to assess the performance of our sample size determination algorithm.

Results

Monte Carlo simulations

Before we illustrate the performance of our sample size determination method based on AUC or VUS, we present results from an extensive Monte Carlo simulation study conducted to verify the accuracy of the approximations for <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M73','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M73">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M74">View MathML</a>, respectively, and study their behavior as a function of n and other parameters. Here, we present the numerical assessments based on the VUS for the cases D=3 and 4, respectively. However, as mentioned above, the assessments based on the AUC for the case D=2 are given in Additional file 3: Appendix 5. Henceforth, we will set nk=n for all k=1,…,D, and we will use n instead of <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M75','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M75">View MathML</a> to simplify notations.

When D=3, we consider the following simulation set up: For <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M76','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M76">View MathML</a>, let θ1,jU(0.4,0.49), j=1,…,m; for a specified scalar value h, let <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M77','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M77">View MathML</a> be such that their components hi,jU(h−0.002,h+0.002), i=1,2; j=1,…,m; and let <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M78','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M78">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M79','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M79">View MathML</a>. First, we generated a <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M80','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M80">View MathML</a> according to the above set up, and then generated the data vector <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M81','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M81">View MathML</a> for each class. We then computed <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M82">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M83">View MathML</a> following the computational methodology described earlier. For this <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M84','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M84">View MathML</a>, we then drew twenty <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M85','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M85">View MathML</a> data sets and calculated a Monte Carlo estimate, denoted by <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M86','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M86">View MathML</a>. This process was repeated 20 times and an average value of <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M87','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M87">View MathML</a> was computed. These are given in Table 1. It is evident from Table 1 that the <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M88','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M88">View MathML</a> is negligible in most cases, which validates the use of our approximation for VUS(n). Table 1 also gives similar results for the case D=4. Note that VUS()=1/D! for a random classifier, which is the lower bound of VUS() for any classifier.

Table 1. Performance of optimal and linear classifiers

Next, we determine the smallest n such that <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M100','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M100">View MathML</a>, for a pre-specified γ value. We use the following algorithm to determine such an n: (i) Let n=nS and nL such that f(nS)>0 and f(nL)<0, and set nM=[(nS+nL)/2]. The algorithm begins by selecting a small nS and a large nL; (ii) If f(nM)f(nS)<0, then reset nL=nM; or else, reset nS=nM. In either case, return to step (i), unless nLnS≤1, in which case, the smallest sample n=nL; (iii) Use the smallest (total) sample of size D×nL, with n=nL from each class, C1,…,CD. We implemented this algorithm for each value of h, m and significance level α for the Wald test; see discussion below (5). For the cases D=2 and D=3, respectively, Table 2 displays the determined sample sizes for γ=0.01 and each combination of parameter values. From Table 2, it is evident that the required sample size reduces as h increases, as expected. Hence, f(n)<0 for smaller sample sizes, as shown in Table 2. However, the effect of m on the determined sample sizes is less clear. When h is large, say h≥0.1, then the required sample size reduces as m becomes large. Whereas, when h is small, say h=0.05, the reverse is true as m becomes large.

Table 2. Sample size determination: here, D = 3 and 4, and n is the sample size for each class satisfying:<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M101','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M101">View MathML</a>

Application to the HapMap data

The aim of the International HapMap Project is to develop a haplotype map of the human genome, which will describe the common patterns of human DNA sequence variation.

The HapMap data (Phase III) consists of eleven populations with about p=1.2×106 SNPs. Here, we consider the following nine populations in order to illustrate our sample size determination algorithm: ASW—African ancestry in Southwest USA with 87 subjects; CEU—Utah residents with Northern and Western European ancestry from CEPH collection with 167 subjects; CHB—the Han Chinese individuals from Beijing with 137 subjects; CHD—Chinese in Metropolitan Denver, Colorado with 109 subjects; GIH—Gujarati Indians in Houston, Texas with 101 subjects; JPT—the Japanese individuals from Tokyo with 113 subjects; MEX—Mexican ancestry in Los Angeles, California with 86 subjects; TSI—Toscans in Italy (TSI) with 102 subjects; and YRI—Yoruba in Ibadan, Nigeria with 203 subjects. With these, we created four sample size determination studies, of which the first three involve three populations (D=3), and the last study involves four populations (D=4). More specifically, we conducted our sample size determination studies with the following population groupings: (I) (CEU, GIH, MEX); (II) (ASW, TSI, YRI); (III) (CHB, JPT, CHD); and (IV) (CHB, JPT, CHD, GIH).

Based on all the available subjects, we extracted pair-wise independent SNPs using the following steps. Suppose L is a set of SNPs, then: (I) form a set S with one SNP from L and update S after the next step; (II) from the remaining SNPs in L, choose one SNP that is independent of every SNP in S using Kendall’s τ coefficient as a test statistic to test pair-wise independence, and then add this new SNP to S. Here, we concluded independence if the Kendall’s τ-value <0.05; (III) Repeat (II) until each remaining SNP in L is correlated with at least one SNP in S. This procedure yielded a set S with m=92 pair-wise independent SNPs, and with these we built our linear classifier.

Next, we set ρ=1 so that m=l=92; see Assumption 3 under the Methods section. Recall that <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M105','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M105">View MathML</a> for k=1,…,D. For the cases D=3 and D=4 considered in studies (I) to (IV) above, we estimated <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M106','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M106">View MathML</a> using the maximum likelihood (ML) estimates obtained based on all the available subjects belonging to the respective populations. We then substituted these ML estimates into the corresponding expressions for <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M107','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M107">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M108','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M108">View MathML</a>, respectively. Figures 1, 2 and 3 show plots of required sample sizes for a continuum of threshold values γ for the case D=3 considered in studies (I) to (III), respectively, and Figure 4 plots the same for D=4 considered in study (IV). From these figures, the required total sample size can be determined approximately for each pre-specified γ value.

thumbnailFigure 1. Total sample sizes needed for classification to well-separated HapMap populations CEU, GIH, and MEX. For the linear classifier based on the SNP data from the three populations, the estimated learning curve gives the required total sample size for different values of the threshold, γ, satisfying <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M109','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M109">View MathML</a>. Here, ρ = 1, α = 0.1, m = 92, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M110','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M110">View MathML</a>.

thumbnailFigure 2. Total sample sizes needed for classification to moderately-separated HapMap populations ASW, TSI, and YRI. For the linear classifier based on the SNP data from the three populations, the estimated learning curve gives the required total sample size for different values of the threshold, γ, satisfying <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M111','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M111">View MathML</a>. Here, ρ = 1, α = 0.1, m = 92, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M112','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M112">View MathML</a>.

thumbnailFigure 3. Total sample sizes needed for classification to poorly-separated HapMap populations CHB, JTP, and CHD. For the linear classifier based on the SNP data from the three populations, the estimated learning curve gives the required total sample size for different values of the threshold, γ, satisfying <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M113','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M113">View MathML</a>. Here, ρ = 1, α = 0.1, m = 92, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M114','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M114">View MathML</a>.

thumbnailFigure 4. Total sample sizes needed for classification to majority poorly-separated HapMap populations CHB, JTP, CHD and GIH. For the linear classifier based on the SNP data from the three populations, the estimated learning curve gives the required total sample size for different values of the threshold, γ, satisfying <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M115','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M115">View MathML</a>. Here, ρ = 1, α = 0.1, m = 92, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M116','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M116">View MathML</a>.

For example, if we set γ=0.10 (i.e., <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M117','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M117">View MathML</a>), then in the three population (CEU,GIH, MEX) case, the <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M118','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M118">View MathML</a> and about 62 observations are required for each class with a total sample size of 186, whereas in the three population (ASW, TSI, YRI) case, the <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M119','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M119">View MathML</a> and about 150 observations are required for each class with a total sample size of 450. Note that, for γ=0.10, in study (I) the required sample sizes for each population is less than what is currently available, whereas in study (II), we would need 63 and 48 more observations for the populations ASW and TSI, respectively. For the three population (CHB, JPT and CHD) case, if we set γ=0.10 then the <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M120','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M120">View MathML</a> and about 244 observations are required for each class with a total sample size of 732. Clearly, for study (III) at least 100 more observations are needed for each population (CHB, JPT and CHD) when γ=0.10. Finally, for the four population (CHB, JPT, CHD, GIH) case, setting γ=0.10 yields that the <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M121','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M121">View MathML</a> and about 279 samples are required for each class with a total sample of 1,116. Once again, at least 150 more observations are needed for each of the four populations when γ=0.10.

The results from the four HapMap studies suggest that the <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M122','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M122">View MathML</a> value is large and the required total sample size is small when the populations are well-separated [as in study (I)]. Whereas, when the populations are moderately-separated [as in study (II), where the populations ASW and YRI may be similar], the <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M123','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M123">View MathML</a> value reduces and the required total sample size increases moderately. When the populations are poorly-separated [as in study (III), where all the three populations may be similar], the <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M124','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M124">View MathML</a> value reduces even further and there is a substantial increase in the required total sample size. Finally, in the four population study, where three of the populations are poorly-separated, once again we see a further reduction in the <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M125','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M125">View MathML</a> value and a corresponding increase in the required total sample size. Although not reported here, we also considered other well-/moderate-/poorly- separated cases with the HapMap data and observed similar results as the ones reported here.

It is well known in the classification literature that the performance of a classifier depends on how well separated the classes are. Similarly, the studies above involving the HapMap data show that the performance of our sample size determination methodology also depends on the extent of separation between populations. While our methodology provides a formal way of determining an approximate total sample size for each specified value of γ, it is clear from the HapMap data analysis that it is not possible to propose a universal γ value. Nevertheless, if the classes are well-separated or moderately-separated, then we believe that γ=0.10 may be a good choice for many frequently encountered data sets in classification problems.

Discussion

We have built an optimal Bayes classifier and a linear classifier based on coded SNP data from two or more classes. For these classifiers, we have considered the two commonly used scalar performance measures, the Area Under the ROC curve (AUC) and the Volume Under the ROC hyper-Surface (VUS), which allow classifiers to be compared independent of discrimination values. We have illustrated the performance of a sample size determination methodology, which selects the smallest total sample size n such that the criterion <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M126','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M126">View MathML</a> is satisfied. While the approximations to the VUS (or AUC) obtained here provide the necessary theoretical justification, the simulations and the HapMap data analysis presented here illustrate the practical value of our sample size determination method.

The fact that the HapMap contains data on multiple populations belonging to similar or dissimilar geographical locations enabled us to test the performance of our sample size determination method on three different multi-class scenarios involving well-separated, moderately-separated, and poorly-separated populations. We have shown that the the extent of separation between the populations and the choice of threshold value affect the total sample size required to satisfy the criterion. With regard to the choice of the threshold value γ in other practical contexts, we recommend that the user take into consideration the cost of obtaining more samples and choose an appropriate value of γ that gives an acceptable precision. In other words, if the cost of sampling is affordable then the user may want to sample more to achieve a higher precision (lower γ value) using our classifier; otherwise, the user has to settle for a higher γ value that makes use of all the available samples. We also infer from our HapMap data analysis that a value of <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M127','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M127">View MathML</a> may indicate the extent of separation between the classes. Thus, the value of <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/190/mathml/M128','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/190/mathml/M128">View MathML</a> could also give some prior guidance on the choice of γ values, especially in instances where the cost of sampling is a serious concern.

Conclusion

In summary, for multiple classes, we have developed an asymptotic methodology based on AUC or VUS to estimate the learning curve of SNP classifiers. It is shown that the required total sample size can be obtained from the estimated learning curve for each pre-specified threshold value. In classification problems, sample size determination is important due to cost considerations. This methodology will help scientists determine if a sample at hand is adequate or more observations are necessary to achieve a pre-specified accuracy, and thus help users strike an optimal balance between precision and cost.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

XL developed and implemented the proposed model, performed simulation and application, and drafted the manuscript. TNS participated in model development and helped manuscript preparation. YW participated in HapMap data analysis. All authors read and approved the final manuscript.

Acknowledgements

We would like to thank the Editor and the two reviewers for their careful reading and insightful suggestions, which greatly improved the content and the presentation of the article. T.N.S. was supported by a grant from the National Security Agency [H98230-11-1-0188] and the National Science Foundation [#1309665].

References

  1. Guzzetta G, Jurman G, Furlanello C: A machine learning pipeline for quantitative phenotype prediction from genotype data .

    BMC Bioinformatics 2010, 11(Suppl 8):S3. BioMed Central Full Text OpenURL

  2. Lee SH, van der Werf JHJ, Hayes BJ, Goddard ME, Visscher PM: Predicting unobserved phenotypes for complex traits from whole-genome SNP data .

    Plos Genet 2008, 4:e1000231. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Nunkesser R, Bernholt T, Schwender H, Ickstadt K, Wegener I: Detecting high-order interactions of single nucleotide polymorphisms using genetic programming .

    Bioinformatics 2007, 23:3280-3288. PubMed Abstract | Publisher Full Text OpenURL

  4. Wray NR, Goddard ME, Visscher PM: Prediction of individual genetic risk to disease from genome-wide association studies .

    Genome Res 2007, 17:1520-1528. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  5. Zhou N, Wang L: Effective selection of informative SNPs and classification on the HapMap genotype data .

    BMC Bioinformatics 2007, 8:484-492. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  6. De Valpine P, Bitter HM, Brown MPS, Heller J: A simulation-approximation approach to sample size planning for high-dimensional classification studies .

    Biostatistics 2009, 10:424-435. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  7. Dobbin KK, Simon RM: Sample size determination in microarray experiments for class comparison and prognostic classification .

    Biostatistics 2005, 6:27-38. PubMed Abstract | Publisher Full Text OpenURL

  8. Dobbin KK, Simon RM: Sample size planning for developing classifiers using high-dimensional DNA microarray data .

    Biostatistics 2007, 8:101-117. PubMed Abstract | Publisher Full Text OpenURL

  9. Dobbin KK, Zhao Y, Simon RM: How large a training set is needed to develop a classifier for microarray data .

    Clin Cancer Res 2008, 14:108-114. PubMed Abstract | Publisher Full Text OpenURL

  10. Mukherjee S, Tamayo P, Rogers S, Rifkin R, Engle A, Campbell C, Golub TR, Mesirov JP: Estimating dataset size requirements for classifying DNA microarray data .

    J Comput Biol 2003, 10:119-142. PubMed Abstract | Publisher Full Text OpenURL

  11. Liu X, Wang Y, Rekhaya R, Sriram TN: Sample size determination for classifiers based on single-nucleotide polymorphisms .

    Biostatistics 2012, 13:217-227. PubMed Abstract | Publisher Full Text OpenURL

  12. Metz C: Basic principles of ROC analysis .

    Seminars Nucl Med 1978, 3:283-298. OpenURL

  13. Fawcett T: An introduction to ROC analysis .

    Pattern Recogn Lett 2005, 27:861-874. OpenURL

  14. Landgrebe T, Duin RPW: Approximating the multiclass ROC by pairwise analysis .

    Pattern Recogn Lett 2007, 28:1747-1758. Publisher Full Text OpenURL

  15. Landgrebe T, Paclik P: The ROC skeleton for multiclass ROC estimation .

    Pattern Recogn Lett 2010, 31:949-958. Publisher Full Text OpenURL