<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>1471-2164-9-S1-S6</ui>
	<ji>1471-2164</ji>
	<fm>
		<dochead>Research</dochead>
		<bibl>
			<title>
				<p>Supervised learning-based tagSNP selection for genome-wide disease classifications</p>
			</title>
			<aug>
				<au id="A1">
					<snm>Liu</snm>
					<fnm>Qingzhong</fnm>
					<insr iid="I1"/>
					<insr iid="I2"/>
					<email>liu@cs.nmt.edu</email>
				</au>
				<au id="A2">
					<snm>Yang</snm>
					<fnm>Jack</fnm>
					<insr iid="I3"/>
					<email>jyang@bwh.harvard.edu</email>
				</au>
				<au id="A3">
					<snm>Chen</snm>
					<fnm>Zhongxue</fnm>
					<insr iid="I4"/>
					<email>zhongxuechen@gmail.com</email>
				</au>
				<au id="A4">
					<snm>Yang</snm>
					<fnm>Mary Qu</fnm>
					<insr iid="I5"/>
					<insr iid="I6"/>
					<email>yangma@mail.nih.gov</email>
				</au>
				<au id="A5" ca="yes">
					<snm>Sung</snm>
					<mi>H</mi>
					<fnm>Andrew</fnm>
					<insr iid="I1"/>
					<insr iid="I2"/>
					<email>sung@cs.nmt.edu</email>
				</au>
				<au id="A6" ca="yes">
					<snm>Huang</snm>
					<fnm>Xudong</fnm>
					<insr iid="I3"/>
					<email>xhuang3@partners.org</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>Department of Computer Science, New Mexico Institute of Mining and Technology, Socorro, NM 87801, USA</p>
				</ins>
				<ins id="I2">
					<p>Institute for Complex Additive Systems Analysis, New Mexico Institute of Mining and Technology, Socorro, NM 87801, USA</p>
				</ins>
				<ins id="I3">
					<p>Department of Radiology, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02120, USA</p>
				</ins>
				<ins id="I4">
					<p>The Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA</p>
				</ins>
				<ins id="I5">
					<p>National Human Genome Research Institute, National Institutes of Health (NIH), U.S. Department of Health and Human Services, USA</p>
				</ins>
				<ins id="I6">
					<p>Oak Ridge Institute for Science and Education, Oak Ridge National Laboratory, U.S. Department of Energy, USA</p>
				</ins>
			</insg>
			<source>BMC Genomics</source>
			<supplement>
				<title>
					<p>The 2007 International Conference on Bioinformatics &amp; Computational Biology (BIOCOMP'07)</p>
				</title>
				<editor>Jack Y Jang, Mary Qu Yang, Mengxia (Michelle) Zhu, Youping Deng and Hamid R Arabnia</editor>
				<note>Research</note>
			</supplement>
			<conference>
				<title>
					<p>The 2007 International Conference on Bioinformatics &amp; Computational Biology (BIOCOMP'07)</p>
				</title>
				<location>Las Vegas, NV, USA</location>
				<date-range>25-28 June 2007</date-range>
				<url>http://www.world-academy-of-science.org</url>
			</conference>
			<issn>1471-2164</issn>
			<pubdate>2008</pubdate>
			<volume>9</volume>
			<issue>Suppl 1</issue>
			<fpage>S6</fpage>
			<url>http://www.biomedcentral.com/1471-2164/9/S1/S6</url>
			<xrefbib>
				<pubidlist><pubid idtype="pmpid">18366619</pubid><pubid idtype="doi">10.1186/1471-2164-9-S1-S6</pubid>
				</pubidlist></xrefbib>
		</bibl>
		<history>
			<pub>
				<date>
					<day>20</day>
					<month>03</month>
					<year>2008</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2008</year>
			<collab>Liu et al.; licensee BioMed Central Ltd.</collab>
			<note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<sec>
					<st>
						<p>Background</p>
					</st>
					<p>Comprehensive evaluation of common genetic variations through association of single nucleotide polymorphisms (SNPs) with complex human diseases on the genome-wide scale is an active area in human genome research. One of the fundamental questions in a SNP-disease association study is to find an optimal subset of SNPs with predicting power for disease status. To find that subset while reducing study burden in terms of time and costs, one can potentially reconcile information redundancy from associations between SNP markers.</p>
				</sec>
				<sec>
					<st>
						<p>Results</p>
					</st>
					<p>We have developed a feature selection method named Supervised Recursive Feature Addition (SRFA). This method combines supervised learning and statistical measures for the chosen candidate features/SNPs to reconcile the redundancy information and, in doing so, improve the classification performance in association studies. Additionally, we have proposed a Support Vector based Recursive Feature Addition (SVRFA) scheme in SNP-disease association analysis.</p>
				</sec>
				<sec>
					<st>
						<p>Conclusions</p>
					</st>
					<p>We have proposed using SRFA with different statistical learning classifiers and SVRFA for both SNP selection and disease classification and then applying them to two complex disease data sets. In general, our approaches outperform the well-known feature selection method of Support Vector Machine Recursive Feature Elimination and logic regression-based SNP selection for disease classification in genetic association studies. Our study further indicates that both genetic and environmental variables should be taken into account when doing disease predictions and classifications for the most complex human diseases that have gene-environment interactions.</p>
				</sec>
			</sec>
		</abs>
	</fm>
	<bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<p>Correlating DNA sequence variations with phenotypic differences has challenged biomedical research community for decades. Substantial efforts have been made to identify all common genetic variations in humans, including single nucleotide polymorphisms (SNPs), deletions and insertions <abbrgrp>
					<abbr bid="B1">1</abbr>
				</abbrgrp>. The International HapMap Project has collected genotypes of millions of SNPs from populations with ancestry from Africa, Asia and Europe and made this information freely available in the public domain <abbrgrp>
					<abbr bid="B2">2</abbr>
					<abbr bid="B3">3</abbr>
					<abbr bid="B4">4</abbr>
				</abbrgrp>. Millions of SNPs have been identified so far, yet, how to best use this information is not always clear. Due to the relatively low power of each SNP amidst the huge number of total SNPs, most researchers are unable to perform a whole genome-wide association study directly based on the genotypes or allele frequencies of individual markers. Nonetheless, a great need exists to develop, both conceptually and computationally, robust algorithms and advanced analytical methods for characterizing genetic variations that are non-redundant. Through this characterization, one can then identify the target SNPs that are most likely to affect the phenotypes and ultimately contribute to disease pathogenesis.</p>
			<p>To date the efficiency of searching for optimal set of SNPs has not been efficient. To counter this trend, we propose reconciling information redundancy from associations between SNP markers. This method not only successfully identifies the approximate optimal set of SNPs but also potentially reduces the burden involved with genetic association studies such as time and cost <abbrgrp>
					<abbr bid="B5">5</abbr>
				</abbrgrp>.</p>
			<p>One primary cause for the lack of success in searching for optimal sets of SNPs is that the high dimensionality and highly correlated features of SNPs hinder the power of identifying small to moderate genetic effects connectable to complex diseases. The need to incorporate covariates of other environmental risk factors as effect modifiers or confounders further worsens &#8220;the curse of dimensionality problem&#8221; in mapping genes associated with complex diseases <abbrgrp>
					<abbr bid="B6">6</abbr>
				</abbrgrp>.</p>
			<p>How do we evaluate the searching for optimal SNPs? It must be predetermined prior to searching how SNPs are needed to provide enough predicting power of disease status. This is not a new question; it comes out of the overall recent statistical and computational endeavors that focus on feature selection from massive and highly dimensional genomic data. Specifically, in genome-wide disease association studies, various models and algorithms have been proposed for selecting an optimal subset of SNPs <abbrgrp>
					<abbr bid="B7">7</abbr>
					<abbr bid="B8">8</abbr>
					<abbr bid="B9">9</abbr>
					<abbr bid="B10">10</abbr>
					<abbr bid="B11">11</abbr>
					<abbr bid="B12">12</abbr>
					<abbr bid="B13">13</abbr>
				</abbrgrp>. Linkage Disequilibrium-based methods for selecting a maximally informative set of SNPs for disease association analyses have been developed first <abbrgrp>
					<abbr bid="B14">14</abbr>
					<abbr bid="B15">15</abbr>
					<abbr bid="B16">16</abbr>
					<abbr bid="B17">17</abbr>
					<abbr bid="B18">18</abbr>
				</abbrgrp>. Zhang and Jin <abbrgrp>
					<abbr bid="B19">19</abbr>
				</abbrgrp> introduced a tagSNPs criterion based on pair-wise Linkage Disequilibrium (LD) and haplotype <it>r</it>
				<sup>2</sup> measure for case control association studies. Anderson and Novembre <abbrgrp>
					<abbr bid="B20">20</abbr>
				</abbrgrp> and Mannila <it>et al</it>. <abbrgrp>
					<abbr bid="B21">21</abbr>
				</abbrgrp> proposed finding haplotype block boundaries using minimum description length. The method presented by Beckmann <it>et al</it>. <abbrgrp>
					<abbr bid="B22">22</abbr>
				</abbrgrp> showcases the flexibility of Mantel statistics using haplotype sharing. This method was employed to correlate temporal and spatial distributions of cancer in a generalized regression approach for SNP selections and disease gene mapping. He and Zelikovsky <abbrgrp>
					<abbr bid="B23">23</abbr>
				</abbrgrp> proposed tagSNPs for unphased genotypes based on multiple linear regressions. Other test statistic approaches such as scan statistics by Levin <it>et al</it>. <abbrgrp>
					<abbr bid="B24">24</abbr>
				</abbrgrp>; score statistics by Schaid <it>et al</it>. <abbrgrp>
					<abbr bid="B25">25</abbr>
				</abbrgrp>, and weighted-average statistics <abbrgrp>
					<abbr bid="B26">26</abbr>
				</abbrgrp> were proposed for disease gene mapping in case-control studies and for SNP selection in genetic association studies. By using spliced expressed sequence tags, Yang <it>et al</it>. investigated the connection between &#8220;bidirectional gene pair&#8221; and cancer <abbrgrp>
					<abbr bid="B35">35</abbr>
				</abbrgrp>.</p>
			<p>Recently, Schwender and Ickstadt <abbrgrp>
					<abbr bid="B27">27</abbr>
				</abbrgrp> demonstrated logic regression <abbrgrp>
					<abbr bid="B28">28</abbr>
				</abbrgrp> based identification of SNP interactions for the disease status in a case-control study and proposed two measures for quantifying the importance of feature interactions and classifications. In comparison with some well-known classification methods such as CART <abbrgrp>
					<abbr bid="B29">29</abbr>
				</abbrgrp>, Random Forests <abbrgrp>
					<abbr bid="B30">30</abbr>
				</abbrgrp> and other regression procedures <abbrgrp>
					<abbr bid="B17">17</abbr>
				</abbrgrp>, logic regression has been claimed to perform better when applied to SNP data <abbrgrp>
					<abbr bid="B27">27</abbr>
				</abbrgrp>.</p>
			<p>In this paper, we developed a feature selection method named Supervised Recursive Feature Addition (SRFA). It not only allows for the selection of genomic information but helps to identify the optimal subset of SNPs necessary for finding the variations associated with disease. This method combines supervised learning and statistical measures for the chosen candidate SNPs and/or environmental variables to reconcile redundancy information for improving the classification and prediction performance. We implemented SRFA with different statistical learning classifiers for both SNP selection and disease classification and then compared their performances with popular classification models, such as logic regression and Support Vector Machine Recursive Feature Elimination (SVMRFE). Additionally, we proposed a Support Vector based Recursive Feature Addition (SVRFA) scheme for SNP-disease association analysis. To evaluate and to demonstrate the proposed methods, we applied them to two complex SNP-disease data sets, the Myocardial Infarction Case &amp; Control (MICC) data set and a subset of The North American Rheumatoid Arthritis Consortium (NARAC) data, for both SNP selection and disease classification.</p>
		</sec>
		<sec>
			<st>
				<p>Results</p>
			</st>
			<p>Fig. <figr fid="F1">1</figr> displays the testing accuracies of NBC, NMSC, SVM, and UDC in the analysis of the MICC data set. The legend marks the different feature selections. Fig. <figr fid="F1">1</figr> shows that, with the use of the four learning classifiers, both SRFA and SVRFA (including MSW-MSC, MMW-MSC, NBC-MSC, NMSC-MSC, and DENFIS-MSC) outperform the well-known feature selection method SVMRFE. The SRFA methods, NBC-MSC and NMSC-MSC, are better than others including SVRFA. Especially under the low feature dimension, the advantage of SRFA is noticeable. Regarding the classification performances of different learning classifiers, on average, NBC, NMSC, and SVM performed better than UDC.</p>
			<fig id="F1">
				<title>
					<p>Figure 1</p>
				</title>
				<caption>
					<p>Testing accuracies of NBC, NMSC, SVM, and UDC for the MICC data set</p>
				</caption>
				<text>
					<p>Testing accuracies of NBC, NMSC, SVM, and UDC for the MICC data set. The legend marks the different feature selection methods.</p>
				</text>
				<graphic file="1471-2164-9-S1-S6-1"/>
			</fig>
			<p>Fig. <figr fid="F2">2</figr> shows the average testing accuracies on the NARAC CHR18SNP case/control data from feature dimension 1 to 200, by applying learning classifiers NBC and NMSC to the following feature selections: MSW-MSC, MMW-MSC, NBC-MSC, NMSC-MSC, SVMRFE, TTEST, and nonparametric RANKSUM. Regarding the testing accuracy, SRFA, SVRFA, and SVMRFE outperform TTEST and RANKSUM. In addition to feature selection, learning classifier is important to the testing performance. MSC combined with RFA helps to improve the classification accuracy. The best testing accuracy is obtained by applying NMSC to the SRFA feature selection, NMSC-MSC. In our view, the weakness of TTEST and RANKSUM is that selection ignores the redundancy and interaction among the SNPs. By contrast, the other approaches may detect the epistatic effects (gene-gene interactions). The detection of higher dimensions of many epistatic effects requires even more complex models.</p>
			<fig id="F2">
				<title>
					<p>Figure 2</p>
				</title>
				<caption>
					<p>Testing accuracies of NBC and NMSC for the NARAC CHR18SNP data set</p>
				</caption>
				<text>
					<p>Testing accuracies of NBC and NMSC for the NARAC CHR18SNP data set. The legend marks the different feature selection methods.</p>
				</text>
				<graphic file="1471-2164-9-S1-S6-2"/>
			</fig>
			<p>Overall, regarding the testing accuracies, the feature selection method NMSC-MSC performed the best, followed by NBC-MSC, MMW-MSC, MSW-MSC and SVMRFE; TTEST and RANKSUM performed the worst. Comparing NBC to NMSC, the performance of NMSC is, on average, superior to NBC. Figs. <figr fid="F1">1</figr> and <figr fid="F2">2</figr> also show that classification techniques are strictly paired up with feature selections. The performance of NMSC-MSC was not improved by the use of NBC, but with the use of NMSC, the feature selection method NMSC-MSC performed the best.</p>
			<p>Tables <tblr tid="T1">1</tblr> and <tblr tid="T2">2</tblr> list the testing accuracies and the standard errors associated with the highest training accuracies for given classifiers (NMSC, NBC, SVM, UDC) under different feature selections (two SVRFA: MSW-MSC, MMW-MSC; three SRFA: NBC-MSC, NMSC-MSC, DENFIS-MSC; three popular approaches: SVMRFE, Logistic-Wald-t, LOGICFS) for the MICC data set and NARAC CHR18SNP, respectively. In Table <tblr tid="T1">1</tblr>, the testing accuracies of LOGICFS were obtained from the 31 SNPs in the MICC data set without environmental variables. Although the MICC data set integrates SNPs with environmental variables, due the limit of the number of the features, the differences between the accuracy levels of the tests were not noticeable, although one SVRFA (MMW-MSC) got the best result with the use of NMSC. Table <tblr tid="T2">2</tblr> shows that supervised learning-based feature selection NMSC-MSC with the use of NMSC outperforms other combinations, followed by NBC-MSC with the use of NMSC. Support vector based feature selections are superior to LOGICFS, and LOGICFS is better than parametric and non-parametric based feature selections. Regarding support vector based feature selection, on average, MMW-MSC outperformed MSW-MSC and SVMRFE.</p>
			<tbl id="T1" hint_layout="single">
				<title>
					<p>Table 1</p>
				</title>
				<caption>
					<p>Testing accuracies associated with the highest training accuracies under different feature selections for the MICC data set.</p>
				</caption>
				<tblbdy cols="5">
					<r>
						<c rspan="2">
							<p>Feature Selection</p>
						</c>
						<c cspan="4">
							<p>Testing accuracy (mean value &#177; standard deviation, %)</p>
						</c>
					</r>
					<r>
						<c>
							<p>NMSC</p>
						</c>
						<c>
							<p>NBC</p>
						</c>
						<c>
							<p>SVM</p>
						</c>
						<c>
							<p>UDC</p>
						</c>
					</r>
					<r>
						<c cspan="5">
							<hr/>
						</c>
					</r>
					<r>
						<c>
							<p>MSW-MSC</p>
						</c>
						<c>
							<p>76.0 &#177; 3.4</p>
						</c>
						<c>
							<p>75.1 &#177; 3.0</p>
						</c>
						<c>
							<p>73.1 &#177; 4.5</p>
						</c>
						<c>
							<p>73.6 &#177; 2.9</p>
						</c>
					</r>
					<r>
						<c>
							<p>
								<b>MMW-MSC</b>
							</p>
						</c>
						<c>
							<p>
								<b>77.4 &#177; 2.9</b>
							</p>
						</c>
						<c>
							<p>75.9 &#177; 3.0</p>
						</c>
						<c>
							<p>74.4 &#177; 2.3</p>
						</c>
						<c>
							<p>74.8 &#177; 4.6</p>
						</c>
					</r>
					<r>
						<c>
							<p>NBC-MSC</p>
						</c>
						<c>
							<p>75.1 &#177; 3.1</p>
						</c>
						<c>
							<p>73.2 &#177; 2.4</p>
						</c>
						<c>
							<p>74.2 &#177; 4.1</p>
						</c>
						<c>
							<p>75.2 &#177; 2.6</p>
						</c>
					</r>
					<r>
						<c>
							<p>NMSC-MSC</p>
						</c>
						<c>
							<p>75.0 &#177; 4.5</p>
						</c>
						<c>
							<p>75.0 &#177; 2.9</p>
						</c>
						<c>
							<p>74.0 &#177; 3.7</p>
						</c>
						<c>
							<p>72.7 &#177; 3.9</p>
						</c>
					</r>
					<r>
						<c>
							<p>
								<b>DENFIS-MSC</b>
							</p>
						</c>
						<c>
							<p>76.9&#177; 3.2</p>
						</c>
						<c>
							<p>74.2 &#177; 3.4</p>
						</c>
						<c>
							<p>
								<b>74.9 &#177; 4.4</b>
							</p>
						</c>
						<c>
							<p>75.6 &#177; 2.8</p>
						</c>
					</r>
					<r>
						<c>
							<p>SVMRFE</p>
						</c>
						<c>
							<p>77.0 &#177; 4.2</p>
						</c>
						<c>
							<p>73.9 &#177; 2.7</p>
						</c>
						<c>
							<p>73.1 &#177; 4.0</p>
						</c>
						<c>
							<p>74.4 &#177; 3.2</p>
						</c>
					</r>
					<r>
						<c>
							<p>T-TEST</p>
						</c>
						<c>
							<p>75.6 &#177; 2.6</p>
						</c>
						<c>
							<p>
								<b>76.4 &#177; 3.0</b>
							</p>
						</c>
						<c>
							<p>74.5 &#177; 3.1</p>
						</c>
						<c>
							<p>75.9 &#177; 3.6</p>
						</c>
					</r>
					<r>
						<c>
							<p>LOGICFS</p>
						</c>
						<c ca="center" cspan="4">
							<p>54.4&#177;1.5</p>
						</c>
					</r>
				</tblbdy>
			</tbl>
			<tbl id="T2" hint_layout="single">
				<title>
					<p>Table 2</p>
				</title>
				<caption>
					<p>Testing accuracies associated with the highest training accuracies under different feature selections for the NARAC CHR18SNP data set.</p>
				</caption>
				<tblbdy cols="3">
					<r>
						<c rspan="2">
							<p>Feature Selection</p>
						</c>
						<c cspan="2">
							<p>Testing accuracy (mean value &#177; standard deviation, %)</p>
						</c>
					</r>
					<r>
						<c>
							<p>
								<b>NMSC</b>
							</p>
						</c>
						<c>
							<p>NBC</p>
						</c>
					</r>
					<r>
						<c cspan="3">
							<hr/>
						</c>
					</r>
					<r>
						<c>
							<p>MSW-MSC</p>
						</c>
						<c>
							<p>71.3 &#177; 0.7</p>
						</c>
						<c>
							<p>68.5 &#177; 0.7</p>
						</c>
					</r>
					<r>
						<c>
							<p>
								<b>MMW-MSC</b>
							</p>
						</c>
						<c>
							<p>71.4 &#177; 0.4</p>
						</c>
						<c>
							<p>
								<b>69.3 &#177; 0.3</b>
							</p>
						</c>
					</r>
					<r>
						<c>
							<p>NBC-MSC</p>
						</c>
						<c>
							<p>74.3 &#177; 0.6</p>
						</c>
						<c>
							<p>68.3 &#177; 0.7</p>
						</c>
					</r>
					<r>
						<c>
							<p>
								<b>NMSC-MSC</b>
							</p>
						</c>
						<c>
							<p>
								<b>77.7 &#177; 0.7</b>
							</p>
						</c>
						<c>
							<p>67.7 &#177; 0.3</p>
						</c>
					</r>
					<r>
						<c>
							<p>SVMRFE</p>
						</c>
						<c>
							<p>67.8 &#177; 0.8</p>
						</c>
						<c>
							<p>68.3 &#177; 0.8</p>
						</c>
					</r>
					<r>
						<c>
							<p>T-TEST</p>
						</c>
						<c>
							<p>65.4 &#177; 0.5</p>
						</c>
						<c>
							<p>66.1 &#177; 0.8</p>
						</c>
					</r>
					<r>
						<c>
							<p>LOGICFS</p>
						</c>
						<c ca="center" cspan="2">
							<p>67.1 &#177; 2.1</p>
						</c>
					</r>
				</tblbdy>
			</tbl>
		</sec>
		<sec>
			<st>
				<p>Discussion</p>
			</st>
			<p>Since it is still too expensive to genotype all available SNPs across the human genome, we need advanced approaches to mine the minimum SNPs with the highest prediction accuracy for complex diseases. Our method of exploiting information redundancy from associations among SNP markers provides an efficient and relatively inexpensive method of searching for the optimal or approximate optimal subset of SNPs in genetic association studies. In this paper we specifically propose supervised learning-based strategies, SRFA and SVRFA, to reconcile the redundancy in the highly correlated SNP data to identify the subset of SNPs that enables the most efficient classification of individuals at risk for disease. We evaluated SRA and SVRFA against some popular methods for SNP-disease association studies, and were able to evidence the improvement made by our proposed methods.</p>
			<p>Compared with the well-known feature selection methods SVMRFE and LOGICFS, our methods evidenced a higher testing accuracy. When SRFA is associated with two learning classifiers, we have two feature selection methods, NMSC-MSC and NBC-MSC. On average, NMSC-MSC performed better. Among the support vector based feature selection methods, MSW-MSC, MMW-MSC, and SVMRFE, in general, MMW-MSC is the best performer. In comparison SRFA with SVRFA, SRFA performed better than the latter. Our study shows that supervised-learning based MSC feature selection not only reduces the redundancy, but also improves the classification accuracy.</p>
			<p>An important factor in the evaluation of testing accuracy worth expounding upon is the training model. In our experiments, training with the use of DENFIS and other neural network classifiers always achieved high training accuracy but the testing accuracy was comparatively not good and over-fitting often happened. Since complex evolutionary learning and classification models, such as DENFIS, almost always require large sample size to elicit their effects, the over-fitting problem is probably related to the relatively small sample sizes. While the complexity of the model increases to achieve higher training accuracy, the requirement for more training samples also increases. If the sample size is not large enough, the relation and model mined from the training samples are not suitable for testing and, as a result, over-fitting happens. This is the reason that complex models fit training samples but not necessarily testing samples very well.</p>
			<p>Another point worthy of mentioning is that the learning classifier and feature selection are strictly paired. For instance, NMSC-MSC with the use of NMSC performed the best in the experiments on NARAC CHR18SNP, but NMSC-MSC with the use of NBC was not as good.</p>
			<p>The issue of environmental variables also requires discussion. For the MICC data set, with the inclusion of environmental variables, we greatly improved prediction and classification performances. Without the environmental variables, LOGICFS only achieved a 54.4%+/-1.5% correct classification rate. Also, SRFA provided a low (&lt;60%) correct classification rate on the testing data when only using the SNPs, but a higher (&gt;73%) correct classification rate after including the environmental variables as well. These results confirm that complex diseases usually involve both genetic factors and environmental cues. Therefore, both genetic and environmental variables should be taken into account when doing disease predictions and classifications for the most complex human diseases that have gene-environment interactions.</p>
			<p>In our experiments, when SVM was applied to the feature sets extracted from the NARAC CHR18SNP genotype data, the classification performance was pretty poor. However, SVM worked well on the feature sets extracted from the MICC data. In our view, the difference might be caused by the following two reasons. First, NARAC CHR18SNP consists of categorical SNP data only, while the MICC data set consists of many environmental variables of which most follow continuous distributions and have major impact on the classification. Second, it might be caused by the failure of optimizing the parameters for the SVM in testing NARAC CHR 18SNP.</p>
			<p>Our study shows that, with the use of our methods, even small SNPs and/or environmental variables can obtain good predictive capacity. In the analyses of MICC data, it was evident that, after applying our method with 3-5 variables, we can achieve up to 75% classification accuracy after applying our methods (Fig. <figr fid="F1">1</figr>). On the other hand, SVMRFE needed 20-30 variables in achieving the similar accuracy. In analyses of the NARAC CHR18SNP data set, the advantage of our method is also noticeable (Fig. <figr fid="F2">2</figr>). Experimental results imply that the classification accuracy can be improved and the cost of genotyping can be reduced with the use of our algorithms.</p>
		</sec>
		<sec>
			<st>
				<p>Conclusions</p>
			</st>
			<p>We proposed SRFA with different statistical learning classifiers and SVRFA for both SNP selections and disease classifications, and then applied them to two complex disease data sets. In general, our approaches outperform the well-known feature selection method of Support Vector Machine Recursive Feature Elimination and logic regression based SNP selection for disease classification involved in genetic association studies. Our study further indicates that both genetic and environmental variables should be taken into account when doing disease predictions and classifications for the most complex human diseases that have gene-environment interactions.</p>
		</sec>
		<sec>
			<st>
				<p>Materials and methods</p>
			</st>
			<sec>
				<st>
					<p>Materials</p>
				</st>
				<p>
					<it>Application 1</it>: Genes and the environment are links between important health conditions: Periodontal Disease (PD) and Cardiovascular Disease (CVD). Cardiovascular disease is the number one cause of death and disability in the Western world. Almost 1 million Americans die of CVD each year, which accounts for 42% of all US deaths. Numerous clinical and epidemiological studies have shown a consistent association between PD and CVD <abbrgrp>
						<abbr bid="B36">36</abbr>
					</abbrgrp>, and the link between these two diseases may be the result of common environmental exposures and potential genes that may regulate the individual response to these exposures. The identification of SNPs that influence the risk of diseases through interactions with other SNPs and environmental factors remains a statistical and computational challenge.</p>
				<p>The studied Myocardial Infarction Case &amp; Control (MICC) data set is a result of a population based study. The sample included residents of Erie and Niagara counties in New York State, and all were in age group 35 to 69 years. There were 614 white male patients with Myocardial Infarction matched with 614 control males (without CVD) by age (+/- 5 year) and smoking habits; 206 white pre- and post-menopausal females with MI matched with 412 control females (without CVD) by age (+/- 5 year), menopausal status, years since menopause (+/- 2 year), and smoking habits. Diabetics were excluded. The features in the data set included 29 environmental variables, such as two protein variables (ACHMN and CALMEA), which were known to be related to periodontal disease, and smoking status, menopausal status, blood pressure, blood cholesterol, body mass index, drinking status, <it>etc</it>. Selection of genetic variables was based on the well-known Seattle web site (<url>http://pga.mbt.washington.edu/</url>) by using the candidate approach that included 31 SNPs. This study evaluates the SNP-environment and variable-disease associations especially the effects of SNPs and environmental variables to disease. The original MICC data set contained some missing values. In our experiments, we filtered out the missing values and their associated observations.</p>
				<p>
					<it>Application 2:</it> Rheumatoid arthritis (RA) is an autoimmune disease that causes chronic inflammation of joints, tissues around joints, or other organs in body. RA affects more than two million people in the United States. Women account for 70% of patients with RA. While women are two to three times more likely to get RA, men who have RA tend to have more severe symptoms. It afflicts people of all races equally. Onset usually occurs between 30 and 50 years old. Data for this analysis was provided as part of Genetics Analysis Workshop (GAW) 15. The North American Rheumatoid Arthritis Consortium (NARAC), led by Peter Gregersen, has provided microsatellite and SNP scans, quantitative phenotypes, and clinical measures, with additional genotype data provided by Robert Plenge and Ann Begovich. We studied the SNP case-control data named &#8220;CHR18SNP.dat&#8221; offered by NARAC. In the data file, a dense panel of 2300 SNPs was genotyped by Illumina for an approximately 10 kb region of chromosome 18q. These markers were individually genotyped on 460 cases and 460 controls. Controls were recruited from a New York City population. The objective of this study is to identify SNPs of chromosome 18 that are significantly associated with rheumatoid arthritis. The significant SNPs identified here could be used as a starting point for biologists developing genetic tests that indicate increased risk of developing rheumatoid arthritis.</p>
			</sec>
			<sec>
				<st>
					<p>Methods</p>
				</st>
				<sec>
					<st>
						<p>Supervised recursive feature addition (SRFA) algorithm for SNP selection</p>
					</st>
					<p>SRFA combines supervised learning and statistical similarity measures among the chosen features and the candidates and is presented as follows:</p>
					<p>Step 1: Each individual feature is ranked from the highest classification accuracy to the lowest classification accuracy with the use of a supervised learning classifier.</p>
					<p>Step 2: The feature with the highest classification accuracy is chosen as the first feature. If multiple features achieved the same highest classification accuracy, the one with the lowest <it>p</it>-value measured by score test-statistics is chosen as the first element. At this point the chosen feature set, <it>G<sub>1</sub>
						</it>, consists of the first feature, <it>g<sub>1</sub>
						</it>, which corresponds to feature dimension one.</p>
					<p>Step 3: The (<it>N+1</it>)-dimensional feature set, <it>G<sub>N+1</sub>
						</it> = {<it>g<sub>1</sub>
						</it>, <it>g<sub>2</sub>
						</it>, &#8230;, <it>g<sub>N</sub>
						</it>, <it>g<sub>N+1</sub>
						</it>}, is produced by adding <it>g<sub>N+1</sub>
						</it> to the current <it>N</it>-dimensional feature set <it>G<sub>N</sub>
						</it> = {<it>g<sub>1</sub>
						</it>, <it>g<sub>2</sub>
						</it>,&#8230;, <it>g<sub>N</sub>
						</it>}. <it>g<sub>N+1</sub>
						</it> is chosen as follows: Temporarily add each feature <it>g<sub>i</sub>
						</it> (<it>i &#8800; 1, 2, &#8230;, N</it>) outside of <it>G<sub>N</sub>
						</it> to <it>G<sub>N</sub>
						</it>; the classification accuracy of each feature set <it>G<sub>N</sub>
						</it> + {<it>g<sub>i</sub>
						</it>} is then recorded; that <it>g<sub>c</sub>
						</it> which gives the highest classification accuracy is included into the set of candidates, <it>C</it>. Generally <it>C</it> includes many features, but only one&#8722;the feature that is least statistically similar to the already chosen features&#8722;will be selected as <it>g<sub>N+1</sub>
						</it> to form the next feature set <it>G<sub>N+1</sub>
						</it>. We call this step Candidate Feature Addition. The goal is to obtain the most informative and least redundant feature set. The statistical similarity measure is based on the Spearman Correlation Coefficient (for categorical features/SNPs) or the Pearson Correlation Coefficient (for continuous environmental variables) between the chosen features <it>g<sub>n</sub>
						</it> (<it>g<sub>n</sub>
						</it> &#949; <it>G<sub>N</sub>
						</it>, <it>n</it> = 1, 2,&#8230;, <it>N</it>) and the candidate <it>g<sub>c</sub>
						</it> (<it>g<sub>c</sub>
						</it> &#949; <it>C, c</it>= 1, 2 &#8230; <it>m</it>; <it>m</it> is the number of elements in <it>C</it>). The Sum of the square of the Correlation (SC) is calculated to measure the similarity and is defined as follows:</p>
					<p>
						<display-formula>
							<m:math name="1471-2164-9-S1-S6-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mi>S</m:mi>
										<m:mi>C</m:mi>
										<m:mo stretchy="false">(</m:mo>
										<m:msub>
											<m:mi>g</m:mi>
											<m:mi>c</m:mi>
										</m:msub>
										<m:mo stretchy="false">)</m:mo>
										<m:mtext>&#160;</m:mtext>
										<m:mo>=</m:mo>
										<m:mstyle displaystyle="true">
											<m:munderover>
												<m:mo>&#8721;</m:mo>
												<m:mrow>
													<m:mi>n</m:mi>
													<m:mo>=</m:mo>
													<m:mn>1</m:mn>
												</m:mrow>
												<m:mi>N</m:mi>
											</m:munderover>
											<m:mrow>
												<m:mtext>&#160;</m:mtext>
												<m:mi>c</m:mi>
												<m:mi>o</m:mi>
												<m:msup>
													<m:mi>r</m:mi>
													<m:mn>2</m:mn>
												</m:msup>
												<m:mo stretchy="false">(</m:mo>
												<m:msub>
													<m:mi>g</m:mi>
													<m:mrow>
														<m:mi>c</m:mi>
														<m:mo>,</m:mo>
													</m:mrow>
												</m:msub>
												<m:mtext>&#160;</m:mtext>
												<m:msub>
													<m:mi>g</m:mi>
													<m:mi>n</m:mi>
												</m:msub>
												<m:mo stretchy="false">)</m:mo>
												<m:mo>,</m:mo>
												<m:mtext>&#160;&#160;</m:mtext>
												<m:mi>n</m:mi>
												<m:mtext>&#160;</m:mtext>
												<m:mo>=</m:mo>
												<m:mtext>&#160;</m:mtext>
												<m:mn>1</m:mn>
												<m:mo>,</m:mo>
												<m:mi/>
												<m:mn>2</m:mn>
												<m:mo>&#8230;</m:mo>
												<m:mtext>&#160;</m:mtext>
												<m:mi>N</m:mi>
												<m:mo>.</m:mo>
												<m:mtext>&#160;</m:mtext>
											</m:mrow>
										</m:mstyle>
										<m:mtext>&#160;&#160;&#160;&#160;&#160;&#160;</m:mtext>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaqcLbuacqWGtbWucqWGdbWqcqGGOaakcqWGNbWzkmaaBaaaleaacqWGJbWyaeqaaKqzafGaeiykaKIaeeiiaaIaeyypa0tcfa4aaabCaOqaaKqzafGaeeiiaaIaem4yamMaem4Ba8MaemOCaiNcdaahaaWcbeqaaiabikdaYaaajugqbiabcIcaOiabdEgaNPWaaSbaaSqaaiabdogaJjabcYcaSaqabaqcLbuacqqGGaaicqWGNbWzkmaaBaaaleaacqWGUbGBaeqaaKqzafGaeiykaKIaeiilaWIaeeiiaaIaeeiiaaIaemOBa4MaeeiiaaIaeyypa0JaeeiiaaIaeGymaeJaeiilaWYexLMBbXgBcf2CPn2qVrwzqf2zLnharyWrL9MCNLwyaGabciaa=bcacqaIYaGmcqGHMacVcqqGGaaicqWGobGtcqGGUaGlcqqGGaaiaSqaaiabd6gaUjabg2da9iabigdaXaqaaiabd6eaobqcLbuacqGHris5aiabbccaGiabbccaGiabbccaGiabbccaGiabbccaGiabbccaGaaa@720A@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>where <it>g<sub>c</sub>
						</it> &#8712; <it>C</it>, <it>g<sub>n</sub>
						</it> &#8712; <it>G<sub>N</sub>
						</it>.</p>
					<p>The selection of <it>g<sub>N+1</sub>
						</it> follows the qualification that the SC value in (1) is the minimum:</p>
					<p>
						<display-formula>
							<m:math name="1471-2164-9-S1-S6-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mo>{</m:mo>
										<m:msub>
											<m:mstyle mathsize="140%" displaystyle="true">
												<m:mi>g</m:mi>
											</m:mstyle>
											<m:mrow>
												<m:mi>N</m:mi>
												<m:mo>+</m:mo>
												<m:mn>1</m:mn>
											</m:mrow>
										</m:msub>
										<m:mo>|</m:mo>
										<m:mtext>&#8201;</m:mtext>
										<m:msub>
											<m:mstyle mathsize="140%" displaystyle="true">
												<m:mi>g</m:mi>
											</m:mstyle>
											<m:mrow>
												<m:mi>N</m:mi>
												<m:mo>+</m:mo>
												<m:mn>1</m:mn>
											</m:mrow>
										</m:msub>
										<m:mo>&#8712;</m:mo>
										<m:mtext>&#160;</m:mtext>
										<m:mi>C</m:mi>
										<m:mo>&#8745;</m:mo>
										<m:mtext>&#160;</m:mtext>
										<m:mi>S</m:mi>
										<m:mi>C</m:mi>
										<m:mo stretchy="false">(</m:mo>
										<m:msub>
											<m:mstyle mathsize="140%" displaystyle="true">
												<m:mi>g</m:mi>
											</m:mstyle>
											<m:mrow>
												<m:mi>N</m:mi>
												<m:mo>+</m:mo>
												<m:mn>1</m:mn>
											</m:mrow>
										</m:msub>
										<m:mo stretchy="false">)</m:mo>
										<m:mo>=</m:mo>
										<m:mi>min</m:mi>
										<m:mo>&#8289;</m:mo>
										<m:mo stretchy="false">(</m:mo>
										<m:mi>S</m:mi>
										<m:mi>C</m:mi>
										<m:mo stretchy="false">(</m:mo>
										<m:msub>
											<m:mstyle mathsize="140%" displaystyle="true">
												<m:mi>g</m:mi>
											</m:mstyle>
											<m:mi>c</m:mi>
										</m:msub>
										<m:mo stretchy="false">)</m:mo>
										<m:mo stretchy="false">)</m:mo>
										<m:mo>,</m:mo>
										<m:msub>
											<m:mstyle mathsize="140%" displaystyle="true">
												<m:mi>g</m:mi>
											</m:mstyle>
											<m:mi>c</m:mi>
										</m:msub>
										<m:mo>&#8712;</m:mo>
										<m:mtext>&#160;</m:mtext>
										<m:mi>C</m:mi>
										<m:mo>}</m:mo>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaqcLbuacqGG7bWEkmaavababeWcbaGaemOta4Kaey4kaSIaeGymaedabeqdbaGaem4zaCgaaKqzafGaeiiFaWNaaGjbVRWaaubeaeqaleaacqWGobGtcqGHRaWkcqaIXaqmaeqaneaacqWGNbWzaaqcLbuacqGHiiIZcqqGGaaicqWGdbWqcqGHPiYXcqqGGaaicqWGtbWucqWGdbWqcqGGOaakkmaavababeWcbaGaemOta4Kaey4kaSIaeGymaedabeqdbaGaem4zaCgaaKqzafGaeiykaKIaeyypa0JagiyBa0MaeiyAaKMaeiOBa4MaeiikaGIaem4uamLaem4qamKaeiikaGIcdaqfqaqabSqaaiabdogaJbqab0qaaiabdEgaNbaajugqbiabcMcaPiabcMcaPiabcYcaSOWaaubeaeqaleaacqWGJbWyaeqaneaacqWGNbWzaaqcLbuacqGHiiIZcqqGGaaicqWGdbWqcqGG9bqFaaa@669E@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>This strategy is called Minimum SC (MSC).</p>
					<p>Step 4: A feature is recursively added to the chosen feature set from steps 1-3 with supervised learning and the similarity measures until classification accuracy stops to increase.</p>
					<p>Our SRFA based MSC is denoted as classifier-MSC. For example, if the classifier is Naive Bayes Classifier (NBC), we call the feature selection NBC-MSC.</p>
				</sec>
				<sec>
					<st>
						<p>Support vector based recursive feature addition (SVRFA) algorithms</p>
					</st>
					<p>Support Vector Machines (SVMs) <abbrgrp>
							<abbr bid="B14">14</abbr>
							<abbr bid="B15">15</abbr>
							<abbr bid="B16">16</abbr>
						</abbrgrp> have been widely applied to pattern classification problems and non-linear regressions. The basic idea of the SVM algorithm is to find an optimal hyper-plane that can maximize the margin between two groups. The vectors that are closest to the optimal hyper-plane are called support vectors. Guyon <it>et al</it>. <abbrgrp>
							<abbr bid="B31">31</abbr>
						</abbrgrp> proposed a gene selection utilizing Support Vector Machine methods based on Recursive Feature Elimination (SVMRFE). In addition to gene selection, SVMRFE has been successfully applied to other feature selection and pattern classification issues <abbrgrp>
							<abbr bid="B37">37</abbr>
						</abbrgrp>. Based on the SVMRFE and our SRFA discussed earlier, we propose a Support Vector based lowest weight (or maximum margin width) and the lowest correlation feature addition scheme, called Support Vector based Recursive Feature Addition (SVRFA) described as follows:</p>
					<p>1. Train an SVM on each individual feature in the data set to reach an SVM with a weighted vector <inline-formula>
							<m:math name="1471-2164-9-S1-S6-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:msub>
											<m:mi>g</m:mi>
											<m:mi>j</m:mi>
										</m:msub>
										<m:mo>&#8712;</m:mo>
										<m:mi>C</m:mi>
										<m:mtext>&#160;|&#160;</m:mtext>
										<m:mi>M</m:mi>
										<m:mi>W</m:mi>
										<m:mo stretchy="false">(</m:mo>
										<m:msub>
											<m:mi>g</m:mi>
											<m:mi>j</m:mi>
										</m:msub>
										<m:mo stretchy="false">)</m:mo>
										<m:mtext>&#160;=&#160;min</m:mtext>
										<m:mo stretchy="false">(</m:mo>
										<m:mi>M</m:mi>
										<m:mi>W</m:mi>
										<m:mo stretchy="false">)</m:mo>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaeHbuLwBLnhiov2DGi1BTfMBaebbnrfifHhDYfgasaacH8qrps0lbbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0RYxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGacaGaaeqabaWaaqaafaaakeaajugqbiabdEgaNLqbaoaaBaaaleaajugqbiabdQgaQbWcbeaajugqbiabgIGiolabdoeadjabbccaGiabbYha8jabbccaGiabd2eanjabdEfaxjabcIcaOiabdEgaNLqbaoaaBaaabaqcLbuacqWGQbGAaKqbagqaaKqzafGaeiykaKIaeeiiaaIaeeypa0JaeeiiaaIaeeyBa0MaeeyAaKMaeeOBa4MaeiikaGIaemyta0Kaem4vaCLaeiykaKcaaa@5C74@</m:annotation>
								</m:semantics>
							</m:math>
						</inline-formula>.</p>
					<p>2. Rank features according to criterion <it>c</it> for feature <it>i</it>: <it>c<sub>i</sub> =</it> (<it>w<sub>i</sub>
						</it>)<sup>2</sup>. The features corresponding to the lowest <it>c</it> are selected as candidates. The candidate with the highest statistical significance is the first element of the feature set. At this point the chosen feature set, <it>G<sub>1</sub>
						</it>, consists of the first feature, <it>g<sub>1</sub>
						</it>, which corresponds to feature dimension one.</p>
					<p>3. The (<it>N+1</it>)<sup>
							<it>st</it>
						</sup> dimensional feature set, <it>G<sub>N+1</sub> =</it> {<it>g<sub>1</sub>
						</it>, <it>g<sub>2</sub>
						</it>, &#8230;, <it>g<sub>N</sub>
						</it>, <it>g<sub>N+1</sub>
						</it>} is produced by adding <it>g<sub>N+1</sub>
						</it> to the <it>N</it> dimensional feature set <it>G<sub>N</sub>
						</it> = {<it>g<sub>1</sub>
						</it>, <it>g<sub>2</sub>
						</it>,&#8230;,<it>g<sub>N</sub>
						</it>}. The choice of <it>g<sub>N+1</sub>
						</it> is described as follows:</p>
					<p>Temporarily add each feature <it>g<sub>i</sub>
						</it> (<it>i &#8800; 1, 2, &#8230;, N</it>) outside of <it>G<sub>N</sub>
						</it> to <it>G<sub>N</sub>
						</it>, train an SVM on feature set <it>G<sub>N</sub>
						</it> + {<it>g<sub>i</sub>
						</it>}, update <it>c</it>, and calculate the measures after introducing <it>g<sub>i</sub>
						</it> as follows:</p>
					<p>
						<display-formula>
							<m:math name="1471-2164-9-S1-S6-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mi>S</m:mi>
										<m:mi>W</m:mi>
										<m:mtext>(</m:mtext>
										<m:msub>
											<m:mi>g</m:mi>
											<m:mi>i</m:mi>
										</m:msub>
										<m:mtext>)&#160;=</m:mtext>
										<m:mstyle displaystyle="true">
											<m:msubsup>
												<m:mo>&#8721;</m:mo>
												<m:mrow>
													<m:mi>k</m:mi>
													<m:mo>=</m:mo>
													<m:mn>1</m:mn>
												</m:mrow>
												<m:mrow>
													<m:mi>N</m:mi>
													<m:mo>+</m:mo>
													<m:mn>1</m:mn>
												</m:mrow>
											</m:msubsup>
											<m:mrow>
												<m:msub>
													<m:mi>c</m:mi>
													<m:mi>k</m:mi>
												</m:msub>
											</m:mrow>
										</m:mstyle>
										<m:mo>=</m:mo>
										<m:msup>
											<m:mrow>
												<m:mstyle displaystyle="true">
													<m:msubsup>
														<m:mo>&#8721;</m:mo>
														<m:mrow>
															<m:mi>k</m:mi>
															<m:mo>=</m:mo>
															<m:mn>1</m:mn>
														</m:mrow>
														<m:mrow>
															<m:mi>N</m:mi>
															<m:mo>+</m:mo>
															<m:mn>1</m:mn>
														</m:mrow>
													</m:msubsup>
													<m:mrow>
														<m:msub>
															<m:mi>w</m:mi>
															<m:mi>k</m:mi>
														</m:msub>
													</m:mrow>
												</m:mstyle>
											</m:mrow>
											<m:mn>2</m:mn>
										</m:msup>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaeHbuLwBLnhiov2DGi1BTfMBaebbnrfifHhDYfgasaacH8qrps0lbbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0RYxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGacaGaaeqabaWaaqaafaaakeaajugqbiabdofatjabdEfaxjabbIcaOiabdEgaNLqbaoaaBaaaleaajugqbiabdMgaPbWcbeaajugqbiabbMcaPiabbccaGiabb2da9KqbaoaaqadajaaybaqcLbuacqWGJbWyjuaGdaWgaaqcbawaaKqzafGaem4AaSgajeaybeaaaeaajugqbiabdUgaRjabg2da9iabigdaXaqcbawaaKqzafGaemOta4Kaey4kaSIaeGymaedacqGHris5aiabg2da9KqbaoaaqadajaaybaqcLbuacqWG3bWDkmaaBaaaleaacqWGRbWAaeqaaaqcbawaaKqzafGaem4AaSMaeyypa0JaeGymaedajeaybaqcLbuacqWGobGtcqGHRaWkcqaIXaqmaiabggHiLdqcfa4aaWbaaeqabaGaeGOmaidaaaaa@6AE3@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>
						<display-formula>
							<m:math name="1471-2164-9-S1-S6-i5" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mi>M</m:mi>
										<m:mi>W</m:mi>
										<m:mo stretchy="false">(</m:mo>
										<m:msub>
											<m:mi>g</m:mi>
											<m:mi>i</m:mi>
										</m:msub>
										<m:mo stretchy="false">)</m:mo>
										<m:mtext>&#160;=</m:mtext>
										<m:mi>max</m:mi>
										<m:mo>&#8289;</m:mo>
										<m:mo stretchy="false">(</m:mo>
										<m:msub>
											<m:mi>c</m:mi>
											<m:mi>k</m:mi>
										</m:msub>
										<m:mo stretchy="false">)</m:mo>
										<m:mo>=</m:mo>
										<m:mi>max</m:mi>
										<m:mo>&#8289;</m:mo>
										<m:mo stretchy="false">(</m:mo>
										<m:msup>
											<m:mrow>
												<m:msub>
													<m:mstyle mathsize="140%" displaystyle="true">
														<m:mi>w</m:mi>
													</m:mstyle>
													<m:mi>k</m:mi>
												</m:msub>
											</m:mrow>
											<m:mn>2</m:mn>
										</m:msup>
										<m:mo stretchy="false">)</m:mo>
										<m:mo>,</m:mo>
										<m:mi>k</m:mi>
										<m:mo>=</m:mo>
										<m:mn>1</m:mn>
										<m:mo>,</m:mo>
										<m:mn>2...</m:mn>
										<m:mi>N</m:mi>
										<m:mo>+</m:mo>
										<m:mn>1.</m:mn>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaeHbuLwBLnhiov2DGi1BTfMBaebbnrfifHhDYfgasaacH8qrps0lbbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0RYxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGacaGaaeqabaWaaqaafaaakeaajugqbiabd2eanjabdEfaxjabcIcaOiabdEgaNLqbaoaaBaaaleaajugqbiabdMgaPbWcbeaajugqbiabcMcaPiabbccaGiabb2da9iGbc2gaTjabcggaHjabcIha4jabcIcaOiabdogaJLqbaoaaBaaajeaybaqcLbuacqWGRbWAaKqaGfqaaKqzafGaeiykaKIaeyypa0JagiyBa0MaeiyyaeMaeiiEaGNaeiikaGIcdaqfqaqabSqaaiabdUgaRbqab0qaaiabdEha3baakmaaCaaaleqabaGaeGOmaidaaOGaeiykaKscLbuacqGGSaalcqWGRbWAcqGH9aqpcqaIXaqmcqGGSaalcqaIYaGmcqGGUaGlcqGGUaGlcqGGUaGlcqWGobGtcqGHRaWkcqaIXaqmcqGGUaGlaaa@6C6E@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>Here we have two strategies to choose candidates as <it>g<sub>N+1</sub>
						</it>, corresponding to measures <it>SW</it> and <it>MW</it>, respectively. The candidate set is denoted as <it>C</it>. The first strategy is to pick up the feature with the minimum <it>SW</it> into <it>C;</it> and the second one is based on the minimum <it>MW</it>.</p>
					<p>
						<display-formula>
							<m:math name="1471-2164-9-S1-S6-i6" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:msub>
											<m:mi>g</m:mi>
											<m:mi>j</m:mi>
										</m:msub>
										<m:mo>&#8712;</m:mo>
										<m:mi>C</m:mi>
										<m:mtext>&#160;|&#160;</m:mtext>
										<m:mi>S</m:mi>
										<m:mi>W</m:mi>
										<m:mo stretchy="false">(</m:mo>
										<m:msub>
											<m:mi>g</m:mi>
											<m:mi>j</m:mi>
										</m:msub>
										<m:mo stretchy="false">)</m:mo>
										<m:mtext>&#160;=&#160;min</m:mtext>
										<m:mo stretchy="false">(</m:mo>
										<m:mi>S</m:mi>
										<m:mi>W</m:mi>
										<m:mo stretchy="false">)</m:mo>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaeHbuLwBLnhiov2DGi1BTfMBaebbnrfifHhDYfgasaacH8qrps0lbbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0RYxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGacaGaaeqabaWaaqaafaaakeaajugqbiabdEgaNLqbaoaaBaaaleaajugqbiabdQgaQbWcbeaajugqbiabgIGiolabdoeadjabbccaGiabbYha8jabbccaGiabdofatjabdEfaxjabcIcaOiabdEgaNLqbaoaaBaaabaGaemOAaOgabeaajugqbiabcMcaPiabbccaGiabb2da9iabbccaGiabb2gaTjabbMgaPjabb6gaUjabcIcaOiabdofatjabdEfaxjabcMcaPaaa@5B4F@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>
						<display-formula>
							<m:math name="1471-2164-9-S1-S6-i7" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mover accent="true">
											<m:mi>w</m:mi>
											<m:mo stretchy="true">&#8594;</m:mo>
										</m:mover>
										<m:mo>=</m:mo>
										<m:mstyle displaystyle="true">
											<m:msub>
												<m:mo>&#8721;</m:mo>
												<m:mi>k</m:mi>
											</m:msub>
											<m:mrow>
												<m:msub>
													<m:mi>&#945;</m:mi>
													<m:mi>k</m:mi>
												</m:msub>
												<m:msub>
													<m:mi>y</m:mi>
													<m:mi>k</m:mi>
												</m:msub>
												<m:mover accent="true">
													<m:mrow>
														<m:msub>
															<m:mi>x</m:mi>
															<m:mi>k</m:mi>
														</m:msub>
													</m:mrow>
													<m:mo stretchy="true">&#8594;</m:mo>
												</m:mover>
											</m:mrow>
										</m:mstyle>
										<m:mo>.</m:mo>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaeHbuLwBLnhiov2DGi1BTfMBaebbnrfifHhDYfgasaacH8qrps0lbbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0RYxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGacaGaaeqabaWaaqaafaaakeaajuaGdaWhcaqcaawaaKqzafGaem4DaChajaaycaGLxdcajugqbiabg2da9KqbaoaaqabajaaybaqcLbuacqaHXoqyjuaGdaWgaaqcbawaaKqzafGaem4AaSgajeaybeaajugqbiabdMha5LqbaoaaBaaajeaybaqcLbuacqWGRbWAaKqaGfqaaKqbaoaaFiaajaaybaqcLbuacqWG4baEjuaGdaWgaaqcbawaaKqzafGaem4AaSgajeaybeaaaKaaGjaawEniaaqcbawaaKqzafGaem4AaSgajeaybeqcLbuacqGHris5aiabc6caUaaa@5FC8@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>Only one feature will be chosen as <it>g<sub>N+1</sub>
						</it>, despite whether set <it>C</it> contains multiple candidates or a single one. We chose <it>g<sub>N+1</sub>
						</it> from <it>C</it> based on the calculation of SC(<it>g<sub>j</sub>
						</it>), shown in (1), and Minimum SC (MSC) standard, listed in (2).</p>
					<p>We call the support vector based Minimum <it>SW</it>, calculated in (5), combining with Minimum SC standard, presented in (2) as MSW-MSC. Similarly we call the support vector based Minimum MW in (6) that is combined with Minimum SC in (2) as MMW-MSC. Both MSW-MSC and MMW-MSC are Support Vector based Recursive Feature Addition (SVRFA) algorithms.</p>
				</sec>
				<sec>
					<st>
						<p>Implementations and comparison studies</p>
					</st>
					<p>We implemented SRFA with various statistical learning classifiers (with different complexity) proposed in section 2.1. The learning classifiers for feature selections were Naive Bayes Classifier (NBC) <abbrgrp>
							<abbr bid="B32">32</abbr>
						</abbrgrp>, Nearest Mean Scaled Classifier (NMSC) <abbrgrp>
							<abbr bid="B33">33</abbr>
						</abbrgrp> and Dynamic Evolving Neuro-Fuzzy Inference System (DENFIS) <abbrgrp>
							<abbr bid="B34">34</abbr>
						</abbrgrp>. We recorded them as NBC-MSC, NMSC-MSC and DENFIS-MSC. Several classifiers including NBC, NMSC, SVM and uncorrelated normal based quadratic Bayes classifier (UDC) <abbrgrp>
							<abbr bid="B33">33</abbr>
						</abbrgrp> were applied to the feature sets selected by the above SRFA in order to compare their performances. Our goals are (i) to evaluate feature selection procedures and find the number of features required for the best classification accuracy; (ii) to evaluate various learning approaches; and (iii) to investigate the redundancy issues in SNP data for improving the classification performance.</p>
					<p>We implemented and tested our SVRFA (MSW-MSC and MMW-MSC) methods proposed in section 2.2. For comparison purposes, other popular methods, such as Support Vector Machine Recursive Feature Elimination (SVMRFE), logistic regression based Wald t-test and Logic regression (LOGICFS) for SNP selection and disease classification were compared. In addition, we also applied SVM and other traditional neural network classifiers, such as Levenberg-Marquardt trained feed-forward neural network classifier and back-propagation trained feed-forward neural network classifier <abbrgrp>
							<abbr bid="B33">33</abbr>
						</abbrgrp>, for different feature selections to two real data sets. Unfortunately, these learning classifiers didn't work well. Therefore, here we did not list their experimental results.</p>
					<p>Cross-validation has been widely used for selecting tuning parameters and optimizing the number of selected genes in the context of building classifiers to avoid over-fitting. We split the data into training and testing samples in each run and built the model based on training samples only and evaluated the performance on the testing samples by using cross-validation. We performed and then tested the accuracy of 20 runs.</p>
				</sec>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Competing interests</p>
			</st>
			<p>The authors declare that they have no competing interests.</p>
		</sec>
		<sec>
			<st>
				<p>Authors' contributions</p>
			</st>
			<p>QL performed the study and drafted the manuscript; JY conceived the project and designed the experiments; ZC assisted the study and the manuscript preparation; MQY designed the project and helped to design the algorithms; AHS and XH supervised the study and obtained the supports. All authors have read and approved the final manuscript.</p>
		</sec>
	</bdy>
	<bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>Research supports received from ICASA (Institute for Complex Additive Systems Analysis, a division of New Mexico Tech) and the Radiology Department of Brigham and Women's Hospital (BWH) are gratefully acknowledged. The authors highly appreciate Dr. Liang at SUNY-Buffalo for her invaluable help and insightful discussion during this study and Ms. Kim Lawson at BWH Radiology Department for her manuscript editing and very constructive comments.</p>
				<p>This article has been published as part of <it>BMC Genomics</it> Volume 9 Supplement 1, 2008: The 2007 International Conference on Bioinformatics &amp; Computational Biology (BIOCOMP'07). The full contents of the supplement are available online at <url>http://www.biomedcentral.com/1471-2164/9?issue=S1</url>.</p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>Review: The essence of SNPs</p>
				</title>
				<aug>
					<au>
						<snm>Brookes</snm>
						<fnm>A. J.</fnm>
					</au>
				</aug>
				<source>Gene</source>
				<pubdate>1999</pubdate>
				<issue>234</issue>
				<fpage>177</fpage>
				<lpage>186</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">10395891</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B2">
				<title>
					<p>The International HapMap Project</p>
				</title>
				<insg>
					<ins>
						<p>The International HapMap Consortium</p>
					</ins>
				</insg>
				<source>Nature</source>
				<pubdate>2003</pubdate>
				<volume>426</volume>
				<fpage>789</fpage>
				<lpage>796</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">14685227</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B3">
				<title>
					<p>Integrating ethics and science in the International HapMap Project</p>
				</title>
				<insg>
					<ins>
						<p>The International HapMap Consortium</p>
					</ins>
				</insg>
				<source>Nat Rev Genet</source>
				<pubdate>2004</pubdate>
				<volume>5</volume>
				<fpage>467</fpage>
				<lpage>475</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15153999</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B4">
				<title>
					<p>Haplotype map of the human genome</p>
				</title>
				<insg>
					<ins>
						<p>The International HapMap Consortium</p>
					</ins>
				</insg>
				<source>Nature</source>
				<pubdate>2005</pubdate>
				<volume>437</volume>
				<fpage>1299</fpage>
				<lpage>1320</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1880871</pubid>
						<pubid idtype="pmpid" link="fulltext">16255080</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B5">
				<title>
					<p>Searching for genetic determinants in the new millennium</p>
				</title>
				<aug>
					<au>
						<snm>Risch</snm>
						<fnm>NJ</fnm>
					</au>
				</aug>
				<source>Nature</source>
				<pubdate>2000</pubdate>
				<volume>405</volume>
				<fpage>847</fpage>
				<lpage>856</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">10866211</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B6">
				<title>
					<p>Association study designs for complex diseases</p>
				</title>
				<aug>
					<au>
						<snm>Cardon</snm>
						<fnm>LR</fnm>
					</au>
					<au>
						<snm>Bell</snm>
						<fnm>JI</fnm>
					</au>
				</aug>
				<source>Nat Rev Genet</source>
				<pubdate>2001</pubdate>
				<volume>2</volume>
				<fpage>91</fpage>
				<lpage>99</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">11253062</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B7">
				<title>
					<p>Entropy-based SNP selection for genetic association studies</p>
				</title>
				<aug>
					<au>
						<snm>Hampe</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Schreiber</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Krawczak</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Hum Genet</source>
				<pubdate>2003</pubdate>
				<volume>114</volume>
				<fpage>36</fpage>
				<lpage>43</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">14505034</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B8">
				<title>
					<p>Minimal haplotype tagging</p>
				</title>
				<aug>
					<au>
						<snm>Sebastiani</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Lazarus</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Weiss</snm>
						<fnm>ST</fnm>
					</au>
					<au>
						<snm>Lunkel</snm>
						<fnm>LM</fnm>
					</au>
					<au>
						<snm>Kohane</snm>
						<fnm>IS</fnm>
					</au>
					<au>
						<snm>Romani</snm>
						<fnm>MF</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci</source>
				<pubdate>2003</pubdate>
				<volume>100</volume>
				<fpage>9900</fpage>
				<lpage>9905</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">187880</pubid>
						<pubid idtype="pmpid" link="fulltext">12900503</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B9">
				<title>
					<p>Choosing haplotype-tagging SNPs based on unphased genotype data using preliminary sample of unrelated subjects with an example from the multiethnic cohort study</p>
				</title>
				<aug>
					<au>
						<snm>Stram</snm>
						<fnm>DO</fnm>
					</au>
					<au>
						<snm>Haiman</snm>
						<fnm>CA</fnm>
					</au>
					<au>
						<snm>Hirschhorn</snm>
						<fnm>JN</fnm>
					</au>
					<au>
						<snm>Altshuler</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Kolonel</snm>
						<fnm>LN</fnm>
					</au>
					<au>
						<snm>Henderson</snm>
						<fnm>BE</fnm>
					</au>
					<au>
						<snm>Pike</snm>
						<fnm>MC</fnm>
					</au>
				</aug>
				<source>Hum Hered</source>
				<pubdate>2003</pubdate>
				<volume>55</volume>
				<fpage>27</fpage>
				<lpage>36</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12890923</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B10">
				<title>
					<p>Selecting a maximally informative set of single-nucleotide polymorphisms for association analysis using linkage disequilibrium</p>
				</title>
				<aug>
					<au>
						<snm>Carlson</snm>
						<fnm>CS</fnm>
					</au>
					<au>
						<snm>Eberle</snm>
						<fnm>MA</fnm>
					</au>
					<au>
						<snm>Rieder</snm>
						<fnm>MJ</fnm>
					</au>
					<au>
						<snm>Yi</snm>
						<fnm>Q</fnm>
					</au>
					<au>
						<snm>Kruglyak</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Nickerson</snm>
						<fnm>DA</fnm>
					</au>
				</aug>
				<source>Am J Hum Genet</source>
				<pubdate>2004</pubdate>
				<volume>74</volume>
				<fpage>106</fpage>
				<lpage>120</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1181897</pubid>
						<pubid idtype="pmpid" link="fulltext">14681826</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B11">
				<title>
					<p>Optimal haplotype block-free selection of tagging SNPs for genomewide association studies</p>
				</title>
				<aug>
					<au>
						<snm>Halldorsson</snm>
						<fnm>BV</fnm>
					</au>
					<au>
						<snm>Bafna</snm>
						<fnm>V</fnm>
					</au>
					<au>
						<snm>Lippert</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Schwartz</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>De La Vega</snm>
						<fnm>FM</fnm>
					</au>
					<au>
						<snm>Clark</snm>
						<fnm>AG</fnm>
					</au>
					<au>
						<snm>Istrail</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2004</pubdate>
				<volume>14</volume>
				<fpage>1633</fpage>
				<lpage>1640</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">509273</pubid>
						<pubid idtype="pmpid" link="fulltext">15289481</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B12">
				<title>
					<p>Finding haplotype tagging SNPs by use of principal components analysis</p>
				</title>
				<aug>
					<au>
						<snm>Lin</snm>
						<fnm>Z</fnm>
					</au>
					<au>
						<snm>Altman</snm>
						<fnm>RB</fnm>
					</au>
				</aug>
				<source>Am J Hum Genet</source>
				<pubdate>2004</pubdate>
				<volume>75</volume>
				<fpage>850</fpage>
				<lpage>861</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1182114</pubid>
						<pubid idtype="pmpid" link="fulltext">15389393</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B13">
				<title>
					<p>TagSNP Selection Based on Pairwise LD Criterion and Power Analysis in Association Studies</p>
				</title>
				<aug>
					<au>
						<snm>Gopalakrishnan</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Qin</snm>
						<fnm>ZS</fnm>
					</au>
				</aug>
				<source>Pacific Sym Biocomputing</source>
				<pubdate>2006</pubdate>
				<volume>11</volume>
				<fpage>511</fpage>
				<lpage>522</lpage>
			</bibl>
			<bibl id="B14">
				<title>
					<p>Support Vector Networks</p>
				</title>
				<aug>
					<au>
						<snm>Cores</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Vapnik</snm>
						<fnm>VN</fnm>
					</au>
				</aug>
				<source>Machine Learning</source>
				<pubdate>1995</pubdate>
				<volume>20</volume>
				<fpage>273</fpage>
				<lpage>297</lpage>
			</bibl>
			<bibl id="B15">
				<title>
					<p>The Nature of Statistical Learning Theory</p>
				</title>
				<aug>
					<au>
						<snm>Vapnik</snm>
						<fnm>VN</fnm>
					</au>
				</aug>
				<publisher>Springer-Verlag, New York</publisher>
				<pubdate>1995</pubdate>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">8555380</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B16">
				<title>
					<p>Statistical Learning Theory</p>
				</title>
				<aug>
					<au>
						<snm>Vapnik</snm>
						<fnm>VN</fnm>
					</au>
				</aug>
				<publisher>Wiley, New York</publisher>
				<pubdate>1998</pubdate>
			</bibl>
			<bibl id="B17">
				<title>
					<p>Introduction: Analysis of Sequence Data and Population Structure</p>
				</title>
				<aug>
					<au>
						<snm>Witte</snm>
						<fnm>JS</fnm>
					</au>
					<au>
						<snm>Fijal</snm>
						<fnm>BA</fnm>
					</au>
				</aug>
				<source>Genet Epidemiol</source>
				<pubdate>2001</pubdate>
				<volume>21</volume>
				<issue>Suppl 1</issue>
				<fpage>S600</fpage>
				<lpage>S601</lpage>
				<xrefbib>
					<pubid idtype="pmpid">11793745</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B18">
				<title>
					<p>Introduction to Data Mining</p>
				</title>
				<aug>
					<au>
						<snm>Tan</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Steinbach</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Kumar</snm>
						<fnm>V</fnm>
					</au>
				</aug>
				<publisher>Addison-Wesley</publisher>
				<pubdate>2005</pubdate>
				<fpage>76</fpage>
				<lpage>79</lpage>
			</bibl>
			<bibl id="B19">
				<title>
					<p>HaploBlockFinder: Haplotype block analysis</p>
				</title>
				<aug>
					<au>
						<snm>Zhang</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Jin</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19</volume>
				<fpage>1300</fpage>
				<lpage>1301</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12835279</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B20">
				<title>
					<p>Finding haplotype block boundaries by using the minimum-description-length principle</p>
				</title>
				<aug>
					<au>
						<snm>Anderson</snm>
						<fnm>EC</fnm>
					</au>
					<au>
						<snm>Novembre</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>American Journal of Human Genetics</source>
				<pubdate>2003</pubdate>
				<volume>73</volume>
				<fpage>336</fpage>
				<lpage>354</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1182137</pubid>
						<pubid idtype="pmpid" link="fulltext">12858289</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B21">
				<title>
					<p>Minimum description length block finder, a method to identify haplotype blocks and to compare the strength of block boundaries</p>
				</title>
				<aug>
					<au>
						<snm>Mannila</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Koivisto</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Perola</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Varilo</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Hennah</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Ekelund</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Lukk</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Peltonen</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Ukkonen</snm>
						<fnm>E</fnm>
					</au>
				</aug>
				<source>Am J Hum Genet</source>
				<pubdate>2003</pubdate>
				<volume>73</volume>
				<fpage>86</fpage>
				<lpage>94</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1180593</pubid>
						<pubid idtype="pmpid" link="fulltext">12761696</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B22">
				<title>
					<p>Haplotype sharing analysis using Mantel statistics</p>
				</title>
				<aug>
					<au>
						<snm>Beckmann</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Thomas</snm>
						<fnm>DC</fnm>
					</au>
					<au>
						<snm>Fischer</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Chang-Claude</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Human Heredity</source>
				<pubdate>2005</pubdate>
				<volume>59</volume>
				<fpage>67</fpage>
				<lpage>78</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15838176</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B23">
				<title>
					<p>MLR-tagging informative SNP selection for unphased genotypes based on multiple linear regression</p>
				</title>
				<aug>
					<au>
						<snm>He</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Zelikovsky</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2006</pubdate>
				<volume>22</volume>
				<issue>20</issue>
				<fpage>2558</fpage>
				<lpage>2561</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">16895924</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B24">
				<title>
					<p>A model-based scan statistics for identifying extreme chromosomal regions of gene expression in human tumors</p>
				</title>
				<aug>
					<au>
						<snm>Levin</snm>
						<fnm>AM</fnm>
					</au>
					<au>
						<snm>Ghosh</snm>
						<fnm>D</fnm>
					</au>
					<etal/>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>21</volume>
				<fpage>2867</fpage>
				<lpage>2874</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15814559</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B25">
				<title>
					<p>Score test for association between traits and haplotypes when linkage phase is ambiguous</p>
				</title>
				<aug>
					<au>
						<snm>Schaid</snm>
						<fnm>DJ</fnm>
					</au>
					<au>
						<snm>Rowland</snm>
						<fnm>CM</fnm>
					</au>
					<au>
						<snm>Tines</snm>
						<fnm>DE</fnm>
					</au>
					<au>
						<snm>Jacobson</snm>
						<fnm>RM</fnm>
					</au>
					<au>
						<snm>Poland</snm>
						<fnm>GA</fnm>
					</au>
				</aug>
				<source>Am J Hum Genet</source>
				<pubdate>2002</pubdate>
				<volume>70</volume>
				<fpage>425</fpage>
				<lpage>443</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">384917</pubid>
						<pubid idtype="pmpid" link="fulltext">11791212</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B26">
				<title>
					<p>A powerful method of combining measures of association and Hardy-Weinberg disequilibrium for fine-mapping in case-control studies</p>
				</title>
				<aug>
					<au>
						<snm>Song</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Elston</snm>
						<fnm>RC</fnm>
					</au>
				</aug>
				<source>Stat Med</source>
				<pubdate>2006</pubdate>
				<volume>25</volume>
				<fpage>105</fpage>
				<lpage>126</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">16220513</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B27">
				<aug>
					<au>
						<snm>Schwender</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Ickstadt</snm>
						<fnm>K</fnm>
					</au>
				</aug>
				<pubdate>2006</pubdate>
				<note>Identification of SNP Interactions Using Logic Regression, <url>http://www.sfb475.uni-dortmund.de/berichte/tr31-06.pdf</url>, accessed on Oct.-31-2006</note>
			</bibl>
			<bibl id="B28">
				<title>
					<p>Sequence Analysis Using Logic Regression</p>
				</title>
				<aug>
					<au>
						<snm>Kooperberg</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Ruczinski</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>LeBlanc</snm>
						<fnm>ML</fnm>
					</au>
					<au>
						<snm>Hsu</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>Genet Epidemiol</source>
				<pubdate>2001</pubdate>
				<volume>21</volume>
				<issue>Suppl 1</issue>
				<fpage>S626</fpage>
				<lpage>S631</lpage>
				<xrefbib>
					<pubid idtype="pmpid">11793751</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B29">
				<title>
					<p>Classification and Regression Tress</p>
				</title>
				<aug>
					<au>
						<snm>Breiman</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Friedman</snm>
						<fnm>JH</fnm>
					</au>
					<au>
						<snm>Olshen</snm>
						<fnm>RA</fnm>
					</au>
					<au>
						<snm>Stone</snm>
						<fnm>CJ</fnm>
					</au>
				</aug>
				<publisher>Wadsworth, Belmont</publisher>
				<pubdate>1984</pubdate>
			</bibl>
			<bibl id="B30">
				<title>
					<p>Random Forests</p>
				</title>
				<aug>
					<au>
						<snm>Breiman</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>Machine Learning</source>
				<pubdate>2001</pubdate>
				<volume>45</volume>
				<fpage>5</fpage>
				<lpage>32</lpage>
			</bibl>
			<bibl id="B31">
				<title>
					<p>Gene Selection for Cancer Classification using Support Vector Machines</p>
				</title>
				<aug>
					<au>
						<snm>Guyon</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Weston</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Barnhill</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Vapnik</snm>
						<fnm>VN</fnm>
					</au>
				</aug>
				<source>Machine Learning</source>
				<pubdate>2002</pubdate>
				<volume>46</volume>
				<issue>1-3</issue>
				<fpage>389</fpage>
				<lpage>422</lpage>
			</bibl>
			<bibl id="B32">
				<title>
					<p>On the optimality of the simple Bayesian classifier under zero-one loss</p>
				</title>
				<aug>
					<au>
						<snm>Pedro</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Pazzani</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Machine Learning</source>
				<pubdate>1997</pubdate>
				<volume>29</volume>
				<fpage>103</fpage>
				<lpage>137</lpage>
			</bibl>
			<bibl id="B33">
				<title>
					<p>Classification, Parameter Estimation and State Estimation</p>
				</title>
				<aug>
					<au>
						<snm>Heijden</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Duin</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Ridder</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Tax</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<publisher>John Wiley</publisher>
				<pubdate>2004</pubdate>
			</bibl>
			<bibl id="B34">
				<title>
					<p>DENFIS: Dynamic Evolving Neural-Fuzzy Inference System and Its Application for Time-Series Prediction</p>
				</title>
				<aug>
					<au>
						<snm>Kasabov</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Song</snm>
						<fnm>Q</fnm>
					</au>
				</aug>
				<source>IEEE Trans Fuzzy Systems</source>
				<pubdate>2002</pubdate>
				<volume>10</volume>
				<issue>2</issue>
				<fpage>144</fpage>
				<lpage>154</lpage>
			</bibl>
			<bibl id="B35">
				<title>
					<p>Comprehensive Annotation of Bidirectional Promoters Identifies Co-Regulation among Breast and Ovarian Cancer Genes</p>
				</title>
				<aug>
					<au>
						<snm>Yang</snm>
						<fnm>MQ</fnm>
					</au>
					<au>
						<snm>Koehly</snm>
						<fnm>LM</fnm>
					</au>
					<au>
						<snm>Elnitski</snm>
						<fnm>LL</fnm>
					</au>
				</aug>
				<source>PLoS Comput Biol</source>
				<pubdate>2007</pubdate>
				<volume>3</volume>
				<issue>4</issue>
				<fpage>e72</fpage>
				<note>doi:10.1371/journal.pcbi.0030072</note>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1853124</pubid>
						<pubid idtype="pmpid" link="fulltext">17447839</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B36">
				<title>
					<p>Periodontal disease and risk of myocardial infarction: the role of gender and smoking</p>
				</title>
				<aug>
					<au>
						<snm>Andriankaja</snm>
						<fnm>OM</fnm>
					</au>
					<au>
						<snm>Genco</snm>
						<fnm>RJ</fnm>
					</au>
					<au>
						<snm>Dorn</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Dmochowski</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Hovey</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Falkner</snm>
						<fnm>KL</fnm>
					</au>
					<au>
						<snm>Trevisan</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>European Journal of Epidemiology</source>
				<pubdate>2007</pubdate>
				<volume>22</volume>
				<issue>10</issue>
				<fpage>699</fpage>
				<lpage>705</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">17828467</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B37">
				<title>
					<p>Feature mining and pattern classification for steganalysis of LSB matching steganography in grayscale images, Pattern Recognition</p>
				</title>
				<aug>
					<au>
						<snm>Liu</snm>
						<fnm>Q</fnm>
					</au>
					<au>
						<snm>Sung</snm>
						<fnm>AH</fnm>
					</au>
					<au>
						<snm>Chen</snm>
						<fnm>Z</fnm>
					</au>
					<au>
						<snm>Xu</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<pubdate>2008</pubdate>
				<volume>41</volume>
				<issue>1</issue>
				<fpage>56</fpage>
				<lpage>66</lpage>
				<note>doi: 10.1016/j.patcog.2007.06.005.</note>
			</bibl>
		</refgrp>
	</bm>
</art>
