Abstract
Background
Since the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual. For efficient population identification with the HapMap genotype data, as few informative SNPs as possible are required from the original 4 million SNPs. Recently, Park et al. (2006) adopted the nearest shrunken centroid method to classify the three populations, i.e., Utah residents with ancestry from Northern and Western Europe (CEU), Yoruba in Ibadan, Nigeria in West Africa (YRI), and Han Chinese in Beijing together with Japanese in Tokyo (CHB+JPT), from which 100,736 SNPs were obtained and the top 82 SNPs could completely classify the three populations.
Results
In this paper, we propose to first rank each feature (SNP) using a ranking measure, i.e., a modified ttest or Fstatistics. Then from the ranking list, we form different feature subsets by sequentially choosing different numbers of features (e.g., 1, 2, 3, ..., 100.) with top ranking values, train and test them by a classifier, e.g., the support vector machine (SVM), thereby finding one subset which has the highest classification accuracy. Compared to the classification method of Park et al., we obtain a better result, i.e., good classification of the 3 populations using on average 64 SNPs.
Conclusion
Experimental results show that the both of the modified ttest and Fstatistics method are very effective in ranking SNPs about their classification capabilities. Combined with the SVM classifier, a desirable feature subset (with the minimum size and most informativeness) can be quickly found in the greedy manner after ranking all SNPs. Our method is able to identify a very small number of important SNPs that can determine the populations of individuals.
Background
When any one single nucleotide of A, T, C and G in the genome sequence is replace by one of any other 3 nucleotide, e.g., from AAAT
 C
 T
In association studies [25], informative SNPs were usually selected based on certain correlation measures and therefore could represent other SNPs in the close proximity. For example, Bafna et al. [2] and Halldrsson et al. [3] proposed to select a subset of tag SNPs with the minimum size and highest informativeness value calculated from a selfdefined informativeness measure, which evaluates how well a single SNP or a set of SNPs predict another single SNP or another set of SNPs within the neighborhoods. Eran et al. [4] proposed to select the informative SNPs with the maximum prediction accuracy, which is obtained from a prediction accuracy measure evaluating how well the value of an SNP is predicted by the values of only two closest tag SNPs. Phuong et al. [5] proposed the method of selecting informative SNPs by removing redundant features. Redundancy was measured by feature similarity between two features, i.e., the linkage disequilibrium (LD) measure γ^{2 }[5].
In population studies, the selection of informative SNPs should be based on their population classification capability. Related research, such as selecting genetic markers with highest informativeness for inference of individual ancestry [8], selecting informative marker panels for population assignment [6] and detecting ethnically variant SNPs [7], has already been explored. Rosenberg et al. [8] proposed to use the informativeness for assignment (I_{n}) to measure the ability of each genetic loci or marker (feature) to infer individuals' ancestry, which was proved to be similar to the Fstatistics measure [8]. In [6], Rosenberg et al. proposed the univariate, greedy, and maximum algorithms to select marker panels. The three algorithms were realized through a given performance function, e.g., the optimal rate of correct assignment (ORCA) [8], which measures the probability of correctly assigning an individual to the population from which the genotype of the individual has originated with the greatest possibility. The application of the algorithms on eight species was effective. Very recently, Park et al. developed a systematic approach based on nearest shrunken centroid (NSCM) method [9] to identify ethnically variant SNPs. According to [9], they calculated a shrunken value for each SNP of each class, and compared each SNP's shrunken value for different classes to determine the SNP's classification capability. The less the difference among the SNP's shrunken values for different classes, the less important the SNP for classifying the three different ethnic groups (classes) [10], i.e., CEU, YRI and JPT+CHB. 100,736 SNPs were obtained and the top 82 SNPs were able to completely classify the three populations.
In this paper, we propose to firstly rank SNPs according to a feature importance ranking measure, i.e., a modified ttest or Fstatistics, where the higher the ranking value, the stronger the corresponding classification power. Then, from the ranking list, we sequentially choose different numbers of top ranked SNPs, e.g., 1, 2, 3, ..., 20 and so on, test them through a classifier, e.g., the support vector machine (SVM) [11,12] and determine the SNP subset which has the highes classification accuracy. This process is repeated 30 times. Finally, we locate those important SNPs who always have top ranking values according to SNP subsets obtained from 30 simulations.
Results and discussion
The international HapMap Project provides many kinds of data for researchers in [10], such as the HapMap genotype data and the phased haplotype data. The phased haplotype data describes SNP alleles on a chromosome inherited from one of father and mother, while the genotype data describes SNP alleles on both chromosomes inherited from parents [13]. We give an example (see Fig. 1(a)) to describe the relationship between the haplotype and genotype. Besides, the genotype data has missing values for some loci (SNPs), while the phased haplotype data (also called as the HapMap Phase II haplotypes data) has missing values filled by the well known genotype phasing tool PHASE [14,15]. Therefore, we download the phased haplotype data from the directory of (Index of/downloads/phasing/2006–07 phaseII/phased). The legend data and sample data in the directory are also necessary to describe locus places (feature IDs), locus names (feature names), and sample names (individual IDs). The HapMap data includes four populations: CEU, YRI, JPT and HCB, where CEU represents Utah residents with ancestry from northern and western Europe; YRI represents Yoruba individuals from Ibadan and Nigeria; JPT represents Japanese individuals from Tokyo, and HCB means Han Chinese individuals from Beijing. CEU and YRI each has 90 related samples, i.e., 30 fathermotheroffspring trios. After removing the offsprings, 60 unrelated samples are obtained for CEU and also for YRI. For JPT and CHB populations, each of them has 45 unrelated samples. Therefore, we obtain 210 unrelated samples for the experiment. Since the HapMap Project provides 4 separate populations and also 3 populations, we will do the classification on the 3population and 4population problems, respectively. The 3population problem is the same as [7].
Figure 1. Example of haplotypes and genotypes. (a). The haplotype and genotype formats of one individual; (b). Different nominal values (genotype format) of one SNP for different individuals; (c). Numerical values of one SNP for different individuals in (b), in which the first transformation is for the FStatistics algorithm, the second transformation in vector format is for the modified ttest algorithm.
Combining all the features together from the 23 chromosomes, i.e., Chromosome 1, 2, ..., 22, X (the phased data of Chromosome Y is not available), we have nearly 4 million SNPs involved in the experiment. For most features (locus), their SNP types (feature values) are expressed as biallelic SNPs, i.e., consisting of two single alleles from 4 nucleotides ATCG. For example, if one SNP consists of the two allels A and G (see Fig. 1(b)), all the possible feature values for this locus will be AA, AG and GG, in which AA and GG are called homozygous, and AG is called hyterozygous. Since the phased haplotype data has two rows of haplotypes describing one individual, we transform the haplotype data into the genotype format (see Fig. 1(a)) for computational convenience. When transforming data from the haplotype format into the genotype format, we adopt two kinds of transformations considering different requirements of two algorithms (see Fig. 1(c)). For the modified ttest ranking measure, if simply transforming nominal values to normal numeric values and doing the calculation according to Equation (4), it will be possible to lose the meaning of different SNP types. We propose to use vectors to represent different SNP types and rank them by the modified ttest ranking measure (Equation (5)). For example (see Fig. 1(c)), according to the description of the modified ttest ranking measure, "11" (i.e,, AA in Fig. 1(c)) is represented by {0, 0, 1}, "00" (GG) is represented by {0, 1, 0}, and "10" (AG) is represented by {1, 0, 0}. As to the Fstatistics ranking measure, since it involves the calculation of two single alleles for each biallelic SNPs, we will use 1 and 0 to represent the two different alleles, respectively. For example, given the same SNP reference type A/G as the one in the modified ttest ranking measure, we use 1 to represent A and 0 to represent G. Then, in each population we can calculate each allele's frequency and variation for each population, as well as those values for all the populations. Each SNP's Fstatistics value is calculated from Equation (7). At the same time, we notice some special conditions. For example, if one locus with reference SNP type A/G only has the value AA for all the individuals, the frequency of the SNP allele A will be 100% and the frequency of allele G will be zero. Referring to Equation (7), either or in the denominator will be equal to zero. In fact, this feature has no classification capability for any populations. Therefore, we set the F_{st }value of that feature as zero. In summary, the greater the numerator and the smaller the denominator in Equation (7), the greater the value F_{st }and the more important the corresponding feature for classification.
We have 4 simulations to conduct, i.e., 4 different combinations of two rankings (Fstatistics and modified ttest) and two classifications (on 3 populations and 4 populations, respectively). From the 210 samples, we randomly choose 40 samples from YRI and CEU, respectively, and 30 samples from JPT and CHB, respectively, as the training set. The 70 samples left are used as the testing set. Each simulation is repeated for 30 times.
We first rank the SNPs of 23 chromosomes, respectively. Then we choose each chromosome's top 100 SNPs, combine the 2300 features together, and rank them again. In this way, features involved in the experiment are greatly reduced and this also will not lead to loss of important information. On the contrary, it will improve the efficiency of the experiment.
In each of 30 simulations, we select top 100 SNPs from the ranking list and form 100 different SNP subsets. The first subset consists of only the first top SNP. The second subset is formed by adding the second top SNP into the previous subset, the third top. Subsequently, we evaluate all subsets through the classifier SVM in terms of the classification accuracy. Due to space limitation, we provide classification results of only 11 feature subsets, i.e., the subsets consisting of 1, 10, 20, 30, 40, 50, 60, 70, 80, 90 and 100 SNPs, respectively. Table 1 is for the Fstatistics ranking measure and Table 2 is for the modified ttest ranking measure. Intuitively, the classification on 3 populations produces higher accuracies than on 4 populations, for both ranking measures. This can be interpreted by the fact that the JPT and CHB populations have very similar DNA sequence and it is hence hard to discriminate between these two populations. When comparing classification results by the FStatistics ranking measure with those by the modified ttest ranking measure on 3 populations, we can see that mean accuracies produced by the latter are higher than those by the former for most SNP subsets. The advantage of the modified ttest measure over the Fstatistics measure is more obvious for 4 populations than for 3 populations. In addition, we provide the minimal and maximal accuracies for each of those 11 SNP subsets in the 30 simulations for the two ranking measures in Tables 1 and 2, respectively. According to the results in Tables 1 and 2, we can see that complete classification on 3 populations is possible for the modified ttest ranking measure with appropriate SNP subsets. Whereas, there is always a little error (i.e., 1/70) for the Fstatistics measure with those 100 SNP subsets.
Table 1. Classification accuracy results obtained by the Fstatistics measure for different feature subsets with different numbers of top ranked features (SNPs) in 30 simulations, on 3 and 4 populations, respectively
Table 2. Classification accuracy results obtained by the modified ttest measure for different feature subsets with differen numbers of top ranked features (SNPs) in 30 simulations, on 3 and 4 populations, respectively
In the following, we find the subset which leads to the maximal classification accuracy from the 100 SNP subsets (see Table 3). We list the maximal classification accuracy in each of the 30 simulations, the number of SNPs that the relevant SNP subset includes, and the mean values (± standard deviations) in the 30 simulations. Although the average number of SNPs that leads to the best classification is similar for both ranking measures (see the 5th column of Table 3), the mean classification accuracies produced by two ranking measures are different (see the 2nd column of Table 3). The modified ttest ranking measure produces 97.09% mean accuracy, which is 1.04% higher than the accuracy produced by the Fstatistics measure, i.e., 96.05%, for 3 populations. The mean accuracy produced by the modified ttest on 4 populations, i.e., 83.86%, is much higher than that produced by the Fstatistics measure, i.e., 77.34%.
Table 3. The maximum classification accuracy in each of 30 simulations together with the mean accuracy (standard deviation), and the relevant feature numbers leading to the maximal accuracy together with the mean number (standard deviation), for 3 populations and 4 populations, respectively
After determining the subset leading to the maximal classification accuracy in each of the 30 simulations, we need further determine what those SNPs are and which chromosomes those SNPs locate on. From the result in Table 3, we know there are on average 64 SNPs obtained for the desirable feature subset. Because of space limitation, we will not list all those SNPs. For example, in Table 4, we list 22 SNPs whose appearance frequencies are greater than 83.33% (i.e., appearing more than 25 times in the 30 simulations), mean ranking values and locations of these SNPs, using the Fstatistics ranking measure on 3 populations. Similarly, we present results obtained by the modified ttest ranking measure in Table 5, in which 24 top ranked SNPs whose appearance frequencies are greater than 83.33% in the 30 simulations are presented. For both ranking methods, most of the SNPs come from chromosome 11 (chr11), except rs35397 from the chromosome 5 (chr5), rs2296224 from chromosome 1 (chr1) and rs199138 from chromosome 15 (chr15). Among 22 SNPs in Table 4, rs1604797 and rs7946015 appear 30 times in the 30 simulations. Among 24 SNPs in Table 5, rs1604797, rs7946015 and rs10832001 appear 30 times in the 30 simulations. Interestingly, the mean ranking values of these SNPs with the highest appearance frequencies are not the highest.
Table 4. Top ranked features whose appearance frequencies are greater than 83.33% (25/30) in 30 simulations, and their mean ranking values by the Fstatistics ranking measure for 3 populations
Table 5. Top ranked features whose appearance frequencies are greater than 83.33% (25/30) in 30 simulations, and their mean ranking values by the modified ttest ranking measure for 3 populations
All experiments are executed using Matlab 7.1 on a personal computer with Windows XP operating system and Pentium 4 CPU (3.4 GHZ) and 1 GHZ RAM. We perform statistics about the running time of the two ranking measures together with the training and testing time. The mean time using the Fstatistics to rank all SNPs of 3 populations is 5342.9 seconds, while on average 5728.7 seconds for the modified ttest ranking measures. It may be because that calculating the median value S_{0 }makes the modified ttest ranking measure take more time than the Fstatistics measure. Both algorithms cost more time on 4 populations compared to 3 populations. The total training and testing time is 6915.1 seconds. In terms of classification accuracy, the modified ttest ranking measure is superior over the Fstatistics measure. Besides, the modified ttest ranking measure is proposed to deal with vector features and provides a way for ranking nominal features.
Since features' ranking only indicates the relevance of each feature, those features with the same or close ranking values may have high correlation between each other, i.e., redundancy. Therefore, it is possible for us to further reduce the number of SNPs in our future work.
Conclusion
In this paper, we propose to use two feature importance ranking measures, i.e., the modified ttest and Fstatistics, to rank large amount of SNPs, and then use the greedy manner together with a classifier to determine a desirable feature subset, which has the minimum size but leads to the highest classification accuracy. The final results show that both ranking methods are efficient on determining the importance of the SNPs. Although the two ranking measures find nearly the same amount of SNPs, the modified ttest ranking measure tends to be better than the Fstatistics measure in terms of the classification accuracy. Compared to the classification method of Park et al.[7], we obtain a better result, i.e., good classification of the 3 populations using fewer, i.e., on average 64, SNPs.
Methods
In classification on large data sets, feature selection is necessary and shows many advantages such as saving computational time, reducing computational burden and improving efficiency. Feature ranking, as an usual step in many feature selection methods [16,17], is adopted in our experiment to determine the features' classification power. In this paper, we will present two feature importance ranking measures: a modified ttest from [9,18,19] and Fstatistics [20], and make an comparison about their ranking abilities so as to evaluate the modified ttest ranking measure.
Modified Ttest
The original ttest, i.e., the student ttest [18], can be used to evaluate whether the means of two classes are statistically different from each other by calculating a ratio between the difference of two class means and variability of the two classes. It has been adopted by [21,22] to rank features (genes) for microarray data and for mass spectrometry data [23,24]. We notice that the original ttest is only applied on 2class problems. In the following multiclass problems, Tibshirani et al.[9] developed the nearest shrunken centroid method, i.e., calculating a tstatistic value (Equations (1)) for each gene of each class. This tstatistic value measured the difference between the mean of one class and the mean of all the classes, and the difference is standardized by the withinclass standard deviation.
Here t_{ic }indicates the tstatistics value for the ith feature of the cth class. indicates the ith feature's mean value in the cth class and indicates the ith feature's mean value for all classes. x_{ij }represents the ith feature of the jth sample. N is the total number of all the samples for all the C classes and n_{c }is the number of samples for the cth class. S_{i }is the withinclass standard deviation and S_{0 }is set to be the median value of S_{i }for all the features. This tstatistic value of Tibshirani et al. [9] measured the deviation between each class and the mean of all classes and was used to constitute a classifier. The authors did not refer to using the tstatistic of each class to rank features for all the classes. In [19], Wang et al. extended the tstatistic algorithm to rank features for all the classes. That is, the tscore (tstatistic value) of feature i is calculated as the greatest tscore for all classes:
Due to the characteristic of the SNP data [10], i.e., with nominal values for each feature (e.g., AA, AT and TT), Equation (4) can not be used to deal with our problem. We proposed a modified ttest ranking method, in which different nominal values are represented by different vectors to realize the calculation. In the following, we generalized the tscore of each feature in 3 steps:
1. Suppose the feature set is F = (f_{1},...,f_{i}, ..., f_{g}), and feature i has m_{i }different nominal values represented as
2. Transform each nominal feature value into a vector with the dimension m_{i}, i.e., .
3. Replace all the numerical features in Equations (1) and (2) with those vectors (see Equations (5) and (6)).
The ranking rule is: the greater the tscores, the more relevant the features.
Fstatistics
In our experiment, we will use another ranking measure, i.e., Fstatistics, to make a comparison with the modified ttest. The version of Fstatistics used in our experiment is based on the definition of [25], which was originally developed by [20] and used in population genetics to describe the level of heterozygosity in a population.
Given a SNP genotype data with C subpopulations and each feature expressed as biallelic SNPs (i.e., consisting of any two different nucleotides from the four nucleotides ATCG), the Fstatistics (F_{st}) is calculated as:
where p and q are corresponding to the two alleles' frequencies, respectively, in one population. and refer to the two alleles' mean frequencies for all the population classes. Var_{p }represents the variance (See Equation 8) of one allele.
Here, p_{c }designates the frequency of one allele for the cth population. And the mean frequency is easy to obtain from:
The ranking rule is same as the modified ttest, i.e., the larger the F_{st }value, the more significant the SNP for population classification.
The Classifier
Although many classifiers, such as classical neural network, naive Bayes classifier and so on, can be applied in our classification, here we would like to choose the support vector machine (SVM) [11,12] in our experiment because of its some attractive features, such as effectively avoiding overfitting and accomodating large feature spaces, fast speed and so on. It will be used not only in the final classification, but also in the feature selection to test different feature subsets and determine the one with the highest classification accuracy. During the classification process, we determine the kernel parameter γ and the penal parameter ν through the double crossvalidation method.
Authors' contributions
LW proposed to use statistic ranking methods to select informative SNPs on the HapMap genotype data for population classification. NZ conducted the algorithm implementations and drafted an early version of the manuscript. LW revised the draft.
Acknowledgements
We thank Dr Maria Kowalczuk (the Assistant Editor), the BioMed Central Editorial Production Team and the anonymous reviewer who provided comments and suggestions that helped to significantly improve the paper. The experimental data was provided by the international HapMap Project website [10].
References

Duerinck KF[http://www.duerinck.com/snp.html] webcite

Bafna V, Halldorsson B, Schwartz R, Clark A, Istrail S: Haplotypes and Informative SNP selection: Don't block out information.

Halldrsson B, Bafna V, Lippert R, Schwartz R, de la Vega FM, Clark A, Istrail S: Optimal haplotype blockfree selection of tagging snps for genomewide association studies.
Genome research 2004, 14:16331640. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Halperin E, Kimmel G, Shamir R: Tag SNP selection in genotype data for maximizing SNP prediction accuracy.
Bioinformatics 2005, 21 Suppl 1:i195i203. PubMed Abstract  Publisher Full Text

Phuong TM, Lin Z, Altman RB: Choosing SNPs using feature selection.
Proc IEEE Comput Syst Bioinform Conf 2005, 301309. PubMed Abstract

Rosenberg NA: Algorithms for selecting informative marker panels for population assignment.
Journal of computational biology 2005, 12:11831201. PubMed Abstract  Publisher Full Text

Park J, Hwang S, Lee YS, Kim SC, Lee D: SNP@Ethnos: a database of ethnically variant singlenucleotide polymorphisms.
Nucleic Acids Res 2007, 35(Database issue):D711D715. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Rosenberg NA, Li LM, Ward R, Pritchard JK: Informativeness of Genetic Markers for Inference of Ancestry.
Am J Hum Genet 2003, 73:14021422. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis ofmultiple cancer types by shrunken centroids of gene expression.
Proc Natl Acad Sci USA 2002, 99:65676572. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

The International HapMap Consortium. The international Hapmap Project [http://www.hapmap.org/genotypes] webcite
Nature 2003, 426:789796. PubMed Abstract  Publisher Full Text

Vapnik V: Statistical learning theory. NewYork,: Wiley; 1998.

Wang LP: Support Vector Machines: Theory and Applications. Berlin: Springer; 2005.

Mandoiu I: Algorithms for SNP data collection and analysis. [http://www.engr.uconn.edu/~ion/FILES/ppt/uri_11_10_06.ppt] webcite

Stephens M, Smith NJ, Donnelly P: A new statistical method for haplotype reconstruction from population dat.
American Journal of Human Genetics 2001, 68:978989. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gusev A, Mandoiu I, Pasaniuc B: Highly Scalable Genotype Phasing by Entropy Minimization.
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 2007.

Guyon I, Elisseeff A: An Introduction to Variable and Feature Selection.
Journal of Machine Learning Research 2003, 3:11571182. Publisher Full Text

Wang LP, Fu XJ: Data Mining with Computational Intelligence. Berlin: SpringerVerlag; 2005.
ISBN 3540245227

Devore J, Peck R: Statistics:the exploration and analysis of data. 3rd edition. CA: Duxbury Press; 1997.

Wang L, Chu F, Xie W: Accurate cancer classification using expressions of very few genes.
IEEE/ACM Trans Comput Biol Bioinform 2007, 4(1):4053. PubMed Abstract  Publisher Full Text

Wright S: The interpretation of population structure by Fstatistics with special regard to systems of mating.
Evolution 1965, 19:395420. Publisher Full Text

Jaeger J, Sengupta R, Ruzzo WL: Improved Gene Selection For Classification Of Microarrays.
Pac Symp Biocomput 2003, 5364. PubMed Abstract

Su Y, Murali TM, Pavlovic V, Schaffer M, Kasif S: RankGene: Identifcation of Diagnostic Genes Based on Expression Data.
Bioinformatics 2003, 19:15781579. PubMed Abstract  Publisher Full Text

Wu B, Abbott T, Fishman D, McMurray W, Mor G, Stone K, Ward D, Williams K, Zhao H: Comparison of statistical methods for classifcation of ovarian cancer using mass spectrometry data.
BioInformatics 2003, 19:16361643. PubMed Abstract  Publisher Full Text

Levner I: Feature selection and nearest centroid classification for protein mass spectrometry.
BMC Bioinformatics 2005, 6:68. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Trochim WM: [http://www.socialresearchmethods.net/kb/] webcite
The Research Methods Knowledge Base. 2nd edition. Atomic Dog Publishing; 2004.