Description of Method for Scoring Functions 1–5. In this example, a study consists of 200 cases and 200 controls and a 10-fold cross-validation is performed. Only two SNPs are examined: A (with alleles A and a) and B (with alleles B and b) in this example. The order of samples is scrambled before training. In (a) training samples (180 cases and 180 controls) are assigned to the 9 × 2 genotype-phenotype table (classification). The genotype-phenotype table is the distribution of phenotypes (i.e. case vs. control) for all possible genotype combinations for the SNPs examined. The genotype-phenotype table is used for classification of SNPs. In this example, PIA v. 2.0 designates AABB and AABb as case-genotypes, aaBB as an undetermined-genotype, and the remaining six genotypes as control-genotypes. If the training data is selected to contribute to scoring (if the Jackknife analysis, LOO is selected), a contingency data is generated using the training data (b). The contingency table compares the observed genotype-phenotype distribution to the expected based on the genotype assignments in (a). The testing data is placed into the appropriate cells of the genotype-phenotype table (c). The contingency table for testing data (d) is generated using genotype assignments from the training data (a). Since the AABB genotype represents a case-phenotype (based on training data), the seven case samples are added to the number of true positives (NTP) and the three control samples are added to the number of false positives (NFP) in the contingency table (d). Conversely, AaBB is a control-phenotype, so the five controls are added to the number of true negatives (NTN) and the three cases are added to the number of false negatives (NFN). If a testing sample is assigned to an undetermined-phenotype (aaBB), PIA counts the assignment as half-right and half-wrong. Therefore, the three cases cause NTP and NFP to be increased by 1.5; the two controls increase NTN and NFN by 1.0. After processing all testing samples, the corresponding contingency table is shown in (d). The process is then repeated for the remaining 9 sets of testing and training samples, and all contingency tables arising from the testing samples are summed.
Mechanic et al. BMC Bioinformatics 2008 9:146 doi:10.1186/1471-2105-9-146