<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>1471-2105-7-S4-S8</ui>
	<ji>1471-2105</ji>
	<fm>
		<dochead>Research</dochead>
		<bibl>
			<title>
				<p>The impact of sample imbalance on identifying differentially expressed genes</p>
			</title>
			<aug>
				<au id="A1">
					<snm>Yang</snm>
					<fnm>Kun</fnm>
					<insr iid="I1"/>
					<email>kunyang@hit.edu.cn</email>
				</au>
				<au id="A2" ca="yes">
					<snm>Li</snm>
					<fnm>Jianzhong</fnm>
					<insr iid="I1"/>
					<email>lijzh@hit.edu.cn</email>
				</au>
				<au id="A3">
					<snm>Gao</snm>
					<fnm>Hong</fnm>
					<insr iid="I1"/>
					<email>honggao@hit.edu.cn</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>Department of Computer Science and Engineering, Harbin Institute of Technology, Harbin, 150001, China</p>
				</ins>
			</insg>
			<source>BMC Bioinformatics</source>
			<supplement>
				<title>
					<p>Symposium of Computations in Bioinformatics and Bioscience (SCBB06)</p>
				</title>
				<editor>Youping Deng, Jun Ni</editor>
				<note>Research</note>
				<url>http://www.biomedcentral.com/content/pdf/1471-2105-7-S4-info.pdf</url>
			</supplement>
			<conference>
				<title>
					<p>Symposium of Computations in Bioinformatics and Bioscience (SCBB06) in conjunction with the International Multi-Symposiums on Computer and Computational Sciences 2006 (IMSCCS|06)</p>
				</title>
				<location>Hangzhou, China</location>
				<date-range>June 20&#8211;24, 2006</date-range>
				<url>http://mfgn.usm.edu/ebl/SCBB06</url>
			</conference>
			<issn>1471-2105</issn>
			<pubdate>2006</pubdate>
			<volume>7</volume>
			<issue>Suppl 4</issue>
			<fpage>S8</fpage>
			<xrefbib>
				<pubidlist><pubid idtype="pmpid">17217526</pubid><pubid idtype="doi">10.1186/1471-2105-7-S4-S8</pubid>
				</pubidlist></xrefbib>
		</bibl>
		<history>
			<pub>
				<date>
					<day>12</day>
					<month>12</month>
					<year>2006</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2006</year>
			<collab>Yang et al; licensee BioMed Central Ltd</collab>
			<note>This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<sec>
					<st>
						<p>Background</p>
					</st>
					<p>Recently several statistical methods have been proposed to identify genes with differential expression between two conditions. However, very few studies consider the problem of sample imbalance and there is no study to investigate the impact of sample imbalance on identifying differential expression genes. In addition, it is not clear which method is more suitable for the unbalanced data.</p>
				</sec>
				<sec>
					<st>
						<p>Results</p>
					</st>
					<p>Based on random sampling, two evaluation models are proposed to investigate the impact of sample imbalance on identifying differential expression genes. Using the proposed evaluation models, the performances of six famous methods are compared on the unbalanced data. The experimental results indicate that the sample imbalance has a great influence on selecting differential expression genes. Furthermore, different methods have very different performances on the unbalanced data. Among the six methods, the welch t-test appears to perform best when the size of samples in the large variance group is larger than that in the small one, while the Regularized t-test and SAM outperform others on the unbalanced data in other cases.</p>
				</sec>
				<sec>
					<st>
						<p>Conclusion</p>
					</st>
					<p>Two proposed evaluation models are effective and sample imbalance should be taken into account in microarray experiment design and gene expression data analysis. The results and two proposed evaluation models can provide some help in selecting suitable method to process the unbalanced data.</p>
				</sec>
			</sec>
		</abs>
	</fm>
	<bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<p>Microarrays enable us to monitor expressions of thousands of genes simultaneously and generate enormous amount of data. Using such techniques, it is possible to explore the secret of biology at the molecular level and understand the fundamental biological processes ranging from gene function to development and to cancer <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>. In microarray experiments, the expression levels of several thousands candidate genes have been monitored in two opposite conditions, such as Treatment versus Control conditions, where each condition is represented by several samples. Unfortunately, most monitored genes are unrelated to the conditions and their expression levels do not change or change by chance, while other genes are strongly related to the conditions and truly change their expression levels according to conditions. However, these differentially expressed genes are very useful in latter research and clinical applications <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>. Therefore, one of the important tasks in microarray data analysis is to compare the expression levels of genes in samples drawn from two different conditions and to select genes with differential expression under those two conditions. Specifically, we are interesting in identifying which of several thousands candidate genes have had their expression levels changed by condition, given a microarray data.</p>
			<p>One simple approach used in literature to detect differential expression genes is "fold change" method, in which a gene is declared to be differentially expressed if its average expression level varies by more than a given constant between two conditions. However, "fold change" method has been demonstrated to be unreliable and inefficient, because statistical variability is not considered <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Then, many sophisticated statistical approaches have been proposed <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>. These approaches can be roughly classified into two categories. The parametric methods based on statistical model is the first category of methods. This kind of methods include various versions of the two-sample t-test <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>. Due to the reason that gene expression data are often noisy and not normally distributed <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>, the strong assumption of parametric method can be violated in practice. The second category of approaches is nonparametric statistical methods, including the Wilcoxon rank-sum test <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, the Significance Analysis of Microarray (SAM) method <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, the Empirical Bayes (EB) method <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>, the mixture model method <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> and other modified nonparametric methods <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>. For recent reviews, please see <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>.</p>
			<p>However, very few studies consider the problem of sample imbalance in detecting differential expression genes and there are no studies as well as quantitive method to investigate the effect of sample imbalance on differential expression genes selection. Sample imbalance means that the size of samples in one group is very different to that in another group. In fact, the problem of sample imbalance usually appears in gene expression data, especially in the data about tumor samples. For example, the data in <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr></abbrgrp> are all unbalanced. There are many factors causing the problem of sample imbalance, such as the limit of source of tumor samples, budgetary constraints and reducing samples in the control group artificially and factitiously. Coupled with the small sample in gene expression data, the problem of sample imbalance may be more serious. Consequently, two important and natural questions may be asked by biologists as follows: How does the sample imbalance affect the methods for identifying differential expression genes? Which method is more suitable for the unbalanced data? In addition, previous studies <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr></abbrgrp> have found that the variability of gene expression may be related to the average expression. It suggests that the two sample t-test being used should be based on unequal variances. An instant but reasonable question is: whether the above suggestion is still true on the unbalanced data.</p>
			<p>In this paper, we investigate the new problem about the impact of sample imbalance on identifying differential expression genes. Two evaluation models based on random sampling are proposed and six famous methods are compared on both the real data and the simulated data. Under each evaluation model, the random sampling is utilized to estimate the expected performances of methods on the unbalanced data which satisfy one specific sample ratio between two groups. Then the variations of performances are used to illustrate the effect of sample imbalance on differential expression genes selection and method selection.</p>
		</sec>
		<sec>
			<st>
				<p>Results</p>
			</st>
			<p>In this section, six methods including two-sample t-test with equal variances (equalling F-test) <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, two-sample t-test with unequal variances (i.e. Welch t-test) <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B7">7</abbr></abbrgrp>, Wilcoxon rank-sum test <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, SAM <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, Regularized t-test <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> and the permutation-based method of Pan <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> are systematically compared on real data and simulated data according to two evaluation models. All experiments are conducted in Matlab environment on a Pentium PC with a 3.20 GHz CPU and 512 MB RAM. The processing procedure is as follows. For every pair of fixed parameters <it>n</it><sub>1 </sub>and <it>n</it><sub>2 </sub>(which are the numbers of samples in class one <it>C</it><sub>1 </sub>and class two <it>C</it><sub>2</sub>) in each experiment under two evaluation models, first, we randomly create a set of <it>x </it>independent artificial data or simulated data and test all six methods on these <it>x </it>data to get the results. For a specific method, each one in the <it>x </it>random data will only get one result for each measure, for example Overlap Rate, Precision Rate or Recall Rate. Then, these <it>x </it>values are treated as a random sample of size <it>x </it>from the fixed parameters <it>n</it><sub>1 </sub>and <it>n</it><sub>2</sub>. Last, the expected performance of each method and its 0.95 confidence interval are calculated from this kind of random samples.</p>
			<sec>
				<st>
					<p>Datasets</p>
				</st>
				<p>Two real datasets are the liver dataset <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> and the prostate dataset <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. Taking a data preprocess protocol similar to that in Dudoit et al <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>, we screen out genes with missing data in more than 5% arrays, impute other missing data by 0, and then apply a base 2 logarithmic transformation. Each experiment is standardized to zero median across the genes. The prostate data finally consists of gene expression profiles of 62 primary prostate tumours and 41 normal specimens with expression values of 7931 genes. The liver data consists of gene expression profiles of 105 primary HCC and 76 non-tumor liver tissues, 7 benign liver tumor samples, 10 metastatic cancers, and 10 HCC cell lines on 11763 genes. We select two largest classes from the liver dataset to do experiments.</p>
				<p>The simulated data is created according to the protocol in <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, where the gene expression value is a normally generated random value with a noise generated from one uniform distribution of <it>U</it>(-0.1, 0.1), which is very similar to real data. In each simulated data, there are 1000 genes (first 50 with differential expression and next 950 with non-differential expression) and two classes <it>C</it><sub>1 </sub>and <it>C</it><sub>2 </sub>(having <it>n</it><sub>1 </sub>and <it>n</it><sub>2 </sub>samples, respectively). For any non-differential expression gene <it>j </it>(i.e. 51 &#8804; <it>j </it>&#8804; 1000), its expression value <it>a</it><sub><it>ij </it></sub>on each sample <it>i </it>is randomly generated from <it>N</it>(<it>&#956;</it>, 0.5) and <it>U</it>(-0.1, 0.1), where <it>&#956; </it>~<it>N</it>(0, 0.25). For gene <it>j </it>&#8804; 50, the value of gene <it>j </it>on any sample in class <it>C</it><sub>1 </sub>is generated from <it>N</it>(<it>&#956;</it><sub>1</sub>, <it>&#963;</it><sub>1</sub>) and <it>U</it>(-0.1, 0.1), while that in class <it>C</it><sub>2 </sub>is generated from <it>N</it>(<it>&#956;</it><sub>2</sub>, <it>&#963;</it><sub>2</sub>) and <it>U</it>(-0.1, 0.1), where <it>&#956;</it><sub>1</sub>, <it>&#956;</it><sub>2</sub>~<it>N</it>(0, 0.5). For the problem of multiple testing involved in identifying differential expression genes, bonfenorri correction of the significant level <it>&#945; </it>can be used to reduce the error of type I. But a very small <it>&#945; </it>will be disadvantaged to compare the performances of methods. In this paper, a relatively small significant level <it>&#945; </it>will be used to control the type error I. On the real data, the value of <it>&#945; </it>is set to 0.0001. On the simulated data, the significant level <it>&#945; </it>is set to 0.01.</p>
			</sec>
			<sec>
				<st>
					<p>Results on real data</p>
				</st>
				<p>In the experiments of the evaluation model 1, the number of samples in Class <it>C</it><sub>1 </sub>of the artificial data, which are created from the liver data or the prostate data, is always fixed at 60. The results under the evaluation model 1 are presented in figure <figr fid="F1">1</figr>. Because of the limitation of sample size in real data, in the experiments of the evaluation model 2, the value of <it>n</it><sub>1</sub>+ <it>n</it><sub>2 </sub>in the artificial data created from the liver data is fixed at 120 and that from the prostate data is fixed at 60. The results of the evaluation model 2 on real data are presented in figure <figr fid="F2">2</figr>. The expected Overlap Rates and its 0.95 confidence interval (or Error Limit) of each method at each specific SR are obtained from 100 randomly generated artificial data. Furthermore, in order to test whether the average Overlap Rate at <it>SR </it>&#8800; 1 (denoted as <graphic file="1471-2105-7-S4-S8-i1.gif"/><sub><it>i</it>(<it>i </it>&#8800; 1)</sub>)is significantly different with that at <it>SR </it>= 1 (denoted as <graphic file="1471-2105-7-S4-S8-i1.gif"/><sub>1</sub>), we make a two sample t-test, where the observations are these 100 Overlap Rates calculated from 100 random artificial data with <it>SR </it>= 1 and those calculated from 100 random artificial data with <it>SR </it>&#8800; 1. So our null hypothesis states that <graphic file="1471-2105-7-S4-S8-i1.gif"/><sub><it>i</it>(<it>i </it>&#8800; 1)</sub> = <graphic file="1471-2105-7-S4-S8-i1.gif"/><sub>1</sub>, while the alternative hypothesis states that <graphic file="1471-2105-7-S4-S8-i1.gif"/><sub><it>i</it>(<it>i </it>&#8800; 1) </sub>&#8800; <graphic file="1471-2105-7-S4-S8-i1.gif"/><sub>1</sub>. The p-values associated with the t-statistic in the evaluation model 1 and 2 are summarized in table <tblr tid="T1">1</tblr> and <tblr tid="T2">2</tblr>, respectively. The experiments on real data indicate that the sample imbalance has a great influence on the performances of all six methods. As can be seen in figures <figr fid="F1">1</figr> and <figr fid="F2">2</figr>, on both real datasets, the Overlap Rates of all methods are gradually decreasing in response to the increasing amounts of sample ratio. For example, in the figure <figr fid="F2">2(a)</figr>, the margins between the average Overlap Rates at SR = 1 and that at SR = 3 on 6 methods (F, welch-t, wilcoxon, SAM, Regularized-t and Pan) are 0.2249, 0.1842, 0.2429, 0.2255, 0.2378 and 0.1932. According to the p-value showed in Table <tblr tid="T2">2</tblr>, we can conclude that on the real data the difference of the performance for each method between <it>SR </it>= 1 and <it>SR </it>&#8800; 1 has a very high statistical confidence. Additionally, there is also a difference among the Overlap Rates of different methods. It can be seen from figure <figr fid="F1">1</figr> and <figr fid="F2">2</figr> that Welch t-test and the method of pan create higher Overlap Rates on the unbalanced liver data than other 4 methods, while Wilcoxon test shows a lower Overlap Rate compared with other 5 methods on the unbalanced prostate data. However, because of without true solution, we can't decide directly and strictly which one of the six methods has the best performance on real data.</p>
				<fig id="F1">
					<title>
						<p>Figure 1</p>
					</title>
					<caption>
						<p>The results on prostate and liver datasets under the evaluation model 1</p>
					</caption>
					<text>
						<p><b>The results on prostate and liver datasets under the evaluation model 1</b>. The expected Overlap Rates of six methods as well as their error limits on prostate and liver datasets under the evaluation model 1, where the sizes of samples in Class <it>C</it><sub>1 </sub>of the artificial data, which are created from the liver data and the prostate data, are all fixed at 60.</p>
					</text>
					<graphic file="1471-2105-7-S4-S8-1"/>
				</fig>
				<fig id="F2">
					<title>
						<p>Figure 2</p>
					</title>
					<caption>
						<p>The results on prostate and liver datasets under the evaluation model 2</p>
					</caption>
					<text>
						<p><b>The results on prostate and liver datasets under the evaluation model 2</b>. The expected Overlap Rates of six methods as well as their error limits on prostate and liver datasets under the evaluation model 2, where the number of overall samples in the artificial data from liver data is fixed at 120 and that from the prostate data is fixed at 60.</p>
					</text>
					<graphic file="1471-2105-7-S4-S8-2"/>
				</fig>
				<tbl id="T1">
					<title>
						<p>Table 1</p>
					</title>
					<caption>
						<p>The p-value of t-statistic under the evaluation model 1 on two real datasets.</p>
					</caption>
					<tblbdy cols="7">
						<r>
							<c ca="left">
								<p>SR</p>
							</c>
							<c ca="center">
								<p>2</p>
							</c>
							<c ca="center">
								<p>3</p>
							</c>
							<c ca="center">
								<p>4</p>
							</c>
							<c ca="center">
								<p>5</p>
							</c>
							<c ca="center">
								<p>6</p>
							</c>
							<c ca="center">
								<p>7.5</p>
							</c>
						</r>
						<r>
							<c cspan="7">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Prostate</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>F</p>
							</c>
							<c ca="center">
								<p>1.4e-115</p>
							</c>
							<c ca="center">
								<p>l.0e-162</p>
							</c>
							<c ca="center">
								<p>9.9e-174</p>
							</c>
							<c ca="center">
								<p>1.8e-188</p>
							</c>
							<c ca="center">
								<p>1.6e-201</p>
							</c>
							<c ca="center">
								<p>6.2e-232</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>welch-t</p>
							</c>
							<c ca="center">
								<p>2.3e-112</p>
							</c>
							<c ca="center">
								<p>3.3e-157</p>
							</c>
							<c ca="center">
								<p>5.0e-177</p>
							</c>
							<c ca="center">
								<p>4.4e-189</p>
							</c>
							<c ca="center">
								<p>3.4e-213</p>
							</c>
							<c ca="center">
								<p>3.2e-249</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>sam</p>
							</c>
							<c ca="center">
								<p>4.8e-090</p>
							</c>
							<c ca="center">
								<p>1.3e-146</p>
							</c>
							<c ca="center">
								<p>3.0e-164</p>
							</c>
							<c ca="center">
								<p>3.1e-186</p>
							</c>
							<c ca="center">
								<p>7.9e-203</p>
							</c>
							<c ca="center">
								<p>4.8e-230</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>wilcoxon</p>
							</c>
							<c ca="center">
								<p>1.3e-112</p>
							</c>
							<c ca="center">
								<p>1.4e-156</p>
							</c>
							<c ca="center">
								<p>6.3e-182</p>
							</c>
							<c ca="center">
								<p>9.5e-202</p>
							</c>
							<c ca="center">
								<p>1.6e-230</p>
							</c>
							<c ca="center">
								<p>1.2e-279</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Reg-t</p>
							</c>
							<c ca="center">
								<p>2.8e-108</p>
							</c>
							<c ca="center">
								<p>4.9e-159</p>
							</c>
							<c ca="center">
								<p>3.9e-170</p>
							</c>
							<c ca="center">
								<p>1.3e-183</p>
							</c>
							<c ca="center">
								<p>9.7e-199</p>
							</c>
							<c ca="center">
								<p>2.8e-229</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Pan</p>
							</c>
							<c ca="center">
								<p>1.7e-084</p>
							</c>
							<c ca="center">
								<p>2.8e-135</p>
							</c>
							<c ca="center">
								<p>1.3e-154</p>
							</c>
							<c ca="center">
								<p>5.9e-176</p>
							</c>
							<c ca="center">
								<p>2.0e-192</p>
							</c>
							<c ca="center">
								<p>9.3e-227</p>
							</c>
						</r>
						<r>
							<c cspan="7">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Liver</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>F</p>
							</c>
							<c ca="center">
								<p>3.1e-118</p>
							</c>
							<c ca="center">
								<p>5.3e-151</p>
							</c>
							<c ca="center">
								<p>4.6e-171</p>
							</c>
							<c ca="center">
								<p>4.7e-188</p>
							</c>
							<c ca="center">
								<p>7.3e-198</p>
							</c>
							<c ca="center">
								<p>1.5e-204</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>welch-t</p>
							</c>
							<c ca="center">
								<p>9.8e-086</p>
							</c>
							<c ca="center">
								<p>1.9e-122</p>
							</c>
							<c ca="center">
								<p>1.8e-144</p>
							</c>
							<c ca="center">
								<p>1.7e-156</p>
							</c>
							<c ca="center">
								<p>6.6e-173</p>
							</c>
							<c ca="center">
								<p>7.8e-185</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>sam</p>
							</c>
							<c ca="center">
								<p>6.1e-106</p>
							</c>
							<c ca="center">
								<p>3.5e-139</p>
							</c>
							<c ca="center">
								<p>7.4e-166</p>
							</c>
							<c ca="center">
								<p>2.0e-178</p>
							</c>
							<c ca="center">
								<p>2.8e-187</p>
							</c>
							<c ca="center">
								<p>3.1e-195</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>wilcoxon</p>
							</c>
							<c ca="center">
								<p>2.6e-107</p>
							</c>
							<c ca="center">
								<p>1.3e-148</p>
							</c>
							<c ca="center">
								<p>1.6e-173</p>
							</c>
							<c ca="center">
								<p>1.7e-189</p>
							</c>
							<c ca="center">
								<p>1.2e-200</p>
							</c>
							<c ca="center">
								<p>2.9e-211</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Reg-t</p>
							</c>
							<c ca="center">
								<p>3.4e-119</p>
							</c>
							<c ca="center">
								<p>4.7e-153</p>
							</c>
							<c ca="center">
								<p>8.9e-177</p>
							</c>
							<c ca="center">
								<p>1.2e-188</p>
							</c>
							<c ca="center">
								<p>8.6e-198</p>
							</c>
							<c ca="center">
								<p>8.8e-205</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Pan</p>
							</c>
							<c ca="center">
								<p>5.1e-073</p>
							</c>
							<c ca="center">
								<p>1.1e-111</p>
							</c>
							<c ca="center">
								<p>7.0e-135</p>
							</c>
							<c ca="center">
								<p>1.8e-144</p>
							</c>
							<c ca="center">
								<p>5.9e-165</p>
							</c>
							<c ca="center">
								<p>1.8e-177</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
				<tbl id="T2">
					<title>
						<p>Table 2</p>
					</title>
					<caption>
						<p>The p-value of t-statistic under the evaluation model 2 on two real datasets.</p>
					</caption>
					<tblbdy cols="6">
						<r>
							<c ca="left">
								<p>SR</p>
							</c>
							<c ca="center">
								<p>2</p>
							</c>
							<c ca="center">
								<p>3</p>
							</c>
							<c ca="center">
								<p>4</p>
							</c>
							<c ca="center">
								<p>5</p>
							</c>
							<c ca="center">
								<p>7</p>
							</c>
						</r>
						<r>
							<c cspan="6">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Prostate</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>F</p>
							</c>
							<c ca="center">
								<p>7.2e-030</p>
							</c>
							<c ca="center">
								<p>9.0e-068</p>
							</c>
							<c ca="center">
								<p>8.8e-089</p>
							</c>
							<c ca="center">
								<p>7.1e-107</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>welch-t</p>
							</c>
							<c ca="center">
								<p>1.8e-016</p>
							</c>
							<c ca="center">
								<p>6.9e-056</p>
							</c>
							<c ca="center">
								<p>5.6e-087</p>
							</c>
							<c ca="center">
								<p>1.6e-110</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>sam</p>
							</c>
							<c ca="center">
								<p>9.2e-034</p>
							</c>
							<c ca="center">
								<p>4.7e-071</p>
							</c>
							<c ca="center">
								<p>1.6e-099</p>
							</c>
							<c ca="center">
								<p>9.0e-121</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>wilcoxon</p>
							</c>
							<c ca="center">
								<p>1.6e-028</p>
							</c>
							<c ca="center">
								<p>3.7e-070</p>
							</c>
							<c ca="center">
								<p>5.2e-099</p>
							</c>
							<c ca="center">
								<p>2.6e-123</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Reg-t</p>
							</c>
							<c ca="center">
								<p>1.2e-033</p>
							</c>
							<c ca="center">
								<p>1.1e-073</p>
							</c>
							<c ca="center">
								<p>1.3e-094</p>
							</c>
							<c ca="center">
								<p>1.0e-113</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Pan</p>
							</c>
							<c ca="center">
								<p>2.1e-017</p>
							</c>
							<c ca="center">
								<p>2.0e-054</p>
							</c>
							<c ca="center">
								<p>5.5e-083</p>
							</c>
							<c ca="center">
								<p>2.8e-112</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c cspan="6">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Liver</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>F</p>
							</c>
							<c ca="center">
								<p>1.6e-060</p>
							</c>
							<c ca="center">
								<p>8.9e-111</p>
							</c>
							<c ca="center">
								<p>1.2e-137</p>
							</c>
							<c ca="center">
								<p>6.5e-147</p>
							</c>
							<c ca="center">
								<p>1.6e-177</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>welch-t</p>
							</c>
							<c ca="center">
								<p>3.2e-008</p>
							</c>
							<c ca="center">
								<p>4.5e-049</p>
							</c>
							<c ca="center">
								<p>1.8e-085</p>
							</c>
							<c ca="center">
								<p>5.9e-106</p>
							</c>
							<c ca="center">
								<p>1.6e-137</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>sam</p>
							</c>
							<c ca="center">
								<p>1.3e-050</p>
							</c>
							<c ca="center">
								<p>5.5e-099</p>
							</c>
							<c ca="center">
								<p>1.1e-129</p>
							</c>
							<c ca="center">
								<p>9.5e-142</p>
							</c>
							<c ca="center">
								<p>2.0e-170</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>wilcoxon</p>
							</c>
							<c ca="center">
								<p>3.4e-036</p>
							</c>
							<c ca="center">
								<p>3.1e-089</p>
							</c>
							<c ca="center">
								<p>3.1e-121</p>
							</c>
							<c ca="center">
								<p>5.0e-136</p>
							</c>
							<c ca="center">
								<p>8.9e-171</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Reg-t</p>
							</c>
							<c ca="center">
								<p>6.3e-062</p>
							</c>
							<c ca="center">
								<p>1.4e-112</p>
							</c>
							<c ca="center">
								<p>6.5e-138</p>
							</c>
							<c ca="center">
								<p>8.6e-148</p>
							</c>
							<c ca="center">
								<p>2.1e-178</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Pan</p>
							</c>
							<c ca="center">
								<p>3.1e-005</p>
							</c>
							<c ca="center">
								<p>1.7e-039</p>
							</c>
							<c ca="center">
								<p>1.3e-069</p>
							</c>
							<c ca="center">
								<p>1.2e-094</p>
							</c>
							<c ca="center">
								<p>6.5e-124</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Results on simulated data</p>
				</st>
				<p>In this section, under two proposed evaluation models, we generate two kinds of simulated data to compare the performances of different methods on the unbalanced data. In the first category, the differential expression genes have equal variances in sample class <it>C</it><sub>1 </sub>and sample class <it>C</it><sub>2 </sub>(i.e. <it>&#963;</it><sub>1 </sub>= <it>&#963;</it><sub>2</sub>), but have unequal variances (i.e. <it>&#963;</it><sub>1 </sub>&#8800; <it>&#963;</it><sub>2</sub>) in the second category of simulated data. The result on a simulated data is the average result on 1000 random data generated with a specific sample ratio.</p>
				<sec>
					<st>
						<p>Equal variances</p>
					</st>
					<p>Figure <figr fid="F3">3</figr> shows the results on the simulated data in the case of equal variances (<it>&#963;</it><sub>1 </sub>= <it>&#963;</it><sub>2 </sub>= 0.5), where the number of samples in class <it>C</it><sub>1 </sub>is fixed at 60 in the evaluation model 1 and the number of overall samples is fixed at 60 in the evaluation model 2. The corresponding p-values of the t-statistic on the simulated data with equal variances under the evaluation model 1 and 2 are summarized in table <tblr tid="T3">3</tblr> and <tblr tid="T4">4</tblr>, respectively. From the experiments on simulated data with equal variances, we have the following:</p>
					<fig id="F3">
						<title>
							<p>Figure 3</p>
						</title>
						<caption>
							<p>The expected performances of six methods on the simulated data with equal variances, i.e. <it>&#963;</it><sub>1 </sub>= <it>&#963;</it><sub>2</sub>= 0.5</p>
						</caption>
						<text>
							<p><b>The expected performances of six methods on the simulated data with equal variances, i.e. <it>&#963;</it><sub>1 </sub>= <it>&#963;</it><sub>2</sub>= 0.5</b>. The expected Precision Rates and Recall Rates of six methods as well as their error limits on the simulated data with equal variances (<it>&#963;</it><sub>1 </sub>= <it>&#963;</it><sub>2 </sub>= 0.5), where the number of samples of class <it>C</it><sub>1 </sub>is fixed at 60 in the evaluation model 1 and the number of overall samples is fixed at 60 in the evaluation model 2.</p>
						</text>
						<graphic file="1471-2105-7-S4-S8-3"/>
					</fig>
					<tbl id="T3">
						<title>
							<p>Table 3</p>
						</title>
						<caption>
							<p>The p-Value of t-statistic on the simulated data with <it>&#963;</it><sub>1 </sub>= <it>&#963;</it><sub>2 </sub>= 0.5, under the evaluation model 1 (<it>n</it><sub>1 </sub>&#8801; 60).</p>
						</caption>
						<tblbdy cols="7">
							<r>
								<c ca="left">
									<p>SR</p>
								</c>
								<c ca="center">
									<p>2</p>
								</c>
								<c ca="center">
									<p>3</p>
								</c>
								<c ca="center">
									<p>4</p>
								</c>
								<c ca="center">
									<p>5</p>
								</c>
								<c ca="center">
									<p>6</p>
								</c>
								<c ca="center">
									<p>7.5</p>
								</c>
							</r>
							<r>
								<c cspan="7">
									<hr/>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Precision</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>F</p>
								</c>
								<c ca="center">
									<p>2.1e-10</p>
								</c>
								<c ca="center">
									<p>5.8e-20</p>
								</c>
								<c ca="center">
									<p>3.5e-50</p>
								</c>
								<c ca="center">
									<p>3.0e-085</p>
								</c>
								<c ca="center">
									<p>2.1e-096</p>
								</c>
								<c ca="center">
									<p>7.6e-150</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>welch-t</p>
								</c>
								<c ca="center">
									<p>2.3e-12</p>
								</c>
								<c ca="center">
									<p>1.9e-36</p>
								</c>
								<c ca="center">
									<p>9.3e-85</p>
								</c>
								<c ca="center">
									<p>3.8e-164</p>
								</c>
								<c ca="center">
									<p>4.2e-231</p>
								</c>
								<c ca="center">
									<p>0.0</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>sam</p>
								</c>
								<c ca="center">
									<p>5.9e-08</p>
								</c>
								<c ca="center">
									<p>4.5e-24</p>
								</c>
								<c ca="center">
									<p>2.6e-47</p>
								</c>
								<c ca="center">
									<p>3.1e-079</p>
								</c>
								<c ca="center">
									<p>3.3e-092</p>
								</c>
								<c ca="center">
									<p>1.7e-144</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>wilcoxon</p>
								</c>
								<c ca="center">
									<p>1.5e-06</p>
								</c>
								<c ca="center">
									<p>1.7e-14</p>
								</c>
								<c ca="center">
									<p>2.3e-26</p>
								</c>
								<c ca="center">
									<p>1.4e-044</p>
								</c>
								<c ca="center">
									<p>9.5e-045</p>
								</c>
								<c ca="center">
									<p>3.3e-064</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Reg-t</p>
								</c>
								<c ca="center">
									<p>1.7e-06</p>
								</c>
								<c ca="center">
									<p>2.7e-15</p>
								</c>
								<c ca="center">
									<p>3.2e-36</p>
								</c>
								<c ca="center">
									<p>1.6e-063</p>
								</c>
								<c ca="center">
									<p>3.6e-073</p>
								</c>
								<c ca="center">
									<p>4.2e-123</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Pan</p>
								</c>
								<c ca="center">
									<p>6.7e-12</p>
								</c>
								<c ca="center">
									<p>5.3e-29</p>
								</c>
								<c ca="center">
									<p>1.9e-61</p>
								</c>
								<c ca="center">
									<p>6.8e-117</p>
								</c>
								<c ca="center">
									<p>5.5e-152</p>
								</c>
								<c ca="center">
									<p>1.0e-228</p>
								</c>
							</r>
							<r>
								<c cspan="7">
									<hr/>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Recall</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>F</p>
								</c>
								<c ca="center">
									<p>6.6e-76</p>
								</c>
								<c ca="center">
									<p>2.0e-211</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>welch-t</p>
								</c>
								<c ca="center">
									<p>2.1e-81</p>
								</c>
								<c ca="center">
									<p>4.1e-247</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>sam</p>
								</c>
								<c ca="center">
									<p>3.7e-74</p>
								</c>
								<c ca="center">
									<p>8.5e-204</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>wilcoxon</p>
								</c>
								<c ca="center">
									<p>9.0e-80</p>
								</c>
								<c ca="center">
									<p>1.8e-228</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Reg-t</p>
								</c>
								<c ca="center">
									<p>3.4e-79</p>
								</c>
								<c ca="center">
									<p>5.9e-215</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Pan</p>
								</c>
								<c ca="center">
									<p>3.1e-82</p>
								</c>
								<c ca="center">
									<p>8.3e-246</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
							</r>
						</tblbdy>
					</tbl>
					<tbl id="T4">
						<title>
							<p>Table 4</p>
						</title>
						<caption>
							<p>The p-Value of t-statistic on the simulated data with <it>&#963;</it><sub>1 </sub>= <it>&#963;</it><sub>2 </sub>= 0.5, under the evaluation model 2 (<it>n</it><sub>1 </sub>+ <it>n</it><sub>2 </sub>&#8801; 60).</p>
						</caption>
						<tblbdy cols="6">
							<r>
								<c ca="left">
									<p>SR</p>
								</c>
								<c ca="center">
									<p>2</p>
								</c>
								<c ca="center">
									<p>3</p>
								</c>
								<c ca="center">
									<p>4</p>
								</c>
								<c ca="center">
									<p>5</p>
								</c>
								<c ca="center">
									<p>6.5</p>
								</c>
							</r>
							<r>
								<c cspan="6">
									<hr/>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Precision</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>F</p>
								</c>
								<c ca="center">
									<p>6.6e-2</p>
								</c>
								<c ca="center">
									<p>5.3e-06</p>
								</c>
								<c ca="center">
									<p>5.9e-17</p>
								</c>
								<c ca="center">
									<p>4.2e-029</p>
								</c>
								<c ca="center">
									<p>3.4e-055</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>welch-t</p>
								</c>
								<c ca="center">
									<p>2.5e-4</p>
								</c>
								<c ca="center">
									<p>3.7e-23</p>
								</c>
								<c ca="center">
									<p>1.6e-44</p>
								</c>
								<c ca="center">
									<p>7.1e-128</p>
								</c>
								<c ca="center">
									<p>2.3e-254</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>sam</p>
								</c>
								<c ca="center">
									<p>3.1e-2</p>
								</c>
								<c ca="center">
									<p>4.0e-08</p>
								</c>
								<c ca="center">
									<p>1.6e-19</p>
								</c>
								<c ca="center">
									<p>1.0e-030</p>
								</c>
								<c ca="center">
									<p>7.3e-061</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>wilcoxon</p>
								</c>
								<c ca="center">
									<p>2.3e-l</p>
								</c>
								<c ca="center">
									<p>1.1e-02</p>
								</c>
								<c ca="center">
									<p>1.2e-08</p>
								</c>
								<c ca="center">
									<p>1.2e-018</p>
								</c>
								<c ca="center">
									<p>6.3e-022</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Reg-t</p>
								</c>
								<c ca="center">
									<p>4.6e-2</p>
								</c>
								<c ca="center">
									<p>6.6e-07</p>
								</c>
								<c ca="center">
									<p>8.7e-17</p>
								</c>
								<c ca="center">
									<p>4.0e-029</p>
								</c>
								<c ca="center">
									<p>1.4e-056</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Pan</p>
								</c>
								<c ca="center">
									<p>2.6e-2</p>
								</c>
								<c ca="center">
									<p>1.1e-09</p>
								</c>
								<c ca="center">
									<p>1.5e-29</p>
								</c>
								<c ca="center">
									<p>1.8e-056</p>
								</c>
								<c ca="center">
									<p>1.4e-112</p>
								</c>
							</r>
							<r>
								<c cspan="6">
									<hr/>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Recall</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>F</p>
								</c>
								<c ca="center">
									<p>9.2e-6</p>
								</c>
								<c ca="center">
									<p>2.3e-47</p>
								</c>
								<c ca="center">
									<p>6.6e-102</p>
								</c>
								<c ca="center">
									<p>1.0e-180</p>
								</c>
								<c ca="center">
									<p>1.2e-288</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>welch-t</p>
								</c>
								<c ca="center">
									<p>2.0e-9</p>
								</c>
								<c ca="center">
									<p>6.2e-77</p>
								</c>
								<c ca="center">
									<p>1.4e-169</p>
								</c>
								<c ca="center">
									<p>1.9e-309</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>sam</p>
								</c>
								<c ca="center">
									<p>3.7e-6</p>
								</c>
								<c ca="center">
									<p>1.0e-48</p>
								</c>
								<c ca="center">
									<p>7.9e-103</p>
								</c>
								<c ca="center">
									<p>3.7e-184</p>
								</c>
								<c ca="center">
									<p>1.9e-290</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>wilcoxon</p>
								</c>
								<c ca="center">
									<p>7.2e-7</p>
								</c>
								<c ca="center">
									<p>5.2e-56</p>
								</c>
								<c ca="center">
									<p>3.5e-122</p>
								</c>
								<c ca="center">
									<p>2.3e-213</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Reg-t</p>
								</c>
								<c ca="center">
									<p>2.0e-8</p>
								</c>
								<c ca="center">
									<p>4.7e-57</p>
								</c>
								<c ca="center">
									<p>2.6e-115</p>
								</c>
								<c ca="center">
									<p>3.1e-199</p>
								</c>
								<c ca="center">
									<p>2.9e-310</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Pan</p>
								</c>
								<c ca="center">
									<p>4.2e-9</p>
								</c>
								<c ca="center">
									<p>1.7e-80</p>
								</c>
								<c ca="center">
									<p>4.6e-172</p>
								</c>
								<c ca="center">
									<p>8.1e-315</p>
								</c>
								<c ca="center">
									<p>0</p>
								</c>
							</r>
						</tblbdy>
					</tbl>
					<p>The results on the simulated data with equal variances indicate the performances of all methods are greatly affected by the sample imbalance. Each of two metrics for the performance of method (Precision Rate and Recall Rate) is steadily declined as the sample ratio increases. This result is consistant with that of previous experiments on the real data.</p>
					<p>Furthermore, the downward trend of Recall Rate in response to the increasing amounts of sample ratio is steeper than that of Precision Rate. In other words, the Recall Rate (the false negative) of the method for selecting differential expression genes is more sensitive than the Precision Rate (the false positive) to sample imbalance, although they are all affected by sample imbalance.</p>
					<p>It is certain that the sample imbalance appears to have different effects between different methods. The difference between different methods become great when the degree of sample imbalance increases. In detail, the Precision Rates of the Wilcoxon rank-sum test and the Regularized t-test are higher than those of others, that is, the Wilcoxon rank-sum test and the Regularized t-test have lowest false positive rate (Type I error). Whereas, the Recall Rate of SAM is superior to that of other methods, i.e. the method of SAM has the lowest false negative rate (Type II error). And Welch t-test shows the worst performance.</p>
				</sec>
				<sec>
					<st>
						<p>Unequal variances</p>
					</st>
					<p>In this section, under two evaluation models, the simulated data are generated in two case: the first case satisfies <it>&#963;</it><sub>1 </sub>&#8804; <it>&#963;</it><sub>2 </sub>and <it>n</it><sub>1 </sub>&#8805; <it>n</it><sub>2</sub>, for example, <it>&#963;</it><sub>1 </sub>= 0.5, <it>&#963;</it><sub>2 </sub>= 1 and <it>SR </it>= 1, 2, 3. The second case is that <it>&#963;</it><sub>1 </sub>&#8804; <it>&#963;</it><sub>2 </sub>and <it>n</it><sub>1 </sub>&#8804; <it>n</it><sub>2</sub>, for example, <it>&#963;</it><sub>1 </sub>= 0.5, <it>&#963;</it><sub>2 </sub>= 1 and <it>SR </it>= 1, <graphic file="1471-2105-7-S4-S8-i2.gif"/>, <graphic file="1471-2105-7-S4-S8-i3.gif"/>. The results of the evaluation model 1 on the two case of simulated data with unequal variances <it>&#963;</it><sub>1 </sub>= 0.5, <it>&#963;</it><sub>2 </sub>= 1 are showed in figure <figr fid="F4">4</figr>. Figure <figr fid="F5">5</figr> plots the results of the evaluation model 2 with <it>n</it><sub>1 </sub>+ <it>n</it><sub>2 </sub>= 60 on two case of simulated data with unequal variances <it>&#963;</it><sub>1 </sub>= 0.5, <it>&#963;</it><sub>2 </sub>= 1.</p>
					<fig id="F4">
						<title>
							<p>Figure 4</p>
						</title>
						<caption>
							<p>The expected performances of six methods under the evaluation model 1 on the simulated data with unequal variances, where <it>&#963;</it><sub>1</sub>= 0.5, <it>&#963;</it><sub>2 </sub>= 1</p>
						</caption>
						<text>
							<p><b>The expected performances of six methods under the evaluation model 1 on the simulated data with unequal variances, where <it>&#963;</it><sub>1 </sub>= 0.5, <it>&#963;</it><sub>2 </sub>= 1</b>. The expected Precision Rates and Recall Rates of six methods as well as their error limits on the simulated data with unequal variances (<it>&#963;</it><sub>1 </sub>= 0.5, <it>&#963;</it><sub>2 </sub>= 1) in the evaluation model 1, where the numbers of samples of class <it>C</it><sub>1 </sub>and class <it>C</it><sub>2 </sub>are fixed at 60, respectively.</p>
						</text>
						<graphic file="1471-2105-7-S4-S8-4"/>
					</fig>
					<fig id="F5">
						<title>
							<p>Figure 5</p>
						</title>
						<caption>
							<p>The expected performances of six methods under the evaluation model 2 on the simulated data with unequal variances, where <it>&#963;</it><sub>1 </sub>= 0.5, <it>&#963;</it><sub>2 </sub>= 1 and <it>n</it><sub>1 </sub>+ <it>n</it><sub>2 </sub>&#8801; 60</p>
						</caption>
						<text>
							<p><b>The expected performances of six methods under the evaluation model 2 on the simulated data with unequal variances, where <it>&#963;</it><sub>1 </sub>= 0.5, <it>&#963;</it><sub>2 </sub>= 1 and <it>n</it><sub>1 </sub>+ <it>n</it><sub>2 </sub>&#8801; 60</b>. The expected Precision Rates and Recall Rates of six methods as well as their error limits on the simulated data with unequal variances (<it>&#963;</it><sub>1 </sub>= 0.5,<it>&#963;</it><sub>2 </sub>= 1) in the evaluation model 2, where the number of overall samples is fixed at 60.</p>
						</text>
						<graphic file="1471-2105-7-S4-S8-5"/>
					</fig>
					<p>As observed in figure <figr fid="F4">4</figr> and <figr fid="F5">5</figr>, the performance of each of six methods degrades when the degree of sample imbalance increases, and on the same unbalanced data there exists great variance among the performances of six methods. These features are the same as those on the simulated data with equal variances. Furthermore, there are surprising variation on the performances of all methods compared in this paper between two different types of unbalanced data with unequal variances. In the case of <it>&#963;</it><sub>1 </sub>= 0.5, <it>&#963;</it><sub>2 </sub>= 1 and <it>n</it><sub>1 </sub>&#8805; <it>n</it><sub>2</sub>, Regularized t-test shows the highest Precision Rate and Recall Rate while Welch t-test performs the worst capability. In contrast, Regularized t-test has the medium performance and Welch t-test shows the best performance when <it>&#963;</it><sub>1 </sub>= 0.5, <it>&#963;</it><sub>2 </sub>= 1 and <it>n</it><sub>1 </sub>&#8804; <graphic file="1471-2105-7-S4-S8-i3.gif"/><it>n</it><sub>2</sub>. This surprising observation can be easily explained by figure <figr fid="F5">5</figr>. As we can see in figure <figr fid="F5">5</figr>, the curve of each method performance under the evaluation model 2 is a function of sample ratio, which maximize its value at a specific sample ratio. These results imply that one should select a relatively feasible method to detect differentially expressed genes on an actual and specific unbalanced data. If one more suitable method has been selected to process the unbalanced data, then the result of analysis can be improved greatly.</p>
					<p>In order to investigate the combined influence of sample ratio and varied variance on method performance, Regularized t-test and Welch t-test are selected as examples to demonstrate the dependency of the difference between methods with respect to different variances and sample ratios. Figure <figr fid="F6">6</figr> shows the difference between Regularized t-test and Welch t-test against varied variance at different sample ratios. When <it>&#963;</it><sub>1 </sub>&#8804; <it>&#963;</it><sub>2</sub>, Regularized t-test is always superior to Welch t-test on the unbalanced data which satisfies <it>n</it><sub>1 </sub>&#8805; <it>n</it><sub>2</sub>. When <it>&#963;</it><sub>1 </sub>&#8804; <it>&#963;</it><sub>2</sub>and <it>n</it><sub>1 </sub>&#8804; <it>n</it><sub>2</sub>, the results become relatively complex. In the plot b of figure <figr fid="F6">6</figr>, several curves cross the line of zero, which implies that both methods of Regularized t-test and Welch t-test have some region of superiority. But when <it>&#963;</it><sub>1</sub>&#8804; <graphic file="1471-2105-7-S4-S8-i2.gif"/><it>&#963;</it><sub>2 </sub>and <it>n</it><sub>1 </sub>&#8804; <graphic file="1471-2105-7-S4-S8-i3.gif"/><it>n</it><sub>2</sub>, Welch t-test have obvious dominance. In addition, the more difference between variances <it>&#963;</it><sub>1 </sub>and <it>&#963;</it><sub>2 </sub>in unbalanced data, the higher different effects on different methods.</p>
					<fig id="F6">
						<title>
							<p>Figure 6</p>
						</title>
						<caption>
							<p>The average performance Regularized t-test minus the corresponding performance of Welch t-test on the simulated data with varied variance <it>&#963;</it><sub>2</sub>, where <it>n</it><sub>1 </sub>+ <it>n</it><sub>2 </sub>&#8801; 60 and <it>&#963;</it><sub>1 </sub>&#8801; 0</p>
						</caption>
						<text>
							<p><b>The average performance Regularized t-test minus the corresponding performance of Welch t-test on the simulated data with varied variance <it>&#963;</it><sub>2</sub>, where <it>n</it><sub>1 </sub>+ <it>n</it><sub>2 </sub>&#8801; 60 and <it>&#963;</it><sub>1 </sub>&#8801; 0.5</b>. The average Precision Rate and Recall Rate of Regularized t-test minus that of Welch t-test on the simulated data with varied variance <it>&#963;</it><sub>2</sub>, where <it>&#963;</it><sub>1 </sub>&#8801; 0.5 and <it>n</it><sub>1 </sub>+ <it>n</it><sub>2 </sub>&#8801; 60.</p>
						</text>
						<graphic file="1471-2105-7-S4-S8-6"/>
					</fig>
				</sec>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Discussion</p>
			</st>
			<p>From this study, it is clear that there is a great effect on the performances of methods for selecting differential expression genes by the sample imbalance. Because of many objective factors, the gene expression data always involve the problem of small sample. As mentioned earlier in the previous section, coupled with the problem of small sample, the presence of the unbalanced data makes detecting differential expression genes more difficult. The sample imbalance is an important and inevitable problem in gene expression data analysis. Hence, one need to consider the problem of sample imbalance in the design of microarray experiments and the following data analysis.</p>
			<p>Careful experimental design is necessary to improve the result of data analysis and reduce the cost of experiment simultaneously. By the comparison between plot a and b in figure <figr fid="F3">3</figr>, we can find that the expected Recall Rates and the expected Precision Rates at SR = 1 in plot b are higher than those at SR = 6 in plot a. In other words, because of the influence of sample imbalance, the result from one gene expression data of size 60 can be superior to that from another similar gene expression data of size 70. This finding is very considerable and exciting.</p>
			<p>Furthermore, our results also indicate that on the unbalanced data, there have a great difference between the performances of different methods, especially on the data with heterogeneity. Some previous studies <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr></abbrgrp> have found that the variance <graphic file="1471-2105-7-S4-S8-i4.gif"/> (for i = 1, 2) of expression values for gene <it>j </it>may depend on the mean expression value <it>&#956;</it><sub><it>i</it></sub>. Hence, it will be very helpful to the result of analysis if a more suitable method has been selected to process the unbalanced data. For example, given an unbalanced data with unequal variances, one can improve the result of analysis if a feasible method from the six methods is selected. However, it is very likely that all six methods are not feasible for the unbalanced data and there is a requirement to find new methods more suitable to process the unbalanced data.</p>
			<p>It should be noted that this paper does not consider the problem of determining sample size for detecting differentially expressed genes in microarray data. An interesting topic is how to assign samples between two groups in order to maximize a method performance under the constraint of the given number of overall samples <it>n</it><sub>1 </sub>+ <it>n</it><sub>2</sub>.</p>
			<p>The results of this paper are based on six popular and typical methods for identifying differential expression genes including parametric method and nonparametric method. The similar effect of the sample imbalance on both kinds of methods leads us to believe that the findings in this paper should have, at least qualitatively, a comprehensive meaning. Also, two proposed evaluation models can be used to compare and evaluate other methods.</p>
		</sec>
		<sec>
			<st>
				<p>Conclusion</p>
			</st>
			<p>The experiments in this paper demonstrate that sample imbalance has a great effect on identifying differential expression genes and two proposed models are effective to quantify the effect of sample imbalance. Moreover, different methods have different performances on the unbalanced data and we can not find one method to be suitable for all unbalanced data in the experiments. Among the six methods, the welch t-test appears to perform best when the size of samples in the large variance group is larger than that in the small one, While the Regularized t-test and SAM outperform others on the unbalanced data in other cases. In conclusion, two proposed evaluation models and the results provide some help in selecting suitable method to process the unbalanced data.</p>
			<p>In future work, we will apply the evaluation models to evaluate more methods, for example the methods based False Discovery Rate. Furthermore, we attempt to investigate the problem of determining the sample size to maximize the performance of a given differential expression genes selection method.</p>
		</sec>
		<sec>
			<st>
				<p>Methods</p>
			</st>
			<p>First, some notations used in this paper are introduced here. We assume there are <it>n </it>samples in the gene expression data and these <it>n </it>samples consist of two nonoverlapping categories named class one (<it>C</it><sub>1</sub>) and class two (<it>C</it><sub>2</sub>). In each sample, the expression values of <it>p </it>genes have been detected. Then the gene expression data may be represented by a <it>n </it>&#215; <it>p </it>matrix</p>
			<p><it>A</it><sub><it>n </it>&#215; <it>p </it></sub>= (<it>a</it><sub><it>ij</it></sub>)<sub><it>n </it>&#215; <it>p</it></sub>,</p>
			<p>where the element <it>a</it><sub><it>ij </it></sub>is the expression value of gene <it>j </it>in sample <it>i</it>. The rows of <it>A </it>correspond to samples, and the <it>i-th </it>row vector of <it>A </it>is called the expression profile of the <it>i-th </it>sample. We assume that <it>n</it><sub><it>k</it></sub>, <graphic file="1471-2105-7-S4-S8-i5.gif"/><sub><it>k</it></sub>(<it>j</it>) and <it>S</it><sub><it>k</it></sub>(<it>j</it>) are number of samples, sample mean and sample variance of gene <it>j </it>in the class <it>C</it><sub><it>k</it></sub>, respectively, where <it>k </it>= 1, 2.</p>
			<sec>
				<st>
					<p>Basic concepts</p>
				</st>
				<sec>
					<st>
						<p>Definition 1 (Sample Ratio)</p>
					</st>
					<p><it>Given a gene expression data, let n</it><sub><it>k</it></sub><it> denotes the number of samples in class C</it><sub><it>k</it></sub><it>, k = 1, 2. Then, Sample Ratio, denoted by SR, is defined to be n</it><sub>1</sub><it>/n</it><sub>2</sub><it>, i.e. SR = n</it><sub>1</sub><it>/n</it><sub>2</sub>.</p>
					<p>We use the Sample Ratio (SR) to measure the degree of sample imbalance between two groups. As revealed by definition 1, the further the value of SR departs from 1, the more serious the degree of sample imbalance is.</p>
					<p>A key question also involved in this paper is how to evaluate the performance of a method for identifying differential expression genes, that is, how to evaluate the solution resulted from the method. For thousands of genes in a real gene expression data, it is generally unclear that which ones are differentially expressed genes. This situation has resulted in an obstacle to assess a method directly and strictly. In contrast, the true solution is known for the simulated data. So, in order to assess the performance of method directly, the simulated data are introduced. Furthermore, several measures are introduced to measure the quality of the method solution. Different measures are applicable in different situations, depending on whether a true solution is known or not.</p>
					<p>First, we present a metric to assess the method performance for selecting differential expression genes on the real gene expression data.</p>
					<p>Given real data, the whole real data is treated as the <b>original data (OD) </b>and the <b>artificial data (AD)</b>, which satisfies the given parameters <it>n</it><sub>1 </sub>and <it>n</it><sub>2</sub>, is generated by randomly sampling samples from the original data. Thus the <it>Overlap Rate </it>denoted by OR is calculated according to the following definition.</p>
				</sec>
				<sec>
					<st>
						<p>Definition 2 (Overlap Rate)</p>
					</st>
					<p><it>Let DEG</it><sub><it>OD</it></sub><it> and DEG</it><sub><it>AD</it></sub><it> be the sets of <ul>D</ul>ifferentially <ul>E</ul>xpressed <ul>G</ul>enes identified by some method on the original data (OD) and the artificial data (AD), respectively, then the Overlap Rate (OR) is defined as OR </it>= |<it>DEG</it><sub><it>OD </it></sub>&#8745; <it>DEG</it><sub><it>AD</it></sub>|/|<it>DEG</it><sub><it>OD</it></sub>|.</p>
					<p>To assess the method performance on the simulated data, we can compare the true solution with the suggested solution by the following method. Given simulated data with <it>p </it>genes, any solution can be represented by a binary 1 &#215; <it>p </it>vector <it>T</it>, where <it>T(i) </it>= 1 if and only if the <it>i</it>-th gene is differentially expressed gene (or positive gene). Suppose that <it>T </it>and <it>S </it>be the true solution and the suggested solution of a method, respectively. And let <it>n</it><sub><it>xy </it></sub>denote the number of pair (<it>i</it>, <it>i</it>), for which <it>T</it>(<it>i</it>) = <it>x </it>and <it>S</it>(<it>i</it>) = <it>y</it>, where <it>x, y </it>= 0 or 1. Thus, <it>n</it><sub>11 </sub>is the number of true positive genes, <it>n</it><sub>01 </sub>is the number of false positive genes, <it>n</it><sub>00 </sub>is the number of true negative genes, and <it>n</it><sub>10 </sub>is the number of false negative genes. Consequently, two different metricss, <it>Recall Rate </it>and <it>Precision Rate</it>, are introduced to measure the performance of method.</p>
				</sec>
				<sec>
					<st>
						<p>Definition 3 (Recall Rate)</p>
					</st>
					<p><it>Suppose that S and T be the suggested solution of a differential expression gene selection method and the true solution, respectively. Then Recall Rate (RR) is defined as RR </it>= <it>n<sub>11</sub>/</it>(<it>n</it><sub>10 </sub>+ <it>n</it><sub>11</sub>).</p>
				</sec>
				<sec>
					<st>
						<p>Definition 4 (Precision Rate)</p>
					</st>
					<p><it>Suppose that S and T be the suggested solution of a differential expression gene selection method and the true solution, then Precision Rate (PR) is defined as PR = n</it><sub>11</sub>/(<it>n</it><sub>01</sub>+<it>n</it><sub>11</sub>).</p>
					<p>From the definitions 3 and 4, We can see that the Recall Rate focuses on the false negative while the Precision Rate focuses on the false positive. However, the false negative and the false positive are two different keystones in the context of selecting differential expression genes, and the false negative is inconsistent with the false positive. So in a particular problem specification, one can choose either Recall Rate or Precision Rate as the main focus.</p>
				</sec>
			</sec>
			<sec>
				<st>
					<p>Random sampling</p>
				</st>
				<p>For one specific method of differential expression genes selection and one given data with <it>n</it><sub>1 </sub>samples in class <it>C</it><sub>1 </sub>and <it>n</it><sub>2 </sub>samples in class <it>C</it><sub>2</sub>, we can only get one specific value of each of the metrics OR, RR and PR of the method on the given data. So after one specific method and the set of gene expression data with size <it>m</it> are given, there exists the set of ORs (RRs or PRs) with size <it>m</it> resulted from the method. A perfect way to evaluate the performance of one method is to run the method on the whole set of gene expression data which satisfy given parameters <it>n</it><sub>1 </sub>and <it>n</it><sub>2 </sub>and to calculate the average value of each metric. But the cardinality of the set of data with parameters <it>n</it><sub>1 </sub>and <it>n</it><sub>2 </sub>may be very large or infinite. For example, if a real microarray data has 50 and 30 samples in class <it>C</it><sub>1 </sub>and <it>C</it><sub>2</sub>respectively, then the number of different artificial data with parameters <it>n</it><sub>1 </sub>= 40 and <it>n</it><sub>2 </sub>= 20 is <graphic file="1471-2105-7-S4-S8-i6.gif"/> &gt; 3 &#215; 10<sup>17</sup>. In order to reduce the computation cost and avoid the problem of infinity, one feasible way is to estimate the expected value of each metric and its approximate confidence interval (or Error Limit) by sampling a sample from the specific gene expression data randomly.</p>
				<sec>
					<st>
						<p>Lemma 1</p>
					</st>
					<p><abbrgrp><abbr bid="B28">28</abbr></abbrgrp><it>Suppose that population X has mean &#956;, and finite variance &#963;</it><sup>2</sup>, <it>and X</it><sub>1</sub>, <it>X</it><sub>2</sub><it>, ..., X</it><sub><it>n </it></sub><it>are an independent random sample of size n from the population X, then the sample mean </it><graphic file="1471-2105-7-S4-S8-i7.gif"/><it>is an unbiased estimate of &#956; and the sample variance S</it><sup>2</sup><it>is an unbiased estimate of &#963;</it><sup>2</sup>. <it>Moreover, the variance of </it><graphic file="1471-2105-7-S4-S8-i7.gif"/>, <it>denoted by D</it>(<graphic file="1471-2105-7-S4-S8-i7.gif"/>), <it>satisfies D</it>(<graphic file="1471-2105-7-S4-S8-i7.gif"/>) = <it>&#963;</it><sup>2</sup>/<it>n</it>, <it>where</it></p>
					<p>
						<graphic file="1471-2105-7-S4-S8-i8.gif"/>
					</p>
					<p>According to lemma 1, we can use the sample mean <graphic file="1471-2105-7-S4-S8-i7.gif"/> to estimate the population mean <it>&#956; </it>and calculate its approximate confidence interval. In sampling survey, the exact distribution of the estimate (i.e. <graphic file="1471-2105-7-S4-S8-i7.gif"/>), is unknown. However, according to the central limit theorem, we can expect the sampling distribution of <graphic file="1471-2105-7-S4-S8-i7.gif"/> to be approximately normal distribution with mean <it>E</it>(<graphic file="1471-2105-7-S4-S8-i7.gif"/>) = <it>&#956; </it>and variance <it>D</it>(<graphic file="1471-2105-7-S4-S8-i7.gif"/>). That is</p>
					<p>
						<graphic file="1471-2105-7-S4-S8-i9.gif"/>
					</p>
					<p>As a result, a (l-<it>&#945;</it>)100% approximate confidence interval for the estimate of <it>&#956; </it>is <graphic file="1471-2105-7-S4-S8-i10.gif"/>. In practice, the standard deviation of sampled population <it>&#963; </it>is typically unknown. Replacing <it>&#963; </it>by <it>S </it>leads to the corresponding estimate <graphic file="1471-2105-7-S4-S8-i11.gif"/> and <graphic file="1471-2105-7-S4-S8-i11.gif"/> is referred to as the standard error (SE) of <graphic file="1471-2105-7-S4-S8-i7.gif"/>. Therefore, a feasible confidence interval of <it>&#956; </it>at a significant level <it>&#945; </it>is [<graphic file="1471-2105-7-S4-S8-i7.gif"/> - <graphic file="1471-2105-7-S4-S8-i11.gif"/><it>z<sub>&#945;</sub></it>, <graphic file="1471-2105-7-S4-S8-i7.gif"/> + <graphic file="1471-2105-7-S4-S8-i11.gif"/><it>z<sub>&#945;</sub></it>] and the approximate Error Limit (EL) is <graphic file="1471-2105-7-S4-S8-i11.gif"/><it>z<sub>&#945;</sub></it>. For sample of size <it>n </it>&#8805; 30, regardless the shape of most population, sampling theory guarantees good results <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>.</p>
					<p>When population is finite, the change is the introduction of the factor 1 - <it>f </it>for the variance <it>D</it>(<graphic file="1471-2105-7-S4-S8-i7.gif"/>), where <it>f </it>= <it>n/N </it>is the sampling fraction and <it>N </it>is the size of population. The factor 1 - <it>f </it>is called the finite population correction (fpc). That is, the confidence interval is <graphic file="1471-2105-7-S4-S8-i12.gif"/>. In practice, the fpc can be ignored whenever the sampling fraction does not exceed 5% <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>.</p>
				</sec>
			</sec>
			<sec>
				<st>
					<p>Evaluation models</p>
				</st>
				<p>In order to investigate the effect of sample imbalance on differential expression genes selection, one simply needs to consider the change of the performance of a method in response to different sample ratios (SRs), because the sample ratio is a measure of the degree of sample imbalance between two groups. Therefore, two evaluation models are proposed as follows.</p>
				<sec>
					<st>
						<p>Evaluation model 1</p>
					</st>
					<p><it>Let the number of samples of certain class always equal to constant C, for instance n</it><sub>1</sub><it>= C, and the artificial data (or the simulated data) is randomly created with different Sample Ratios. Then compare the method results on the data with various Sample Ratios</it>.</p>
				</sec>
				<sec>
					<st>
						<p>Evaluation model 2</p>
					</st>
					<p><it>Let the number of all samples in the artificial data (or the simulated data) always equal to constant C, i.e. n</it><sub>1</sub>+<it>n</it><sub>2 </sub>&#8801;<it> C, and the artificial data (or the simulated data) is randomly created with different Sample Ratios. Then the method is evaluated based on these random data with particular parameter SR</it>.</p>
				</sec>
			</sec>
			<sec>
				<st>
					<p>Calculating cutoff point</p>
				</st>
				<p>For the parametric method, the cutoff point of a significance level <it>a </it>is calculated from the assumed distribution. In the nonparametric method, for a given significance level <it>&#945;</it>, following the spirit of SAM, we find the 100(1 - <it>&#945;</it>)% quantile of the null distribution, i.e. noted as <graphic file="1471-2105-7-S4-S8-i13.gif"/>, using the following formula</p>
				<p>
					<graphic file="1471-2105-7-S4-S8-i14.gif"/>
				</p>
				<p>where B is the number of permutations and <graphic file="1471-2105-7-S4-S8-i15.gif"/> is the value of the statistic for the <it>i</it>-th gene in the b-th permutation. Then the quantile value <graphic file="1471-2105-7-S4-S8-i13.gif"/> is used as the cutoff point for that statistic to select differential expression genes.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Authors' contributions</p>
			</st>
			<p>K.Y. conceived the study, performed the implementations and drafted the manuscript. H.G. critically read and revised the final manuscript. J.L. supervised the whole work and finalized the manuscript. All authors read and approved the final manuscript.</p>
		</sec>
	</bdy>
	<bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>We would like to thank Jing Xu, Chaokun Wang, Shenfei Shi, and George for thoughtful comments and discussions. This work was supported partly by the 863 Research Plan of China under Grant No. 2004AA231071 and the NSF of China under Grant No. 60533110.</p>
				<p>This article has been published as part of <it>BMC Bioinformatics</it> Volume 7, Supplement 4, 2006: Symposium of Computations in Bioinformatics and Bioscience (SCBB06). The full contents of the supplement are available online at <url>http://www.biomedcentral.com/1471-2105/7?issue=S4</url>.  </p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>Quantitive monitoring of gene expression patterns with a complementary DNA microarray</p>
				</title>
				<aug>
					<au>
						<snm>Schene</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Shalon</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Davis</snm>
						<fnm>RW</fnm>
					</au>
					<au>
						<snm>Brown</snm>
						<fnm>PO</fnm>
					</au>
				</aug>
				<source>Science</source>
				<pubdate>1995</pubdate>
				<volume>270</volume>
				<fpage>467</fpage>
				<lpage>470</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmpid" link="fulltext">7569999</pubid>
						<pubid idtype="doi">10.1126/science.270.5235.467</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B2">
				<title>
					<p>Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring</p>
				</title>
				<aug>
					<au>
						<snm>Golub</snm>
						<fnm>TR</fnm>
					</au>
					<au>
						<snm>Slonim</snm>
						<fnm>DK</fnm>
					</au>
					<au>
						<snm>Tamayo</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Huard</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Gaasenbeek</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Mesirov</snm>
						<fnm>JP</fnm>
					</au>
					<au>
						<snm>Coller</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Loh</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Downing</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Caligiuri</snm>
						<fnm>MA</fnm>
					</au>
					<au>
						<snm>Bloomfield</snm>
						<fnm>CD</fnm>
					</au>
					<au>
						<snm>Lander</snm>
						<fnm>ES</fnm>
					</au>
				</aug>
				<source>Science</source>
				<pubdate>1999</pubdate>
				<volume>286</volume>
				<issue>5439</issue>
				<fpage>531</fpage>
				<lpage>537</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1126/science.286.5439.531</pubid>
						<pubid idtype="pmpid" link="fulltext">10521349</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B3">
				<title>
					<p>Medical applications of microarray technologies: a regulatory science perspective</p>
				</title>
				<aug>
					<au>
						<snm>Petricoin</snm>
						<fnm>EF</fnm>
						<suf>III</suf>
					</au>
					<au>
						<snm>Hackett</snm>
						<fnm>JL</fnm>
					</au>
					<au>
						<snm>Lesko</snm>
						<fnm>LJ</fnm>
					</au>
					<au>
						<snm>Puri</snm>
						<fnm>RK</fnm>
					</au>
					<au>
						<snm>Gutman</snm>
						<fnm>SI</fnm>
					</au>
					<au>
						<snm>Chumakov</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Woodcock</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Feigal</snm>
						<fnm>DW</fnm>
					</au>
					<au>
						<snm>Zoon</snm>
						<fnm>KG</fnm>
					</au>
					<au>
						<snm>Sistare</snm>
						<fnm>FD</fnm>
					</au>
				</aug>
				<source>Nature Genetics</source>
				<pubdate>2002</pubdate>
				<volume>32</volume>
				<issue>supplement</issue>
				<fpage>474</fpage>
				<lpage>479</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1038/ng1029</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B4">
				<title>
					<p>Ratio-based decisions and the quantitative analysis of cDNA microarray images</p>
				</title>
				<aug>
					<au>
						<snm>Chen</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Dougherty</snm>
						<fnm>ER</fnm>
					</au>
					<au>
						<snm>Bittner</snm>
						<fnm>ML</fnm>
					</au>
				</aug>
				<source>J Biomed Optics</source>
				<pubdate>1997</pubdate>
				<volume>2</volume>
				<fpage>364</fpage>
				<lpage>367</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1117/12.281504</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B5">
				<title>
					<p>A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments</p>
				</title>
				<aug>
					<au>
						<snm>Pan</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2002</pubdate>
				<volume>18</volume>
				<fpage>546</fpage>
				<lpage>554</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/18.4.546</pubid>
						<pubid idtype="pmpid" link="fulltext">12016052</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B6">
				<title>
					<p>Statistical tests for differential expression in cDNA microarray experiments</p>
				</title>
				<aug>
					<au>
						<snm>Cui</snm>
						<fnm>X</fnm>
					</au>
					<au>
						<snm>Churchill</snm>
						<fnm>GA</fnm>
					</au>
				</aug>
				<source>Genome Biol</source>
				<pubdate>2003</pubdate>
				<volume>4</volume>
				<fpage>210</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">154570</pubid>
						<pubid idtype="pmpid" link="fulltext">12702200</pubid>
						<pubid idtype="doi">10.1186/gb-2003-4-4-210</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B7">
				<title>
					<p>Microarray expression profiling identifies genes with altered expression in HDL-deficient mice</p>
				</title>
				<aug>
					<au>
						<snm>Callow</snm>
						<fnm>MJ</fnm>
					</au>
					<au>
						<snm>Dudoit</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Gong</snm>
						<fnm>EL</fnm>
					</au>
					<au>
						<snm>Speed</snm>
						<fnm>TP</fnm>
					</au>
					<au>
						<snm>Rubin</snm>
						<fnm>EM</fnm>
					</au>
				</aug>
				<source>Genome Research</source>
				<pubdate>2000</pubdate>
				<volume>10</volume>
				<fpage>2022</fpage>
				<lpage>2029</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">313086</pubid>
						<pubid idtype="pmpid" link="fulltext">11116096</pubid>
						<pubid idtype="doi">10.1101/gr.10.12.2022</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B8">
				<title>
					<p>A Bayesian Framework for the Analysis of Microarray Expression Data: Regularized t-test and Statistical Inferences of Gene Changes</p>
				</title>
				<aug>
					<au>
						<snm>Baldi</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Long</snm>
						<fnm>AD</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2001</pubdate>
				<volume>17</volume>
				<fpage>509</fpage>
				<lpage>519</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/17.6.509</pubid>
						<pubid idtype="pmpid" link="fulltext">11395427</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B9">
				<title>
					<p>GEST: a gene expression search tool based on a novel Bayesian similarity metric</p>
				</title>
				<aug>
					<au>
						<snm>Hunter</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Taylor</snm>
						<fnm>RC</fnm>
					</au>
					<au>
						<snm>Leach</snm>
						<fnm>SM</fnm>
					</au>
					<au>
						<snm>Simon</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2001</pubdate>
				<volume>17</volume>
				<issue>Suppl 1</issue>
				<fpage>S115</fpage>
				<lpage>S122</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmpid" link="fulltext">11473000</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B10">
				<title>
					<p>Nonparametric methods for identifying differentially expressed genes in microarray data</p>
				</title>
				<aug>
					<au>
						<snm>Troyanskaya</snm>
						<fnm>OG</fnm>
					</au>
					<au>
						<snm>Garber</snm>
						<fnm>ME</fnm>
					</au>
					<au>
						<snm>Brown</snm>
						<fnm>PO</fnm>
					</au>
					<au>
						<snm>Botstein</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Altman</snm>
						<fnm>RB</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2002</pubdate>
				<volume>18</volume>
				<issue>11</issue>
				<fpage>1454</fpage>
				<lpage>1461</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/18.11.1454</pubid>
						<pubid idtype="pmpid" link="fulltext">12424116</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B11">
				<title>
					<p>Significance analysis of microarrays applied to the ionizing radiation response</p>
				</title>
				<aug>
					<au>
						<snm>Tusher</snm>
						<fnm>VG</fnm>
					</au>
					<au>
						<snm>Tibshirani</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Chu</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>2001</pubdate>
				<volume>98</volume>
				<issue>9</issue>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">33173</pubid>
						<pubid idtype="pmpid" link="fulltext">11309499</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B12">
				<title>
					<p>Empirical Bayes analysis of a microarray experiment</p>
				</title>
				<aug>
					<au>
						<snm>Efron</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Tibshirani</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Storey</snm>
						<fnm>JD</fnm>
					</au>
					<au>
						<snm>Tusher</snm>
						<fnm>V</fnm>
					</au>
				</aug>
				<source>Journal of the American Statistical Association</source>
				<pubdate>2001</pubdate>
				<volume>96</volume>
				<fpage>1151</fpage>
				<lpage>1160</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1198/016214501753382129</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B13">
				<title>
					<p>A mixture model approach to detecting differentially expressed genes with microarray data</p>
				</title>
				<aug>
					<au>
						<snm>Pan</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Lin</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Le</snm>
						<fnm>Ct</fnm>
					</au>
				</aug>
				<source>Funct Integr Genomics</source>
				<pubdate>2003</pubdate>
				<volume>3</volume>
				<fpage>117</fpage>
				<lpage>124</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1007/s10142-003-0085-7</pubid>
						<pubid idtype="pmpid" link="fulltext">12844246</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B14">
				<title>
					<p>Modified nonparametric approaches to detecting differentially expressed genes in replicated microarray experiments</p>
				</title>
				<aug>
					<au>
						<snm>Zhao</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Pan</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19</volume>
				<issue>9</issue>
				<fpage>1046</fpage>
				<lpage>1054</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/btf879</pubid>
						<pubid idtype="pmpid" link="fulltext">12801864</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B15">
				<title>
					<p>On the use of permutation in and the performance of a class nonparametric methods to detect differential gene expression</p>
				</title>
				<aug>
					<au>
						<snm>Pan</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19</volume>
				<fpage>1333</fpage>
				<lpage>1340</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/btg167</pubid>
						<pubid idtype="pmpid" link="fulltext">12874044</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B16">
				<title>
					<p>Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines</p>
				</title>
				<aug>
					<au>
						<snm>Ross</snm>
						<fnm>DT</fnm>
					</au>
					<au>
						<snm>Scherf</snm>
						<fnm>U</fnm>
					</au>
					<au>
						<snm>Eisen</snm>
						<fnm>MB</fnm>
					</au>
					<au>
						<snm>Perou</snm>
						<fnm>CM</fnm>
					</au>
					<au>
						<snm>Rees</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Spellman</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Iyer</snm>
						<fnm>V</fnm>
					</au>
					<au>
						<snm>Jeffrey</snm>
						<fnm>SS</fnm>
					</au>
					<au>
						<snm>Van de Rijn</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Waltham</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Pergamenschikov</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Lee</snm>
						<fnm>JC</fnm>
					</au>
					<au>
						<snm>Lashkari</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Shalon</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Myers</snm>
						<fnm>TG</fnm>
					</au>
					<au>
						<snm>Weinstein</snm>
						<fnm>JN</fnm>
					</au>
					<au>
						<snm>Botstein</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Brown</snm>
						<fnm>PO</fnm>
					</au>
				</aug>
				<source>Nature Genetics</source>
				<pubdate>2000</pubdate>
				<volume>24</volume>
				<fpage>227</fpage>
				<lpage>234</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1038/73432</pubid>
						<pubid idtype="pmpid" link="fulltext">10700174</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B17">
				<title>
					<p>Different Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling</p>
				</title>
				<aug>
					<au>
						<snm>Alizadeh</snm>
						<fnm>AA</fnm>
					</au>
					<au>
						<snm>Eisen</snm>
						<fnm>MB</fnm>
					</au>
					<au>
						<snm>Davis</snm>
						<fnm>RE</fnm>
					</au>
					<au>
						<snm>Ma</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Lossos</snm>
						<fnm>IS</fnm>
					</au>
					<au>
						<snm>Rosenwald</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Boldrick</snm>
						<fnm>JC</fnm>
					</au>
					<au>
						<snm>Sabet</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Tran</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Yu</snm>
						<fnm>X</fnm>
					</au>
					<au>
						<snm>Powell</snm>
						<fnm>JI</fnm>
					</au>
					<au>
						<snm>Yang</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Marti</snm>
						<fnm>GE</fnm>
					</au>
					<au>
						<snm>Moore</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Hudson</snm>
						<fnm>J Jr</fnm>
					</au>
					<au>
						<snm>Lu</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Lewis</snm>
						<fnm>DB</fnm>
					</au>
					<au>
						<snm>Tibshirani</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Sherlock</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Chan</snm>
						<fnm>WC</fnm>
					</au>
					<au>
						<snm>Greiner</snm>
						<fnm>TC</fnm>
					</au>
					<au>
						<snm>Weisenburger</snm>
						<fnm>DD</fnm>
					</au>
					<au>
						<snm>Armitage</snm>
						<fnm>JO</fnm>
					</au>
					<au>
						<snm>Warnke</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Levy</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Wilson</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Grever</snm>
						<fnm>MR</fnm>
					</au>
					<au>
						<snm>Byrd</snm>
						<fnm>JC</fnm>
					</au>
					<au>
						<snm>Botstein</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Brown</snm>
						<fnm>PO</fnm>
					</au>
					<au>
						<snm>Staudt</snm>
						<fnm>LM</fnm>
					</au>
				</aug>
				<source>Nature</source>
				<pubdate>2000</pubdate>
				<volume>403</volume>
				<fpage>503</fpage>
				<lpage>511</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1038/35000501</pubid>
						<pubid idtype="pmpid" link="fulltext">10676951</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B18">
				<title>
					<p>Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses</p>
				</title>
				<aug>
					<au>
						<snm>Bhattacharjee</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Richards</snm>
						<fnm>WG</fnm>
					</au>
					<au>
						<snm>Staunton</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Li</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Monti</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Vasa</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Ladd</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Beheshti</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Bueno</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Gillette</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Loda</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Weber</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Mark</snm>
						<fnm>EJ</fnm>
					</au>
					<au>
						<snm>Lander</snm>
						<fnm>ES</fnm>
					</au>
					<au>
						<snm>Wong</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Johnson</snm>
						<fnm>BE</fnm>
					</au>
					<au>
						<snm>Golub</snm>
						<fnm>TR</fnm>
					</au>
					<au>
						<snm>Sugarbaker</snm>
						<fnm>DJ</fnm>
					</au>
					<au>
						<snm>Meyerson</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>2001</pubdate>
				<volume>98</volume>
				<fpage>13790</fpage>
				<lpage>13795</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1073/pnas.191502998</pubid>
						<pubid idtype="pmpid" link="fulltext">11707567</pubid>
						<pubid idtype="pmcid">61120</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B19">
				<title>
					<p>Molecular Classification of Human Carcinomas by Use of Gene Expression Signatures</p>
				</title>
				<aug>
					<au>
						<snm>Su</snm>
						<fnm>AI</fnm>
					</au>
					<au>
						<snm>Welsh</snm>
						<fnm>JB</fnm>
					</au>
					<au>
						<snm>Sapinoso</snm>
						<fnm>LM</fnm>
					</au>
					<au>
						<snm>Kern</snm>
						<fnm>SG</fnm>
					</au>
					<au>
						<snm>Dimitrov</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Lapp</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Schultz</snm>
						<fnm>PG</fnm>
					</au>
					<au>
						<snm>Powell</snm>
						<fnm>SM</fnm>
					</au>
					<au>
						<snm>Moskaluk</snm>
						<fnm>CA</fnm>
					</au>
					<au>
						<snm>Frierson</snm>
						<fnm>HF Jr</fnm>
					</au>
					<au>
						<snm>Hampton</snm>
						<fnm>GM</fnm>
					</au>
				</aug>
				<source>Cancer Research</source>
				<pubdate>2001</pubdate>
				<volume>61</volume>
				<fpage>7388</fpage>
				<lpage>7393</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmpid" link="fulltext">11606367</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B20">
				<title>
					<p>Multiclass cancer diagnosis using tumor gene expression signatures</p>
				</title>
				<aug>
					<au>
						<snm>Ramaswamy</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Tamayo</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Rifkin</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Mukherjee</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Yeang</snm>
						<fnm>CH</fnm>
					</au>
					<au>
						<snm>Angelo</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Ladd</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Reich</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Latulippe</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Mesirov</snm>
						<fnm>JP</fnm>
					</au>
					<au>
						<snm>Poggio</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Gerald</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Loda</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Lander</snm>
						<fnm>ES</fnm>
					</au>
					<au>
						<snm>Golub</snm>
						<fnm>TR</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>2001</pubdate>
				<volume>98</volume>
				<fpage>15149</fpage>
				<lpage>15154</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1073/pnas.211566398</pubid>
						<pubid idtype="pmpid" link="fulltext">11742071</pubid>
						<pubid idtype="pmcid">64998</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B21">
				<title>
					<p>Gene expression patterns in human liver cancers</p>
				</title>
				<aug>
					<au>
						<snm>Chen</snm>
						<fnm>X</fnm>
					</au>
					<au>
						<snm>Cheung</snm>
						<fnm>ST</fnm>
					</au>
					<au>
						<snm>So</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Fan</snm>
						<fnm>ST</fnm>
					</au>
					<au>
						<snm>Barry</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Higgins</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Lai</snm>
						<fnm>KM</fnm>
					</au>
					<au>
						<snm>Ji</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Dudoit</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Ng</snm>
						<fnm>IO</fnm>
					</au>
					<au>
						<snm>Van De Rijn</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Botstein</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Brown</snm>
						<fnm>PO</fnm>
					</au>
					<etal/>
				</aug>
				<source>Molecular Biology of the Cell</source>
				<pubdate>2002</pubdate>
				<volume>13</volume>
				<fpage>1929</fpage>
				<lpage>1939</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">117615</pubid>
						<pubid idtype="pmpid" link="fulltext">12058060</pubid>
						<pubid idtype="doi">10.1091/mbc.02-02-0023.</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B22">
				<title>
					<p>Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning</p>
				</title>
				<aug>
					<au>
						<snm>Shipp</snm>
						<fnm>MA</fnm>
					</au>
					<au>
						<snm>Ross</snm>
						<fnm>KN</fnm>
					</au>
					<au>
						<snm>Tamayo</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Weng</snm>
						<fnm>AP</fnm>
					</au>
					<au>
						<snm>Kutok</snm>
						<fnm>JL</fnm>
					</au>
					<au>
						<snm>Aguiar</snm>
						<fnm>RC</fnm>
					</au>
					<au>
						<snm>Gaasenbeek</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Angelo</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Reich</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Pinkus</snm>
						<fnm>GS</fnm>
					</au>
					<au>
						<snm>Ray</snm>
						<fnm>TS</fnm>
					</au>
					<au>
						<snm>Koval</snm>
						<fnm>MA</fnm>
					</au>
					<au>
						<snm>Last</snm>
						<fnm>KW</fnm>
					</au>
					<au>
						<snm>Norton</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Lister</snm>
						<fnm>TA</fnm>
					</au>
					<au>
						<snm>Mesirov</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Neuberg</snm>
						<fnm>DS</fnm>
					</au>
					<au>
						<snm>Lander</snm>
						<fnm>ES</fnm>
					</au>
					<au>
						<snm>Aster</snm>
						<fnm>JC</fnm>
					</au>
					<au>
						<snm>Golub</snm>
						<fnm>TR</fnm>
					</au>
				</aug>
				<source>Nature Medicine</source>
				<pubdate>2002</pubdate>
				<volume>8</volume>
				<fpage>68</fpage>
				<lpage>74</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1038/nm0102-68</pubid>
						<pubid idtype="pmpid" link="fulltext">11786909</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B23">
				<title>
					<p>Prediction of central nervous system embryonal tumor outcome based on gene expression</p>
				</title>
				<aug>
					<au>
						<snm>Pomeroy</snm>
						<fnm>SL</fnm>
					</au>
					<au>
						<snm>Tamayo</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Gaasenbeek</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Sturla</snm>
						<fnm>LM</fnm>
					</au>
					<au>
						<snm>Angelo</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>McLaughlin</snm>
						<fnm>ME</fnm>
					</au>
					<au>
						<snm>Kim</snm>
						<fnm>JY</fnm>
					</au>
					<au>
						<snm>Goumnerova</snm>
						<fnm>LC</fnm>
					</au>
					<au>
						<snm>Black</snm>
						<fnm>PM</fnm>
					</au>
					<au>
						<snm>Lau</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Allen</snm>
						<fnm>JC</fnm>
					</au>
					<au>
						<snm>Zagzag</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Olson</snm>
						<fnm>JM</fnm>
					</au>
					<au>
						<snm>Curran</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Wetmore</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Biegel</snm>
						<fnm>JA</fnm>
					</au>
					<au>
						<snm>Poggio</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Mukherjee</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Rifkin</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Califano</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Stolovitsky</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Louis</snm>
						<fnm>DN</fnm>
					</au>
					<au>
						<snm>Mesirov</snm>
						<fnm>JP</fnm>
					</au>
					<au>
						<snm>Lander</snm>
						<fnm>ES</fnm>
					</au>
					<au>
						<snm>Golub</snm>
						<fnm>TR</fnm>
					</au>
				</aug>
				<source>Nature</source>
				<pubdate>2002</pubdate>
				<volume>415</volume>
				<fpage>436</fpage>
				<lpage>442</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1038/415436a</pubid>
						<pubid idtype="pmpid" link="fulltext">11807556</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B24">
				<title>
					<p>On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data</p>
				</title>
				<aug>
					<au>
						<snm>Newton</snm>
						<fnm>MA</fnm>
					</au>
					<au>
						<snm>Kendziorski</snm>
						<fnm>CM</fnm>
					</au>
					<au>
						<snm>Richmond</snm>
						<fnm>CS</fnm>
					</au>
					<au>
						<snm>Blattner</snm>
						<fnm>FR</fnm>
					</au>
					<au>
						<snm>Tsui</snm>
						<fnm>KW</fnm>
					</au>
				</aug>
				<source>Journal of Computational Biology</source>
				<pubdate>2001</pubdate>
				<volume>8</volume>
				<fpage>37</fpage>
				<lpage>52</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1089/106652701300099074</pubid>
						<pubid idtype="pmpid" link="fulltext">11339905</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B25">
				<title>
					<p>Testing for differentially- expressed genes by maximum likelihood analysis of microarray data</p>
				</title>
				<aug>
					<au>
						<snm>Ideker</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Thorsson</snm>
						<fnm>V</fnm>
					</au>
					<au>
						<snm>Siegel</snm>
						<fnm>AF</fnm>
					</au>
					<au>
						<snm>Hood</snm>
						<fnm>LE</fnm>
					</au>
				</aug>
				<source>Journal of Computational Biology</source>
				<pubdate>2000</pubdate>
				<volume>7</volume>
				<fpage>805</fpage>
				<lpage>817</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1089/10665270050514945</pubid>
						<pubid idtype="pmpid" link="fulltext">11382363</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B26">
				<title>
					<p>Gene expression profiling identifies clinically relevant subtypes of prostate cancer</p>
				</title>
				<aug>
					<au>
						<snm>Lapointe</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Li</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Higgins</snm>
						<fnm>JP</fnm>
					</au>
					<au>
						<snm>van de Rijn</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Bair</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Montgomery</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Ferrari</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Egevad</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Rayford</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Bergerheim</snm>
						<fnm>U</fnm>
					</au>
					<au>
						<snm>Ekman</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>DeMarzo</snm>
						<fnm>AM</fnm>
					</au>
					<au>
						<snm>Tibshirani</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Botstein</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Brown</snm>
						<fnm>PO</fnm>
					</au>
					<au>
						<snm>Brooks</snm>
						<fnm>JD</fnm>
					</au>
					<au>
						<snm>Pollack</snm>
						<fnm>JR</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>2004</pubdate>
				<volume>101</volume>
				<issue>3</issue>
				<fpage>811</fpage>
				<lpage>816</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">321763</pubid>
						<pubid idtype="pmpid" link="fulltext">14711987</pubid>
						<pubid idtype="doi">10.1073/pnas.0304146101</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B27">
				<title>
					<p>Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data</p>
				</title>
				<aug>
					<au>
						<snm>Dudoit</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Fridlyand</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Speed</snm>
						<fnm>TP</fnm>
					</au>
				</aug>
				<source>Journal of the American Statistical Association</source>
				<pubdate>2002</pubdate>
				<volume>97</volume>
				<issue>457</issue>
				<fpage>77</fpage>
				<lpage>87</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1198/016214502753479248</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B28">
				<aug>
					<au>
						<snm>Walpole</snm>
						<fnm>RE</fnm>
					</au>
					<au>
						<snm>Myers</snm>
						<fnm>RH</fnm>
					</au>
				</aug>
				<source>Probability and statistics for engineers and Scientists</source>
				<publisher>Macmillan Publishing</publisher>
				<edition>5</edition>
				<pubdate>1993</pubdate>
			</bibl>
			<bibl id="B29">
				<aug>
					<au>
						<snm>Cochran</snm>
						<fnm>WG</fnm>
					</au>
				</aug>
				<source>Sampling Techniques</source>
				<publisher>John Wiley</publisher>
				<edition>3</edition>
				<pubdate>1977</pubdate>
			</bibl>
		</refgrp>
	</bm>
</art>

