Abstract
Background
Microarray techniques have revolutionized genomic research by making it possible to monitor the expression of thousands of genes in parallel. As the amount of microarray data being produced is increasing at an exponential rate, there is a great demand for efficient and effective expression data analysis tools. Comparison of gene expression profiles of patients against those of normal counterpart people will enhance our understanding of a disease and identify leads for therapeutic intervention.
Results
In this paper, we propose an innovative approach, fuzzy membership test (FMtest), based on fuzzy set theory to identify disease associated genes from microarray gene expression profiles. A new concept of FM dvalue is defined to quantify the divergence of two sets of values. We further analyze the asymptotic property of FMtest, and then establish the relationship between FM dvalue and pvalue. We applied FMtest to a diabetes expression dataset and a lung cancer expression dataset, respectively. Within the 10 significant genes identified in diabetes dataset, six of them have been confirmed to be associated with diabetes in the literature and one has been suggested by other researchers. Within the 10 significantly overexpressed genes identified in lung cancer data, most (eight) of them have been confirmed by the literatures which are related to the lung cancer.
Conclusion
Our experiments on synthetic datasets show that FMtest is effective and robust. The results in diabetes and lung cancer datasets validated the effectiveness of FMtest. FMtest is implemented as a Webbased application and is available for free at http://database.cs.wayne.edu/bioinformatics webcite.
Background
Microarray techniques have revolutionized genomic research by making it possible to monitor the expression of thousands of genes in parallel. As the amount of microarray data being produced is increasing at an exponential rate, there is a great demand for efficient and effective expression data analysis tools. The gene expression profile of a cell determines its phenotype and responses to the environment. These responses include its responses towards environmental factors, drugs and therapies. Gene expression patterns can be determined by measuring the quantity of the end product, protein, or the mRNA template used to synthesize the protein. Comparison of gene expression profiles in patients against their normal counterpart people will enhance our understanding of a disease and identify leads for therapeutic intervention. Several important breakthroughs and progress in the gene expression profiling of diseases have been made [15]. More interestingly, researchers have identified many genes that play important roles in the onset, development, and progression of various diseases. Identification of these disease genes offers a route to a better understanding of the molecular mechanisms underlying pathogenesis, a necessary prerequisite for the rational development of improved preventative and therapeutic methods.
One effective approach of identifying genes that are associated with a disease is to measure the divergence of two sets of values of gene expression. A motivating example is shown in Table 1, which records the microarray gene expression values of five genes for two groups of people that are related to diabetes [6]: five insulinsensitive (IS) humans and five insulinresistant (IR) humans. In order to identify the genes that are associated with diabetes, one needs to determine for each gene whether or not the two sets of expression values are significantly different from each other. The two most popular methods to measure the divergence of two sets of values are ttest [7] and Wilcoxon rank sum test [7], The statistical method ttest assesses whether the means of two groups are statistically different from each other. Given two sets S_{1 }and S_{2}, the tvalue is calculated as
Table 1. The gene expression values for five genes under two conditions.
where μ_{S }and σ_{S }are the sample mean and standard deviation of S, respectively.
The limitation of ttest is that it cannot distinguish two sets with close means even though the two sets are significantly different from each other. Another limitation of ttest is that it is very sensitive to extreme values.
Another popular statistical method is Wilcoxon rank sum test, which can be used to test the null hypothesis that two sets S_{1 }and S_{2 }have the same distribution. We first merge the data from these two sets and rank the values from the lowest to the highest with all sequences of ties being assigned an average rank. The Wilcoxon test statistic W is the sum of the ranks from set S_{1}. Assuming that the two sets have the same continuous distribution (and no ties occur), then W has a mean and standard deviation given by
where m = S_{1} and n = S_{2}.
We test the null hypothesis H_{o}: no difference in distributions. A onesided alternative is H_{a}: S_{1 }yields lower measurements. We use this alternative if we expect or see that W is unusually lower than its expected value μ. In this case, the pvalue is given by a normal approximation. We let N~N(μ,σ) and compute the lefttail Pr(N ≤ W) (using continuity correction if W is an integer).
If we expect or see that W is much higher than its expected value, then we should use the alternative H_{a}: first S_{1 }yields higher measurements. In this case, the pvalue is given by the righttail Pr(N ≥ W). If the two sums of ranks from each set are close, then we could use a twosided alternative H_{a}: there is a difference in distributions. In this case, the pvalue is given by twice the smallest tail value 2*Pr(N ≤ W), if W < μ; or 2*Pr(N ≥ W), if W > μ.
Although rank sum test overcomes the limitation of ttest in sensitivity to extreme values, it is not sensitive to absolute values. This might be advantageous to some applications but not to others.
Results
To validate our approach, first, we investigated the distribution of FM dvalue on a set of synthetic datasets. Second, we conducted experiments on a synthetic dataset to study the relationship between FMtest dvalue and its empirical pvalue. Third, on another synthetic dataset, we studied the relationship between FM dvalue and the mean difference of distributions.
The probability distribution of FM dvalue
Suppose two sets S_{1 }and S_{2 }are randomly drawn from the same normal distribution, what is the probability distribution of FM dvalue? To answer this question, we conducted the following simulation:
1. We generated N = 64000 pairs of sets of values, with each set containing 5 values. As shown in Figure 1(a), each value in the two data sets is randomly generated from the same normal distribution N(0,1).
Figure 1. Random generation of dvalue from normal distribution. (a) shows the random generation of two sets of values from the same normal distribution and the calculation of the FM dvalue of these two sets. (b) shows the random generation of two sets of values from two different normal distributions and the calculation of FM dvalue of these two sets.
2. We calculated the dvalue for each pair of sets.
3. We then estimated the probability density value where δ = 0.005. The value is essentially the fraction of the FM dvalues falling in region [dδ, d+δ] divided by the region length 2δ. The probability density function of the ddistribution was drawn in Figure 2.
Figure 2. The probability density function of FM dvalue. The probability density function of FM dvalue shows that most dvalues falls into the middle region and only 5% dvalues are greater than 0.6058; these dvalues are considered significant.
4. At the end, in order to understand the effect of the number of pairs used for simulation, i.e., the size of the dataset, on the approximation error of the ddistribution, we generated datasets with different data sizes. For each data size, we generated 10 datasets, and thus derived 10 probability density functions. The maximum standard deviation for all dvalues is recorded as the error rate for that data size. As shown in Figure 3, as expected, the error rate decreases as the size of the dataset increases.
Figure 3. The impact of dataset size on error rate of PDF of FM dvalue. We show the error rate for different data sizes from 500 to 32000. For each data size, we generated 10 datasets, and thus derived 10 probability density functions. The maximum standard deviation for all dvalues is recorded as the error rate for that data size. The error rate decreases as the size of the dataset increases.
From Figure 2, we can see that most FM dvalues fall into the range from 0.2 to 0.5, and very few fall into the range greater than 0.6, or less than 0.2. In particular, when d ≥ 0.6056, pvalue ≤ 0.05. This is reflected in the redshared area in Figure 2 with f(x)dx = 0.05. Therefore, given two sets S_{1 }and S_{2 }drawn from the same normal unit distribution, the chance that the pair has a FM dvalue equal to or greater than 0.6056 is very low. On the other hand, if we observe that two sets have a dvalue equal to or greater than 0.6056, then this is strong evidence that these two sets are drawn from two different distributions. Therefore, they should be considered as significantly divergent.
Figure 3 shows the effect of data size on the error rate of the derived probability density function. As the data size increases, the error rate decreases. We can see from Figure 3 that, after the number of pairs of sets in a dataset is greater than 8000, the trend of the error rate becomes stable. Thus, to obtain a reliable empirical pvalue for FMtest, the data size should be greater than 8000.
Relationship between FM dvalue and its empirical pvalue
Suppose two sets S_{1 }and S_{2 }are drawn from the same normal distribution, what is the probability that they have a FM dvalue equal to or greater than a particular D? If the D increases, will this probability decrease? To answer these questions, we studied the relationship between FM dvalue and empirical pvalue as follows:
1. Based on the above experimental result, we know that we need at least 8000 pairs of sets to obtain a reliable empirical pvalue. Therefore, in this experiment, we generated 10000 pairs of sets of values, with each set containing 5 values. Each value is randomly generated from the unit normal distribution N(0,1).
2. We calculated the dvalue for each pair of sets.
3. For each pair of sets S_{1 }and S_{2 }with dvalue D, we calculated its empirical pvalue as n+1/10001 where n is the number of pairs in these 10000 pairs that have a dvalue equal to or greater than D.
4. We drew the relationship between dvalue and empirical pvalue in Figure 4.
Figure 4. The relationship between FM dvalue and its empirical pvalue. It shows the relationship between dvalue and its empirical pvalue. We can see that as dvalue increases, the pvalue decreases. In particular, when d ≥ 0.6056, we have pvalue ≤ 0.05.
From Figure 4, we can see that as dvalue increases, the pvalue decreases. In particular, when d ≥ 0.6056, we have pvalue ≤ 0.05.
Relationship between FM dvalue and the mean difference of distributions
Suppose two sets S_{1 }and S_{2 }are drawn from two different distributions, then a good divergence measurement should satisfy the following property: the less overlap between these two distributions, the greater the dvalue. We validated that our FMtest has this property as follows:
1. As shown in Figure 1(b), two data sets are generated from two distributions. Let N(0,1) and N(x, 1) be two normal distributions, where x is the mean difference between these two distributions. In this experiment, we consider x = 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, respectively.
2. We generated 1000 pairs of sets of values, with the first set containing 5 values that are randomly generated from N(0,1), and the second set containing 5 values that are randomly generated from N(x, 1).
3. We calculated the dvalue for each pair. Let the average of these 1000 dvalues be d. We then plotted (x, d) in Figure 5.
Figure 5. Relationship between the mean difference of distributions and dvalue. Two datasets are generated from two distributions. Let N(0,1) and N(x, 1) be two normal distributions, where x is the mean difference between these two distributions. In this experiment, we consider x = 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, respectively. The dvalue between two sets increases when the mean difference of two data sets increases.
4. We repeated step 2 and 3 for different x. Finally, the curve was drawn in Figure 5.
Figure 5 confirmed the desirable property of FMtest: the larger the mean difference between the two distributions, the greater the dvalue.
Discussion
Analyzing diabetes data with FMtest
A diabetes dataset of microarray gene expression for a total of 10831 genes downloadable from [6] is used for analysis. For each gene, there are ten expression values, five from a group of insulinsensitive (IS) people and five from a group of insulinresistant (IR) people. Only the genes that have no null expression values are included in this analysis. We also require that, for a gene to be included, at least five out of its ten expression values are greater than 100. This eliminates the genes whose expression values are noisy and not reliable.
The results of FMtest are compared with the results of ttest and rank sum test. As we can seen in Table 2 although the orders of ranking are different for different methods, all three methods identify these genes as significantly differentially expressed between the IS and IR groups. Furthermore, 10 worst ranked genes in FMtest shown in Table 2 are also consistent with the result of the other two methods. However, gene U49835 is identified by FMtest as the 21st ranked significant gene with pvalue 0.0258. Neither ttest (with pvalue 0.0768) nor rank sum test (with a pvalue 0.1522) identifies this gene as significant.
Table 2. Ten bestranked and worstranked genes of diabetes identified by FMtest.
To study the relevance of genes in insulin metabolism and diabetes, the 10 best ranked differentially regulated genes shown in Table 2 were further searched in the published literature. Human phosphatidylinositol(4,5) bisphosphate 5phosphatase homolog (gene U45973) was found to be differentially expressed in insulin resistance cases. Overexpression of inositol polyphosphate 5phosphatase2 SHIP2 has been shown to inhibit insulinstimulated phosphoinositide 3kinase (PI3K) dependent signaling events. Analysis of diabetic human subjects has revealed an association between SHIP2 gene polymorphism and type 2 diabetes mellitus. Also knockout mouse studies have shown that SHIP2 is a significant therapeutic target for the treatment of type2 diabetes as well as obesity [8]. Csermely et al. reported that insulin mediates phosphorylation/dephosphorylation of nucleolar protein nucleolin (gene M60858) by stimulating casein kinase II, and this may play a role in the simultaneous enhancement in RNA efflux from isolated, intact cell nucle [9]. cmyc is an oncogene that codes for transcription factor Myc that along with other binding partners such as MAX plays an important role widely studied in various physiological processes including tumor growth in different cancers. Myc modulates the expression of hepatic genes and counteracts the obesity and insulin resistance induced by a highfat diet in transgenic mice overexpressing cmyc in liver [10].
Max interactor protein, MXI1 (gene L07648) competes for MAX thus negatively regulates MYC function and may play a role in insulin resistance. In the presence of glucose or glucose and insulin, leucine is utilized more efficiently as a precursor for lipid biosynthesis by adipose tissue. It has been shown that during the differentiation of 3T3L1 fibroblasts to adipocytes, the rate of lipid biosynthesis from leucine increases at least 30fold and the specific activity of 3hydroxy3methylglutarylCoA lyase (gene L07033), the mitochondrial enzyme catalyzing the terminal reaction in the leucine degradation pathway, increases 4fold during differentiation [11]. Schottelndreier et al[12] have described a regulatory role of integrin alpha 6 (gene X53586) in Ca2+ signaling, that is known to have a significant role in insulin resistance [13].
HCGV gene product (gene X81003) is known to inhibit the activity of protein phosphatase1, which is involved in diverse signalling pathways including insulin signaling [14]. Human ribosomal protein L7 (Gene X57959)plays a regulatory role in eukaryotic translation apparatus. It has been shown to be an autoantigen in patients with systemic autoimmune diseases, such as systemic lupus erythematosus [15]. Identification of this gene in our analysis and by [6] suggests a possible role of this gene in insulin resistance. Published reports on these genes indicate their roles in insulin signalling and warrant further investigations on their functions in insulin resistance cases. We further recommend genes D85181, M95610 and U06452 as candidate genes for future research in this area.
In order to compare the fold change of expression levels between the IS and IR groups to the statistical significance pvalues, we presented all the genes in the diabetes dataset with a volcano plot shown in Figure 6. The volcano plot arranges the genes along dimensions of biological and statistical significance. The X axis is the fold change between the two groups, which is on a log scale log_{2}(/), where is the mean of expressions in the IS group, and is the mean of the expressions in the IR group. In this way, up and down regulation appear symmetric. The Y axis represents the pvalue for our FMtest, which is on a negative log scale log_{10}(pvalue), so that smaller pvalues appear higher up. The X axis indicates biological impact of the change; the Y axis indicates the statistical evidence, or reliability of the change.
Figure 6. The volcano plot for the diabetes dataset. We compare the fold change of expression levels between the IS and IR groups to the statistical significance pvalues in a volcano plot. The volcano plot arranges the genes along dimensions of biological and statistical significance. The X axis is the fold change between the two groups, which is on a log scale log_{2}(/), where is the mean of expressions in the IS group, and is the mean of the expressions in the IR group. As we can see, a few genes shows significant difference can be visualized in the plot.
As shown in Figure 6, gene U45973 is identified by FMtest as the most statistically significant gene and it is overexpressed in the IR group; gene X53586 is identified by FMtest as the 7th statistically significant gene and it is overexpressed in the IS group. Although genes M60858, D85181, M95610, L07648, L07033, and X81003 have been identified by FMtest among the top ten significant genes, they are not overexpressed in either groups. Finally, gene U41515 is identified by FMtest as the 11th significant gene and it is overexpressed in the IS group.
In summary, out of the top 10 genes identified by FMtest, we could find 6 of them in the literature about their association with insulin metabolism and diabetes. Among the remaining four genes, gene X57959 has been recommended by [6] as a candidate gene for diabetes, we recommend that gene D85181, M95610 and U06452 could serve as candidate genes for future research in this area.
Analyzing lung cancer data with FMtest
To study the relevance of significant genes in lung cancer, a dataset of microarray gene expression for a total of 22283 genes downloadable from [16] is used for analysis, the top ranked genes were further searched in the published literature. Most of the genes we found have a validated role in tumor progression. As showed in Table 3, we discuss a few genes that we ranked best using our method. Multiple identifiers of Keratins were ranked significant in the dataset. Cytokeratins are a polygenic family of insoluble proteins and have been proposed as potentially useful markers of differentiation in various malignancies including lung cancers [17]. Dystonin (DST/BPAG1) is a member of plakin protein family of adhesion junction plaque proteins. A recent study showed the expression of BPAG1in epithelial tumor cells [18]. Maspin (SERPINB5) was has been shown to be involved in both tumor growth and metastasis such as cell invasion, angiogenesis, and more recently apoptosis [19]. Tumor protein p73like (TP73L/P63) is implicated in the activation of cell survival and antiapoptotic genes [20] and has been used as a marker for lung cancer. It has been suggested that the p63 genomic amplification has an early role in lung tumorigenesis [21]. CLCA2 belongs to calcium sensitive chloride conductance protein family and has been used in a multigene detection assay for Non Small Cell Lung Cancer (NSCLC) [22]. Plakophilins (PKPs) are members of the armadillo multigene family that function in cell adhesion and signal transduction, and also play a central role in tumorigenesis [23]. Desmoplakin (DSP) is a desmosome protein that anchors intermediate filaments to desmosomal plaques. Microscopic analysis with fluorescencelabeled antibodies for DSP revealed high expression of membrane DSP in Squamous Cell Carcinomas (SCC) [24]. The data analysis also identified cell cycle regulatory proteins such as CDC20 and Cyclin B1. Overexpression of CDC20 has been shown to be associated with premature anaphase promotion, resulting in mitotic abnormalities in oral SCC cell lines [25]. Mini chromosome maintenance2 (MCM2) protein is involved in the initiation of DNA replication and is marker for proliferating cells [26]. Our analysis also identified GPR87 (NM_023915) and UGT1A9 (NM_019093). Role of G protein coupled receptors are well documented in lung cancer and GPR87 could be an important gene in cancer progression. Among overexpressed genes, we suggest NM_023915 and NM_019093 as potential candidates for biological investigation.
Table 3. Ten bestranked (overexpressed) cancer genes identified by FMtest.
Conclusion
We proposed an innovative approach based on the fuzzy set theory, FMtest, that quantifies the divergence of two sets directly. We have validated FMtest on synthetic datasets and show that it is effective and robust. We also applied FMtest to a real diabetes dataset and a cancer dataset. For each dataset, we identified 10 significant genes. Within 10 significant genes in diabetes dataset, six of them have been confirmed to be associated with insulin signalling and/or diabetes in the literature, one has been recommended by others, the remaining three genes, D85181, M95610 and U06452, are suggested as three potential diabetes genes involved in insulin resistance for further biological investigation. Out of the 10 significantly overexpressed genes identified in the lung cancer data eight are confirmed by literature to be related to lung cancer. The remaining two genes NM_023915 and NM_019093 are potential candidates for further biological investigation. In addition, we analyzed the asymptotic properties of the distribution of FM dvalue and the equation to calculate its pvalue. The analysis is presented in appendix. FMtest is implemented as a Webbased application and can be accessed for free at http://database.cs.wayne.edu/bioinformatics webcite.
Methods
In this section, based on the fuzzy set theory [27], we present our innovative approach, the fuzzysettheorybased method test (FMtest), to quantify the divergence of two sets of values directly and robustly. In addition, in append ix section, we show the asymptotic property of FMtest, and then establish the relationship between FM dvalue with pvalue.
Let S_{1 }and S_{2 }be two sets of values of a particular feature for two groups of samples under two different conditions. The basic idea is to consider the two sets of values as samples from two different fuzzy sets. We examine the membership value of each element with respect to the other fuzzy set. By calculating the average of membership values, we measure the divergence of the original two sets. In particular, we perform the following steps:
1. Compute the sample mean and standard deviation of S_{1 }and of S_{2 }respectively.
2. Characterize S_{1 }and S_{2 }as two fuzzy sets FS_{1 }and FS_{2 }whose fuzzy membership functions, (x) and (x), are defined with the sample means and standard deviations. The fuzzy membership function (x)(i = 1,2) maps each value x to a fuzzy membership value that reflects the degree of x belonging to (x)(i = 1,2).
3. Using the two fuzzy membership functions, (x) and (x), quantify the convergence degree of two sets.
4. Define the divergence degree (FM dvalue) between the two sets based on the convergence degree.
Fuzzy Sets and Membership Functions
The sample mean μ_{1 }of S_{1 }is calculated as
where n_{1 }is the number of elements in S_{1}, and the sample standard deviation σ_{1 }of S_{1 }is calculated as
For gene 5 in Table 1, we have μ_{1 }= 461.8, σ_{1 }= 210.59, μ_{2 }= 266.2, and σ_{2 }= 45.29. We then characterize set S_{1 }by a fuzzy set FS_{1 }whose fuzzy membership function is defined as
The function (x) maps each value x in S_{1 }to a fuzzy membership value to quantify the degree that x belongs to FS_{1}. A value equal to the mean has a membership value of 1 and belongs to fuzzy set FS_{1 }to a full degree; a value that deviates from the mean has a smaller membership value and belongs to FS_{1 }to a smaller degree. The further the value deviates from the mean, the smaller the fuzzy membership value. Similarly, the fuzzy membership function for S_{2 }is defined as
where μ_{2 }and σ_{2 }are the mean and standard deviation of S_{2 }respectively.
For gene 5 in Table 1, we have and . With these two fuzzy membership functions, the fuzzy membership values for each element with respect to the two sets can be calculated. For example, (598) = 0.81 and (598) = 2.2E^{12}.
Our Proposed Method: FMtest
Since the fuzzy membership functions can overlap, one element can belong to more than one fuzzy set with a respective degree for each. For an element in S_{1}, we measure the degree that it belongs to FS_{1 }by applying its value to . Similarly we can apply its value to to measure the degree that it belongs to FS_{2}. The idea of FMtest is to consider the membership value of an element in S_{1 }with respect to S_{2 }as a bond between S_{1 }and S_{2}, and vice versa, then the aggregation of all these bonds reflects the overall bond between these two sets. The weaker this overall bond is, the more divergent these two sets are. The strength of the overall bond between two sets is quantified by their cvalue, which aggregates the mutual membership values of elements in S_{1 }and S_{2 }and is defined as follows.
Definition 1 (FM cvalue): Given two sets S_{1 }and S_{2}, the convergence degree between S_{1 }and S_{2 }in FMtest is defined as
Now we define the divergence value in FMtest (FM dvalue) as follows.
Definition 2 (FM dvalue): Given two sets S_{1 }and S_{2}, the FM dvalue between S_{1 }and S_{2 }is defined as
For gene 5 in Table 1, c(S_{1}, S_{2}) = 0.326, thus the divergence value is 1c(S_{1}, S_{2}) = 0.674. We calculated all the pvalues for the five genes in Table 1 for the three methods. One interesting observation is that, while both ttest and Wilcoxon rank sum test fail to recognize gene 5 as a significant gene since their pvalues are greater than 0.05, our FMtest identifies gene 5 as a significant gene with a pvalue of 0.025. The reason of the failure of ttest and Wilcoxon rank sum test is due to their sensitivity to the extreme value 141 in the first set of the gene.
Given a calculated FM dvalue D for two sets S_{1 }and S_{2}, to interpret D in terms of "significantly divergent" or not, we need to know the cutoff value δ of D, so that when D ≥ δ, the two sets are interpreted as significantly divergent. In the context of FMtest, we like to test the following null hypothesis H_{o}: S_{1 }and S_{2 }originate from the same distribution. Then the pvalue is defined as the probability {Pr(d(S_{1}, S_{2}) ≥ D  S_{1 }and S_{2 }were randomly sampled from the same distribution}. As a convention of statistical analysis, if pvalue ≤ 0.05, then this is strong evidence to reject the null hypothesis, and accepts that the two sets are significantly divergent, while the pvalue reflects the significance. It has been very common to use Monte Carlo procedures to calculate the empirical pvalue which approximates the exact pvalue without relying on asymptotic distributional theory or on exhaustive enumeration. Davison and Hinkley [28] present the formula for obtaining an empirical pvalue as (n+1)/(N+1), where N is the number of samples in the data set, and n is the number of those samples which produce the statistical value greater than or equal to the specified value.
We perform the following steps to calculate the pvalue of two sets S_{1 }and S_{2 }with their FM dvalue D: (1) Estimate the distribution that S_{1 }and S_{2 }are drawn from a normal distribution N(μ,σ), where μ and σ are estimated using the sample mean and standard deviation of S_{1 }∪ S_{2}; (2) Randomly draw N pairs of sets from N(μ,σ), then calculate the FM dvalue for each pair; (3) Calculate the empirical pvalue as (n+1)/(N+1), where n is the number of pairs whose FM dvalues are equal or greater than D.
Authors' contributions
LRL and SL designed the algorithm and coordinated the project. XW proved the asymptotic property of FMtest and wrote part of manuscript. YL carried out the study and drafted the manuscript. VM implemented the Webbased application of FMtest. DP and DK analyzed gene functional data and wrote part of manuscript.
APPENDIX
Asymptotic Characteristics of the FM dvalue
The FM dvalue is defined in Method section as follows:
Here we are trying to establish the asymptotic characteristics of the FM dvalue by estimating its corresponding mean and variance. To the end, formula (10) is rewritten by defining an indicator variable (·) as follows:
where S = S_{1}∪ S_{2 }= {x_{i},i = 1,..., n_{1 }+ n_{2}}, n_{1 }= S_{1} · n_{2 }= S_{2} and (x) = 1 if x ∈ S_{i }and 0 otherwise for i = 1,2.
Let w.r.t. a r.v. X over sample space S with a probability p of choosing a sample x from S_{1}. The calculation of the dvalue for a given sample x is therefore given by d(S_{1},S_{2}) = = 1 . Next, the mean and the variance of Δ(X) are calculated respectively preparing for establishing the asymptotic distribution of the dvalue.
(1). Calculation of the mean of Δ(X)
The mean of Δ(X) is given by
Similarly,
By (12)–(14), the mean of Δ(X) when p = 0.5 is
(2). Calculation of the variance of Δ(X)
Since S_{1 }and S_{2 }are independent, the variance of Δ(X) is given by
Similarly,
Therefore, when p = 0.5
As illustrated in the beginning, dvalue is a function of which is given by d(S_{1},S_{2}) = 1 . By calculating the mean and the variance of Δ(X) in formula (Δ1) and (Δ2), the mean and the variance of the dvalue are derived straightforward as follows:
For a large sample, by the central limit theorem, the distribution of the dvalue follows a truncated normal distribution approximately: d(S_{1},S_{2})→ N(E(d),Var(d)) on a restrained domain of [0 1].
For the purpose of further illustration, several special cases of the distribution of dvalue under applicationspecific constrains are demonstrated.
i. Balance study: p = 0.5, n_{1 }= n_{2 }= n/2
ii. Balance study with equal mean: p = 0.5, n_{1 }= n_{2 }= n/2, μ_{1 }= μ_{2}
iii. Balance study with equal variance: p = 0.5, n_{1 }= n_{2}, σ_{1}^{2 }= σ_{2 }^{2 }= σ^{2}
iv. Balance study with equal variance of 1 and large samples:σ^{2}= 1, n_{1 }= n_{2 }≥ 25
v. Balance study with equal variance of 1 and equal mean for large samples:σ^{2 }= 1, μ_{1 }= μ_{2}, n_{1 }= n_{2 }≥ 25
E(d(s_{1},s_{2})) = 1 ≈ 0.293, var(d(s_{1},s_{2}))=(,)/n ≈ 0.327/n
d(S_{1},S_{2}) → N(0.293,0.327/n) with a restrained domain of [0 1].
Figure 7 shows the density function of dvalue for this special case when n = 50 with mean 0.293 and variance 0.08.
Figure 7. Asymptotic density function of dvalue for a balance study with equal variance of one.
Calculation of pvalue
Pvalue is also called the observed level of significance and is commonly used to report the smallest αlevel at which the observed test result is significant. In this section, we derived the parametric calculation of pvalue for the FM test based on the asymptotic distribution obtained from section I.
The null hypothesis of the test is H_{0}:μ_{1 }= μ_{2}, where μ_{1 }and μ_{2 }are the mean gene expression levels of two studied groups. According to the asymptotic distribution of the dvalue, following its special case (ii) (balance study with equal mean), a test statistic under the null hypothesis for large sample size (n >= 25) is given by
Where and .
Suppose d_{obs }is an observed dvalue for a given study based on two independent samples S_{1 }= {x_{i},i = 1,...,n_{1}} and S_{2 }= {y_{i},i = 1,...,n_{2}}. The population variances σ_{1}^{2 }and σ_{2}^{2 }are estimated by the corresponding sample variances and . Thus the mean and variance of dvalue are estimated by
and
Pvalue is therefore derived as follows:
Application in Gene Expression Analysis
Table 4 shows the calculated Pvalues for the study example. It is concluded that the pvalues calculated by (Δ3) are consistent with the empirical pvalues listed in Table 1 except the Gene 5 which is above 0.05. As a reminder, while the formula (Δ3) is being applied for the calculation of pvalues, a large sample size (n >= 25) is desired for a robust estimation due to the assumption of the CLT.
Table 4. Pvalues given by FMtest for five genes from the study example.
Acknowledgements
We would like thank anonymous reviewers for their helpful comments. This work was supported by the Agricultural Experiment Station at the University of the District of Columbia (Project No.: DC0LIANG; Accession No.: 0203877).
This article has been published as part of BMC Bioinformatics Volume 7, Supplement 4, 2006: Symposium of Computations in Bioinformatics and Bioscience (SCBB06). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/7?issue=S4.
References

Rome S, Clement K, RabasaLhoret R, Loizon E, Poitou C, Barsh GS, Riou JP, Laville M, Vidal H: Microarray profiling of human skeletal muscle reveals that insulin regulates approximately 800 genes during a hyperinsulinemic clamp.
J Biol Chem 2003, 278(20):1806318068. PubMed Abstract  Publisher Full Text

Shalev A, PiseMasison CA, Radonovich M, Hoffmann SC, Hirshberg B, Brady JN, Harlan DM: Oligonucleotide microarray analysis of intact human pancreatic islets: identification of glucoseresponsive genes and a highly regulated TGFbeta signaling pathway.
Endocrinology 2002, 143(9):36953698. PubMed Abstract  Publisher Full Text

Sreekumar R, Halvatsiotis P, Schimke JC, Nair KS: Gene expression profile in skeletal muscle of type 2 diabetes and the effect of insulin treatment.
Diabetes 2002, 51(6):19131920. PubMed Abstract  Publisher Full Text

Eckenrode SE, Ruan QG, Collins CD, Yang P, McIndoe RA, Muir A, She JX: Molecular pathways altered by insulin b923 immunization.
Ann N Y Acad Sci 2004, 1037:175185. PubMed Abstract  Publisher Full Text

Voisine P, Ruel M, Khan TA, Bianchi C, Xu SH, Kohane I, Libermann TA, Otu H, Saltiel AR, Sellke FW: Differences in gene expression profiles of diabetic and nondiabetic patients undergoing cardiopulmonary bypass and cardioplegic arrest.
Circulation 2004, 110(11 Suppl 1):II280286. PubMed Abstract  Publisher Full Text

Yang X, Pratley RE, Tokraks S, Bogardus C, Permana PA: Microarray profiling of skeletal muscle tissues from equally obese, nondiabetic insulinsensitive and insulinresistant Pima Indians.
Diabetologia 2002, 45(11):15841593. PubMed Abstract  Publisher Full Text

Rosner B: Fundamentals of Biostatistics. In Pacific Grove. 5th edition. CA: Duxbury Press; 2000.

Dyson JM, Kong AM, Wiradjaja F, Astle MV, Gurung R, Mitchell CA: The SH2 domain containing inositol polyphosphate 5phosphatase2: SHIP2.
Int J Biochem Cell Biol 2005, 37(11):22602265. PubMed Abstract  Publisher Full Text

Csermely P, Schnaider T, Cheatham B, Olson MO, Kahn CR: Insulin induces the phosphorylation of nucleolin. A possible mechanism of insulininduced RNA efflux from nuclei.
J Biol Chem 1993, 268(13):97479752. PubMed Abstract  Publisher Full Text

Riu E, Ferre T, Hidalgo A, Mas A, Franckhauser S, Otaegui P, Bosch F: Overexpression of cmyc in the liver prevents obesity and insulin resistance.
Faseb J 2003, 17(12):17151717. PubMed Abstract  Publisher Full Text

Frerman FE, Sabran JL, Taylor JL, Grossberg SE: Leucine catabolism during the differentiation of 3T3L1 cells. Expression of a mitochondrial enzyme system.
J Biol Chem 1983, 258(11):70877093. PubMed Abstract  Publisher Full Text

Schottelndreier H, Potter BV, Mayr GW, Guse AH: Mechanisms involved in alpha6beta1integrinmediated Ca(2+) signalling.
Cell Signal 2001, 13(12):895899. PubMed Abstract  Publisher Full Text

Kulkarni RN, Roper MG, Dahlgren G, Shih DQ, Kauri LM, Peters JL, Stoffel M, Kennedy RT: Islet secretory defect in insulin receptor substrate 1 null mice is linked with reduced calcium signaling and expression of sarco(endo)plasmic reticulum Ca2+ATPase (SERCA)2b and 3.
Diabetes 2004, 53(6):15171525. PubMed Abstract  Publisher Full Text

Zhang J, Zhang L, Zhao S, Lee EY: Identification and characterization of the human HCG V gene product as a novel inhibitor of protein phosphatase1.
Biochemistry 1998, 37(47):1672816734. PubMed Abstract  Publisher Full Text

von Mikecz A, Hemmerich P, Peter HH, Krawinkel U: Characterization of eukaryotic protein L7 as a novel autoantigen which frequently elicits an immune response in patients suffering from systemic autoimmune disease.
Immunobiology 1994, 192(1–2):137154. PubMed Abstract

Wachi S, Yoneda K, Wu R: Interactometranscriptome analysis reveals the high centrality of genes differentially expressed in lung cancer tissues.
Bioinformatics 2005, 21(23):42054208. PubMed Abstract  Publisher Full Text

Camilo R, Capelozzi VL, Siqueira SA, Del Carlo Bernardi F: Expression of p63, keratin 5/6, keratin 7, and surfactantA in nonsmall cell lung carcinomas.
Hum Pathol 2006, 37(5):542546. PubMed Abstract  Publisher Full Text

Schuetz CS, Bonin M, Clare SE, Nieselt K, Sotlar K, Walter M, Fehm T, Solomayer E, Riess O, Wallwiener D, et al.: Progressionspecific genes identified by expression profiling of matched ductal carcinomas in situ and invasive breast tumors, combining laser capture microdissection and oligonucleotide microarray analysis.
Cancer Res 2006, 66(10):52785286. PubMed Abstract  Publisher Full Text

Chen EI, Yates JR: Maspin and tumor metastasis.
IUBMB Life 2006, 58(1):2529. PubMed Abstract  Publisher Full Text

Sbisa E, Mastropasqua G, Lefkimmiatis K, Caratozzolo MF, D'Erchia AM, Tullo A: Connecting p63 to cellular proliferation: the example of the adenosine deaminase target gene.
Cell Cycle 2006, 5(2):205212. PubMed Abstract  Publisher Full Text

Massion PP, Taflan PM, Jamshedur Rahman SM, Yildiz P, Shyr Y, Edgerton ME, Westfall MD, Roberts JR, Pietenpol JA, Carbone DP, et al.: Significance of p63 amplification and overexpression in lung cancer development and prognosis.
Cancer Res 2003, 63(21):71137121. PubMed Abstract  Publisher Full Text

Hayes DC, Secrist H, Bangur CS, Wang T, Zhang X, Harlan D, Goodman GE, Houghton RL, Persing DH, Zehentner BK: Multigene realtime PCR detection of circulating tumor cells in peripheral blood of lung cancer patients.
Anticancer Res 2006, 26(2B):15671575. PubMed Abstract

Schwarz J, Ayim A, Schmidt A, Jager S, Koch S, Baumann R, Dunne AA, Moll R: Differential expression of desmosomal plakophilins in various types of carcinomas: correlation with cell type and differentiation.
Hum Pathol 2006, 37(5):613622. PubMed Abstract  Publisher Full Text

Young GD, Winokur TS, Cerfolio RJ, Van Tine BA, Chow LT, Okoh V, Garver RI Jr: Differential expression and biodistribution of cytokeratin 18 and desmoplakins in nonsmall cell lung carcinoma subtypes.
Lung Cancer 2002, 36(2):133141. PubMed Abstract  Publisher Full Text

Mondal G, Sengupta S, Panda CK, Gollin SM, Saunders WS, Roychoudhury S: Overexpression of Cdc20 leads to impairment of the spindle assembly checkpoint and aneuploidization in oral cancer.
Carcinogenesis 2006. PubMed Abstract  Publisher Full Text

Chatrath P, Scott IS, Morris LS, Davies RJ, Rushbrook SM, Bird K, Vowler SL, Grant JW, Saeed IT, Howard D, et al.: Aberrant expression of minichromosome maintenance protein2 and Ki67 in laryngeal squamous epithelial lesions.
Br J Cancer 2003, 89(6):10481054. PubMed Abstract  Publisher Full Text

Klir GJ, Yuan B: Fuzzy Sets and Fuzzy Logic: Theory and Applications. PrenticeHall; 1995.

Davison A, Hinkley D: Bootstrap methods and their application. Cambridge: Cambridge University Press; 1997.