Abstract
Background
Gene pathway can be defined as a group of genes that interact with each other to perform some biological processes. Along with the efforts to identify the individual genes that play vital roles in a particular disease, there is a growing interest in identifying the roles of gene pathways in such diseases.
Results
This paper proposes an innovative fuzzysettheorybased approach, Multidimensional Cluster Misclassification test (MCMtest), to measure the significance of gene pathways in a particular disease. Experiments have been conducted on both synthetic data and real world data. Results on published diabetes gene expression dataset and a list of predefined pathways from KEGG identified OXPHOS pathway involved in oxidative phosphorylation in mitochondria and other mitochondrial related pathways to be deregulated in diabetes patients. Our results support the previously supported notion that mitochondrial dysfunction is an important event in insulin resistance and type2 diabetes.
Conclusion
Our experiments results suggest that MCMtest can be successfully used in pathway level differential analysis of gene expression datasets. This approach also provides a new solution to the general problem of measuring the difference between two groups of data, which is one of the most essential problems in most areas of research.
Background
Current microarray technologies conduct simultaneous studies of gene expression data of thousands of genes under different conditions. In most of these studies, expression data are analyzed using various standard statistical methods to identify a list of genes responsible for a particular condition. However, in these approaches, interplay among genes is not taken into account. Since organisms behave as complex systems, functioning through chemical networks and interaction of various molecules (also known as pathways), this interplay of genes may provide invaluable insight to the understanding of various diseases. Thus, along with the efforts to identify the individual genes that play vital roles in a particular disease, there is a growing interest in identifying the roles of different pathways in such diseases.
Biological pathway is a group of related genes coding for proteins that interact with each other to perform some biological processes. According to the biological processes they are involved with, pathways can be divided into several categories, such as metabolic pathways and regulatory pathways. Metabolic pathways are series of chemical reactions occurring within a cell, catalyzed by enzymes, resulting in either the formation of a metabolic product to be used or stored by the cell, or the initiation of another metabolic pathway. Regulatory pathways represent proteinprotein interactions.
During the past few years, many signaling and metabolic pathways have been discovered experimentally and have been integrated into pathway databases, such as KEGG [1] and Biocarta [2]. Various statistical techniques have been developed to analyze microarray expression data for the relevance of predefined pathways to a particular disease. These techniques include gene set enrichment analysis [3,4], pathway level analysis of gene expression using singular value decomposition by Tomfohr et al. [5], and hypothesis testing [6] by Tian et al. These approaches are reviewed in detail in the related works section.
Generally speaking, these approaches can be divided into two categories:
• Conduct statistical differential analysis at the individual gene level, and integrate the result statistics of the genes in the same pathway;
• Obtain activity level indices of each pathway for different sample groups and conduct differential analysis of these indices.
For the first category, when the statistics at individual gene level miss significant genes, the effectiveness of the pathway analysis will be affected. An example is given in the later part of this section. For the second approach, extracting pathway activity level indices from gene expression data may cause loss of information.
Diabetes is a group of diseases characterized by high levels of blood glucose resulting from defects in insulin production, insulin action, or both. It is one of the most common diseases, affecting 18.2 million people in the United States, or 6.3% of the population [7]. Hence, identifying active pathways in diabetes is a critical task for understanding this disease. Several pathway analysis works have been proposed in this area [3,5,6].
In gene set enrichment analysis (GSEA) [3], a differential statistic is calculated first for each gene from their expression data of two different groups of samples. Then the genes are ordered according to the statistic values. A running sum of weights is calculated from the ordered list for a particular pathway. The maximum value of this running sum is called the enrichment score of that pathway. To measure the significance of this score, a null distribution of enrichment scores is generated by permuting the sample labels. This approach falls into the first category stated previously, i.e., statistical analysis at individual gene level is performed followed by an integration of these statistics of genes in the same pathway.
In [5], a hypothesis testing framework for pathway differential analysis is proposed. Ttest and Wilcoxon rank test are recommended to measure the difference of expressions of a single gene between two groups of samples. Then this statistic is accumulated over each gene in a particular pathway and standardized by the total number of genes in this pathway. The significance of the result is then interpreted by rejecting two null hypotheses, each with a null population generated by permuting sample labels or gene indices. This approach also belongs to the first category above. Statistical analysis at individual gene level is still required for the pathway analysis in this approach.
In [6], singular value decomposition is used to obtain pathway activity levels from the gene expression matrix. Ttest is applied to the pathway activity levels of the two different sample groups to measure the difference. Significance of the measurement is also obtained by permuting the sample labels. In this approach, no differential analysis at individual gene level is required. However, an extraction of pathway activity level prior to the differential analysis is required. During this extraction process, since only the first eigenvector of singular value decomposition is used, some information of expressions is lost. This approach belongs to the second category stated above.
As discussed above, either ttest or rank sum test is used as a core step by [3,6] to identify individual genes which are expressed differently from two different sample groups. Thus these methods inevitably inherit the disadvantage of ttest and rank sum test. While the ttest is very sensitive to extreme values and cannot distinguish two sets with close means even though the two sets are significantly different from each others, the rank sum test is not sensitive to absolute values. In turn, those pathways contain genes which can not be identified by ttest or rank sum test but actually are significantly differently expressed in two different sample groups will be affected. For example, as showed in Table 1, the expressions of Gene 3 are significantly different under two conditions. However this gene was not identified by ttest. Thus, a pathway involving this gene is less likely to be identified by the first category of analysis that uses ttest at the gene level.
Table 1. An example of five gene pathway
In this paper, we propose an innovative fuzzysettheorybased approach for differential analysis of gene pathways and apply it on identifying significant pathways for diabetes. In our proposed MCMtest, instead of identifying individual genes first, the differential analysis is done directly at the pathway level without individual gene differential statistic. All expression values of genes which belong to a pathway of a particular patient are treated as a vector. The intuition behind this is based on the fact that genes for each patient interplay with each other. MCMtest does not extract activity level of pathways either. This allows keeping the maximum amount of information for the pathway differential analysis. Moreover, the fuzzy concept makes the approach more tolerant to individual data item noise.
Results
To investigate our approach, we conducted experiments on both synthetic data and real world data. We first conducted a series of experiments on synthetic datasets to find the characteristics of MCM dvalue. We then used the MCMtest on the real world diabetes dataset analyzed by Tomfohr et al. [5] and GSEA [3]. Results on real world diabetes data identified several pathways that were deregulated in diabetes patients. The top three pathways identified were related to mitochondrial functions in accordance with previous diabetes studies. Mitochondrial dysfunction is known to be related to insulin resistance and type2 diabetes. Our data suggests that the method can be successfully used in pathway level differential analysis of gene expression datasets.
Relationship between MCM dvalue and mean difference of the distributions
Suppose two sets S_{1 }and S_{2 }are drawn from two different distributions, then a good divergence value will satisfy the following property: the less the overlap, the higher the dvalue. To validate that our MCMtest has this property, we performed the following steps:
1. generated 17 values from Gaussian distribution N (μ, σ), where μ is the mean and σ is the variance, to use as gene expression data. The number 17 was chosen to mimic the real world diabetes dataset used for the analysis in this paper.
2. repeated Step 1 for 100 times to get expression data of 100 genes
3. generated 17 values from Gaussian distribution N (μ + x, σ), with x = 0 at this time.
4. repeated Step 3 for 100 times
5. analyzed these 100 pairs of sets of values with MCMtest and obtained the dvalue.
6. repeated Step 1 to Step 5 for 1000 times and averaged the dvalues over all the iterations.
7. repeated Step 1 to Step 6 for each x: x = 0, 20, 40, 60, 80, 100, 120, 140, 160, and 180.
Figure 1 shows the average dvalue verses mean difference. We can see that the MCMtest has the desired property: the larger the mean difference between two sets, the larger the divergence dvalue.
Figure 1. Relationship between dvalue and mean difference. Two datasets are generated from two distributions N (μ, σ) and N (μ + x, σ). As the mean difference, x, increases, the dvalue also increases.
Impact of population size on standard deviation of MCM dvalue
At this point, a natural question is, what the standard deviation of MCM dvalues look like and how population size of dvalue influences it. To answer these questions, we generated sample expression datasets and calculate dvalues following a process similar to step 1 to 5 in the previous section. Again, we fixed the pathway length to 100. We repeat the process and obtained 500 dvalues and calculated the standard deviation of the dvalues. This is then repeated for 10 times and the 10 standard deviations are averaged and recorded as error rate of MCM dvalue for that population size 500. Similarly we obtained the error rate for various dvalue population sizes. As shown in Figure 2, the error rate decreases as the dataset size increases. We also note that the error rate becomes stable after the size of the population becomes greater than 8000.
Figure 2. Impact of number of permutation on the error rate of PDF of MCM dvalue. We show the error rate for various numbers of permutations ranging from 500 to 32000. The error rate decreases as the number of permutations increases.
Relationship between MCM dvalue and empirical pvalue
Suppose two vectors S_{1 }and S_{2 }are drawn from same Normal distribution. What is the probability that the MCM dvalue of these vectors is greater than a particular D? Does the probability increases with the increase of D? To answer these questions, we studied the relationship between MCM dvalue and empirical pvalue as follows:
1. We generated 15000 pairs of sets, each set with 15 values from standard normal distribution.
2. From these 15000 pairs of sets, we randomly selected 100 pairs of sets to simulate expression data of a pathway with 100 genes under two conditions. We calculated dvalue for this pathway. Since we know that the data size required to obtain stable standard deviation of dvalue is 8000 from the previous experiment, this process is repeated 10000 times.
3. For each pathway generated above with dvalue D, we calculated the empirical pvalue as n+1/10001, where n is the number of dvalues generated above that are equal to or greater than D. The relationship between the dvalue and pvalue is shown in Figure 3.
Figure 3. Relationship between MCM dvalue and its empirical pvalue. As the dvalue increases, the corresponding empirical pvalue decreases.
In Figure 3 we can see that as the dvalue increases, the pvalue decreases. In particular, when dvalue is greater than 0.809, we have pvalue ≤ 0.05.
Impact of number of samples on error rate of MCMtest dvalue
In order to understand the effect of the number of samples on error rate of MCM dvalue, we generated datasets with different sample sizes. For each sample size, we generated 10000 datasets and calculated the corresponding 10000 dvalues. The standard deviation of these dvalues was calculated. This process is repeated 10 times and the average of the standard deviations is recorded as the error rate. The same is done for the other sample sizes. The relationship between number of samples and the error rate is shown in Figure 4. As expected, the error rate decreases as the number of samples increases.
Figure 4. Impact of number of samples on error rate of PDF of MCMtest dvalue. As the number of samples in a pathway increases, the error rate of PDF of MCM dvalue also decreases.
Analyzing the diabetes dataset with MCMtest
The diabetes dataset contains the transcriptional profiles of smooth muscle biopsies of diabetic and normal individuals. In the expression dataset, for each gene, there are 17 expression values from subjects with type 2 diabetes (DM2), 17 expression values from subjects with normal glucose tolerance (NGT) and 10 expression values from subjects with impaired glucose tolerance (IGT). For our analysis, we only used the 34 expression values from subjects with type 2 diabetes and subjects with normal glucose tolerance. Furthermore, we used about 150 pathways obtained from KEGG (Kyoto Encyclopedia of Genes and Genomes) [1].
The expression values in the dataset which are too small, i.e., less than 100 are considered to be the result of noise. So, to minimize the effect of these low values, we only included the genes which have at least one of the expression values greater than 100. Out of the 22,283 genes in the dataset, 10,983 met the filtering criteria. The dvalue for each pathway was calculated as described in the methodology section before. The pvalue for the pathway was calculated using permutation test. We permuted the genes 1000 times, each time selecting the same number of genes as that of the pathway under consideration. We then calculated the dvalue of each pathway obtained thus and the pvalue for the pathway was the fraction of times the dvalues of the pathways obtained by 1000 permutation equaled or exceeded the original dvalue.
The pathways are ordered in the ascending order of their pvalues. The significant pathways, i.e., the pathways with pvalue less than 0.05, are then ordered according to the percentage of the genes in the pathway which were represented in the dataset. Table 2 shows the result after sorting.
Table 2. The results from MCMtest on diabetes dataset
Using our method, we identified the deregulation of mitochondrial pathways in the dataset which is in accordance with previous studies. The first cluster of genes involved was from the mitochondrial OXPHOS pathway. The OXPHOS pathway was well represented in the data with 93% of genes (106 out of 114) present in the dataset. Oxidative phosphorylation in mitochondria provides energy in the form of ATP generation. In muscle cells, mitochondrial dysfunction has been linked to insulin resistance and type2 diabetes [810]. The involvement of genes coded by mitochondria in insulin resistance is also well known. The depletion of cellular mitochondrial DNA has been shown to cause insulin resistance in experimental model [11]. Reduced mitochondrial oxidative phosphorylation leads to the accumulation of intracellular triglycerides resulting in insulin resistance. The next 2 clusters, c20_U133 which is a manually curated cluster of genes coregulated with OXPHOS [3] and the mitochondrial gene cluster human_mitoDB_6_2002 reinforce that muscle mitochondrial dysfunction is linked to type2 diabetes.
Conclusion
In this paper, we propose an innovative fuzzysettheorybased approach for differential analysis of gene pathways and apply it on identifying significant pathways for diabetes. Experiments have been conducted on both synthetic datasets and real world dataset. Results on real world diabetes data identified several number of gene pathways. Of note our top significant pathways were related to mitochondrial function which is well known to be involved in insulin resistance and type2 diabetes. This approach can be used not only for pathway analysis of other diseases but also for other domains. As measuring the difference of two groups of data are essential to most of researches, our approach provides a solution to this general and most critical problem.
Methods
In [1214], we proposed two fuzzysettheory based methods, CMtest and FMtest, to identify the individual genes that expressed significant differences under two conditions. In this paper, we extended the cluster misclassification concept to a multidimensional space and propose a new approach for pathway analysis, Multidimensional Cluster Misclassification test (MCMtest). Comparing with CMtest and FMtest, MCMtest looks for a group of genes significant under two conditions instead of identifying significant individual genes under two conditions. In this approach, the expression values of a group of Q genes for a particular sample under a particular condition are considered as a Qdimension vector. The differential analysis is done at the vector level, without individual gene differential statistic.
In this section, we first introduce the concept of fuzzy membership function of vectors, then the details of MCMtest.
Fuzzy membership Function of Vectors
In fuzzy set theory, the degree for one variable to belong to a fuzzy set is defined by a function. For a vector which has two dimensions, the degree that it belongs to a set of vectors can be defined by a threedimensional function, with the third dimension being the measurement of the membership. Figure 5 shows a sample fuzzy membership function for a vector (x, y).
Figure 5. A sample fuzzy membership function of vector (x, y).
For vectors with n dimensions, their fuzzy membership function will be n+1dimensional, with one dimension measuring the fuzzy membership.
Our approach
Consider a pathway that consists of Q genes, the problem now is to determine how these Q genes are expressed differently under two conditions. To answer this question, we consider the expression values of the Q genes for a particular sample under a particular condition as a Qdimension vector. Then the expression values of a pathway under one condition j can be modeled as set S_{j }(j = 1, 2) of points in a Qdimension space. The idea is to consider the two sets of points S_{1 }and S_{2 }as samples from two different fuzzy sets. We then examine the membership value of each element with respect to these two fuzzy sets and determine the dvalue between the two sets of samples.
The mean of the expression values of set S_{j }is:
where,
N_{j }is the number of samples in S_{j}, is vector made by the expression values of the nth sample under condition j.
We then characterize set S_{j }(j = 1, 2) by a fuzzy set FS_{j }(j = 1, 2) whose fuzzy membership function is defined as:
where,
Given an element in S_{1}, we calculate its element misclassification degree with respect to FS_{2 }as
We denote the misclassified elements in S_{1 }with respect to FS_{2 }as M_{FS2}(S_{1}) = { ∈ S_{1 }∦ m (, FS_{2}) > 0}. Similarly, we denote the misclassified elements in S_{2 }with respect to FS_{1 }as M_{FS1 }(S_{2}) = {∈ S_{2 }∦ m (, FS_{1}) > 0}. We denote the number of misclassified elements in S_{1 }and S_{2 }with respect to each other as # M (S_{1}, S_{2 }= M_{FS2 }(S_{1}) + M_{FS1 }(S_{2}). We then define the convergence degree (cvalue) of S_{1 }and S_{2 }as a linear interpolation of the number of misclassified elements and the mutual misclassification degrees as follows.
where,
and
Then, the divergence between S_{1 }and S_{2 }can be calculated using the following:
In our method, to negate the effect of outliers, we used αtrimmed mean instead of normal mean. The αtrimmed mean is calculated by ordering the sample under consideration and taking away the smallest and largest α values from the ordered sample. The mean of the remaining values in the sample is αtrimmed mean of the sample. For instance, if we have a sample of (3, 17, 25, 29, 23, 53, 22, 31, 45, 81, 90, 1), the 2trimmed mean is calculated by removing the smallest two values (1, 3), and largest two values (81, 90) from the sample set. The mean of the remaining values (30.625) becomes the 2trimmed mean of the sample.
For computational simplicity, an Epanechnikov function shown as following can be used instead of the Gaussian function of equation (2):
where,
Analysis of method
One dimension: a special case
In this section we analyze MCMtest for it theoretical justification. For the sake of clarity, we start with one dimension, the simplest and special case of multidimension. The one dimensional MCMtest corresponds to differential analysis of a single gene.
In Figure 6, two distributions, D_{1 }and D_{2 }are displayed in blue and red respectively, with mean μ_{1 }= 600 and standard deviation σ_{1 }= 50 for D_{1 }and μ_{2 }= 700, σ_{2 }= 100 for D_{2}. The visualization tells us that they are different as they cover different areas and have different shapes. The CMtest, which can be considered as a special case of the MCMtest differentiate them by measuring the differences on the Y axis, which is a combined result of the location difference together with the difference of the variances.
Figure 6. One dimension Gaussian distributions, μ1 = 600, μ2 = 700, σ1 = 50, σ2 = 100.
MCMtest uses the probability distribution functions of these two distributions as their fuzzy membership functions respectively, and takes advantage of the membership differences of "misclassified" samples. As shown in Figure 6, a sample x_{1 }of D_{2 }has a higher degree of belonging to D_{1}, thus is "misclassified" in MCMtest. This misclassification degree is aggregated with all the other "misclassified" samples of D_{2 }that are misclassified. Similarly, x_{2 }of D_{1 }has a higher degree for D_{2}, thus is also misclassified. This misclassification degree is also aggregated with all the other misclassified samples of D_{1}.
MCMtest collects all the misclassified degrees and the number of misclassified samples and form them into an index to measure the divergence of these two distributes. With the mean difference between these two distributions increases, the number of misclassified samples, as well as the aggregated misclassification degree decreases. Thus the MCM dvalue will decrease.
Two and higher dimensions
Figure 7(a) illustrates samples of two distributions, each of which is a 2D Gaussian function. In pathway analysis, the X and Y axis can be the expression data of two individual genes respectively. Figure 7(b) shows the probability density functions of these two distributions, which can be used as their fuzzy membership functions after multiplying a constant.
Figure 7. (a) 1000 samples of 2D Gaussian distribution μ_{1}x = 600, μ_{1}y = 900, σ_{1}x = σ_{1}y = 50 and 1000 samples of 2D Gaussian distribution μ_{2}x = 700, μ_{2}y = 1000, σ_{2}x = σ_{2}y = 100. (b) Probability density functions of the two distributions.
Distributions of higher dimensions are hard to visualize. But the idea of the misclassification test stays the same. In multidimension space, each sample is a vector. And their misclassification degrees are used to measure the divergence of their distributions.
List of abbreviations
MCMtest: multidimensional Cluster Misclassification test
CMtest: cluster misclassification test
FMtest: fuzzy membership test
GSEA: gene set enrichment analysis
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
LRL designed the algorithm, coordinated the project and wrote part of the manuscript. VM implemented the algorithm, conducted experiments and wrote part of the manuscript. YL designed experiments and wrote part of the manuscript. DK located gene expression and pathway data for experiments, analyzed the results and wrote part of the manuscript.
Acknowledgements
We would like to thank Ying Wang and Togba Liberty for generating some of the figures used in this paper. This work was supported by the Agriculture Experiment Station at the University of the District of Columbia (Project No.: DC0LIANG; Accession No.: 0203877).
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 6, 2008: Symposium of Computations in Bioinformatics and Bioscience (SCBB07). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/9?issue=S6.
References

Kyoto Encyclopedia of Genes and Genomes [http://www.genome.jp/kegg/] webcite

Biocarta [http://www.biocarta.com/] webcite

Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: Pgc1alpharesponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes.
Nature Genet 2003, 34:267273. PubMed Abstract  Publisher Full Text

Subramanian AS, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledgebased approach for interpreting genomewide expression profiles.
PNAS 2005, 102:1554515550. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tomfohr J, Lu J, Kepler TB: Pathway level analysis of gene expression using singular value decomposition.
BMC Bioinformatics 2005, 6:225. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies.
PNAS 2005, 102:1354413549. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

American Diabetes Association Website [http://www.diabetes.org/] webcite

Morino K, Petersen KF, Shulman GI: Molecular mechanisms of insulin resistance in humans and their potential links with mitochondrial dysfunction.
Diabetes 2006, 55(Suppl 2):S9S15. PubMed Abstract  Publisher Full Text

Takamura T, Honda M, Sakai Y, Ando H, Shimizu A, Ota T, Sakurai M, Misu H, Kurita S, MatsuzawaNagata N, Uchikata M, Nakamura S, Matoba R, Tanino M, Matsubara K, Kaneko S: Gene expression profiles in peripheral blood mononuclear cells reflect the pathophysiology of type 2 diabetes.
Biochem Biophys Res Commun 2007, 361(2):37984. PubMed Abstract  Publisher Full Text

Skov V, Glintborg D, Knudsen S, Jensen T, Kruse TA, Tan Q, Brusgaard K, BeckNielsen H, Højlund K: Reduced expression of nuclearencoded genes involved in mitochondrial oxidative metabolism in skeletal muscle of insulinresistant women with polycystic ovary syndrome.
Diabetes 2007, 56(9):234955. PubMed Abstract  Publisher Full Text

Park SY, Lee W: The depletion of cellular mitochondrial DNA causes insulin resistance through the alteration of insulin receptor substrate1 in rat myocytes.
Diabetes Res Clin Pract 2007, 77(Suppl 1):S165S171.
17462778
PubMed Abstract  Publisher Full Text 
Liang LR, Lu S, Lu Y, Dhawan P, Kumar D: CMtest: An innovative divergence measurement and its application in diabetes gene expression data analysis.
Proceedings of the IEEE International Conference on Granular Computing: 262–268 May 2006; Atlanta, Georgia, USA

Liang LR, Lu S, Wang X, Lu Y, Mandal V, Patacsil D, Kumar D: FMtest: A fuzzysettheorybased approach to differential gene expression data analysis.
BMC Bioinformatics 2006, 7(Suppl 4):S7. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Lu Y, Lu S, Liang LR, Kumar D: FMtest: A fuzzysettheory based approach for discovering diabetes genes.
Proceedings of the IEEE International Symposium of Computations in Bioinformatics and Bioscience: 48–55 June 2006; Hangzhou, Zhejiang, P.R. China