Abstract
Background
Many different microarray experiments are publicly available today. It is natural to ask whether different experiments for the same phenotypic conditions can be combined using metaanalysis, in order to increase the overall sample size. However, some genes are not measured in all experiments, hence they cannot be included or their statistical significance cannot be appropriately estimated in traditional metaanalysis. Nonetheless, these genes, which we refer to as incomplete genes, may also be informative and useful.
Results
We propose a metaanalysis framework, called "Incomplete Gene Metaanalysis", which can include incomplete genes by imputing the significance of missing replicates, and computing a metascore for every gene across all datasets. We demonstrate that the incomplete genes are worthy of being included and our method is able to appropriately estimate their significance in two groups of experiments. We first apply the Incomplete Gene Metaanalysis and several comparable methods to five breast cancer datasets with an identical set of probes. We simulate incomplete genes by randomly removing a subset of probes from each dataset and demonstrate that our method consistently outperforms two other methods in terms of their false discovery rate. We also apply the methods to three gastric cancer datasets for the purpose of discriminating diffuse and intestinal subtypes.
Conclusions
Metaanalysis is an effective approach that identifies more robust sets of differentially expressed genes from multiple studies. The incomplete genes that mainly arise from the use of different platforms may also have statistical and biological importance but are ignored or are not appropriately involved by previous studies. Our Incomplete Gene Metaanalysis is able to incorporate the incomplete genes by estimating their significance. The results on both breast and gastric cancer datasets suggest that the highly ranked genes and associated GO terms produced by our method are more significant and biologically meaningful according to the previous literature.
Background
Gene expression microarrays are a high throughput technique for measuring gene expression levels in thousands of genes simultaneously, and have been widely used in the study of cancer genomics. An important application of gene expression microarrays is detecting differentially expressed genes by statistical analysis. For example, the classical ttest can be used to assess the statistical significance of genes in terms of their ability to discriminate samples from two phenotypes.
While many microarray experiments from different laboratories have been performed with the same research aim, the results of these experiments may differ from each other in many aspects, e.g., the platform, the probe sets or the characteristics of the samples. Consequently, the significant genes identified by the same statistical analysis from different experiments may be inconsistent.
To overcome these inconsistencies, the evidence from multiple studies needs to be combined. Several papers [13] directly integrated gene expression data by aligning genes/probes and concatenating samples. Metaanalysis [4] is another way of generating more robust and consistent statistical results by integrating multiple datasets and outputting an overall score, which we refer to as a metascore for each gene/probe across all studies. For example, [5] integrated the pvalues from the ttest, [68] integrated the effect size based on the model of [4], [9] integrated the ranks of genes, and [10] integrated the test statistics based on a mixture model of the normal distribution by considering the concordance between two datasets.
In addition, some papers used metaanalysis techniques to discover significant gene functions. For example, [11] applied metaanalysis directly to the functional categories associated with each individual dataset, rather than the expression data, in order to identify more significant pathways; [12] used metaanalysis to predict unknown functions of genes.
The integration of datasets from different platforms can generate more statistically significant results by reducing biases caused by specific platforms or experimental conditions. The study in [13] first highlighted the importance of the alignment between different platforms as an issue for the metaanalysis of gene expression microarrays. More recently, the studies in [1,2] applied metaanalysis to multiple platforms, and demonstrated that more robust gene signatures could be generated from multiple platforms.
A challenge for metaanalysis in this context is that microarray datasets from different platforms do not usually possess an identical set of probes. Consequently, it is critical to fix a single set of probes as the candidates for statistical analysis. It is common to encounter incomplete alignment of genes among different microarray experiments, especially among those microarrays from different platforms. For example, the study in [14] compared three microarray platforms  one with short oligonucleotides, one with long oligonucleotides, and a cDNA platform. The three platforms have 6430 genes in common, but many more genes are shared by a pair of platforms or by a single platform, as shown in Figure 1. Similarly, many other metaanalysis studies have used datasets from different platforms, e.g. [57]. The overlap of genes among the three gastric cancer datasets [1517] used in our experiments, which were independently generated by the research groups from Australia, Hong Kong and Japan on different platforms, is also shown in Figure 1. In addition, other reasons can also cause missing replicates in microarrays.
However, to the best of our knowledge, all existing methods of gene expression metaanalysis either only consider those features that are assayed in all datasets (which we refer to as complete genes), whereas the other genes that are not measured in all datasets are discarded, or simply ignore the missing replicates in the incomplete genes. We refer to the genes that are not measured in all datasets as incomplete genes.
However, the incomplete genes may also be significant and should be considered as candidates, even though their significance is not tested in all studies. In this paper, we focus on developing a novel metaanalysis method that takes complete and incomplete genes into account simultaneously.
We propose a metaanalysis framework, called Incomplete Gene Metaanalysis (IGM), which is able to incorporate incomplete genes caused by crossplatform integration or any other reasons for missing replicates. IGM comprises three major steps: (1) Compute a statistic for every replicate (each probe in each dataset) using the Hedges' g effect size [4]; (2) Impute the significance of missing replicates, where the incomplete genes are not measured in particular datasets, using the model of a conditional probability distribution over the datasets; (3) Generate an overall significance score (metascore) for each probe across all datasets using a variant of an earlier linear model [4,6,18]. As a basis for comparison, we also implemented other variants of this framework by replacing its key steps, including a traditional approach that does not consider the incomplete genes and a method that simply ignores the missing replicates in the incomplete genes.
We first tested IGM and the comparable approaches on five breast cancer datasets with an identical set of probes, for the purpose of distinguishing the binary label of a given number of years to metastasis. We simulated the incomplete genes by randomly removing a subset of probes from each dataset. A gene ranking was generated using each method and the false discovery rate (FDR, [19]) was estimated using a permutation test [6,20]). Our method consistently achieved the closest FDR to that of the gene ranking produced on the original datasets without incomplete genes, which was considered as the gold standard. We also conducted experiments on three gastric cancer datasets, which were generated independently by research institutions in Australia [15], Hong Kong [16] and Japan [17], for the purpose of discriminating diffuse and intestinal subtypes of gastric cancer [21]. Using an enrichment test for Gene Ontology terms in both groups of cancer datasets, IGM identified more significant terms that were closely related to a particular subtype of gastric cancer than only using complete genes. The above results show that the highly ranked genes produced by IGM were statistically and biologically more significant than those produced by the other methods.
In Section, we describe the IGM framework, the comparable methods and our evaluation metrics. In Section, we present the experimental results on the breast cancer and gastric cancer datasets. In Section, we discuss the biological relevance of the results on the gastric cancer datasets. Finally, we conclude the paper in Section.
Methods
In this section, we describe our framework called Incomplete Gene Metaanalysis (IGM), which incorporates both complete genes and incomplete genes simultaneously by including the key step of imputing the significance of missing replicates. We also propose several other variants of this framework as a basis for comparison using three types of evaluation metrics.
Notation
Before presenting our framework, we first introduce several concepts and notations that are used in the following sections. We are given k(k ≥ 2) gene expression datasets GE_{j }= (G_{j}, S_{j}), j = 1, ···, k, where the dataset GE_{j }comprises the gene set G_{j }and the sample set S_{j}. Let G_{I }and G_{U }denote the intersection
and union
of all gene sets, respectively. If the gene g_{i }∈ G_{U }is not measured in the dataset GE_{j}, j ∈ { 1, ···, k}, we call it a missing replicate. A gene that has no missing replicates is called a complete gene. Otherwise, it is called an incomplete gene.
Note that the features are aligned by their gene symbols between datasets. While there are other strategies to align probes between studies, they are not the focus of this paper. More details about the alignment can be found in [22].
If multiple probes in one dataset correspond to a single gene, the median expression level of these probes is computed for each sample.
Incomplete Gene Metaanalysis Framework
Our Incomplete Gene Metaanalysis framework computes an overall score, called a metascore, for each gene across all datasets, by imputing the significance of missing replicates and integrating the statistical results from individual datasets. The major steps are as follows (see Figure 2).
Figure 2. Incomplete Gene Metaanalysis. The process of Incomplete Gene Metaanalysis.
1. Input  We are given k ≥ 2 gene expression microarray datasets GE_{j }= (G_{j}, S_{j}), j = 1, ···, k. In each dataset, the samples are labeled with different phenotypes or clinical annotations, with respect to which the differentially expressed genes can be detected.
2. Candidate gene set  We have to select a candidate gene set G_{0 }⊆ G_{U }if the gene sets differ between datasets. Previous methods (e.g., [6,9,10]) only select complete genes (G_{0 }= G_{I }), but we select G_{0 }= G_{U }, so that all genes are considered as candidates. Let n = G_{0} denote the total number of candidate genes.
3. Individual scores  We apply a statistical test to each replicate g_{i }in dataset j, so that a score x_{ij}, which could be the test statistic or pvalue, is used to measure the significance of the replicate. We let
denote the score matrix for all n genes in k datasets. The corresponding value of any missing replicate is initially undefined.
4. Imputation  For each missing replicate, we impute a value for x_{ij }so that it has a valid score. We estimate the scores of the missing replicates using a probability distribution that is conditional on the observable replicates, and also calculate the estimation error for the imputed scores.
5. Metascores  We compute a metascore x_{M }(i) for every gene g_{i}, characterising its overall significance across all datasets.
In the following three subsections, we discuss the details of steps 3 to 5.
Individual Scores
Many statistical tests could be used for measuring the significance of the differential expressions of genes. In the case of two phenotypes, we employ the Hedges' g effect size [4], which is defined as the standardized difference in the means between two populations. We first briefly describe the general case of estimating the Hedges' g effect size from the two groups of samples for one gene in one dataset. A biased estimator is given by:
where ē_{1 }and ē_{2 }are the mean values of the samples in groups 1 and 2, respectively, and s* is the pooled standard deviation of the samples. Using g in Equation (4), an unbiased estimator g* of the population effect size is given by:
In addition, the variance of g* can be estimated using:
where n_{1 }and n_{2 }are the numbers of samples in groups 1 and 2, respectively.
In our algorithm, we compute g* as the individual score (Section) for each observable replicate g_{i }in dataset j:
while the score for each missing replicate is initially undefined.
A Variant of the Linear Model for Metascores
Our problem corresponds to the general problem of estimating the population effect size from a given set of measurements. We first recall an existing method for estimating a population parameter used by [4,6,18]. The observed statistic x_{ij }in Equation (7) for the replicate g_{i }in dataset j is hierarchically modeled as follows:
In this model, μ_{i }is the unknown population effect size to be estimated for gene i. A key challenge in this estimation problem is how to account for the variation within each study (modeled by β_{ij}) as well as the variation between studies (modeled by α_{ij}). We now consider each of these terms.
First, many factors, such as different microarray platforms or samples of different ages and regions, may affect the measurements and result in variations of the population effect size between studies. This is modeled by the error term α_{ij }in Equation (9), which follows a normal distribution with 0mean and . The term μ_{ij }is the studyspecific population effect size.
Second, the other error term β_{ij }in Equation (8) represents the variation in measuring μ_{ij }due to the finite number of samples in each study. This term's variance is estimated by Equation (6).
An unbiased estimator of μ_{i }is given by the semiweighted mean [4,6,18]:
where and are estimates of the population parameters in Equation (9) and in Equation (8), respectively.
When there is no variation between studies, which indicates , every study has the identical population effect size μ_{ij }= μ_{i}. In this case, the model is called a FixedEffects Model (FEM). Otherwise, the model is called a RandomEffects Model (REM), in which . The test for FEM or REM and the estimate of in Equation (9) can be found in [4,6,18,23].
Thus, when incomplete genes are absent, we can directly use this estimate as the metascore:
To incorporate the imputation step described in Section, we propose a variant of the above model. In our case, some of the x_{ij }are unobservable. As a consequence, the imputation of the scores for these missing replicates ( in Section) will lead to an additional source of variation, which can be accounted for by introducing a new error term e_{ij }in the model:
For all observable replicates, e_{ij }= 0. This indicates that the new error term is only introduced for the missing replicates. We extend the semiweighted mean in Equation (10) to a form involving e_{ij }as follows:
where is the estimated variance of e_{ij}, which is determined by a specialized method of imputation. Thus, when incomplete genes are present, we use this estimate of the semiweighted mean as the metascore:
We can explain the impact of the error term e_{ij }as follows. First, if the expectation of the error term E(e_{ij}) = 0, which implies the estimate of the score of a missing replicate in Equation (13) is unbiased:
The estimate in Equation (14) is again an unbiased estimate of μ_{i}. Otherwise, could overestimate or underestimate μ_{i}. depending on the method of imputation. Second, intuitively, the imputed scores will have a smaller weight in Equation (15), due to the inclusion of the estimated variance of the new error term.
Imputation using Conditional Probability
The imputation step enables the incomplete genes, which are usually neglected in previous studies, to be included in the metaanalysis.
We use a conditional probability distribution (CPD) for imputation. When detecting differentially expressed genes in multiple datasets with respect to the same type of sample labels (e.g., tumor vs. normal), the scores between datasets are usually positively correlated, which reflects the consistency between datasets in terms of significant genes. Otherwise, the metaanalysis is pointless. Intuitively, a gene that is observed to be differentially expressed in most studies is also expected to be significant in the studies where the gene is missing. Based on this, we can estimate the unobservable scores conditioned on the observable scores of the same gene in other studies.
1. Distribution model
For the score matrix X = [x_{ij}]_{n × k }in Equation (3), we denote x_{i}., i = 1,···, n, as the vector of the ith row (feature), and x._{j}, j = 1,···, k, as the vector of the jth column (dataset).
In our model, the row vector x_{i}. follows a kdimensional normal distribution across k datasets:
where the dimensions (columns x._{j}) are usually positively correlated.
We denote p_{i }⊂ {1, ···, k} as the set of indices of the unobservable dimensions (missing replicates), and q_{i }⊂ {1, ···, k} as the set of the observable dimensions, so that p_{i} + q_{i} = k and p_{i }∩ q_{i }= Ø. For gene i, the distribution of the unobservable subvector conditional on the observable subvector is given by:
where
More details of the conditional multivariate normal distribution can be found in [24]. Note that the approximate normality of the real datasets used in our experiments is shown in the Additional File 1.
Additional file 1. Supplement. The supplement contains an analysis of the normality in the five breast cancer datasets, a correlation analysis of the significant genes identified in the five breast cancer datasets, significant Gene Ontology terms in the three gastric cancer datasets and the accuracy of classification in both breast and gastric cancer datasets.
Format: PDF Size: 613KB Download file
This file can be viewed with: Adobe Acrobat Reader
2. Parameter estimation
The above parameters μ and Σ are computed from all complete genes using maximum likelihood estimation. Consequently, we can obtain the conditional probability distribution in Equation (19).
3. Imputation
Given the CPD in Equation (19), the most likely score for the missing replicates is given by the mean of the distribution. Thus, the score for missing replicate g_{ij }in Equation (13) is imputed as an element of the subvector:
where is computed in Equation (20).
However, the CPD allows other possible values for estimating the scores of missing replicates, which leads to the variation of imputation. The variance of this estimate, which is modeled by the error term e_{ij }in Equation (13), is given by the diagonal elements of the covariance matrix in (21) of the CPD:
where is computed in Equation (21).
Consequently, the imputed scores for missing replicates in Equations (13) and (14) and the estimated variance of imputation in Equations (13) and (15) can be obtained using our strategy, and are used to compute the metascores.
In summary, the intuition of the CPD strategy is to impute the scores of missing replicates based on the positive correlation between datasets, which is also the basis of metaanalysis. We discuss the reasons why we employ such an imputation strategy here.
1. Choice of distribution: Assuming a multivariate normal distribution for data is a typical way to estimate missing values in incomplete data, even if the real distribution is not exactly normal [25]. The multivariate normal assumption enables the use of a tractable conditional probability model and captures the correlation between datasets, which is usually present and positive when we apply statistical tests to multiple datasets with respect to the same type of clinical annotation.
2. Unbiased estimation: Under the proposed model, the imputation provides an unbiased estimate of the scores for missing replicates (Equation (23)), which is desirable for an accurate estimate of the population effect size (E(e_{ij}) = 0 in Section).
3. Variation of imputation: A critical aspect of imputation is how to model the instability of estimating missing values, which is reflected as the variance of imputation (Equation (24)). In the survey of [26], two types of imputation, "modelbased imputation" [25,27] and "multipleimputation" [28] dealt with this problem by using the EM algorithm and estimating multiple values for missing entries, respectively. However, since our model itself provides an estimate of the imputation variance based on the CPD, this variance can thus be directly used in the linear model in Equation (13). This strategy, which includes the variance of imputation as part of the model, avoids the iterative procedure in the EM algorithm, which can be costly for largescale studies. Moreover, it also avoids repeatedly applying the downstream analysis to the multiple versions of imputed datasets that would arise in multiple imputation. Overall, our imputation is considered to be a "composite method" comprising "modelbased imputation" and "cold deck imputation" [26] with a strategy of embedding the variance in the metaanalysis model.
However, the CPD model has a potential limitation due to the assumption of the multinormal distribution in Equation (18). In this assumption, the effect sizes of all genes follow a multinormal distribution with the same mean (μ). This assumption may not always hold because the effect sizes of differentially and nondifferentially expressed genes may come from different distributions. On one hand, the number of differentially expressed genes is relatively small in practice, and we demonstrate its validity for imputing incomplete genes in Section 3. On the other hand, this issue has been considered in [10], where a mixture model was proposed for differentially and nondifferentially expressed genes. Thus, the integration of a mixture model for refining the imputation stage will be investigated in our future work.
Another potential limitation of this imputation method is the lack of modeling of the dependence between studies when estimating the true effect size in Equation (14). Although this model has assigned a smaller weight to the imputed effect sizes in order to compensate the variability of imputation, the dependence caused by the CPD in Equation (19) has not been taken into account. A topic for future research is to establish a model that incorporates this interstudy dependence.
Comparable Methods
In addition to the algorithm described above, we have also implemented several other methods to evaluate the importance of including incomplete genes and properly imputing their significance. The Hedges' g effect size [4] is used in all methods to compute the individual scores, and the model described in Section is used to compute the metascores. The comparable methods that we have implemented are as follows.
1. INTERSECTION: All incomplete genes are discarded as in earlier metaanalysis methods. Thus, the candidate gene set G_{0 }is the intersection of the gene sets in all datasets (G_{I}). The imputation step is not necessary. In this case, IGM is equivalent to the method of [6].
2. IGNORE: Both complete genes and incomplete genes are taken into account, by simply ignoring the missing replicates in the incomplete genes. Metascores are computed based only on the observable replicates in the incomplete genes. A typical example of this type of method can be found in [29].
These comparable methods are designed for different purposes. By comparing with the INTERSECTION method, we can show the importance of including incomplete genes. The Ignore method is also considered because it is the simplest way of incorporating incomplete genes.
Evaluation Metrics
In order to evaluate the statistical significance of the differential expression of genes, we use the false discovery rate estimated by the permutation test [6,20] as our metric. We also use the Gene Ontology [30] to assess the significance of the biological processes that are enriched in the significant genes identified by our methods. In the Additional File 1 we also consider the effect of incomplete genes on classification accuracy.
False Discovery Rate
The false discovery rate [19] is defined as the ratio of the number of false positives to the number of features declared significant according to a specific ranking of features. However, when the gold standard for the true positives is not available, the FDR is usually estimated from the data. In our experiments, we employed the permutation test used by [20] and [6] to estimate the FDR.
The idea behind this method [6,20] is to estimate the number of false positives at a given significance level by randomly permuting the labels of samples. We assume that we need to estimate the FDR at the significance level of x_{M }(i), which is the metascore of g_{i }and is ranked R_{i }from the most to least significant. In the bth permutation, the labels of samples are independently permuted in every dataset, ensuring that the number of samples in each class is unchanged. We then repeat the process of metaanalysis, and produce a vector of metascores in this permutation. For the unpermuted metascore x_{M }(i) associated with g_{i}, the number of false positives is estimated as the number of permuted metascores greater than or equal to x_{M }(i) in this permutation. After a total number of B permutations, the Expected number of False Positives (EFP) is computed as the average number of false positives across all permutations. Consequently, the FDR at x_{M }(i) is the ratio of EFP to the number of genes declared significant at the threshold of x_{M }(i), which is the rank R_{i}.
Gene Ontology Significance
To assess the ability to identify significantly overrepresented GO terms, we compute the significance of GO terms associated with each subset of significant genes ranked by our methods. A pvalue is computed for each GO term using Fisher's exact test, where a small pvalue implies that this term is significantly overrepresented. In our experiments, we only consider the Gene Odontology branch "Biological Process."
Results
In this section, we first summarise the IGM algorithm whose details are described in Section. We then apply the IGM algorithm as well as the other approaches in Section to three separate sets of gene expression microarrays: five breast cancer datasets generated on the same platform, three gastric cancer datasets from different platforms and eleven different types of cancer datasets from the same platform. By comparing their performance in terms of the false discovery rate and the Gene Ontology terms, we show that compared with the other approaches IGM is more able to identify significant genes and GO terms that have been proven to be closely related to these cancers by the previous literature.
While our aim is to support metaanalysis across different microarray platforms, we first need to test the accuracy of our approach under controlled conditions. We achieve this in Section by analysing five breast cancer datasets from the same platform, where we can simulate incomplete genes by randomly removing genes from each dataset. In this way, we can validate the accuracy our method by comparing the results of metaanalysis with and without the incomplete genes. Having evaluated the accuracy of our approach under controlled conditions, we then evaluate its performance on three gastric cancer datasets that were generated on different platforms in Section. Finally, we test our method on a larger scale of 11 cancer datasets.
IGM Algorithm
We summarise the key steps of the IGM algorithm as follows.
1. Input  k (k ≥ 2) gene expression microarray datasets GE_{j }= (G_{j}, S_{j}), j = 1, ···, k.
2. Alignment  Calculate the union set of features in all studies , n = G_{U}
3. Effect sizes  Compute the effect size x_{ij }of each feature i in study j for all features in G_{U }.
4. Imputation  Impute the statistic of the missing replicates in the above score matrix X using the CPD method in Section. The scores matrix with imputed significance is denoted as:
5. Metascore  Compute the metascores x_{M }(i) for all features based on the score matrix X' using the model in Section.
In our implementation, we have also provided an option to filter out the features with only a small proportion (e.g., 30%) of observable replicates in order to avoid unstable imputation.
In addition, we also implemented the INTERSECTION and IGNORE methods in Section by specifying different options in the framework in Section. These two methods are the basis of comparison with our method in the evaluation. The main IGM program was implemented in Matlab and the source code is provided in the Additional File 2.
Additional file 2. Source Code. This additional file contains the source code of the program of our IGM framework, which was implemented using Matlab. In addition, a brief description is included to instruct the use of this Matlab program.
Format: RAR Size: 57KB Download file
Controlled Evaluation of Accuracy in Breast Cancer Datasets
As a first step, we need to evaluate the accuracy of our IGM. However, this raises the question of how to measure accuracy in the absence of any ground truth of the significance of each gene, especially for incomplete genes. In order to generate such a ground truth for a controlled evaluation, we have simulated missing replicates in five breast cancer datasets from the same platform. In this way, we can compare the accuracy of the metascores generated for each gene with simulated missing replicate(s), by making a comparison with the metascores generated where all replicates are present in the original datasets. The metascores from the original datasets with no missing replicates thus become a "gold standard" for our evaluation, since using more samples leads to more reliable results. The results of our evaluation are presented in Section and.
Breast Cancer Datasets
We used five public breast cancer datasets from NCBI GEO [31]: GSE2034 [32], GSE4922 [33], GSE6532 [34,35], GSE7390 [36], and GSE11121 [37], all on the Affymetrix HGU133A platform. The phenotype was a binary label (< 5, ≥ 5) years to metastasis.
Simulating Missing Replicates
Assuming that the probes are missing in each dataset independently, we randomly removed a proportion of probes (30% in the following experiments) from each dataset to simulate missing replicates. We then tested each metaanalysis approach on these datasets with simulated missing replicates. Subsequently, by comparing the results with the gold standard (the gene ranking generated on the original datasets), we can evaluate the ability of the approach to estimate the significance of incomplete genes.
FDR Comparison
In this section, by comparing the FDR between different methods, we demonstrate that IGM is able to better estimate the significance of incomplete genes than the INTERSECTION and IGNORE methods. We first applied the framework in Section to the original five datasets without any missing replicates (hence, imputation is not necessary) to generate a gene ranking, and computed the FDR using the permutation method in Section as the gold standard for comparison. In this case, IGM is equivalent to the method in [6]. We then generated 100 groups of datasets with simulated missing replicates using the approach described in Section. For each group of datasets, we generated a ranking of all probes using IGM as well as the other methods, and computed the average FDR across the 100 groups of datasets for each method. The resulting FDR (in log scale) versus the number of probes declared significant for the "gold standard", IGM and the comparable methods, are shown in Figure 3. In addition, the 5% and 95% quantiles of the FDR across all 100 simulations are shown at several positions to demonstrate the significance of the differences between these methods.
Figure 3. FDR evaluation on breast cancer datasets. The average FDR of different metaanalysis methods in the breast cancer datasets. Except for the FDR computed on the original datasets which is used as the gold standard (labeled "Gold"), the other methods were averaged across 100 groups of datasets with simulated missing replicates. The 95% quantiles of the FDR across the 100 simulations are given by the error bars at the number of features 200, 400, 600 and 800.
In our comparison, we consider that the probe ranking generated on the original datasets without any missing replicates, where most information is available, is most reliable, and we refer to this as our "gold standard". Note that the FDR for the gold standard is nonzero because some genes in the original dataset are significant just by chance.
All methods when applied to the datasets with simulated missing replicates produce the same results for complete genes; the difference between these methods is reflected in their ability to estimate the significance of incomplete genes.
We analyse the cause of the overestimation of the FDR as follows. If some incomplete genes are often assigned less significant scores by a particular method than the significance level that they should have in the gold standard, these genes have a greater chance to be counted as false positives (see Section for details). In this case, the FDR is likely to be overestimated due to the increased number of false positives. For example, in Figure 3 since the INTERSECTION method discarded all incomplete genes, which is equivalent to assigning the least significant score (e.g., pvalue = 1) to them, the FDR is overestimated compared to the gold standard. In the Ignore method, the estimated significance of incomplete genes is merely determined by the observable replicates and the interstudy correlation is neglected. Thus, the estimated significance is likely to be distorted by those observable values, and so the estimated FDR deviates from the "gold standard".
Thus, we aim to develop a metaanalysis method that generates an FDR as close as possible to the FDR generated by the gold standard, indicating that this method is able to precisely estimate the significance of probes even though some replicates are missing. In this regard, our approach outperforms the others, since it is closest to the gold standard, and the significance of this difference in the FDR distributions is demonstrated by Figure 3.
Gene Ontology Terms
To further compare the ability of each method to find a more significant set of genes, we have also evaluated the GO terms found in the five breast cancer datasets.
In this experiment, we used the probe rankings produced by the gold standard, INTERSECTION and IGM, which are identical to the results in Section. Based on these rankings, a subset of significant probes (FDR≤0.01) were selected for each method and each simulation of missing replicates. To assess the significance of enrichment, we used Fisher's exact test to compute the pvalues of GO enrichment in these significant subsets. The Biological Process branch was used. Subsequently, for the INTERSECTION and IGM methods, we computed the geometric mean of the pvalues of the GO terms from all 100 simulations, in order to generate a single integrated list of GO terms as a basis for comparison with the terms produced by the gold standard. In Table 1 the top four GO terms for each method are listed.
Table 1. Top GO Terms in breast cancer datasets
As with the FDR evaluation, a good metaanalysis method is expected to reproduce the order of GO terms generated by the gold standard as much as possible when missing replicates are present. Before comparing the INTERSECTION and IGM with the gold standard, we first show that the gold standard has effectively identified the important GO terms associated with the time to metastasis of breast cancer.
A short time to metastasis (less than five years) has been linked to upregulation of the genes related to cell cycle, cell proliferation, and cell invasion [32,38]. The significant GO terms generated by the gold standard confirm that the upregulation of the biological processes related to cell cycle, such as mitotic chromosome condensation, spindle organization, DNA replication and DNA repair [32,3840], the processes related to signal transduction, such as phosphoinositidemediated signaling [32,38], and cell proliferation [40] are most strongly associated with the short time to metastasis.
In order to statistically show the advantages of IGM, we compared the precision and recall of the INTERSECTION and IGM methods in identifying the significant GO terms found by the gold standard. First, in order to establish a gold standard for comparing GO terms, we selected the true significant GO terms from the gold standard method by setting a threshold α on the pvalues. For example, given α = 0.01, we may find a set of GO terms in the gold standard with a pvalue ≤ α, and denote this set as G. Second, we ordered all GO terms in the other methods, including the IGM, INTERSECTION and Ignore methods according to their pvalues separately. Third, for each method (IGM or INTERSECTION), we scanned the ordered GO terms from the most significant to the least significant, and declared different numbers (top k) of GO terms as significant terms (where k ranges from 1 to all GO terms). Finally, for each number of terms declared significant k, we compared these terms declared significant with the true significant terms in the set G, which was previously obtained from the gold standard, and computed the precision and recall for this k. Thus, we can generate a vector of precisionrecall pairs for different values of k as a curve shown in Figure 4. This procedure is similar to the generation of a ROC curve.
Figure 4. Precisionrecall of GO terms. Precisionrecall curves of GO terms in the breast cancer datasets. Left: the true significant terms are annotated from the gold standard under the threshold 0.001. Right: the true significant terms are annotated from the gold standard under the threshold 0.01.
Figure 4 shows the precisionrecall curves across the ranked terms in each method, generated under the threshold α = 0.001 and α = 0.01. The higher precision and recall of IGM demonstrate that IGM better reproduced the order of GO terms in the gold standard than the INTERSECTION method.
Similarly, the ROC curves of GO terms can be also generated for different thresholds. We show the comparison of ROC curves between the IGM, INTERSECTION and IGNORE methods in Figure 5. The results confirmed that our IGM method was closest to the gold standard in terms of reproducing the significant GO terms.
Figure 5. ROC of GO terms. ROC curves of GO terms in the breast cancer datasets. Left: the true significant terms are annotated from the gold standard under the threshold 0.01. Right: the true significant terms are annotated from the gold standard under the threshold 0.1.
We have computed the correlation coefficients of the GO terms between the IGM, INTERSECTION and IGNORE methods and the gold standard, and the result in the form of a scatter plot is shown in Figure 6. The left figure shows the scatter plot of all GO terms between the three methods and the gold standard. Our IGM method reproduced the GO terms and their significance from the gold standard better than the other two methods, because it achieved the largest agreement with the gold standard (closest to the ideal diagonal line and the highest correlation coefficient). In addition, we also computed the agreement of the GO terms between the IGM, INTERSECTION and IGNORE methods in the right figure. The full list of these ranked GO terms for all methods is provided in the Additional File 3.
Figure 6. Agreement of GO terms. Scatter plot of all GO terms between different methods in the breast cancer datasets. Left: the agreement between the IGM, INTERSECTION, IGNORE methods and the gold standard. Right: the agreement between the IGM, INTERSECTION, IGNORE methods.
Additional file 3. Ranked GO List. This additional file contains the full lists of GO terms which are ranked according to their significance in the breast cancer datasets. The GO terms for the gold standard, IGM, INTERSECTION and IGNORE methods are all included in this table.
Format: XLS Size: 638KB Download file
This file can be viewed with: Microsoft Excel Viewer
Real Missing Replicates in Gastric Cancer Datasets
Gastric Cancer Datasets
We tested our IGM algorithm on three gastric cancer datasets, which we refer to as the Australian dataset [15] (6957 genes), the Hong Kong dataset [16] (13; 258 genes) and the Japanese dataset [17] (4974 genes). These three datasets were generated on different spotted cDNA platforms and do not possess an identical set of probes. We aligned the features by their gene symbols. Since we focused on the signatures discriminating two wellknown subtypes of gastric cancer, diffuse and intestinal, according to Lauren's classification [21], only the tumor samples were retained. The Australian dataset has 35 diffuse samples and 22 intestinal samples, the Hong Kong dataset has 13 diffuse samples and 68 intestinal samples, and the Japanese dataset has 5 diffuse samples and 17 intestinal samples.
Gene Ontology Terms
We evaluated the significance of GO terms enriched in the top ranked genes in the gastric cancer datasets. We applied all methods to all three gastric cancer datasets, and set a threshold of FDR ≤ 0.01 to produce a subset of significant genes based on the resulting ranks using each method. The FDR was estimated using the approach in Section. We used GOstat [41] to detect the enriched GO terms for each subset of significant genes and to generate the corresponding FDRcorrected pvalues [19]. In Table 2 we show the top GO terms overrepresented in the groups of significant genes. Note that the significant genes were divided into two groups, which are prominently overexpressed in the diffuse and intestinal subtypes, respectively.
Table 2. Top GO terms in gastric cancer datasets
Since a few incomplete genes were included in the significant set and participated in some biological processes closely associated with a particular subtype of gastric cancer, such as "biological adhesion" enriched in the diffuse subtype (Table 2), the genes identified by IGM resulted in more overrepresented terms that have been validated to be related to these subtypes in the previous literature (discussed in Section) than the INTERSECTION method. Under a threshold of the corrected pvalue ≤ 0.01, IGM resulted in 73 significant terms while the Intersection method resulted in only 20 significant terms. This result is consistent with what we observed in the breast cancer datasets.
A Validation on 11 Cancer Datasets
In order to validate the empirical performance on a larger number of studies, we have applied our method and the Intersection, Ignore methods to a group of 11 datasets with different types of cancer with the purpose of discriminating normal and cancer samples. A similar application can be also found in [2]. These datasets are all publicly available in GEO [31] (GEO series numbers are GSE781, GSE2719, GSE3868, GSE7670, GSE9476, GSE9750, GSE14359, GSE15852, GSE19147, GSE22529 and GSE23400).
All 11 datasets were selected on the Affymetrix HGU133A platform in order to conduct the same evaluation as for the breast cancer datasets. We used identical settings with the experiments of the five breast cancer datasets except that the proportion of missing values in each dataset was set to 10% instead of 30% in order to retain enough features for the Intersection method. The FDR comparison for all the methods is shown in Figure 7.
Figure 7. FDR evaluation on 11 cancer datasets. The average FDR of different metaanalysis methods in the 11 cancer datasets. The same experimental settings were used as the five breast cancer datasets.
As shown in Figure 7 our IGM method still performs better than the Intersection and Ignore methods in terms of FDR, since it is closest to the gold standard in the entire range. However, the performance of IGM is closer to the Ignore method than the result for the breast cancer datasets (Note that the left figure in Figure 7 shows the FDR for the top 10,000 features, while Figure 3 shows the FDR for the top 1000 features only. This is because the difference between different methods is too small for selecting a small number of features).
Due to the noise and inconsistency when the number of studies increases, the interstudy correlation may decrease. As a result, the imputation based on the interstudy correlation may not be as effective as the situation where a significant positive interstudy correlation exists (as with the breast cancer datasets).
Thus, this might be a reason for the reduced difference between our IGM method and the Ignore method. A previous study [10] considered the interstudy concordance in order to assess whether these studies are worthy of being integrated. Thus, as future work, we may take into account the interstudy concordance into the imputation step of our algorithm in order to improve the performance in large scale studies.
Discussion
Here we discuss the biological relevance of the genes and GO terms that are overexpressed in the diffuse and intestinal subtypes separately.
Compared to intestinal gastric cancer, the most significant feature of the diffuse subtype is the poor differentiation caused by the invasion of tumor cells to the stroma [15,21,42].
The term "extracellular structure organization and biogenesis" and its descendent term, "extracellular matrix organization and biogenesis", which are associated with an important component of tumor invasion and metastasis, the extracellular matrix (ECM) [43,44], were overrepresented in our experiment. In these terms, aside from the genes COL4A6, COL6A2 and COL14A1 belonging to the collagen family, TenascinX (TNXB), which was described as a metastasis signature in breast cancer [45], was also upregulated in our experiment but has not previously been reported for gastric cancer. This is a potentially new discovery and provides a focus for further investigation.
Another feature of the diffuse subtype, active cell mobility, e.g., overexpression of Caldesmon 1 (CALD1), stimulates the invasion and metastasis of tumor cells [17,44]. This was reflected by the overrepresentation of the term "cell mobility" and its parent "localization of cell" in our experiment.
A few genes, such as the receptor tyrosineprotein kinase erbB3 (ERBB3), which is related to growth factors [17], and dual specificity protein kinase (TTK) [46], which is related to cell proliferation, were found to be upregulated in the intestinal gastric cancer samples. The overexpression of these features were reflected by the overrepresentation of several terms related to "cell cycle", such as "mitotic cell cycle" and "M phase of miotic cell cycle".
By analysing the statistically significant terms and their biological relevance, we observe that the gene sets identified by IGM result in more significant GO terms, which are closely associated with particular subtypes of gastric cancer according to the previous literature. This demonstrates both the value of including incomplete genes and the ability of IGM to better reproduce the cancer related genes and the corresponding GO terms that have been validated by the previous literature.
Conclusion
Metaanalysis has been widely used for identifying a more robust set of differentiallyexpressed genes by integrating multiple microarray datasets. However, some genes with missing replicates, which we referred to as incomplete genes, were neglected in previous studies. These genes may also be biologically significant though their statistical significance is not confirmed by all studies. In this paper, we developed Incomplete Gene Metaanalysis for incorporating incomplete genes into the metaanalysis. We have shown that the gene rankings generated by IGM were able to identify more statistically significant genes from incomplete genes in terms of FDR, indicating the benefit of including the incomplete genes. We also applied our algorithm and the traditional methods to three gastric cancer datasets. The overrepresented GO terms in each set of significant genes implied that the subsets generated by IGM contained more genes that were associated with the important GO terms relevant to particular clinical annotations in both the breast cancer and gastric cancer datasets. Taken together, these results indicate the benefit in analysing the incomplete genes in addition to complete genes, and demonstrate that IGM is able to appropriately estimate the significance of incomplete genes.
Authors' contributions
FS, under the supervision of CL and AK, developed the major part of the methodology and ran the major part of the experiments. GA contributed to the Gene Ontology evaluation of the results in the manuscript and the classification evaluation in the Additional File 1. IH contributed to the biological analysis of the results. All authors contributed to the writing and modifications of the manuscript.
Acknowledgements
This work was supported by the Australian Research Council, and by the NICTA Victorian Research Laboratory. NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Center of Excellence program.
References

Warnat P, Eils R, Brors B: Crossplatform analysis of cancer microarray data improves gene expression based classification of phenotypes.
BMC Bioinformatics 2005, 6:265+. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Xu L, Geman D, Winslow R: Largescale integration of cancer microarray data identifies a robust common cancer signature.
BMC Bioinformatics 2007, 8:275+. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Xu L, Tan AC, Winslow RL, Geman D: Merging microarray data from separate breast cancer studies provides a robust prognostic test.
BMC Bioinformatics 2008, 9:125+. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Hedges LV, Olkin I: Statistical Methods for MetaAnalysis. Academic Press. San Diego, CA, USA; 1985.

Rhodes DR, Barrette TR, Rubin MA, Ghosh D, Chinnaiyan AM: MetaAnalysis of Microarrays: Interstudy Validation of Gene Expression Profiles Reveals Pathway Dysregulation in Prostate Cancer.
Cancer Research 2002, 62(15):44274433. PubMed Abstract  Publisher Full Text

Choi JK, Yu U, Kim S, Yoo OJ: Combining multiple microarray studies and modeling interstudy variation.
Bioinformatics 2003., 19(Suppl 1) PubMed Abstract  Publisher Full Text

Borozan I, Chen L, Paeper B, Heathcote JE, Edwards AM, Katze M, Zhang ZL, Mcgilvray ID: MAID: An effect size based model for microarray data integration across laboratories and platforms.
BMC Bioinformatics 2008, 9:305+. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Marot G, Foulley J, Mayer C, Jaffrezic F: Moderated effect size and Pvalue combinations for microarray metaanalyses.
Bioinformatics 2009, 25(20):26922699. PubMed Abstract  Publisher Full Text

Breitling R, Armengaud P, Amtmann A, Herzyk P: Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments.
FEBS Letters 2004, 573(13):8392. PubMed Abstract  Publisher Full Text

Lai Y, Eckenrode SE, She JX: A statistical framework for integrating two microarray data sets in differential expression analysis.
BMC bioinformatics 2009., 10(Suppl 1) PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Shen K, Tseng GC: Metaanalysis for pathway enrichment analysis when combining multiple genomic studies.
Bioinformatics 2010, 26(10):13161323. PubMed Abstract  Publisher Full Text

Wren JD: A global metaanalysis of microarray expression data to predict unknown gene functions and estimate the literaturedata divide.
Bioinformatics 2009, 25(13):16941701. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ghosh D, Barette TR, Rhodes D, Chinnaiyan AM: Statistical issues and methods for metaanalysis of microarray data: a case study in prostate cancer.

Petersen D, Chandramouli G, Geoghegan J, Hilburn J, Paarlberg J, Kim C, Munroe D, Gangi L, Han J, Puri R, Staudt L, Weinstein J, Barrett JC, Green J, Kawasaki E: Three microarray platforms: an analysis of their concordance in profiling gene expression.
BMC Genomics 2005, 6:63. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Boussioutas A: Distinctive Patterns of Gene Expression in Premalignant Gastric Mucosa and Gastric Cancer.
Cancer Research 2003, (63):25692577. PubMed Abstract  Publisher Full Text

Ji JF, Chen X, Leung SY, Chi JA, Chu KM, Yuen ST, Li R, Chan AS, Li JY, Dunphy N, So S: Comprehensive analysis of the gene expression profiles in human gastric cancer cell lines.
Oncogene 2002, 21:65496556. PubMed Abstract  Publisher Full Text

Hippo Y, Taniguchi H, Tsutsumi S, Machida N, Chong J, Fukayama M, Kodama T, Aburatani H: Global Gene Expression Analysis of Gastric Cancer by Oligonucleotide Microarrays.
Cancer Research 2002, 62:233240. PubMed Abstract  Publisher Full Text

Cochran WG: The Combination of Estimates from Different Experiments.
Biometrics 1954, 10:101129. Publisher Full Text

Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.
Journal of the Royal Statistical Society. Series B (Methodological) 1995, 57:289300.

Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response.
Proceedings of the National Academy of Sciences of the United States America 2001, 98(9):51165121. Publisher Full Text

Lauren P: The two histological main types of gastric carcinoma: difiuseand socalled intestinaltype carcinoma.
Acta Path Microbiol Scand 1965, 64:3149. PubMed Abstract

Ramasamy A, Mondry A, Holmes CC, A DG: Key issues in conducting a metaanalysis of gene expression microarray datasets.
PLoS medicine 2008, 5(9):e184+. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

DerSimonian R, Laird N: Metaanalysis in clinical trials.
Controlled clinical trials 1986, 7(3):177188. PubMed Abstract  Publisher Full Text

Arnold SF: The theory of linear models and multivariate analysis. New York: Wiley; 1981.

Schafer JL: Analysis of Incomplete Multivariate Data. London:. Chapman & Hall; 1997.

Aittokallio T: Dealing with missing values in largescale studies: microarray data imputation and beyond.
Brief Bioinformatics 2010, 11(2):253264. PubMed Abstract  Publisher Full Text

Dempster AP, Laird NM, Rubin DB: Maximum Likelihood from Incomplete Data via the EM Algorithm.
Journal of the Royal Statistical Society. Series B (Methodological) 1977, 39:138.

Rubin DB: Multiple Imputation for Nonresponse in Surveys. New York: J. Wiley & Sons; 1987.

Stevens JR, Nicholas G: Metahdep: metaanalysis of hierarchically dependent gene expression studies.
Bioinformatics (Oxford, England) 2009, 25(19):26192620. PubMed Abstract  Publisher Full Text

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, IsselTarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology.
Nature Genetics 2000, 25:2529. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository.
Nucleic Acids Research 2002, 30:207210. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijervan Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA: Geneexpression profiles to predict distant metastasis of lymphnodenegative primary breast cancer.

Ivshina AV, George J, Senko O, Mow B, Putti TC, Smeds J, Lindahl T, Pawitan Y, Hall P, Nordgren H, Wong JE, Liu ET, Bergh J, Kuznetsov VA, Miller LD: Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer.
Cancer Research 2006, 66:1029210301. PubMed Abstract  Publisher Full Text

Loi S, HaibeKains B, Desmedt C, Lallemand F, Tutt AM, Gillet C, Ellis P, Harris A, Bergh J, Foekens JA, Klijn JGM, Larsimont D, Buyse M, Bontempi G, Delorenzi M, Piccart MJ, Sotiriou C: Definition of Clinically Distinct Molecular Subtypes in Estrogen ReceptorPositive Breast Carcinomas Through Genomic Grade.
Journal of Clinical Oncology 2007, 25:12391246. PubMed Abstract  Publisher Full Text

Loi S, HaibeKains B, Desmedt C, Wirapati P, Lallemand F, Tutt AM, Gillet C, Ellis P, Ryder K, Reid JF, Daidone MG, Pierotti MA, Berns EM, Jansen MP, Foekens JA, Delorenzi M, Bontempi G, Piccart MJ, Sotiriou C: Predicting prognosis using molecular profiling in estrogen receptorpositive breast cancer treated with tamoxifen.
BMC Genomics 2008., 9 PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, HaibeKains B, Viale G, Delorenzi M, Zhang Y, d'Assignies MS, Bergh J, Lidereau R, Ellis P, Harris AL, Klijn JG, Foekens JA, Cardoso F, Piccart MJ, Buyse M, Sotiriou C, Consortium T: Strong time dependence of the 76gene prognostic signature for nodenegative breast cancer patients in the TRANSBIG multicenter independent validation series.
Clinical Cancer Research 2007, 13:32073214. PubMed Abstract  Publisher Full Text

Schmidt M, Böhm D, von Törne C, Steiner E, Puhl A, Pilch H, Lehr HA, Hengstler JG, KÄolbl J, Gehrmann M: The Humoral Immune System Has a Key Prognostic Impact in NodeNegative Breast Cancer.
Cancer Research 2008, 68:54055413. PubMed Abstract  Publisher Full Text

van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer.
Nature 2002, 415:530536. PubMed Abstract  Publisher Full Text

Mosley J, Keri R: Cell cycle correlated genes dictate the prognostic power of breast cancer gene lists.
BMC Medical Genomics 2008, 1:11+. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Dai HY, van't Veer L, Lamb J, He YD, Mao M, Fine BM, Bernards R, van de Vijver M, Deutsch P, Sachs A, Stoughton R, Friend S: A cell proliferation signature is a marker of extremely poor outcome in a subpopulation of breast cancer patients.
Cancer Research 2005, 65(10):40594066. PubMed Abstract  Publisher Full Text

Beissbarth T, Speed TP: GOstat: find statistically overrepresented Gene Ontologies within a group of genes.
Bioinformatics 2004, 20(9):14641465. PubMed Abstract  Publisher Full Text

Tahara E: Molecular biology of gastric cancer.
World Journal of Surgery 1995, 19(4):484488. PubMed Abstract  Publisher Full Text

Yonemura Y, Endo Y, Fujita H, Fushida S, Ninomiya I, Bandou E, Taniguchi K, Miwa K, Ohoyama S, Sugiyama K, Sasaki T: Role of Vascular Endothelial Growth Factor C Expression in the Development of Lymph Node Metastasis in Gastric Cancer.
Clinical Cancer Research 1999, 5(7):18231829. PubMed Abstract  Publisher Full Text

StetlerStevenson WG, Aznavoorian S, Liotta LA: Tumor Cell Interactions with the Extracellular Matrix During Invasion and Metastasis.
Annual Review of Cell Biology 1993, 9:541573. PubMed Abstract  Publisher Full Text

Crawford N, Walker R, Lukes L, Officewala J, Williams R, Hunter K: The Diasporin Pathway: a tumor progressionrelated transcriptional network that predicts breast cancer survival.
Clinical and Experimental Metastasis 2008, 25(4):357369. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ahn CH, Kim YR, Kim SS, Yoo NJ, Lee SH: Mutational Analysis of TTK Gene in Gastric and Colorectal Cancers with Microsatellite Instability.
Cancer Treatment and Research 2009, 41(4):224228. Publisher Full Text