Prediction of heterogeneous differential genes by detecting outliers to a Gaussian tight cluster

Yang, Zihua; Yang, Zhengrong

doi:10.1186/1471-2105-14-81

Methodology article
Open access
Published: 05 March 2013

Prediction of heterogeneous differential genes by detecting outliers to a Gaussian tight cluster

Zihua Yang¹ &
Zhengrong Yang²

BMC Bioinformatics volume 14, Article number: 81 (2013) Cite this article

3123 Accesses
2 Citations
Metrics details

Abstract

Background

Heterogeneously and differentially expressed genes (hDEG) are a common phenomenon due to bio-logical diversity. A hDEG is often observed in gene expression experiments (with two experimental conditions) where it is highly expressed in a few experimental samples, or in drug trial experiments for cancer studies with drug resistance heterogeneity among the disease group. These highly expressed samples are called outliers. Accurate detection of outliers among hDEGs is then desirable for dis- ease diagnosis and effective drug design. The standard approach for detecting hDEGs is to choose the appropriate subset of outliers to represent the experimental group. However, existing methods typically overlook hDEGs with very few outliers.

Results

We present in this paper a simple algorithm for detecting hDEGs by sequentially testing for potential outliers with respect to a tight cluster of non- outliers, among an ordered subset of the experimental samples. This avoids making any restrictive assumptions about how the outliers are distributed. We use simulated and real data to illustrate that the proposed algorithm achieves a good separation between the tight cluster of low expressions and the outliers for hDEGs.

Conclusions

The proposed algorithm assesses each potential outlier in relation to the cluster of potential outliers without making explicit assumptions about the outlier distribution. Simulated examples and and breast cancer data sets are used to illustrate the suitability of the proposed algorithm for identifying hDEGs with small numbers of outliers.

Background

A heterogeneously and differentially expressed gene (hDEG) is a gene which has an inconsistent expression pattern across its experimental samples. Typically, a large proportion of the experimental samples and the control samples form a tight cluster in low expressions. The remaining small proportion of experimental samples, namely the outliers, are observed to significantly deviate from the tight cluster towards high expressions. We use the word ‘tight’ to describe the cluster of null (or low) expressions of a hDEG as the null variance is typically small compared to the null-outlier distance. In situations where the few highly expressed outliers of a non-differential gene are caused by measurement error, it is also useful to distinguish such genes with hDEG characteristics. The existence of hDEGs has been established in various experiments ([1-8]). Suppose we have the expressions of m genes. The standard t statistic under-estimates the significance in testing the difference across the control and experimental samples of a hDEG. COPA (cancer profile outlier analysis)[9] proposed modifying the Student t statistic to be a ratio of the distance between the r th (default 9th) percentile of experimental samples and the median of all samples over the median absolute distance (deviated from the whole sample median), i.e.,

t_{i}^{COPA} = \frac{q_{r} (y_{i}) - λ_{i}}{σ_{i}} i = 1, \dots, m

(1)

where $σ_{i} = 1.4826 \times med (x_{i} - λ_{i}, y_{i} - λ_{i})$ , x_i and y_i represent control samples and experimental samples of the i th gene respectively, q_r(y_i) is the r th percentile of y_i and λ_i is the median of both x_i and y_i. The quantile-median difference in (1) summarises the null-outlier distance using a single value of y_i. To make outlier detection more efficient, the outlier-sum (OS) statistic[10] sums over outliers, $t_{i}^{OS} = \sum_{j} (y_{ij} - λ_{i}) σ_{i}^{- 1}$ where the outliers are defined as ${y \in y_{i} : y > q_{75} (x_{i}, y_{i}) + IQR (x_{i}, y_{i})}$ . Outlier robust t statistic (ORT) uses the same statistic but defines the outliers in relation to the control samples only ${y \in y_{i} : y > q_{75} (x_{i}) + IQR (x_{i})}$ [11]. Maximum ordered subset t statistic (MOST) defines the outliers to be the top k experimental samples and chooses k by optimising a normalised t statistic[12]. The least sum of ordered subset square t statistic (LSOSS)[13] also compares the controls with a subset of the top k experimental samples, $t_{i}^{LSOSS} = k ({\bar{y}}_{i}^{(k)} - {\bar{x}}_{i}) S_{i}^{- 1}$ where ${\bar{x}}_{i}$ is the mean of control samples, ${\bar{y}}_{i}^{(k)}$ is the mean of top k experimental samples and S_i is the pooled standard deviation of the set of control samples plus non-outlier experimental samples and the set of outlier experimental samples. k is optimised iteratively to minimise the within-cluster variance. We propose a new algorithm for detecting hDEGs with a small number of outliers by detecting outliers via gap (DOG) maximisation. What makes this approach different from the existing methods is that we assess each potential outlier in relation to a tight cluster of non-outliers. This avoids modelling the highly expressed outliers explicitly. This is especially important when the number of outliers is small. The proposed algorithm classifies each gene as a hDEG or non-hDEG by locating potential outliers and summarises it using the average of the standardised outlier expressions. We will use simulated examples and a breast cancer dataset to illustrate the effectiveness of the proposed algorithm in detecting hDEGs with few outliers. We will also show how effective test algorithms are when varying conditions.

Results and discussion

Simulated examples

Scenario 1 - identification of a single hDEG

The algorithms are compared for the detection of a single hDEG with the number of outliers varied from one to nine. The results are summarised in Table1. For a small number of outliers, COPA, MOST and LSOSS demonstrated relatively poor performances while DOG consistently gave significant p-values.

Table 1 Scenario 1

Full size table

Scenario 2 - identification of multiple hDEGs (100 genes with 50 hDEGs)

Over a critical p-value range from 0 to 0.01, DOG demonstrated the highest average cumulative Matthews correlation coefficient (cMCC, see Methods for more detail) across five sets of simulations with one to five outliers - Figure1. Table2 shows that DOG had very high classification rates compared with the other five algorithms. When the number of outliers exceeded two, OS, ORT and LSOSS gave more reasonable classification rates. COPA and MOST gave poor predictions overall.

Table 2 Scenario 2

Full size table

Figure2 shows the ROC curves for the one-outlier simulations, it can be seen that DOG had a superior ROC curve with an partial AUC value of 1. Figure3 illustrates the same ROC curves oover the complete range of false positive rate, COPA and LSOSS remained poor. We also found that as the number of outliers increased to five, most algorithms worked well with the exception of COPA.

Further simulated examples

We look at the sensitivitiy of DOG with respect to changes in certain assumptions and parameters.

Variable marginal null-outlier distance

We revisit the single-hDEG simulation but vary the marginal null-outlier distance (defined in Experimental design of Methods) from 0.5 to 2 with increments of 0.1 - Table3. DOG’s p-values increased for a reduced marginal null-outlier distance but retained the most significant mean p-values for larger marginal null-outlier distances. MOST and LSOSS failed to detect the hDEG. DOG gave accurate estimates of the outlier number when the null-outlier distance was greater than one.

Table 3 Distance effect

Full size table

Non-Gaussian tight cluster

We simulated a Gaussian-mixture tight cluster (0.5 $N (9, 1) + 0.5 N (10, 1)$ ) to examine how DOG is affected by non-Gaussianity in the tight cluster. All other parameters were kept the same as those used in the single-hDEG simulation. The results were very similar to those seen previously - Table4. In particular, the performances of COPA, OS and ORT have improved for the simulated non-Gaussian tight cluster.

Table 4 Non-Gaussian tight cluster

Full size table

Control samples containing outliers

DOG can be modified to enable the detection of hDEGs when control samples contain outliers (see ‘’Allowing control samples to contain outliers of Methods. We illustrate this using the single-hDEG example with one outlier added to the control samples - Table5. It can be seen that DOG accurately detected the outliers from both control and experimental samples. MOST and LSOSS failed to detect the hDEG.

Table 5 Control samples containing outlier

Full size table

Breast cancer data

Figure4 illustrates the ordered expressions of the top four hDEGs as detected by the COPA, OS, ORT, MOST, LSOSS and DOG respectively (with annotations of rankings). The rankings of the genes were based on the order of the test statistics. The defining feature of DOG’s top four hDEGs, PEX6, TFP12, UGT2B4 and SLC4A2 (last row of Figure4), is that they contain a few highly expressed outliers. Figure5 shows the top 25 predictions of hDEGs using DOG for this data set. Existing literature have established these genes to be of biological relevance to the progression and treatment of breast cancer ([14-23]).

Most other algorithms chose genes with a reasonably large pool of differentially expressed experimental samples expressed at a more moderate level. LSOSS also generally favoured ordinary DEGs. MOST chose a set of top four genes with only one or two moderately expressed outliers. Table6 shows how the top 100 predictions of these algorithms overlap - COPA and OS are most similar in their rankings whilst DOG has a maximum of 15% overlap with OS. Using the ordered log2 expressions of each algorithm’s unique top 100 genes, Figure6 illustrates the median expressions minus the minimum expressions for each experimental sample index. The unique top 100 genes for DOG and COPA showed the largest change across their experimental samples, their difference being that COPA favoured hDEGs with a larger number of outliers whilst DOG picked out hDEGs with small numbers of outliers.

Using the significance analysis approach discussed in ‘’Significance analysis for real data of Methods, we estimated p values from sampling the replicates which then give us alternative p values based rankings of the genes. We also found the top four predictions ranked using the p values of DOG to be near identical to those ranked using its t statistics, though there were discrepancies in rankings for the lower ranking genes. Similar results were observed for the remainingfive algorithms.

Conclusions

The difficulty in identifying hDEGs arises from the fact that only a small number of experimental samples are highly expressed at a much higher level than the non-outliers. As a result, various modified t tests target the subset of potential outliers which are then tested against the control group. In practice, for hDEGs with very few outliers, we found that these algorithms often identify hDEGs with insignificant deviations between the outliers and the tight cluster of non-outliers. Based on this observation, the proposed algorithm assesses each potential outlier in relation to the Gaussian tight cluster without making an explicit assumption about the outlier distribution. At each step, we update the posterior mean and variance of the tight cluster which are then used to evaluate the probability of an outlier being a random sample of the tight cluster. Examples of simulated and breast cancer data sets verify the suitability of the proposed algorithm in identifying hDEGs with small numbers of outliers. An extension of the algorithm which fully takes into account gene correlations will be presented in future work. For the breast cancer data, we found negligible correlations across the top ranking genes and very low correlations among the less significant genes.

Table 6 Ranking accordance

Full size table

Methods

The proposed algorithm can be briefly summarised as follows. We first take the list of candidate outliers to be those experimental samples whose expressions are larger than the maximum expression of control samples. For the situation when control samples also contain outliers, see section ‘’Allowing control samples to contain outliers for a description of the necessary extension. The samples in the candidate list are sorted in an ascending order. The algorithm then updates the tight cluster of non-outliers by testing sequentially the samples in the updated candidate list of outliers. The test is terminated when a significant deviation between a candidate sample and the tight cluster is detected. We now give the steps in more statistical detail. First, let us introduce some notation. Let x denote the control samples and y the experimental samples of a gene or a probe set (we drop the gene subscript i for simplicity). The proposed DOG algorithm has the following steps:

1.
Candidate outlier: Given the union of x and y, z≡x∪y, we divide z into the candidate outlier set z ⁺=⇑{z j+∈z|z j+> max(x)} and the non-outlier set $z_{j}^{-} = {z_{j}^{-} \in z | z_{j}^{-} \leq max (x)}$ where ⇑ sorts the elements of a set in an ascending order.
2.
Detection: Given a critical tail probability α and the corresponding threshold t _α [24]. The first element in z ⁺, $z_{1}^{+}$ , is classified as the first outlier if
$t = \frac{z_{1}^{+} - μ}{σ} > t_{α}$

in which case the algorithm terminates and z⁺ is the set of outliers. We use a default value of α=0.05. The parameters μ and σ² are posterior mean and posterior variance derived of the tight cluster. Details of estimating μ and σ are given below.

3.
Absorption: On the other hand if t≤t _α, we move z 1+ to the tight cluster of non-outliers, z ⁻←z ⁻∪z 1+ and z ⁺←z ⁺∖z 1+.
4.
Estimating the parameters of the tight cluster: The parameters μ and β=σ ⁻² are updated using iterative Bayesian learning, i.e., by maximising the posterior probability [24]. Given $z \sim N (μ, 1 / β)$ with conjugate priors $μ \sim N (μ_{0}, 1 / σ_{0}^{2})$ and σ ²=1/β∼I G(a,b), the log-posterior is
$\begin{matrix} log P (θ | z^{-}, α) \propto log ℒ (z^{-} | μ, σ^{2}) + log IG (σ^{2} | a, b) \\ + log N (μ | μ_{0} σ_{0}^{2}) \end{matrix}$
(2)

where

\begin{array}{l} log ℒ (z^{-} | μ, σ^{2}) \propto log β / 2 - \sum_{z_{j} \in z^{-}} β {(z_{j} - μ)}^{2} / 2 \\ log IG (σ^{2} | a, b) \propto a log b + (a + 1) log β - bβ \\ log N (μ | μ_{0}, σ_{0}^{2}) \propto - σ_{0}^{2} {(μ - μ_{0})}^{2} / 2 \end{array}

and θ=(μ,β) and $α = (μ_{0}, σ_{0}^{2}, a, b)$ . Suppose n is the number of expressions in the tight cluster for the current iteration. For simplicity, we set μ₀=m e d(z⁻), a=1, b is set to be the maximum variance of expressions calculated gene by gene. To simplify the notation, we let $β_{0} = σ_{0}^{- 2}$ . β₀ is updated recursively but we set its initial value to be $β_{0}^{(1)} = 0.1$ . The maximum a posteriori probability procedure then gives the updates

\begin{array}{l} μ = \frac{β \sum_{j} z_{j} + β_{0} μ_{0}}{βn + β_{0}}; 1 / β = \frac{\sum_{j} {(z_{j} - μ)}^{2} + 2 b}{n + 2 a + 2}; \\ z_{j} \in z^{-} 1 / β_{0} = \frac{{(μ - μ_{0})}^{2} / 2 + b}{a + 1} . \end{array}

Repeat 3 and 4 until the first outlier (with the lowest expression) is detected or until all candidate outliers have been classified as non-outliers.

5.
Classification: A gene for which the set z ⁺ is non-empty is classified as a hDEG.

The summary statistic for a gene is taken to be the average of the outlier statistics $\sum_{j \in z^{+}} t_{j} / | z^{+} |$ . We use the average as opposed to the sum of outlier contributions as we prioritise the detection of hDEGs with few outliers.

Remark 1

We allow the hyperparameters μ₀ to be evaluated directly from the dataset. We set $β_{0}^{(1)}$ to be 0.1, β₀ is then updated iteratively in the algorithm. We desire the tight cluster variance prior to be densely distributed around the small values, thus we choose a=1 and b to be the maximum gene sample variance. In practice, we found that a large b and a small a≤1 optimise detection rates.

Remark 2

It is clear that for a finite replicate number, the difference in mean and variance of the tight cluster at two sequential steps are bounded. Asymptotically, as the sample size increases at each iteration, these differences converge toward zero since the posterior mean and variance converge toward the sample mean and variance and the tight cluster only absorbs probable null samples. This then guarantees asymptotic algorithmic convergence. Convergence of parameters in step 4 for each iteration follow from standard Bayesian results[25].

Cumulative Matthews correlation coefficient

We compare COPA, OS, ORT, MOST and LSOSS using the cumulative Matthews correlation coefficient (cMCC) which is the area under Matthews correlation coefficient (MCC,[26, 27]) in the interval $[0, p^{*}]$ :

\bar{ρ} = \int_{0}^{p^{*}} ρ_{p} dp,

(3)

the MCC ρ_p is defined as:

ρ_{p} = \frac{T P_{p} \times T N_{p} - F P_{p} \times F N_{p}}{\sqrt{(T P_{p} + F P_{p}) (T P_{p} + F N_{p}) (T N_{p} + F P_{p}) (T N_{p} + F N_{p})}}

Here, T P_p, T N_p, F P_p and F N_p represent the numbers of true positives (true hDEGs), true negatives (true non-hDEGs), false positives and false negatives respectively. These four quantities are determined based on a pre-defined critical p-value, i.e. p∈(0,p^⋆].

Total classification accuracy

The total classification accuracy is defined as

\frac{T N_{p} + T P_{p}}{T N_{p} + T F_{P} + T P_{p} + F P_{p}}

(4)

where T P_p, T N_p, F P_p and F N_p have been defined above.

Receiver operating characteristic (ROC) analysis

Receiver Operating Characteristic (ROC)[28] analysis has been used widely in outlier detection[11-13] for evaluating a classification model when varying the classification threshold, thus it is a useful tool for analysing the robustness of a classifier. As the threshold varies, the sensitivity $(\frac{T P_{p}}{T P_{p} + F N_{p}})$ and the false positive rate $(1 - \frac{T N_{p}}{T N_{p} + F P_{p}})$ change accordingly. The ROC curve is then generated by linking all the pairs of false positive rates and sensitivities corresponding to a set of thresholds. The ROC curve of a desirable classifier is close to the top-left corner. In particular, we limit the false positive rate to less and equal to 5% as rates above this correspond to critical p values that are too large to be of practical relevance. We also calculate the area under a ROC curve (AUC) for quantitative evaluation. A large AUC value of close to 1 indicates a good classifier. As we truncate the false positive rate at an upper limit of 5%, we scale the AUC by this limit so that the best possible partial AUC value is one.

Allowing control samples to contain outliers

In order for DOG to detect hDEGs when outliers are present in control samples, we can modify it slightly. Rather than using $z_{j}^{-} = {z_{j}^{-} \in z | z_{j}^{-} \leq max (x)}$ in the first step of the algorithm, we can use instead the r^th (default is 90^th) percentile of the control samples as the separation between samples belonging to the tight cluster and candidate outliers. Suppose the 90^th percentile of the control samples is denoted by ς, the selection of z j− now follows $z_{j}^{-} = {z_{j}^{-} \in z | z_{j}^{-} \leq ς}$ . In practice, the r th percentile can be specified subjectively by the modeller.

Significance analysis for real data

Existing literature on algorithms such as COPA, OS and ORT typically omits statistical significance when analysing real data. Here we propose a simple method for significance analysis. We assume that control samples contain no outliers. For each algorithm, we create new control and experimental replicates of a gene under the null hypothesis by sampling with replacement from only the control expressions of that gene. This is repeated 100 times to augment the set of null control and experimental samples. The null t statistics are then calculated for all genes. The p value for each gene is then calculated as the proportion of null statistics across all genes that exceed its observed t statistic.

Experimental design

We first look at two simulated scenarios for comparing the algorithms. For both scenarios, the tight cluster of control samples and non-outlier experimental samples are drawn randomly from a Gaussian distribution with a mean of ten and a standard deviation of one. Both control and experimental categories have 30 replicates. The outliers are generated by adding distances to the maximum expression of the tight cluster. The distances are called marginal null-outlier distances in that such a distance measures the gap between the tight cluster and the first outlier which is closest to the tight cluster. The marginal oull-outlier distances are sampled from a Gaussian distribution centered at two and with a standard deviation 0.2. Similar to examples seen in[10], we generate 10,000 non-DEGs which gives us 10,000 null t statistics and corresponding p-values for the hDEGs. This approach is applied to each algorithm. All simulations are repeated 100 times. In the first scenario, we evaluate the algorithms for a single hDEG. In addition, we vary the number of outliers from one to nine. In the second scenario, we generate 50 non-DEGs and 50 hDEGs and vary the number of outliers from one to five. We also look at extensions of the single-hDEG experiment for testing DOG with regard to deviations from the model assumptions. We then apply the algorithms to the histological breast cancer dataset (GDS3139 -[29]) which was downloaded from the gene expression omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo). It contains 22,283 genes for 14 breast cancer patients and 15 non-cancer women. The age of non-cancer women was matched with that of cancer patients. For evaluation and comparison of algorithms, we use the cumulative Matthews correlation coefficient (cMCC) and the total classification accuracy (with a critical p-value threshold of 0.01). We also carry out receiver operating characteristic (ROC) analysis[28] for variable critical p-value thresholds. Details of cMCC and ROC analyses have been given above.

References

Ebina M, Martínez A, Birrer M, Linnoila R: In situ detection of unexpected patterns of mutant p53 gene expression in non-small cell lung cancers. Oncogene 2001, 20: 2579-2586. 10.1038/sj.onc.1204351
Article CAS PubMed Google Scholar
Ezzat S, Smyth H, Ramyar L, Asa S: Heterogenous in vivo and in vitro expression of basic fibroblast growth factor by human pituitary adenomas. J Clin Endocrinol Metab 1995, 80: 878-884. 10.1210/jc.80.3.878
CAS PubMed Google Scholar
Hess G, Rose P, Gamm H, Papadileris S, Huber C, Seliger B: Molecular analysis of the erythropoietin receptor system in patients with polycythaemia vera. Br J Haematol 1994, 88: 794-802. 10.1111/j.1365-2141.1994.tb05119.x
Article CAS PubMed Google Scholar
Knaust E, Porwit-MacDonald A, Gruber A, Xu D, Peterson C: Heterogeneity of isolated mononuclear cells from patients with acute myeloid leukemia affects cellular accumulation and efflux of daunorubicin. Haematologica 2000,85(2):124-132.
CAS PubMed Google Scholar
Miyachi H, Takemura Y, Yonekura S, Komatsuda M, Nagao T, Arimori S, Ando Y, et al.: MDR1 (multidrug resistance) gene expression in adult acute leukemia: correlations with blast phenotype. Int J Hematol 1993, 57: 31-37.
CAS PubMed Google Scholar
Nakayama T, Watanabe M, Suzuki H, Toyota M, Sekita N, Hirokawa Y, Mizokami A, Ito H, Yatani R, Shiraishi T: Epigenetic regulation of androgen receptor gene expression in human prostate cancers. Lab Invest 2000, 80: 1789-1796. 10.1038/labinvest.3780190
Article CAS PubMed Google Scholar
Suzuki M, Hurd Y, Sokoloff P, Schwartz J, Sedvall G: D3 dopamine receptor mRNA is widely expressed in the human brain. Brain Res 1998, 779: 58-74. 10.1016/S0006-8993(97)01078-0
Article CAS PubMed Google Scholar
Wani G, Wani A, MD’Ambrosio S, et al.: Cell type-specific expression of the O6-alkylguanine-DNA alkyltransferase gene in normal human liver tissues as revealed by in situ hybridization. Carcinogenesis 1993, 14: 737-741. 10.1093/carcin/14.4.737
Article CAS PubMed Google Scholar
Tomlins S, Rhodes D, Perner S, Dhanasekaran S, Mehra R, Sun X, Varambally S, Cao X, Tchinda J, Kuefer R, et al.: Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 2005, 310: 644-648. 10.1126/science.1117679
Article CAS PubMed Google Scholar
Tibshirani R, Hastie T: Outlier sums for differential gene expression analysis. Biostatistics 2007, 8: 2-8. 10.1093/biostatistics/kxl005
Article PubMed Google Scholar
Wu B: Cancer outlier differential gene expression detection. Biostatistics 2007, 8: 566-575.
Article PubMed Google Scholar
Lian H: MOST: detecting cancer differential gene expression. Biostatistics 2008, 9: 411-418.
Article PubMed Google Scholar
Wang Y, Rekaya R: LSOSS: detection of cancer outlier differential gene expression. Biomarker Insights 2010, 5: 69-78.
Article PubMed Central PubMed Google Scholar
Boverhof D, Burgoon L, Williams K, Zacharewski T: Inhibition of estrogen-mediated uterine gene expression responses by dioxin. Mol Pharmacol 2008, 73: 82-93.
Article CAS PubMed Google Scholar
Cattaneo M, Lotti L, Martino S, Cardano M, Orlandi R, Mariani-Costantini R, Biunno I: Functional characterization of two secreted SEL1L isoforms capable of exporting unassembled substrate. J Biol Chem 2009, 284: 11405-11415.
Article PubMed Central CAS PubMed Google Scholar
Hensen E, De Herdt M, Goeman J, Oosting J, Smit V, Cornelisse C, De Jong R: Gene-expression of metastasized versus non-metastasized primary head and neck squamous cell carcinomas: a pathway-based analysis. BMC Cancer 2008, 8: 168. 10.1186/1471-2407-8-168
Article PubMed Central PubMed Google Scholar
Hoque M, Kim M, Ostrow K, Liu J, Wisman G, Park H, Poeta M, Jeronimo C, Henrique R, Lendvai Á, et al.: Genome-wide promoter analysis uncovers portions of the cancer methylome. Cancer Res 2008, 68: 2661-2670. 10.1158/0008-5472.CAN-07-5913
Article PubMed Central CAS PubMed Google Scholar
Iwao-Koizumi K, Matoba R, Ueno N, Kim S, Ando A, Miyoshi Y, Maeda E, Noguchi S, Kato K: Prediction of docetaxel response in human breast cancer by gene expression profiling. J Clin Oncol 2005, 23: 422-431.
Article CAS PubMed Google Scholar
Missiaglia E, Blaveri E, Terris B, Wang Y, Costello E, Neoptolemos J, Crnogorac-Jurcevic T, Lemoine N: Analysis of gene expression in cancer cell lines identifies candidate markers for pancreatic tumorigenesis and metastasis. Int J Cancer 2004, 112: 100-112. 10.1002/ijc.20376
Article CAS PubMed Google Scholar
Smeets A, Daemen A, Vanden Bempt I, Gevaert O, Claes B, Wildiers H, Drijkoningen R, Van Hummelen P, Lambrechts D, De Moor B, et al.: Prediction of lymph node involvement in breast cancer from primary tumor tissue using gene expression profiling and miRNAs. Breast Cancer Res Treat 2011, 129: 767-776. 10.1007/s10549-010-1265-5
Article CAS PubMed Google Scholar
Smid M, Wang Y, Klijn J, Sieuwerts A, Zhang Y, Atkins D, Martens J, Foekens J: Genes associated with breast cancer metastatic to bone. J Clin Oncol 2006, 24: 2261-2267. 10.1200/JCO.2005.03.8802
Article CAS PubMed Google Scholar
Sun P, Gao L, Han S: Prediction of human disease-related gene clusters by clustering analysis. Int J Biol Sci 2011, 7: 61-73.
Article PubMed Central PubMed Google Scholar
Sun C, Huo D, Southard C, Nemesure B, Hennis A, Cristina Leske M, Wu S, Witonsky D, Di Rienzo A, Olopade O: A signature of balancing selection in the region upstream to the human UGT2B4 gene and implications for breast cancer risk. Human Genet 2011, 130: 767-75. 10.1007/s00439-011-1025-6
Article CAS Google Scholar
Bernardo J, Smith A, Berliner M: Bayesian Theory. New York: Wiley; 1994.
Book Google Scholar
Bishop C: Pattern Recognition and Machine Learning. New York: Springer; 2006.
Google Scholar
Matthews B, et al.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta 1975, 405: 442-451. 10.1016/0005-2795(75)90109-9
Article CAS PubMed Google Scholar
Baldi P, Brunak S, Chauvin Y, Andersen C, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 2000, 16: 412-424. 10.1093/bioinformatics/16.5.412
Article CAS PubMed Google Scholar
McNeil H, Barbara J: The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology 1982, 143: 29-36.
Article PubMed Google Scholar
Tripathi A, King C, de la Morenas A, Perry V, Burke B, Antoine G, Hirsch E, Kavanah M, Mendez J, Stone M, et al.: Gene expression abnormalities in histologically normal breast epithelium of breast cancer patients. Int J Cancer 2008, 122: 1557-1566.
Article CAS PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Wolfson Institute for Preventive Medicine, Queen Mary University of London, Charterhouse Square, London, EC1M 6BQ, UK
Zihua Yang
College of Life and Environmental Sciences, Exeter University, Stocker Road, Exeter, EX4 4QD, UK
Zhengrong Yang

Authors

Zihua Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhengrong Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zihua Yang.

Additional information

Competing interests

Both authors declare that they have no competing interests.

Authors’ contributions

ZRY and ZHY designed the algorithm. ZRY implemented the algorithm. ZHY analysed the algorithm on the conceived simulated examples. ZRY acquired the dataset from GEO and analysed the algorithm on the real dataset. ZRY and ZHY wrote the paper. Both authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Yang, Z., Yang, Z. Prediction of heterogeneous differential genes by detecting outliers to a Gaussian tight cluster. BMC Bioinformatics 14, 81 (2013). https://doi.org/10.1186/1471-2105-14-81

Download citation

Received: 12 April 2012
Accepted: 14 February 2013
Published: 05 March 2013
DOI: https://doi.org/10.1186/1471-2105-14-81

Prediction of heterogeneous differential genes by detecting outliers to a Gaussian tight cluster

Abstract

Background

Results

Conclusions

Background

Results and discussion

Simulated examples

Scenario 1 - identification of a single hDEG

Scenario 2 - identification of multiple hDEGs (100 genes with 50 hDEGs)

Further simulated examples

Variable marginal null-outlier distance

Non-Gaussian tight cluster

Control samples containing outliers

Breast cancer data

Conclusions

Methods

Remark 1

Remark 2

Cumulative Matthews correlation coefficient

Total classification accuracy

Receiver operating characteristic (ROC) analysis

Allowing control samples to contain outliers

Significance analysis for real data

Experimental design

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us