Abstract
Background
Heterogeneously and differentially expressed genes (hDEG) are a common phenomenon due to biological diversity. A hDEG is often observed in gene expression experiments (with two experimental conditions) where it is highly expressed in a few experimental samples, or in drug trial experiments for cancer studies with drug resistance heterogeneity among the disease group. These highly expressed samples are called outliers. Accurate detection of outliers among hDEGs is then desirable for dis ease diagnosis and effective drug design. The standard approach for detecting hDEGs is to choose the appropriate subset of outliers to represent the experimental group. However, existing methods typically overlook hDEGs with very few outliers.
Results
We present in this paper a simple algorithm for detecting hDEGs by sequentially testing for potential outliers with respect to a tight cluster of non outliers, among an ordered subset of the experimental samples. This avoids making any restrictive assumptions about how the outliers are distributed. We use simulated and real data to illustrate that the proposed algorithm achieves a good separation between the tight cluster of low expressions and the outliers for hDEGs.
Conclusions
The proposed algorithm assesses each potential outlier in relation to the cluster of potential outliers without making explicit assumptions about the outlier distribution. Simulated examples and and breast cancer data sets are used to illustrate the suitability of the proposed algorithm for identifying hDEGs with small numbers of outliers.
Keywords:
Cancer; Outlier; Differentially expressed genes; MicroarrayBackground
A heterogeneously and differentially expressed gene (hDEG) is a gene which has an inconsistent expression pattern across its experimental samples. Typically, a large proportion of the experimental samples and the control samples form a tight cluster in low expressions. The remaining small proportion of experimental samples, namely the outliers, are observed to significantly deviate from the tight cluster towards high expressions. We use the word ‘tight’ to describe the cluster of null (or low) expressions of a hDEG as the null variance is typically small compared to the nulloutlier distance. In situations where the few highly expressed outliers of a nondifferential gene are caused by measurement error, it is also useful to distinguish such genes with hDEG characteristics. The existence of hDEGs has been established in various experiments ( [18]). Suppose we have the expressions of m genes. The standard t statistic underestimates the significance in testing the difference across the control and experimental samples of a hDEG. COPA (cancer profile outlier analysis) [9] proposed modifying the Student t statistic to be a ratio of the distance between the rth (default 9th) percentile of experimental samples and the median of all samples over the median absolute distance (deviated from the whole sample median), i.e.,
where , x_{i} and y_{i} represent control samples and experimental samples of the ith gene respectively, q_{r}(y_{i}) is the rth percentile of y_{i} and λ_{i} is the median of both x_{i} and y_{i}. The quantilemedian difference in (1) summarises the nulloutlier distance using a single value of y_{i}. To make outlier detection more efficient, the outliersum (OS) statistic [10] sums over outliers, where the outliers are defined as . Outlier robust t statistic (ORT) uses the same statistic but defines the outliers in relation to the control samples only [11]. Maximum ordered subset t statistic (MOST) defines the outliers to be the top k experimental samples and chooses k by optimising a normalised t statistic [12]. The least sum of ordered subset square t statistic (LSOSS) [13] also compares the controls with a subset of the top k experimental samples, where is the mean of control samples, is the mean of top k experimental samples and S_{i} is the pooled standard deviation of the set of control samples plus nonoutlier experimental samples and the set of outlier experimental samples. k is optimised iteratively to minimise the withincluster variance. We propose a new algorithm for detecting hDEGs with a small number of outliers by detecting outliers via gap (DOG) maximisation. What makes this approach different from the existing methods is that we assess each potential outlier in relation to a tight cluster of nonoutliers. This avoids modelling the highly expressed outliers explicitly. This is especially important when the number of outliers is small. The proposed algorithm classifies each gene as a hDEG or nonhDEG by locating potential outliers and summarises it using the average of the standardised outlier expressions. We will use simulated examples and a breast cancer dataset to illustrate the effectiveness of the proposed algorithm in detecting hDEGs with few outliers. We will also show how effective test algorithms are when varying conditions.
Results and discussion
Simulated examples
Scenario 1  identification of a single hDEG
The algorithms are compared for the detection of a single hDEG with the number of outliers varied from one to nine. The results are summarised in Table 1. For a small number of outliers, COPA, MOST and LSOSS demonstrated relatively poor performances while DOG consistently gave significant pvalues.
Table 1. Scenario 1
Scenario 2  identification of multiple hDEGs (100 genes with 50 hDEGs)
Over a critical pvalue range from 0 to 0.01, DOG demonstrated the highest average cumulative Matthews correlation coefficient (cMCC, see Methods for more detail) across five sets of simulations with one to five outliers  Figure 1. Table 2 shows that DOG had very high classification rates compared with the other five algorithms. When the number of outliers exceeded two, OS, ORT and LSOSS gave more reasonable classification rates. COPA and MOST gave poor predictions overall.
Figure 1. cMCC. Scenario 2: average cMCC of the six algorithms over (0, 0.01) for 15 numbers of outliers.
Table 2. Scenario 2
Figure 2 shows the ROC curves for the oneoutlier simulations, it can be seen that DOG had a superior ROC curve with an partial AUC value of 1. Figure 3 illustrates the same ROC curves oover the complete range of false positive rate, COPA and LSOSS remained poor. We also found that as the number of outliers increased to five, most algorithms worked well with the exception of COPA.
Further simulated examples
We look at the sensitivitiy of DOG with respect to changes in certain assumptions and parameters.
Figure 2. ROC  one outlier. Scenario 2: ROC curves of the six algorithms in detecting single outlierhDEGs (in close up for low false positive rates).
Figure 3. ROC  one outlier. Scenario 2: Full ROC curves of the six algorithms in detecting single outlier hDEGs.
Variable marginal nulloutlier distance
We revisit the singlehDEG simulation but vary the marginal nulloutlier distance (defined in Experimental design of Methods) from 0.5 to 2 with increments of 0.1  Table 3. DOG’s pvalues increased for a reduced marginal nulloutlier distance but retained the most significant mean pvalues for larger marginal nulloutlier distances. MOST and LSOSS failed to detect the hDEG. DOG gave accurate estimates of the outlier number when the nulloutlier distance was greater than one.
Table 3. Distance effect
NonGaussian tight cluster
We simulated a Gaussianmixture tight cluster (0.5 ) to examine how DOG is affected by nonGaussianity in the tight cluster. All other parameters were kept the same as those used in the singlehDEG simulation. The results were very similar to those seen previously  Table 4. In particular, the performances of COPA, OS and ORT have improved for the simulated nonGaussian tight cluster.
Table 4. NonGaussian tight cluster
Control samples containing outliers
DOG can be modified to enable the detection of hDEGs when control samples contain outliers (see ‘’Allowing control samples to contain outliers of Methods. We illustrate this using the singlehDEG example with one outlier added to the control samples  Table 5. It can be seen that DOG accurately detected the outliers from both control and experimental samples. MOST and LSOSS failed to detect the hDEG.
Table 5. Control samples containing outlier
Breast cancer data
Figure 4 illustrates the ordered expressions of the top four hDEGs as detected by the COPA, OS, ORT, MOST, LSOSS and DOG respectively (with annotations of rankings). The rankings of the genes were based on the order of the test statistics. The defining feature of DOG’s top four hDEGs, PEX6, TFP12, UGT2B4 and SLC4A2 (last row of Figure 4), is that they contain a few highly expressed outliers. Figure 5 shows the top 25 predictions of hDEGs using DOG for this data set. Existing literature have established these genes to be of biological relevance to the progression and treatment of breast cancer ( [1423]).
Figure 4. COPA, OS, ORT, MOST, LSOSS, DOG. Breast cancer data: log2 expressions of the top four hDEGs detected using COPA, OS, ORT, MOST, LSOSS, DOG. The vertical line indicates the separation of expressions in the tight cluster (left) and outliers (right).
Figure 5. DOG. Breast cancer data: log2 expressions of the top 25 hDEGs detected using DOG.
Most other algorithms chose genes with a reasonably large pool of differentially expressed experimental samples expressed at a more moderate level. LSOSS also generally favoured ordinary DEGs. MOST chose a set of top four genes with only one or two moderately expressed outliers. Table 6 shows how the top 100 predictions of these algorithms overlap  COPA and OS are most similar in their rankings whilst DOG has a maximum of 15% overlap with OS. Using the ordered log2 expressions of each algorithm’s unique top 100 genes, Figure 6 illustrates the median expressions minus the minimum expressions for each experimental sample index. The unique top 100 genes for DOG and COPA showed the largest change across their experimental samples, their difference being that COPA favoured hDEGs with a larger number of outliers whilst DOG picked out hDEGs with small numbers of outliers.
Using the significance analysis approach discussed in ‘’Significance analysis for real data of Methods, we estimated p values from sampling the replicates which then give us alternative p values based rankings of the genes. We also found the top four predictions ranked using the p values of DOG to be near identical to those ranked using its t statistics, though there were discrepancies in rankings for the lower ranking genes. Similar results were observed for the remainingfive algorithms.
Conclusions
The difficulty in identifying hDEGs arises from the fact that only a small number of experimental samples are highly expressed at a much higher level than the nonoutliers. As a result, various modified t tests target the subset of potential outliers which are then tested against the control group. In practice, for hDEGs with very few outliers, we found that these algorithms often identify hDEGs with insignificant deviations between the outliers and the tight cluster of nonoutliers. Based on this observation, the proposed algorithm assesses each potential outlier in relation to the Gaussian tight cluster without making an explicit assumption about the outlier distribution. At each step, we update the posterior mean and variance of the tight cluster which are then used to evaluate the probability of an outlier being a random sample of the tight cluster. Examples of simulated and breast cancer data sets verify the suitability of the proposed algorithm in identifying hDEGs with small numbers of outliers. An extension of the algorithm which fully takes into account gene correlations will be presented in future work. For the breast cancer data, we found negligible correlations across the top ranking genes and very low correlations among the less significant genes.
Methods
The proposed algorithm can be briefly summarised as follows. We first take the list of candidate outliers to be those experimental samples whose expressions are larger than the maximum expression of control samples. For the situation when control samples also contain outliers, see section ‘’Allowing control samples to contain outliers for a description of the necessary extension. The samples in the candidate list are sorted in an ascending order. The algorithm then updates the tight cluster of nonoutliers by testing sequentially the samples in the updated candidate list of outliers. The test is terminated when a significant deviation between a candidate sample and the tight cluster is detected. We now give the steps in more statistical detail. First, let us introduce some notation. Let x denote the control samples and y the experimental samples of a gene or a probe set (we drop the gene subscript i for simplicity). The proposed DOG algorithm has the following steps:
1. Candidate outlier: Given the union of x and y, z≡x∪y, we divide z into the candidate outlier set z^{+}=⇑{zj+∈zzj+> max(x)} and the nonoutlier set where ⇑ sorts the elements of a set in an ascending order.
2. Detection: Given a critical tail probability α and the corresponding threshold t_{α}[24]. The first element in z^{+}, , is classified as the first outlier if
in which case the algorithm terminates and z^{+} is the set of outliers. We use a default value of α=0.05. The parameters μ and σ^{2} are posterior mean and posterior variance derived of the tight cluster. Details of estimating μ and σ are given below.
3. Absorption: On the other hand if t≤t_{α}, we move z1+ to the tight cluster of nonoutliers, z^{−}←z^{−}∪z1+ and z^{+}←z^{+}∖z1+.
4. Estimating the parameters of the tight cluster: The parameters μ and β=σ^{−2} are updated using iterative Bayesian learning, i.e., by maximising the posterior probability [24]. Given with conjugate priors and σ^{2}=1/β∼IG(a,b), the logposterior is
where
and θ=(μ,β) and . Suppose n is the number of expressions in the tight cluster for the current iteration. For simplicity, we set μ_{0}=med(z^{−}), a=1, b is set to be the maximum variance of expressions calculated gene by gene. To simplify the notation, we let . β_{0} is updated recursively but we set its initial value to be . The maximum a posteriori probability procedure then gives the updates
Repeat 3 and 4 until the first outlier (with the lowest expression) is detected or until all candidate outliers have been classified as nonoutliers.
5. Classification: A gene for which the set z^{+} is nonempty is classified as a hDEG.
The summary statistic for a gene is taken to be the average of the outlier statistics . We use the average as opposed to the sum of outlier contributions as we prioritise the detection of hDEGs with few outliers.
Remark 1
We allow the hyperparameters μ_{0} to be evaluated directly from the dataset. We set to be 0.1, β_{0} is then updated iteratively in the algorithm. We desire the tight cluster variance prior to be densely distributed around the small values, thus we choose a=1 and b to be the maximum gene sample variance. In practice, we found that a large b and a small a≤1 optimise detection rates.
Remark 2
It is clear that for a finite replicate number, the difference in mean and variance of the tight cluster at two sequential steps are bounded. Asymptotically, as the sample size increases at each iteration, these differences converge toward zero since the posterior mean and variance converge toward the sample mean and variance and the tight cluster only absorbs probable null samples. This then guarantees asymptotic algorithmic convergence. Convergence of parameters in step 4 for each iteration follow from standard Bayesian results [25].
Cumulative Matthews correlation coefficient
We compare COPA, OS, ORT, MOST and LSOSS using the cumulative Matthews correlation coefficient (cMCC) which is the area under Matthews correlation coefficient (MCC, [26,27]) in the interval :
the MCC ρ_{p} is defined as:
Here, TP_{p}, TN_{p}, FP_{p} and FN_{p} represent the numbers of true positives (true hDEGs), true negatives (true nonhDEGs), false positives and false negatives respectively. These four quantities are determined based on a predefined critical pvalue, i.e. p∈(0,p^{⋆}].
Total classification accuracy
The total classification accuracy is defined as
where TP_{p}, TN_{p}, FP_{p} and FN_{p} have been defined above.
Receiver operating characteristic (ROC) analysis
Receiver Operating Characteristic (ROC) [28] analysis has been used widely in outlier detection [1113] for evaluating a classification model when varying the classification threshold, thus it is a useful tool for analysing the robustness of a classifier. As the threshold varies, the sensitivity and the false positive rate change accordingly. The ROC curve is then generated by linking all the pairs of false positive rates and sensitivities corresponding to a set of thresholds. The ROC curve of a desirable classifier is close to the topleft corner. In particular, we limit the false positive rate to less and equal to 5% as rates above this correspond to critical p values that are too large to be of practical relevance. We also calculate the area under a ROC curve (AUC) for quantitative evaluation. A large AUC value of close to 1 indicates a good classifier. As we truncate the false positive rate at an upper limit of 5%, we scale the AUC by this limit so that the best possible partial AUC value is one.
Allowing control samples to contain outliers
In order for DOG to detect hDEGs when outliers are present in control samples, we can modify it slightly. Rather than using in the first step of the algorithm, we can use instead the r^{th} (default is 90^{th}) percentile of the control samples as the separation between samples belonging to the tight cluster and candidate outliers. Suppose the 90^{th} percentile of the control samples is denoted by ς, the selection of zj− now follows . In practice, the rth percentile can be specified subjectively by the modeller.
Significance analysis for real data
Existing literature on algorithms such as COPA, OS and ORT typically omits statistical significance when analysing real data. Here we propose a simple method for significance analysis. We assume that control samples contain no outliers. For each algorithm, we create new control and experimental replicates of a gene under the null hypothesis by sampling with replacement from only the control expressions of that gene. This is repeated 100 times to augment the set of null control and experimental samples. The null t statistics are then calculated for all genes. The p value for each gene is then calculated as the proportion of null statistics across all genes that exceed its observed t statistic.
Experimental design
We first look at two simulated scenarios for comparing the algorithms. For both scenarios, the tight cluster of control samples and nonoutlier experimental samples are drawn randomly from a Gaussian distribution with a mean of ten and a standard deviation of one. Both control and experimental categories have 30 replicates. The outliers are generated by adding distances to the maximum expression of the tight cluster. The distances are called marginal nulloutlier distances in that such a distance measures the gap between the tight cluster and the first outlier which is closest to the tight cluster. The marginal oulloutlier distances are sampled from a Gaussian distribution centered at two and with a standard deviation 0.2. Similar to examples seen in [10], we generate 10,000 nonDEGs which gives us 10,000 null t statistics and corresponding pvalues for the hDEGs. This approach is applied to each algorithm. All simulations are repeated 100 times. In the first scenario, we evaluate the algorithms for a single hDEG. In addition, we vary the number of outliers from one to nine. In the second scenario, we generate 50 nonDEGs and 50 hDEGs and vary the number of outliers from one to five. We also look at extensions of the singlehDEG experiment for testing DOG with regard to deviations from the model assumptions. We then apply the algorithms to the histological breast cancer dataset (GDS3139  [29]) which was downloaded from the gene expression omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo webcite). It contains 22,283 genes for 14 breast cancer patients and 15 noncancer women. The age of noncancer women was matched with that of cancer patients. For evaluation and comparison of algorithms, we use the cumulative Matthews correlation coefficient (cMCC) and the total classification accuracy (with a critical pvalue threshold of 0.01). We also carry out receiver operating characteristic (ROC) analysis [28] for variable critical pvalue thresholds. Details of cMCC and ROC analyses have been given above.
Competing interests
Both authors declare that they have no competing interests.
Authors’ contributions
ZRY and ZHY designed the algorithm. ZRY implemented the algorithm. ZHY analysed the algorithm on the conceived simulated examples. ZRY acquired the dataset from GEO and analysed the algorithm on the real dataset. ZRY and ZHY wrote the paper. Both authors read and approved the final manuscript.
References

Ebina M, Martínez A, Birrer M, Linnoila R: In situ detection of unexpected patterns of mutant p53 gene expression in nonsmall cell lung cancers.
Oncogene 2001, 20:25792586. PubMed Abstract  Publisher Full Text

Ezzat S, Smyth H, Ramyar L, Asa S: Heterogenous in vivo and in vitro expression of basic fibroblast growth factor by human pituitary adenomas.
J Clin Endocrinol Metab 1995, 80:878884. PubMed Abstract  Publisher Full Text

Hess G, Rose P, Gamm H, Papadileris S, Huber C, Seliger B: Molecular analysis of the erythropoietin receptor system in patients with polycythaemia vera.
Br J Haematol 1994, 88:794802. PubMed Abstract  Publisher Full Text

Knaust E, PorwitMacDonald A, Gruber A, Xu D, Peterson C: Heterogeneity of isolated mononuclear cells from patients with acute myeloid leukemia affects cellular accumulation and efflux of daunorubicin.
Haematologica 2000, 85(2):124132. PubMed Abstract  Publisher Full Text

Miyachi H, Takemura Y, Yonekura S, Komatsuda M, Nagao T, Arimori S, Ando Y, et al.: MDR1 (multidrug resistance) gene expression in adult acute leukemia: correlations with blast phenotype.
Int J Hematol 1993, 57:3137. PubMed Abstract

Nakayama T, Watanabe M, Suzuki H, Toyota M, Sekita N, Hirokawa Y, Mizokami A, Ito H, Yatani R, Shiraishi T: Epigenetic regulation of androgen receptor gene expression in human prostate cancers.
Lab Invest 2000, 80:17891796. PubMed Abstract  Publisher Full Text

Suzuki M, Hurd Y, Sokoloff P, Schwartz J, Sedvall G: D3 dopamine receptor mRNA is widely expressed in the human brain.
Brain Res 1998, 779:5874. PubMed Abstract  Publisher Full Text

Wani G, Wani A, MD’Ambrosio S, et al.: Cell typespecific expression of the O6alkylguanineDNA alkyltransferase gene in normal human liver tissues as revealed by in situ hybridization.
Carcinogenesis 1993, 14:737741. PubMed Abstract  Publisher Full Text

Tomlins S, Rhodes D, Perner S, Dhanasekaran S, Mehra R, Sun X, Varambally S, Cao X, Tchinda J, Kuefer R, et al.: Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer.
Science 2005, 310:644648. PubMed Abstract  Publisher Full Text

Tibshirani R, Hastie T: Outlier sums for differential gene expression analysis.
Biostatistics 2007, 8:28. PubMed Abstract  Publisher Full Text

Wu B: Cancer outlier differential gene expression detection.
Biostatistics 2007, 8:566575. PubMed Abstract  Publisher Full Text

Lian H: MOST: detecting cancer differential gene expression.
Biostatistics 2008, 9:411418. PubMed Abstract  Publisher Full Text

Wang Y, Rekaya R: LSOSS: detection of cancer outlier differential gene expression.
Biomarker Insights 2010, 5:6978. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Boverhof D, Burgoon L, Williams K, Zacharewski T: Inhibition of estrogenmediated uterine gene expression responses by dioxin.
Mol Pharmacol 2008, 73:8293. PubMed Abstract  Publisher Full Text

Cattaneo M, Lotti L, Martino S, Cardano M, Orlandi R, MarianiCostantini R, Biunno I: Functional characterization of two secreted SEL1L isoforms capable of exporting unassembled substrate.
J Biol Chem 2009, 284:1140511415. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hensen E, De Herdt M, Goeman J, Oosting J, Smit V, Cornelisse C, De Jong R: Geneexpression of metastasized versus nonmetastasized primary head and neck squamous cell carcinomas: a pathwaybased analysis.
BMC Cancer 2008, 8:168. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Hoque M, Kim M, Ostrow K, Liu J, Wisman G, Park H, Poeta M, Jeronimo C, Henrique R, Lendvai Á, et al.: Genomewide promoter analysis uncovers portions of the cancer methylome.
Cancer Res 2008, 68:26612670. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

IwaoKoizumi K, Matoba R, Ueno N, Kim S, Ando A, Miyoshi Y, Maeda E, Noguchi S, Kato K: Prediction of docetaxel response in human breast cancer by gene expression profiling.
J Clin Oncol 2005, 23:422431. PubMed Abstract  Publisher Full Text

Missiaglia E, Blaveri E, Terris B, Wang Y, Costello E, Neoptolemos J, CrnogoracJurcevic T, Lemoine N: Analysis of gene expression in cancer cell lines identifies candidate markers for pancreatic tumorigenesis and metastasis.
Int J Cancer 2004, 112:100112. PubMed Abstract  Publisher Full Text

Smeets A, Daemen A, Vanden Bempt I, Gevaert O, Claes B, Wildiers H, Drijkoningen R, Van Hummelen P, Lambrechts D, De Moor B, et al.: Prediction of lymph node involvement in breast cancer from primary tumor tissue using gene expression profiling and miRNAs.
Breast Cancer Res Treat 2011, 129:767776. PubMed Abstract  Publisher Full Text

Smid M, Wang Y, Klijn J, Sieuwerts A, Zhang Y, Atkins D, Martens J, Foekens J: Genes associated with breast cancer metastatic to bone.
J Clin Oncol 2006, 24:22612267. PubMed Abstract  Publisher Full Text

Sun P, Gao L, Han S: Prediction of human diseaserelated gene clusters by clustering analysis.
Int J Biol Sci 2011, 7:6173. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sun C, Huo D, Southard C, Nemesure B, Hennis A, Cristina Leske M, Wu S, Witonsky D, Di Rienzo A, Olopade O: A signature of balancing selection in the region upstream to the human UGT2B4 gene and implications for breast cancer risk.
Human Genet 2011, 130:76775. Publisher Full Text

Bernardo J, Smith A, Berliner M: Bayesian Theory. New York: Wiley; 1994.

Bishop C: Pattern Recognition and Machine Learning. New York: Springer; 2006.

Matthews B, et al.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme.
Biochimica et Biophysica Acta 1975, 405:442451. PubMed Abstract  Publisher Full Text

Baldi P, Brunak S, Chauvin Y, Andersen C, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview.
Bioinformatics 2000, 16:412424. PubMed Abstract  Publisher Full Text

McNeil H, Barbara J: The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve.
Radiology 1982, 143:2936. PubMed Abstract  Publisher Full Text

Tripathi A, King C, de la Morenas A, Perry V, Burke B, Antoine G, Hirsch E, Kavanah M, Mendez J, Stone M, et al.: Gene expression abnormalities in histologically normal breast epithelium of breast cancer patients.
Int J Cancer 2008, 122:15571566. PubMed Abstract  Publisher Full Text