Natural and Exact Sciences Department, Higher School of Health Technology of Lisbon of Polytechnic Institute of Lisbon and Center of Statistics and Applications of University of Lisbon, Lisbon, Portugal

Department of Statistics and Operational Research, Faculty of Sciences of University of Lisbon, and Center of Statistics and Applications of University of Lisbon, Lisbon, Portugal

Abstract

Background

A common task in analyzing microarray data is to determine which genes are differentially expressed across two (or more) kind of tissue samples or samples submitted under experimental conditions. Several statistical methods have been proposed to accomplish this goal, generally based on measures of distance between classes. It is well known that biological samples are heterogeneous because of factors such as molecular subtypes or genetic background that are often unknown to the experimenter. For instance, in experiments which involve molecular classification of tumors it is important to identify significant subtypes of cancer. Bimodal or multimodal distributions often reflect the presence of subsamples mixtures. Consequently, there can be genes differentially expressed on sample subgroups which are missed if usual statistical approaches are used. In this paper we propose a new graphical tool which not only identifies genes with up and down regulations, but also genes with differential expression in different subclasses, that are usually missed if current statistical methods are used. This tool is based on two measures of distance between samples, namely the overlapping coefficient (OVL) between two densities and the area under the receiver operating characteristic (ROC) curve. The methodology proposed here was implemented in the open-source R software.

Results

This method was applied to a publicly available dataset, as well as to a simulated dataset. We compared our results with the ones obtained using some of the standard methods for detecting differentially expressed genes, namely Welch t-statistic, fold change (FC), rank products (RP), average difference (AD), weighted average difference (WAD), moderated t-statistic (modT), intensity-based moderated t-statistic (ibmT), significance analysis of microarrays (samT) and area under the ROC curve (AUC). On both datasets all differentially expressed genes with bimodal or multimodal distributions were not selected by all standard selection procedures. We also compared our results with (i) area between ROC curve and rising area (ABCR) and (ii) the test for not proper ROC curves (TNRC). We found our methodology more comprehensive, because it detects both bimodal and multimodal distributions and different variances can be considered on both samples. Another advantage of our method is that we can analyze graphically the behavior of different kinds of differentially expressed genes.

Conclusion

Our results indicate that the

Background

Genome-wide expression analysis is an increasingly important tool for identifying gene function, disease-related genes and transcriptional patterns related to drug treatments. Microarrays enable the simultaneous measurement of the expression levels of tens of thousands of genes and have found widespread application in biological and biomedical research. Increasing numbers of multi-class microarray studies are performed, but the vast majority continues to be two class (binary) studies, for example when both control and a treatment are examined

Genes with a bimodal or a multimodal distribution within a class (considering a binary study) may indicate the presence of unknown subclasses with different expression values

The particular application that motivated our work concerns the development of a methodology which could simultaneously identify up- and down-regulated genes and differentially expressed with bimodal or multimodal distributions with similar means on both groups. For convenience, the latter case is referred to as

Different statistical tests have been proposed to select differentially expressed genes

A ROC curve displays the relationship between the proportion of true positive (sensitivity) and false positive (1-specificity) classifications resulting from each possible decision threshold value in a two class classification task

Relationship between densities and ROC curves considering equal variances on both groups

**Relationship between densities and ROC curves considering equal variances on both groups.** Probability density functions of gene expression values of two groups and their corresponding empirical ROC curves, where **A)****B)****C)****D)****E)**

Genes can be ranked using the area under the ROC curve (AUC)

Nevertheless, different scenarios can lead to NPROC curves, for instance, when the means of the two groups are similar and one of the groups has a bimodal distribution (Figure

Relationship between densities and ROC curves, considering different variances and similar means on both groups

**Relationship between densities and ROC curves, considering different variances and similar means on both groups.** Probability density functions of gene expression values of two groups and their corresponding empirical ROC curves, where **A)****B)**

Proper binormal model

Since it is not possible to decide beforehand the direction of the classification rule, we considered the same classification rule for all of the genes, i.e., values of expression levels above the threshold correspond to up-regulation. In that sense, AUC values near 1 will correspond to up-regulated genes, AUC values near 0 will correspond to down-regulated genes, and special genes (Figure

We used the overlapping coefficient (OVL) to further separate these different situations which produce values of AUC near 0.5. Bradley

We propose using AUC and OVL simultaneously to select different types of differentially expressed genes and plotting OVL against AUC we get a picture which we named as

If we consider that groups have different variances, special genes can be mixed with genes which are not differentially expressed as illustrated on Figure

Nonparametric techniques are used to estimate AUC and OVL. To estimate AUC, we used the Mann-Whitney U statistic

We first describe the algorithm and later we evaluate the performance of our method by comparing the gene expression profiles in two different classes using data from a publicly dataset

All the analysis were performed using the open-source

Results and Discussion

Algorithm description

For illustrative purposes, we divided the algorithm in two parts (algorithm 1 and algorithm 2). The first part describes the OVL estimation (Figure

Algorithm 1

**Algorithm 1.** Pseudo code to estimate OVL based on kernel density estimates.

Algorithm 2

**Algorithm 2.** Pseudo code to select differentially expressed genes based on AUC and OVL estimates.

The OVL estimation was based on a non-parametric form with densities estimated using kernel functions. Figure

**Symbol**

**Definition**

Symbols are listed in order of appearance in the Algorithm.

^{
A
}, ^{
B
}

Kernel density coordinates of samples A and B

pairs of coordinates of samples A and B that will be used

to estimate OVL

index a pair of coordinates of ^{
A
},

where

^{
B
} and

total number of pairs of coordinates

ordered list of points resulting from the union of

union of G with new pairs of coordinates,

which correspond to jump points between densities

indexes a pair of coordinates of G

_{
new
}

new abscissa

_{
new
}

new ordinate

final list of pairs of coordinates to estimate OVL

overlapping coefficient between two kernel densities

**Function**

**Definition**

Functions are listed in order of appearance.

if there is more than one equal abscissa

on the list, returns the pair of

coordinates corresponding to the

one which has the minimum ordinate

returns the ordinate of a pair of

coordinates

returns the pair of coordinates immediately

preceding the abscissa in the list

returns the pair of coordinates immediately

after the abscissa in the list

joins lists

orders a list in increasing order of abscissas

returns the abscissas of a pair of

coordinates

trapezoidal rule for area estimation

The selection of differentially expressed genes is based on simultaneous analysis of OVL and AUC. The

**Symbol**

**Definition**

Symbols are listed in order of appearance.

representing arrays and rows representing genes

representing arrays and rows representing genes

UP

up-regulated genes list

DOWN

down-regulated genes list

indexes a gene (row of the matrix)

arbitrary thresholds

^{
A
}, ^{
B
}

kernel density coordinates of subsamples of genes from

samples A and B

^{
A[j]}

indexes a gene of the subsample S from sample A

indexes a ordinate of a gene j of the subsample S from

sample A

_{
X
}[

indexes a gene with bimodal or multimodal kernel density

from sample A

indexes a gene with bimodal or multimodal kernel density

SPECIAL

special genes list

**Function**

**Definition**

Functions are listed in order of appearance.

Area above the ROC curve

estimated by the trapezoidal rule

overlapping coefficient estimated by Algorithm 1

kernel density estimation

returns the ranks of a list

Selection of differentially expressed genes with positive regulation (Figure

Bimodality (or multimodality) is analyzed based on the behavior of the ordinates of the kernel based estimated densities of both groups, considering only the gene list that is selected in the first step mentioned above (Figure

Performance and implementation

The running time of the algorithm in a dataset with 10000 genes, takes less than 60 minutes on a 533 MHz Pentium.

**R code for implementation of Algorithms 1 and 2.**

Click here for file

Lymphoma data

From a total of 4026 genes, our method selected 178 differentially expressed genes, where 68 corresponded to up-regulated genes, 90 to down-regulated and 20 corresponded to special genes. We used AUC≥0.9 and OVL<0.5 to select up-regulated genes, AUC≤0.1 and OVL<0.5 to select down-regulated genes and OVL<0.5 and 0.4<

Arrow plot of lymphoma data

**Arrow plot of lymphoma data.** AUC≥ 0.9 and OVL< 0.5 was considered to select up-regulated genes, corresponding to red dots on the plot. To select down-regulated genes an AUC≤0.1 and OVL< 0.5 was considered, corresponding to blue dots on the plot. To select special genes an OVL< 0.5 and 0.4 <AUC< 0.6 was considered. Orange dots correspond to a bimodal or multimodal density in the experimental group, cyan dots correspond to a bimodal or multimodal density in the control group and green dots correspond to a bimodal or multimodal densities in both groups.

Table

**Gene ID**

**Gene name**

**OVL**

**AUC**

**Group**

Special genes were selected using OVL<0.5 and 0.4<

GENE3323X

0.389

0.477

B

GENE3473X

0.399

0.407

B

GENE1877X

0.428

0.421

B

GENE3388X

0.432

0.529

B

GENE1141X

0.443

0.571

E

GENE3521X

0.446

0.593

B

GENE3407X

0.453

0.543

B

GENE75X

0.457

0.546

C

GENE2519X

0.461

0.529

E

GENE3343X

0.461

0.543

B

GENE1817X

0.472

0.586

B

GENE3389X

0.476

0.475

E

GENE3909X

0.492

0.463

C

GENE2887X

0.492

0.486

E

GENE3547X

0.493

0.413

B

GENE1004X

0.494

0.511

B

GENE2547X

0.495

0.500

B

GENE2778X

0.496

0.536

B

GENE3322X

0.498

0.532

E

GENE463X

0.499

0.461

B

Kernel density plots and empirical ROC plots

**Kernel density plots and empirical ROC plots.** Kernel density estimate of the 20 special selected genes expression values, where red densities represent the experimental sample and black densities represent the control sample. The

Among the 20 special genes selected list (Table

**Biological description of the 20 special genes selected in the Lymphoma data.**

Click here for file

We compared our results with those obtained by Parodi et al.

Nine feature selection methods were applied to the full dataset, namely Welch t-statistic, fold change (FC), rank products (RP)

Simulated data

We simulated ten thousand genes (see Methods for details), among which 9500 were non-differentially expressed, 225 were up-regulated, 225 were down-regulated and 50 were special genes. Analyzing the

Arrow plot of simulated data

**Arrow plot of simulated data.** Orange dots correspond to truly no differentially expressed genes, red dots correspond to truly up-regulated genes, blue dots correspond to truly down-regulated genes and green dots to truly special genes. We considered as up-regulated genes those for which AUC≥ 0.9 and an OVL< 0.5. To select down-regulated genes an AUC≤ 0.1 and an OVL< 0.5 were considered and to select differentially expressed genes with bimodal or multimodal densities we considered an OVL< 0.5 and 0.4 <AUC< 0.6.

We can conclude that our algorithm for detection of bimodality performed with 100% of accuracy on that list.

ROC analysis was conducted to evaluate and compare the performance of the above methods. We analyzed the performance of these methods regarding the discrimination between differentially expressed genes and non-differentially expressed genes considering two scenarios. First we studied the performance of the methods concerning the capacity to differentiate among up-regulated, down-regulated and special genes; secondly we studied the performance concerning only the capacity to identify special genes.

The construction of the ROC curves were based on the absolute values of the following statistics: FC, AD, WAD, RP, Welch-

The empirical ROC curves, under the first scenario are represented in Figure

Empirical ROC curves

**Empirical ROC curves.** Comparison of ROC curves in experiments where the goal is to select up- and down-regulated genes and special genes.

**OVL**

**RP**

**WAD**

**FC**

**AD**

**AUC**

Comparison of AUC values where the goal is to select up- and down-regulated genes and special genes. The AUC values are sorted by decreasing order.

0.998

0.969

0.959

0.953

0.939

0.937

**SAM**

**ibmT**

**modT**

**Welch- t
**

**ShrinkT**

**SAMROC**

0.930

0.924

0.924

0.924

0.924

0.921

The OVL with an estimated AUC value near of the unit showed to be the one with a better performance followed by the Rank Products method. The method with lowest performance was SAMROC, however all methods showed high values of performance.

Considering the scenario where the goal is to select only special genes, the empirical ROC curves (Figure

Empirical ROC curves

**Empirical ROC curves.** Comparison of ROC curves in experiments where the goal is to select special genes.

**OVL**

**FC**

**SAMROC**

**Welch-t**

**ibmT**

**modT**

Comparison of AUC values where the goal is to select special genes. The AUC values are sorted by decreasing order.

0.9459

0.7786

0.7608

0.7604

0.7555

0.7545

**SAM**

**RP**

**AUC**

**AD**

**WAD**

**shrinkT**

0.6934

0.6733

0.6288

0.6140

0.5793

0.5793

Conclusions

We have presented a graphical and computational method for microarray experiments which allow the identification of genes that express differently under two conditions even if the behavior in average is similar. The main objective of this work was to select differentially expressed genes due to the presence of different subclasses, which could give important information about their inherent biological functions, and that are usually missed by usual methods.

AUC and OVL statistics were used to achieve this goal. Both statistics are invariant when a suitable common transformation is made on variables

The approach used by the

Non-parametric techniques were used because they eliminate the need to specify parametric models. The non-parametric kernel density method has few assumptions about the form of the distributions. This is attractive because it can be used on thousands of genes on an automatic way. The disadvantage of non-parametric techniques is that it results in a loss of efficiency. Yet, the loss of efficiency is balanced by the reduction of the risk of misinterpreting the results by incorrectly specifying a parametric form for the distribution.

The proposed algorithm is particularly useful in situations where bimodality exists in the gene expression data. The proposed methodology outperforms other well known methods for detecting different kinds of differentially expressed genes. Future work includes further evaluation of this methodology on other real datasets.

We recognize that selecting DE genes through an

Methods

Data sets

Lymphoma data

We used microarray data provided by the study of Alyzadeh et al. (2000) [6] which are publicly available at the website

Simulated data

We conducted a simulation study in order to evaluate the performance of the proposed method.

Most studies of microarray data assumed normality assumptions. However, there is relatively little literature on evaluating the normality of this type of data. Part of the problem is that most microarray datasets include large amounts of biological variability and/or small sample sizes. Biological variability makes it difficult to determine the source of the non-normality (non-normal datasets could simply be mixtures of normal datasets). Small samples do not have the power to be able to make claims about the distribution of the data.

It is well known that raw microarray data (across all platforms) are highly skewed (usually skewed right) with many extreme values, so, simulated datasets were generated by drawing case and control samples from lognormal distributions, and log transformation was used afterwards to offset the skewness. Consider _{
x
}, _{
x
}) and _{
y
}, _{
y
}).

For case and control samples we simulated _{1} = _{2} = 30 microarrays and a total of 10000 genes. This sampling was performed independently, albeit the fact that individual gene expression levels are far from being independent. In a typical microarray experiment, we expect to see a combination of non-differentially and differentially expressed genes (approximately 5% to 10% of the data). Hence, we simulated 500 genes differentially expressed and 9500 not differentially expressed. From the 500 differentially expressed genes, 225 were up-regulated, 225 were down-regulated and 50 corresponded to special genes.

Four characteristics of the data were considered in this simulation: mean (^{2}), the magnitude of difference between control and case samples and bimodality of the distributions. Hence, several combinations of these parameters were considered.

While simulating values for expression levels of genes not differentially expressed, we considered that the difference between the mean of the control and case arrays ranged between -0.9 and 0.9. To provide several patterns of density distributions we considered variances with differences ranging from 0 and 12.25. The effect of changing

Genes with up-regulation and down-regulation were generated considering the difference between the mean of the case and control arrays ranging from 3.5 to 13.5 for up-regulation, and -13.5 to -3.5 for down-regulation and the differences between the variances for both situations ranged from 0 to 12.25.

Gene expression distribution of a special gene was considered as a mixture of two lognormal distributions in one of the groups. If _{0},_{0}) + (1 − _{1}, _{1}),_{0} = 3.5 unchanged and gradually increased _{1} from 7 to 17, and left _{0} = _{1} = 1.2 unchanged. For the other group we considered a lognormal density with location parameter approximately equal to _{0} + (1 − _{1}. We considerer

Finally we took the logarithms of the 10000 expression levels on both groups to offset the skewness.

Non-parametric OVL

The overlapping coefficient refers to the area under two density functions simultaneously

where _{
X
}and _{
Y
} are the density functions of the random variables

The estimation of OVL was based on a non-parametric procedure with densities estimated using kernel functions. A kernel function

where _{1},…,_{
n
}) is the sampling vector.

For the purpose of this work, we chose as kernel function a standard normal distribution

More than the choice of the kernel function, the choice of the bandwidth,

where

However this choice of

The function

Non-parametric AUC

ROC curve assesses the effectiveness of a continuous diagnostic marker in distinguishing between two independent populations. In a standard situation a case is assessed positive if the corresponding marker value is greater than a given threshold value. Associated with any threshold value is the probability of a true positive (sensitivity) and the probability of a true negative (specificity). Let _{
Y
}(_{
X
}(_{
X
}(

The simplest non-parametric estimation method for the ROC curve involves using empirical cumulative distribution functions. The empirical cumulative distribution function is defined for any given value

This method was performed using functions from the

Arrow plot

Plotting OVL against AUC gives rise to a graph which we called

TNRC and ABCR statistics

Parodi et al.

The ABCR statistic is obtained using the empirical ROC curve, where ties are not considered. In that sense, if _{0} is the number of individuals observed with _{1} the number of individuals observed with _{0} + _{1} will be the total of individuals observed and _{0} ≤

They first rank the genes accordingly to ABCR (5).

where _{
k
} is the partial area under a ROC curve between the consecutive abscissa points for _{0} computed according to the standard trapezoidal rule, and

TNRC statistic is used to test for not proper ROC curves:

where AUC is the area under the empirical ROC curve. Not proper ROC curves are identified by high values of the TNRC statistic.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

CSF, a PhD student developed and implemented the method under the guidance of her advisors MAAT and LS. All authors read and approved the final manuscript.

Acknowledgements

Research partially sponsored by national funds through the Fundação Nacional para a Ciência e Tecnologia, Portugal – FCT under the projects PEst-OE/MAT/UI0006/2011 and PTDC/MAT/118335/2010 and by the FCT PhD scholarship SFRH/BD/45938/2008.

The authors thank Stefano Parodi from G. Gaslini Children’s Hospital, Italy, for the help given with the Lymphoma dataset