|
| Tests for differential gene expression using weights in oligonucleotide microarray experiments1 Program in Genetics and Genomic Biology, The Hospital for Sick Children Research Institute, 15-706 TMDT, 101 College Street, Toronto, ON, M5G 1L7, Canada 2 Department of Public Health Sciences, University of Toronto, Health Sciences Building, 155 College St, Toronto, ON, M5T 3M7, Canada 3 Program in Population Health Sciences, The Hospital for Sick Children Research Institute, 555 University Ave, Toronto, ON, M5G 1X8, Canada
BMC Genomics 2006, 7:33doi:10.1186/1471-2164-7-33 The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2164/7/33
©
2006 Hu et al; licensee BioMed Central Ltd. AbstractBackgroundMicroarray data analysts commonly filter out genes based on a number of ad hoc criteria prior to any high-level statistical analysis. Such ad hoc approaches could lead to conflicting conclusions with no clear guidance as to which method is most likely to be reproducible. Furthermore, the number of tests performed with concomitant inflation in type I error also plagues the statistical analysis of microarray data, since the number of tested quantities in a study significantly affects the family-wise error rate. It would, therefore, be very useful to develop and adopt strategies that allow quantification of the quality of each probeset, to filter out or give little credence to low-quality or unexpressed probesets, and to incorporate these strategies into gene selection within a multiple testing framework. ResultsWe have proposed a unified scheme for filtering and gene selection. For Affymetrix gene expression microarrays, we developed new methods for measuring the reliability of a particular probeset in a single array, and we used these to develop measures for a set of arrays. These measures are then used as weights in standard t-statistic calculations, and are incorporated into the multiple testing procedures. We demonstrated the advantages of our methods using simulated data, publicly available spiked-in data as well as data comparing normal muscle to muscle from patients with Duchenne muscular dystrophy (DMD), in which a set of truly differentially expressed genes is known. ConclusionOur quality measures provide convenient ways to search for individual genes of high quality. The quality weighting strategies we proposed for testing differential gene expression have demonstrable improvement on the traditional filtering methods, the standard t-statistic and a regularized t-statistic in Affymetrix data analysis. BackgroundAffymetrix GeneChip™ microarrays are used to measure gene expression for thousands of transcripts simultaneously. Each transcript is measured by 11–20 probesets, where a probeset consists of two almost identical sequences of length 25 bp. One member of the pair is the perfect match (PM) probe, where the sequence is the exact complement of a section of the mRNA. The other member of the pair, the mismatch probe (MM), is identical to the PM probe except for the nucleotide in 13th position and is intended to measure and control for non-specific binding signals. The accuracy and sensitivity of the measurement for any one gene or EST depends on the uniqueness and the binding properties of the probes. It is well known that most probesets perform consistently and reliably, in that similar estimates of expression are obtained from two replicates of an experiment. However, in many cases, the signals from a probeset can be hard to interpret. There may be substantial variability across the probesets in the estimated level of expression, or in the PM-MM differences. PM signals can be smaller than MM signals suggesting high levels of non-specific binding. It is also well known that a large proportion of the genes are expressed in only a few tissues or at a particular developmental stage, and hence many of the genes are not expected to have a measurable transcribed product. Su et al. [1] have estimated that in most cases, only 30–40% of the genome will be expressed in any one tissue. In such situations, the probesets give measurements for PM and MM that fluctuate near the lower detectable limit. The latest release for the Affymetrix GeneChip Expression Analysis Platform, GeneChip Human Genome U133 Plus 2.0 Array [2], provides comprehensive coverage of the transcribed human genome on a single array. The array includes more than 54,000 probesets with 38,500 well-characterized human genes. When analyzing such a large number of genes, the adjustment of significance levels through multiple testing procedures such as the Bonferroni method [3], or Benjamini and Hochberg [4], may be dramatic enough to make it very difficult to identify differentially expressed genes. However, if we knew which transcripts were either not expressed in the tissue under study, or were measured unreliably due to poor probe specificity, we could exclude these transcripts from the analysis and pay a smaller penalty for multiple testing. In 2001, Affymetrix produced a new algorithm for summarizing the results of a probeset, and the algorithm includes a detection p-value, which represents the probability that the probeset (gene) expression is above zero (i.e., turned on), and measured reliably and consistently across the probe. In addition, "Present/Marginal/Absent" (P/M/A) calls for each transcript are based on the detection p-value together with thresholds that can be altered by the user [5]. The calls are often used as filters to keep genes whose transcripts are detectable in a particular experiment [6-8], and this filtering concept has been integrated into some software such as dChip [9,10]. For example, Blalock, et al. [7] removed a probeset from further statistical analysis if there were > = 60% absent calls, for that probeset, in at least one treatment group. However, the stringency of the filtering procedure can strongly affect downstream analyses and the final results. Too much filtering may exclude some important genes, whereas too little filtering will reduce power by increasing the number of tests performed. Moreover, it is possible that the statistical error introduced by imperfect filtering criteria makes the overall error worse. Pounds and Cheng [11] recently argued that it is better to define gene filters based on the detection p-value than on the calls, so that more of the available information about probeset reliability can be used. They developed two pooled filters for each probeset based on detection p-values. An alternative strategy to filtering is to treat probe-specific variability as a quality index to measure the reliability of each probeset. Although many summaries of Affymetrix GeneChip expression have been proposed (e.g. MAS5 [5], dChip [9,10], RMA [12], PLIER [13] and gcRMA [14]), there have been few studies on quality measures for Affymetrix GeneChip expression data. In previous studies, typically, once a filtering decision was made, the estimated intensities of each probeset were considered to be equally well measured, so that probesets with highly variable signals were considered as reliable as probesets with inconsistent signals. However, probesets that detect transcripts expressed at a very high level would be expected to show a more robust signal with greater quality than probesets that are performing poorly or detecting very low level transcripts. Seo et al. [15] used the detection p-values to develop weighted Pearson correlations between expression measurements from two different arrays. The weight was defined to be a function of the two detection p-values for each probeset. They incorporated these weighted correlations into unsupervised clustering analysis, with the goal of choosing the best algorithm for summarizing across the probesets. In this study, we focus on the problem of testing for differential expression between groups of arrays, while adjusting for multiple testing in a weighted framework. We redefine some of the widely used filtering methods [6-9] and also propose some new methods for measuring the quality of a particular probeset in a single array, in an experimental group of arrays, and in an entire study. These measures are then used as weights in t-statistic calculations, and are incorporated into the multiple testing procedures. We applied these methods to a dataset of expression in muscle tissue [16], Choe's spiked-in data [17] and simulated data. One of our quality measures for an experimental group is based on the measure developed very recently by Pounds and Cheng [11]; however, we go further in combining measures across groups, and in discussing the usefulness of these measures when testing for differential gene expression. ResultsTest statistics and quality measuresWe propose a weighting paradigm for including quality measures into analysis when testing for differential expression. Suppose that an unweighted test statistic for gene g is represented by tg, and that the quality measure is called Qg. We propose to evaluate the significance of gene g using the weighted test statistic Measure Qg must represent a summary across treatment groups w ∈ {1 ...W}. We propose where We examine the performance of several choices for the group-specific quality measure Instead of using the Affymetrix present/absent calls, more sensitivity can be gained by basing Although, under the null hypothesis of no signal, the detection p-values can be expected to follow a uniform distribution (so that -log(q) is exponential with mean 1.0), when there is a signal, the p-value distribution is better described by the two-parameter Beta distribution. However, detection p-values are based on rank tests and there is often little variability in the p-values across arrays for highly expressed genes. This leads to difficulties in estimating two parameters. Pounds and Cheng [11] proposed a one-parameter Beta distribution for detection p-values, so that For comparison, two very different test statistics were also used: (1) Weights based on the detection p-values were incorporated directly in the calculation of group-specific means and standard deviations – we refer to this test statistic as Table 1. Test statistics and quality measures. For the quality-based tests, the summarized quality measure, across treatment groups, is defined by Qg = maxw∈{1...W} Analysis of simulated dataThe performance of our proposed methods was evaluated using simulated probe level data generated from a model incorporating probe level effects, optical noise, and non-specific binding, as well as true signals [14]. We have run five simulation models following the simulation procedures described in Materials and Methods. Treatment effects on the signal were varied between 0.5 (very small) and 2.5 (very large) in the five models. To compare performance, we used Summarized Receiver Operating Characteristic (SROC) curves, where the test sensitivities and specificities (true positive and true negative proportions) for a range of p-value cutoffs (or FDR cutoffs for results with multiple testing adjustments) were averaged over 100 simulated datasets. Figures 1 and 2 show SROC curves of models 2 and 4, where large and small treatment effect sizes, respectively, were chosen in the generated models.
The SROC curve's overall behavior can be measured by the Area Under the Curve (AUC) [19]. Table 2 shows AUCs for five simulation models under different weighting and multiple testing strategies. As seen from the table and figures, weighted t-statistics based on quality measure Table 2. Area under the curves (AUCs) for five simulation models * Duchenne muscular dystrophy vs. normal muscleHaslett et al. [16] compared gene expression between 12 quadriceps biopsies from Duchenne muscular dystrophy (DMD) patients and 12 quadriceps biopsies of normal skeletal muscle. Using a method called geometric fold change (GFC), they identified 133 differentially expressed genes (139 probesets) with a permutation-based false discovery rate of 2.3 × 10-3; Of these, 12 genes (13 probesets) were confirmed by RT-PCR. This set will be referred to as the "RT-PCR" probesets. Tables 3 and 4 compare the agreement between probesets selected by our test statistics and Haslett et al. [16]. In Table 3, for each test statistic, we selected the top T = 30, 50, 100 or 139 significantly expressed probesets, ranked based on our adjusted false discovery rates, and we counted the number of these probesets that were also in the top T probesets identified by GFC and ranked based on their absolute fold changes. Table 4 shows a similar comparison for the RT-PCR set. Table 3. Agreement in probeset selections between our methods and Haslett et al. [16]: Given a chosen number of selected probesets, how many of the probesets selected by GFC were also selected by the methods in this paper (corresponding false discovery rate WBH-FDR) Table 4. Comparison of our methods and Haslett et al. [16]. Identification of differentially-expressed probesets validated by RT-PCR: given a chosen number of probesets selected by GFC, and the number of probesets validated by RT-PCR within this set, how many of the RT-PCR probesets were selected by the methods in this paper. The measure Analysis of Choe's spiked-in dataThe availability of data from spiked-in experiments (Choe et al. [17]) provides an excellent opportunity to examine the performance of our weighted statistics on real data where the answers are known. Selected transcripts were added at a range of known concentrations; some were chosen to have differential expression between two groups of samples (true positive changes in expression); others were spiked-in at the same concentration in the two groups (true negative changes in expression). Figures 3 and 4 show the ROC curves for the comparison of the two groups for MAS5 and RMA, respectively. Table 5 shows the AUC under different weighting and multiple testing strategies. For both the RMA and MAS5 strategies, it is still clear that whether p-values were unadjusted or adjusted by the WBH method, Table 5. Area under the curves (AUCs) of Choe's spiked-in data (ν = 0.05)
DiscussionIn traditional gene selection methods, pre-filtering and selection are two separate procedures, and commonly used pre-filtering methods are based on Affymetrix calls [6-9]. We unified gene filtering and gene selection procedures together, where the importance of a gene is measured by its quality score, defined across all arrays and experimental groups, rather than by a given cutoff call value. The methods, therefore, overcome the shortcomings of the traditional methods to filter out genes before any high-level data analysis, such as gene selection, is carried out. Our measure The best choice of weighting statistic in the presence of adjustments for multiple testing appears to depend on the size of the treatment effects. The exponential model measure, For quality measures Table 6. Sensitivity and specificity for detecting differentially expressed genes, as a function of the sensitivity parameter ν It is now a common practice to use procedures for controlling multiple testing when identifying differentially expressed genes. Various procedures, such as the Bonferroni correction, the Benjamini and Hochberg false discovery rate [4], or the Benjamini and Yekutieli false discovery rate [20], have been widely used. Due to the quality adjustment in our proposed t-statistics, we can no longer assume that the p-values ptg have a uniform distribution under the null hypothesis, and we can not assume that the test statistics have identical distributions. Therefore, we developed weighted multiple testing procedures. Our findings showed that the proposed weighted Benjamini and Hochberg (WBH) adjustment procedure is better than the weighted Bonferroni (WB) and weighted Benjamini and Yekutieli (WBY) adjustment procedures (results not shown). However, we also observed that the proposed t-statistics using WB and WBY had poor sensitivity, regardless of the type of quality measure score (data not shown). We are planning to further extend weighted multiple testing methods in order to redefine Storey and Tibshirani's positive false discovery rate (pFDR) [21], where the test statistics are assumed to be identically distributed. Several other modified t-statistics, such as SAM [22], penalized t-statistics [23], or the local pooled error method used here [18], have been developed for microarray data analysis. All of these methods focus on overcoming the shortcoming of the ordinary t-statistics for ranking genes, due to unstable variance estimates that may arise when sample size is small. Each method used a slightly different strategy to estimate a penalty parameter for smoothing unstable variance estimates, using information from all genes rather than relying solely on variance estimates from an individual gene. However, as we discussed in the "Background" Section, the quality of this information is not the same across genes. Therefore, the estimate of the penalty parameter may not be reliable if we assume that all gene-specific information has the same quality. Here, we take another strategy to improve the performance of ordinary t statistics by putting a high weight for the genes with high quality and a low weight for those with low quality in the t-statistic calculations. The initial comparison of our strategy with the LPE test demonstrated that our weighted tests based on It should be noted that the probe-level data analysis (such as background correction, normalization and summarization methods) may influence the results of the test procedures we discussed. Our initial analysis of real data and spiked-in data show that the weighted test statistics based on the MAS5 strategy may have slightly better performance than those based on the RMA strategy. In future investigations, we plan to explore how different probe-level data analysis methods, such as MAS5, dCHIP, RMA, PLIER and gcRMA, and different detection p-value calculation methods [17], may influence our weighting strategy in detail. Our results showed that some true differences are missed with any filtering method, as was noted by others [11]. It is known that the mismatch probes measure some true signal [12], and therefore is possible that, especially for genes expressed at a low level, quality scores for real data are lower than in our simulated data. Defining an alternative to the detection p-value that does not depend on the mismatch data may lead to better sensitivity. However, if mismatch probes do contain signal, and a study wishes to identify small changes in gene expression, the decision to use any filter at all must be carefully considered, given that some true effects are likely to be excluded. In addition, when effects are small, it must be realized that using multiple testing corrections will also lead to the exclusion of real effects. The proposed quality measures may also be useful in other applications of microarray experiments. For example, in microarray meta-analysis we could weight based on not just the quality of a study [24], but the quality of each measurement [25,26]. MethodsQuality measuresThe quality of a probeset may depend on many quantities such as spatial arrangement on arrays, upper and lower threshold effects, etc. Here we focus on measuring reliability and consistency of a probeset's expression using the probeset's detection p-value. The distribution of expression within a probeset, leading to a detection p-value, is influenced by all stages of the microarray process including scanner brightness, background, RNA quality, chip design, etc. [5]. Therefore, it can be thought as a synthesis index to represent the probeset's quality. In this paper, we defined quality measures hierarchically at 3 levels (Table 1): (a) gene and array level, (b) gene and group level, summarizing across arrays within each group, (c) gene level, summarizing across arrays as well as groups. Suppose there are i = 1, 2, ..., I arrays in total, each containing g = 1, 2, ..., G genes. Further, assume that there are W treatment groups, each consisting of nw arrays, for w = 1, 2, ..., W. Therefore, for the first aspect (a), we measured the quality of the measure of expression for one transcript based on the detection p-value or the Affymetrix Present/Absent/Marginal call [5]. Two measures were defined, which we denote by qgi, the detection p-value, and Approaches for summarizing across a set of arrays take two forms. Firstly, the quality of a gene's measurement across a particular group of arrays can be defined; we call this To develop a model-based quality measure, we argued as follows: if a gene is not expressed or cannot be measured, then the detection p-values (qgi) are expected to follow a uniform distribution. Equivalently, if a gene's expression cannot be detected, then we can assume a common distribution for this gene for all arrays in group w, such that -log(qgi) ~ Exp(1). That is, the negative log(qgi) follows an exponential distribution with a rate parameter value of 1 [25]. We therefore made the assumption that the qgi for gene g and array i follows the one-parameter exponential distribution with a group specific mean λgw, w = 1, ..., W, so that -log(qgi) ~ Exp(λgw). The maximum likelihood estimate of λgw is the inverse of the group-specific sample mean. Define ν, a sensitivity parameter, to be a desired threshold for the detection p-values, representing 1 minus the probability that any probeset in a particular treatment group shows a detectable signal. Then P(qgiw ≤ ν) leads to Theoretically, p-values are expected to follow a two-parameter Beta distribution. However, estimating these two parameters is occasionally impossible due to the fact that the detection p-values are derived from rank tests, and all arrays may have exactly the same p-value when genes are highly expressed. Pounds and Cheng [11] recently proposed a one-parameter Beta distribution to model detection p-values within each experimental group, Tests of differential expression with quality weightsIn traditional meta-analysis, quality measures are often used when combining results from different studies [28]. Without loss of generality, we can assume that we are comparing two groups of microarrays (with NA arrays in group A and NB in group B), and testing for differentially-expressed genes with the two-sample Welch t-statistic, not assuming equal variances. For gene g, the test statistic is therefore We converted this modified t-test statistic to a p-value by reference to a standard t-distribution with degrees of freedom based on Satterthwaite's approximation [29], assuming unequal variances between groups A and B. Of course, it is clear that This simple approach of weighting the t-test statistics can be contrasted with the approach of weighting the expression intensities within the test statistic. Using standard formula for weighting based on quality, we constructed Multiple hypotheses testing with weightsSeveral authors have considered the problem of including weights in multiple hypothesis testing situations [31-33]. Often, weights have been introduced when some of the hypotheses Hi are deemed more important than others. For example, Holm [31] first introduced weights into his sequentially rejective multiple hypothesis testing procedure, where weight was used to indicate the importance of the hypotheses. In that spirit, testing for differences in genes that are not expressed or non-specific in such genes should be considered less important than testing for differences among genes that are specific and well-expressed. To give another perspective, our quality-weighted statistics t* can also be considered as an implementation of a filtering method. For example, Assume that quality measures Q1, Q2, ..., QG and significance values p1, p2, ..., pG have been calculated for all G genes. Let We applied WBH to the t-statistics modified by the quality measures Simulation methodAffymetrix probe level data were generated based on a unified model proposed by Wu et al. [14]. When treatment effects are considered for different conditions, such as cancer and normal tissues, gene expression across these conditions can be modeled as: where Ygij and Wgij are the PM and MM intensities for the probe j in probeset g on array i respectively; O denotes optical noise. N represents non-specific binding (NSB) noise. S is a quantity proportional to RNA expression and the coefficient 0 < Φ < 1 accounts for the fact that for some probe-pairs the MM detects signal. Following Wu et al. [14], we simulated Design of simulation studyThe simulation design is shown in Table 7. We assumed two groups, A and B, with NA and NB arrays, respectively. G genes were generated, of them the proportion of expressed genes is k, and the proportion of differentially expressed genes is d of the G*k expressed genes. We set the number of up-and -down regulated genes to be the same in this study. Therefore, the G genes are divided into four groups: a non-expressed gene group where genes are not expressed across all arrays in group A and group B; a non-differentially expressed gene group where genes are expressed but not differentially between groups A and B; an up-regulated gene group where the mean gene expression of gene g in group B is larger by δg than that in group A; and a down-regulated gene group where the mean gene expression of gene g in group A is larger by δg than that in group B. Table 7. Simulation structure of Affymetrix microarray data We ran five simulation models following the above design. The specific parameters used in the five models are: number of genes: 1000; sample size: 25 arrays in groups A and B, respectively (for a total of 50 arrays); number of probes within each probeset: 16; proportion of expressed genes: 0.5; proportion of differentially expressed genes: 0.1; signal detection variance: 0.25; multiplicative error variance: 0.05 and sensitivity parameter ν: 0.05. Duchenne muscular dystrophy dataHaslett et al. [16] hybridized the total RNA to HG-U95Av2 GeneChips. They used MAS5 to obtain signal intensities and normalized with a linear regression. Tests of differential expression were based on geometric fold change (GFC) [16]. The differential expression of 12 genes (13 probesets) was confirmed by quantitative RT-PCR analysis of seven DMD biopsies and four unaffected biopsies. We used only 23 arrays (only 11 DMD arrays) for our re-analysis since one file was truncated. Raw data were converted to signal estimates using both MAS5 [5] and RMA [12], both were implemented using the affy package in Bioconductor [34]. Spiked-in data of Choe et al. [17]The 'spiked-in' experiment (Choe et al. [17]) for Affymetrix Genechips provides a controlled dataset of 3,860 RNA species with known sequence and known concentration. Two different samples were prepared and hybridized in triplicate to Affymetric GeneChips; these are called the 'constant' (C) and 'spike' (S) samples. Out of 3,860 RNA species, 2,551 of them were created to have the same concentrations in both samples while the rest (1,309) were spiked in with different concentrations between the S and C samples. Ten fold-change levels, ranging from 1.2 to 4-fold, were assigned to the spiked-in RNAs. Basically, all the RNAs with positive log fold changes can be thought of as differentially expressed genes. In this study, we considered the top 1000 probesets as differentially expressed genes (as did Choe et al. [17]). Authors' contributionsCG initiated, designed and managed the study. CG also proposed the methods in this manuscript. JB participated in designing and managing the study. PH conducted data analysis and drafted the manuscript. CG and JB revised the manuscript. All authors read and approved the final manuscript. AcknowledgementsWe acknowledge helpful suggestions from two anonymous reviewers that greatly improved the quality of the manuscript. This work was supported by the Ontario Genomics Institute and Genome Canada. References
Have something to say? Post a comment on this article! |




on Google Scholar







author email
corresponding author email












Figure 1.
Figure 2.
Figure 3.
Figure 4.




























