Robust computational reconstitution – a new method for the comparative analysis of gene expression in tissues and isolated cell fractions1 Leibniz Institute for Natural Products Research and Infection Biology – Hans Knöll Institute, Beutenbergstr. 11a, Jena, Germany 2 Experimental Rheumatology Unit, Department of Orthopedics, Friedrich Schiller University Jena, Jena, Germany 3 Institute of Immunology, University of Rostock, Rostock, Germany 4 Department of Pharmacy and Molecular Biotechnology, Ruprecht Karls University Heidelberg, Heidelberg, Germany
BMC Bioinformatics 2006, 7:369doi:10.1186/1471-2105-7-369 The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/7/369
©
2006 Hoffmann et al; licensee BioMed Central Ltd. AbstractBackgroundBiological tissues consist of various cell types that differentially contribute to physiological and pathophysiological processes. Determining and analyzing cell type-specific gene expression under diverse conditions is therefore a central aim of biomedical research. The present study compares gene expression profiles in whole tissues and isolated cell fractions purified from these tissues in patients with rheumatoid arthritis and osteoarthritis. ResultsThe expression profiles of the whole tissues were compared to computationally reconstituted expression profiles that combine the expression profiles of the isolated cell fractions (macrophages, fibroblasts, and non-adherent cells) according to their relative mRNA proportions in the tissue. The mRNA proportions were determined by trimmed robust regression using only the most robustly-expressed genes (1/3 to 1/2 of all measured genes), i.e. those showing the most similar expression in tissue and isolated cell fractions. The relative mRNA proportions were determined using several different chip evaluation methods, among which the MAS 5.0 signal algorithm appeared to be most robust. The computed mRNA proportions agreed well with the cell proportions determined by immunohistochemistry except for a minor number of outliers. Genes that were either regulated (i.e. differentially-expressed in tissue and isolated cell fractions) or robustly-expressed in all patients were identified using different test statistics. ConclusionRobust Computational Reconstitution uses an intermediate number of robustly-expressed genes to estimate the relative mRNA proportions. This avoids both the exclusive dependence on the robust expression of individual, highly cell type-specific marker genes and the bias towards an equal distribution upon inclusion of all genes for computation. BackgroundThe comparative analysis of gene expression in diseased tissues and its isolated cell fractions can be used to identify genes with potential pathophysiological relevance, including those involved in interactions among different cell types. In the present study 'isolated cell fractions' are defined as cultivated cell populations of individual cell types purified from the respective tissue samples. A direct approach to the gene expression of specific cell types in the tissue is their microdissection from the tissue. Isolation and amplification of mRNA from microdissected single cells or pure cell type subpopulations has recently been established and described [1,2]. However, this method is just emerging, still having technical problems with reliable cell type markers, exact dissection, and representative mRNA extraction and amplification [3,4]. Therefore, instead of comparing gene expression profiles of individual cell types between tissue and isolated cell fractions, the present study compared the gene expression profiles of whole tissues and computationally reconstituted expression profiles that combine the expression profiles of the isolated cell fractions according to their relative mRNA proportions in the tissue. These relative mRNA proportions were determined using trimmed robust regression. Methods for the reconstruction of cell type-specific expression profiles and relative proportions have already been proposed in the literature. The marker gene approach [5,6] determines the relative mRNA proportions from the expression of highly cell type-specific marker genes. A drawback of this method is its dependence on the robust expression of single genes. Venet et al. [7], Stuart et al. [8], and Lähdesmaeki et al. [9] identified cell type-specific expression profiles from tissue samples differing in their cell type composition. Venet et al. [7] and Lähdesmaeki et al. [9] computed the cell type-specific expression profiles and their corresponding relative proportions simultaneously (matrix factorization of the tissue gene expression matrix), whereas Stuart et al. [8] determined the cell proportions experimentally and then calculated the respective expression values (gene-wise regression). The method of Lu et al. [10] and the present study are different from the three previous approaches in that they use actually measured, cell type-specific expression profiles and determine the relative mRNA proportions computationally (tissue-wise regression). Whereas Lu et al. [10] compared desynchronized yeast cell 'tissues' and five isolated cell fractions consisting of synchronized yeast cells in the G1, S, G2, M, and M/G1 cell cycle phases, the present study compares synovial tissues with the isolated cell fractions of adherent macrophages, adherent fibroblasts. and non-adherent cells. The study of Lu et al. [10], however, did not address the relative importance of regulated gene expression resulting from the induction of cell cycle arrest. In contrast, the present study demonstrates for the first time that in order to avoid a bias towards an equal distribution of the computed mRNA proportions, as many regulated genes as possible must be excluded from the analysis. Here, 'regulated genes' are defined as showing a differential expression between tissue and isolated cell fractions, whereas 'robustly-expressed genes' show a similar expression under both conditions. Differential gene expression was also investigated in the study of Ghosh [11], in which the gene expression of tumor tissue samples was compared to that of normal tissue controls (probabilistic mixture model). This study accounted for the varying relative proportions of stromal tissue within the tumor samples and assumed the gene expression in tumor and normal tissue to be perfectly robust, i.e. independent of their relative proportions in the sample (as is also the case in the studies of Venet et al. [7]. Stuart et al. [8], Lähdesmaeki et al. [9], and Lu et al. [10]). Thus, differential expression in the study of Ghosh [11] refers to expression differences between tumor and normal tissue, whereas regulated expression in the present study refers to expression differences between two conditions (tissue versus isolated cell fractions). Synovial membranes (inner aspect of the joint capsule) consist of three main cell types: macrophages and fibroblasts, both adhering to the culture vessel, and mixed non-adherent cells. These cell types are expected to contribute differentially to the pathogenesis of rheumatic diseases by expressing pro-inflammatory and pro-destructive genes. Therefore, gene expression profiles of the whole synovial tissue and the respective isolated cell fractions of patients with rheumatoid arthritis (RA) and osteoarthritis (OA) were analyzed using Affymetrix GeneChip technology. The performance of the newly developed reconstitution algorithm was validated using two mRNA mixing experiments performed by the authors and the mixing part of the GeneLogic dilution study [12]. The computed mRNA proportions were compared to the cell proportions determined by immunohistochemistry. Regulated and robustly-expressed genes selected according to statistical evidence and pathophysiological relevance are listed in the supplementary material. ResultsThe relative mRNA proportions of macrophages (pM), fibroblasts (pF), and non-adherent cells (pN) were determined for each tissue sample by matching the measured synovial tissue expression profile S and the computationally reconstituted expression profile S* = pM M + pF F + pN N, which in turn combines the measured expression profiles of the isolated cell fractions of macrophages (M), fibroblasts (F), and non-adherent cells (N) according to their relative mRNA proportions (Figure 1).
The matching of S and S* as a function of the relative mRNA proportions was performed using trimmed robust regression (Methods section, subsection Mathematical model). Trimmed regression only uses part of the data in order to exclude outliers that otherwise would bias the result. Here, the result is given by the set of relative mRNA proportions that minimize the differences between S and S* as quantified by the regression objective function. Solutions to trimmed regression problems are in general not unique due to the data subset choice, however, the solutions generally become more alike with increasing subset size. In the present study, the trimmed regression problem was solved for an increasing number of included genes. For each such number the optimization routine was initialized from several different starting values. The standard deviation of the results obtained from these random initializations was used to assess the similarity of the solutions. The relative mRNA proportions were determined from the respective ensemble means at the minimum number of included genes, for which the ensemble standard deviations approached zero, i.e. for which the solution was almost unique. This 'educated guess' (heuristic) approach accounts for the fact that the solution is completely indeterminate towards zero included genes and increasingly biased towards an equal distribution upon inclusion of all genes (Methods section, subsection Mathematical model). The general performance of this methodology is demonstrated in this first subsection. In order to assess the effect of data preprocessing on the present results the influence of different chip evaluation methods was investigated (Different chip evaluation methods subsection). The preliminary knowledge of the true relative mRNA proportions in mRNA mixing experiments was used to validate the present methodology as well as the chip evaluation methods (Mixing experiments subsection). In addition, the computed mRNA proportions were compared to the cell proportions determined by immunohistochemistry and those obtained using the marker gene approach (Immunohistochemistry and marker genes subsection). Finally, genes that were regulated and robustly-expressed in all patients were identified using different test statistics (Regulated and robustly-expressed genes subsection). General performanceThe proposed method for the determination of the relative mRNA proportions is demonstrated using the data of patient 2. Figure 2 shows how the means and standard deviations of the computed relative mRNA proportions of macrophages (pM), fibroblasts (pF), and non-adherent cells (pN) depend on the number k of genes included in the trimmed regression approach (Methods section, subsection Mathematical model). The mean is variable for small k. settles to a constant value for intermediate k and shows a slow and incomplete convergence towards an equal distribution (pM = pF = pN = 1/3) for large k. The respective standard deviations approach zero at an intermediate number of included genes. This number was used to determine the mRNA proportions from the respective mean values. The solid curves correspond to data, for which an additional local regression normalization was performed for the measured and reconstituted tissue profiles in each algorithmic step (Methods section, subsection Data preparation). The decrease of the standard deviation is faster for additional local regression normalization in this case, but the resulting mRNA proportions are similar.
The general curve progression in Figure 2 is similar to that obtained from a probabilistic model described in Supplementary Section A (Supplementary Figure A.2, see 1). However, the best estimate of the 'true' tissue mRNA proportions assumed by the model was obtained for the smallest number of included genes (in contrast to an intermediate number in Figure 2). This difference can be attributed to the striking roughness and overall flatness of the empirical objective function (3) used for trimmed regression (Methods section, subsection Mathematical model; Supplementary Figure A.3, see 1), if only a small number of genes is included. Hence, the means of the computed relative proportions converge to an equal distribution towards both ends. For increasing k, this is due to the inclusion of more and more regulated genes with expression levels non-correlated or negatively correlated between tissue and isolated cell fractions (Supplementary Section A.2, see 1). For decreasing k, this is due to the increasing roughness (increasing number of local minima) and overall flatness of the objective function (Methods section, subsection Mathematical model; Supplementary Section A.3, see 1). The relative mRNA proportions of the tissue can be estimated from the respective means of the computed mRNA proportions, as soon as the respective standard deviations become sufficiently small. The distribution of the individual mRNA proportions can then be assumed to be symmetric, implying the equality of mean and global minimum [13]. The drop of the standard deviation was indeed the most important criterion, but the curve progression and the agreement between the curves for the two different chip normalization methods (with and without local regression) were also taken into account. Additional File 1. Supplementary Material. Self-contained pdf-file providing supplementary information. Supplementary sections A – H. Format: PDF Size: 3.7MB Download file This file can be viewed with: Adobe Acrobat Reader Different chip evaluation methodsThere is an ongoing discussion about the question which probe set summary and which chip normalization method proves optimal for evaluating oligonucleotide microarrays. For assessing the effect of different chip evaluation methods on the results obtained by Robust Computational Reconstitution, four different probe set summaries were applied: MAS-S, MAS-C [14,15], RMA [16], and MBEI [17] (Methods section, subsection Data preparation). In addition, four different normalization methods were tested for the MAS-S and MAS-C summaries: trimmed mean only (t) [14,15], trimmed mean plus cyclic local regression (clr) [18], quantile normalization (q) [18,19], or centralization (c) [20] (Methods section, subsection Data preparation). The results are summarized in Table 1. Table 1. Computed relative mRNA proportions. Computed relative mRNA proportions pM (macrophages) and pF (fibroblasts) for patients 1–6 calculated for different chip evaluation methods. The proportion of non-adherent cells is pN = 1 - pM - pF. The primed 1 in 'patient 1' (RA)' indicates the use of the second tissue chip replicate for patient 1. Additional stepwise local regression normalization is indicated by lr. The proportions are given in percent. Probe set summaries: MAS-S: Microarray Suite 5.0 signal algorithm, RMA: Robust Multiarray Analysis, MBEI: Model Based Expression Index, GP: GenePublisher (not available for HG-U95A chips, patient 1), the R implementation of MBEI was used with the same settings as in GP. Normalization methods: t: trimmed mean, clr: cyclic local regression, q: quantile normalization, c: centralization. The computed mRNA fractions vary considerably among different probe set summaries and, for MAS-S and MAS-C, among different normalization methods. The methodological scatter differed among patients. It was quantified in terms of the pooled Mean Absolute Deviation (MAD) of the relative proportions across methods calculated with regard to the respective mean values, i.e. MAD = Table 2. MAD of the computed mRNA proportions. Pooled mean absolute deviation (MAD) of the relative mRNA proportions across different chip evaluation methods, calculated with regard to the respective mean values. 'MAD all' refers to all seven methods: the four different normalizations of MAS-S, as well as RMA, MBEI, and MBEI-GP, whereas 'MAD MAS' only refers to the four different normalization methods of MAS-S. The additional application of stepwise local regression normalization is indicated by lr. The MAD refers to the proportions of Table 1 (given in percent). The MAS-S summaries were preferred to the MAS-C, RMA, and MBEI summaries in this study because the agreement between the curves with and without stepwise local regression normalization (lr) was generally better for the MAS-S summaries (Figure 2 and Supplementary Figures C.1 and C.2, see 1). In addition. RMA, MBEI, and MAS-C tended to more extreme proportions close to 100% for patients 2, 5, and 6 (Table 1 and Supplementary Table C.1, see 1). In view of the relative cell proportions determined by immunohistochemistry (see subsection Immunohistochemistry and marker genes), the less extreme values obtained from MAS-S are more plausible. Moreover, the use of MAS-C resulted in irregular curves for patient 1 (original tissue chip, Supplementary Figure C.2, see 1). The results obtained by excluding weakly-expressed genes (either < 100 or < 200 signal intensity) were within the range of the values listed in Table 1. This was also true for data filtering by Affymetrix present calls and transformation of the data according to the variance stabilization method of Huber et al. [21,22] (this is presumably due to the fact that the data are already corrected by MAS 5.0 in order to avoid negative intensities). Masking of outliers as performed by the MAS 5.0 software also did not alter the results. This suggests that measurement errors associated with weakly-expressed genes, outliers, or saturation of certain probe sets play a minor role for the estimation of the relative mRNA proportions using trimmed robust regression. The effect of the experimental chip quality was assessed by correlating the MAD of the computed mRNA fractions (Table 2) with several relevant chip operating figures provided by the Affymetrix 'Expression Report' (Supplementary Section D, Supplementary Figures D.1 and D.2, see 1). Generally, the MAD was positively correlated with the noise and background level of the chips. However, due to the small number of patients the results strongly depended on individual patients (inhomogeneity correlation), in particular on patient 1 (original tissue chip). Mixing experimentsIn mixing experiments, mRNA samples from different sources are mixed together in defined proportions and subsequently hybridized to chips. These experiments were used to validate the present computational approach. Two mixing experiments were performed by the authors with material from patient 2. The other experimental data were derived from the mixing part of the GeneLogic dilution study [12]. The target values for the mRNA proportions of the first two mixing experiments were set according to the range of values computed for patient 2 (Table 1). Figure 3 shows the results for the second mixing experiment using trimmed mean normalized MAS-S probe set summaries. The proportions are virtually fixed from the beginning and the standard deviations drop at only a few thousand included genes. This picture was typical for all mixing experiments, probe set summaries, and chip normalization methods of the present study. The number of included genes, at which the computed mRNA proportions were determined, was between 2000 and 4000 in all cases.
Table 3 summarizes the results in terms of the pooled MAD of the relative mRNA proportions, calculated with regard to the respective target values. All methods gave better results for the second mixing experiment. This difference cannot be explained by the experimental quality, because both chips had very similar operating figures. MAS-C tended to perform somewhat better than MAS-S, and both MAS-S and MAS-C gave better results than RMA, MBEI, and MBEI-GP. In contrast to MAS-S and MAS-C, the results for RMA, MBEI, and MBEI-GP were improved (with one exception) by additional stepwise local regression normalization (lr). Table 3. MAD for mixing experiments 1 and 2. Pooled MAD of the computed relative mRNA proportions in mixing experiments 1 and 2 for different chip evaluation methods. The MAD is calculated with respect to the target values pM = 0.75, pF = 0.11, pN = 0.14 (mix 1) and pM = 0.86, pF = 0.11, pN = 0.03 (mix 2). 'MAS-S mean' and 'MAS-C mean' denote the means with respect to the four preceeding normalization methods. Notation according to Table 1. The GeneLogic dilution study contains 25 chips, in which mRNA samples from liver and SNB19 cells were mixed for SNB-proportions of 0. 0.25, 0.50. 0.75. and 1. and subsequently hybridized to HG-U95Av2 GeneChips. Replicas of these 5 mixes were processed by 5 different scanners resulting in a total of 25 experiments. Figure 4 shows mean, standard deviation, and MAD of the computed mRNA proportions for SNB-proportions of 0.25, 0.5, and 0.75. The MADs were calculated with respect to these nominal target values. The standard deviations drop very early and the proportions were generally determined at 2000 included genes, with very few exceptions. The MAD of mixes 25 and 50 is quite high (roughly 4–10%) for all chip evaluation methods possibly indicating some imprecision in sample weighing. The results summarized in Table 4 show that, as in the mixing experiments of the authors, MAS-C tended to perform somewhat better than MAS-S (except for some cases in mix 75). The results for RMA and MBEI are similar to those for the MAS summaries.
Table 4. MAD for the GeneLogic Dilution Study. Pooled MAD of the computed relative mRNA proportions in mixing experiments 25, 50, and 75 for different chip evaluation methods. The respective MAD is calculated with regard to the target values pS = 0.25, 0.50, and 0.75. Notation according to Table 3. Immunohistochemistry and marker genesImmunohistochemical staining was used to assess the cellular composition in the synovial tissue samples. The obtained cell type proportions of macrophages ( Table 5. Cell proportions. Relative cell proportions of adherent macrophages M, adherent fibroblasts F, and non-adherent cells N for peitients 1–6 obtained from immunohistochemistry by analyzing tissue samples in close vicinity to the sample from which the mRNA was isolated. The fraction of non-adherent cells consists of T-cells (T), B-cells (B), plasma cells (P), and endothelial cells (E). If standard deviations are given, three samples were analyzed. The proportions are given in percent. The model effectively contains two independent parameters that were fitted by matching the mRNA proportions of the model
Cell type-specific marker genes allow to determine the mRNA proportions from the ratio of the expression in the synovial tissue (si) and the respective isolated cell fraction (e.g., pM = si/mi for macrophages) provided the marker gene (i) is very specific (no expression in fibroblasts and non-adherent cells) and robust (same expression in tissue and isolated cell fractions). In the present study, the relative mRNA proportions calculated using Robust Computational Reconstitution and marker genes showed a markedly higher than random consistency only for fibroblasts (Supplementary Table F.1, see 1). Obviously, the two premises required by the marker gene approach, specificity and robustness, are not fulfilled simultaneously, at least not for macrophages and non-adherent cells (the former being known to be quite responsive to stimulation and the latter being subject to a confounding heterogeneity of their cell type composition). Regulated and robustly-expressed genesRegulated and robustly-expressed genes were identified using six different test statistics and log-transformed expression values of the measured (S) and reconstituted (S*) tissues of patients 1 to 6. The first four methods are well established techniques in the post-genomic era: Significance Analysis of Microarrays (SAM) [24], one-sample paired t-test (t-test1), homoscedastic two-sample t-test (t-test2) (being equivalent to one-way ANOVA for two groups), and VERAandSAM (V&S) [25]. In addition, two new criteria, μ-test and MAD-test, were introduced in the present study owing to the fact that SAM, t-test1, and t-test2 tended to select robustly-expressed genes with low expression strength (Supplementary Figure C.2, see 1). This bias is a result of the fact that the standard error appears in the denominator of these test statistics and that weakly-expressed genes generally show a higher standard error than strongly-expressed genes (log-transformed data). As a consequence, the newly-defined μ- and MAD-test statistics, tμ = μ1 - μ2 = mean(s - s*) and tMAD = mean(|s - s*|), respectively. (s = log S, s* = log S*), do not contain the standard error. The μ-test, and also V&S, appeared to be almost unbiased with respect to expression strength. The MAD-test (most strongly related to the regression objective function (3) in the Methods section, subsection Mathematical model) showed a clear preference for highly-expressed genes when applied to identify robustly-expressed genes (Supplementary Figure G.2, see 1). This may reflect the fact that in general small differences in gene expression are observed for highly-expressed genes (well known from so called M-A-plots for log-transformed expression data). Also, highly-expressed genes serving fundamental roles in the cell are commonly used as housekeeping genes [23]. The p-values of t-test1 and t-test2 were calculated using both the Student distribution and the permutation method described in Storey and Tibshirani [26]. For the paired t-test1 nP = 1000 permutations were used. For the unpaired t-test2 all nP = (6 out of 12)/2 = 462 possibilities for obtaining different absolute values of the test statistic were enumerated. The Pearson correlation coefficient between the respective p-values was c1 = 0.999979 for t-test1 and c2 = 0.999986 for t-test2 suggesting that the gene expression data of the present study are roughly log-normally distributed. Consequently, the Student distribution derived p-values were used for t-test1 and t-test2. The p-values of SAM and MAD-test were calculated using the permutation method and np = 1000 permutations, those of μ-test by enumerating all np = 462 possibilities. The λ-values of V&S are based on an expression level-dependent error model [25]. The pairwise correlations between the six statistical methods are shown in Supplementary Figure G.1 (see 1). The best correspondence was observed among SAM, t-test1, t-test2, and μ-test (Pearson correlation between 0.844 and 0.977). The MAD-test differed most from all other methods (Pearson correlation between 0.240 and 0.677). Also, this test statistic was the only one showing a two-modal p-value distribution in contrast to the uni-modal distributions assumed by the method of Storey and Tibshirani [26] for calculating False Discovery Rate (FDR)-related q-values. The other methods allowed to estimate the proportion of null-hypothesis genes quite consistently (between 0.5 and 0.6). Regulated and robustly-expressed genes were identified by 1.) applying each of the six individual test statistics independently and 2.) selecting top-ranking genes with respect to pathophysiological relevance (Supplementary Section G, Supplementary Tables G.2 and G.3, see 1). A total of 62 regulated and 48 robustly-expressed genes were selected. Each method contributed some selected genes and the approach of combining different complementary test statistics for gene selection, as proposed in the present study, may prove useful in future investigations. DiscussionDetermining and analyzing the gene expression profiles of different tissue cell types under diverse physiological and pathophysiological conditions is a central aim of biomedical research. However, microdissection of single cells or pure cell types, as well as mRNA isolation and amplification [1,2] still bears basic technical problems [3,4]. Therefore, gene expression profiles of whole tissue samples and purified cell types were compared in the present study. Using Robust Computational Reconstitution, the required mRNA proportions and the set of robustly-expressed genes that allow for their determination were simultaneously identified. The present results suggest that MAS-S provided the most stable probe set summaries. This is in contrast to the study of Irizarry et al. [16], but is in line with the results of Choe et al. [27]. This discrepancy may be due to the data sets used. Whereas in the study of Irizarry et al. [16] only few spiked-in genes were present, Choe et al. [27] spiked-in more than 1300 genes. Real world applications like the present study possibly contain more biological and technical noise and the subtraction of the mismatch probes from the perfect-match probes (MAS 5.0) may be appropriate in this case. Recent detailed sequence level studies of Binder and Preibisch [28,29] resulted in similar conclusions. However, when Shedden et al. [30] applied their FDR-based pairwise comparison method to ovary and colon tumor data, MAS 5.0 performed inferior to most other methods. Chip-wise (non-linear) averaging of the probes before comparison to other chips (MAS-S) may also be less susceptible to the effects of outliers than the initial comparison of probes between different chips and subsequent averaging of the differences (MAS-C). There were only few differences in the present study among the different chip normalization methods applied to the MAS-S and MAS-C summaries. However, the application of quantile normalization and also centralization resulted in more outliers compared to the remaining normalization methods. Quantile normalization also tended to perform suboptimally without additional local regression normalization in the study of Choe et al. [27]. The accuracy of Robust Computational Reconstitution depends on the fraction of robustly-expressed genes, the quality of experimentation, and the appropriateness of the chip evaluation method. The variance is estimated to be in the range of 5–10% in absolute value based on the variability of the computed mRNA proportions among methods (Table 2) and the expected bias introduced by the imperfect robustness of even the most robustly-expressed genes (Supplementary Figures A.1 and A.2, see 1). A sensitivity analysis (resampling of relative proportions) showed that the rank order of differentially-expressed genes is only moderately affected by this variance. However, the maximum shift in rank order was always large indicating that some differentially-expressed genes may be lost even from a moderately extended list of top-ranking genes. This maximum rank shift was close to that of random permutations but the third quartile of the rank shift distribution was already much smaller than that observed for random permutations (rank shift of 60 compared to 9300 for the 200 top-ranking genes and a resampling standard deviation of 5%). The results for patient 1 largely differed from all other patients. On the one hand this may be due to the fact that HG-U95A chips were used instead of HG-U95Av2 chips. On the other hand patient 1 is the only type I RA patient (patients 2 and 3 are type II RA) [31,32]. This was confirmed by immunohistochemistry and comparison with synovial tissue gene expression profiles of 12 other classified RA patients (using the distance to the respective expression means and Affymetrix' best-matches probe set list for matching HG-U133A and HG-U95Av2 chips). Hence, the differences between RA patient 1 and RA patients 2 and 3 are substantial. Remarkably, the computed mRNA proportions considerably differ from the cell proportions determined by immunohistochemistry (Figure 5). Most striking is the prevalence of macrophage mRNA over fibroblast mRNA in contrast to the dominance of fibroblasts over macrophages in the cell composition, implying that macrophages produce 8 to 16 times as much mRNA as do fibroblasts. The results for the mRNA proportions may be biased because the isolated fractions of macrophages and non-adherent cells were not perfectly pure with respect to fibroblasts (resulting in apparently higher mRNA proportions for macrophages). Also, the mRNA yield of different cell types may be severalfold different due to experimental methodology, e.g. mRNA extraction and amplification [3,23]. Furthermore, the cell proportions as determined by immunohistochemistry may be somewhat imprecise due to the limited specificity of this semi-quantitative method. Nevertheless, the augmented mRNA production of macrophages compared to fibroblasts is expected to be still valid under idealized experimental conditions. The central aim addressed by the present study was the cell type-specific gene expression in synovial tissue and its dependence on cell type composition (purified cell fractions representing an extreme case of composition). Cell type-specific gene expression was also addressed by Venet et al. [7], who assumed the individual expression profiles to be largely uncorrelated. According to our own-experience (the correlation coefficients among the isolated cell fractions showed values as high as 0.7 and 0.9), this is an arguable assumption, as also admitted by Venet et al. [7] themselves. Lähdesmaeki et al. [9] used different constraints in their least squares approach. Similar to Venet et al. [7] and Lu et al. [10] and in contrast to the present study, they included all genes in the sum of squares and showed this to be appropriate in the case of mixing experiments (as also demonstrated in the present study). However, their method, as well as the methods of Venet et al. and Lu et al. (which is otherwise equivalent to the present approach), will give biased results in the presence of regulated (i.e. composition-dependent) gene expression. Using an intermediate number of robustly-expressed genes (present study) avoids, on the one hand, the exclusive dependence on the robust expression of individual highly cell type-specific marker genes and, on the other hand, the bias towards an equal distribution when including all genes in matrix factorization or regression. This statement remains valid despite the inability of all current in silico microdissection methods (including our own; Methods section, subsection Mathematical model and Supplementary Section H, see 1) to directly assign composition-dependent changes in gene expression to specific cell types. ConclusionThe proposed method of Robust Computational Reconstitution is applicable if there is a sufficient number of robustly-expressed genes whose expression is highly correlated between tissue and isolated cell fractions. The existence of such housekeeping genes is plausible and well-known [23] since many genes code for basic cell functions that must be maintained across different environmental conditions. The present approach improves previously published methods for the determination of the relative mRNA proportions of different cell types by the exclusion of regulated genes that bias the result towards an equal distribution. In addition, it can identify robustly-expressed genes (representing basic metabolism or persistent pathological changes) and responsive genes that change their expression between tissue and isolated cell fractions (possibly reflecting physiological or pathological cell communication processes). Both are of biomedical interest and can be further screened for pathophysiological relevance. Evidently, microdissection of single cells or pure cell types [1,2] is a promising experimental technique. However, it is still under steady development and, for the time being, the preferential use of macrodissection in combination with in silico microdissection techniques has been recently recommended [3]. The method of the present study can thus be readily used as an effective research tool and a supporting reference method for further studies. Note added in proofDuring the review process of the present manuscript Wang et al. [33] also published a related paper dealing with the cellular composition of complex tissues and the relative contribution of the individual cell types to tissue gene expression. MethodsMathematical modelThe proposed method for determining the cell type-specific mRNA composition of synovial tissue samples (applicable to any other tissue) is based on a comparison between the measured expression profile S of the whole synovial tissue and the computationally reconstituted tissue profile S* = pM M + pF F + pN N, (2) which is composed of the measured expression profiles of the isolated cell fractions, i.e. isolated adherent macrophages M, adherent fibroblasts F, and non-adherent cells N, according to their respective computational mRNA proportions pM, pF, and pN (Figure 1). The proposed computational approach aims at the simultaneous identification of the cell type-specific mRNA proportions and the particular set of robustly-expressed genes, which allows to reliably determine these proportions. This is achieved by minimizing the trimmed sum of absolute differences between the gene expression in the measured and the reconstituted tissue with respect to pM and pF, treating pN = 1 - pM - pF as a dependent variable (equivalently, an optimization routine able of handling constraints could be used with all three variables). In Equation (3) si = log Si and Table 6. Algorithmic Protocol The results obtained from a statistical model for the expectation value The differences si - ExperimentsSynovial tissue samples were obtained from 3 RA [35] and 3 OA patients [36] and prepared as previously described [31]. During the primary cell culture, non-adherent cells were removed by medium exchange on day 1. After 7 days, synovial fibroblasts were purified by removal of macrophages using anti-CD 14 magneto-beads. Total RNA was isolated and labeled according to the supplier's (Qiagen) instructions. Gene expression was analyzed using Affymetrix HG-U95A (patient 1) and HG-U95Av2 (patient 2–6) chips. The synovial tissue chip of patient 1 showed about twice as much noise as the other chips. Therefore, the hybridization was repeated twice (using HG-U95Av2 chips). Finally, the first repeat was selected for analysis. Two mixing experiments were performed, in which mRNA preparations of the isolated cell fractions of patient 2 were mixed according to two representative sets of relative mRNA proportions and subsequently hybridized to HG-U95Av2 chips. Immunohistochemical analysis was performed on cryostat sections of synovial membranes using marker antibodies for specific cell types. For each patient, the cellular composition was assessed for one to three different synovial tissue sections from samples adjacent to those, from which the mRNA was prepared (Table 5). Further details of the experimental materials and methods are given in the Supplementary Sections B and E (see 1). Data preparationMicroarrays were evaluated using four different probe set summaries: 1) single chip (MAS-S); and 2) chip comparison (MAS-C) algorithm of Affymetrix' Microarray Suite 5.0 [14,15]; 3) Robust Multi-Array Analysis (RMA) developed by Irizarry et al. [16]; and 4) Model Based Expression Index (MBEI) of Li and Wong [17] as implemented in R and the GenePublisher web-interface [37]. In addition, the MAS-S and MAS-C summaries were normalized using four different globalization methods: i) 2%-symmetric trimmed mean (t) [14,15]; ii) cyclic local regression (clr) [18]; iii) quantile normalization (q) [18,19]; and iv) centralization (c) [20]. The computationally reconstituted tissue profile S* was either not further normalized or normalized by local regression. This method is based on so called M-A-plots and normalizes two expression profiles, in this case S and S*, at the same time. The normalization has to be done in each algorithmic step, i.e. for every pair of relative proportions that is explored by the optimization algorithm. It was chosen to be a 200-genes symmetric moving average instead of the widely used loess procedure [18,38] in order to achieve reasonable computation times. The symmetric moving average was also used for the cyclic local regression normalization (clr). Centralization (c) requires the selection of a set of robustly-expressed genes that is used for normalization. This set was chosen to consist of those 1000 genes, for which at least one pairwise log-ratio was among the best conserved (i.e. showed the lowest mean absolute deviation) across the four chips per patient. The additive nature of Equation (2) requires absolute expression values. However, the chip comparison algorithm (MAS-C) computes log-ratios that compare a baseline experiment to different other experiments in a pairwise manner. The absolute expression values were reconstructed from the baseline experiment (S) and the respective log-ratios. The number of random starting values for the relative mRNA proportions (cf. algorithmic protocol. Table 6) was l = 25 throughout this study. The missing N-profile of patient 3 was substituted from patient 2 (both Type II RA) [32]. When HG-U95A and HG-U95Av2 chips were investigated simultaneously, the analysis was restricted to the 12533 probe sets common to both chips (excluding controls). For the RMA and MBEI summaries of patient 1 (tissue repeats) the mixture CDF environment provided by Bolstad [39] was used. Authors' contributionsMH developed mathematical methods, performed computations, and wrote the manuscript; DP conducted the experimental work and contributed to the manuscript; DK hybridized chips and contributed to the discussion, HJT and SW contributed to the design of the study and the discussion; RWK contributed to experimental work, discussion, and writing of the manuscript. All authors read and approved the final manuscript. AcknowledgementsThe authors cordially thank Reinhard Guthke for valuable discussion. Ernst Günter Schukat-Talamazzini is acknowledged for his proof-reading of the mathematical methods. Torsten Kroll contributed to the early stage of the project and Andreas Beyer contributed by critical discussion. Rafael A. Irizarry and James W. MacDonald are acknowledged for their comments on the Genelogic dilution study. The two anonymous referees contributed valuable comments and suggestions that considerably helped to improve the manuscript. This work was supported by the German Federal Ministry of Education and Research (BMBF) through the Jena Centre for Bioinformatics (JCB) (FKZ 0312704D to R. Guthke, which supported MH, and 0312704B to RWK, which supported DP), the Interdisciplinary Center for Clinical Research (IZKF) (FKZ 01ZZ0105 and 01ZZ0405 to RWK), and the National Genome Research Network 2 (NGFN-2) (FKZ 01GS0413 to RWK). References
Have something to say? Post a comment on this article! |




on Google Scholar







author email
corresponding author email
Figure 1.


Figure 2.

Figure 3.
Figure 4.




Figure 5.



