Department of Pharmacology and Experimental Therapeutics, Thomas Jefferson University, Philadelphia, PA 19107, USA

Respiratory Research, Teva Branded Pharmaceutical Products R&D, Inc., Horsham, PA 19044, USA

Department of Experimental Medicine, Proctor and Gamble, Cincinnati, OH 45241, USA

Abstract

Background

Normalization in real-time qRT-PCR is necessary to compensate for experimental variation. A popular normalization strategy employs reference gene(s), which may introduce additional variability into normalized expression levels due to innate variation (between tissues, individuals, etc). To minimize this innate variability, multiple reference genes are used. Current methods of selecting reference genes make an assumption of independence in their innate variation. This assumption is not always justified, which may lead to selecting a suboptimal set of reference genes.

Results

We propose a robust approach for selecting optimal subset(s) of reference genes with the smallest variance of the corresponding normalizing factors. The normalizing factor variance estimates are based on the estimated unstructured covariance matrix of all available candidate reference genes, adjusting for all possible correlations. Robustness is achieved through bootstrapping all candidate reference gene data and obtaining the bootstrap upper confidence limits for the variances of the log-transformed normalizing factors. The selection of the reference gene subset is optimized with respect to one of the following criteria: (A) to minimize the variability of the normalizing factor; (B) to minimize the number of reference genes with acceptable upper limit on variability of the normalizing factor, (C) to minimize the average rank of the variance of the normalizing factor. The proposed approach evaluates all gene subsets of various sizes rather than ranking individual reference genes by their stability, as in the previous work. In two publicly available data sets and one new data set, our approach identified subset(s) of reference genes with smaller empirical variance of the normalizing factor than in subsets identified using previously published methods. A small simulation study indicated an advantage of the proposed approach in terms of sensitivity to identify the true optimal reference subset in the presence of even modest, especially negative correlation among the candidate reference genes.

Conclusions

The proposed approach performs comprehensive and robust evaluation of the variability of normalizing factors based on all possible subsets of candidate reference genes. The results of this evaluation provide flexibility to choose from important criteria for selecting the optimal subset(s) of reference genes, unless one subset meets all the criteria. This approach identifies gene subset(s) with smaller variability of normalizing factors than current standard approaches, particularly if there is some nontrivial innate correlation among the candidate genes.

Background

Normalization is important in real-time qRT-PCR analysis because of the need to compensate for intra- and inter-kinetic RT-PCR variations

The variability of a reference gene has two major sources, experimental variability associated with the technology and the innate or natural variability of the reference gene (between tissues, individuals, etc). The original approach to normalization was to find a single reference gene with the most stable (in the sense of the smallest variability) expression across tissues and individuals. Starting with the work of Vandesompele et al

It is well documented that optimal reference genes vary according to tissues and treatments

Vandesompele et al

A more comprehensive approach to selection of the optimal subset of reference genes is to fit a common model that would allow simultaneous quantification and comparison of variability in all candidate genes. This is the approach taken, for example, in

The crucial assumption underlying all these methods is independence in innate variation of the candidate reference genes. The corresponding statistical models assume that correlation between expressions of different genes in the same sample comes exclusively from the experimental variation in the sample. In contrast, we have observed that even after subtracting the random (or fixed) effects of sample, residuals may exhibit non-trivial correlation between some candidate reference genes (see Results). Therefore, estimates of the standard deviation of the log geometric mean may change substantially when correlation is properly estimated and incorporated. This, in turn, can change the ranking of a subset of candidate reference genes with respect to optimality for inclusion into normalization factors.

We developed a robust approach for directly selecting optimal subset(s) of reference genes rather than addressing stability of individual candidate genes. Our approach is based on estimating the unstructured covariance matrix of all available candidate reference genes and using this covariance matrix to estimate the variances of the log normalizing factors (geometric means of the expression of multiple genes) corresponding to all possible subsets of reference genes. Robustness is achieved through bootstrapping candidate reference gene samples and obtaining the bootstrap upper confidence limits for the variances of the log transformed normalizing factors and average ranks of reference gene subsets with respect to the variance of their geometric mean in all bootstrap samples. A bootstrap procedure was proposed earlier

Two publicly available data sets and one new data set from the validation study of five candidate reference genes for normalization of guanylyl cyclase C (GUCY2C) mRNA expression in blood are used to illustrate the proposed method and compare to earlier published results. In addition, a small simulation study was conducted to evaluate the performance of the proposed approach under known correlation structures assuming varying degrees of innate correlation among candidate reference genes.

Methods

Model for the log-transformed expression levels of candidate reference genes

To incorporate all correlations among candidate reference genes, we simultaneously model their log-transformed expression levels or threshold cycle (Ct) numbers in a multivariate linear mixed effects model with unstructured covariance matrix. The normality assumption is usually appropriate for log-transformed expression levels or Ct numbers in homogeneous populations of samples.

Let _{
jik
}be the **Y**
_{
ik
}= [_{
1ik
},..., _{
Jik
}]^{T }the vector of log-transformed expression levels for all J candidate reference genes in replicate **Y**
_{
ik
}may be modeled as

where vector **g **= [_{
1
},...,g_{
J
}] ^{T }and g_{
j
}is the average log-transformed expression level for the candidate reference gene **s**
_{
i
}= [s_{
i
},..., s_{
i
}] ^{T }is the random effect of ^{th }sample, which reflects the experimental variation and is the same for all genes, so that **s**
_{
i
}= s_{
i
}[1,...,1] ^{T}, **r**
_{
i
}= [r_{
i1
},..., r_{
iJ
}] ^{T }is the vector of random gene effects in sample **e**
_{
ik
}= [e_{
ik1
},..., e_{
ikJ
}] ^{T }is the vector of error terms in replicate k.

It is assumed that sample random effects s_{
i
}, random gene effects vectors **r**
_{
i
}, and the error terms vector **e**
_{
ik
}are all independent, s_{
i
}are identically normally distributed as N(0, σ^{2}), vectors **r**
_{
i
}are identically normally distributed as _{
J
}(0,**R**), and **e**
_{
ik
}are identically distributed as MVN_{
J
}(0,**D**), **D **= _{
1
}
^{
2
},..., _{
J
}
^{
2
}). For each gene

and vectors **Y**
_{
ik
}have a multivariate normal distribution

where **V **= σ^{2}1_{J×J }+ **R **+ **D **and 1_{J×J }is J×J matrix of ones.

Our model (1) generalizes models 4 and 5 in **R **rather than imposing a simple uncorrelated structure with **R **= _{
1
}
^{
2
},..., _{
J
}
^{
2
}). Sunberg et al

For multiple tissues and possible covariates affecting **Y**
_{
ik
}the mean vector **g **would have to be replaced by some linear mean model. Since the proposed methodology utilizes only covariance parameters estimates, it is straightforward to extend developments to the case with a linear mean model instead of the mean vector **g**. The standard way to write a general linear mean model is **Aβ**, where A is some design matrix and **β **is the vector of unknown parameters. For model (1), **A **is just the identity matrix and **β = g**. In the general case, the model is written as

Notably, such extension has no effect on the assumed covariance structure of the data. For example with just multiple tissues, t = 1,.., T, one can use the model

where vector **g **= [_{
1
},...,g_{
J
}] ^{T }represents now across tissues average log-transformed expression levels for all candidate reference genes **g**
_{t }represents the mean differences in expression attributed to tissue

In most analyses of qRT-PCR data, the Ct numbers for the replicates of the same reaction are averaged, and the majority of methods for selecting optimal subsets of reference genes also operate with averaged replicates, which is appropriate if averaged replicates are to be used for normalizing the target gene. For this reason, and to simplify notation, in further development we do not use multiple replicates of the same reaction. With averaged replicates, vectors **Y**
_{
ik
}and **e**
_{
ik
}in model (1) no longer depend on index k and model (1) is simplified to:

where vectors **r**
_{
i
}effectively incorporate both, the random gene effects and the errors of gene expression measures. The multivariate formulation (2) still applies to model (4) with **V **= σ^{2}1_{J×J }+ **R**. If we consider a specific case of model (4) with s_{i }being fixed rather than random effects (so that **V = R**) and **R **= _{
1
}
^{
2
},..., _{
J
}
^{
2
}) then we obtain model 1a in

In general, the variance components σ^{2}1_{J×J}, **R**, and **D **in models (1) and (4) are not identifiable unless one imposes additional constraints on the structure of **R **and **D**. In previous work, **R **was constrained to be diagonal, which corresponds to the independent random effects of reference genes. Our approach is to estimate **V **as an unstructured covariance matrix without separating the variance components, and then use **V **to compute the variance of the log geometric mean of any possible subset of reference genes. An unstructured J×J matrix **V **has J(J + 1)/2 unknown parameters, with the total of J(J + 1)/2 + J = J(J + 3)/2 unknown parameters for model (2). Hence, one needs at least samples of size N > (J + 3)/2 to estimate model (2). With a moderate number of samples available, the estimates of **V **may not be reliable. To overcome this, we propose to utilize bootstrap re-sampling and compute the upper confidence bounds for the variances of the geometric means. Such upper confidence bounds would properly reflect uncertainty in estimation of the variances.

Variability of geometric means of multiple genes

Further we focus on single or averaged multiple replicates of a gene in the sample and assume model (4) with **V **= σ^{2}1_{J×J }+ **R**. The log geometric mean expression of a subset of _{1}, j_{2 },...,j_{L }in sample

In a vector form, (5) may be written as

where _{
j1,...,jL
}has elements equal to 1, if _{
1
}, _{
2
},..., _{
L
}, and elements equal to 0 otherwise. Since **Y**
_{
i
}= MVN_{
J
}(**g**, **V**), the variance of F_{
i
}(_{
1
},..., _{
L
}) is

Thus, the total variance of the log geometric mean of any subset _{
1
}, _{
2
},..., _{
L
}of reference genes may be estimated using (6) with the corresponding vector _{
j1,...,jL
}and matrix **V**, which is estimated by fitting model **Y**
_{
i
}~MVN_{
J
}(**g**, **V**). Representation (6) allows computing the variance of all possible F_{
i
}(_{
1
},..., _{
L
}) through the nested J cycles exhausting all possibilities for vectors _{
j1,...,jL
}.

When **V **= σ^{2}1_{J×J }+ **R**, then (6) implies

Hence, the log geometric mean of any subset of reference genes includes the same variance component σ^{2 }corresponding to the experimental error present in all gene expressions for the same sample. Therefore, minimizing the total variability of the log geometric mean is equivalent to minimizing the variability described by **R**.

Selection of the optimal subset of reference genes

Using model (4) and expression (6), we propose a robust approach for selecting optimal subset(s) of reference genes with the smallest variance of the corresponding normalizing factors. Robustness is achieved through bootstrapping candidate reference genes data to obtain the bootstrap upper confidence limits for the variances of the (log) normalizing factors (geometric means) for all possible gene subsets as well as the distribution of ranks of these variances. The bootstrapping also alleviates the uncertainty in estimation of potentially large number of parameters in unstructured covariance matrix **V**.

Specifically, for each bootstrap sample, the following analyses are performed:

(i) Unstructured covariance matrix **V **of all available candidate reference genes is estimated from model (2). In this work, the estimates of V were computed in SAS PROC MIXED (SAS 9.2, SAS Institute, Cary, NC), but any other software capable of fitting liner mixed effects or MANOVA models may be used as well.

(ii) Vectors C_{j1,...,jL }for all possible subsets of reference genes are generated and expression (6) is used to compute the variance of the log geometric mean for each possible subset of reference genes. There is a finite, although rather large number, 2^{J}-1, of possible subsets of J reference genes, and the absolute minimum is always attained. In practical qRT-PCR validation studies, the number of candidate reference genes J would not be expected to be much larger than 10.

(iii) All possible subsets of reference genes are ranked from the smallest to the largest variance of the corresponding log geometric mean.

Based on results for all bootstrap samples, we compute the bootstrapped upper 95% confidence limit for the variance of the log geometric mean and the average rank of this variance for all possible subsets of the reference genes. Then the optimal subset of the reference genes may be selected using one of the following criteria:

**(A) **to minimize the upper 95% confidence limit on variability of the log geometric mean regardless of the number of reference genes required;

**(B) **to minimize the number of reference genes given that the upper 95% confidence limit on variability is under some acceptable level;

**(C) **to minimize the average rank of the variance of the log geometric mean.

The last criterion is similar in spirit to the bootstrap ranking procedure in ^{J}-1) possible subsets instead of just J reference genes, we use the mean rank (average in all bootstrap samples) as the measure of optimality in criterion (C). In the absence of a desired limit on variability in criterion (B), one ideally would want to find a reference gene subset that satisfies both, criteria (A) and (C). To address criteria (A) and (C) simultaneously, we plot the upper 95% confidence limits vs. the average rank by the size of gene subset (Figures

Breast tumor data: 95% UCL vs. the average overall rank of the normalizing factors

**Breast tumor data: 95% UCL vs. the average overall rank of the normalizing factors**. Each point represents one of the possible 63 = 2^{6}-1 gene subsets. Different colors are used for the subsets with different numbers of genes included. The x-coordinate is the average overall rank of the corresponding normalizing factor variance. The y-coordinate is the upper 95% confidence limit (95% UCL) for the standard deviation of the log normalizing factor. The red dot, which is closest to the lower left corner, represents the optimal (in the sense of criteria A and C) combination of two genes, ACTB and SF3A1.

Neuroblastoma data: 95% UCL vs. the average overall rank of the normalizing factors

**Neuroblastoma data: 95% UCL vs. the average overall rank of the normalizing factors**. Each point represents one of the possible 1023 = 2^{10}-1 gene subsets. Different colors are used for the subsets with different numbers of genes included. The x-coordinate is the average overall rank of the corresponding normalizing factor variance. The y-coordinate is the upper 95% confidence limit (95% UCL) for the standard deviation of the log normalizing factor. Only sets with average rank less than 200 are shown on the plot.

Blood data: 95% UCL vs. the average overall rank of the normalizing factors (Ct numbers)

**Blood data: 95% UCL vs. the average overall rank of the normalizing factors (Ct numbers)**. Each point represents one of the possible 33 = 2^{5}-1 gene subsets. Different colors are used for the subsets with different numbers of genes included. The x-coordinate is the average overall rank of the corresponding normalizing factor variance. The y-coordinate is the upper 95% confidence limit (95% UCL) for the standard deviation of the log normalizing factor.

A simple direct comparison of our method vs. previously proposed methods was performed by computing the log geometric mean and its variance for the optimal subsets (for each set size) selected by different procedures. The advantage is direct evaluation of the log geometric mean of interest while ignoring the rest of the genes, which mimics the prospective use of the selected reference genes (the other candidate reference genes would not be available).

The macros implementing the proposed methodology were developed in SAS 9.2 (SAS Institute, Cary, NC). The corresponding SAS code is included as Additional file

**SAS program file with the code implementing the proposed algorithm.**

Click here for file

Results

Data Sets

The first dataset includes relative expression levels of 6 reference genes (ACTB, GAPDH, MRPL19, PSMC4, PUM1, and SF3A1) quantified in 80 breast tumor samples. These data are described in detail in

The third dataset comes from a validation study of five candidate reference genes for normalization of guanylyl cyclase C (GUCY2C) mRNA expression in blood. The RT-PCR assay to quantify GUCY2C mRNA in tissues and blood employing external calibration standards of RNA complementary to GUCY2C (cRNA) is described in

Here, the log transformed expression levels were computed from the threshold cycle (Ct) numbers as in the MS Excel add-on software gNorm, which implements the method described in

Results for the breast tumour data

In the breast tumour data, we first investigated innate correlation among 6 reference genes using the residuals from model 1a in

where s_{
i
}are assumed to be fixed rather than random effects and each gene is assumed to have different variance, _{
ji
}~N (g_{j}, _{
j
}
^{
2
}). The residuals were computed as

where _{
i
}represents experimental variability in _{
ji
}common for all reference genes in sample

Pearson correlation matrix of the residuals from model (8) fitted to the data from 80 breast tumor samples

**ACTB**

**GAPDH**

**MRPL**

**PSMC4**

**PUM**

GAPDH

Coeff.^{1}

-0.112

p-value

0.324

MRPL

Coeff.^{1}

**-0.476**

0.021

p-value

<.0001

0.851

PSMC4

Coeff.^{1}

**-0.246**

-0.108

0.014

p-value

0.028

0.340

0.903

PUM

Coeff.^{1}

0.077

**-0.352**

**-0.432**

**-0.510**

p-value

0.496

0.001

<.0001

<.0001

SF3A1

Coeff.^{1}

0.147

-0.086

**-0.567**

**-0.313**

0.160

p-value

0.194

0.447

<.0001

0.005

0.156

^{1}Pearson correlation coefficient with p-value testing that it is zero

The proposed algorithm was applied to the breast tumor data with 1000 bootstrapped (sampled with replacement) data sets of size 80 from 80 samples. For each possible gene subset size (1-6), Table

Breast tumor data: Top ranked by set size bootstrap 95% upper confidence limit (UCL) for the variance and standard deviation of the log geometric mean (GM).

**Set Size(*)**

**ACTB**

**GAPDH**

**MRPL19**

**PSMC4**

**PUM**

**SF3A1**

**95% UCL Var(GM)**

**95% UCL StdDev(GM)**

1

1

0

0

0

0

0

0.407

0.638

**2**

**1**

**0**

**0**

**0**

**0**

**1**

**0.349**

**0.591**

3

1

0

0

0

1

1

0.356

0.596

4

1

1

0

0

1

1

0.398

0.631

5

1

1

0

1

1

1

0.429

0.655

6

1

1

1

1

1

1

0.465

0.682

(*) in the column with gene name, 1 indicates that the corresponding gene is included in the subset and 0 that it is not included.

Breast tumor data: Ten gene subsets with the smallest mean overall ranks of the variance of the log geometric mean (GM).

**Set Size(*)**

**ACTB**

**GAPDH**

**MRPL19**

**PSMC4**

**PUM**

**SF3A1**

**Mean rank of Var(GM)**

**2**

**1**

**0**

**0**

**0**

**0**

**1**

**1.2**

3

1

0

0

0

1

1

2.0

2

1

0

0

0

1

0

3.2

2

0

0

0

0

1

1

4.7

4

1

0

0

1

1

1

6.1

1

1

0

0

0

0

0

6.4

3

1

0

0

1

0

1

8.1

1

0

0

0

0

0

1

8.8

4

1

0

1

0

1

1

9.2

4

1

1

0

0

1

1

10.3

3

1

0

1

0

0

1

10.6

(*) in the column with gene name, 1 indicates that the corresponding gene is included in the subset and 0 that it is not included.

In contrast, using the model in

For direct comparison of results, the empirical variances of the geometric means of selected gene subsets were computed for the actual log geometric means based on the optimal subsets identified by the proposed selection method and methods

Breast tumor data: Variability of log geometric means based on optimal gene subsets identified by various methods

**Set Size**

**Method**

**Optimal set**

**Variance logGM**

**Std Dev logGM**

2

Szabo et al

MRPL19, PUM1

0.517

0.719

2

Vandes. et al

MRPL19, PSMC4

0.629

0.793

2

New

ACTB, SF3A1

**0.321**

**0.567**

3

Szabo et al

MRPL19, PUM1, PSMC4

0.531

0.729

3

Vandes. et al

MRPL19, PUM1, PSMC4

0.531

0.729

3

New

ACTB, SF3A1, PUM1

0.327

0.572

4

Szabo et al ^{1}

MRPL19, PUM1, PSMC4, SF3A1

0.464

0.681

4

New

ACTB, SF3A1, PUM1, GAPDH

0.369

0.607

^{1}Same results using either the method of Vandesompele et al

Results for neuroblastoma data

For 34 neuroblastoma samples, the proposed new algorithm yielded the smallest upper bound for the variance of the geometric mean of six genes, ACTB, B2M, GAPDH, HPRT1, TBP, and YWHAZ. However, the subsets of four genes, ACTB, B2M, GAPDH, and TBP have a negligibly higher upper bound (0.303 vs. 0.298, Table

Neuroblastoma data: Top ranked by set size bootstrap 95% upper confidence limit (UCL) for the variance and standard deviation of the log geometric mean (GM).

**Set Size(*)**

**AC**
^{
1
}

**B2M**

**GA**
^{
2
}

**HM**
^{
3
}

**HP**
^{
4
}

**RP**
^{
5
}

**SD**
^{
6
}

**TBP**

**UBC**

**YW**
^{
7
}

**95% UCL Var(GM)**

**95% UCL StdDev(GM)**

1

0

0

1

0

0

0

0

0

0

0

0.458

0.677

2

0

0

1

0

0

0

0

0

0

1

0.340

0.583

3

1

0

1

0

0

0

0

0

0

1

0.340

0.583

4

1

1

1

0

0

0

0

1

0

0

0.303

0.550

5

1

1

1

0

1

1

0

0

0

0

0.299

0.547

6

1

1

1

0

1

0

0

1

0

1

0.298

0.546

7

1

1

1

0

1

1

0

1

0

1

0.303

0.550

8

1

1

1

0

1

1

1

1

0

1

0.317

0.563

9

1

1

1

1

1

1

1

1

0

1

0.334

0.578

10

1

1

1

1

1

1

1

1

1

1

0.353

0.594

(*) in the column with gene name, 1 indicates that the corresponding gene is included in the subset and 0 that it is not included

^{1}AC - ACTB; ^{2}GA - GAPDH; ^{3}HM - HMBS; ^{4}HP - HPRT1; ^{5}RP - RPL13A; ^{6}SD - SDHA; ^{7}YW - YWHAZ

Neuroblastoma data: Ten gene subsets with the smallest mean overall ranks of the variance of the log geometric mean (GM).

**Set Size(*)**

**AC**
^{
1
}

**B2M**

**GA**
^{
2
}

**HM**
^{
3
}

**HP**
^{
4
}

**RP**
^{
5
}

**SD**
^{
6
}

**TBP**

**UBC**

**YW**
^{
7
}

**Mean rank of Var(GM)**

6

1

1

1

0

0

1

0

1

0

1

52.1

7

1

1

1

0

1

1

0

1

0

1

59.9

5

1

1

1

0

0

1

0

1

0

0

63.8

6

1

1

1

0

1

1

0

0

0

1

76.7

6

1

1

0

0

1

1

0

1

0

1

87.5

5

1

1

1

0

0

0

0

1

0

1

92.2

6

1

1

1

0

1

1

0

1

0

0

93.3

7

1

1

1

1

1

1

0

0

0

1

95.9

7

1

1

1

1

0

1

0

1

0

1

96.4

7

1

1

1

0

0

1

1

1

0

1

103.0

(*) in the column with gene name, 1 indicates that the corresponding gene is included in the subset and 0 that it is not included

^{1}AC - ACTB; ^{2}GA - GAPDH; ^{3}HM - HMBS; ^{4}HP - HPRT1; ^{5}RP - RPL13A; ^{6}SD - SDHA; ^{7}YW - YWHAZ

Figure

Table

Neuroblastoma data: Variability of log geometric means based on optimal gene subsets identified by various methods

**Set Size**

**Method**

**Optimal set**

**Variance logGM**

**Std Dev logGM**

2

Vand

GAPDH, HPRT

0.327

0.572

2

Szabo

GAPDH, SDHA

0.374

0.612

2

New

GAPDH, YWHAZ

0.250

0.500

3

Old^{1}

GAPDH, HPRT, SDHA

0.348

0.590

3

New

ACTB, GAPDH, YWHAZ

0.255

0.505

4

Old^{1}

GAPDH, HPRT, SDHA, UBC

0.361

0.601

4

New

ACTB, B2M, GAPDH, TBP

0.231

**0.480**

5

Old^{1}

GAPDH, HPRT, SDHA, UBC, HMBS

0.358

0.598

5

New

ACTB, B2M, GAPDH, HPRT1, RPL13A

0.224

**0.473**

6

Old^{1}

GAPDH, HPRT, SDHA, UBC, HMBS, YWHAZ

0.319

0.565

6

New

ACTB, GAPDH, B2M, HPRT1, TBP, YWHAZ

0.227

**0.477**

^{1}Same results using either the method of Vandesompele et al

Results for five reference genes for GUCY2C in blood

For five candidate reference genes for GUCY2C (ACTB, GAPDH, HPRT, PPIB, and TFRC), the new approach was applied to the log transformed relative expression levels for direct comparison with previously proposed methods and to the threshold cycle (Ct) numbers because Ct numbers are actually used for efficiency adjusted relative quantification

Tables

Blood data: Top ranked by set size bootstrap 95% upper confidence limit (UCL) for the variance and standard deviation of the log geometric mean (GM) based on log transformed relative expression levels.

**Set Size(*)**

**ACTB**

**GAPDH**

**HPRT1**

**PPIB**

**TFRC**

**95% UCL Var(GM)**

**95% UCL StdDev(GM)**

1

0

1

0

0

0

**1.19**

1.09

2

0

1

0

0

1

1.25

1.12

3

0

1

1

0

1

1.57

1.25

4

0

1

1

1

1

1.77

1.33

5

1

1

1

1

1

2.06

1.43

(*) in the column with gene name, 1 indicates that the corresponding gene is included in the subset and 0 that it is not included.

Blood data: Top ranked by set size bootstrap 95% upper confidence limit (UCL) for the variance and standard deviation of the log geometric mean (GM) based on Ct numbers.

**Set Size(*)**

**ACTB**

**GAPDH**

**HPRT1**

**PPIB**

**TFRC**

**95% UCL Var(GM)**

**95% UCL StdDev(GM)**

1

0

1

0

0

0

6.22

2.49

2

0

1

0

0

1

6.06

2.46

3

0

1

1

0

1

6.66

2.58

4

1

1

1

0

1

7.29

2.70

5

1

1

1

1

1

7.91

2.81

Blood data: Ten gene subsets with the smallest mean overall ranks of the variance of the log geometric mean (GM) based on log transformed relative expression levels.

**Set Size(*)**

**ACTB**

**GAPDH**

**HPRT1**

**PPIB**

**TFRC**

**Mean rank of Var(GM)**

2

0

1

0

0

1

1.5

1

0

1

0

0

0

2.8

3

0

1

1

0

1

3.7

3

1

1

0

0

1

5.3

2

0

1

1

0

0

5.7

1

0

0

0

0

1

7.0

4

1

1

1

0

1

8.1

3

0

1

0

1

1

8.2

2

1

1

0

0

0

8.8

4

0

1

1

1

1

10.7

(*) in the column with gene name, 1 indicates that the corresponding gene is included

in the subset and 0 that it is not included.

Blood data: Ten gene subsets with the smallest mean overall ranks of the variance of the log geometric mean (GM) based on Ct numbers.

**Set Size(*)**

**ACTB**

**GAPDH**

**HPRT1**

**PPIB**

**TFRC**

**Mean rank of Var(GM)**

2

0

1

0

0

1

1.7

1

0

1

0

0

0

2.1

3

0

1

1

0

1

3.5

2

0

1

1

0

0

4.4

1

0

0

0

0

1

5.3

3

0

1

0

1

1

5.6

4

0

1

1

1

1

7.3

2

0

1

0

1

0

8.3

3

0

1

1

1

0

9.6

2

0

0

1

0

1

10.4

Blood data: 95% UCL vs. the average overall rank of the normalizing factors (expression levels)

**Blood data: 95% UCL vs. the average overall rank of the normalizing factors (expression levels)**. Each point represents one of the possible 33 = 2^{5}-1 gene subsets. Different colors are used for the subsets with different numbers of genes included. The x-coordinate is the average overall rank of the corresponding normalizing factor variance. The y-coordinate is the upper 95% confidence limit (95% UCL) for the standard deviation of the log normalizing factor.

For comparison, model 1a in

Blood data: Variability of log geometric means based on optimal gene subsets identified by various methods

**Set Size**

**Method**

**Optimal set**

**Variance logGM**

**Std Dev logGM**

2

Szabo et al

TFRC, GAPDH

0.98

0.99

2

Vandes. et al

TFRC, HPRT

1.47

1.21

2

New

TFRC, GAPDH

**0.98**

0.99

3

Szabo et al

TFRC, GAPDH, PPIB

1.26

1.12

3

Vandes. et al

TFRC, GAPDH, PPIB

1.62

1.27

3

New

TFRC, GAPDH, HPRT

1.16

1.08

4

All methods

GAPDH, PPIBA, TFRC, HPRT

1.34

1.16

Simulation study

A small simulation study was conducted to evaluate the performance of the proposed approach assuming varying degrees of innate correlation among reference genes, independent of the variance component corresponding to the sample random effect. Samples of size 25, 40, or 80 of 5-dimensional vectors, representing log transformed expression levels, were generated from the 5-variate normal distribution according to model (4). Since the mean part of the model does not affect either the new or previously proposed methods, without loss of generality, it was assumed that the mean vector had all components equal to zero (**g **= **0**). The covariance matrix **V **of the simulated 5-variate normal samples had the structure **V **= σ^{2}1_{J×J }+ **R**, where σ^{2 }is the variance component for sample random effect and **R **is the covariance matrix of random effects of genes, and J = 5. Table **V **used in five different simulation scenarios. The values of σ, the standard deviation for the sample random effect, ranged from 0.02 to 0.16, while correlation coefficients corresponding to the **R **matrices, were 0, ±0.2, or ±0.4, representing zero, weak and strong correlation respectively. The **R **matrices used were defined by five standard deviations, corresponding to the innate variances of the each gene and by the correlation matrix shown. The values of the standard deviations were chosen so that resulting elements of matrices **V **were similar in magnitude to the estimates from the real data examples. Table **V**. The values in bold correspond to the absolute minimum variance of the mean for any possible subset size and the size of that optimal subset. Table **V **as a diagonal matrix and assuming the sample effect to be fixed rather than random. The corresponding absolute minimum variance of the mean for any possible subset size is shown in bold italic. Since the results using the method of Szabo et al

Design of the simulation study

**Min Var of NF**
^{
1
}

**Scenario**

**Std Dev**

**Correlation Matrix of R**

**Total Covariance Matrix V**

**No Genes**

**True**

**Uncorr**
^{
2
}

0.30

1

0

0

0

0

0.25

0.16

0.16

0.16

0.16

1

0.250

0.250

Uncorrelated R

0.35

0

1

0

0

0

0.16

0.28

0.16

0.16

0.16

**2**

**0.213**

**
0.133
**

Sample Random

0.80

0

0

1

0

0

0.16

0.16

0.80

0.16

0.16

3

0.255

0.148

Effect Var = 0.16

0.90

0

0

0

1

0

0.16

0.16

0.16

0.97

0.16

4

0.264

0.144

1.00

0

0

0

0

1

0.16

0.16

0.16

0.16

1.16

5

0.267

0.139

0.60

1

0.2

0.2

0.2

0.2

0.38

0.10

0.11

0.15

0.16

1

0.380

0.380

Corr Coef = 0.2

0.70

0.2

1

0.2

0.2

0.2

0.10

0.51

0.13

0.17

0.19

2

0.275

0.223

Sample Random

0.75

0.2

0.2

1

0.2

0.2

0.11

0.13

0.58

0.19

0.20

**3**

**0.239**

**
0.164
**

Effect Var = 0.02

1.10

0.2

0.2

0.2

1

0.2

0.15

0.17

0.19

1.23

0.28

4

0.275

0.169

1.20

0.2

0.2

0.2

0.2

1

0.16

0.19

0.20

0.28

1.46

5

0.301

0.167

0.42

1

-0.2

-0.2

0.2

0.2

0.28

0.06

0.06

0.15

0.15

1

0.276

0.276

Corr Coef = ±0.2

0.45

-0.2

1

-0.2

0.2

0.2

0.06

0.30

0.06

0.15

0.15

2

0.176

0.145

Sample Random

0.48

-0.2

-0.2

1

0.2

0.2

0.06

0.06

0.33

0.16

0.16

**3**

**0.141**

0.101

Effect Var = 0.1

0.60

0.2

0.2

0.2

1

0.2

0.15

0.15

0.16

0.46

0.17

4

0.166

0.086

0.60

0.2

0.2

0.2

0.2

1

0.15

0.15

0.16

0.17

0.46

5

0.175

**
0.073
**

0.30

1

-0.4

0.0

0.0

0.0

0.25

0.11

0.16

0.16

0.16

1

0.250

0.090

Corr Coef = ±0.4

0.40

-0.4

1

0.0

0.0

0.0

0.11

0.32

0.16

0.16

0.16

**2**

**0.199**

**
0.063
**

Sample Random

0.60

0.0

0.0

1

0.0

0.0

0.16

0.16

0.52

0.16

0.16

3

0.217

0.068

Effect Var = 0.16

0.70

0.0

0.0

0.0

1

0.4

0.16

0.16

0.16

0.65

0.38

4

0.223

0.069

0.80

0.0

0.0

0.0

0.4

1

0.16

0.16

0.16

0.38

0.80

5

0.244

0.070

0.40

1

0.4

0.4

0.4

0.4

0.26

0.18

0.21

0.23

0.24

1

0.260

0.260

Corr Coef = 0.4

0.50

0.4

1

0.4

0.4

0.4

0.18

0.35

0.24

0.26

0.28

**2**

**0.243**

0.153

Sample Random

0.70

0.4

0.4

1

0.4

0.4

0.21

0.24

0.59

0.32

0.35

3

0.274

0.133

Effect Var = 0.1

0.80

0.4

0.4

0.4

1

0.4

0.23

0.26

0.32

0.74

0.39

4

0.302

0.121

0.90

0.4

0.4

0.4

0.4

1

0.24

0.28

0.35

0.39

0.91

5

0.331

**
0.114
**

^{1}NF - normalizing factor

^{2}Assuming Szabo et al

The results of the simulation study are summarized in terms of sensitivity to identifying the optimal subset with the absolute minimum variance of the mean. Table **A **(UCL), proposed criterion **B **(Rank), and method in

Results of the simulation study

**Sensitivity to optimal subset**

**Scenario**

**No of samples**

**UCL**
^{
1
}

**Rank**
^{
2
}

**Szabo**
^{
3
}

Uncorrelated R

25

43.00

41.75

60.25

Sample Random

40

53.50

53.25

73.50

Effect Var = 0.16

80

81.25

81.50

86.25

All Corr Coef = 0.2

25

34.25

36.50

38.75

Sample Random

40

53.25

55.25

46.25

Effect Var = 0.02

80

75.75

74.50

57.25

Corr Coef = ±0.2

25

48.50

55.00

0.00

Sample Random

40

68.00

72.75

0.25

Effect Var = 0.1

80

91.50

93.50

0.00

Corr Coef = ±0.4

25

36.50

31.50

8.50

Sample Random

40

49.25

42.75

7.25

Effect Var = 0.16

80

63.50

60.25

3.75

All Corr Coef = 0.4

25

37.5

40.8

23.3

Sample Random

40

49.8

51.3

21.0

Effect Var = 0.1

80

68.0

71.5

21.3

^{1}Criterion (**A**) (minimum 95% upper confidence limit for standard deviation of the normalizing factor)

^{2}Criterion (**C**) (minimum average rank of the normalizing factor variance)

^{3}Minimum standard deviation of the normalizing factor variance as in Szabo et al

The results of our simulation study suggest that for truly uncorrelated candidate reference genes, the proposed approach may have lower power/sensitivity than the method of Szabo et al **V **has the structure as assumed in **V**. For equally weakly positively correlated candidate reference genes, performance of our and approach in

Discussion

In this work, we developed an approach for selecting an optimal set of reference genes for normalization in RT-PCR. The key difference from previously proposed methods is that assumption of independence among candidate reference genes is relaxed, and, instead, the estimated correlation among the genes is incorporated into estimates of variability of the prospective normalizing factors. The proposed approach does not explicitly estimate correlation among the genes, but implicitly the correlation is incorporated into the estimate of the total covariance matrix **V**. Then the variance of a log transformed prospective normalizing factor is estimated by substituting the estimated **V **into (6).

To overcome uncertainty in estimating a large number of covariance parameters from usually small data sets, we employ bootstrap to obtain robust upper confidence bounds for the variance of the log geometric means of multiple genes. These bounds allow comparing various gene subsets as prospective normalizing factors, but also may be used in sample size calculations while designing an RT-PCR study. Our approach also allows certain flexibility to choose a criterion for selecting the optimal subset(s) of the reference genes unless one subset meets all the criteria.

Here, our primary focus was on selecting reference genes for normalizing target gene expressions from one tissue as motivated by the study of guanylyl cyclase C (GUCY2C) mRNA expression in blood. Our methodology is easily extendable to multiple tissues or inter-species comparisons by incorporating fixed effects for between-tissue or between-species differences into the mean sub-model **Aβ **in (3), as long as one can assume that variances and correlation among the genes do not change between tissues or between species. If they do change between tissues or between species, then selecting the same reference genes for different tissues or different species may not be appropriate, or careful consideration may be required to set appropriate criteria of optimal properties of the reference genes that may behave differently in different tissues or species.

In the considered data examples, the use of the proposed methodology yielded generally smaller optimal subsets of the reference genes with smaller variability of the normalizing factors. In direct comparisons, the normalizing factor variances (based on the genes from the selected subset only) were reduced by 27-32% when using the proposed selection approach instead of the methods

Conclusions

The proposed approach performs comprehensive and robust evaluation of the variability of normalizing factors based on all possible subsets of candidate reference genes rather than addressing the stability of individual reference genes. The results of this evaluation provide flexibility to choose more important criterion for selecting the optimal subset(s) of the reference genes unless one subset meets all the criteria. This new approach identifies gene subset(s) with smaller variability of normalizing factors than current standard approaches when there is some nontrivial innate correlation among the candidate genes.

Authors' contributions

SAW and SS initiated the biological problem. CW, IC, SS, SAW, and TH designed the validation study of five candidate reference genes for normalization of guanylyl cyclase C (GUCY2C). IC and TH designed the statistical methods and analyses. CW conducted the RT-PCR experiments. IC and YL conducted the analysis, devised algorithms and wrote the computer programs. SC carried out the simulation study and prepared the figures. All authors have read and approved the final manuscript.

Acknowledgements

These studies were supported by NIH grants CA075123, CA79663, CA95026 and CA112147. CW was enrolled in the NIH-supported institutional K30 Training Program in Human Investigation (K30 HL004522) and was supported by NIH institutional award T32 GM08562 for Postdoctoral Training in Clinical Pharmacology. SAW is the Samuel M.V. Hamilton Professor of Medicine of Thomas Jefferson University.