Department of Plant Sciences, University of Tennessee, Knoxville, TN 37996, USA

University of Tennessee Institute of Agriculture Genomics Hub, University of Tennessee, Knoxville, TN 37996, USA

Statistical Consulting Center, University of Tennessee, Knoxville, TN 37996, USA

Abstract

Background

Even though real-time PCR has been broadly applied in biomedical sciences, data processing procedures for the analysis of quantitative real-time PCR are still lacking; specifically in the realm of appropriate statistical treatment. Confidence interval and statistical significance considerations are not explicit in many of the current data analysis approaches. Based on the standard curve method and other useful data analysis methods, we present and compare four statistical approaches and models for the analysis of real-time PCR data.

Results

In the first approach, a multiple regression analysis model was developed to derive ΔΔCt from estimation of interaction of gene and treatment effects. In the second approach, an ANCOVA (analysis of covariance) model was proposed, and the ΔΔCt can be derived from analysis of effects of variables. The other two models involve calculation ΔCt followed by a two group

Conclusion

Practical statistical solutions with SAS programs were developed for real-time PCR data and a sample dataset was analyzed with the SAS programs. The analysis using the various models and programs yielded similar results. Data quality control and analysis procedures presented here provide statistical elements for the estimation of the relative expression of genes using real-time PCR.

Background

Real-time PCR is one of the most sensitive and reliably quantitative methods for gene expression analysis. It has been broadly applied to microarray verification, pathogen quantification, cancer quantification, transgenic copy number determination and drug therapy studies

Real-time PCR

**Real-time PCR**. (A) Theoretical plot of PCR cycle number against PCR product amount is depicted. Three phases can be observed for PCRs: exponential phase, linear phase and plateau phase. (B) shows a theoretical plot of PCR cycle number against logarithm PCR product amount. Panel (C) is the output of a serial dilution experiment from an ABI 7000 real-time PCR instrument.

Both genomic DNA and reverse transcribed cDNA can be used as templates for real-time PCR. The dynamics of PCR are typically observed through DNA binding dyes like SYBR green or DNA hybridization probes such as molecular beacons (Strategene) or Taqman probes (Applied Biosystems)

Real-time PCR data are quantified absolutely and relatively. Absolute quantification employs an internal or external calibration curve to derive the input template copy number. Absolute quantification is important in case that the exact transcript copy number needs to be determined, however, relative quantification is sufficient for most physiological and pathological studies. Relative quantification relies on the comparison between expression of a target gene versus a reference gene and the expression of same gene in target sample versus reference samples

Since relative quantification is the goal for most for real-time PCR experiments, several data analysis procedures have been developed. Two mathematical models are very widely applied: the efficiency calibrated model _{target}) to the power of target ΔCt (ΔCt_{target}) and reference gene efficiency (E_{reference}) to the power of reference ΔCt (ΔCt_{reference}). The ΔΔCt model can be derived from the efficiency-calibrated model, if both target and reference genes reach their highest PCR amplification efficiency. In this circumstance, both target efficiency (E_{target}) and control efficiency (E_{control}) equals 2, indicating amplicon doubling during each cycle, then there would be the same expression ratio derived from 2^{-ΔΔCt }

_{target }= _{control }- _{treatment }and Δ_{reference }= _{control }- _{treatment}

^{-ΔΔCt }

_{reference }- Δ_{target}

Even though both the efficiency-calibrated and ΔΔCt models are widely applied in gene expression studies, not many papers have thorough discussions of the statistical considerations in the analysis of the effect of each experimental factor as well as significance testing. One of the few studies that employed substantial statistical analysis used the REST^{® }program

Results and discussion

Data quality control

From the two mathematical models for relative quantification of real-time PCR data, we observe disparities between data quality standards. For efficiency-calibrated method, the author who described this procedure

Data quality could be examined through a correlation model. Even though examining the correlation between Ct number and concentration can provide an effective quality control, a better approach might be to examine the correlation between Ct and the logarithm (base 2) transformed concentration of template, which should yield a significant simple linear relationship for each gene and sample combination. For example, for a target gene in the control sample, the Ct number should correlate with the logarithm transformed concentration following the simple linear regression model in equation 3. In the equation, _{lcon }represents the logarithm transformed concentration, _{0 }represents the intercept of the regression line, and _{con }represents the slope of the regression line

The sample real-time PCR data for analysis. In this data set, there two types of samples (treatment and control); two genes (reference and target); and four concentrations of each combination of gene and sample. For data quality control and ANCOVA analysis, the real-time PCR sample data set can be grouped in four groups according to the combination of sample and gene. The Control-Target combination effect was named group 1, Treatment-Target group 2, Control-Reference group 3 and Treatment-Reference group 4.

**Replicate**

**Sample**

**Gene**

**Concentration**

**Ct**

**Group (Class)**

1

Control

Target

10

23.1102

1

2

Control

Target

10

22.9003

1

3

Control

Target

10

22.8972

1

1

Control

Target

2

26.5801

1

2

Control

Target

2

26.2139

1

3

Control

Target

2

26.0606

1

1

Control

Target

0.4

28.1125

1

2

Control

Target

0.4

28.1899

1

3

Control

Target

0.4

27.5949

1

1

Control

Target

0.08

30.2772

1

2

Control

Target

0.08

30.4667

1

3

Control

Target

0.08

30.7571

1

1

Treatment

Target

10

21.7813

2

2

Treatment

Target

10

21.7564

2

3

Treatment

Target

10

21.641

2

1

Treatment

Target

2

23.7965

2

2

Treatment

Target

2

23.7571

2

3

Treatment

Target

2

23.724

2

1

Treatment

Target

0.4

26.3794

2

2

Treatment

Target

0.4

26.2542

2

3

Treatment

Target

0.4

25.9621

2

1

Treatment

Target

0.08

28.5479

2

2

Treatment

Target

0.08

28.3894

2

3

Treatment

Target

0.08

28.3416

2

1

Control

Reference

10

19.7415

3

2

Control

Reference

10

19.494

3

3

Control

Reference

10

19.3906

3

1

Control

Reference

2

21.9838

3

2

Control

Reference

2

22.4435

3

3

Control

Reference

2

22.57

3

1

Control

Reference

0.4

24.8109

3

2

Control

Reference

0.4

24.4327

3

3

Control

Reference

0.4

24.2342

3

1

Control

Reference

0.08

26.7319

3

2

Control

Reference

0.08

26.8206

3

3

Control

Reference

0.08

26.822

3

1

Treatment

Reference

10

18.4468

4

2

Treatment

Reference

10

18.8227

4

3

Treatment

Reference

10

18.3061

4

1

Treatment

Reference

2

21.2568

4

2

Treatment

Reference

2

21.0956

4

3

Treatment

Reference

2

20.8473

4

1

Treatment

Reference

0.4

23.2322

4

2

Treatment

Reference

0.4

22.9577

4

3

Treatment

Reference

0.4

23.2415

4

1

Treatment

Reference

0.08

25.4817

4

2

Treatment

Reference

0.08

25.608

4

3

Treatment

Reference

0.08

25.5675

4

The SAS program implements the data quality control processes.

Click here for file

_{0 }+ _{con}_{lcon }+

The input data is grouped as shown in Table

The data provides the input data for Program1_QC.sas and Program3_ANCOVA.sas. The data has grouped the Ct values according to the different combination of sample and gene.

Click here for file

The abbreviated SAS output for all the analyses.

Click here for file

Data quality control

**Data quality control**. The four classes represent four different combinations of sample and gene, which are reference gene in control sample, target gene in control sample, reference gene in treatment sample, and target gene in treatment sample. Each class should derive a linear correlation between Ct and logarithm transformed concentration pf PCR product with a slope of -1.

Multiple regression model

Several effects need to be taken in to consideration in the ΔΔCt method, namely, the effect of treatment, gene, concentration, and replicates. If we consider these effects as quantitative variables and have the Ct number relating to these multiple effects and their interactions, we can develop a multiple regression model as follows in Equation 4.

_{0 }+ _{con}_{icon }+ _{treat}_{itreat }+ _{gene}_{igene }+ _{contreat}_{icon}_{itreat }+ _{congene}_{icon}_{igene }+ _{genetreat}_{igene}_{itreat }+ _{congenetreat}_{icon}_{itreat}_{igene }+

In this model, Ct is the true dependent, the _{0 }is the intercept, _{x}s are the regression coefficients for the corresponding X (independent) terms, and _{genetreat}. The four groups in Table _{genetreat }and estimate the ΔΔCt from it. As shown in the ΔΔCt formula in Equation 2, if a ΔΔCt is equal to 0, the ratio will be 1, which indicates no change in gene expression between control and treatment.

A SAS program for multiple regression model

SAS procedure PROC GLM was used for ΔΔCt estimation in Program2_MR.sas in

The SAS program implements the multiple regression model and derives ΔΔCt.

Click here for file

The input data for Program2_MR.sas, which has contains sample, gene, concentration and Ct number.

Click here for file

The SAS output gives a very comprehensive analysis of the data. We are interested in two aspects of the analysis. First, we want to test whether the ΔΔCt value is significantly different from 0 at

Analysis of covariance and SAS code

Another way to approach the real-time PCR data analysis is by using an analysis of covariance (ANCOVA). A simplified model can be derived from transforming the data into a grouped data as shown in Table

_{0 }+ _{con}_{icon }+ _{group}_{igroup }+ _{groupcon}_{igroup}_{icon }+

We are interested in two questions here. First, are the covariance adjusted averages among the four groups equal? Second, what is the Ct difference of target gene value between treatment and control sample after corrected by reference gene? In this case, the null hypothesis will be (μ2-μ1)-(μ4-μ3) = 0, and the test will yield a parameter estimation of ΔΔCt as shown in the Program3_ANCOVA.sas (

The SAS program implements the ANCOVA model and derives ΔΔCt.

Click here for file

The SAS code implementing the ANCOVA model is similar to that of multiple regression model. Either SAS procedures PROC GLM or PROC MIXED can be employed to implement the ANCOVA model; and we used PROC MIXED here. The class statement defines which variables will be grouped for significance testing. In this case, the variables are concentration and group, and ANCOVA assumes that these are co-varying in nature. The contrast and estimate statements were used to contrast the group effect, which will yield ΔΔCt (-0.6848), as well as its standard error (0.1185) and 95% confidence interval (-0.9262, -0.4435). The SAS output with both confidence level and

Simplified alternatives –

More simplified alternatives can be used to analyze real-time data with biological replicates for each experiment. The primary assumption with this approach is that the additive effect of concentration, gene, and replicate can be adjusted by subtracting Ct number of target gene from that of reference gene, which will provide ΔCt as shown in Table

ΔCt calculation. The table presents the calculation of ΔCt, which is derived from subtracting Ct number of reference gene from that of the target gene. Con stands for concentration.

**Sample**

**Gene**

**Con**

**Ct**

**Sample**

**Gene**

**Con**

**Ct**

**ΔCt**

Control

Target

10

23.1102

Control

Reference

10

19.7415

3.3687

Control

Target

10

22.9003

Control

Reference

10

19.494

3.4063

Control

Target

10

22.8972

Control

Reference

10

19.3906

3.5066

Control

Target

2

26.5801

Control

Reference

2

21.9838

4.5963

Control

Target

2

26.2139

Control

Reference

2

22.4435

3.7704

Control

Target

2

26.0606

Control

Reference

2

22.57

3.4906

Control

Target

0.4

28.1125

Control

Reference

0.4

24.8109

3.3016

Control

Target

0.4

28.1899

Control

Reference

0.4

24.4327

3.7572

Control

Target

0.4

27.5949

Control

Reference

0.4

24.2342

3.3607

Control

Target

0.08

30.2772

Control

Reference

0.08

26.7319

3.5453

Control

Target

0.08

30.4667

Control

Reference

0.08

26.8206

3.6461

Control

Target

0.08

30.7571

Control

Reference

0.08

26.822

3.9351

Treatment

Target

10

21.7813

Treatment

Reference

10

18.4468

3.3345

Treatment

Target

10

21.7564

Treatment

Reference

10

18.8227

2.9337

Treatment

Target

10

21.641

Treatment

Reference

10

18.3061

3.3349

Treatment

Target

2

23.7965

Treatment

Reference

2

21.2568

2.5397

Treatment

Target

2

23.7571

Treatment

Reference

2

21.0956

2.6615

Treatment

Target

2

23.724

Treatment

Reference

2

20.8473

2.8767

Treatment

Target

0.4

26.3794

Treatment

Reference

0.4

23.2322

3.1472

Treatment

Target

0.4

26.2542

Treatment

Reference

0.4

22.9577

3.2965

Treatment

Target

0.4

25.9621

Treatment

Reference

0.4

23.2415

2.7206

Treatment

Target

0.08

28.5479

Treatment

Reference

0.08

25.4817

3.0662

Treatment

Target

0.08

28.3894

Treatment

Reference

0.08

25.608

2.7814

Treatment

Target

0.08

28.3416

Treatment

Reference

0.08

25.5675

2.7741

As a non-parametric alternative to the

A SAS program has been developed for both

The SAS program performs both student t test and Wilcoxon two group tests on the ΔCt to derive ΔΔCt.

Click here for file

The SAS program is a macro that derives confidence interval for Wilcoxon two group tests.

Click here for file

The input data for Program4_TW.sas. The data contains only sample name and ΔCt.

Click here for file

Comparison of four approaches and data presentation

A comparison of the four approaches is presented in Table

The comparison of four approaches. The table listed ΔΔCt, standard error,

**Model**

**ΔΔCt**

**Standard Error**

**Confidence Interval**

Multiple Regression

-0.6848

0.1185

< 0.0001

(-0.4435, -0.9262)

ANCOVA

-0.6848

0.1185

< 0.0001

(-0.4435, -0.9262)

t-test

-0.6848

0.1303

< 0.0001

(-0.4147, -0.955)

Wilcoxon Test

-0.6354

< 0.0001

(-0.4227, -0.8805)

Data quality control

Many of the current real-time PCR experiments do not include a standard curve design, nor do they use a method to estimate the amplification efficiency. We argue here that real-time PCR data without proper quality controls are not reliable, since the efficiency of real-time PCR could have significant impact on the ratio estimation and dynamic range. For example, if a PCR has a percentage amplification efficiency (PE) of 0.8 (i.e. PCR product will increase 2^{0.8 }times instead of two times per cycle), a ΔCt value of 3 can only be transformed into 5.27 times differences in ratio instead of 8 times. This problem gets amplified when the ΔΔCt or ΔCt values are larger and the amplification efficiency is lower, which could lead to severely skewed interpretations.

We therefore propose two standards for real-time PCR data quality control according to the model using the SAS programs presented in this paper. First, experiments with a serial dilution of template need to be included in order to estimate the amplification efficiency of each gene with each sample. Some researchers assume that the amplification efficiency for each gene is the same in different samples because the same primer pair and amplification conditions are used. However, we found that sample effect does have an impact on the amplification efficiency. In other words, the amplification efficiency could be different for the same gene when amplified from different cDNA template samples. We therefore consider the experimental design with standard curve for each gene and sample combination as the optimal. Second, under optimal conditions, if a plot of the Ct number against the logarithm (2-based) template amount should yield a slope not significantly different from -1, which indicates a nearly 2 amplification efficiency. Even though both efficiency-calibrated model and modified ΔΔCt model tolerates the amplification efficiency lower than 2, it is most reliable to have all the reaction with amplification efficiency approximating 2 through optimizing primer choices, amplicon lengths and experimental conditions. From our experience, maintaining all the amplification efficiency near 2 is the best way to reach equal amplification efficiency among the samples and thus to ensure high quality data. It is also observed that a near 2 amplification efficiency can help to expand the dynamic range of ratio estimation.

The

Some publications present a standard deviation of the ratio as a meaningful metric. However, we argue here that the standard deviation of ratio should be derived from the standard deviation of ΔΔCt; and the confidence interval of the ratio should be derived from the confidence interval of ΔΔCt. In other words, the point estimation of ratio should be 2^{-ΔΔCt }and the confidence interval for ratio should be (2^{-ΔΔCtHCL}, 2^{-ΔΔCtLCL}). Since Ct is the observed value from experimental procedures, it should be the subject of statistical analysis. The practice of performing statistical analysis at ratio directly is not appropriate. The presentation of data needs to refer to the ΔΔCt and subsequently the ratio and confidence intervals derived from 2^{-ΔΔCt.}

Statistical analysis for real-time PCR data with amplification efficiency less than 2

As stated before, the PCR amplification efficiency can be optimized to be approximately 2 with proper amplification primers, RNA quality, and cDNA synthesis protocol. Recent advancements in real-time PCR primer design have allowed easier experimental optimization

Several approaches have been developed to calculate the amplification efficiency in the low quality data. One of such approach is so called 'dynamic data analysis', in which the fluorescence history of a PCR reaction is employed to calculate the amplification efficiency _{lcon }represents 10 based logarithm transformed concentration, the amplification efficiency (E) is 10^{-(1/slope) }or _{lcon }represents the 2 based logarithm transformed concentration, the amplification efficiency (E) therefore is 2^{-(1/slope) }or _{con}).

In the first scenario discussed above, all PCR amplification have the same efficiency, but the efficiency is not equal to 1. Then the ratio of gene expression can be represented in the following equation.

whereas _{con}), and _{adjust }=

In the Equation 6, β_{con }is the pooled slope of the plot with Ct against logarithm 2 based concentration. The β_{con }can be calculated with a correlation function in SAS as shown in Program5_LowQualityData.sas in _{0 }is the pooled slope of the plot of Ct against log_{2 }(concentration) for each gene.

The SAS program performs test for equal slope, grouped slope and adjusted ΔΔCt.

Click here for file

whereas _{target }= -(1/_{conTarget}), _{control }= -(1/_{conControl}), and _{adjust }= _{target}_{target}-_{control}_{control}

In the Equation 7, _{conTarget }and _{conControl }are the pooled slope for the plot of Ct against logarithm 2 based concentration for target gene and reference gene respectively. The slopes can be calculated by the Program5_LowQualityData.sas (_{adjust }can be calculated with the same program. Theoretically, an equation can also be derived for the third scenario when PCR amplification efficiency differs both by gene and by sample. However, in actual application, we don't consider the data in the third scenario as acceptable due to the significant variation of the amplification efficiency

The Program5_LowQuatilityData.sas in _{adjust}. The first step is to perform the data quality control test as shown in Methods. From the SAS output, we can conclude that the LowQualityData dataset does not meet the requirements for 2^{-ΔΔCt }method, since one group of PCR has amplification efficiency significantly different from 1 as shown in the data quality control for LowQualityData dataset part of SASOutput.doc (

The input data for Program5_LowQualityData.sas.

Click here for file

The second step is to test the equal PCR efficiency (or slope) by observing the Type III sums of squares for lcon and class interaction. A low _{adjust }calculation. In this set of data, the Type III sums of squares has a

The next step is to calculate the pooled slope (β_{con}) for each gene to derive the percentage amplification efficiency (PE = -(1/β_{con})) for each gene. The pooled slopes are derived based on the correlation between Ct and logarithm 2 based concentrations. The β_{con}s for the two genes are -1.0813 and -1.0137 respectively as shown in SASOutput.doc (_{con}, -(1/β_{con}) or PE can be calculated for each gene as 0.925 and 0.987 respectively. The ΔΔCt_{adjust }can then be computed with PEs substituting the 1 for each gene in the 'estimate' and 'contrast' statement. The SAS program is as follows in

Title 2 'Calculate the deltadeltaCt with Adjusted efficiency';

**PROC ****MIXED **data=TR2 Order=Data;

CLASS Class Con;

MODEL Ct = Con Class Con*Class/SOLUTION NOINT;

Contrast 'Intercepts' Class **0.925 **-**0.925 **-**0.987 ****0.987**;

Estimate 'Intercepts' Class **0.925 **-**0.925 **-**0.987 ****0.987**/cl;

**Run**;

The SAS output for the analysis is in SASOutput.doc (_{adjust }is therefore -1.0901 and the change is significant since

Overall, in the less optimized PCR reactions, statistical analysis is not only complicated but also compromised for precision and efficiency. Therefore caution should be exercised when performing statistical analysis with the low quality real-time PCR data, which may easily introduce error due to the efficiency adjustment

Conclusion

In this report, we presented four models of statistical analysis of real-time PCR data and one procedure for data quality control. SAS programs were developed for all the applications and a sample set of data was analyzed. The analyses with different models and programs yielded the same estimation of ΔΔCt and similar confidence intervals. The data quality control and analysis procedures will help to establish robust systems to study the relative gene expression with real-time PCR.

Methods

Plant material, RNA extraction, real-time PCR and sample data set

The sample data set (Table

Real-time PCR experimental design, data output, transformation, and programming

A main limitation of efficiency calibrated method and ΔΔCt method is that only one set of cDNA samples are employed to determine the amplification efficiency. It was assumed that the same amplification efficiency could be applied to other cDNA samples as long as the primers and amplification conditions are the same. However, amplification efficiency not only depends on the primer characteristics, but also varies among different cDNA samples. Using a standard curve for only one set of tested samples to derive the amplification efficiency might overlook the error introduced by sample differences. In our experimental design, we have performed standard curve experiments with four concentrations of three replicates for all samples and genes involved. The ΔΔCt will derive from the standard curves only, and the data quality is examined for each gene and sample combination. The analysis of two samples is presented in the paper as an example. A minimal of PCRs of two replicates in three concentrations will be required for each sample. Even though more effort is required, the data is more reliable out of stringent data quality control and data analysis based on statistical models.

The output dataset included Ct number, gene name, sample name, concentration and replicate. We used Microsoft^{® }Excel to open the exported Ct file from an ABI 7000 sequence analysis system and then to transform data into a tab delimited text file for SAS processing. The sample data set is shown in Table

All programs were developed with SAS 9.1 (SAS Institute).

Authors' contributions

JSY carried out the real-time PCR experiments, developed the statistical model and SAS programs for analysis, and drafted the article. AR provided assistance in SAS programming and data modeling. FC provided assistance in real-time PCR experiments. CNS provided oversight of the work, conceptualized non-parametric elements, and finalized the draft.