Center for integrated Bioinformatics, School of Biomedical Engineering, Science and Health System, Drexel University, 3120 Market Street, Philadelphia, PA 19104, USA

Abstract

Background

MicroRNAs (miRNAs) are short non-coding RNA molecules that regulate mRNA transcript levels and translation. Deregulation of microRNAs is indicated in a number of diseases and microRNAs are seen as a promising target for biomarker identification and drug development. miRNA expression is commonly measured by microarray or real-time polymerase chain reaction (RT-PCR). The findings of RT-PCR data are highly dependent on the normalization techniques used during preprocessing of the Cycle Threshold readings from RT-PCR. Some of the commonly used endogenous controls themselves have been discovered to be differentially expressed in various conditions such as cancer, making them inappropriate internal controls.

Methods

We demonstrate that RT-PCR data contains a systematic bias resulting in large variations in the Cycle Threshold (CT) values of the low-abundant miRNA samples. We propose a new data normalization method that considers all available microRNAs as endogenous controls. A weighted normalization approach is utilized to allow contribution from all microRNAs, weighted by their empirical stability.

Results

The systematic bias in RT-PCR data is illustrated on a microRNA dataset obtained from primary cutaneous melanocytic neoplasms. We show that through a single control parameter, this method is able to emulate other commonly used normalization methods and thus provides a more general approach. We explore the consistency of RT-PCR expression data with microarray expression by utilizing a dataset where both RT-PCR and microarray profiling data is available for the same miRNA samples.

Conclusions

A weighted normalization method allows the contribution of all of the miRNAs, whether they are highly abundant or have low expression levels. Our findings further suggest that the normalization of a particular miRNA should rely on only miRNAs that have comparable expression levels.

Background

MicroRNAs (miRNAs) are short non-coding RNA sequences that average 22 nucleotides in length

miRNAs have been discovered to play a role in many diseases and pathologies

There are two main tools used to quantify the expression of miRNAs: microarrays and real-time polymerase chain reaction (RT-PCR). RT-PCR returns the number of cycles that the samples underwent before they were detected, reported as a value known as the Cycle Threshold (CT). The CT values vary logarithmically with expression levels. There are several methods of normalizing the data and calculating the fold-change of each gene between samples. For convenience, in this presentation the terms, "miRNA" and "gene," are used interchangeably in the context of RT-PCR. ΔCT values are calculated by subtracting the CT value of the endogenous control for a given sample (or the mean of the CT values of the endogenous controls if more than one exist) from the CT value of the gene for the given sample. In the calculation of ΔCT values we refer to the number subtracted from the raw CT values of each gene as the _{0}

Theoretically, endogenous controls are selected because they have low variance in their expression levels across samples. In the case of miRNAs, the endogenous controls are typically recommended by the manufacturer of the miRNA kit used in the PCR. Some of the most commonly used endogenous controls are RNU44, RNU48, and U6

Directly applying this method can lead to misleading results if the CT values in the data are not normalized. There are several commonly used methods for miRNA normalization, including: quantile normalization, median normalization, and cyclic loess. Quantile normalization involves sorting the expression values of each gene in a given sample in order from least to greatest. This is done for each sample in the study. The vectors of the sorted CT values for each sample are combined into a matrix. The mean of each row of the matrix is calculated. The CT value in each element in each row is replaced with the mean of the entire row. In the case of median quantile normalization the median of the row is used instead of the mean. The CT values in each sample are then rearranged back into their original order. This causes the distribution of CT values across all samples to assume the same shape, which will minimize the variance except for that resulting from the experimental condition beings studied

Median normalization shifts the CT values in each sample such that the median CT value of each sample is the same. The median of each plate should be determined, and the medians of all plates should be arranged in a vector and sorted to determine the median of the medians. In each plate the difference between the median of the sample and the overall median should be subtracted from the CT value of each gene

In cyclic loess normalization, pairs of plates are considered. For all pairs of plates the difference of the log of the CT for each gene is represented by

A number of normalization methods developed for microarrays have been applied to RT-PCR experiments. These methods assume that all miRNAs present in the organism are being profiled in the experiment. While microarrays can profile all miRNAs encoded in a genome, this assumption does not hold for RT-PCR experiments which typically only profile a few hundred miRNAs at a given time

One of the main problems with RT-PCR that remains as yet unaddressed by current normalization methods is the systematic bias present within the data. We observe that standard deviation increases as CT values increase. We believe that the most likely cause of this observation is the assumption that the PCR magnification at each cycle is an exact doubling of the expression levels is inaccurate. There seems to be an accumulation of an expression-level specific rate-limiting effect. As a result, a small difference in the size of the initial sample being amplified causes larger variations in the CT values of the less abundant microRNA molecules. Consequently, using endogenous controls, which are usually chosen from highly expressed microRNAs, for normalization becomes inappropriate for the less-abundant microRNAs. Even quantile normalization has been observed to produce more variance at high CT values than was present in the original raw data

Methods

The primary dataset used in this study was obtained from a recently deposited microRNA RT-PCR dataset in the Gene Expression Omnibus (GEO)

We have investigated several normalization methods, including quantile, mean, and median normalization methods, and endogenous controls identified using various stability criteria. In mean and median normalization, the mean and median of all of the genes in a given sample are used as the value for _{0}_{0 }

A new weighted mean metric is proposed using the standard deviations of the microRNAs as weights. For a given gene, the weighted average is calculated using the following equation:

where _{0 }

We also examined the reproducibility of miRNA expression experiments between RT-PCR and microarray. To explore this topic we utilized data from Chen et al.

Results and Discussion

In order to test the hypothesis that increasing CT values magnifies the natural variation between the initial amounts of samples loaded in each well during RT-PCR, we examined the standard deviation of the genes against their mean CT values, as shown in Figure

Dependence of variability on expression level

**Dependence of variability on expression level**. Each point represents the standard deviation versus the mean of the CT values for a particular microRNA across all samples.

As expected, the CT values of most genes are well correlated with the mean expression of all the genes. This is illustrated in Figure

miRNAs most correlated with the mean expression value

**miRNAs most correlated with the mean expression value**. The CT values of 20 miRNAs where the change between samples is most correlated with the change in the mean expression value from sample to sample.

The correlation with the mean expression level extends to low-abundant miRNAs. We demonstrate this in Figure

Correlation with mean vs. mean CT value

**Correlation with mean vs. mean CT value**. Each point represents the correlation of a particular miRNA with the mean expression of all miRNAs across multiple samples vs. the mean of the CT value of that particular microRNA. The x-values are the mean CT of a miRNA, and the y-values are the correlation of the vector of sample differences from the mean for each miRNA with the vector of the sample mean differences from the overall mean.

In order to quantify the sensitivity of the microRNA expression levels to the initial loaded sample size, a regression line is fitted to the fluctuation of each miRNA against the fluctuation of mean expression. Fluctuation is determined by subtracting the value of the expression of a miRNA in a given sample from the mean expression of that miRNA for all samples; the overall mean of the expression of all miRNAs in all samples can be subtracted from the sample means to determine the fluctuation of a sample mean. In Figure

Correlation of miRNA fluctuation with mean fluctuation

**Correlation of miRNA fluctuation with mean fluctuation**. Each point represents a single sample in the study. The x-value of the point is the difference of the sample mean from the overall mean and the y-value is the difference between the expression of an miRNA in that sample and the mean expression of that miRNA across all samples. The slope of the line quantifies the sensitivity of fluctuations in the miRNA's value to fluctuations in the overall mean. This plot is presented as an example for a single microRNA; all miRNAs were plotted in this fashion.

A plot of fluctuation response vs. expression level

**A plot of fluctuation response vs. expression level**. Each point represents the slope of a particular miRNA as shown by example in Figure 4. All miRNAs' slopes are plotted against their mean CT to show that as CT increases the response to sample fluctuations also increases.

Difference ratio vs. expression level

**Difference ratio vs. expression level**. Each point represents the ratio of the differences in expression level of a microRNA and the mean of all microRNAs against the mean of the CT values for that particular microRNA across all samples. The difference ratio is calculated by dividing the difference of a miRNA's expression in a particular sample by the difference between that sample's mean and the overall mean. For each miRNA a vector of difference ratios is calculated with one value for each sample. On the figure the y-axis represents the mean difference ratio for a particular miRNA. The ratio of this difference increases with increasing CT, demonstrating that lowly expressed miRNAs are more sensitive to fluctuations in the mean.

In conclusion, the fluctuations of the low-abundant miRNAs are not random. The changes in their expression levels are correlated well with the overall changes in all miRNAs, which is assumed to be due to different starting sample sizes for the PCR reactions. We see that there is a systematic bias in the CT values that causes the expression levels of the low-abundant miRNAs to be more sensitive to the initial sample sizes.

We then investigated the suitability of our weighted mean metric. In Figure _{0 }_{0 }_{0 }_{0 }as using the mean and almost exactly the same standard deviations and geNorm stability values (data not shown), thus using the geometric mean had no advantage over using the mean.

Comparison of the _{0 }

**Comparison of the CT**. The

Mean Normalization

**top-k**

**mean CT**

**stdev**

**geNorm**

1

20.92

0.69

0.35

2

23.2

0.64

0.21

3

23.8

0.64

0.19

4

22.13

0.63

0.2

5

22.11

0.61

0.17

6

22.79

0.6

0.18

7

22.91

0.61

0.16

8

23.66

**0.59**

**0.15**

9

23.67

0.6

0.16

10

23.61

0.61

0.16

∞

25.59

0.71

0.23

Mean _{0 }_{0 }

Weighted Mean Normalization

**power**

**mean CT**

**stdev**

**geNorm**

**1**

25.34

0.69

0.21

**3**

24.82

0.67

0.18

**5**

24.35

0.65

0.15

**7**

23.96

0.64

0.14

**9**

23.65

0.63

0.13

**11**

23.41

0.62

0.12

**13**

23.21

0.62

0.12

**15**

23.04

0.62

0.12

**17**

22.89

0.61

0.12

**19**

22.76

0.61

0.13

Mean _{0 }

Using the top 10 miRNAs as endogenous controls

**miRNA**

**mean CT**

**stdev**

**geNorm**

**191**

20.92

0.69

1.14

**744**

25.49

0.72

1.17

**152**

25

0.73

1.12

**MammU6**

17.12

0.75

1.22

**92a**

22.03

0.75

1.24

**29c**

26.15

0.78

1.26

**186**

23.69

0.78

1.17

**671-3p**

28.89

0.8

1.29

**26b**

23.75

0.8

1.19

**let-7d**

23.07

0.8

1.16

Mean _{0 }

The proposed weighted mean normalization method could not be compared to quantile normalization in the same fashion as the other methods because quantile normalization does not have a value analogous to CT_{0 }which could be evaluated for stability and compared to weighted mean normalization. However, the normalized data resulting from each method could be visualized and compared as boxplots. Figure

Boxplots of raw and normalized data

**Boxplots of raw and normalized data**. Here boxplots of the raw data in the first row, followed by boxplots of normalized data in each subsequent row are presented. The raw CT values are compared to the quantile normalized CT values and the ΔCT values produced by the endogenous controls and the weighted mean normalization with a weighted mean power of 13. The column on the left contains adult melanoma samples, the middle column contains pediatric melanoma samples, and the right column contains adult and pediatric nevus samples.

Having explored the problems of RT-PCR normalization, the consistency of miRNA expression experiments between RT-PCR and microarray technologies was of further interest to us. In order to explore this issue we used a dataset from Chen et al.

Distribution of miRNA CT values on card A

**Distribution of miRNA CT values on card A**. Each line represents the CT value of a particular miRNA across each of the four samples.

RT-PCR expression vs. log microarray expression

**RT-PCR expression vs. log microarray expression**. On the left each point represents the base 2 logarithm of the microarray expression vs. the ΔCT value for a particular miRNA for card A. On the right is the same plot for card B.

Figure

Spearman correlation of each RT-PCR sample with microarray expression

**Spearman correlation of each RT-PCR sample with microarray expression**. Here we show the Spearman correlation of the ΔCT values with the base 2 logarithm of microarray expression for each RT-PCR sample. The left four samples come from card A and consist of two different reverse transcription reactions each performed on one of two different days. The right four samples come from card B and consist of two pre-amplified and two non-amplified RT-PCR samples.

Correlation of card A by range of CT values

**Correlation of card A by range of CT values**. Here we divided the miRNAs detected on both card A and the microarray into ranges of CT values and calculated the Spearman correlation of the miRNAs' ΔCT values with the base 2 logarithm of their microarray expression.

Correlation of card B by range of CT values

**Correlation of card B by range of CT values**. Here we divided the miRNAs detected on both card B and the microarray into ranges of CT values and calculated the Spearman correlation of the miRNAs' ΔCT values with the base 2 logarithm of their microarray expression.

Conclusions

We explored the phenomenon whereby differences in the initial sample size of miRNA in an RT-PCR experiment were magnified with increasing CT levels. This was illustrated by the strong correlation of the CT values of the individual miRNAs with the average CT values of all miRNAs and by the increased sensitivity in the CT values of the low-abundant miRNAs to the average CT values. We conclude that a systematic bias in RT-PCR exists in which the fluctuations in the CT are dependent on the expression levels of the particular miRNAs. We further proposed a novel data-driven method of addressing this bias by using the weighted mean instead of an endogenous control in the calculation of ΔCT. We demonstrated that the new normalization method produces lower standard deviations and is more stable than other methods.

Note that, while the power parameter in the weighted mean normalization method provides a convenient way of adjusting how much one wishes to let the less stable microRNAs influence the normalization of other microRNAs, its optimization currently requires enumeration of different values and using the one with the best overall stability. Several CT_{0 }values can be calculated for different values for the weighted mean power, subsequently the value of the power that produces the lowest standard deviation or is determined to be the most stable by geNorm can be used for normalization. The standard deviation or geNorm stability calculations are two methods to quantitatively determine the ideal weighted mean power. Other criteria, such as significance of the differentially expressed microRNAs can be utilized in this optimization. Furthermore, a different custom _{0 }

We further examined the reproducibility of miRNA expression experiments across two different platforms by comparing RT-PCR and microarray results. We explored the relationship between the CT value and the consistency of the expression of a miRNA between RT-PCR and microarray. We leave as a future work the comparison of the ability of different normalization methods to detect differentially expressed genes.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

AS conceived of the study and coordinated the project. RQ implemented the method and performed the experiments. RQ contributed to the design and testing of the method. All authors participated in the analysis of the results. RQ and AS contributed to the writing of the manuscript. All authors read and approved of the final draft.