Department of Biostatistics and Computational Biology, University of Rochester, New York 14642, USA

Functional Genomics Center, University of Rochester, 601 Elmwood Avenue, Rochester, New York 14642, USA

Department of Probability and Statistics, Charles University, Sokolovska 83, Praha-8, CZ-18675, Czech Republic

Abstract

Background

Stochastic dependence between gene expression levels in microarray data is of critical importance for the methods of statistical inference that resort to pooling test-statistics across genes. It is frequently assumed that dependence between genes (or tests) is suffciently weak to justify the proposed methods of testing for differentially expressed genes. A potential impact of between-gene correlations on the performance of such methods has yet to be explored.

Results

The paper presents a systematic study of correlation between the

Conclusion

A long-range correlation in microarray data manifests itself in thousands of genes that are heavily correlated with a given gene in terms of the associated

Background

There are two major methodological problems that deal with the issue of stochastic dependence between gene expression signals in microarray data. The first arises naturally when adjustments for multiplicity of tests are made by

In all such approaches, the stochastic dependence between gene expression values or test statistics is a nuisance that hinders their application. The independence assumption is frequently invoked when building a theoretical foundation for a particular method of statistical inference. Some authors (e.g.,

The stochastic dependence between expression levels and thus between the associated test statistics is really a serious problem. It may cause high variability of statistical estimators and even deteriorate their consistency. To obtain theoretical results it is frequently assumed that weak or almost sure convergence holds for an empirical distribution function constructed from the data pooled across genes (see, i.e.

Storey

"I hypothesize that the most likely form of dependence between the genes encountered in DNA microarrays is weak dependence, and more specifically, "clumpy dependence"; that is, the measurements on the genes are dependent in small groups, each group being independent of the others. There are two reasons that make clumpy dependence likely. The first is that genes tend to work in pathways, that is, small groups of genes interact to produce some overall process. This can involve just a few to 50 or more genes. This would lead to a clumpy dependence in the pathway-specific noise in the data. The second reason is that there tends to be cross-hybridization in DNA microarrays. In other words, the signals between two genes can cross because of molecular similarity at the sequence level. Cross-hybridization would only occur in small groups, and each group would be independent of the others."

This hypothesis does not seem plausible from a biological standpoint because of the pleiotropic character of gene function: one gene participates in multiple molecular pathways. However, the possibility that it may approximately be true for all practical purposes cannot be ruled out. There are two key words in the above quotation: "small groups" and "weak dependence". Whether or not such groups are small and stochastic dependence is suffciently weak can be deciphered only from real world data. To the best of our knowledge, no attempt has been made so far to systematically study dependence structures in microarray data using large data sets. In this connection we would like to continue quoting from

The second research area where the dependence between gene expression levels plays a crucial role is the discovery (reverse engineering) of molecular pathways and networks from microarray data

The present paper is focused on the correlations between test-statistics associated with expression signals produced by each gene and the effects of normalization procedures on these correlations. We limit our consideration to the

Results

The design of our study is presented in the Methods section. This design allows us to compute the

Using these tools we attempt to answer the following questions:

• What is the (pairwise) correlation structure of the

• What is the impact of normalization procedures on this structure?

• What is the impact of normalization procedures on the number of highly correlated pairs formed by a given gene?

Figure

The histogram of correlation coeffcients for

The histogram of correlation coeffcients for

The effects of three normalization procedures (

The effect of normalization with non-overlapping pairs of genes;

Click here for file

The effect of normalization on the between-gene correlations observed in the simulated data SIMU2N and SIMU2 is stronger than that in the case of biological data (the SJCRH leukemia data set). This can be seen in Figures

The effect of normalization procedures on the correlation structure of simulated data;

Click here for file

The effect of the normalization procedure

The effect of the normalization procedure

The behavior of the standard deviation of the sample mean as a function of the number of involved genes

The behavior of the standard deviation of the sample mean as a function of the number of involved genes. 1. Raw biological data; 2. Quantile normalization; 3. Independent simulations (SIMU1).

The effect of the quantile normalization for the SIMU3N, shown in Figure

Another way of studying such effects is to look at the number of pairs characterized by a relatively high correlation with a pre-selected gene. Tables

Long-range correlation analysis for the SIMU2N data.

Gene Label

GEO

QUANT

RANK

SIMU2N

1

743

746

741

12558

2

754

750

756

12558

3

723

723

721

12558

4

705

698

718

12558

5

736

734

754

12558

6

751

763

765

12558

7

702

695

709

12558

8

667

665

679

12558

9

747

747

759

12558

10

728

730

736

12558

11

713

717

713

12558

12

696

699

685

12558

13

743

750

762

12558

14

725

721

733

12558

15

691

691

740

12558

16

789

789

799

12558

17

724

725

669

12558

18

716

712

722

12558

19

762

762

720

12558

20

676

673

708

12558

Mean

724.6

724.5

729.5

12558

STD

30.1

31.8

31.9

0

Long-range correlation analysis for the SIMU3N data.

Gene Label

GEO

QUANT

RANK

SIMU3N

1

483

520

512

12297

2

471

582

591

10656

3

436

523

614

12506

4

644

643

744

11031

5

677

739

765

11320

6

610

543

570

12413

7

612

863

788

12429

8

802

727

711

12077

9

1743

1406

1077

11898

10

975

895

920

12001

11

1352

1330

1543

12453

12

670

707

686

12480

13

1874

1849

1890

6913

14

1858

1765

1808

9371

15

1925

1790

1974

12469

16

1792

1718

1796

12520

17

1764

1526

1679

12499

18

1769

1684

1821

12509

19

1476

1300

1569

12514

20

2223

2307

2148

12507

Mean

1207.8

1170.9

1210.3

11743.2

STD

617.3

557.5

576.5

1402

Long-range correlation analysis for the SJCRH data.

Gene Label

GEO

QUANT

RANK

raw data

1

5644

462

494

12481

2

7330

3175

1431

12486

3

4189

1480

2062

12496

4

5218

2728

1548

12493

5

8169

1888

1064

12451

6

8140

956

1162

12482

7

323

1169

839

12480

8

6774

1479

839

12497

9

7676

1832

2140

12390

10

8234

794

1440

12384

11

7652

930

466

12498

12

8266

1329

708

12476

13

8197

1343

2045

12391

14

7422

2118

2513

12501

15

1588

1467

1011

12494

16

7861

1931

1133

12429

17

1292

1477

1445

12489

18

6389

2949

1456

12481

19

7359

490

514

12469

20

4384

970

787

12488

Mean

6105.4

1548.4

1254.9

12467.8

STD

2545 2512

756

589.5

38.2

Consider first the results obtained with simulated data. Each of the twenty initiator genes selected from SIMU2N form exactly 12,558 highly correlated pairs. When applied to the SIMU2N data, the normalization procedures

The results for the SIMU3N data are different (see Table

We then selected 20 initiator genes in the SJCRH data set representing real biological data. The number of highly correlated pairs formed by these genes before normalization ranges from 12,384 to 12,501, which is a very narrow range indeed. As is seen in Table

Another interesting finding in Table

The effect of the normalization

The effect of the quantile normalization on the distribution of the

Click here for file

The results shown in this section are obtained with a single initial random split of the pooled set of arrays into two groups. We have conducted several such splits in this study. All the above-described effects are highly reproducible, and reporting the results for other splits in the paper is not warranted.

Discussion

It follows from our observations that normalization procedures are capable of destroying a significant part of correlations between gene expression signals and associated test-statistics. In doing so, they affect both the spurious correlation induced by the noise and the true correlation that reflects gene interactions. The clumpy structure (involving relatively large clumps of genes) of the SIMU3N data set is more resistant to this effect than the SIMU2N data. This is even more so for real biological data. The weaker effect of normalization seen in the SJCRH data indicates that the actual noise structure may be more complicated than assumed in the simulation studies (multiplicative array-specific random effect model). A clumpy structure of gene expression signals may also play a role in this phenomenon. This observation explains why it is so diffcult to remove correlations from the data.

The destructive effect of normalization procedures on pairwise correlations in microarray data is good news for the methods of statistical inference that resort to "pooling across genes". However, it remains unclear whether or not the remaining correlation may still be substantial enough to invalidate such methods by affecting important properties of statistical estimators and tests. The problem invites further investigation. However, we would like to present an experiment specially designed to address the consistency question mentioned in the Background section.

To this end, we applied the following algorithm to the SJRCH data:

1. Select randomly 100 genes and compute the arithmetic (sample) mean of the

2. Compute the standard deviation of the sample mean across the 15 pairs of subsamples.

3. Select randomly 100 from the remaining genes and compute the arithmetic mean for the 200 genes for each pair of subsamples.

4. Compute the standard deviation from the sample means resulted from the previous step.

5. Continue until the set of all genes is exhausted.

6. Plot the estimated standard deviation of the sample mean as a function of the number of genes involved in each step of the algorithm.

7. Repeat the procedure

The results of one such experiment are given in Figure

The observed effect of normalization procedures is definitely bad news for the associative network reconstruction from gene expression data. Unless further technological advancements result in a significant reduction of the noise in microarray data, this kind of analysis will continue producing unreliable inferences. To normalize, or not to normalize: that is the question to which no scientifically sound answer is currently known as far as this kind of reverse engineering is concerned. Although limited to cell cultures, the causal inference from gene perturbation (disruption and over-expression) experiments seems to be the only solid alternative. From this standpoint the observations reported in the present paper add to the concerns expressed by several investigators regarding how much confidence to place in the thousands of papers already published using microarray technology

Conclusion

The present paper provides quantitative insight into correlation between the

• There is a long-range correlation in microarray data manifesting itself in a huge number of genes that are heavily correlated with a given gene in terms of the associated

• Using normalization of microarray data it is possible to significantly reduce correlation between the

• Normalization procedures affect both the true correlation, stemming from gene interactions, and the spurious correlation induced by random noise.

• It is likely that some noise effects represent non-monotone transformations of the underlying gene expression signals because even the rank normalization does not make the

• Even the most effcient normalization procedures are unable to completely remove correlation between the

Methods

Study design and biological data

There are 335 arrays (Affymetrix, Santa Clara, CA) in the SJCRH data set, each array representing

Then 15 pairs of the array samples were arranged and the corresponding 15

In a separate experiment, we formed

Simulated data

We simulated several sets of data to gain a better insight into the effects of normalization. All of them included the same numbers of arrays and genes as in the biological data described in the previous section. Specific characteristics of these data sets are given below.

1. SIMU1: Every element _{ij}, _{ij }are generated from the standard normal distribution. This implies that the original expression signals are modeled as log-normally distributed random variables but we used their logarithms in our computations. This data set was used to illustrate the correlation analysis under independence of gene expression levels.

2. SIMU2 is a 12, 558 × 335 random matrix that models an exchangeable correlation structure. The entries in this matrix are normal random variables with mean zero and unit variance. The entries from different columns are independent, while the correlation coeffcient between any two elements _{ij }of the same column is equal to 0.8.

3. SIMU2N is a data set based on SIMU2. First we generate a 335-dimensional random vector _{ij }of _{ij }to be _{ij }+ _{j}, where _{ij }is the _{ij}} represents the data SIMU2N

4. SIMU3 is a 12,550 × 335 matrix. The 12,550 rows (genes) are divided into ten groups of genes, each containing 1,255 rows. If two genes are both from the

5. SIMU3N is the same as the SIMU3 data set but with an added noise. An array-specific multiplicative and uniformly distributed noise is modeled exactly as in the SIMU2N data.

Normalization methods

Suppose there are

1. Geometric mean normalization GEO

If the array-specific random noise is multiplicative then a reasonable way to remove it from the expression values is to divide each element of the data matrix by the geometric mean over all gene expression signals on the array to which this element belongs. Szabo

2. Rank normalization RANK

This method was proposed by Tsodikov ^{sort }by arranging all gene expression signals for the same array in increasing order. Next we replace every entry in this array by its position (rank) in ^{sort }counted from the smallest value. The idea behind this method is that ranks are invariant to any monotone transformation, implying a much more general model for the technological noise than the multiplicative array-specific random effect model.

3. Quantile normalization QUANT

As discussed in

Authors' contributions

This work represents a truly collaborative endeavor. All members of the research team contributed equally to the design of this study, discussion of its technical issues and formulation of the net results. XQ was responsible for the computational component of the study. AB brought his biological expertise to the project.

Acknowledgements

We are grateful to anonymous reviewers whose comments have helped us improve the manuscript. We thank our colleague Cristine Brower for technical assistance. The research is supported in part by NIH Grant GM075299 and Czech Ministry of Education Grant MSM 113200008.