An integrative genomics approach, in which data from different micro-array experiments are merged together to study regulatory networks , has been adopted in several recent research studies. However, we propose that blind use of this approach can be misleading. Our hypothesis is that as micro-array data from different experiments are merged, local patterns of activity, for example the cell cycle, can be masked by more global and dominant patterns such as stress reactions. We have carried out a systematic study in which data with increasing heterogeneity is clustered to determine groups of functionally related genes. These clusters are then tested for similarity to each other.
In order to validate our hypothesis, the primary requirement is to obtain the regulatory modules from various datasets and their mixtures and then measure their similarities to each other. A decreasing trend of similarity as we mix more and more heterogeneous data should confirm our hypothesis. A number of researchers have worked on the problem of finding regulatory networks, some of the most important ones being [2,3] where they have incorporated prior knowledge in the form of known transcription factors or DNA binding data to guide the clustering process. The results in these works have shown that the resulting clusters of regulated gene modules are biologically meaningful. We have used Module Networks algorithm  which is a well established approach and has had success in finding biologically relevant modules. For measuring the similarities among sets of regulated gene clusters resulting from this algorithm, we chose to use the modified Rand Index  which has been shown to be a very stable index of partition similarity.
Materials and methods
In order to validate our hypothesis we chose to work with two very diverse datasets from Stanford Microarray Database (SMD). One of them is when yeast is exposed to stress conditions while other is from cell-cycle related study. Expression of genes when stress conditions are created is much more drastic (both repressed and induced genes) when compared to cell-cycle experiments where optimal conditions are created for growth. We started with analysing data by individual researchers for experiments related to stress  in this paper referred as DS-STRESS1 (76 microarrays),  called DS-STRESS2 (49 microarrays) and  called DS-STRESS3 (41 microarrays). In the next stage we merged all the stress microarrays to create the data set we call DS-STRESS. To compare these clustering against an entirely different category, we took 93 microarray data sets for cell-cycle experiments  referred in this article as DS-CCYCLE. A further mixing of both stress and cell-cycle data was named DS-STRESS-CCYCLE. Finally, we extracted all available data (1082 microarrays) for yeast (not only stress/cell-cycle) named DS-ALL and compared the earlier results against it. In order to have statistical significance behind our results we also generated a random microarray dataset for all the genes by generating random numbers from a Gaussian distribution with zero mean and unit standard deviation. This dataset was named DS-RANDOM.
For normalization, we use the assumption that the average log R/G ratio on the array should be zero. Further, we do filtering on the genes selected by choosing genes whose log(base2) of R/G ratio is greater than 2 times for at least one experiment. List of 145 transcription factors (TFs) as prior knowledge were taken from the Yeastract website http://yeastract.com/ webcite. We analysed all this data using the software package Genomica which has been provided by the authors of the Module Network.
We compared each of the stress datasets against DS-STRESS-CCYCLE, DS-ALL, DS-CCYCLE which are increasingly distant from the stress datasets as described earlier. As reference, we also compared them against the two extremes of similarity – DS-STRESS which is a mixture of all the stress datasets and DS-RANDOM which is a random dataset. As seen from the results in table 1, different datasets show different similarity even to the DS-STRESS dataset. This suggests that DS-STRESS1 and DS-STRESS3 are more similar to each other than DS-STRESS2, the reason we think is that because they came from experiments related to common research. All the stress datasets' similarity to DS-CCYCLE is very low as we expected because of very different nature of expression in these diverse experiments. As expected, the similarity values for the random data-set are minuscule in all the cases.
Table 1. Comparison of individual stress versus progressively mixed datasets
The visible trend of similarity values gradually falling as we move from left to right indicates that similar data do keep the similarity among clusters higher while mixing with dissimilar data brings it down. We also did a combined data-set level comparison rather than individual data sets as done earlier. In this we compared the cell cycle and stress data-set with each other, DS-STRESS-CCYCLE, DS-ALL and DS-RANDOM. The results in table 2 generalise and substantiate our earlier observations as the same trends are even more robust here.
Table 2. Comparison of stress and cell-cycle (mixed) versus progressively mixed datasets
Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.