Department of Mathematics and Statistics, South Dakota State University, Box 2220, Brookings, SD, 57007, USA

Abstract

Background

Many plant genes have been identified through whole genome and deep transcriptome sequencing and other methods; yet our knowledge on the function of many of these genes remains limited. The integration and analysis of large gene-expression datasets gives researchers the ability to formalize hypotheses concerning the functionality and interaction between different groups of correlated genes.

Results

We applied the non-negative matrix factorization (NMF) algorithm to the AtGenExpress dataset which consists of 783 microarray samples (29 separate experimental series) conducted on the model plant

Conclusions

This study identifies a network of correlated metagenes composed of

Background

Previous Gene Co-expression Studies

In recent years, we have witnessed a deluge of new results coming from genome-wide microarray experiments, and the torrent of data seems likely to increase in the future. In particular, thousands of microarray data sets from experiments on the organism

In general, gene co-expression studies fall into two broad categories: condition dependent and condition independent

In an elegant condition dependent study by Bassel

The idea of ‘guilt by association’, forming hypotheses concerning the biological role of genes based upon similar patterns of gene expression, plays an important role in co-expression analysis

In a condition independent study by Atias

The Model Organism

The NMF Algorithm

The NMF algorithm was first introduced in 1999 by Lee and Seung

NMF has been applied with considerable success to gene expression datasets other than

Results

The AtGenExpress Dataset

AtGenExpress is a large global research project whose purpose is to discover the transcriptome of the model organism

**GSE Accession Number**

**Number of Samples**

**Experiment Description**

**Sampled Tissue**

GSE5615

42

Response to bacterial-(LPS, HrpZ, Flg22) and oomycete-(NPP1) derived elicitors

Leaf

GSE5616

18

Response to Phytophthorainfestans

Leaf

GSE5617

48

Light treatments

Shoot

GSE5620

36

Stress Treatments (Control plants)

Root and shoot

GSE5621

24

Stress Treatments (Cold stress)

Root and shoot

GSE5622

24

Stress Treatments (Osmotic stress)

Root and shoot

GSE5623

24

Stress Treatments (Salt stress)

Root and shoot

GSE5624

28

Stress Treatments (Drought stress)

Root and shoot

GSE5625

24

Stress Treatments (Genotoxic stress)

Root and shoot

GSE5626

28

Stress Treatments (UV-B stress)

Root and shoot

GSE5627

28

Stress Treatments (Wounding stress)

Root and shoot

GSE5628

32

Stress Treatments (Heat stress)

Root and shoot

GSE5629

24

Developmental series (seedlings and whole plants)

Shoot and whole plant

GSE5630

60

Developmental series (leaves)

Leaf (different stages)

GSE5631

21

Developmental series (roots)

Root (different stages)

GSE5632

66

Developmental series (flowers and pollen)

Flower (different stages)

GSE5633

42

Developmental series (shoots and stems)

Shoots and stems (different stages)

GSE5634

24

Developmental series (siliques and seeds)

Siliques and seeds (different stages)

GSE5684

12

Pathogen Series: Response to Botrytis cinerea infection

Mature leaf

GSE5685

32

Pathogen Series: Pseudomonas half leaf injection

Stage 10–11 rosette leaf

GSE5686

48

Pathogen Series: Response to Erysipheorontii infection

Mature leaf

GSE5687

4

Different temperature treatment of seeds

Seed

GSE5688

22

Response to sulfate limitation

Root

GSE5696

26

Effect of brassinosteroids in seedlings

Whole plant

GSE5697

8

Comparison of plant hormone-related mutants

Whole plant

GSE5698

12

Cytokinin treatment of seedlings

Whole plant

GSE5699

6

ARR21C overexpression

Whole plant

GSE700

8

Effect of ABA during seed imbibition

Seed

GSE701

12

Basic hormone treatment of seeds

Seed

Selecting the Dimensionality Reduction Parameter for the NMF Algorithm

An important part of the NMF algorithm is that it reduces the dimensionality of the original data space to a much smaller dimension

Cophenetic Correlation Coefficient for the determination of optimal number of metagenes

**Cophenetic Correlation Coefficient for the determination of optimal number of metagenes **** k.** Peaks in this plot represent stable

The Metagenes and Encoding Coefficients

Analysis of the dataset involved applying the NMF algorithm to reduce the dimensionality of the data to a set of metagenes, and associated encoding coefficients. Each metagene represents a collection of genes behaving in a functionally correlated fashion within the genome. The encoding coefficients express the degree of activation of each metagene on each sample within the dataset. See the Methods section for details.

Each gene within a metagene has an activation level which represents the degree to which that gene is expressed. The metagenes were sorted in descending order according to the activation levels of their genes, and the genes within each metagene were also sorted in descending order. Metagene 1 represents the most highly expressed metagene within the dataset, and Metagene 15 the least highly expressed. Some of the genes in the metagenes have been well-studied, while others have not been annotated at all. Table

**metagenes.** Excel spreadsheet containing the gene lists for each metagene as well as the gene ranks returned by NMF.

Click here for file

**TAIR ID**

**GenBank ID**

**Description**

At1g80840

BAA87058

Transcription metagene, putative similar to WRKY transcription metagene

At1g05575

---

Expressed protein

At1g19180

---

Unknown protein

At4g29780

Hypothetical protein

At1g27730

CAA64820

Salt-tolerance zinc finger protein identical to salt-tolerance zinc finger

At4g34410

---

Putative protein ethylene-responsive element binding protein homolog, Stylosantheshamata, U91857

At1g76650

CAA56517

Putative calmodulin similar to calmodulin

At2g34600

---

Hypothetical protein predicted by genscan

At2g26530

D88743

AR781, similar to yeast pheromone receptor

At1g19020

---

Expressed protein ; supported by full-length cDNA: Ceres: 31015

At4g24570

---

Putative mitochondrial uncoupling protein mitochondrial uncoupling protein

At4g17490

---

Ethylene responsive element binding metagene-like protein (AtERF6)

At3g55980

---

Putative protein zinc finger transcription metagene (PEI1)

At3g01830

CAB42906

Hypothetical protein similar to calmodulin-like protein

At5g42380

---

Putative protein contains similarity to calmodulin

At1g72520

CAB56692

Putative lipoxygenase similar to lipoxygenase

At2g32210

---

Unknown protein

At3g25780

---

Unknown protein ; supported by full-length cDNA: Ceres:3457

At1g61340

---

Late embryogenesis abundant protein, putative similar to late embryogenesis abundant protein

At4g30280

---

xyloglucan endo-1,4-beta-D-glucanase-like protein

Metagene Involvement in Experimental Series

The encoding coefficients returned by the NMF algorithm measure the degree to which each metagene is active in each of the 783 samples within the dataset. Multiple samples comprise an experiment, and through an application of the Kruskal-Wallis test

Metagene Activity in Experimental Series

**Metagene Activity in Experimental Series.** This heat map shows the

One striking feature in Figure

**Biological Process**

**P-Value**

**FDR**

**Metagene**

Response to chitin

4.08E-13

5.13E-10

1

Response to chemical stimulus

2.82E-12

3.55E-09

1

Response to stimulus

2.39E-11

3.00E-08

1

Response to carbohydrate stimulus

2.40E-11

3.02E-08

1

Response to stress

8.84E-11

1.11E-07

1

Photosynthesis

1.71E-74

2.62E-71

3

Photosynthesis, light reaction

2.53E-39

3.88E-36

3

Generation of precursor metabolites

3.01E-31

4.60E-28

3

Photosynthesis, light harvesting

1.49E-20

2.29E-17

3

Photosynthetic electron transport in photosystem I

1.49E-16

1.67E-13

3

Response to stimulus

1.83E-12

2.78E-09

5

Response to stress

9.95E-09

1.52E-05

5

Response to chemical stimulus

2.74E-08

4.17E-05

5

Photosynthesis

7.35E-07

1.12E-03

5

Response to abiotic stimulus

9.79E-07

1.49E-03

5

Functional Characterization of Metagenes Using Gene Set Enrichment Analysis

GSEA

Metagene involvement in three gene ontologies (molecular functions, cellular components, and biological processes) was examined and the results for the biological processes ontology are summarized in Figure

Metagene GSEA Enrichment in Biological Processes

**Metagene GSEA Enrichment in Biological Processes.** The NES score plotted in this heat map is a measure of metagene enrichment within a specific gene ontology involved with biological processes. Bright red cells indicate high enrichment.

The NES (Nominal Enrichment Score) is a measure of the enrichment of a metagene within a gene ontology – and is the value plotted in Figure

**GOBP.** Excel spreadsheet containing GSEA Enrichment results for Biological Processes.

Click here for file

**GOCC.** Excel spreadsheet containing GSEA Enrichment results for Cellular Components.

Click here for file

**GOMF.** Excel spreadsheet containing GSEA Enrichment results for Molecular Functions.

Click here for file

**Figure S1.** A plot showing the average correlation for 1000 randomly selected genes before and after the Empirical Bayes method was applied to adjust for batch effects. **S2**. A plot showing heat maps of the consensus matrix for k = {5,10,15,20}. **S3**. A plot showing heat maps of the consensus matrix for k = {30,35,40,45}. **S4.** A histogram of metagene coefficients, showing the δ=0.2 cut-off. Genes with coefficients greater than δ were included in the metagenes, and those with coefficients less than this were excluded. **S5**. Metagene correlation network for the gene ontology: cellular components. Each node in this network represents a metagene. The size of each node is proportional to the activity of the metagene within the dataset. The width of lines between a pair of nodes is proportional to the strength of the correlation between them. Positive correlations are denoted by red lines, and negative correlations by green lines. Only Spearman correlations with a p-value less than 10^{-12} are visible. The pie slices within each node represent the amount of enrichment for specific gene ontologies (the NES score). **S6**. Metagene correlation network for the gene ontology: molecular functions. Each node in this network represents a metagene. The size of each node is proportional to the activity of the metagene within the dataset. The width of lines between a pair of nodes is proportional to the strength of the correlation between them. Positive correlations are denoted by red lines, and negative correlations by green lines. Only Spearman correlations with a p-value less than 10^{-12} are visible. The pie slices within each node represent the amount of enrichment for specific gene ontologies (the NES score). **S7**. This heat map shows the z-values for all metagenes for the pathogen series in the dataset. Red indicates a metagene is more active in an experimental series, and green indicates it is suppressed. **S8**. Intersection p-values between clusters in the pathogen network from Atias, and the metagenes active in the pathogen series of the AtGenExpress dataset. Bright green cells represent significant statistical overlap. The intensity of the cells represents a log-10-transformed p-value returned by the hypergeometric test. **S9**. The NES score plotted in this heat map is a measure of metagene enrichment within a specific gene ontology involved with cellular components. Bright red cells indicate high enrichment. **S10**. The NES score plotted in this heat map is a measure of metagene enrichment within a specific gene ontology involved with molecular functions. Bright red cells indicate high enrichment.

Click here for file

In Figure

Metagenes 1, 3 and 5

We observed in Figure

In Figure

Metagene 1,3 and 5 Activity in Biological Processes

**Metagene 1,3 and 5 Activity in Biological Processes.** This heat map shows the biological process NES score for metagenes 1, 3 and 5. Bright red cells indicate high enrichment within an ontology. Metagene 3 is highly enriched with respect to different responses related to chemical and mechanical stimulus. Metagene 5 is enriched with respect to defense responses such as light and bacterial infection. Metagene 1 is also enriched in catabolic processes related to toxin removal, which one would expect for a metagene active under the stress series of experiments.

Metagene Correlation Network

To determine how the metagenes interact with each other, a Spearman correlation matrix

1. Each network node represents a metagene.

2. The size of each node is proportional to the metagene activity within the dataset.

3. A line is drawn between a pair of nodes if the p-value of their Spearman correlation is less than 10^{-12}.

a. Positive correlations are denoted by red lines

b. Negative correlations are denoted by green lines

c. The width of the line is proportional to the strength of the correlation.

4. The pie slices within each node represent the amount of enrichment for specific gene ontologies as determined by GSEA.

In the correlation network of Figure

Metagene Correlation Network for Biological Processes

**Metagene Correlation Network for Biological Processes.** Each node in this network represents a metagene. The size of each node is proportional to the activity of the metagene within the dataset. The width of lines between a pair of nodes is proportional to the strength of the correlation between them. Positive correlations are denoted by red lines, and negative correlations by green lines. Only Spearman correlations with a p-value less than 10^{-12} are visible. The pie slices within each node represent the amount of enrichment for specific gene ontologies (the NES score).

A Comparison of Metagenes with Atias

In the study by Atias

Atias created three different gene correlation networks based on the scoring threshold. For the 0.3-threshold network, gene pairs with a score less than 0.3 were filtered out of the analysis. Using graph theoretic methods encapsulated within the MCODE plug-in for Cytoscape

A hypergeometric test was used to determine the degree of intersection between the metagenes in this study, and the clusters identified in the 0.3-threshold and pathogen networks of Atias. See the Methods section for details. A matrix of p-values for each of the networks was calculated. Hierarchical clustering analysis on both metagenes/clusters from Atias was then performed. The p-values in the heat maps shown in Figures _{10} transformed, and the bright green squares correspond to p-values less than or equal to 10^{-10}.

Heat map of intersection p-values for the 0.3-threshold network

**Heat map of intersection p-values for the 0.3-threshold network.** The intersection between clusters in the Atias study for the 0.3 threshold-network, and the metagenes in this study are visualized as a heat map, with cells representing log-10 transformed p-values from a hypergeometric test. Bright green values indicate significant statistical overlap (very low p-values).

Heat map of intersection p-values for the pathogen network

**Heat map of intersection p-values for the pathogen network.** Intersection p-values between clusters in the pathogen network from Atias, and the metagenes in this study. Bright green cells represent significant statistical overlap. The intensity of the cells represents a log-10-transformed p-value returned by the hypergeometric test.

The Atias network with a score threshold of 0.3 contains 1372 genes. Of these, 1118 genes are also contained within the metagenes discovered in this analysis. Significant statistical overlaps were observed (see Figure

In both the study by Atias and our study, portions of each dataset are comprised of microarray samples involved in pathogen experiments. In this study, metagenes 6, 7, 8, and 13 are actively expressed in at least two out of the three pathogen series (with metagene 6 being involved in all three). See Additional file

Examining the p-values of intersection for these metagenes with the clusters from the pathogen network in the Atias study reveals many significant overlaps. Not surprisingly, metagene 6 intersects with the most clusters. See Additional file

Discussion

The three primary steps of the analysis conducted in this paper are:

1. Filtering genes of interest based upon their expression activity on the samples.

2. Finding the optimal number of metagenes intrinsic to the dataset and applying NMF. Using the results of NMF to construct a metagene correlation network.

3. Determining the functionality of the metagenes by applying GSEA, and using NMF results to provide a ranking for the genes within each metagene.

In the study by Lee and colleagues

A benefit of using the NMF algorithm instead of other algorithms such as hierarchical clustering, PCA, or VQ methods is that it does not constrain genes to belong exclusively to one cluster, so the method realistically models the way in which many genes perform different functional roles within the genome. There is also a way to approximate an optimum number of metagenes (see

In Figure

Conclusions

NMF analysis of the AtGenExpress dataset revealed 15 metagenes representing collections of genes with correlated expression patterns (both globally with respect to all of the experiments in the data series, and locally with respect to a smaller subset of the experiments). By combining the NMF results with

Methods

Pre-processing

The AtGenExpress data files were downloaded from the NCBI Gene Expression Omnibus (

Filtering

Filtering of the dataset was conducted in two stages. For the first stage, the Wilcoxon sign-based present/absence detection algorithm available in Bioconductor

The second level of filtering was applied by ranking all of the genes using the _{2} norm. Let _{2} norm of

This measure assigns a value to each gene based upon its overall degree of gene expression with respect to all of the samples in the dataset. The top 7000 genes ranked according to their _{2} norm values were selected for further analysis.

Adjusting for Batch Effects

After normalizing the dataset, it was necessary to adjust for the problem of batch effects. An Empirical Bayes (EB) method

To adjust for batch effects, the batches have to be first defined. Samples were grouped into the same batch if they were from the same experiment – resulting in 29 different batches.

The influence of batch effects on the data can be inferred by selecting two rows of the dataset at random. Each row represents a gene and its vector of expression values across all samples in the dataset. In the absence of batch effects, it is expected that the Pearson correlation coefficient between two randomly selected genes will be very close to zero.

To assess the effectiveness of the Empirical Bayes method, 1000 genes (rows) were selected at random from the dataset, and a plot of the standard deviation versus the average Pearson correlation between each gene and every other gene in the random sample was calculated. The plot of this graph is available as Additional file

Non-negative Matrix Factorization

After standard normalization, adjusting for batch effects, and filtering, the dataset consists of gene expression values for 7000 genes in 783 samples. It is represented by a matrix V of dimension N x P, where

Non-negative matrix factorization factors

In this study, the

Finding the Optimal Number of Metagenes

Although the factor _{
ij
} is the largest value in column

The NMF algorithm does not always converge to the same solution each time it is run. However, if we have chosen a stable value for

For each run, the sample clustering can be represented by a connectivity matrix

The stability of a clustering for a given value of _{
k
} is calculated by computing the Pearson correlation coefficient between two distance matrices. The first distance matrix represents the distance between samples of

Consensus matrices were computed for

The Metagenes

Each column of _{
ij
} represents the coefficient of gene

To compare the activity of the metagenes, the column sums of the coefficient values were calculated, and the columns sorted in descending order according to their column sums. So the first column of

A cut-off threshold of

Comparison with the result of Atias

The hypergeometric p-value returned by testing for statistical overlap between a gene cluster in the Atias study and a metagene in this one was calculated using Equation 2 below:

where:

All calculations were carried out in R

Creating the Metagene Correlation Network

To determine how the metagenes interact with each other, a correlation matrix _{
ij
} in ^{-12} was set, and a connection was drawn between two nodes

The Spearman p-value was approximated using a two-sided Student's t-test. Specifically, if

This statistic approximately follows a Student-t distribution with 781 degrees of freedom under the null hypothesis.

The pie chart representation of the network nodes, showing enrichment in gene ontologies, was created using the software library

Metagene Involvement in Experimental Series

The rows of _{
ij
} in

Through a convenient application of the Kruskal-Wallis test we can determine on which series of experiments each metagene is active on. The Kruskal-Wallis test is a non-parametric method for testing the equality of different population medians between groups

Data measurements are converted to ranks before the test is applied. Given one row of

Given a row of

where:

The

**kruskal_wallis.** Excel spreadsheet containing z-values returned by the Kruskal-Wallis test on the encoding coefficients.

Click here for file

Gene Set Enrichment Analysis

The GSEA software package available from The Broad Institute (

The GSEA algorithm takes as inputs two sets of gene lists:

1. A pre-defined set of genes S (such as genes sharing the same metabolic pathways)

2. A list of genes L ordered according to some ranking criteria of importance to the researcher.

In this study, we used the columns of _{
ij
} of

Three databases of

Competing Interests

The authors declare that they have no competing interests.

Authors’ contributions

TJW and SXG conceived the project. TJW did the analysis and wrote the paper. YB and LL created the GSEA knowledge base used in the enrichment analysis. All authors have read and approve the final manuscript.

Acknowledgements

This work was supported in part by National Institutes of Health [GM083226 to XG]. The content is solely the responsibility of the authors and does not necessarily represent the official views of NIGMS or NIH. Partial support has also been received from Susan G. Komen for the Cure Foundation (BCTR0600447).