School of Mathematics & Statistics, Newcastle University, Newcastle upon Tyne, NE1 7RU, UK

Centre for Integrated Systems Biology of Ageing and Nutrition (CISBAN), Newcastle University, UK

Institute for Ageing and Health, Newcastle University, Campus for Ageing and Vitality, Newcastle upon Tyne, NE4 5PL, UK

Abstract

Background

Large scale microarray experiments are becoming increasingly routine, particularly those which track a number of different cell lines through time. This time-course information provides valuable insight into the dynamic mechanisms underlying the biological processes being observed. However, proper statistical analysis of time-course data requires the use of more sophisticated tools and complex statistical models.

Findings

Using the open source CRAN and Bioconductor repositories for R, we provide example analysis and protocol which illustrate a variety of methods that can be used to analyse time-course microarray data. In particular, we highlight how to construct appropriate contrasts to detect differentially expressed genes and how to generate plausible pathways from the data. A maintained version of the R commands can be found at

Conclusions

CRAN and Bioconductor are stable repositories that provide a wide variety of appropriate statistical tools to analyse time course microarray data.

Introduction

As experimental costs decrease, large scale microarray experiments are becoming increasingly routine, particularly those which track a number of different cell lines through time. This is because time-course information provides valuable insight into the dynamic mechanisms underlying the biological processes being observed. However, a proper statistical analysis of time-course data requires the use of more sophisticated tools and complex statistical models. For example, problems due to multiple comparisons are increased by catering for changing effects over time. In this case study, we demonstrate how to analyse time-course microarray data by investigating a data set on yeast. We discuss issues related to normalisation, extraction of probesets for specific species, chip quality and differential expression. We also discuss network inference in the Additional file

**Additional R commands and analysis**. 1. R commands for extracting S. cerevisiae ids, removing unwanted probesets and converting probesets to genes. 2. R commands for genetic regulatory network inference. 3. A list of R packages used in this manuscript. 4. Additional figures.

Click here for file

Description of the data

The data were collected according to the experimental protocol described in

Loading microarray data into Bioconductor

Installing Bioconductor and associated packages

Assuming that R is already installed, Bioconductor is fairly straightforward to obtain installation script, viz:

> **url**='

> **source**(**url**)

> biocLite()

This installs a number of base packages, including

>

> biocLite(**c**('ArrayExpress', 'Mfuzz', 'timecourse', 'yeast2.db', 'yeast2probe'))

>

> **install.packages**(**c**('GeneNet', 'gplots'))

Bioconductor packages are updated regularly on the web and so users can easily update their currently installed packages by starting a new R session and then using

> **update.packages**(repos = biocinstallRepos())

See

A list of packages used in this paper is given in the Additional file

Entering data into Bioconductor

The data used in this paper can be downloaded from ArrayExpress into R using the commands

> **library**(ArrayExpress)

> yeast.raw = ArrayExpress('E-MEXP-1551')

Unfortunately due to changes in the ArrayExpress website, the

A brief description of the

AffyBatch object

size of arrays = 496 × 496 features (3163 kb)

cdf = Yeast_2 (10928 affyids)

number of samples = 30

number of genes = 10928

annotation = yeast2

If the Affymetrix microarray data sets have been downloaded into a single directory, then the

Also available from ArrayExpress are the experimental conditions. However, some preprocessing is necessary:

> ph = yeast.raw@phenoData

> exp_fac = **data.frame**(data_order = **seq **(1, 30),

+ strain = ph@data**$**Factor.Value.GENOTYPE.,

+ replicates = ph@data**$**Factor.Value.INDIVIDUAL.,

+ tps = ph@data**$**Factor.Value.TIME.)

> **levels**(exp_fac**$**strain) = **c**('m', 'w')

> exp_fac = **with**(exp_fac, exp_fac[**order**(strain, replicates, tps), ])

> exp_fac**$**replicate = **rep**(**c**(1, 2, 3), each = 5, 2)

The data frame

Note that there are two yeast species on this chip,

Pre-processing

Extraction of

As these microarrays contain probesets for both

We obtain a data frame containing lists of

>

> s_cer = **read.table**('s_cerevisiae.msk', skip = 2, stringsAsFactors = FALSE)

> probe_filter = s_cer[[1]]

> **source**('ExtractIDs.R')

> c_df = ExtractIDs(probe_filter)

We also need to restrict the view of

>

> **library**(affy)

> **library**(yeast2probe)

> **source**('RemoveProbes.R')

> cleancdf = cleancdfname(yeast.raw@cdfName)

> RemoveProbes(probe_filter, cleancdf, 'yeast2probe')

Note that the commands in

AffyBatch object

size of arrays = 496 × 496 features(3167 kb)

cdf = Yeast_2(5900 affyids)

number of samples = 30

number of genes = 5900

annotation = yeast2

and the number of genes (actually probesets here) is 5,900 now that the

Data Quality Assessment

Before any formal statistical analysis, it is important to check for data quality. Initially, we might examine the perfect and mismatch probe-level data to detect anomalies. Images of the first five arrays can be obtained using

> op = **par**(mfrow = **c**(3, 2))

> **for**(iin 1:5) {

+ plot_title = **paste **('Strain:', exp_fac**$**strain [i], 'Time:', exp_fac**$**tps [i])

+ d = exp_fac**$**data_order [i]

+ **image**(yeast.raw [, d], main = plot_title)

+ }

These commands produce the image shown in the Additional file

Another useful quality assessment tool is to examine density plots of the probe intensities. The command

> d = exp_fac**$**data_order[1:5]

> **hist**(yeast.raw[, d], lwd = 2, ylab = 'Density', xlab = 'Log (base 2) intensities')

produces the image shown in the Additional file

Other exploratory data analysis techniques that should be carried include MAplots, where two microarrays are compared and their log intensity difference for each probe on each gene are plotted against their average. Also of interest is to examine RNA degradation (see

Normalising Microarray Data

There are number of methods for normalising microarray data. Two of the most popular methods are GeneChip RMA (GCRMA) and Robust Multiple-array Average (RMA); see

Since we have thirty microarray data sets and believe that the levels of transcriptional activity are similar across strains, we will use the RMA normalisation method. This technique normalises across the set of hybridizations at the probe level. The data can be normalised via

> yeast.rma = rma(yeast.raw)

> yeast.matrix = exprs(yeast.rma)[, exp_fac**$**data_order]

> cnames = **paste**(exp_fac**$**strain, exp_fac**$**tps, sep = ' ')

> **colnames**(yeast.matrix) = cnames

> exp_fac**$**data_order = 1:30

The normalisation procedure consists of three steps: model-based background correction, quantile normalisation and robust averaging. The aim of the quantile normalisation is to make the distribution of probe intensities for each array in a set of arrays the same. We illustrate its effect by studying boxplots of the raw

> **library**(affyPLM)

> **par**(mfrow = **c**(1, 2))

>

> **boxplot**(yeast.raw, **col **= 'red', main="")

>

> **boxplot**(yeast.rma, **col **=**'**blue')

Principal Component Analysis

Principal component analysis (PCA) is useful in exploratory data analysis as it can reduce the number of variables to consider whilst still retaining much of the variability in the data. In particular, PCA is useful for identifying patterns in the data. Essentially, principal components partition the data into orthogonal linear components which explain different contributions to the variability in the data. The first component explains the largest contribution to variability in the original dataset, that is, retains most information, with the second component explaining the next largest contribution to variability, and so on. The following commands calculate the principal components

> yeast.PC = **prcomp**(**t**(yeast.matrix))

> yeast.scores = **predict**(yeast.PC)

which we can then plot using

>

> **plot**(yeast.scores [, 1], yeast.scores [, 2],

+ xlab = 'PC 1', ylab = 'PC 2',

+ pch = **rep**(**seq **(1, 5), 6),

+ **col **= **as.numeric**(exp_fac**$**strain))

> **legend**(-20, -4, pch = 1:5, cex = 0.6, **c**('t 0', 't 60', 't 120', 't 180', 't 240'))

Figure

A plot of the first two principal components

**A plot of the first two principal components**. The red symbols correspond to the wild-type strain.

Identifying differentially expressed genes

In this experiment, interest lies in differences in gene expression over time between the wild-type and mutant yeast strains. It is expected that the wild-type expression level is independent of time. Also we anticipate that the mutant expressions at time

There are currently two main packages available to detect differentially expressed genes using this kind of data: the

Using the timecourse package

This package assesses treatment differences by comparing time-course mean profiles allowing for variability both within and between time points. It uses the multivariate empirical Bayes model proposed by

Further details of the

> **library**(timecourse)

> size = **matrix**(3, **nrow **= 5900, **ncol **= 2)

To extract a list of differentially expressed we calculate the Hotelling statistic

> c.grp = **as.character**(exp_fac**$**strain)

> t.grp = **as.numeric**(exp_fac**$**tps)

> r.grp = **as.character**(exp_fac**$**replicate)

> MB.2D = mb.long(yeast.matrix, times = 5, method = '2', reps = size,

+ condition.grp = c.grp, time.grp = t.grp, rep.grp = r.grp)

The top (say) one hundred genes can be extracted via

> gene_positions = MB.2D**$**pos.HotellingT2 [1:100]

> gnames = **rownames**(yeast.matrix)

> gene_probes = gnames[gene_positions]

The expression profiles can also be easily obtained. The profile for the top ranked expression is found using

> plotProfile(MB.2D, ranking = 1, gnames = **rownames**(yeast.matrix))

and is shown in the Additional file

Using the limma package

The _{g}] = _{g}, where _{g }= (_{g,1}, ..., _{g, n})^{T }contains the expression values for gene _{g }is the coefficient vector for gene _{g}. We will label these ten coefficients as ('m0', 'm60, 'm120', 'm180', 'm240', 'w0', 'w60', 'w120', 'w180', 'w240'), where the first five coefficients represent the levels of the mutant strain at time points

> **library**(limma)

> expt_structure = **factor**(**colnames**(yeast.matrix))

>

> X = **model.matrix**(~0 + expt_structure)

> **colnames**(X) = **c**('m0', 'm60', 'm120', 'm180', 'm240', 'w0', 'w60', 'w120', 'w180', 'w240')

and then the coefficient vector _{g }is estimated via the command

> **lm.fit **= lmFit(yeast.matrix, X)

Determining the differentially expressed genes amounts to studying contrasts of the various strain × time levels, as described by a contrast matrix _{g }= ^{T}_{g }for gene

> mc = makeContrasts('m60-w60', 'm120-w120', 'm180-w180', 'm240-w240', **levels **= X)

> c.fit = contrasts.fit(**lm.fit**, mc)

> eb = eBayes(c.fit)

The final command uses the _{gj }are plausibly zero, corresponding to no signifficant evidence of a difference between strains at time point

Ranking differentially expressed genes

There are a number of ways to rank the differentially expressed genes. For example, they can be ranked according to their log-fold change

>

> toptable(eb, sort.by = 'logFC')

or by using

> topTableF(eb)

The advantage of using

Our analysis is based on a large number of statistical tests, and so we must correct for this multiple testing. In our example we use the (very) conservative Bonferroni correction since we have a large number of differentially expressed genes and the resulting corrected list is still long. Another common method of correcting for multiple testing is to use the false discovery rate (fdr) (use the command

> modFpvalue = eb**$**F.p.value

>

> indx = **p.adjust**(modFpvalue, method = 'bonferroni') < 0.05

> sig = modFpvalue[indx]

>

> nsiggenes = **length**(sig)

> results = decideTests(eb, method = 'nestedF')

> modF = eb**$F**

> modFordered = **order**(modF, decreasing = TRUE)

>

> c_rank_probe = c_df**$**probe [modFordered [1:nsiggenes]]

> c_rank_genename = c_df**$**genename [modFordered [1: nsiggenes]]

>

> updown = results[modFordered [1:nsiggenes],]

> **write.table**(**cbind**(c_rank_probe, c_rank_genename, updown),

+ **file **= 'updown.csv', sep = ',', **row.names **= FALSE, **col.names **= FALSE)

The following code (adapted from lecture material found at

Time course expression levels for the top 9 differentially expressed genes, ranked by their

**Time course expression levels for the top 9 differentially expressed genes, ranked by their F-statistic**.

>

> **par**(mfrow = **c **(3, 3), ask = TRUE, cex = 0.5)

> **for **(i in 0:99){

+ indx = **rank**(modF) == **nrow**(yeast.matrix) -i

+

+ id = c_df**$**probe [indx]

+ name = c_df**$**genename [indx]

+ genetitle = **paste**(**sprintf **('%. 30s', id), **sprintf **('%. 30s', name), 'Rank =', i +1)

+

+ exprs.row = yeast.matrix[indx, ]

+

+ **plot **(0, pch = NA, xlim = **range**(0, 240), ylim = **range**(exprs.row), ylab = 'Expression',

+ xlab = Time, main = genetitle)

+

+ **for **(j in 1:6){

+ pch_value = **as.character**(exp_fac**$**strain [5 * j])

+ **points**(**c **(0, 60, 120, 180, 240), exprs.row[(5 * j-4):(5 * j)], type = 'b', pch = pch_value)

+ }

+ }

When interpreting rank orderings based on statistical significance, it is important to bear in mind that a statistically significant differential expression is not always biologically meaningful. For example, Figure

Comparison of the

Both packages have different strengths. One advantage of the

> N = 100

> gene_positions = MB.2D**$**pos.HotellingT2[1:N]

> tc_top_probes = gnames[gene_positions]

> lm_top_probes = c_df**$**probe[modFordered[1:N]]

> **length**(**intersect**(tc_top_probes, lm_top_probes))

The result is a moderately large overlap of fifty-three probesets. We note that changing the ranking method in the

Two fold-change list

When looking for "interesting" genes it can be helpful to restrict attention to those differential expressed that are both statistically significant and of biological interest. This objective can be achieved by considering only significant genes which show, say, at least a two-fold change in their expression level. This gene list is obtained using the following code (adapted from

>

> maxfoldchange = **function**(foldchange)

+ foldchange[**which.max**(**abs**(foldchange))]

> difference = **apply**(eb**$**coeff, 1, maxfoldchange)

> pvalue = eb**$**F.p.value

> lodd = -log10(pvalue)

>

> nd = (**abs**(difference) > **log **(2, 2))

> ordered_hfc = **order**(**abs**(difference), decreasing = TRUE)

> hfc = ordered_hfc[1: **length**(difference[nd])]

> np = **p.adjust**(pvalue, method = ' bonferroni') < 0.05

>

> ordered_lpv = **order**(**abs**(pvalue), decreasing = FALSE)

> lpv = ordered_lpv[1: **length **(pvalue[np])]

> oo = **union**(lpv, hfc)

> i i = **intersect**(lpv, hfc)

Figure

Volcano plot showing the Bonferroni cut-off and the two-fold change

**Volcano plot showing the Bonferroni cut-off and the two-fold change**.

>

> **par**(cex = 0.5)

> **plot**(difference[-oo], lodd[-oo], xlim = **range**(difference), ylim = **range**(lodd))

> **points**(difference[hfc], lodd[hfc], pch = 18)

> **points**(difference[lpv], lodd[lpv], pch = 1)

>

> **abline**(v = **log **(2, 2), **col **= 5); **abline**(v = - log (2, 2), **col **= 5)

> **abline **(h = -log10 (0.05**/**5900), **col **= 5)

> **text**(**min**(difference) + 1, -log10 (0.05**/**5900) + 0.2, 'Bonferroni cut off')

> **text**(1, **max**(lodd) - 1, **paste **(**length **(i i), 'intersects'))

Cluster Analysis

Biological insight can be gained by determining groups of differentially expressed genes, that is, groups of genes which increase or decrease simultaneously. This can be achieved by using cluster analysis.

Traditional cluster analysis

In this section, we separate the top fifty differentially expressed genes into groups of similar pattern (clusters). Clearly different genes will have different overall levels of expression and so we first standardise their measurements by taking the expression level of the mutant strain (at each time point) relative to the wild-type at time

> c_probe_data = yeast.matrix [ii,]

>

> wt_means = **apply**(c_probe_data [, 16:30], 1, **mean**)

> m = **matrix**(**nrow **= **dim**(c_probe_data) **ncol **= 5)

> **for **(i in 1:5) {

+ mut_rep = **c**(i, i+5, i +10)

+ m [, i] = **apply**(c_probe_data [, mut_rep], 1, **mean**) - wt_means

+ }

> **colnames**(m) = **sort**(**unique**(exp_fac**$**tps))

The heatmap in Figure

Clustering of the top fifty differentially expressed genes

**Clustering of the top fifty differentially expressed genes**. Red and green correspond to up- and down-regulation respectively.

> **library**(gplots)

>

> heatmap.2 (m [1:50,], dendrogram = 'row', Colv = FALSE, **col **= greenred (75),

+ key = FALSE, keysize = 1.0, symkey = FALSE, density.info = ' none',

+ **trace **= 'none', colsep = **rep**(1:10), sepcolor = 'white', sepwidth = 0.05,

+ hclustfun = **function **(**c**){**hclust**(**c**, method = 'average')},

+ labRow = NA, cexCol = 1)

Figure

Soft clustering

Soft clustering methods have the advantage that a probe can be assigned to more than one cluster. Furthermore, it is possible to grade cluster membership within particular groupings. Soft clustering is considered more robust when dealing with noisy data; for more details see

> **library**(Mfuzz)

> tmp_expr = **new**('ExpressionSet', exprs = m)

> cl = mfuzz(tmp_expr, **c **= 8, m = 1.25)

> mfuzz.plot(tmp_expr, cl = cl, mfrow = **c**(2, 4), new.window = FALSE)

Of course, it is usually not clear how many clusters there are (or should be) within a dataset and so the sensitivity of conclusions to the choice of number of clusters (

> cluster = 1

> cl [[4]][,cluster]

Eight clusters obtained using the

**Eight clusters obtained using the **** package**.

Conclusion

The response to telomere uncapping in

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

AG conducted the microarray experiments. All authors participated in the analysis of the data and in the writing of the manuscript.

Acknowledgements

We wish to thank Dan Swan (Newcastle University Bioinformatics Support Unit) and David Lydall for helpful discussions. The authors are affiliated with the Centre for Integrated Systems Biology of Ageing and Nutrition (CISBAN) at Newcastle University, which is supported jointly by the Biotechnology and Biological Sciences Research Council (BBSRC) and the Engineering and Physical Sciences Research Council (EPSRC).