School of Information Technologies, The University of Sydney, NSW 2006, Australia

Muscle Research Unit, Bosch Institute, Discipline of Anatomy and Histology, The University of Sydney, NSW 2006, Australia

Sydney Bioinformatics and Centre for Mathematical Biology, The University of Sydney, NSW 2006, Australia

NICTA, Australian Technology Park, Eveleigh, NSW 2015, Australia

Abstract

Background

It has been a long-standing biological challenge to understand the molecular regulatory mechanisms behind mammalian ageing. Harnessing the availability of many ageing microarray datasets, a number of studies have shown that it is possible to identify genes that have age-dependent differential expression (DE) or differential variability (DV) patterns. The majority of the studies identify "interesting" genes using a linear regression approach, which is known to perform poorly in the presence of outliers or if the underlying age-dependent pattern is non-linear. Clearly a more robust and flexible approach is needed to identify genes with various age-dependent gene expression patterns.

Results

Here we present a novel model selection approach to discover genes with linear or non-linear age-dependent gene expression patterns from microarray data. To identify DE genes, our method fits three quantile regression models (constant, linear and piecewise linear models) to the expression profile of each gene, and selects the least complex model that best fits the available data. Similarly, DV genes are identified by fitting and comparing two quantile regression models (non-DV and the DV models) to the expression profile of each gene. We show that our approach is much more robust than the standard linear regression approach in discovering age-dependent patterns. We also applied our approach to analyze two human brain ageing datasets and found many biologically interesting gene expression patterns, including some very interesting DV patterns, that have been overlooked in the original studies. Furthermore, we propose that our model selection approach can be extended to discover DE and DV genes from microarray datasets with discrete class labels, by considering different quantile regression models.

Conclusion

In this paper, we present a novel application of quantile regression models to identify genes that have interesting linear or non-linear age-dependent expression patterns. One important contribution of this paper is to introduce a model selection approach to DE and DV gene identification, which is most commonly tackled by null hypothesis testing approaches. We show that our approach is robust in analyzing real and simulated datasets. We believe that our approach is applicable in many ageing or time-series data analysis tasks.

Background

Age-dependent gene expression patterns discovery in microarray datasets

Ageing is an important risk factor to many diseases, but the molecular basis of this complex process is still poorly understood

The identification of genes with age-dependent DE patterns is the central microarray analysis task of many ageing studies. For instance, linear regression is the principle tool for identifying genes with strong (linear) age-dependent expression trends in two recent large meta-analysis of ageing microarray studies

The estimated linear function

We have previously introduced the concept of differential variability analysis (DVA) and showed that changes in gene expression variability are biologically relevant in understanding human diseases

Despite the wealth of microarray time-series analysis procedures devised to date (such as **Results **section.

Introduction to quantile regression

The standard linear regression approach aims to estimate a conditional mean function of

The quantile regression technique was first developed by Koenker and colleagues in 1978

Similar to the formulation of linear regression, the aim of quantile regression is to estimate the parameter vector, _{i }and _{i},

where _{τ }(

The check function is defined as:

We can obtain various linear and non-linear quantile regression lines by using different parametric models for the quantile function. In this paper, we refer to such a parametric model as a _{c}(_{c}) = _{c }= {_{l}(_{l}) = _{l }= {

where _{0 }is the location of the change point and _{pl }= {_{1}, _{2}, _{0}}. We note that our piecewise linear model specifies a continuous piecewise linear function with one change-point at (_{0}, _{1}_{0}). These three models form the basis of our approach for identifying various age-dependent gene expression patterns.

Results

Our approach

Here we describe our novel method to discover various age-dependent gene expression patterns based on a model selection strategy. An important observation is that the goodness-of-fit of a quantile regression model to a given data series can be assessed by the

In other words, RSAD is the optimal value of the objective function after solving the minimization of Equation 2. The smaller the RSAD, the better a model fits the data. It is also known that a model with more parameters tends to gives lower RSAD than a model with fewer parameters (see _{1 }is more complex than _{2 }if _{1 }has more parameters than does _{2}. Therefore, we can order our three quantile regression models from the least complex to the most complex as: constant, linear, and piecewise linear. We note that the three models are _{1 }= _{2 }is identical to a linear model regardless of the parameter choice of _{0}, and a linear model can be reduced to a constant model by restricting

To determine if a gene exhibits a DE pattern, we separately fit to the expression profile of that gene three quantile regression models at **Background **section. The model that best describes the available data is said to be the target model of the gene (see Figure

Illustration of our model selection approach to identifying both age-dependent DE and DV patterns

**Illustration of our model selection approach to identifying both age-dependent DE and DV patterns**. The four plots in this figure illustrate the core idea of our model selection approach to identifying genes with age-dependent change in expression (DE) or variability (DV). (A) A gene with an artificially simulated expression profile is fitted with three quantile regression models: the constant model, the linear model and the piecewise linear model. The estimated quantile regression lines are superimposed onto the expression profile. The simplest model that fits the data reasonably well is selected to be the target model. If the linear model or piecewise linear model is selected, this gene is said to be DE. (B) The distribution of

The constant 2 in (1 - 2_{1 }and _{2 }and change-point parameter _{0}. The DV model consists of two piecewise linear quantile regression functions that have independent slope parameters but the same change-point parameter _{0}. In both non-DV and DV models, we fit the upper quantile and lower quantile trend model at _{upper }= 0.85 and _{lower }= 0.15 respectively. We observe that choosing other reasonable values of _{upper }and _{lower}) does not make a substantial difference in practice. The parameters of both non-DV and DV models are estimated by solving a joint optimization problem which can be formulated as follows:

where _{upper }∪ _{lower}. Analogously, the RSAD of both models is the optimal value of the objective function after solving the minimization problem in Equation 6. Using the RSADs of the fitted non-DV and DV models, denoted

We use the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method implemented in R's

Simulation results

We performed an extensive simulation study to empirically establish the sensitivity and specificity of our quantile regression based methods compared with the linear regression based methods (see **Methods**).

The basic experimental design is to simulate datasets with different noise characteristics, and calculate the true positive (TP), true negative (TN), false positive (FP) and false negative (FN) rates in each simulated dataset at different **Methods **section. The trade-off between the true positive rates and the false positive rates of a method at different values of

To test the ability of our method to identify age-dependent DE genes, we simulated five 3000-gene datasets, each containing a different degree and type of noise. There are two types of noise that we investigated here: systematic noise (a consistent amount of noise that affects all the samples regardless of age), and non-systematic outliers (noise that are only present in some data points, which we refer to as outliers). Each simulated dataset consists of three equal proportions of non-DE genes, DE genes with linear age-dependency, and DE genes with non-linear age-dependency. As a base-line, we compared our method with a method based on a second order linear regression method.

To test the ability of our method to identify age-dependent DV genes, we simulated two 3000-gene datasets, each containing a different type of noise — one dataset without outliers and one dataset with outliers. Each simulated dataset comprised three equal proportions of non-DV genes, DV genes with linear age-dependency, and DV genes with non-linear age-dependency. We compared the performance of our method with a variant of the linear regression based approach of

Comparison of false discovery rate (FDR) of our quantile regression methods and linear regression methods using simulation data.

**DE**

**DV**

**FDR**

**DE2**

**DE5**

**DE5 + outliers**

**DE9**

**DE9 + outliers**

**DV**

**DV + outliers**

Quantile Regression (

0.021

0.040

0.049

0.082

0.151

0.017

0.023

Linear Regression (

0.061

0.160

0.204

0.230

0.38

0.083

0.262

FDR_{QR}/FDR_{LR}

0.340

0.247

0.237

0.357

0.396

0.214

0.087

The FDRs of applying our quantile regression method to seven simulated datasets are compared to the corresponding FDRs of applying linear regression based methods to identify DE and DV genes at a predefined threshold of _{l }= 0.05 (for linear regression). At this commonly accepted threshold, we found that our quantile regression method yields FDRs that are consistently about only one third of that the corresponding FDR when the linear regression approach is used.

Comparison of our quantile regression method to a linear regression based method using the ROC curves

**Comparison of our quantile regression method to a linear regression based method using the ROC curves**. This figure shows the ROC curves generated by analyzing seven simulated datasets using our quantile regression method, and a linear regression method. Each simulated dataset has a different type and level of noise. The ROC curves show that our approach consistently out-performs the linear regression method studied in this work in terms of both sensitivity and specificity in all seven simulated datasets.

The relationship of the model selection threshold,

**The relationship of the model selection threshold, α, with the various performance measures**. This figure shows how four different performance measures vary with different values of model selection threshold,

Analysis of two human brain ageing datasets

We applied our method to analyze two real microarray datasets that study human brain ageing in non-diseased individuals. The Colantuoni dataset

The false discovery rates of discovering genes with DE and DV patterns at various **Methods **section. The results are shown in Figure

Estimated FDR at various

**Estimated FDR at various α values for applying our method to the two real datasets**. The means and standard deviations of the estimated FDR of applying our method to the two real datasets. These plots enable us to determine a reasonable

The Colantuoni dataset

Among the 31 genes surveyed in the Colantuoni dataset, we identified ten "interesting" genes, which include seven genes with strong evidence for the presence of a linear DE pattern (PRODH, DARPP32, GRM3, CHRNA7, MUTED, RGS4 and NTRK1), and three genes with a moderate support for a non-linear DE pattern (NTK3, ERBB3, and ERBB4). A plot showing the

Some age-dependent DE genes discovered in the Colantuoni dataset

**Some age-dependent DE genes discovered in the Colantuoni dataset**. The plot in the centre of the figure shows the distribution of the

Some age-dependent DV genes discovered in the Colantuoni dataset

**Some age-dependent DV genes discovered in the Colantuoni dataset**. Based on the

First, by fitting a piecewise linear regression function (with one change-point) to all genes Colantuoni

Second, MUTED

Third, ERRB2 is found to have the strongest support to exhibit an age-dependent DV pattern (Figure

The Lu dataset

By applying our quantile regression approach to analyze the Lu dataset, we found 984 genes with linear DE pattern, 12 genes with non-linear DE pattern, and 120 genes with DV pattern. Since most of the genes we found with strong evidence of DE are also found and analyzed in the original study, we mainly focused on analyzing the genes that show strong evidence of DV. The expression profiles of 12 selected genes with strong evidence of DV, i.e., having a low ^{2+}-mediated and Ca^{2+}-CaM-mediated signalling pathways. NGRN mRNA and protein expression has been shown to decrease with age ^{2+ }ATPase pump 2 (SERCA2), which is highly expressed in various parts of the brain, including the hippocampus and cortex ^{2+ }homeostasis. NSF (N-ethylamide sensitive factor) is a key protein associated with a myriad of processes in the central nervous system including trafficking of synaptic vesicles and regulation of neuronal glutamatergic, GABAergic, adrenergic and muscarinic membrane receptors

Some age-dependent DV genes discovered in the Lu dataset

**Some age-dependent DV genes discovered in the Lu dataset**. This figure shows 12 genes with strong evidence of DV (by having small

Since Lu

Discussion

Some remarks on our approach

Our approach is based on selecting the least complex model that is reasonably strongly supported by the data. There are two important ingredients that need to be defined in a model selection approach: (1) A small set of models which we believe are able to explain the data, and (2) A set of criteria that enables us to compare these models. In this work, we show how the task of identifying age-dependent gene expression patterns from ageing microarray datasets can be formulated as a model selection problem and solved accordingly.

This model selection approach to scientific data analysis is strongly advocated by Burnham and Anderson

Similar to a null hypothesis approach, our approach also requires a threshold, which we referred to as _{S }- RSAD_{C})/RSAD_{S}, in order for a more complex model (M_{C}) to be selected over a less complex model (M_{S}). In general, we believe such a model selection threshold is intuitive and easily extended to analyzing more complex models since no null distribution has to be defined. Further, we note that the term "significant" or "significance" were not used to describe a gene we identified to have strong support for a particular pattern, as these wording tend to be misleading. Moreover we note that our approach is similar to the likelihood ratio test method if we treat RSAD to be inversely related to the likelihood of fitting a model. A further research direction is to investigate how a model selection strategy based on information theoretic criteria such as Akaike Information Criteria (AIC) or Bayesian Information Criteria (BIC) is compared to our approach.

Our application of quantile regression in analyzing ageing microarray datasets has three advantages over the standard linear regression method in analyzing microarray time-series data — robustness against noise, ease of visualizing DV patterns, and the ability to model various parts of a data distribution — which are all clearly exemplified in our analysis of the simulated and real datasets. In particular, we stress the importance of obtaining a regression trend at various quantiles, rather than a regression trend through the mean of a distribution. It has been argued that a biologically important limiting factor in ecological studies may not affect the average behaviour of the measured variable, but may strongly affect the behaviour at the extreme quantiles

Another contribution of our paper is the application of a piecewise linear quantile regression model to identify genes with age-dependent DE and DV patterns. The application of piecewise linear regression for biological responses has been studied by

From a methodological point of view, our work still has a few limitations. First, although we have empirically validated the superior performance of our approach in analyzing noisy microarray data, we did not give any theoretical justification of why this is the case. Without further investigation it is very difficult to discern how much of this improvement is due to the model selection strategy, and how much is due to the robustness of the quantile regression method. This should therefore be further investigated. Second, we only used a generic non-linear optimization algorithm to solve our optimization problem (as in Equations 5 and 6) associated with estimating the parameters of a quantile regression model. Although the BFGS method works well in practice given a good initial parameter estimates, there is no guarantee that the result is indeed the global optimum. This is an even larger problem with models that have many parameters as they are more likely to have complicated (e.g., non-convex) solution surfaces. One line of research direction is to re-frame the optimization problem as a linear programming problem and solve it with the Simplex method

Biological significance of differential gene expression variability in ageing

A number of recent studies showed that the changes in expression variability may be associated with mammalian ageing

Differential variability analysis is often ignored in many gene expression studies because the main aim of these studies is to identify genes that have "significant" changes in mean expression across the study population. However, it is clear that such responses are not sufficient to capture the information in the data. It is important to acknowledge that expression of a gene varies across the population, and this expression variability can change depending on factors such as age and disease. Our previous work showed that genes with decreased variability also tend to have decreased gene-to-gene coexpression in human diseases, which implies that loss of gene expression variability is associated with a loss in gene regulation

Our quantile regression approach is a very powerful tool to assess DV in time-series microarray data, thus opening up the opportunity for a large scale meta-analysis of many microarray datasets to assess the prevalence of DV in human and other organisms, in ageing and diseases. We believe a good understanding of population based gene expression variability is a crucial step towards developing personalized medicine strategies

Extension to analysis of microarray datasets with multiple discrete class labels

While preparing this manuscript, we realized that our quantile regression approach can be extended to identify genes with DE and DV patterns in microarray datasets with discrete class labels. The general concept of fitting and comparing a small number of competing models to a dataset (such as a non-DE model vs. a DE model) can be readily applied to identifying genes with interesting patterns, where these patterns are predefined using biological knowledge and are encapsulated in the model formulation. Here we propose a simple approach to identify genes that have class-dependent DE and DV patterns.

For identifying DE genes, we propose to fit and compare the goodness-of-fit of two models, where one model specifies that the median expression values are the same across multiple classes (the non-DE model), and the second model specifies that the median expression values can differ across multiple classes (the DE model). The non-DE model only requires fitting one parameter — the median value of the data, while the DE model requires fitting

A proposed approach to identify DE and DV genes in a multi-class microarray dataset using quantile regression

**A proposed approach to identify DE and DV genes in a multi-class microarray dataset using quantile regression**. Genes that are DE across multiple classes of samples can be identified by checking whether the RSAD for fitting a DE model, which specifies one median value per class is much smaller than the RSAD for a simpler model (a non-DE model) which specifies only one median value for all

For identifying genes with DV, one can similarly fit and compare two competing models — the non-DV model and the DV model. The non-DV model specifies that the lower quantile for each class is estimated independently while sharing the same inter-quantile range (the absolute difference between the upper and lower quantiles) among the

Similar to linear regression, quantile regression techniques are most commonly used in finding 'interesting' trends in time-series data, such as various econometric, social and ecological data

Conclusion

The main objective of this paper is to present and evaluate a novel approach to discovering genes with various age-dependent expression patterns. Through an extensive simulation study, we show that our quantile regression approach is superior to linear regression based methods in terms of sensitivity and specificity of identifying linear and non-linear DE and DV patterns. We applied our method to two human brain ageing microarray datasets and show that biologically interesting patterns can be discovered.

Further, we propose that our model selection approach to pattern identification can be extended to handle DE and DV discovery tasks in microarray datasets with multiple discrete class labels. Therefore we believe that our approach is an important tool in our quest to understand the nature of gene expression regulation.

Methods

Simulation of artificial microarray data

We simulated seven datasets with different noise properties to evaluate the performance of our quantile regression method compared to other linear regression based methods (described in the next subsection) in terms of discovering DE and DV patterns. All simulated datasets contained 3000 genes and 80 samples, and the samples were grouped into 8 groups of 10 samples. Among the 3000 genes, 1000 of them had no age-dependent expression patterns (C), 1000 of them had a linear age-dependent trend (L), and the remaining 1000 genes had a non-linear age-dependent trend (NL). All data values in each artificial dataset were drawn from a normal distribution with mean

Parameters for simulating the seven artificial datasets.

**Sample (in ascending order of age)**

**Dataset**

**Pattern**

**1-10**

**11-20**

**21-30**

**31-40**

**41-50**

**51-60**

**61-70**

**71-80**

DE, 2

C

4, 2

4, 2

4, 2

4, 2

4, 2

4, 2

4, 2

4, 2

L

1, 2

2, 2

3, 2

4, 2

5, 2

6, 2

7, 2

8, 2

NL

1, 2

2, 2

3, 2

4, 2

5, 2

4, 2

3, 2

1, 2

DE, 5

C

4, 5

4, 5

4, 5

4, 5

4, 5

4, 5

4, 5

4, 5

L

1, 5

2, 5

3, 5

4, 5

5, 5

6, 5

7, 5

8, 5

NL

1, 5

2, 5

3, 5

4, 5

5, 5

4, 5

3, 5

1, 5

DE, 5+outliers

C

4, 5

4, 5

4, 5

4, 5

4, 5

4, 5

4, 5

4, 5

L

1, 5

2, 5

3, 5

4, 5

5, 5

6, 5

7, 5

8, 5

NL

1, 5

2, 5

3, 5

4, 5

5, 5

4, 5

3, 5

1, 5

DE, 9

C

4, 9

4, 9

4, 9

4, 9

4, 9

4, 9

4, 9

4, 9

L

1, 9

2, 9

3, 9

4, 9

5, 9

6, 9

7, 9

8, 9

NL

1, 9

2, 9

3, 9

4, 9

5, 9

4, 9

3, 9

1, 9

DE, 9+outliers

C

4, 9

4, 9

4, 9

4, 9

4, 9

4, 9

4, 9

4, 9

L

1, 9

2, 9

3, 9

4, 9

5, 9

6, 9

7, 9

8, 9

NL

1, 9

2, 9

3, 9

4, 9

5, 9

4, 9

3, 9

1, 9

DV

C

4, 3

4, 3

4, 3

4, 3

4, 3

4, 3

4, 3

4, 3

L

4, 1

4, 2

4, 3

4, 4

4, 5

4, 6

4, 7

4, 8

NL

4, 1

4, 2

4, 3

4, 4

4, 5

4, 3

4, 2

4, 1

DV+outliers

C

4, 3

4, 3

4, 3

4, 3

4, 3

4, 3

4, 3

4, 3

L

4, 1

4, 2

4, 3

4, 4

4, 5

4, 6

4, 7

4, 8

NL

4, 1

4, 2

4, 3

4, 4

4, 5

4, 3

4, 2

4, 1

Gene expression values can be simulated by drawing values from a normal distribution with mean

Comparison with linear regression based methods

To provide a baseline for comparison, we also analyzed our simulated datasets with two linear regression based methods for DE and DV patterns discovery. For identifying genes with either linear or non-linear DE patterns, we used a second order linear regression model of the form _{1}_{2}^{2}. We then independently tested whether _{1 }= 0 and _{2 }= 0 by performing a _{1 }and _{2 }respectively for parameter _{1 }and _{2}. Given a predefined significance level _{l}, a gene is classified to have one of the three patterns using the following set of rules:

A second order linear regression model is more commonly known as the quadratic regression model, but we deliberately avoid the this terminology since it may be easily confused with the term quantile regression, also abbreviated as QR. We note that the above linear regression based method is a variant of the quadratic regression method of Liu

To identify genes with linear or non-linear DV patterns using a linear regression based method, we use the following two step scheme: (1) Fit the data with a third-order linear regression model (commonly known as a cubic regression model) of the form _{1}_{2}^{2 }+ _{3}^{3}, and obtain the residuals as _{i }= _{i }- _{i}), then (2) Fit a second order linear model to the absolute residuals, i.e., |_{1}_{2}^{2}, which enables us to calculate a _{1 }and _{2 }(_{1 }and _{2 }respectively) using _{l}:

The linear regression model fitting and

Construction and interpretation of the ROC curves

A Receiver Operator Characteristic (ROC) curve is a two dimensional plot of two important performance measures of a pattern discovery method — the true positive rate (TPR; or sensitivity) and the false positive rate (FPR; or 1-specificity). A desirable pattern discovery method should achieve a high TPR while maintaining a low FPR. If TPR = FPR for all threshold values, the pattern discovery method is performing just as badly as a random binary classifier that randomly assigns an object into one of the two classes with probability 0.5. Given the true positive (TP), true negative (TN), false positive (FP) and false negative (FN) rates at a given

Analysis of real datasets

The Colantuoni dataset

We estimated the false discovery rate (FDR) of our procedure in discovering age-dependent patterns in the two real datasets using a randomization procedure. Using the concepts and notation developed by Storey and Tibshirani

The last approximation can be shown to be valid if the number of genes tested is large

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JWKH conceived, designed and performed the research, analyzed data and wrote the manuscript. MS and JWKH conceived the idea of differential variability in human ageing. MS and CGdR contributed to the biological interpretation of the results. MAC supervised the study, helped design the experiments and critically revised the manuscript. All authors read and approved the final version of the manuscript.

Note

Other papers from the meeting have been published as part of

Acknowledgements

This work is supported by an Australia Postgraduate Award and a NICTA Research Project Award. We thank Novi Quadrianto (NICTA) for introducing the basic idea of quantile regression to the first author.

This article has been published as part of