A computational procedure for functional characterization of potential marker genes from molecular data: Alzheimer's as a case study

Squillario, Margherita; Barla, Annalisa

doi:10.1186/1755-8794-4-55

Research article
Open access
Published: 05 July 2011

A computational procedure for functional characterization of potential marker genes from molecular data: Alzheimer's as a case study

Margherita Squillario¹ &
Annalisa Barla¹

BMC Medical Genomics volume 4, Article number: 55 (2011) Cite this article

5065 Accesses
9 Citations
1 Altmetric
Metrics details

Abstract

Background

A molecular characterization of Alzheimer's Disease (AD) is the key to the identification of altered gene sets that lead to AD progression. We rely on the assumption that candidate marker genes for a given disease belong to specific pathogenic pathways, and we aim at unveiling those pathways stable across tissues, treatments and measurement systems. In this context, we analyzed three heterogeneous datasets, two microarray gene expression sets and one protein abundance set, applying a recently proposed feature selection method based on regularization.

Results

For each dataset we identified a signature that was successively evaluated both from the computational and functional characterization viewpoints, estimating the classification error and retrieving the most relevant biological knowledge from different repositories. Each signature includes genes already known to be related to AD and genes that are likely to be involved in the pathogenesis or in the disease progression. The integrated analysis revealed a meaningful overlap at the functional level.

Conclusions

The identification of three gene signatures showing a relevant overlap of pathways and ontologies, increases the likelihood of finding potential marker genes for AD.

Peer Review reports

Background

Alzheimer's Disease (AD) is a common progressive brain disease generally diagnosed in individuals over 65 years of age and it is mostly characterized by cognition deterioration that causes dementia [1]. Within 3 to 9 years after diagnosis, it usually leads to death.

From the molecular point of view, AD is characterized by many different lesions: the most evident are deposits of beta amyloid and tangles of hyperphosphorylated tau proteins, together with a marked loss of neurons in the neocortex and hippocampus [2, 3]. In the early stages, the most common symptom is memory loss, followed by mood swings, difficult in speech, long-memory loss and confusion. Several characteristics of AD are common to normal aging or to other neurological diseases, making its diagnosis very difficult. Usually, psycho-logical tests are used to indicate the presence of the disease, but only a post-mortem exam can confirm it. The diagnostic process is time-consuming and, by the time AD is detected, the disease has been progressing for many years, causing increased brain damages along with the deterioration of cognitive capacities. For these reasons, AD patients need constant care from their relatives or from specialized structures. Clearly, this phenomenon has a relevant economical impact on the national health systems.

Although many scientific papers are published every year, AD is still a very open research topic and its etiology is still unknown. In this context, the mainstream focus is to understand the underlying molecular mechanisms with the ultimate goal of identifying potential biomarkers to be used in the clinical practice.

The basis of our work is the assumption that candidate marker genes for a given disease belong to specific pathogenic pathways. Our aim was to uncover molecular pathways that are stable across tissues, treatments and measurement systems. The identification of these pathways or functional groups across different datasets is fundamental to unveil those that really feed the progression of the disease and that might harbor relevant genes.

We considered AD as a case study and obtained results from the supervised analysis of three publicly available datasets: one that collects the abundance of 120 signaling proteins [4] and two, retrieved from the Gene Expression Omnibus (GEO) database, that store gene expression data from DNA microarray experiments: GSE1297 [5] and GSE5281 [6, 7]. The rationale behind [4] is very convincing and motivated our choice: since the brain controls many body functions through the release of signaling proteins in the blood stream, a brain disease like AD could induce unique changes of these proteins in the blood. We chose GSE1297 because it is homogeneous with the protein dataset for the Mini-Mental State Exam (MMSE) parameter (t-test, p-value < 0.01), which is a 30-points questionnaire test that is commonly used to screen for cognitive impairment. Unfortunately, for GSE5281 the MMSE parameter was not available, but we used it anyway because its platform, i.e. Affymetrix HG-U133 Plus2.0, provides a more accurate coverage of the human genome and completely includes the probesets measured with Affymetrix HG-U133A (GSE1297).

Supervised analysis of high-throughput data allows for the identification of lists of genes with good prediction ability. In the remainder of the paper we refer to such lists as signatures. Gene signature analysis is fundamental to discover the most relevant functional classes or biological pathways involved in the progression of disease.

In this work, we adopted a supervised analysis schema: l ₁ l _2FSregularization with double optimization framework, set in a nested cross-validation structure (l ₁ l _2FS). This method is inspired by [8] and it was recently proven to be robust and very effective for high-throughput data analysis [9]. The statistical accuracy of the system was measured by its prediction error that is the ability of predicting the outcome on future data (see Materials and Methods) [10].

By separately applying l ₁ l _2FSto each dataset, we obtained three AD signatures all showing high prediction performances. The small overlap between the two microarray signatures confirmed the necessity to consider more data coming from the same measuring technique and also different kind of data in order to incorporate all the genes that are significantly modulated by the disease.

The analysis was completed by a functional characterization of each signature in the Medical Literature Analysis and Retrieval System Online (Medline) [11], Gene Ontology (GO) [12] and the Kyoto Encyclopedia of Genes and Genomes (KEGG) [13]. This final step identified a functional overlap of ontologies and pathways. Even if the majority of the discriminant genes were different, they were frequently involved in the same KEGG pathways and/or shared similar GO ontologies. Moreover, the presence in each signature of some genes already known to be involved in the disease confirmed the reliability of the method in selecting relevant genes and also increased the likelihood that the remaining selected genes could be involved in the development of AD.

Results

The first purpose of this work was to define significant signatures that are gene or protein lists able to distinguish, with a certain degree of reliability, diseased from control subjects. The second purpose was to test the biological soundness of the genes selected by the adopted statistical method. The third and main goal was to characterize AD at a functional level, identifying those pathways and functions that are stable across heterogeneous data sources.

l ₁ l _2FSis a rather novel method for feature selection and classification but it has recently been applied with success in the analysis of data coming from high-throughput techniques [14–17]. We are convinced that the ability of detecting correlated features is the most relevant property of l ₁ l _2FS , since correlation is a peculiar and important property characterizing the genes. It is relevant to note that, in this context, the correlation parameter μ in l ₁ l _2FSis not a threshold value, but it is a regularization parameter within the naïve elastic net functional (see Materials and Methods section). It allows for detecting correlated genes that contribute to the final outcome in a multivariate fashion.

Our analysis was based on one protein dataset [4] and two microarray datasets [5, 6]. The obtained results are presented in this order: the classification error estimated by l ₁ l _2FS , a bibliographic (Medline) characterization of the relevant variables (proteins and probesets) in the signature, the results of the WebGestalt enrichment analysis performed in KEGG and GO and the analysis of the significant gene groups identified by the k-means clustering technique. The Medline bibliographic content we considered relevant concerns: the potential role in AD, in other brain diseases, in pathways already known to be related with AD, or the specific expression in some brain regions. Additional data on the enrichment analysis is available in the Additional files section.

Protein data analysis

Results of the l ₁ l _2FS

The analysis of the protein dataset consisted in two main phases. As shown in Figure 1, we firstly trained l ₁ l _2FSon the Training Set, learning a predictive statistical model and evaluating a cross-validation error. We then assessed the generalization ability of the results on independent datasets (Test Set AD and Test Set MCI - Mild Cognitive Impairment). The algorithm distinguished AD and control samples with a 10-fold cross-validation error of 19%. The presented signature corresponds to the highest value of the correlation parameter μ and it is composed by 21 genes, reported in Table 1. The frequency score associated to each gene indicates its stability (presence) across the lists produced by l ₁ l _2FSin the cross-validation procedure.

Table 1 Table of protein signature

Full size table

Following Ray et al. [4] and Ravetti and Moscato [18], we used the Test Set AD and the Test Set MCI to verify the predicting ability of our signature.

Ray et al. adopted a shrunken centroid algorithm and identified 18 predictors characterizing AD status. Similarly, Ravetti and Moscato considered the dataset and applied more than 20 different classifiers to achieve a highly predictive 5-protein signature.

After the feature selection step, for each test set the test phase consisted in extracting the sub-matrix corresponding to the 21 relevant variables identified in the training phase and in applying the learned model.

The Test Set AD is composed by samples affected by either AD or other dementia and by controls. In this case, our model scored a 7/92 error (see Figure 1), while Ray and co-authors obtained a 10/92 error and Ravetti and Moscato an error of ~ 6/92, averaged over all the methods they applied.

The Test Set MCI is composed by 47 samples corresponding to subjects with MCI as illustrated in Figure 1. In this case, we used the statistical model as a predictor of outcome, considering the conversion to AD as benchmark status (follow-up: 2-6 years from MCI diagnosis). The statistical model scored a 10/47 error, while Ray and co-authors obtained a 9/47 overall error and Ravetti and Moscato achieved an average error of ~ 16/47.

Literature characterization

Table 1 reports the 21 relevant genes identified by l ₁ l _2FS, ranked according their stability in terms of the frequency score. Thirteen genes are meaning-fully associated to AD, to other brain diseases or to brain-related processes. The signature completely includes the one of Ravetti and Moscato [18] and almost completely the one presented in [4]. Some genes uniquely belong to our signature: ADIPOQ, MST1, TNFRSF10C, ANGTP1, AGRP and IL6; with the exception of the latter, the other proteins have never been associated to AD. ADIPOQ encodes for the adiponectin protein that circulates in the plasma and it is involved in the metabolic and hormonal processes. This protein is unable to cross the blood-brain barrier but it is able to modify cytokine expression in the brain endothelial cells [19]; the cytokines are known to be involved in AD. ADIPOQ also characterizes the pathogenesis of the insulin resistance [20] that is a common trait of AD patients.

AGRP encodes for a protein homolog to agouti, a murine protein that regulates the hypothalamic control of feeding behavior via melanocortin receptor and/or intracellular calcium regulation. It therefore influences the weight homeostasis. Kim et al. [21] note that AGRP stimulates insulin secretion through calcium release in pancreatic beta cells. Imbalances in insulin and calcium are well-known risk factors for AD.

ANGPT1 and ANGPT2 contribute to the glucose metabolism by interacting with VEGF [22] and they are both indicated to have a prognostic value in adult forms of malignant brain tumors [23].

TNFR10C encodes for a member of the tumor necrosis factor that protects the cells from TRAIL-induced apoptosis. It is regulated by p53 and it is inducible by DNA damage. It is not constitutively expressed in the human brain but its apoptosis-mediating and apoptosis-blocking receptors are found on neurons, astrocytes and oligodendrocytes [24].

The same holds for TNFRSF10D, which is involved in the same KEGG pathways of TNFRSF10C (i.e. apoptosis, cytokine-cytokine receptor interaction, natural killer cell mediated cytotoxicity).

MST1 encodes for the macrophage stimulating 1 factor. MST1, interacting with FOXO1, induces its accumulation in the nucleus leading to cell death, upon withdrawal of growth factors and neuronal activity [25].

Functional analysis of the signature

Table 2 shows the results of gene set enrichment analysis of the signature using KEGG [26].

Table 2 Table of the functional analysis made in KEGG for protein signature

Full size table

The selected proteins are especially involved in the Signaling Molecules and Interaction and Immune System categories, but also in processes related to the cell (Cell Growth and Death, Cell Communication, Signal Transduction). These results underline the role of the selected genes within pathways already linked to AD: cytokine-cytokine receptor interaction [2, 3, 27], hematopoietic cell lineage [2, 3], apoptosis [2, 3, 27], pathways involved in the immune and inflammatory response [2] and pathways related to Metabolic Diseases [28]. We also identified pathways not previously associated with AD: adipocytokine, PPAR signaling pathway, glioma and pancreatic cancer. We extended the functional analysis of our signature applying the gene set enrichment procedure on GO. The results are presented in Additional file 1. The heatmap plot in Figure 2 visualizes the structured signature obtained by l ₁ l _2FSand postprocessed by k-means clustering. Such structured representation confirmed that the genes belonging to the same clusters, having highly correlated abundance profiles, are indeed grouped in the same ontologies or biological pathways or they are known to interact. For instance, gene set enrichment in GO showed two gene pairs: GCSF/IL3 in the positive biological processes and IL8/TNFRSF10D in the negative biological processes. The heatmap plot in Figure 2 shows them into two different clusters. The enrichment in KEGG provided additional examples. For instance, EGF and PDGFB were clustered together and they are both involved in several pathways: cytokine-cytokine receptor interaction, MAPK signaling pathway, gap junction, focal adhesion, glioma and regulation of actin cytoskeleton. TNF _α and CSF1 show similar abundance profiles and they are included in the hematopoietic cell lineage and cytokine-cytokine signaling pathways. These proteins are also known to interact. Similar examples are: IL3/IL-1_α, CSF3/IL-1_α, ADIPOQ/AGRP, TNFRSF10C/TNFRS10D.