Univ. Bordeaux, ISPED, centre INSERM U-897-Epidémiologie-Biostatistique, Bordeaux, F-33000, FRANCE

INSERM, ISPED, centre INSERM U-897-Epidémiologie-Biostatistique, Bordeaux, F-33000, FRANCE

Queensland Facility for Advanced Bioinformatics and the institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD 4072, Australia

INSERM U955 Eq 16, UPEC Université, Créteil, FRANCE

Vaccine Research Institute ANRS, Paris, France

Abstract

Background

High throughput ’omics’ experiments are usually designed to compare changes observed between different conditions (or interventions) and to identify biomarkers capable of characterizing each condition. We consider the complex structure of repeated measurements from different assays where different conditions are applied on the same subjects.

Results

We propose a two-step analysis combining a multilevel approach and a multivariate approach to reveal separately the effects of conditions within subjects from the biological variation between subjects. The approach is extended to two-factor designs and to the integration of two matched data sets. It allows internal variable selection to highlight genes able to discriminate the net condition effect within subjects. A simulation study was performed to demonstrate the good performance of the multilevel multivariate approach compared to a classical multivariate method. The multilevel multivariate approach outperformed the classical multivariate approach with respect to the classification error rate and the selection of relevant genes. The approach was applied to an HIV-vaccine trial evaluating the response with gene expression and cytokine secretion. The discriminant multilevel analysis selected a relevant subset of genes while the integrative multilevel analysis highlighted clusters of genes and cytokines that were highly correlated across the samples.

Conclusions

Our combined multilevel multivariate approach may help in finding signatures of vaccine effect and allows for a better understanding of immunological mechanisms activated by the intervention. The integrative analysis revealed clusters of genes, that were associated with cytokine secretion. These clusters can be seen as gene signatures to predict future cytokine response. The approach is implemented in the ^{a}.

Background

Recent advances in high throughput ‘omics’ technologies enable quantitative measurements of expression or abundance of biological molecules in a whole biological system. Various popular omics platforms in systems biology include transcriptomics, proteomics, cytomics and metabolomics. These experiments are usually designed to compare changes observed between different conditions or groups and are often used to identify biomarkers capable of characterising pathological states or response to treatment.

The decreasing costs of these high-throughput platforms now enable repeated measures experiments on the same individuals or biological samples. Such experiments allow a substantial gain in information. For instance, longitudinal designs are more powerful as they reduce the noise due to inter individual variability, as long as the correlation between repeated observations is taken into account. There exists an abundant literature on the analysis of repeated measurements of omics data

The mixed model approach has been used for the analysis of one single data type (e.g. gene expression). However, a growing number of high-throughput data are generated in standard clinical trials. For example, the evaluation of HIV vaccine in phase I/II trials incorporates measurements of counts of numerous types of cell, of the production of intra and extracellular cytokines and of gene expression

In recent years, several multivariate approaches have been proposed to combine two omics data, often in an unsupervised framework. In contrast to univariate repeated measures analysis, these linear multivariate approaches take into account the dependency between genes, are able to handle large and noisy data sets and do not face computational issues in the high dimensional case as matrix inversions are avoided. Most importantly in the context of this study, they enable the integration of data coming from different platforms and provide interpretable visualisation tools. These approaches aim at selecting correlated biological entities from two

The flexibility and versatility of PLS also enable a supervised framework through PLS-Discriminant Analysis (PLS-DA

In this paper, we consider a two-step approach to model the correlation between repeated measurements while taking advantage of the multivariate approaches. We first propose to extract the within-sample variation

Starting from the classical mixed-model, we present the principle of a multilevel analysis to extract the within-sample deviation of the data and we extend the approach to a two-factor analysis. The within data set is then analysed with either sPLS-DA to select discriminative genes between the groups of subjects on a single data set, or with sPLS to select subsets of correlated variables from two data sets. A simulation study is performed which demonstrates the good performance of multilevel sPLS-DA compared to a classical sPLS-DA. The approach is then illustrated on an HIV vaccination study, where the effect of a lipopeptide based vaccine was explored by measuring before and after vaccination various components of the immune response, including gene expression and cytokine secretion. These repeated measurement were made in several

Methods

Notations

Let **
X
**(

Multilevel approach

We first present the mixed-effect model as a pedagogical tool and then introduce the concept of the multilevel approach based on the “split-up” variation. Despite the fact that some similarities exist between the mixed-effect model and the multilevel “split-up” variation approach, we emphasize that the latter is performed completely independently from the estimation of the mixed-effect model. Moreover, the mixed model relies on certain assumptions (such as Gaussian distribution of random effects) that the split-up variation approach does not require.

Mixed-effect model

Let

where for a given gene _{
s
}) may differ. However, for each subject

This model is known also as the one-way unbalanced random-effects ANOVA. A simple approach for identifying differentially expressed (DE) genes in this model is to test the stimulation effect for each gene and apply a multiple testing correction (FDR from

Split-up variation

As suggested by Westerhuis et al.

where

where _{
j
} the number of subject undergoing stimulation

Let **
X
** be the (

The matrix **
X
**

Similarly to the Analysis of Variance, it is easy to show that the sum of squares can be separated into three parts:

where ||**
X
**||

The mixed-model described earlier can provide an analysis for repeated measurements data in an unbalanced design. It can be viewed as an extension of a paired t-test to test the differences between paired observations. However, to tackle some of the previously mentioned limitations of the approach, we propose to combine a multilevel approach and a multivariate approach as an interesting alternative. Indeed, the multilevel step splits the different parts of the variation while taking into account the repeated measurements on each subject. Since the stimulation effect from each subject can be separated from the between subject deviation (variation), it is possible to examine the differences in stimulation effect within the subjects in a much easier way than without the separation of the difference sources of variation **
X
**

Extended method for two factors

We propose to extend this approach for data with two factors: the time (‘before’ and ‘after’ vaccination), in addition to the stimulation factor. Let

where for a given gene

According to the mixed model, we have:

where the within-subject deviation can be further decomposed as:

The matrix representation gives:

Similar to the one-factor decomposition, the multivariate approach will be applied on the within matrix

Discriminant analysis of one data set

Once the multilevel approach has been applied to split up the variation in the data, a variant of the multivariate approach PLS Discriminant Analysis (called sparse PLS-DA) is applied on the within matrix **
X
**

Sparse PLS-DA

Linear Discriminant Analysis (LDA) and Partial Least Squares Discriminant Analysis (PLS-DA, **
X
**(

where we denote by **
ξ
**=

The sparse version proposed by Lê Cao et al. _{1} constraint on **
u
** in order to ensure that some

Parameters tuning

Two parameters need to be tuned in sPLS-DA: the number of discriminant vectors

**
β
** is the regression coefficient matrix from sPLS-DA (see

**Supplementaries results regarding the VAC18 study experiments from two assays.**

Click here for file

**
Y
**

Integrative analysis of two data sets

Similarly to the PLS-DA analysis, a more general PLS multivariate approach can be applied on the matching within matrices **
X
**

Sparse PLS

Partial Least Square regression (PLS, **
X
**(

PLS relates both matrices by maximising the covariance between each pair of scores (**
ξ
**

This PLS form is often referred to as “PLS2 mode A” in the literature _{1} penalizations on both **
u
**

Parameter tuning

As an extension to the tuning criterion 2 from the previous section, and similar to what was proposed by Waaijenborg et al. **
X
**

Results and discussion

We first present the results of a short simulation study to show the importance of using a multilevel approach in comparison to a standard sparse partial least square analysis on the original data. We then apply the proposed multilevel approach on an HIV-vaccination study.

Simulation study

Simulated model

A simulation study based on the following mixed effects model was performed:

with _{1} and _{2} in the same cluster, a pairwise correlation for _{
π
} is a (100× 100) matrix with _{
ε
} is a (100×100) matrix with

To mimic the application, clusters of genes discriminating 4 conditions were generated (the 4 stimulations denoted LIPO5, GAG+, GAG- and NS) , where the mean effect of each stimulation is specified by

• 2 gene clusters discriminate (LIPO5, GAG+) versus (GAG-, NS) with **
μ
**

• 2 gene clusters discriminate LIPO5 versus GAG+, while GAG+ and NS have the same effect: **
μ
**

• 2 gene clusters discriminate GAG- versus NS, while LIPO5 and GAG+ have the same effect: **
μ
**

• the 4 remaining clusters represent noisy signal (no stimulation effect): **
μ
**

The intra cluster correlation was either set to

Numerical results

From the simulated data, the within matrix was computed and applied to multilevel sPLS-DA. Figure

Simulation study

**Simulation study.** Sample representation from multilevel sPLS-DA. Samples were projected onto a subspace spanned by the first 3 sPLS-DA components, based on the 200 genes selected on each of the 3 components.

Firstly, in order to highlight the benefit of the multilevel approach in comparison to the multivariate approach without the split-up variation step, a prespecified number of genes was selected on each dimension in order to assess the ability of each approach to select the true relevant genes. As expected, 3 components (linear combinations of 200 genes) were sufficient to discriminate the effect of the 4 stimulations. Multilevel sPLS-DA (applied on the within matrix) selected 92% of the true simulated discriminative genes as compared to 75% of the true discriminative genes for classical sPLS-DA (applied on the original matrix), see Table

Stimulation study

**Stimulation study.** Hierarchical clustering (Euclidian distance and Ward method aggregation) of the genes selected with multilevel sPLS-DA. Samples are represented in columns and genes in rows.

**Component 1**

**Component 2**

**Component 3**

**All**

Percentage of the number of true selected genes selected by classical sPLS-DA or multilevel sPLS-DA on each component or dimension (averaged over 100 simulation runs); 200 genes were selected on each component.

classical sPLS-DA

58.0

75.0

87.2

78.2

multilevel sPLS-DA

82.8

95.6

93.1

92.0

Secondly, leave-one-out cross-validation was performed on each simulation run to evaluate the error rate of classification of classical sPLS-DA or multilevel sPLS-DA (Table

**Number of genes**

**Original matrix**

**Within matrix**

**1 component**

**2 components**

**3 components**

**1 component**

**2 components**

**3 components**

Classification error rate estimation using leave-one-out cross-validation for classical sPLS-DA and multilevel sPLS-DA, with respect to the number of genes selected on each component (averaged over 100 simulation runs).

25

0.535

0.369

0.312

0.500

0.271

0.024

50

0.530

0.364

0.311

0.500

0.265

0.016

75

0.527

0.360

0.306

0.500

0.261

0.013

100

0.524

0.354

0.300

0.500

0.258

0.011

125

0.522

0.351

0.296

0.500

0.257

0.009

150

0.520

0.343

0.285

0.500

0.250

0.008

175

0.518

0.335

0.281

0.500

0.243

0.009

200

0.516

0.327

**0.268**

0.500

0.234

**0.009**

225

0.514

0.323

0.269

0.500

0.227

0.009

250

0.512

0.316

0.267

0.500

0.220

0.008

275

0.510

0.314

0.266

0.500

0.207

0.007

300

0.510

0.306

0.262

0.500

0.196

0.007

325

0.509

0.299

0.260

0.500

0.182

0.007

Application to HIV vaccine evaluation

Description of the study

The data come from a trial evaluating a vaccine based on HIV-1 lipopeptides in HIV-negative volunteers

Preprocessing

Background correction, log_{2} transformation and quantile normalisation were applied on the gene expression data using the **R**

The statistical analysis was performed on the probe expression, but the results were biologically interpreted at the gene level.

Discriminant analysis on the transcriptomics data

First we present results obtained using a mixed model and discuss some potential limitations of this method in the context of small sample size. Then we present the results obtained using multilevel sPLS-DA for one and two-factor analyses. To shorten the length of the paper, some results have been moved in Additional file

**R code used for the analysis of the VAC18 study.**

Click here for file

Mixed model

The one-level mixed model was applied to the W14 transcriptomics data. We used the

The univariate mixed model approach is commonly used to analyse data with repeated measurement with an unbalanced design. However, several reasons favor the use of a multilevel approach in this high dimensional setting. Apart from the already mentioned problem of numerous independent tests and the requirement to apply multiple correction

Multilevel approach with one factor

A multilevel sPLS-DA analysis was performed on the W14 transcriptomics data, with

Given the expression of these 290 selected genes, Figures

Multilevel sPLS-DA analysis on the transcriptomics data with one factor (W14)

**Multilevel sPLS-DA analysis on the transcriptomics data with one factor (W14).****(a)** Unsupervised clustering analysis with Euclidian distance and Ward method of the 290 genes selected by sPLS-DA. Samples are represented in columns and genes in rows. **(b)** and **(c)** sPLS-DA sample representation for dimensions 1-2 **(b)** or 1-3 **(c)**.

Figure

Several clusters of genes which expression seemed related to each type of stimulation could be identified. Cluster 1 included a subset of genes downregulated in GAG-, in cluster 2 the genes were overexpressed in GAG-, while cluster 3 included a subset of genes overexpressed in LIPO5 and GAG+, and cluster 4 was composed of a subset of genes mainly overexpressed in GAG+. The advantage of sPLS-DA is its ability to select genes related to a specific stimulation group on each component. For instance, clusters 1 and 2 included 126 out of the 137 probes selected on the third dimension which separated GAG- from the other stimulation groups (Figure

Note that the same analysis was also performed on W0 but identified much fewer discriminative genes (30 genes in total), indicating that there was a change in expression level after vaccination (see Additional file

Multilevel approach with two factors

A multilevel sPLS-DA analysis was performed on the within matrix

The hierarchical clustering of the 220 selected genes indicated a very satisfying separation of both time and stimulation factors (Figure

Multilevel sPLS-DA analysis on the transcriptomics data with two factors stimulation and time

**Multilevel sPLS-DA analysis on the transcriptomics data with two factors stimulation and time.****(a)** Unsupervised clustering analysis with Euclidian distance and Ward method of the 220 genes selected by sPLS-DA. sPLS-DA sample representations for dimensions 1-2 **(b)** or 1-3 **(c)**.

Integrative analysis

Multilevel sPLS enables the integration of data measured using different assays. This approach differs from multilevel sPLS-DA as the aim is to select subsets of genes and cytokines which are highly correlated (positively or negatively) across the samples. While the paired structure of the data is still taken into account in the analysis via the decomposition of the within matrices

Multilevel approach

Multilevel sPLS was applied on the within matrices of the gene and cytokine data sets after vaccination. Given the very small number of cytokines, all cytokines were selected in the model, and the tuning of the number of variables to select was only performed on the gene expression data set. Respectively, a selection of 50, 1 and 60 genes was performed each of the sPLS dimension, corresponding to a correlation of (0.86, 0.62 and 0.84). A drop of the subsequent correlations for the other dimensions guided the choice of 3 components in the model.

Although unexpected and indicated by the tuned correlation value of 0.62, the selection of one single gene on the second dimension was not surprising given the sample representation that was obtained (see Additional file

Integrative analysis of gene expression and cytokine secretion for W14

**Integrative analysis of gene expression and cytokine secretion for W14.** Clustered Image Maps (CIM) obtained from multilevel sPLS. Selected genes are represented in columns and cytokines in rows.

Conclusion

In this paper, we have proposed a two-step analysis combining a multilevel approach and a multivariate approach to analyze repeated measures of gene expression. The multilevel approach first extracts the within-sample variation while the multivariate approach applied on the within matrix takes into account the dependency between the variables. The multilevel approach was extended for one and two factors analyses.

Two multilevel variants were proposed with either sPLS-DA or sPLS. The multilevel sPLS-DA approach selects genes separating the groups of subjects on a single data set. The simulation study comparing multilevel sPLS-DA and the sPLS-DA applied on the original data demonstrated the good performance of the model. The multilevel sPLS approach integrates two experiments made on different platforms but on the same subjects, and selects subsets of correlated variables from both sets.

The application of both types of approaches on the HIV-1 vaccine trial showed their ability to highlight the stimulation groups and to select biologically relevant genes related to immune response. Hence, our combined multilevel approach may help in finding signatures of vaccine effect and allows for a better understanding of immunological mechanisms activated by the intervention. Future work will include a thorough analysis on the gene/probe annotations to fully understand the mechanistic link between gene differential expression, cytokine secretion according to the various stimulations.

Endnote

^{a}

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

BL and KL developed the methodology, the R code, performed the simulation and the analysis on the dataset as well as wrote the manuscript. RT developed the methodology, interpreted the dataset as well wrote the Manuscript. HH collected the data and wrote the Application section. All authors read and approved the final manuscript.

Acknowledgements

This work was supported, in part, by the Wound Management Innovation CRC (established and supported under the Australian Government’s Cooperative Research Centres Program) for K-A.LC. Financial support of the VAC18 trial was provided by the French Natioanl Agency for Research on AIDS and Hepatitis (ANRS); Sanofi Pasteur provided HIV-LIPO-5 vaccine. The authors would like to thank the VAC18 study group to provide the data.