University of Toulouse, IRIT/INP-ENSEEIHT, 2 rue Camichel, 31071 Toulouse cedex 7, BP 7122, France

Department of Medicine and Institute for Genome Sciences and Policy, Duke University, Durham, North Carolina, USA

Center for Computational Biology and Bioinformatics and EECS Department, University of Michigan, 1301 Beal Avenue, Ann Arbor, MI, 48109-2122, USA

Abstract

Background

This paper introduces a new constrained model and the corresponding algorithm, called unsupervised Bayesian linear unmixing (uBLU), to identify biological signatures from high dimensional assays like gene expression microarrays. The basis for uBLU is a Bayesian model for the data samples which are represented as an additive mixture of random positive gene signatures, called

Results

Firstly, the proposed uBLU method is applied to several simulated datasets with known ground truth and compared with previous factor decomposition methods, such as principal component analysis (PCA), non negative matrix factorization (NMF), Bayesian factor regression modeling (BFRM), and the gradient-based algorithm for general matrix factorization (GB-GMF). Secondly, we illustrate the application of uBLU on a real time-evolving gene expression dataset from a recent viral challenge study in which individuals have been inoculated with influenza A/H3N2/Wisconsin. We show that the uBLU method significantly outperforms the other methods on the simulated and real data sets considered here.

Conclusions

The results obtained on synthetic and real data illustrate the accuracy of the proposed uBLU method when compared to other factor decomposition methods from the literature (PCA, NMF, BFRM, and GB-GMF). The uBLU method identifies an inflammatory component closely associated with clinical symptom scores collected during the study. Using a constrained model allows recovery of all the inflammatory genes in a single factor.

Background

Factor analysis methods such as principal component analysis (PCA) have been widely studied and can be used for discovering the patterns of differential expression in time course and/or multiple treatment biological experiments using gene or protein microarray samples. These methods aim at finding a decomposition of the observation matrix

This decomposition expresses each of the

where **y**
_{
i
} (**Y**, is a vector of

where **m**
_{
r
} is the **M**, _{
r, i
} denotes the (**A**, **a**
_{
i
} and **n**
_{
i
} are the **A** and **N** respectively. The number of factors

The model (1) is identical to the standard factor analysis model **M** are called **M** are referred to as **A** are the

This paper presents a new Bayesian factor analysis method called unsupervised Bayesian linear unmixing (uBLU), that estimates the number of factors and incorporates non-negativity constraints on the factors and factor scores, as well as a sum-to-one constraint for the factor scores. The uBLU method presented here differs from the BLU method, developed in

A similar approach, based on NMR spectral imaging and called the **MA** to the data **Y**. In contrast, the proposed uBLU algorithm uses a judicious model to reduce sensitivity to local minima rather than using cold restarts. The novelty of the uBLU model is that it consists of: (1) a birth-death process to infer the number of factors; (2) a positivity constraint on the loading and score matrices **M**, **A** to restrict the space of solutions; (3) a sum-to-one constraint on the columns of **A** to further restrict the solution space. The uBLU model is justified for non-negative data problems like gene expression analysis and produces an estimate of the non-negative factors in addition to their proportional representation in each sample.

Bayesian linear unmixing, traditionally used for hyperspectral image analysis (see

In this paper we provide comparative studies that establish quantitative performance advantages of the proposed constrained model and its corresponding uBLU algorithm with respect to PCA, NMF, BFRM and GB-GMF for time-varying gene expression analysis, using synthetic data with known ground truth. We also illustrate the application of uBLU to the analysis of a real gene expression dataset from a recent viral challenge study

Methods

Mathematical constrained model

Let **y**
_{
i
} represent a gene microarray vector of **y**
_{
i
} have units of hybridization abundance levels with non-negative values. In the context of gene expression data, the starting point for Bayesian linear unmixing is the linear mixing model (LMM)

where **m**
_{
r
} = [_{1, r
}, …, _{
G, r
}]^{
T
} is the _{
g, r
} ≥ 0 is the strength of the _{
r, i
} is the relative contribution of the **y**
_{
i
}, where _{
r,i
} ∈ [0, 1] and **n**
_{
i
} denotes the residual error of the LMM representation. For a matrix of

where **M**, **A** satisfy positivity and sum-to-one constraints defined by

where _{
g, r
} denotes the (**M**. The constraints (5) arise naturally when dealing with positive data for which one is seeking the relative contribution of positive factors that have the same numerical characteristics as the data, i.e., the signature **m**
_{
r
} is itself interpretable as a vector of hybridization abundances.

The objective of linear unmixing is to simultaneously estimate the factor matrix **M** and the factor score matrix **A** from the available **A** has rank **M** is estimated first using an **A**. A common (but restrictive) assumption in these algorithms is that some samples in the dataset are “pure” in the sense that the linear combination of (2) contains a unique factor, say **m**
_{
r
}, with factor score _{
r, i
}. Recently, this assumption has been relaxed by applying a hierarchical Bayesian approach, called Bayesian linear unmixing (BLU). The resulting algorithm requires the number **M**, and the factor scores **A**. The uBLU model is described in the next subsection and the Gibbs sampling algorithm is given in the Appendix. In the Results and discussion section below we demonstrate the performance advantages of uBLU as a factor analysis model for simulated and real gene expression data.

Unsupervised Bayesian linear unmixing algorithm

The BLU algorithm studied in **M** and **A** given the number **n**
_{
i
} in (2) are assumed to be independent identically distributed (i.i.d.) according to zero-mean Gaussian distributions: **I**
_{
G
} denotes the identity matrix of dimension

The number of factors _{max}]

where _{max} is the maximal number of factors present in the mixture.

Because of the constraints in (5), the data samples **y**
_{
i
} (_{max} − 1 ≤

where **Y** and **P** is the (**Y**. This dimension reduction procedure allows us to work in a lower-dimensional subspace without loss of information, and reduces significantly the computational complexity of the Bayes estimator of the factor loadings. A multivariate Gaussian distribution (MGD) truncated on a subset **t**
_{
r
}. The subset **m**
_{
r
} (see

More precisely, **t**
_{
r
} such that all components of **e**
_{
r
} of these truncated MGDs, one can use a standard endmember extraction algorithm (EEA) common in hyperspectral imaging, e.g. N-FINDR **t**
_{
r
} is

where **e**
_{
r
} and covariance matrix **t**
_{
r
}, for **T** = [**t**
_{1},…,**t**
_{
R
}] is

where ∝ stands for “proportional to”, ∥·∥ is the standard _{2}-norm, **E** = [**e**
_{1}, …, **e**
_{
R
}] and

The sum-to-one constraint for the factor scores **a**
_{
i
}, for each observed sample **a**
_{
i
} to be rewritten as

and **a**
_{
i
} could be expressed as a function of the others, i.e., _{
R, i
} has been chosen here for notation simplicity. To ensure the positivity constraint, the subvectors **a**
_{1:R − 1, i
} must belong to the simplex

where ∥·∥_{1} is the _{1} norm (**a**
_{
i
}≽**0** stands for the set of inequalities {_{
r,i
} ≥ 0}_{
r = 1, …, R
}. Following the model in **a**
_{1:R − 1, i
} (

For the prior distribution on the variance ^{2} of the residual errors we chose a conjugate inverse-Gamma distribution with parameters

The shape parameter

The resulting hierarchical structure of the proposed uBLU model is summarized in the directed acyclic graph (DAG) presented in Additional file

**Supplementary materials on algorithm details and performance validation.** Directed acyclic graph (DAG) of the model and flowchart of the proposed algorithm are provided in this additional file. More results on synthetic datasets are also presented to validate the proposed Bayesian algorithm, including a convergence diagnosis.

Click here for file

The model defined in (1) and the Gaussian assumption for the noise vectors **n**
_{1}, …, **n**
_{
N
} allow the likelihood of **y**
_{1}, …, **y**
_{
N
} to be determined

Multiplying this likelihood by the parameter priors defined in (10), (13), (14) and (6), and integrating out the nuisance parameter **Θ** = {**M**, **A**, ^{2},

Considering the parameters to be

where **A**|**T**|**E**, **s**
^{2}, ^{2}|**A**, the projected factor matrix **T** and the noise variance ^{2} previously defined.

Due to the constraints enforced on the data, the posterior distribution **M**, **A**, **Y**) obtained from the proposed hierarchical structure is too complex to derive analytical expressions of the Bayesian estimators, e.g., the minimum mean square (MMSE) and maximum a posteriori (MAP) estimators. In such case, it is natural to use Markov chain Monte Carlo (MCMC) methods **M**
^{(t)}, **A**
^{(t)} and ^{(t)} asymptotically distributed according to **M**, **A**, **Y**). However, the dimensions of the factor loading matrix **M** and the factor score matrix **A** depend on the unknown number **M**, **A**, **Y**) requires exploring parameter spaces of different dimensions. To solve this dimension matching problem, we include a birth/death process within the MCMC procedure. Specifically, a birth, death or switch move is chosen at each iteration of the algorithm (see the Appendix and **M**, the factor score matrix **A** and the noise variance ^{2} are then updated, conditionally upon the number of factors

After a sufficient number of iterations (N_{mc} iterations, including a burn-in period of N_{bi} iterations), the traditional Bayesian estimators (e.g., MMSE and MAP) can be approximated using the generated samples **M**
^{(t)}, **A**
^{(t)} and ^{(t)}. First, the generated samples are used to approximate the MAP estimator of the number of factors

where _{
k
} is the number of generated samples ^{(t)} =

Results and discussion

The proposed method consists of estimating simultaneously the matrices **M** and **A** defined in (1), under the positivity and sum-to-one constraints mentioned previously, in a fully unsupervised framework, i.e., the number of factors

Simulations on synthetic data

To illustrate the performance of the proposed Bayesian factor decomposition, we first present simulations conducted on synthetic data. More extensive simulation results are reported in the Additional file

Simulation scenario

Several synthetic datasets

_{1}

Peaky factors

_{2}

Realistic factors

_{3}

Orthogonal factors

_{4}

Orthogonal and positive factors

In each case, the _{
i
} = 20 dB where

Proposed method (uBLU)

The first step of the algorithm consists of estimating the number of factors **M** and **A**, using the **M**, **A** and ^{
2
}**) given**

The burn-in period and number of Gibbs samples were determined using quantitative methods described in the Additional file

Comparison to other methods

The performance of the proposed uBLU algorithm is compared with other existing factor decomposition methods including PCA, NMF, BFRM and GB-GMF by using the following criteria, which are common measures used to compare factor analysis algorithms,

• the factor mean square errors (MSE)

where

• the global MSE of factor scores

where

• the reconstruction error (RE)

• where**y**
_{
i
},

• the spectral angle distance (SAD) between **m**
_{
r
} and its estimate

where arccos(·) is the inverse cosine function,

• the global spectral angle distance (GSAD) between **y**
_{
i
} (the

• the computational time.

The proposed uBLU algorithm, the PCA, NMF and GB-GMF methods were implemented in Matlab 7.8.0 (R2009a). The BFRM software (version 2.0) was downloaded from

Simulation results are reported in Tables **M**, **A**} is an admissible solution, {**M****B**, **B**
^{
T
}**A**} is also admissible for any scaling and permutation matrix **B**. Hence a re-scaling is required to identify appropriate permutations before computing MSEs and GMSEs. Moreover, when PCA, NMF, BFRM and GB-GMF methods are run for

**(a) ****
R
**

MSEs, GMSEs, SADs, GSADs, REs and computational times between the proposed uBLU algorithm and PCA, NMF, BFRM and GB-GMF methods.

**uBLU**

**PCA**

**NMF**

**BFRM**

**GB-GMF**

**0.39**

N/A

N/A

205.99

267.42

**0.60**

6.04

61.12

N/A

N/A

**0.54**

0.97

9.78

325.58

67.14

**0.04**

N/A

N/A

64.39

226.58

**0.04**

2.00

2.00

N/A

N/A

**0.05**

0.30

0.28

75.87

41.33

**0.46**

N/A

N/A

21.69

12.48

**0.29**

3.49

3.50

N/A

N/A

**0.28**

1.49

1.50

23.24

27.43

GSAD (×10^{
−2
})

**3.39**

20.38

20.38

24.04

37.35

RE

**0.18**

9.12

9.12

1.94

9.16

Time (

1.24×10^{
3
}

**0.03**

0.71

47.15

0.39×10^{
3
}

**(b) ****
R
**

**uBLU**

**PCA**

**NMF**

**BFRM**

**GB-GMF**

**0.39**

6.01

0.48

212.30

40.27

0.60

6.53

**0.45**

681.42

147.74

0.54

5.86

**0.28**

137.22

94.90

**0.04**

6.62

0.19

76.09

45.29

0.04

2.40

**0.01**

142.72

17.37

**0.05**

0.84

0.05

76.22

33.78

**0.46**

1.86

0.53

10.68

11.86

**0.29**

1.18

0.31

15.18

12.50

0.28

1.36

**0.26**

5.33

13.96

GSAD (×10^{
−2
})

**3.37**

3.39

3.38

24.23

33.38

RE

**0.18**

**0.18**

0.18

1.84

0.18

Time (

1.24×10^{
3
}

**0.10**

0.95

53.60

0.56×10^{
3
}

**(c) ****
R
**

**uBLU**

**PCA**

**NMF**

**BFRM**

**GB-GMF**

**0.39**

6.02

87.78

205.66

195.89

0.60

6.53

**0.45**

247.96

101.34

0.54

8.03

**0.26**

330.01

68.69

**0.04**

23.82

26.56

64.59

57.58

**0.04**

11.70

0.23

114.02

3.10

**0.05**

6.37

18.04

75.47

27.72

**0.46**

1.86

6.14

9.74

8.84

**0.29**

1.18

0.31

22.15

26.80

0.28

1.36

**0.26**

8.17

27.32

GSAD (×10^{
−2
})

3.39

**3.34**

3.36

28.62

29.23

RE

**0.18**

**0.18**

0.18

2.08

0.18

Time (

1.24×10^{
3
}

**0.11**

0.96

63.88

0.70×10^{
3
}

**(a) ****
R
**

MSEs, GMSEs, SADs, GSADs, REs and computational times between the proposed uBLU algorithm and PCA, NMF, BFRM and GB-GMF methods.

**uBLU**

**PCA**

**NMF**

**BFRM**

**GB-GMF**

**0.09**

1.97

N/A

N/A

N/A

**0.14**

N/A

1.06

37.67

58.75

0.14

**0.12**

26.68

52.09

150.09

0.34

**0.01**

N/A

N/A

N/A

**0.15**

N/A

1.12

1.17

22.37

**0.09**

0.94

6.24

0.62

1.18

**0.39**

0.44

N/A

N/A

N/A

**0.48**

N/A

1.32

16.53

13.34

0.47

**0.44**

3.72

15.21

18.14

GSAD (×10^{
−2
})

1.51

**1.02**

1.53

37.99

129.40

RE (×10^{
−2
})

**0.64**

1.62

1.65

0.65

5.47

Time (

22.06×10^{
3
}

**0.29**

32.02

4.07×10^{
3
}

9.24×10^{
3
}

**(b) ****
R
**

**uBLU**

**PCA**

**NMF**

**BFRM**

**GB-GMF**

**0.09**

1.97

14.87

24.41

61.00

0.14

**0.01**

20.53

50.59

58.31

0.14

**0.09**

14.02

35.89

65.11

0.34

**0.03**

0.34

1.41

4.80

0.15

**0.02**

2.44

0.65

9.40

0.09

**0.05**

0.92

1.19

5.40

**0.39**

0.44

2.84

14.35

13.72

0.48

**0.12**

4.75

15.47

13.62

0.47

**0.37**

4.00

17.50

15.82

GSAD (×10^{
−2
})

**1.02**

1.02

1.49

29.29

129.29

RE (×10^{
−2
})

0.64

**0.63**

1.55

0.75

1.62

Time (

22.06×10^{
3
}

**0.28**

45.91

5.37×10^{
3
}

16.59×10^{
3
}

**(c) ****
R
**

**uBLU**

**PCA**

**NMF**

**BFRM**

**GB-GMF**

**0.09**

1.97

13.13

24.25

64.90

0.14

**0.01**

20.53

50.52

64.09

0.14

**0.09**

14.02

28.32

69.99

0.34

**0.09**

0.20

1.42

15.12

**0.15**

0.48

1.00

0.65

9.55

0.09

**0.05**

0.44

1.31

7.73

**0.39**

0.44

2.54

14.74

14.53

0.48

**0.13**

5.52

15.45

14.55

0.47

**0.37**

4.79

16.45

16.17

GSAD (×10^{
−2
})

1.02

**1.01**

1.06

40.36

129.29

RE (×10^{
−2
})

0.64

**0.63**

0.69

0.86

1.50

Time (

22.06×10^{
3
}

**0.54**

55.86

5.59×10^{
3
}

16.59×10^{
3
}

**(a) ****
R
**

MSEs, GMSEs, SADs, GSADs, REs and computational times between the proposed uBLU algorithm and PCA, NMF, BFRM and GB-GMF methods.

**uBLU**

**PCA**

**NMF**

**BFRM**

**GB-GMF**

**0.01**

0.83

0.82

N/A

1.14

0.85

**0.80**

0.92

1.34

2.30

**1.15**

N/A

N/A

1.36

N/A

7.75

**7.29**

7.72

N/A

8.94

7.76

**0.47**

0.48

12.30

11.86

**9.84**

N/A

N/A

11.05

N/A

**0.59**

7.09

7.04

N/A

15.55

7.13

**6.71**

7.19

8.41

16.43

8.71

N/A

N/A

**8.54**

N/A

GSAD (×10^{
−1
})

3.23

**2.58**

2.59

6.59

15.26

RE (×10^{
−4
})

3.11

0.70

0.70

**0.47**

2.50

Time (

1.59×10^{
3
}

**0.01**

0.70

42.02

0.40×10^{
3
}

**(b) ****
R
**

**uBLU**

**PCA**

**NMF**

**BFRM**

**GB-GMF**

**0.01**

0.15

0.15

1.74

1.20

0.85

1.02

**0.76**

1.76

2.26

1.15

1.57

**1.03**

1.55

2.40

7.75

14.89

**2.80**

11.40

14.09

7.76

**0.11**

0.40

12.11

12.33

9.84

**0.11**

0.30

10.94

12.76

**0.59**

2.60

2.47

11.34

15.76

7.13

7.16

**6.59**

9.45

16.40

8.71

8.80

**7.67**

9.06

15.66

GSAD (×10^{
−1
})

3.23

**2.58**

1.71

6.88

15.20

RE (×10^{
−4
})

3.11

**0.27**

0.29

0.49

2.44

Time (

1.59×10^{
3
}

**0.10**

1.24

59.72

0.54×10^{
3
}

**(c) ****
R
**

**uBLU**

**PCA**

**NMF**

**BFRM**

**GB-GMF**

**0.01**

0.02

1.43

1.43

1.19

**0.85**

1.48

5.49

3.92

2.06

1.15

1.68

**0.90**

1.88

2.33

**7.75**

13.78

20.56

16.66

13.15

7.76

**4.35**

12.36

15.34

11.75

9.84

3.99

**2.67**

11.25

13.29

**0.59**

0.97

10.27

10.24

15.97

**7.13**

7.93

15.78

16.45

14.92

8.71

8.66

**6.93**

10.98

15.89

GSAD (×10^{
−1
})

3.23

**1.17**

1.20

5.51

15.98

RE (×10^{
−4
})

3.11

**0.16**

0.16

0.41

2.45

Time (

1.59×10^{
3
}

**0.13**

1.15

67.71

0.69×10^{
3
}

**(a) ****
R
**

**uBLU**

**PCA**

**NMF**

**BFRM**

**GB-GMF**

**0.02**

N/A

5.12

N/A

N/A

1.61

**0.01**

3.59

15.35

18.69

**0.05**

0.44

N/A

14.42

19.20

**0.28**

N/A

3.23

N/A

N/A

0.87

**0.02**

2.65

0.33

1.62

0.69

0.76

N/A

**0.50**

1.30

**0.34**

N/A

4.25

N/A

N/A

3.08

**0.17**

3.71

14.90

14.89

**0.51**

0.68

N/A

15.59

15.70

GSAD (×10^{
−2
})

**4.97**

5.24

5.25

157.09

156.19

RE (×10^{
−4
})

**4.49**

4.88

4.89

19.34

8.48

Time (

1.61×10^{
3
}

**0.02**

1.36

35.29

0.40×10^{
3
}

**(b) ****
R
**

**uBLU**

**PCA**

**NMF**

**BFRM**

**GB-GMF**

0.02

**0.01**

6.18

18.38

21.63

1.61

**0.01**

4.79

16.10

19.55

**0.05**

0.09

4.21

15.04

19.85

0.28

**0.05**

1.67

1.44

1.29

0.87

**0.05**

1.01

0.37

1.75

0.69

**0.05**

0.94

0.26

1.17

0.34

**0.27**

4.12

15.21

15.65

3.08

**0.17**

4.09

15.26

15.90

0.51

**0.32**

4.16

16.07

15.36

GSAD (×10^{
−2
})

4.97

**4.95**

4.99

157.08

154.80

RE (×10^{
−4
})

4.49

**4.34**

4.36

25.00

8.48

Time (

1.61×10^{
3
}

**0.10**

1.78

41.05

0.55×10^{
3
}

**(c) ****
R
**

**uBLU**

**PCA**

**NMF**

**BFRM**

**GB-GMF**

0.02

**0.01**

6.98

17.51

21.60

1.61

**0.01**

7.30

15.07

19.03

**0.05**

0.07

4.27

14.55

19.14

0.28

**0.22**

0.65

0.75

1.29

0.87

**0.51**

0.91

0.77

1.18

0.69

**0.05**

0.56

0.56

1.33

0.34

**0.27**

4.41

15.61

15.51

3.08

**0.19**

4.81

16.31

14.77

0.51

**0.33**

4.00

15.84

15.26

GSAD (×10^{
−2
})

4.97

**4.91**

4.94

156.76

162.63

RE (×10^{
−4
})

4.49

**4.30**

4.33

13.48

8.29

Time (

1.61×10^{
3
}

**0.16**

1.56

48.22

0.70×10^{
3
}

These results show that the uBLU method is more flexible since it provides better unmixing performance across all of the considered synthetic datasets

Evaluation on gene expression data

Here the proposed algorithm is illustrated on a real time-evolving gene expression data from recent viral challenge studies on influenza A/H3N2/Wisconsin. The data are available at GEO, accession number GSE30550.

Details on data collection

We briefly describe the dataset. For more details the reader is referred to ^{6} TCID_{50} Influenza A manufactured and processed under current good manufacturing practices (cGMP) by Baxter BioScience. Peripheral blood microarray analysis was performed at multiple time instants corresponding to baseline (24 hours prior to inoculation with virus), then at 8 hour intervals for the initial 120 hours and then 24 hours for two further days. Each sample consisted of over

**Experimental results on the H3N2 viral challenge dataset of gene expression profiles.** (**a**) Estimated posterior distribution of the number of factors **b**) Factor loadings ranked by decreasing dominance. (**c**) Heatmap of the factor scores of the inflammatory component clearly separates symptomatic subjects (bottom 9 rows) and the time course of their molecular inflammatory response. The five black colored pixels indicate samples that were not assayed

**Experimental results on the H3N2 viral challenge dataset of gene expression profiles.** (**a**) Estimated posterior distribution of the number of factors **b**) Factor loadings ranked by decreasing dominance. (**c**) Heatmap of the factor scores of the inflammatory component clearly separates symptomatic subjects (bottom 9 rows) and the time course of their molecular inflammatory response. The five black colored pixels indicate samples that were not assayed.

Application of the proposed uBLU algorithm

The uBLU algorithm was run with N_{mc} = 10000 Monte Carlo iterations, including a burn-in period of N_{bi} = 1000 iterations. uBLU allows the posterior distribution of the number of factors

Figure ^{(t)} as a function of the number of iterations (**M** and **A** at each iteration. Figure _{bi} = 1000 and N_{mc} = 10000 are sufficient.

**Reconstruction error and estimated number of factors as a function of the number of iterations (H3N2 challenge data).** Top: Reconstruction error (RE^{(t)}) computed from the observation matrix **Y** and the estimated matrices **M**^{(t)} and **A**^{(t)} as a function of the iteration index ^{(t)} as a function of the iteration number

**Reconstruction error and estimated number of factors as a function of the number of iterations (H3N2 challenge data).** Top: Reconstruction error (RE^{(t)}) computed from the observation matrix **Y** and the estimated matrices **M**^{(t)} and **A**^{(t)} as a function of the iteration index ^{(t)} as a function of the iteration number

The different factors are depicted in Figure

The factor scores corresponding to this inflammatory component are shown in Figure

Furthermore, this inflammatory factor identified by the proposed uBLU algorithm is most highly represented in those samples associated with acute flu symptoms, as measured by modified Jackson scores (see

**Pathway name**

**Genes**

**P-value**

NCI-curated pathway associations of group of genes contributing to uBLU inflammatory component, whose factor scores are shown in Figure

IFN-gamma pathway

CASP1, CEBPB, IL1B, IRF1, IRF9, PRKCD, SOCS1, STAT1, STAT3

1.34e-09

PDGFR-beta signaling pathway

DOCK4, EIF2AK2, FYN, HCK, LYN, PRKCD, SLA, SRC, STAT1, STAT3, STAT5A, STAT5B

3.26e-08

IL23-mediated signaling events

CCL2, CXCL1, CXCL9, IL1B, STAT1, STAT3, STAT5A

2.18e-07

Signaling events mediated by TCPTP

EIF2AK2, SRC, STAT1, STAT3, STAT5A, STAT5B, STAT6

6.38e-07

Signaling events mediated by PTP1B

FYN, HCK, LYN, SRC, STAT3, STAT5A, STAT5B

2.40e-06

GMCSF-mediated signaling events

CCL2, LYN, STAT1, STAT3, STAT5A, STAT5B

3.70e-06

IL12-mediated signaling events

HLA-A, IL1B, SOCS1, STAT1, STAT3, STAT5A, STAT6

1.32e-05

IL6-mediated signaling events

CEBPB, HCK, IRF1, PRKCD, STAT1, STAT3

1.80e-05

For comparison we applied a supervised version of the proposed uBLU algorithm to the H3N2 dataset. This was implemented by setting the number of factors to **M** and **A**. The inflammatory component found by the supervised algorithm was virtually identical to the one found by the proposed algorithm (uBLU) that automatically selects

Comparison to other methods

The uBLU algorithm is compared with four matrix factorization algorithms, i.e. PCA, NMF, BFRM and GB-GMF methods.

Figure

Factor loadings ranked by decreasing dominance for H3N2 challenge data

**Factor loadings ranked by decreasing dominance for H3N2 challenge data.** uBLU shows a particularly strong component (Figure

Heatmaps of the factor scores of the inflammatory component for H3N2 challenge data

**Heatmaps of the factor scores of the inflammatory component for H3N2 challenge data.** The inflammatory factor determined by the proposed uBLU method (**a**) shows higher contrast between symptomatic and asymptomatic subjects than the other methods. The five black colored pixels of the heatmaps indicate samples that were not assayed.

The factor scores of the five matrix factorization methods corresponding to the inflammatory component are depicted in Figure _{pos}, ^{2}pos) the empirical mean and variance of the scores associated with the _{pos} samples in the acute symptomatic state (bright colored samples in the lower right rectangle of Figure

**uBLU**

**PCA**

**NMF**

**BFRM**

**GB-GMF**

Measure of the Fisher linear discriminant measure (

Fisher criteria (× 10^{−2}) (22)

**6.20**

2.03

6.17

4.68

2.30

RE

**6.48.10**
^{
−2
}

4.89

7.31.10^{−2}

4.82

9.51.10^{−2}

Time

≈ 12

**1.5****
s
**

116

≈ 47

≈ 10

Number of iterations

10 000

N/A

5 000

10 000

500

To compare the biological relevance of the inflammatory genes found by uBLU to those found by the other methods we performed gene enrichment analysis (GEA). Here we only report GEA comparisons between uBLU and NMF. Tables

**Pathway name**

**Genes**

**P-value**

NCI-curated pathway associations of group of genes contributing to NMF inflammatory component, whose factor scores are shown in Figure

IL23-mediated signaling events

CCL2, CXCL1, CXCL9, IL1B, JAK2, STAT1, STAT5A

2.18e-07

IL12-mediated signaling events

GADD45B, IL1B, JAK2, MAP2K6, SOCS1, STAT1, STAT5A, STAT6

1.10e-06

IFN-gamma pathway

CASP1, IL1B, IRF9, JAK2, SOCS1, STAT1

1.07e-05

Signaling events mediated by TCPTP

EIF2AK2, PIK3R2, STAT1, STAT5A, STAT5B, STAT6

1.07e-05

IL27-mediated signaling events

IL1B, JAK2, STAT1, STAT2, STAT5A

1.22e-05

CXCR3-mediated signaling events

CXCL10, CXCL11, CXCL13, CXCL9, MAP2K6, PIK3R2

1.23e-05

GMCSF-mediated signaling events

CCL2, JAK2, STAT1, STAT5A, STAT5B

6.24e-05

PDGFR-beta signaling pathway

EIF2AK2, JAK2, PIK3R2, ARAP1, DOCK4, STAT1, STAT5A, STAT5B

1.38e-04

Figure

Chip clouds after demixing for H3N2 challenge data

**Chip clouds after demixing for H3N2 challenge data.** These figures show the scatter of the four dimensional factor score vectors (projected onto the plane using MDS) for each algorithm that was compared to uBLU. uBLU, NMF and BFRM obtain a clean separation of samples of symptomatic (red points) and asymptomatic (blue points) subjects whereas the separation is less clear with PCA. In these scatter plots the size of a point is proportional to the time at which the sample was taken during challenge study.

One can conclude from these comparisons that, when applied on the H3N2 dataset, the proposed uBLU algorithm outperforms PCA, NMF, BFRM, and GB-GMF algorithms in terms of finding genes with higher pathway enrichment and achieving higher contrast of the acute symptom states.

The computational times required by the five considered matrix factorization methods, including the proposed uBLU algorithm, when applied to this real dataset, are reported in Table

Conclusions

This paper proposes a new Bayesian unmixing algorithm for discovering signatures in high dimensional biological data, and specifically for gene expression microarrays. An interesting property of the proposed algorithm is that it provides positive factor loadings to ensure positivity as well as sum-to-one constraints for the factor scores. The advantages of these constraints are that they lead to better discrimination between sick and healthy individuals, and they recover the inflammatory genes in a unique factor, the inflammatory component. The proposed algorithm is fully unsupervised in the sense that it does not depend on any labeling of the samples and that it can infer the number of factors directly from the observation data matrix. Finally, as any Bayesian algorithm, the Monte Carlo-based procedure investigated in this study provides point estimates as well as confidence intervals for the unknown parameters, contrary to many existing factor decomposition methods such as PCA or NMF.

Simulation results performed on synthetic and real data demonstrated significant improvements. Indeed, when applied to real time-evolving gene expression datasets, the uBLU algorithm revealed an inflammatory factor with higher contrast between subjects who would become symptomatic from those who would remain asymptomatic (as determined by comparing to ground truth clinical labels).

In this study, the time samples were modeled as independent. Future works include extensions of the proposed model to account for time dependency between samples.

Appendix A: Gibbs sampler

This appendix provides more details about the Gibbs sampler strategy to generate samples**Θ**, **Y**) defined in (18) and can be written

where the dimensions of the matrices **M**, **T**, and **A** depend on the unknown number of factors

The different steps of the Gibbs sampler are detailed below.

Inference of the number of factors

The proposed unsupervised algorithm includes a birth/death process for inferring the number of factors **M** and **A**. More precisely, at iteration **M**
^{(t)}, **A**
^{(t)}, ^{(t)}} to the new state {**M**
^{
⋆
}, **A**
^{
⋆
}, ^{
⋆
}}. The birth, death and switch moves are defined as follows, similar to those used in

• **
Birth
**

• **
Death
**

• **
Switch
**

Each move is then accepted or rejected according to an empirical acceptance probability: the likelihood ratio between the actual state and the proposed new state. The factor matrix **M**, the factor score matrix **A** and the noise variance ^{2} are then updated, conditionally upon the number of factors

Generation of samples according to ^{
2
},

Sampling from the joint conditional **T**|**A**, ^{2}, **Y**) is achieved by updating each column of **T** using Gibbs moves. Let denote **T**
_{∖r
} the matrix **T** whose **t**
_{
r
} is the following truncated multivariate Gaussian distribution (MGD)

where

For more details on how we generate realizations from this truncated distribution, see

Generation of samples according to _{1:R − 1, i
}|T, ^{
2
},

Straightforward computations lead to the posterior distribution of each element of **a**
_{1:R − 1, i
}

where

**M**
_{∖R
} denotes the factor loading matrix **M** whose

Generation of samples according to ^{
2
}|T, A,

Using (14) and (16), one can show that the conditional distribution ^{2}|**M**, **A**, **Y**) is the following inverse-Gamma distribution

Appendix B: Contribution of each of uBLU’s constraints

To illustrate the advantage of enforcing non-negativity and sum-to-one constraints on the factors and on the factor scores, as detailed in the Methods section, we evaluated the effect of successively stripping out these constraints from uBLU. In particular we implemented uBLU under the following conditions: i) without any constraints, ii) with only the positivity constraints on the factors and the scores, iii) with only the sum-to-one constraint on the scores, and iv) with both positivity and sum-to-one constraint on factors and scores as proposed in (5).

Figures

Contribution of each constraint on the scores of the inflammatory factor (H3N2 challenge data)

**Contribution of each constraint on the scores of the inflammatory factor (H3N2 challenge data).** The five black colored pixels of the heatmaps indicate samples that were not assayed. Note that when only the sum-to-one constraint is applied, non-negativity is not guaranteed. However, for this dataset the sum-to-one factor scores turn out to take on non-negative values for the inflammatory factor (but not for the other factors).

**Without**

**Positivity**

**Sum-to-one**

**Positivity and**

**constraints**

**sum-to-one**

Benefit of constraints in uBLU in terms of gene enrichment in the NCI-curated IFN-gamma and IL23-mediated pathways. As in Tables ^{−6}).

P-value of the “IFN-gamma pathway”

6.00.10^{−2}

2.05.10^{−2}

2.17.10^{−1}

**1.34**.**10**
^{
−9
}

P-value of the “IL23-mediated signaling events”

2.60.10^{−1}

8.37.10^{−2}

2.28.10^{−2}

**2.18**.**10**
^{
−7
}

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

CB, ND, JYT and AH performed the statistical analysis. GG and AZ designed the Flu challenge experiment that generated the data used to compare the methods. All authors contributed to the manuscript and approved the final version.

Acknowledgements

This work was supported in part by DARPA under the PHD program (DARPA-N66001-09-C-2082). The views, opinions, and findings contained in this article are those of the authors and should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense. Approved for Public Release, Distribution Unlimited.