Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA USA

Departments of Human Oncology and Human Genetics, Drexel University School of Medicine, Pittsburgh, PA USA

Abstract

Background

While in principle a seemingly infinite variety of combinations of mutations could result in tumor development, in practice it appears that most human cancers fall into a relatively small number of "sub-types," each characterized a roughly equivalent sequence of mutations by which it progresses in different patients. There is currently great interest in identifying the common sub-types and applying them to the development of diagnostics or therapeutics. Phylogenetic methods have shown great promise for inferring common patterns of tumor progression, but suffer from limits of the technologies available for assaying differences between and within tumors. One approach to tumor phylogenetics uses differences between single cells within tumors, gaining valuable information about intra-tumor heterogeneity but allowing only a few markers per cell. An alternative approach uses tissue-wide measures of whole tumors to provide a detailed picture of averaged tumor state but at the cost of losing information about intra-tumor heterogeneity.

Results

The present work applies "unmixing" methods, which separate complex data sets into combinations of simpler components, to attempt to gain advantages of both tissue-wide and single-cell approaches to cancer phylogenetics. We develop an unmixing method to infer recurring cell states from microarray measurements of tumor populations and use the inferred mixtures of states in individual tumors to identify possible evolutionary relationships among tumor cells. Validation on simulated data shows the method can accurately separate small numbers of cell states and infer phylogenetic relationships among them. Application to a lung cancer dataset shows that the method can identify cell states corresponding to common lung tumor types and suggest possible evolutionary relationships among them that show good correspondence with our current understanding of lung tumor development.

Conclusions

Unmixing methods provide a way to make use of both intra-tumor heterogeneity and large probe sets for tumor phylogeny inference, establishing a new avenue towards the construction of detailed, accurate portraits of common tumor sub-types and the mechanisms by which they develop. These reconstructions are likely to have future value in discovering and diagnosing novel cancer sub-types and in identifying targets for therapeutic development.

Background

One of the great contributions of genomic studies to human health has been to dramatically improve our understanding of the biology of tumor formation and the means by which it can be treated. Our understanding of cancer biology has been radically transformed by new technologies for probing the genome and gene and protein expression profiles of tumors, which have made it possible to identify important sub-types of tumors that may be clinically indistinguishable yet have very different prognoses and responses to treatments

More sophisticated computational models of tumor evolution, drawn from the field of phylogenetics, have provided an important tool for identifying and characterizing novel cancer sub-types

An alternative approach to tumor phylogenetics, developed by Pennington et al.

Each of these two approaches to cancer phylogenetics has advantages, but also significant limitations. The tumor-by-tumor approach has the advantage of allowing assays of many distinct probes per tumor, potentially surveying expression of the complete transcriptome or copy number changes over the complete genome. It does not, however, give one access to the information provided by knowledge of intratumor heterogeneity, such as the existence of transitory cell populations and the patterns by which they co-occur within tumors, that allow for a more detailed and accurate picture of the progression process. The cell-by-cell approach gives one access to this heterogeneity information, but at the cost of allowing only a small number of probes per cell. It thus allows for only relatively crude measures of state using small sets of previously identified markers of progression.

One potential avenue for bridging the gap between these two methodologies is the use of computational methods for mixture type separation, or "unmixing," to infer sample heterogeneity from tissue-wide measurements. In an unmixing problem, one is presented with a set of data points that are each presumed to be a mixture of unknown fractions of several fundamental

The use of similar unmixing methods for tumor samples was pioneered by Billheimer and colleagues

In the present work, we develop a new approach using unmixing of tumor samples to assist in phylogenetic inference of cancer progression pathways. Our unmixing method adapts the geometric approach of Ehrlich and Full

Results

Algorithms

Model and definitions

We assume that the input to our methods consists primarily of a set of gene expression values describing activity of _{ij }to be element (

The output of the unmixing step is assumed to consist of a set of mixture components, representing the inferred cell types from the microarray data, and a set of mixture fractions, describing the amount of each observed tumor sample attributed to each mixture component. Mixture components, then, represent the presumed expression signatures of the fundamental cell types of which the tumors are composed. Mixture fractions represent the amount of each cell type inferred to be present in each sample. The degree to which different components co-occur in common tumors according to these mixture fractions provides the data we will subsequently use to infer phylogenetic relationships between the components. The mixture components are encoded in a _{ij }to be the fraction of component _{i }_{ij }= 1 for all

The unmixing problem is illustrated in Fig. _{1 }and _{2}, meant to represent primary tumor samples derived from three mixture components, _{1}, _{2}, and _{3}. For this example, we assume data are assayed on just two genes, _{1 }and _{2}. The matrix _{1 }and _{2}, in terms of the gene expression levels _{1 }and _{2}. We assume here that _{1 }and _{2 }are mixtures of the three components, _{1}, _{2}, and _{3}, meaning that they will lie in the triangular simplex that has the components as its vertices. The matrix _{1 }and _{2}. The matrix _{1 }and _{2 }are generated from _{1 }is a mixture of equal parts of _{1 }and _{2}, and thus appears at the midpoint of the line between those two components. The second row of _{2 }is a mixture of 80% _{3 }with 10% each _{1 }and _{2}, thus appearing internal to the simplex but close to _{3}. In the real problem, we get to observe only

Illustration of the geometric mixture model used in the present work

**Illustration of the geometric mixture model used in the present work**. The image shows a hypothetical set of three mixture components (_{1}, _{2}, and _{3}) and two mixed samples (_{1 }and _{2}) produced from different mixtures of those components. The triangular simplex enclosed by the mixture components is shown with dashed lines. To the right are the matrices

The output of the phylogeny step is presumed to be a tree whose nodes correspond to the mixture components inferred in the unmixing step. The tree is intended to describe likely ancestry relationships among the components and thus to represent a hypothesis about how cell lineages within the tumors collectively progress between the inferred cell states. We assume for the purposes of this model that the evidence from which we will infer a tree is the sharing of cell states in individual tumors, as in prior combinatorial models of the oncogenetic tree problem _{1}, _{2}, and _{3 }from a sample of tumors and, further, have inferred that one tumor is composed of component _{1 }alone, another of components _{1 }and _{2}, and another of components _{1 }and _{3}. Then we could infer that _{1 }is the parent state of _{2 }and _{3 }based on the fact that the presence of _{2 }or _{3 }implies that of _{1 }but not vice-versa. This purely logical model of the problem cannot be used directly on unmixed data because imprecision in the mixture assignments will lead to every tumor being assigned some non-zero fraction of every component. We therefore need to optimize over possible ancestry assignments using a probability model that captures this general intuition but allows for noisy assignments of components. This model is described in detail under the subsection "Phylogeny" below.

Cell type identification by unmixing

We perform cell type identification by seeking the most tightly fitting bounding simplex enclosing the observed point set, assuming that this minimum-volume bounding simplex provides the most plausible explanation of the observed data as convex combinations of mixture components. Our method is inspired by that of Ehrlich and Full

We adopt a similar high-level approach of sampling candidate simplices and iteratively expanding boundaries to generate possible component sets. There are, however, some important complications raised by gene expression data, especially with regard to its relatively high dimension, that lead to substantial changes in the details of how our method works. While the raw data has a high literal dimension, though, the hypothesis behind our method is that the data has a low intrinsic dimension, essentially equivalent to the number of distinct cell states well represented in the tumor samples. To allow us to adapt the geometric approach to unmixing to these assumed data characteristics, our overall method proceeds in three phases: an initial dimensionality reduction step, the identification of components through simplex-fitting as in Ehrlich and Full, and assignment of likely mixture fractions in individual samples using the inferred simplex.

For ease of computation, we begin our calculations by transforming the data into dimension _{ij }contains the mean expression level of gene

Note that although PCA is itself a form of unmixing method, it would not by itself be an effective method for identifying cell states. We would not in general expect cell types to yield approximately orthogonal vectors since distinct cell types are likely to share many modules of co-regulated genes, and thus similar expression vectors, particularly along a single evolutionary lineage. Furthermore, the limits of expression along each principal component are not sufficient information to identify the cell type mixture components, each of which would be expected to take on some portion of the expression signature of several components. For the same reasons, we would not be able to solve the present problem by any of the other common dimension-reduction methods similar to PCA, such as independent components analysis (ICA)

Once we have transformed the input matrix

Component inference begins by chosing a candidate point set that will represent an initial guess as to the vertices of the polytope. We select these candidate points from within the set of observed data points in ^{th }power: ||^{k}. It then successively adds additional points to a growing set of candidate vertices. Sampling of each successive point is again weighted by the volume of the simplex defined by the new candidate point and the previously selected vertices raised to the ^{th }power. Simplex volume is determined using the Matlab

The next step of the algorithm uses an approach based on that of Ehrlich and Full _{i }as the face defined by _{i}}. This distance is assigned a sign based on whether the observed point is on the same side of the face as the missing candidate vertex (negative sign) or the opposite side of the face (positive sign). The method then identifies the largest positive distance from among all faces _{i }and observed points _{j}, which we will call _{ij}. _{ij }represents distance of the point farthest from the simplex. We then transform _{j }by translating all points in _{i}} by distance _{ij }along the tangent to _{i}, creating a larger simplex _{j}. This process of simplex expansion repeats until all observed points are within the simplex defined by _{min}, as the output of the component inference algorithm.

Once we have selected _{min}, we must explain all elements of _{min}. We can find the best-fit matrix of mixture fractions

We also require that the mixture components sum to one for each tumor sample:

Since there are generally many more genes than tumor samples, the resulting system of equations will usually be overdetermined, although solvable assuming exact arithmetic. We find a least-squares solution to the system, however, to control for any arithmetic errors that would render the system unsolvable. The _{tj }values optimally satisfying the constraints then define the mixture fraction matrix

We must also transform our set of components _{min }back from the reduced dimension into the space of gene expressions. We can perform that transformation using the matrices

The resulting mixture components

Given tumor samples

1. Define _{min }to be an arbitrary simplex of infinite volume

2. Apply PCA to yield the

3. For each

a. Sample two points ^{k}

b. For each

i. Sample a point ^{k}

c. While there exists some _{j }in

i. Identify the _{j }farthest from the simplex defined by

ii. Identify the face _{i }violated by _{j}

iii. Move the vertices of _{i }along the tangent to _{i }until they enclose _{j}

d. If _{min}) then _{min }←

4. For each tumor sample

i. Solve for the elements _{tj }of

∑_{j }_{tj}_{ij }= _{it }∀

∑_{j }_{tj }= 1 ∀

5. Find the component matrix _{min}

6. Return (

Phylogeny inference

Once we have inferred cell states and their mixture fractions in each tumor sample, we can use those inferences to construct a phylogeny suggesting how the states are evolutionarily related. The sharing of states within individual tumors provides clues as to which cell types are likely to occur on common progression pathways. Imprecision in the mixture fraction assignments, however, will tend to create a spurious appearance of cell-type sharing due to tumors being assigned some non-zero fraction of each cell type whether or not they truly contain that type. To overcome the confounding effects of this noise in the mixture fractions, we pose phylogeny inference as the problem of finding a tree that maximizes cell-type sharing across tree edges and thus implicitly minimizes the assignment of edges to cell-type pairs that appear to co-occur due to noisy mixture fraction assignments or more distant evolutionary relationships. We define a measure of sharing of any two cell types

where

One can conceive of this measure as a log likelihood model, in which we are interested in explaining the frequency with which any given pair of states would be sampled by picking two independent cells from a given tumor. The numerator describes the hypothesis that a given pair of states are sampled from correlated densities, with the frequency of the pair derived by summing over the product of the two types' frequencies in individual tumors. The denominator describes the hypothesis that the states are independent of one another and thus sampled independently from some background noise distributions, with the two independent frequencies estimated by summing each cell type's frequency individually over all tumors. Seeking a tree that maximizes the log sum of this measure across all tree edges is then equivalent to seeking a maximum likelihood Bayesian model in which each child is presumed to have frequency directly dependent on its parent and independent of all other tree nodes. Intuitively, this distance function will tend to assign high sharing to cell types that generally have high frequencies in common tumors and low sharing to cell types that generally occur in disjoint tumors. The set of _{ij }values thus provides a similarity matrix for a phylogeny inference.

The model makes several assumptions about the available data. We assume that we have inferred all states present in the data and that our states therefore represent both internal and leaf nodes of the phylogeny. This assumption follows from the evidence that tumor samples maintain remnant populations of their earlier progression states

Testing

Validation on simulated data

We first validated the method using two protocols for simulated data generation. Simulated data is essential for validation because the ground truth components and their representation in particular tumors are not known for real tumor data sets. In addition, it allows us to explore how performance of the method varies with assumptions about the data set. We began by applying a simple simulation protocol for generating uniformly sampled mixtures, in which each component is simulated as an independent vector of unit normal random variables and each observed tumor passed as input to the data set is simulated as a uniformly random mixture of this common set of components (see Methods). We developed a second simulation protocol meant to better mimic the substructure expected from true tumor samples due to the evolutionary relationships among sub-types. In this protocol, we assume that mixture components correspond to nodes in a binary tree and that each observed tumor represents a mixture of components along a random path in that tree (see Methods). In both protocols, we add log normal noise to all simulated expression measurements.

Fig.

Examples of mixture components inferred from simulated data sets

**Examples of mixture components inferred from simulated data sets**. Green circles show the true mixture components, red points the simulated data points that serve as the input to the algorithms, and blue X's the inferred mixture components. (a) A uniform mixture of three independent components with no noise. Each data point is a mixture of all three components. Inferred mixture fractions for the three components, averaged over all points, are (0.295 0.367 0.339). (b) A tree-embedded mixture of three components with noise equal to signal. Each data point is a mixture of a root component (top, labeled 1) and one of two leaf components (bottom, labeled 2 and 3). The inset shows the phylogenetic tree in which the labeled components are embedded. Inferred mixture fractions averaged over points in the two branches of the simplex are (0.410 0.567 0.025) and (0.410 0.020 0.535) (c) A tree-embedded mixture of five components with 10% noise. Each data point contains a portion of the root component (bottom, labeled 1), a subset contain portions of one of two internal components (far left, labeled 2, and far right, labeled 4), and subsets of these contain portions of one of two leaf components (center left, labeled 3, and center right, labeled 5). The inset shows the phylogenetic tree in which the labeled components are embedded. Inferred mixture fractions averaged over points in the two branches of the simplex are (0.356 0.462 0.141 0.006 0.005) and (0.387 0.072 0.008 0.187 0.378).

Fig.

Accuracy of methods in inferring simulated mixture components and assigning mixture fractions to data points

**Accuracy of methods in inferring simulated mixture components and assigning mixture fractions to data points**. (a) Root mean square error in inferred mixture components as a function of noise level for uniform mixtures of

Figs.

Fig.

Accuracy of tree inference on simulated tree-embedded data

**Accuracy of tree inference on simulated tree-embedded data**. The plot shows the fraction of true tree edges accurately inferred for

Application to real data

In order to demonstrate the applicability of the methods to real tumor data, we next examined a dataset of lung tumor expression measurements from Jones et at.

Fig.

**Marker genes for the lung cancer four-component inference**. This supplementary table provides relative expression levels inferred for each annotated microarray probe for the Jones

Click here for file

Visualization of four-component unmixing results from the lung cancer data of Jones et al.

**Visualization of four-component unmixing results from the lung cancer data of Jones et al. **

Fig.

Table

Mixture fractions averaged by tumor type for a four component inference.

**Comp. 1**

**Comp. 2**

**Comp. 3**

**Comp. 4**

Normal

0.5387

0.1165

0.0451

0.2998

Adenocarcinoma

0.3943

0.1196

0.1708

0.3153

Small cell (cell lines)

0.4074

0.1701

0.3493

0.0732

Small cell (primary)

0.5015

0.0952

0.2665

0.1368

Carcinoid

0.3431

0.4198

0.1304

0.1066

Large cell neuroendocrine

0.4674

0.1227

0.2093

0.2007

Large cell carcinoma

0.4010

0.0836

0.1861

0.3293

Combined SCLC/AD

0.0650

0.1586

0.4120

0.3643

Columns correspond to mixture components from a four-component inference. Rows correspond to distinct tumor types, as annotated by Jones

We next examined performance of the method with six components, which we label

**Marker genes for the lung cancer six-component inference**. This supplementary table provides relative expression levels inferred for each annotated microarray probe for the Jones

Click here for file

Mixture fractions averaged by tumor type for a six component inference.

**Comp. 1**

**Comp. 2**

**Comp. 3**

**Comp. 4**

**Comp. 5**

**Comp. 6**

Normal

0.4017

0.1134

0.0993

0.0563

0.2338

0.0955

Adenocarcinoma

0.2625

0.1748

0.0960

0.0627

0.2556

0.1483

Small cell (cell lines)

0.1585

0.0705

0.2818

0.0645

0.3151

0.1095

Small cell (primary)

0.2539

0.0677

0.1913

0.0301

0.3252

0.1317

Carcinoid

0.1670

0.0674

0.1681

0.2868

0.2358

0.0747

Large cell neuroendocrine

0.2591

0.1206

0.1955

0.0434

0.2977

0.0838

Large cell carcinoma

0.2461

0.2110

0.1264

0.0257

0.2852

0.1056

Combined SCLC/AD

0.1270

0.2000

0.1425

0.0695

0.0938

0.3672

Columns correspond to mixture components from a six-component inference. Rows correspond to distinct tumor types, as annotated by Jones

Comparing gene expression vectors from Additional files

Correlations between four- and six-component inferences by inferred gene expression vectors.

**Comp. 1**

**Comp. 2**

**Comp. 3**

**Comp. 4**

**Comp. 5**

**Comp. 6**

Comp. 1

0.3348

-0.5376

-0.1510

-0.2595

0.3040

-0.3127

Comp. 2

-0.1271

-0.1618

0.2200

0.9450

-0.1983

-0.1663

Comp. 3

-0.6752

-0.2433

0.4800

-0.3663

0.2977

0.5466

Comp. 4

0.3515

0.8621

-0.4429

-0.2310

-0.3799

-0.0187

Entries show the Pearson correlation coefficients between inferred relative gene expression levels for components inferred from a 4-component versus a 6-component inference. Rows correspond to mixture components from a four component inference and columns from a six component inference.

Fig.

Phylogenies inferred on components derived from Jones et al.

**Phylogenies inferred on components derived from Jones et al**.

The application to real lung cancer data provides a further opportunity to evaluate the approach, in addition to suggesting some novel hypotheses about the molecular evolution of lung cancers. Both 4- and 6-component inferences suggest that the tumor types examined here evolve into two major groups early in their progression: one consisting of large cell neuroendocrine, small cell, and carcinoid tumors and the other consisting of adenocarcinoma and large cell carcinoma. This subdivision is supported by many lines of evidence on specific genetic abnormalities frequently found in the different sub-types, which suggest that small cell and large cell neuroendocrine carcinomas arise from common progenitors

The 6-component tree appears to be an elaboration on the 4-component tree, supporting a common model of the two major pathways but adding some additional features beyond that. One intriguing feature is the insertion of the

Implementation

All code for this project was written in Matlab and executed with Matlab v.7 on a Linux PC. Matlab was also used for visualization of component inferences. The validation code also required

as Matlab ".m" files. Other custom code used in data analysis and generation of simulated data sets will be provided upon request.

Discussion

We have developed a novel approach to tumor phylogenetics combining unmixing methods with a cell-by-cell strategy for phylogeny inference. The method is an attempt to gain the advantages of both intratumor heterogeneity information available to cell-by-cell methods and the large probe sets available to tumor-by-tumor methods. The application to simulated data sets suggests that the method is effective at making component and mixture fraction inferences from large, noisy datasets for limited numbers of components. The method does, however, degrade in performance quickly with increasing numbers of mixture components. Phylogeny inference, which depends on the quality of the mixture fraction inference, similarly shows high tolerance for noise, although a loss of quality with increasing numbers of components. Nonetheless, phylogeny inferences show good reconstruction accuracy for as many as seven components on simulated tree-embedded samples. The methods are generally more effective at component inference from these tree-embedded samples, suggesting that they can effectively exploit some features of the geometric substructure we would expect an evolutionary process to produce. Application to a real lung cancer data set shows the method to be effective at inferring components consistent with known lung cancer sub-types and with grouping these components into phylogenies generally consistent with the prior literature on the evolution of lung tumors. The method does, however, make some apparent mistakes and provide low confidence to some seemingly correct predictions, suggesting room for improvement.

The approach, at least as presently realized, does make a number of assumptions about the data and tumor progression in general that one might reasonably question. The primary purpose of this paper is to lay out the general concept of how unmixing methods can inform tumor phylogenetics and this concept in itself implies certain assumptions. Some assumptions concern the biology of tumor development: that there are reproducible cancer sub-types that co-occur in the population, that tumors accumulate remnant cell populations as they progress, that these remnant progression states are themselves reproducible across patients, and that these remnant states persist at high enough levels to measureably influence overall expression. There is considerable literature supporting all of these points, as discussed in the Introduction, although they might reasonably be debated. A further assumption is that the states we wish to observe differ sufficiently in expression profile as to be separable by unmixing methods. While there is strong evidence in the literature that distinct sub-types can be separated by their expression profiles, there is no direct experimental basis from which to argue that individual cell states along a single progression pathway are similarly separable. It remains to be seen how precisely one can sub-divide a progression pathway and which mutations or combinations of mutations will or will not lead to discernible changes in expression profile.

The specific implementation of the model in the present work adds additional assumptions that underlie the results here but might conceivably be relaxed in future work. The present model assumes that individual components of a mixture contribute linearly to the mixture, i.e., that the expression level of each gene in a tumor sample is a derived from the weighted sum of the expression level of the gene in each component of the sample. This assumption might break down due to limitations of the microarray technology or for biological reasons, e.g., if intracellular communication leads to radically different expression in mixtures of cell types than in the cell types independently. We would expect the dimensionality reduction step to partially correct for such problems by extracting a linear subset of the full expression space. Non-linear dimensionality reduction methods

There are also several potentially debatable assumptions underlying our current phylogeny methods. The most obvious assumption is that all of the cell types we detect are in fact evolutionarily related. The ultimate product of our method will be a phylogeny connecting all of the inferred cell types, which will not in general be a meaningful outcome if the cell types are not related. At some level, of course, all cells in a given individual are related. Distinct tumors may, however, arise from different populations of healthy cells. Furthermore, stromal contamination will in general lead to tumor samples containing mixtures of healthy cells that may not be ancestral to any of the tumor types. We chose to accept this assumption in the model, even knowing it to be imperfect, in the belief that it is better to have the predictions of the method about ancestry available and account for the assumptions after the fact in interpreting the meaning of the inferred ancestry relationships. One might alternatively seek to explicitly separate healthy from cancerous cells prior to phylogenetic inference, using the mixture model to remove stromal contaminants as in Etzioni et al.

A related question raised by these assumptions is whether we can tell, either in advance or post-hoc, whether the assumptions of the model are in fact satisfied by a given data set. One can in principal assess how well the raw input data is explained by a linear mixture of a small number of components, for example by examining the rate at which singular values of the matrix decay. Such a validation does not tell us whether information not accounted for by the linear model reflects genuinely non-linear contributions, the need for a more complex linear model, or noise in the true expression data or the experimental measurements. Neither does it tell us whether the linear model extracted carries sufficient information to characterize the cell states and their relationships. One can also apply a generic post-hoc validation based on how reproducible the results are to sub-sampling, as in our bootstrapping for the lung phylogenies. If the assumptions of the model are violated, then we would expect the data to only weakly support a defined tree topology. The lung cancer example suggests mixed success, with some features of the inferred trees strongly supported across subsamples but others nearly arbitrary. We can also, after the fact, examine the degree to which the inferred mixture fractions conform to the phylogeny. Healthy cells, when available, should be assigned minimal fractions of non-healthy components and tumors on a given progression pathway should exhibit minimal contamination with components characteristic of other pathways. While the lung data does show a clear partitioning of mixture components by sub-type, it nonetheless shows frequent contamination suggestive of poor fitting of the simplex. All of these measures suggest room for improvement in the model and algorithms and will provide a basis for determining whether any alternative methods do in fact improve on those proposed here. Perhaps the ultimate test of the method is how well it recapitulates what we already know about a given real data set. Our comparison of the lung cancer results to prior knowledge again suggests that the method works sufficiently well to recapitulate our prior understanding about the classification and origin of the major lung tumor classes, but with less precision than we might like and with some apparent mistakes. Further post-hoc validation might also be conducted directly on the derived components by testing whether they meet the expression signatures of known tumor sub-types or healthy cell populations.

Conclusions

We have presented a novel method for identifying likely pathways of tumor progression using computational unmixing methods to interpret expression measurements from tumor samples as mixtures of fundamental components. Validation on simulated data demonstrates good effectiveness at inferring mixture components, assigning mixture fractions to samples, and inferring phylogenies provided noise levels and numbers of components are sufficiently small. The prototype methods presented here do appear to suffer from insufficiently precise fits of polytopes to the data, especially as the number of components increases, which can in turn result in spurious identification of components in samples that lack them and inaccurate phylogeny inferences. Nonetheless, the effectiveness of the models across a variety of scenarios and in the presence of relatively high levels of noise suggests that the approach has good promise for improving our ability to identify phylogenetic relationships among tumor cells. Likewise, application of the method to real lung cancer microarray data shows that the method to be effective at identifying components corresponding to known clinical sub-types and at inferring progression pathways largely consistent with current knowledge about the molecular evolution of lung tumors. These inferences also suggest some novel hypotheses about the genesis of lung cancers. The methods developed here thus represent a promising new computational model for phylogenetic studies of tumors that can provide many of the competing advantages of the two major paradigms for tumor phylogeny inference: tissue-wide and cell-by-cell. This new model is likely to benefit in the future from further methodological insights from both the unmixing and the phylogenetics fields.

Methods

Validation on simulated data

Our first simulated data protocol was designed to model uniformly sampled mixtures of components. In this protocol, we specify a dimension (number of genes) ^{(true) }in which each column is a mixture component and each element of that column is the expression of one hypothetical gene, sampled from a unit normal distribution. We then sample _{i1}, ..., _{ik}) for each data point ^{(true)}. Entries of the input data matrix

where _{ij }is a unit normal random variable, implementing a log normal noise model. The resulting matrix ^{(inferred) }and ^{(inferred)}. To assess the quality of the assignment, we first match inferred mixture components to true mixture components by performing a maximum weighted bipartite matching of columns between ^{(true) }and ^{(inferred)}, weighted by negative Euclidean distance. We then assess the quality of the mixture component identification by the root mean square distance over all entries of all components between the matched columns of the two

We similarly assess the quality of the mixture fractions by the root mean square distance between ^{(true) }and ^{(inferred)}over all genes and fractions:

This process was performed for

Our second protocol was meant to model the assumption that each observed sample encodes a subset of an evolutionary tree. The protocol is parameterized by dimension ^{(true)}likewise proceeds exactly as with uniform samples. We assume, however, that each mixture component corresponds to one node in a binary evolutionary tree and that each observed sample corresponds to a path from the root to some arbitrary node in that tree. To generate the mixture fractions for a given simulated tumor sample, we select a tree node uniformly at random and then uniformly sample a set of mixture fractions of the chosen node and all of its ancestors in the tree, setting the mixture fraction to zero for all other components. After the generation of ^{(true) }and ^{(true)}, the application of the inference methods and the evaluation of ^{(inferred) }and ^{(inferred)}proceeds identically to that described for uniform mixture fractions in the preceding paragraph.

We additionally tested the accuracy of phylogeny inference on the tree-embedded samples. Following component inference and mixture fraction assignment, we applied the phylogeny inference algorithm described in "Phylogeny inference" above to infer trees on the components. For each test, we matched inferred to true components using maximum matching as above and used these assignments to determine the correspondence between edges present in the true trees and those present in the inferred trees. We scored the accuracy of each tree assignment as the fraction of true tree edges found in the inferred tree. For each condition, we recorded the average accuracy across repetitions.

Application to real data

We retrieved the Jones et al. ^{x}. Missing values were aribtrarily assigned linear expression level 1. In order to minimize effects of outlier data points that are likely to be due to assay failures, we further restricted the lower and upper ranges of the data values to 2^{-5}and 2^{5}, respectively, setting values outside that range to the closer limit. This step was necessary because of the linearity assumption of the data, which would otherwise cause even a few extremely large values to dominate the calculations.

After processing the dataset, we then applied the unmixing methods as described in "Cell type identification by unmixing." We performed analysis for two different numbers of desired components, four and six. We chose four primarily to allow visual analysis of the solutions, as the resulting three-dimensional simplex is the largest we can directly visualize. We performed a second analysis with six components in order to explore how the methods perform with a more involved component set. While we did perform additional inferences for higher component numbers, we have no empirical basis for evaluating them once they begin to sub-divide the known tumor classes. In addition, the simulated results provide little ground for confidence in predictions about larger numbers of components. In the interests of space, we therefore do not report results beyond

Finally, we performed phylogeny inference for both sets of inferences using the algorithm of "Phylogeny inference." For these data, we performed bootstrap replicates of the phylogeny inference stage of analysis to assess confidence in particular edges. We repeated the phylogeny inferences 10,000 times with each data point preserved 90% of the time per sample. We recorded the fraction of times each edge appeared across replicates to establish confidences on the predictions.

The human subjects work described in this section was considered by the Carnegie Mellon University Institutional Review Board and ruled exempt from human subjects requirements in accordance with 45 CFR 46.101(b)(4) due to its exclusive use of publicly available, anonymized patient data.

Authors' contributions

RS and SS both participated in conceiving the approach used in this work, conceiving and designing validation experiments, and interpreting the results. RS carried out the algorithm development, coding, and execution of the experiments in this work. Both authors contributed to writing the manuscript.

Acknowledgements

We are grateful to Yongjin Park for providing earlier analyses and preprocessing of the lung cancer data used in this work. We also thank Gary Miller and David Tolliver for helpful suggestions and insights into the problem and avenues for future work. We thank anonymous reviewers of this manuscript for many helpful suggestions on the methods and their description. RS was supported in part by the Eberly Career Development Professorship at Carnegie Mellon University. RS and SS were supported by U.S. NIH Award #1R01CA140214. The organizations providing funding for this work had no role in the study design; in the collection, analysis, or interpretation of data; in the writing of the manuscript; or in the decision to submit the manuscript for publication.