Department of Chemical and Petroleum Engineering, University of Pittsburgh, 1249 Benedum Hall, 3700 O’Hara Street, Pittsburgh, PA 15261, USA

Department of Bioengineering, University of Pittsburgh, 360B CNBIO, 300 Technology Drive, Pittsburgh, PA 15219, USA

School of Mathematics and Statistics, Central China Normal University, Wuhan, China

Department of Pathology, University of Pittsburgh, Pittsburgh, PA, USA

McGowan Institute for Regenerative Medicine, University of Pittsburgh, Pittsburgh, PA, USA

Thomas E. Starzl Transplantation Institute, University of Pittsburgh, Pittsburgh, PA, USA

Center for Innovative Regenerative Therapies, Department of Surgery, Transplantation Section of Children's Hospital of Pittsburgh, Pittsburgh, PA, USA

Abstract

Background

Lineage specific differentiation of human embryonic stem cells (hESCs) is largely mediated by specific growth factors and extracellular matrix molecules. Growth factors initiate a cascade of signals which control gene transcription and cell fate specification. There is a lot of interest in inducing hESCs to an endoderm fate which serves as a pathway towards more functional cell types like the pancreatic cells. Research over the past decade has established several robust pathways for deriving endoderm from hESCs, with the capability of further maturation. However, in our experience, the functional maturity of these endoderm derivatives, specifically to pancreatic lineage, largely depends on specific pathway of endoderm induction. Hence it will be of interest to understand the underlying mechanism mediating such induction and how it is translated to further maturation. In this work we analyze the regulatory interactions mediating different pathways of endoderm induction by identifying co-regulated transcription factors.

Results

hESCs were induced towards endoderm using activin A and 4 different growth factors (FGF2 (F), BMP4 (B), PI3KI (P), and WNT3A (W)) and their combinations thereof, resulting in **15** total experimental conditions. At the end of differentiation each condition was analyzed by qRT-PCR for **12** relevant endoderm related transcription factors (TFs). As a first approach, we used hierarchical clustering to identify which growth factor combinations favor up-regulation of different genes. In the next step we identified sets of co-regulated transcription factors using a biclustering algorithm. The high variability of experimental data was addressed by integrating the biclustering formulation with bootstrap re-sampling to identify robust networks of co-regulated transcription factors. Our results show that the transition from early to late endoderm is favored by FGF2 as well as WNT3A treatments under high activin. However, induction of late endoderm markers is relatively favored by WNT3A under high activin.

Conclusions

Use of FGF2, WNT3A or PI3K inhibition with high activin A may serve well in definitive endoderm induction followed by WNT3A specific signaling to direct the definitive endoderm into late endodermal lineages. Other combinations, though still feasible for endoderm induction, appear less promising for pancreatic endoderm specification in our experiments.

Background

Embryonic stem cells have been shown to have tremendous impact in the field of regenerative medicine because of its potential to differentiate to multiple cell types of interest. Efficient harvesting of this potential requires careful development of protocols to evolve the cells through specific signaling pathways which will induce desired lineages and properties in the differentiated phenotypes. Our primary interest lies in differentiation of human embryonic stem cells (hESCs) to insulin producing β-cells of the pancreas as a cellular transplantation strategy for diabetes mellitus. The first and perhaps the most important step in differentiation to endodermal organs like pancreas and liver is the commitment to definitive endoderm (DE)

Differentiation of hESCs to DE

Activin A (henceforth denoted as activin) has been shown to be effective in inducing DE from hESCs and is a key induction factor used in many protocols

Hierarchical clustering

HC is a useful technique to analyze and interpret multivariate data. Each data point here is represented as a vector and the distances between these data points are measured using a suitable distance measure

Biclustering to identify co-regulated genes across different conditions

While HC homogenizes the entire dataset, techniques like biclustering are useful in preserving the second dimension in clustering; in our case all the endoderm induction conditions. We are interested in identifying specific sets of genes exhibiting similar expression patterns across various subsets of experimental conditions, which can be achieved by biclustering. Likewise, many TFs are known to have multiple functions, and hence participate in multiple regulatory networks, which can also be captured by overlapped biclusters

Handling data variability

The gene expression data obtained for cell culture systems are subjected to noise because of the heterogeneity and stochasticity associated with the system. Differences among the biological replicates may therefore arise due to the inherent heterogeneity of the ES cell population as well as by experimental noise

Essentially, bootstrapping generates a pseudo dataset from the small number of experimental replicates by a sampling with replacement technique. The advantage of bootstrap lies in estimating statistically significant parameters from a limited number of experimental replicates

Results

The focus of this work is to understand the mechanism of endoderm induction using different growth factors, acting alone and in combination, from an integrated experimental and computational approach (summarized in Figure

Work-flow for the entire analysis from data collection to identification of robust biclusters

**Work-flow for the entire analysis from data collection to identification of robust biclusters**. In short, we start with the qRT-PCR data and perform bootstrap with re-sampling to obtain 1000 pseudo-datasets. Each of these datasets is subjected to biclustering analysis to obtain the most coherent pattern in each dataset. The resulting biclusters are then analyzed for the most repeated subsets of biclusters.

Experimental analysis of endoderm differentiation using combinations of major pathways

Figure

Fold change data for the 12 transcriptional markers across 15 experimental conditions

**Fold change data for the 12 transcriptional markers across 15 experimental conditions.** (**a**) The fold change calculated from the mean expression data from qRT-PCR on day 4 of the differentiation process is plotted from the expression matrix, **b**) Variation observed in the 12 transcriptional markers with changes in the signaling pathways presented as mean ± SE. All the major DE markers

Hierarchical clustering of the mean expression data identifies differences in the endoderm induced by BMP4 in the presence and absence of exogenous FGF2

The mean experimental data matrix was first analyzed using hierarchical clustering which clusters the TFs and conditions separately, as shown in Figure

**Principal Component Analysis.docx**

Click here for file

Hierarchical clustering on the mean expression data

**Hierarchical clustering on the mean expression data.** The conditions cluster into two major groups, one containing BMP4 in the absence of exogenous FGF2 and the other containing all the other treatments and BMP4 in combination with exogenous FGF2. Activin A is common among all the treatments. The TFs cluster into two groups, the late and early endoderm markers.

The clusters identified by the hierarchical algorithm reflect our biological understanding of the induction conditions as seen from the previous studies. A major difference between the two clusters of conditions was the context dependent function of BMP4. In the presence of FGF2 and high activin, BMP4 was found to favor the endodermal lineage which was seen in several recent studies

Identification of co-regulated transcription factors by biclustering

While hierarchical clustering enables a fast and simplistic analysis of the experimental data sets, it does not provide information on which subsets of TFs are co-regulated across subsets of conditions. Identifying such co-clusters will be beneficial, since the governing signaling pathways change with the induction condition and the same TFs may not be co-regulated. The technique of biclustering serves to mine subgroups of such TFs exhibiting similar trends in their expression level under subsets of conditions. Hence TFs appearing in the same bicluster can be inferred to be co-regulated and constituents of a similar network architecture. The experimental data matrix,

**Selection of biclustering parameters.docx**

Click here for file

The developed optimization based bicluster identification algorithm was applied to the mean expression data with the above mentioned parameters, which resulted in a 3-gene 5-condition bicluster as illustrated in Figure

Biclusters obtained from the normalized mean expression data

**Biclusters obtained from the normalized mean expression data.** (**a**) Optimal Bicluster The bicluster contains 3 genes across 5 conditions. (**b**) **Subsequent bicluster** containing 3 genes and 7 conditions. The bicluster parameters selected were

Recently, a new method was proposed by Banka

Robust biclusters identify WNT3A treatment to favor both early and late endoderm

The above identified biclusters were for the mean dataset, and hence does not explicitly take into account the experimental variations. In general biological datasets are known for their noise and uncertainty, and in particular stem cells have inherent heterogeneity and stochasticity. In order to increase confidence in the identified bicluster we undertook bootstrap analysis on the experimental data to generate 1000 pseudo-datasets. Each of these datasets were treated as an experimental repeat and subjected to the entire biclustering analysis. In order to identify somewhat overlapped biclusters, we ran the biclustering algorithm five times at each data point by subsequently penalizing previously identified biclusters.

The next task was to determine a robust bicluster from this array of alternate biclusters. We hypothesize that the robust bicluster will not be significantly affected by the experimental noise, and hence will appear a large number of times in the bootstrapped-bicluster data set. However, a thorough search of the entire array of alternate biclusters for frequency of repeats did not yield any satisfactory outcome. Thus we could not find a single bicluster that was significantly repeated in its entirety across the data set. Instead, we realized subsets of genes and conditions of the bicluster were being repeated with very high frequency instead of the entire bicluster. Hence, we focused on identifying such subsets from the family of bootstrap + bicluster solutions. Setting a minimum threshold of 50% repeats across the bootstrap samples, we identified 6 such subsets. First five of these contained different combinations of the same two markers and four conditions. Hence we collected them together into a single group. The profiles of the repeated subsets are presented in Figure

Robust subsets identified from the 1000 bootstrap datasets

**Robust subsets identified from the 1000 bootstrap datasets.** Robust biclusters are the most repeated subsets (>500). The bicluster parameters selected were

Figure

Robust subsets of co-regulated genes presented as a bipartite graph

**Robust subsets of co-regulated genes presented as a bipartite graph.**. We have identified high Activin along with PI3K inhibition or activin in combination with WNT3A to work the best to co-regulate early endoderm marker

Discussion

The differentiation of hESCs into the endoderm lineages is carried out by the activation of different signaling pathways mimicking

The DE signature differs under exogenous activation of different signaling pathways participating in endoderm commitment

Our experiments with different DE inducing conditions show that the DE potential of the differentiating hESCs is highly dependent on the method of DE induction. The major DE markers (

All the pathways studied here have been known to be important at the earlier stages of

Among the DE markers,

The response to the BMP4 pathway, however, was highly dependent on the context, namely the presence and absence of FGF2 which was a striking feature of the hierarchical clustering on the 15 conditions. BMP4 is typically known as an activin antagonist and high concentrations of BMP4 in the culture with high activin results in mesoderm fate

WNT3A/β-catenin signaling has been shown to be important both for maintenance of pluripotency as well as induction of differentiation

Robust biclusters identify the necessary pathways for efficient endoderm differentiation to the pancreatic lineage

The robust biclusters identified by the biclustering + bootstrap analysis show the most important trends preserved under experimental variations. Supportively,

Alternatively, the markers

Figure summarizing the functional dependence of the co-regulated genes on the active signaling pathways of endoderm induction

**Figure summarizing the functional dependence of the co-regulated genes on the active signaling pathways of endoderm induction. **

Conclusion

The focus of the current work was to achieve insights into the

Methods

Experimental methods

Cell culture and treatment

hESC maintenance

H1 hESCs were placed on hESC certified matrigel coated wells and maintained with mTeSR1 with media change every day. Cells were passaged every 5 to 7 days by incubating in 1 mg/ml dispase for 5 minutes followed by mechanically breaking the colonies and splitting at a 1:3–1:5 dilution. Cells were examined under the microscope every day and colonies with observable differentiation were picked and removed before the media changes.

hESC differentiation to DE

H1 hESCs were allowed to grow to 60-70% confluency before the experiments were started. Once confluency was reached, differentiation was performed by adding DE induction media for 4 days with media change every day. Several induction conditions were chosen according to previously published studies

Measurement of Transcription Factor (TF) expression

After 4 days of DE induction, cells were lysed and RNA extracted using Nucleospin RNA II kit (Macherey Nagel) according to the manufacturer’s instructions. The sample absorbance at 280 nm and 260 nm was measured using a BioRad Smart Spec spectrophotometer to obtain RNA concentration and quality. Reverse transcription was performed using ImProm II Promega reverse transcription kit following the manufacturer’s recommendation. qRT-PCR analysis was performed for endoderm and pancreatic markers using the primers listed in Additional file

**Transcription factors and primers list.docx**

Click here for file

A total of 12 transcription factors were studied which included pluripotency marker _{
T
}, after normalization with respect to the control sample and housekeeping gene, _{
T
} = [(_{
T,target
} − _{
T,GAPDH
})_{
sample
} − (_{
T,target
} − _{
T,GAPDH
})_{
undiff cells
}]. The control sample was chosen to be undifferentiated cells at day 0.

TF expression profiles

The TF expression profiles can be grouped together to form an expression matrix with the rows corresponding to the measurements of interest (like the relative mRNA concentrations) and the columns corresponding to the experimental conditions or samples. Thus, each element in the matrix refers to the intensity of the particular measurement in a given sample

Mathematical analysis

Hierarchical clustering

Hierarchical clustering partitions the data into clusters through an iterative process, where similarity or dissimilarity between every pair of variables in the data matrix is calculated using an appropriate distance measure followed by grouping the variables in close proximity using a linkage function. We used the in-built Matlab functions to perform the analysis using various distance measures e.g. Euclidean, city block etc., on the mean centered and variance scaled expression matrix. The results were represented as a clustergram i.e. the linkage tree and the corresponding heat map. We tested the tree generated using different linkage measures after normalization of the mean expression matrix and found all the trees to be very similar with the cophenetic correlation coefficient greater than 0.9.

Biclustering algorithm

Biclustering can be described as two dimensional clustering, where a subset of genes exhibiting similar trend across a subset of conditions is being identified. Such subsets can be considered to be participating in similar regulatory mechanism, hence constituting a regulatory network. In order to identify sets of TFs expressing coherent trends under specific sets of conditions, we analyzed our TF-condition matrix, _{
ij
} for _{
ij
} of each element in the bicluster is defined as: _{
ij
} = _{
ij
} − _{
iJ
} − _{
Ij
} − _{
IJ
}. The gene base is defined as
_{
IJ
}, of a bicluster is defined as,

Thus, our final goal is to find biclusters of maximum size, with mean squared residue lower than a given threshold (

In this function, _{
p
} is defined as

Where _{
ij
})| is the number of previous biclusters containing _{
ij
}. The use of the penalty term biases the search against members which already have appeared in the previous biclusters, thus reducing the overlapping amongst the biclusters.

_{
d
} is defined as

Solution procedure

The current optimization formulation has been identified to be NP-hard and has been shown to be effectively handled by evolutionary techniques like Genetic Algorithm (GA)

Each chromosome has a metric associated with it called the fitness which we wish to maximize. The GA algorithm is initiated by randomly initializing a population of chromosomes (i.e. biclusters). The population is continuously evolved in every generation by the operators: reproduction, crossover and mutation. At the end of every generation, individuals for the next one are selected on the basis of their fitness values. This cycle of evolution is continued until a predetermined termination criterion is reached. For the present case, we continued the simulations for a maximum number of generations until no further change in the population was observed. The biclustering formulation was coded in FORTRAN R90 and the Genetic Algorithm (version 1.7a) driver obtained from David Carroll, CU Aerospace, Urbana, IL. Computations were performed on INTEL (R) Core (TM) 2 Quad CPU (Q8400 @ 2.66 GHz).

Determination of robust biclusters

The inherent noise in biological systems makes it difficult to draw meaningful conclusions from a deterministic analysis. The formulation proposed above is based on the mean gene expression data which possibly reduces confidence in the identified bicluster. Here we have adopted the bootstrap technique to obtain robust biclusters from noisy experimental data. Bootstrap is a statistical technique to generate large data set from a small number of experimental replicates, using sampling with replacement technique. The present formulation systematically re-samples the original experimental data set using Monte Carlo algorithm to generate the artificial data set. The optimization formulation of the biclustering problem is then solved at each of the bootstrap data points to generate a family of alternate biclusters. The final goal will be to identify the most repeated biclusters in the entire array, based on the justification that such a bicluster will be relatively insensitive to experimental noise and hence is robust. To this end, the number of repeats of a particular gene-condition combination is analyzed using the quicksort algorithm (N log N). Our analysis showed that the complete bicluster was typically not repeated significantly; instead only subsets of the biclusters were repeated sufficient number of times. For identification of robust biclusters, we set the threshold frequency of repeats as 500 out of every 1000 alternate biclusters. The most repeated subsets are thereby concluded to be robust under experimental noise. The work flow for the entire analysis is depicted in Figure

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

Conceived and designed experiments: IB MJ ASG SM. Performed the experiments: MJ. Conducted mathematical analysis: SM, XZ, LZ. Contributed materials/analysis tools: IB. Drafted the manuscript: SM IB. All authors read and approved the final manuscript.

Acknowledgements

We would like to thank Dr. Ira Fox from the University of Pittsburgh for his generous gift of H1 hESCs.