Department of Obstetrics & Gynecology, University of Texas Medical Branch, Galveston, Texas, USA

Northwest Medical Specialties, PLLC, Tacoma, Washington, USA

Systems Biology and Translational Medicine, Texas A&M University Health Science Center, College Station, Texas, USA

Abstract

Background

We introduce Approximate Entropy as a mathematical method of analysis for microarray data. Approximate entropy is applied here as a method to classify the complex gene expression patterns resultant of a clinical sample set. Since Entropy is a measure of disorder in a system, we believe that by choosing genes which display minimum entropy in normal controls and maximum entropy in the cancerous sample set we will be able to distinguish those genes which display the greatest variability in the cancerous set. Here we describe a method of utilizing Approximate Sample Entropy (ApSE) analysis to identify genes of interest with the highest probability of producing an accurate, predictive, classification model from our data set.

Results

In the development of a diagnostic gene-expression profile for cervical intraepithelial neoplasia (CIN) and squamous cell carcinoma of the cervix, we identified 208 genes which are unchanging in all normal tissue samples, yet exhibit a random pattern indicative of the genetic instability and heterogeneity of malignant cells. This may be measured in terms of the ApSE when compared to normal tissue. We have validated 10 of these genes on 10 Normal and 20 cancer and CIN3 samples. We report that the predictive value of the sample entropy calculation for these 10 genes of interest is promising (75% sensitivity, 80% specificity for prediction of cervical cancer over CIN3).

Conclusion

The success of the Approximate Sample Entropy approach in discerning alterations in complexity from biological system with such relatively small sample set, and extracting biologically relevant genes of interest hold great promise.

Background

Genomic heterogeneity is a characteristic feature of nearly all solid tumors, appearing early in tumor progression. Over time, early genomic instability evolutionarily leads to a molecularly heterogeneous population of cells naturally selected for their abilities to proliferate and invade, while simultaneously evading host defenses. The selection process, influenced by the unique to the tumor host environment, results in a further diverse intrapatient tumor population which is responsible for the clinical heterogeneity of the disease.

Concepts of tumorigenesis include the stochastic model, disputing the existence of genomic destabilization in tumor cells, instead arguing that mutations occurring during differentiation, development, and proliferation combine to create a cell with the perquisites for malignancy; according to this model, the cell then simply proliferates to create the cancer.

The evidence for genomic heterogeneity has been found in the extensive, progressive, and diverse genomic damage observed within the tumor, pre-neoplastic lesion,

While a fortuitous random process is created by genomic destabilization and natural selection, most are searching for a coherent pattern within this extremely diverse event. It is likely that the observed clinical heterogeneity of cervical cancer is a reflection of the dissimilar genomic events between the patients. Since the advent of microarray technology, data generated from these experiments have most often been analyzed with the aim to identify specific patterns in the expression levels, such as periodicity or monotonous increases or decreases.

We introduce Approximate Sample Entropy as a mathematical method of analysis for microarray data. Approximate entropy is traditional utilized in analysis of temporal patterns; such as heart rate variability.

Cervical cancer is an excellent model for evaluation of this analysis for multiple reasons. Aside from being the second leading cancer among women worldwide; with 470,000 new cases occurring annually; cervical cancer uniquely has an easily accessible, well defined, distinct pre-neoplastic lesion.

Methods

Sample handling and RNA isolation

Cervical biopsy samples were obtained from radical hysterectomy specimens of patients with Stage 1B1 squamous cell carcinoma of the cervix. Lesional and perilesional samples were obtained from 4 individual patients. Control biopsy samples were obtained from hysterectomy specimens of patients with documented normal cervix. All biopsies were histologically confirmed. Perilesional biopsies were normal cervical epithelium adjacent to the tumors.

Cervical biopsy samples from an additional 20 patients were utilized for real-time pcr validation. These biopsies were obtained from patients with either the preneoplasitc lesion (CIN 3), or with Stage 1A-1B squamous cell carcinoma of the cervix. As for the training set, normal cervical samples obtained from hysterectomy specimens of patients were utilized as controls. State of disease was histologically confirmed on an adjacent biopsy.

Total RNA was extracted from each sample using the RNAqueous RNA Isolation Kit (Ambion, Austin, Tx) following manufacturer's instructions. The extracted RNA is DNase treated prior to quantification. The RNA quality was assessed by detecting the 28S/18S peaks with an Agilent Bioanalyzer 2100 (Agilent Technologies, Walbronn, Germany). Only RNA with the highest fidelity (A260/A280 ratio between 1.9 and 2.1, and a high quality electropherogram) is used for microarray analysis

Microarray analysis

Twenty μg of DNA-free RNA from each biopsy was applied to the GeneChip Human HG_U133 A & B GeneChip (Affymetrix). Microarray gene chip analysis was performed on all samples using the Affymetrix HU 133 A & B human 33,000-GeneChip (Affymetrix,) for identification of disease specific expression patterns. Labeling of cRNA, hybridization, and scanning of the microarrays were done according to the manufactures protocols (Affymetrix, Santa Clara, Calif.).

Quantitative reverse transcription-PCR

Quantitative reverse transcription-PCR (qRT-PCR) was performed to validate differential expression of genes in an independent set of 20 cervical tissue biopsies. QRT-PCR was done with a 7500 Fast Real-time PCR system (Applied Biosystems) using gene-specific primer/probe mix (Assays-by-Design: Applied Biosystems, Foster City, Ca. The comparative threshold cycle (Ct) method was used in comparing the level of mRNA expression of the genes of interest to that of the internal control gene, GAPDH. Each reaction was set up in triplicate, with final expression levels in cancer samples compared to non-diseased, normal controls.

Results

Cluster Identification

Initially a normal baseline was established, 966 genes were found to be unchanging between patients with normal cervical biopsies when limited to a variance of less then 0.5% of the log2 expression values, a variance below the technical noise level of the Affymetrix array.

Establishment of normal baseline for sample entropy calculations

**Establishment of normal baseline for sample entropy calculations**. Initially a normal baseline was established, 966 genes were found to be unchanging between the normal samples when limited to a variance of less then 0.5% of the log2 expression values, a variance below the technical noise level of the Affymetrix array.

Half of the normal patient's biopsies were used in the reference sample set while the other half was treated as experimental samples similar to the cancer and perilesional biopsy samples. Changes in gene expression were calculated for each of the 966 genes as

_{2 }(_{2 }(

where

with superscripts 1, 2, and 3 representing normal, perilesional and cancer sample groups, respectively.

The 3 by 966 array of mean Δ

_{k }∈ ω_{i }if _{i}(Δ E_{k}) <_{j}(ΔE_{k}) ∀ _{j }≠

where the distances

_{i }are the estimated mean and covariance matrix for cluster _{i }and _{i}) is the probability of membership in cluster _{i }and |_{i }| is the determinant of the covariance matrix.

On the first pass through the clustering algorithm, all points were considered to be members of a single cluster located at a center given by the mean of each group over all 966 genes with an _{i}) of 1. On subsequent iterations, this quantity was based on the fraction of genes in each potential cluster.

The number of subsets was determined using a process of blind clustering which estimated the χ^{2 }statistic of hypothetical clusters to determine if the data set represents a single or multiple Gaussian distributions. If the computed chi^{2 }value for any cluster exceeded a set threshold (1.0 for 1 to 10 clusters; 2.0 for > 10 clusters) and if the product of the number of members of a class and the amount by which the chi-squared test exceeded the threshold was greater than a second criterion, the data were considered for further subsets. Subdivision of existing clusters continued until these conditions were no longer met. By varying the second criterion, it is possible to stabilize at a greater or lesser total number of clusters.

If an existing cluster was considered for further subdivision, two new cluster centers were located symmetrically along the principal axis (eigenvector) by a distance determined by the magnitude of the principal component (eigenvalue) computed from that cluster's covariance matrix. Bayesian distances were then estimated relative to the new clusters and the process was repeated. (Figure

Bayesian classification algorithm partitioned the genes into 10 subsets based on expression change patterns

**Bayesian classification algorithm partitioned the genes into 10 subsets based on expression change patterns**. Scatter plot of the change in log2 expression of Cancer from Normal, versus Perilesional from Normal, depicts the 10 possible subsets of genes differentiating the two groups. Subsets 1 and 2 displayed the best predictive value during validation.(A) Line plot of log2 expression of the 10 subsets, for the average change of Log2 expression of from Normal; for Normal, Perilesional and Cancer for each subset.(B).

Apparent Entropy Calculation

After dividing the 966 genes into 10 groups based on cluster membership, _{n }| represents the absolute value of Δ_{R }denote the set of genes whose Δ

ApSE as a function of R

**ApSE as a function of R**.

_{R }= {_{k})_{n }<

where

We can then define _{R}. If

Classification Trees

Classification consisted of two stages: 1) determination of decision rules based on the

Decision rules were determined separately for each subset of genes. ApSE values from the training set were input as independent/explanatory variables into a regression tree (Insightful Miner V3.01, Insightful Corp., Seattle WA). (Although Insightful Miner uses the term 'regression tree', the same process can be used to predict categorical or continuous dependent/response variables. The result is typically termed a regression tree when the response variable is continuous and a classification tree when the response variable is categorical).

The tree-fitting algorithm used in Insightful Miner is based on the recursive partitioning code called RPART.

Unlike traditional regression techniques which use a weighted sum of independent values to estimate the dependent variable, regression trees output a hierarchical series of rules or logical if-then conditions that can be used to classify the cases based on the values of the predictor variables.(Figure

Example Classification Tree based on

**Example Classification Tree based on ApSE analysis for a training set**. For this particular gene cluster, key decisions were based on the values of

Quantitative real-time PCR validation of gene subsets

Ten genes were chosen for initial quantitative real-time validation of the gene array, for the two gene subsets (subset 1 and 2) which best represent our progressive model of disease. RNA was extracted from fresh frozen biopsies of normal, cervical intraepithelial neoplasia grade 3 (CIN 3) and stage IB-IIB squamous cell carcinoma (SCC) for Quantitative Real-Time PCR (qRT-PCR) using the RNAqueous RNA Isolation Kit (Ambion, Austin, Texas). Using the 7500 FAST Real-Time System (Applied Biosystems), cDNA from the samples was amplified in a single-plex reaction, for the following genes of interest (Adaptor related protein complex 2, sigma 1 subunit (AP2S1), Collagen typeVI alpha 3(Col6A3), Fibronectin 1 (FN1), Growth arrest specific 6 (GAS6), Human leukocyte antigen (HLA-C), High-mobility group nucleosome binding domain 1 (HMGN1), Heat shock 60 kDa protein 1 (HSPD1), Interferon gamma-inducible protein 16 (IFI16), Matrix metallopeptidase 2 (MMP2), Proteasome subunit beta-2 PSMB2, Tenascin C (TNC))

Discussion

We believe that the observed clinical heterogeneity of most cancers is a reflection of the gross genomic heterogeneity and subsequently the molecular heterogeneity also found within the tumors, also that the more heterogeneous the cell population, the more likely the development of a metastatic, aggressive phenotype. Genomic destabilization, evolution, and selection for invasive, proliferating populations of cells encompass the fundamental nature of cancer. Cancer cells represent a diversely complex evolutionary life within the host. The resultant heterogeneous tumor is linked to its genetic state and represented in its gene expression profile. We have shown here that encompassing this heterogeneity in the data analysis aids in capturing the diversity within the cancer subgroups and in the identification of subgroup specific gene expression profiles.

Since the advent of microarray technology, data generated from these experiments have most often been analyzed with the aim to identify specific patterns in the expression levels, such as periodicity or monotonous increases or decreases.

Scientists have struggled with theoretical means to detect alterations in system complexities. Entropy, a measure of disorder or randomness has been used to classify complex systems externally viewed as chaotic. A limited number of studies have utilized the concept of entropy in gene expression analysis; all to enhance clustering techniques by minimizing entropy and thus the amount of disorder in the groups.

Approximate sample entropy analysis of microarray data has proven to be a reliable method of differentiating subgroups of patient samples. With the ability to classify the perilesional biopsy sample from normal with 75% accuracy, and to classify cancer with 80% accuracy with an 81 gene set (subset 1). (Table

Assessment of evaluation metrics for subset 1 and 2 from the logistic regression

**Training Data**

**Gene Subset 1**

**Positive Category**

**Recall**

**Precision**

**F-Measure**

**Perilesional**

100.00%

100.00%

100.00%

**Cancer**

100.00%

100.00%

100.00%

**Validation Data**

**Positive Category**

**Recall**

**Precision**

**F-Measure**

**Perilesional**

75.00%

75.00%

75.00%

**Cancer**

80.00%

80.00%

80.00%

**Training Data**

**Gene Subset 2**

**Positive Category**

**Recall**

**Precision**

**F-Measure**

**Perilesional**

100.00%

85.71%

92.31%

**Cancer**

100.00%

100.00%

100.00%

**Validation Data**

**Positive Category**

**Recall**

**Precision**

**F-Measure**

**Perilesional**

75.00%

75.00%

75.00%

**Cancer**

100.00%

83.33%

90.91%

Squamous cell carcinoma tumor and perilesional display distinctly different scatter plots from normal tissue

**Squamous cell carcinoma tumor and perilesional display distinctly different scatter plots from normal tissue**. Representative scatter plots of average normal versus perilesional (A) and cancer (B) samples from patient 1, for gene subset 1. R is depicted by the dashed line around the normal baseline line (slope = 1). The value for R is optimized for differentiation between groups for each subset of genes.

Quantitative real-time validation has been performed with a limited gene set from the sample entropy analysis of the Affymetrix microarrays. Approximate sample entropy analysis of normalized quantitative real-time PCR data was also found to be reliable at differentiating subgroups of cervical neoplasia. Unlike the original gene array experimental set, in the validation study expanded the abnormal cervical sample groups to include CIN3 (the premalignant lesion), and stages I and II squamous cell cancer. We find in Table

Probability distribution parameters and classification probabilities for the realtime validation of gene subsets 1 and 2

**Subset 1**

**Subset 2**

**Cin3**

**SCC**

**Cin3**

**SCC**

**Mean**, m

0.424

1.122

0.539

1.87

**Standard Deviation**, s

0.176

0.455

0.304

0.699

**Probability of classifying Normal**

0.059757

0.016329

0.100342

0.006934

**Probability of classifying Cin3**

**0.850268**

0.138632

**0.704364**

0.05598

**Probability of classifying SCC**

0.089975

**0.845039**

0.195294

**0.937085**

**p value (T-test)**

0.0005

0.0001

Conclusion

The sample entropy analysis has the ability to detect abnormality with a predictive probability of 88%, with region R set at the maximum value for difference in

Probability distribution of approximate sample entropy approach for gene sets 1 and 2 applied to an independent group of patients

**Probability distribution of approximate sample entropy approach for gene sets 1 and 2 applied to an independent group of patients**. Plot of R versus average

Authors' contributions

SKB conceived of the study, participated in design, supervised the molecular studies and carried out the analysis. JBT assisted in development of analysis and draft of the manuscript. SAS carried out the molecular studies. MFB assisted in patient recruitment, sample acquisition and molecular studies. CDA participated in design and supervised all aspects of the study. All authors contributed to the writing of this manuscript. All authors read and approved the final manuscript.

Acknowledgements

This work was supported in part by NCI K22 Grant (5K22CA108482-03). We wish to thank Nilufar Khatoon and Deepa Patel for their technical support.