Bioinformatics Institute, 30 Biopolis Str. #07-01, 138672, Singapore

Abstract

Background

A sense-antisense gene pair (SAGP) is a gene pair where two oppositely transcribed genes share a common nucleotide sequence region. In eukaryotic genomes, SAGPs can be organized in complex sense-antisense architectures (CSAGAs) in which at least one sense gene shares loci with two or more antisense partners. As shown in several case studies, SAGPs may be involved in cancers, neurological diseases and complex syndromes. However, CSAGAs have not yet been characterized in the context of human disease or cancer.

Results

We characterize five genes (

Conclusion

We have identified a novel

Background

A

Studies have shown that changes in the transcription of SAGPs could be implicated in pathological processes such as some cancers and neurological diseases

In mammalian genomes, SAGPs can be organized in more complex sense-antisense gene architectures (CSAGAs) in which at least one gene shares loci with two or more antisense partners

There are many oncogenes on chromosome 17, although the localization of these genes is not uniform. For example, according to Cancer Genetics Web

CSAGAs and their association with human cancers in the regions outside of the

Methods

Patients, tumor specimens, cell lines and microarray data

Clinical characteristics of breast cancer patients and tumor samples from two independent cohorts (Uppsala and Stockholm) have been published previously _{s }= 159 patients with breast cancer, who were operated on in the Karolinska Hospital from 1 January 1994 to 31 December 1996 and identified in the Stockholm-Gotland breast Cancer registry _{u }= 251 patients representing approximately 60% of all breast cancers resections in Uppsala County, Sweden, from 1 January 1987 to 31 December 1989. Information on patients' disease-free survival (DFS) times/events and the expression patterns of approximately 30,000 gene transcripts (representing

Correlation analysis

Our primary goal is to identify whether the set of genes composing the

**P-values calculated by Kolmogorov-Smirnov test of Normality ( α = 1%)**. Description: file contains two tables with P-values of Normality for the five genes of

Click here for file

Then, we derive a correlation matrix of the form:

where _{1p }denotes the Pearson correlation coefficient between Affymetrix probesets 1 and

To test the significance of the _{0}: _{p×p }= _{p×p}, where _{p×p }is the p × p correlation matrix and _{p×p }is the corresponding p × p identity matrix. Under the null hypothesis, there is no significant correlation among these probes, whereas rejection of H_{0 }at

where ^{2 }with 1/2 ^{b}, for each of the B = 5,000 draws. The corresponding bootstrap P value is estimated as:

where ^{b }denotes the bootstrap test statistic of the b^{th }draw. Similar bootstrap approaches have been discussed in

Comparison of correlation matrices

We would like to show that the genes in the _{0}: _{p×p }= _{p×p }is as before and _{a,b }(_{1 }and _{2 }and used to estimate Box M statistic as:

|_{1}| is the determinant of the variance-covariance matrix of our prospective gene cluster (corresponding to the _{p×p }correlation matrix), |_{2}| is the determinant of the variance-covariance matrix of the neighbouring group of genes (corresponding to the _{pool}| is the pooled sample variance/covariance matrix estimated as:

Box ^{2 }and F approximations for the distribution of M (an exact test does not exist). Notice that in our case _{1 }= _{2 }but the dimensions of _{p×p }using Box M test. Then we average over the estimated P values. It is possible that our approach introduces some bias in the comparison. However, as we will see later, the difference between the two compared matrices is large enough to safely conclude for their statistical difference.

Survival Analysis Based on Genes and Gene Pair Expression Patterns

This analysis involves testing whether the prospective gene cluster contains

We assume a microarray experiment with _{k}, defined as the time interval from surgery until the first recurrence (local, regional, distant) or the last date of follow-up), and a nominal (yes/no) clinical event _{k }(e.g., occurrence of tumor metastasis at time _{k}). Each patient is assigned to low- or high- risk groups according to:

where ^{i }denotes the cut-off of the

where ^{i}_{k }is the hazard function and _{i}(_{k}^{i}_{0}(_{k}_{k }is patient survival time. To assess the ability of each gene to discriminate the patients into two distinct genetic classes, the Wald P value of the _{i }coefficient of the Cox proportional hazard regression model

where _{k}) = {_{j }≥ _{k}} is the risk set at time _{k }and _{k }is the clinical event at time _{k}. The actual fitting of the Cox model is conducted by the _{i }Wald P values are assumed to have better group discrimination ability and are thus called

The proposed dichotomization of the patients into two groups and the subsequent fit on the Cox proportional hazards model is a strategy that has been followed in the past (for example see

A similar approach is applied to identify synergistic survival-significant gene pairs using the two-dimensional data-driven grouping method of Motakis et al. ^{i }and ^{j},

Figure _{i,k}, _{j,k}] are plotted; note that "A", "B", "C" and "D" are defined by the conditions A: _{i,k }<^{i }and _{j,k}<^{j}; B: _{i,k }≥ ^{i }and _{j,k }<^{j}; C: _{i,k}<^{i }and _{j,k }≥ ^{j}; D: _{i,k }≥ ^{i }and _{j,k }≥ ^{j}. For each

Grouping of a synergetic gene pair (genes 1 and 2 with respective cutoffs c^{1 }and c^{2}) and all possible two-group designs (Designs 1-7)

**Grouping of a synergetic gene pair (genes 1 and 2 with respective cutoffs c ^{1 }and c^{2}) and all possible two-group designs (Designs 1-7)**.

for each design and estimate the seven Wald

The correlation and survival analyses were conducted in R

Results

Identification of the co-expressed

Using the high-confidence Affymetrix Chip U133 A&B probesets presented in the APMA database

** TNFAIP1/POLDIP2 complex sense antisense architecture mapped onto the genome (UCSC genomic browser)**.

Identification of TNFAIP1/POLDIP2 Structural-Functional Gene Module

We identified two SAGPs (

Next, we produced correlation matrices of the

Members of the

**Members of the TNFAIP1/POLDIP2 CSAGA are mutually co-regulated in breast cancer and form a structural--functional gene module**. Correlation matrices visually demonstrate the presence of a characteristic co-regulatory pattern: the co-regulatory area is formed by enrichment of significant Pearson correlation coefficients (

The structural backbone of this

Based on its structural and expressional integrity, we have termed the

Next, using Bartlett's test

P values obtained by pair-wise comparisons of matrices for five genes in the SFGM group and six 'neighbouring' genes

**Breast cancer grade**

**SFGM matrix ^{1}**

**NG matrix ^{1}**

**SFGM matrix/NG matix ^{2}**

**U**

**S**

**U**

**S**

**U**

**S**

G3

1.5E-12

2.7E-10

7.6E-01

5.9E-01

1.0E-16

1.0E-16

G3-like

3.4E-11

8.9E-03

2.1E-01

8.8E-01

1.0E-16

1.1E-01

G1-like

2.1E-14

9.9E-02

4.5E-01

6.7E-01

1.0E-16

1.0E-16

G1

1.1E-04

1.6E-14

9.1E-01

7.7E-01

1.0E-16

1.0E-16

Total group

1.2E-16

1.3E-15

8.3E-01

6.1E-01

1.0E-16

1.0E-15

^{1 }-- P values were calculated using Bartlett's bootstrap test. ^{2 }-- averaged P values were calculated using Box's M test (see description of procedures in Materials and methods section). U -- Uppsala cohort; S -- Stockholm cohort.

Bartlett test in Uppsala cohort showed that the tested correlation matrices were highly significant in all four different grades or using all patients data at significance level

Next, we applied Box's M test to the comparison of two correlation matrices at

We suggest three possible mechanisms for the observed co-regulatory pattern of the

Survival analysis of SFGM genes and their closest neighbours in breast cancer patients

We applied our survival analysis algorithm for the genes of SFGM and NG matrices. Four members (unique genes) of TNFAIP1/POLDIP2 SFGM are significant at

Individual genes selected among the

**Affymetrix U133 (A&B)**

**probeset**

**Gene Symbol**

**Uppsala cohort(individual)**

**Stockholm P value (individual)**

**Wald statistic**

**P value**

**FDR corrected**

**P value = 4.2E-03**

**Wald statistic**

**P value**

**FDR corrected**

**P value = 5.1E-03**

B.222425_s_at

POLDIP2

1.5E-05

Significant

2.4E-02

Not significant

A.210312_s_at

IFT20

4.1E-04

Significant

2.4E-02

Not significant

A.212282_at

TMEM97

1.1E-03

Significant

3.0E-03

Significant

B.225375_at

TMEM199

2.2E-02

Not significant

7.2E-04

Significant

The Uppsala cohort P value correction

Selected non-redundant survival-significant gene pairs identified in both cohorts of breast cancer patients

**Affymetrix U133 (A&B)**

**probeset**

**Gene Symbol**

**P value (individual)**

**Affyprobeset***

**GS***

**P value(individual) **

**P value(gene pair)**

**U**

**S**

**U**

**S**

**U**

**S**

**
222425_s_at
**

**
POLDIP2
**

**
1.50E-05
**

**
0.024
**

**
A.201207_at
**

**
TNFAIP1
**

**
0.00011
**

**
0.081
**

**
3.10E-07
**

**
0.00046
**

**
201208_s_at
**

**
TNFAIP1
**

**
0.022
**

**
0.11
**

**
A.214283_at
**

**
TMEM97
**

**
0.074
**

**
0.081
**

**
0.00022
**

**
0.0029
**

**
213259_s_at
**

**
SARM1
**

**
0.011
**

**
0.074
**

**
B.225375_at
**

**
TMEM199
**

**
0.022
**

**
0.00072
**

**
0.00085
**

**
2.90E-05
**

204534_at

VTN

0.024

0.11

A.210312_s_at

IFT20

0.00041

0.024

0.00021

0.00052

204534_at

VTN

0.024

0.11

A.212279_at

TMEM97

0.0042

0.0035

1.00E-04

0.00062

210312_s_at

IFT20

0.00041

0.024

A.212281_s_at

TMEM97

0.0028

0.0051

1.60E-05

0.0036

213259_s_at

SARM1

0.011

0.074

A.212281_s_at

TMEM97

0.0028

0.0051

0.00039

0.004

217806_s_at

POLDIP2

4.30E-05

0.12

A.212281_s_at

TMEM97

0.0028

0.0051

2.60E-05

0.0036

225375_at

TMEM199

0.022

0.00072

A.201207_at

TNFAIP1

0.00011

0.081

9.30E-05

0.00021

225375_at

TMEM199

0.022

0.00072

A.212279_at

TMEM97

0.0042

0.0035

2.00E-04

0.00032

Bold italics indicates gene pairs where the P values for a gene pair are at least ten times lower than that for either of the individual gene's of the pair for both the Uppsala and Stockholm cohorts. U, Uppsala cohort; S, Stockholm cohort.

Among the seven unique genes that compose the eleven significant gene pairs (Table

**Survival analysis for the TMEM97/TNFAIP1 (Figure S1) and TMEM199/SARM1 (Figure S2) gene pairs**. Description: file contains patients grouping and Kaplan-Meier survival curves for the

Click here for file

Survival analysis for the TNFAIP1/POLDIP2 gene pair in breast cancer patients

**Survival analysis for the TNFAIP1/POLDIP2 gene pair in breast cancer patients**. **A**, **B **and **C **-- plots and histogram for the Stockholm cohort; **D**, **E **and **F **- plots and histogram for the Uppsala cohort. Black indicates the low-risk prognosis group, red indicates the high risk prognosis group. **A **and **D **- correlation of gene expression and optimal partition of expression domains and patients grouping. The horizontal lines are the cut-offs of 2D data-driven grouping. **B **and **E **- Kaplan-Meier survival curves for **C **and **F **-- separation of breast cancer patients based on expression data of the

The gene pair

Expression of gene members of the

Previous studies of HER2-amplified tumors have demonstrated that the smallest region of amplification (SRA) involving HER2 spans 280 kb and contains a number of genes in addition to HER2 that have elevated levels of expression

In order to elucidate whether the mRNA expression levels of members of the

**Expression data for 38 breast cancer cell lines either non-normalized or normalized to DNA copy number**. File contains the original expression values for 38 breast cancer cell lines extracted from Hu

Click here for file

Correlation matrices analysis of TNFAIP1/POLDIP2 SFGM in 38 breast cancer cell lines (see materials and methods)

**Correlation matrices analysis of TNFAIP1/POLDIP2 SFGM in 38 breast cancer cell lines (see materials and methods)**. Due to the small sample size (38 cell lines) Kendall-Tau correlation coefficients were calculated. Only significant correlation coefficients (

For the analysis of the

Correlation analysis of DNA copy number and microarray expression data for the

**Affy probesets***

**Gene Symbol**

**SNPs for 17q11.2 SRA**

**SNPs for 17q12 SRA**

**rs4239211**

**rs10512429**

**rs7207976**

**rs10512430**

**rs602688**

**rs632202**

**rs620686**

**rs10491129**

**rs10491128**

**rs2517956**

**rs9303277**

220654_at

PPY2

**212282_at**

**TMEM97**

**0.29**

**210312_s_at**

**IFT20**

**0.41**

**0.39**

**0.39**

**0.39**

**201207_at**

**TNFAIP1**

**0.40**

**0.33**

**0.33**

**0.33**

**222425_s_at**

**POLDIP2**

**0.45**

**0.39**

**0.39**

**0.39**

0.29

0.34

0.34

**225375_at**

**TMEM199**

**0.48**

**0.41**

**0.41**

**0.41**

234256_at

SEBOX

204534_at

VTN

213259_s_at

SARM1

207567_at

SLC13A2

207683_at

FOXN1

**200029_at**

**RPL19**

**0.43**

**0.43**

**0.43**

**0.42**

**0.35**

228888_at

STAC2

0.35

0.35

**239224_at**

**FBXL20**

**0.54**

**0.54**

**0.54**

**0.50**

**0.48**

**0.39**

**0.39**

**203497_at**

**PPARBP**

**0.58**

**0.58**

**0.58**

**0.63**

**0.68**

**0.54**

**0.54**

**213557_at**

**CRKRS**

**0.45**

**0.45**

**0.45**

**0.50**

**0.57**

**0.48**

**0.48**

210271_at

NEUROD2

**225165_at**

**PPP1R1B****

**202991_at**

**STARD3**

**0.32**

**0.39**

**0.50**

**0.50**

**205766_at**

**TCAP****

**206793_at**

**PNMT**

**221811_at**

**PERLD1**

**0.32**

**0.39**

**0.39**

**216836_s_at**

**ERBB2**

**0.42**

**0.42**

**224447_s_at**

**C17orf37**

**0.30**

**0.50**

**0.50**

**210761_s_at**

**GRB7**

**0.32**

**0.49**

**0.49**

221092_at

IKZF3

231442_at

ZPBP2

**219233_s_at**

**GSDML**

**0.39**

**0.39**

**0.39**

**0.36**

**0.42**

**0.41**

**0.41**

Only Kendal-Tau correlation coefficients with

Expression of gene-members of the

The

In our analysis we found that the genes composing the

Previous studies on the ERBB2 amplicon have utilized several different approaches. One of these was based on detection of a correlation between DNA copy number and mRNA expression

In our study, we performed DNA copy number analysis of the

Another approach originally applied to budding yeast

In our correlation analysis we applied a similar idea as in

We produced correlation matrices that included 6 validated genes of the

Correlation matrix for ERBB2 amplicon (Uppsala cohort)

**Correlation matrix for ERBB2 amplicon (Uppsala cohort)**.

Correlation matrix for

**Correlation matrix for ERBB2 amplicon (Stockholm cohort)**.

Independently, for the 38 breast cancer cell lines for which both expression and DNA copy number data were available

**Correlation matrix of the genes involved in the ERBB2 CR in breast cancer cell lines**. File represents correlation matrix analysis of the genes involved in the ERBB2 CR as well as 11 'neighbouring' genes in a sample of 38 breast cancer cell lines (Kendall-Tau,

Click here for file

Previously, Kauraniemi

Our correlation analysis of the

Due to the previously documented fact of co-amplification of broad genomic regions of the 17q11.2 and 17q12 SRAs

Correlation tables between the genes of the

**Correlation tables between the genes of the TNFAIP1/POLDIP2 SFGM and its 'neighbours' and the ERBB2 CR and its 'neighbours' in breast cancer patients**. The central selected area of the matrix represents significant correlations (Pearson,

Similarly, in 38 breast cancer cell lines (Additional file

**Correlation analysis between the genes of the TNFAIP1/POLDIP2 SFGM and the the ERBB2 CR in breast cancer cell lines**. File represents correlation analysis between the genes of the

Click here for file

Therefore, the expression profiles of the genes of the

Genes of the

Figure

Additional custom tracks in the UCSC browser for STAT1 binding in HeLa S3 cells

Results of our additional experiment are presented in Figure

Therefore, we have clearly demonstrated that not only recurrent amplification, but also chromatin remodeling and/or transcription activation is important for the establishment and maintenance of the co-regulatory pattern of the

Discussion

A method for the statistical identification of co-regulated genes organized in complex genome architectures

In the present study, we have developed a new computational method for the statistical identification of co-regulated genes organized in complex genome architectures including more than one SAGP. Our approach is based on: (i) concordant analysis and selection of expressed SA genes; (ii) identification of the boundaries of a genomic region encompassing genes with similar co-expression patterns; (iii) validation of the expression pattern using independent patient cohorts; (iv) evaluation of the clinical significance of expressed genes that belong to the identified genome region; and (v) identification of the synergy of the genes in the context of disease aggressiveness and disease relapse.

We analyzed the

Concordant regulation in the

We did not observe any significant negative correlations (discordant regulation) in the

Correlation analysis of the

Protein interaction sub-network

Our analysis of the literature on the members of the

Liu

Two interesting recent publications support the idea about the involvement of the

Interesting pleiotropic effects of

Co-regulatory pattern of the

It is important to note that the

It is well established that overexpression of the

Therefore, we suggest that the

Taken together, our analysis suggests that the

Conclusion

We conclude that the methods of computational identification of

Due to concordant regulation of the genes in such modules, one could target just the antisense transcript(s), resulting in reduction of sense mRNA transcripts, or also the adjacent genes of the module, thereby achieving additive and even synergistic reduction of expression of a specific group of neighboring genes

List of abbreviations used

SA: sense-antisense; SAGP: Sense-antisense gene pair; CSAGA: complex sense-antisense gene architecture; GEO: Gene Expression Omnibus;

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

V.A.K. initiated the study, developed general conception and provided interpretation of the results and leaded the project. O.V.G. designed and implemented the study framework in order to apply bioinformatics tools and statistical approaches for biological interpretation of the obtained results. E.M. provided statistical analysis, computer simulations, and programming. All the authors were actively involved in writing of the draft and preparing of final version of the manuscript.

Acknowledgements

This work was supported by the Biomedical Research Council of A*STAR (Agency for Science, Technology and Research), Singapore.

This article has been published as part of