School of Computer, Wuhan University, Wuhan 430072, PR China

Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA

Abstract

Background

Alternative splicing is a ubiquitous gene regulatory mechanism that dramatically increases the complexity of the proteome. However, the mechanism for regulating alternative splicing is poorly understood, and study of coordinated splicing regulation has been limited to individual cases. To study genome-wide splicing regulation, we integrate many human RNA-seq datasets to identify splicing module, which we define as a set of cassette exons co-regulated by the same splicing factors.

Results

We have designed a tensor-based approach to identify co-splicing clusters that appear frequently across multiple conditions, thus very likely to represent splicing modules - a unit in the splicing regulatory network. In particular, we model each RNA-seq dataset as a co-splicing network, where the nodes represent exons and the edges are weighted by the correlations between exon inclusion rate profiles. We apply our tensor-based method to the 38 co-splicing networks derived from human RNA-seq datasets and indentify an atlas of frequent co-splicing clusters. We demonstrate that these identified clusters represent potential splicing modules by validating against four biological knowledge databases. The likelihood that a frequent co-splicing cluster is biologically meaningful increases with its recurrence across multiple datasets, highlighting the importance of the integrative approach.

Conclusions

Co-splicing clusters reveal novel functional groups which cannot be identified by co-expression clusters, particularly they can grant new insights into functions associated with post-transcriptional regulation, and the same exons can dynamically participate in different pathways depending on different conditions and different other exons that are co-spliced. We propose that by identifying splicing module, a unit in the splicing regulatory network can serve as an important step to decipher the splicing code.

Background

Alternative splicing provides an important means for generating proteomic diversity. Recent estimates indicate that nearly 95% of human multi-exon genes are alternatively spliced

A central concept in transcription regulation is the

The recent development of RNA-seq technology provides a revolutionary tool to study alternative splicing. From each RNA-seq dataset, we can derive not only the expression levels of genes, but also those of exons and transcripts (i.e., splicing isoforms). Given an RNA-seq dataset containing a set of samples, we can calculate the inclusion rate of each exon (In this study we only consider cassette exons, which are common in alternative splicing events. Henceforth, the term "exon" always means "cassette exons".) In every sample, as the ratio between its expression level and that of the host gene. A recent study provided a nice example of studying splicing regulatory relationships using a network of exon-exon, exon-gene, and gene-gene links

A heavy subgraph in a weighted co-splicing network represents a set of exons that are highly correlated in their inclusion rate profiles; i.e., they are co-spliced. A set of exons which

In this paper, we adopt our recently developed tensor-based approach to find the heavy subgraph that frequently occur in multiple weighted networks _{ij }is the weight of the edge between nodes ^{rd}-order tensor (or 3-dimensional array) of size _{ijk }of the tensor is the weight of the edge between nodes ^{th }network (Figure

Illustration of the 3^{rd}-order tensor representation of a collection of networks

**Illustration of the 3**^{rd}**-order tensor representation of a collection of networks**. A collection of co-splicing networks can be "stacked" into a third-order tensor such that each slice represents the adjacency matrix of one network. The weights of edges in the co-splicing networks and their corresponding entries in the tensor are color-coded according to the scale to the right of the figure. After reordering the tensor by the exon and network membership vectors, a frequent co-splicing cluster (colored in red) emerges in the top-left corner. It is composed of exons

We applied our tensor algorithm to 38 weighted exon co-splicing networks derived from human RNA-seq datasets. We identified an atlas of frequent co-splicing clusters and validated them against four biological knowledge bases: Gene Ontology annotations, RNA-binding motif database, 191 ENCODE genome-wide ChIP-seq profiles, and protein complex database. We demonstrate that the likelihood for an exon cluster to be biologically meaningful increases with its recurrence across multiple datasets, highlighting the benefit of the integrative approach. Moreover, we show that co-splicing clusters can reveal novel functional groups that cannot be identified by co-expression clusters. Finally, we show that the same exons can dynamically participate in different pathways, depending on different conditions and different other exons that are co-spliced.

Results

We identified 38 human RNA-seq datasets from the NCBI Sequence Read Archive (^{th }percentile across at least 6 samples. This criterion resulted in inclusion rate profiles for 16,024 exons covering 9,532 genes. Based on these profiles, we constructed an exon co-splicing network from each RNA-seq dataset by using Pearson's correlation between exons' inclusion rate profiles. Details of data processing refer to additional file

**Supplementary material**. Additional file provides supplementary material which gives details of data processing and methods.

Click here for file

We applied our method to 38 RNA-seq datasets generated under various experimental conditions. Adopting the empirical criteria of "heaviness" ≥ 0.4 and cluster size ≥5 exons, we identified 7,194/3,104/1,422/594 co-splicing clusters with recurrences ≥3/4/5/6.

Frequent co-splicing clusters are likely to represent functional modules, splicing modules, transcriptional modules, and protein complexes

To assess the biological significance of the identified patterns, we evaluate the extent to which these exon clusters represent functional modules, splicing modules, transcriptional regulatory modules, and protein complexes. Due to the difference of background "gene" numbers, we set different p-value thresholds for significance test.

Functional analysis

We evaluated the functional homogeneity of the host genes in an exon cluster using Gene Ontology (GO) annotations. To ensure the specificity of GO terms, we filtered out general GO terms associated with

Evaluation of the functional, splicing, transcriptional, and protein complex homogeneity of co-splicing clusters with different recurrences

**Evaluation of the functional, splicing, transcriptional, and protein complex homogeneity of co-splicing clusters with different recurrences**. Four types of databases are used: **(A) **Gene Ontology for functional enrichment, **(B) **SpliceAid2 database for splicing enrichment, **(C) **ENCODE database for transcriptional and epigenetic enrichment, and **(D) **CORUM database for protein complex enrichment. The

Splicing regulatory analysis

By construction, the exons in our identified co-splicing clusters have highly correlated inclusion rate profiles across different experimental conditions. Clusters meeting this criterion are likely to consist of exons co-regulated by the same splicing factors. It has been shown that splicing factors can affect alternative splicing by interacting with cis-regulatory elements in a position-dependent manner

We found that some splicing factors tend to co-bind to the cis-regulatory regions of exons in a co-splicing cluster, suggesting the combinatorial regulation of those splicing factors. Trans-acting

Transcriptional and epigenomic analysis

To evaluate how co-splicing is affected by transcriptional regulation, we used 191 ChIP-seq profiles generated by the Encyclopedia of DNA Elements (ENCODE) consortium

Protein complex analysis

We evaluate the extent to which host genes of our identified exon clusters are protein complexes by using the Comprehensive Resource of Mammalian protein complexes database (CORUM, September 2009 version)

Co-splicing clusters reveal novel functions that are not identified by co-expression clusters

Studies have shown that genes that are co-regulated transcriptionally do not necessarily overlap with those that are co-spliced

For example, one co-splicing cluster has seven host genes:

Exons can dynamically participate in different pathways upon different co-splicing mechanisms

Alternatively skipping or including a cassette exon can change the functions of a protein by deleting or inserting a protein domain. In other words, protein isoforms alternatively spliced from the same gene may participate in different pathways. In our results, we observed that 70.3%/52.3%/38.3%/27.1% of exons are members of at least two clusters (recurrence≥3/4/5/6) with different functions. For example, exon8 of the gene

Conclusions

Splicing code is determined by a combination of many factors, such as cis-regulatory elements and transacting factors. If some exons share the same splicing code, they may form a splicing module: a unit in the splicing regulatory network. Therefore, identifying co-splicing clusters first and then investigating their cis-regulatory elements and associated trans-acting factors can serve as an important step to decipher the splicing code. Our tensor-based approach can identify co-spliced exon clusters that frequently appear in multiple RNA-seq datasets. The exons in a frequent co-splicing cluster can belong to different genes, but are very likely to be co-regulated by the same splicing factors, thus forming a splicing module. We demonstrated that the identified clusters represent meaningful biological modules, i.e. functional modules, splicing modules, transcriptional modules, and protein complexes, by validating against four biological knowledge databases. In all four types of enrichment results, the likelihood that a co-splicing cluster is biologically meaningful increases with its recurrence. This consistent behavior highlights the importance of the integrative approach. We also showed that the co-splicing clusters can reveal novel functional related genes that cannot be identified by co-expression clusters, and that the same exons can dynamically participate in different pathways depending on different conditions and different other exons that are co-spliced. The

Methods

Given an RNA-seq dataset, we construct a co-splicing network where nodes represent exons and edges are weighted by the correlation between two exon inclusion rate profiles. Given ^{rd}-order tensor **x **= (_{1}, ..., _{n})^{T}, where _{i }= 1 if exon _{i }= 0 otherwise; and (ii) the **y **= (_{1}, ..., _{m})^{T}, where _{j }= 1 if the exons of the cluster are heavily interconnected in network _{j }= 0 otherwise. The summed weight of all edges in the FSC is

Note that only the weights of edges _{ijk }with _{i }= _{j }= _{k }= 1 are counted in **x **and **y**. The problem of discovering a frequent co-splicing cluster can be formulated as a discrete combinatorial optimization problem: _{1 }_{2 }**x **and **y **that jointly maximize

The first is _{1 }and _{2 }are hard for users to provide and control. The second is **x **and **y **that jointly maximize

where ℝ_{+ }is a non-negative real space, and **x**) and **y**) are vector norms. After solving Eq. (2), users can easily identify the top-ranking networks (after sorting the tensor by **y**) and top-ranking exons (after sorting each network by **x**) contributing to the objective function. After rearranging the networks in this manner, the FSC with the largest heaviness occupies a corner of the 3D tensor. We can then mask all edges in the heaviest FSC with zeros, and optimize Eq. (2) again to search for the next FSC.

The choice of vector norms in Eq. (2) has a significant impact on the outcome of the optimization. A vector norm defined as _{p}-vector norm". In general, the closer _{p}-norm; that is, fewer components of the optimized vectors are significantly different from zero _{p}-norm grows smoother; in the extreme case _{0,}_{∞}(**x**) = **x**∥_{0 }+ (1 - **x**∥_{∞ }(0 <**x**). The norm _{0 }favors sparsity while the norm _{∞ }encourages smoothness in the non-zero components of **x**. In practice, we approximate _{0,}_{∞}(**x**) with another mixed norm: _{p,2}(**x**) = **x**∥_{p }+ (1 - **x**∥_{2}, where _{∞ }norm. In practice, we approximate _{∞ }with _{q}(**y**), where **y**). Therefore, the vector norms **x**) and **y**) are fully specified as follows,

We performed simulations to determine suitable values for the parameters

Since the vector norm **x**) is non-convex, our tensor method requires an optimization protocol that can deal with non-convex constraints. The quality of the optimum discovered for a non-convex problem depends heavily on the numerical procedure. Standard numerical techniques such as gradient descent converge to a local minimum of the solution space, and different procedures often find different local minima. Thus, it is important to find a theoretically justified numerical procedure. We use an advanced framework known as multi-stage convex relaxation, which has good numerical properties for non-convex optimization problems **x**) by the convex function **x**) is a specific convex function ^{2 }and **v **contains coefficients that will be automatically generated during the optimization process. After each optimization, the new coefficient vector **v **yields a convex function **x**). Details of our tensor-based optimization method can be found in the additional file

Once the membership vectors (i.e., the solution of Eq. (2)) have been found by optimization, the frequent co-splicing clusters can be intuitively obtained by including those exons and networks with large membership values. However, any given solution can result in multiple overlapping patterns whose "heaviness" is greater than a specified threshold. Here,

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

XJZ conceived the project; DC and WL performed the research; DC, WL, and XJZ wrote the paper; JL provided input and suggestions. All authors read and approved the final manuscript.

Acknowledgements

The work presented in this paper was supported by National Institutes of Health Grants R01GM074163 and NSF Grant 0747475 to XJZ, and National Science Foundation of China 60970063, Program for New Century Excellent Talents in University NCET-10-0644, the Ph.D. Programs Foundation of Ministry of Education of China 20090141110026 and the Fundamental Research Funds for the Central Universities 6081007 to JL.

This article has been published as part of