Abstract
Background
RNA metabolism, through 'combinatorial splicing', can generate enormous structural diversity in the proteome. Alternative domains may interact, however, with unpredictable phenotypic consequences, necessitating integrated RNAlevel regulation of molecular composition. Splicing correlations within transcripts of single genes provide valuable clues to functional relationships among molecular domains as well as genomic targets for higherorder splicing regulation.
Results
We present tools to visualize complex splicing patterns in fulllength cDNA libraries. Developmental changes in pairwise correlations are presented vectorially in 'clock plots' and linkage grids. Higherorder correlations are assessed statistically through Monte Carlo analysis of a loglinear model with an empiricalBayes estimate of the true probabilities of observed and unobserved splice forms. Loglinear coefficients are visualized in a 'spliceprint,' a signature of splice correlations in the transcriptome. We present two novel metrics: the linkage change index, which measures the directional change in pairwise correlation with tissue differentiation, and the accuracy index, a very simple goodnessoffit metric that is more sensitive than the integrated squared error when applied to sparsely populated tables, and unlike chisquare, does not diverge at low variance. Considerable attention is given to sparse contingency tables, which are inherent to singlegene libraries.
Conclusion
Patterns of splicing correlations are revealed, which span a broad range of interaction order and change in development. The methods have a broad scope of applicability, beyond the single gene – including, for example, multiple gene interactions in the complete transcriptome.
Background
Through alternative splicing at multiple sites, a single transcriptional unit may give rise to a complex array of isoforms – a 'mini transcriptome,' or singlegene transcriptome (SGT). Considerable effort is being invested to assemble a genomewide compendium of sites of transcript and peptide variations [16], the guiding principle being that combinatorial splicing may profoundly expand the proteome, and consequently the phenotypic repertoire, without increasing the number of structural genes [7].
Such phenotypic elaboration may arise through simple combinations of individual 'modular' domains, or through cooperative effects from multiple variable domains that interact functionally [8]. The latter complication imposes significant regulatory demands – for the efficient selection of combinations of viable combinations – which may in part underly the expansion of the noncoding genome [9,10]. To understand gene function more fully, we must determine how all the possible alternative domains are actually combined by the RNA splicing process into working molecules. This requires structural analysis of large numbers of fulllength transcripts expressed from each gene, approaching numbers currently available in wholegenome fulllength cDNA libraries derived from wholegenomes [1116] to ensure adequate representation of the less abundant variants.
EST, microarray, and proteomics methodologies for largescale alternative splicing surveys share the limitation that by sampling fragments of macromolecules they cannot capture most intramolecular linkage information. That is, they primarily yield marginal splicing frequencies but not correlations. The latter are invaluable for discerning the tissue and cellspecific functional differentiation of splice variants [4,17]. Fulllength cDNAs, by contrast, provide diaries of intramolecular splicing choices. We have developed robust methods for production of fulllength, nonrecombinant, statistically representative singlegene libraries (SGLs) [18]. An SGL is a 'vertical' sample of the transcriptome, representing its basic building block, the SGT.
Initial analyses of moderatescale SGLs [8,18,19] reveal fascinating developmental changes in both the extent and pattern of splicing linkage. These represent developmental changes in the regulatory programs that establish the selection rules for combining variable domains into functional ensembles, thus establishing the molecular phenotype. Recent developments in production of largescale fulllength cDNA libraries and highthroughput sequencing techniques [20,21] promise to provide this information at high resolution and on a meaningful scale.
As largescale SGT data become available, reliable statistical methods will be required to discern potentially complex splicing interactions. Two features of such data have direct relevance to the present work: (1) highorder splicing correlations, which reflect highorder functional interactions within single molecules, and (2) very sparse representation of the complete configuration space, which presents a fundamental challenge to statistical analysis. Data sparseness results from a They typical situation where several alternative splice loci have lowfrequency variations, rendering a substantial fraction of splice combinations as simply improbable.
Here we present tools for statistical analysis of high order splicing correlations in fulllength SGLs
A thorough statistical analysis of alternative splicing (1) begins with fulllength cDNA libraries that are unbiased, representative samples of the mRNA populations, (2) presents a complete discussion of marginal splicing frequencies as well as correlations, including tissuetotissue variations in these parameters, and (3) employs a stochastic splicing model where the observed data is a likely outcome. We present tools for analysis of splicing correlations in fulllength SGLs, with careful attention to handling of sparse contingency tables. We provide simple, novel graphical visualizations that accentuate coordinated changes within groups of variable loci. The data – human low voltageactivated calcium channel gene (CACNA1G) SGLs from fetal and adult whole brain – derive from recent work [8], where we discuss in detail marginal splicing frequencies and their developmental changes, as well as splicing correlations, effects on ion channel function of alternative splicing at individual alternative domains, and highorder, nonadditive functional interactions among multiple alternative domains. The present work builds on that material, presenting a detailed exposition of the statistical model, which was beyond the scope of that publication, a thorough analysis of splicing correlations, and visual tools that will prove generally useful in studies of splicing linkages. Since our focus is correlations, we limit our discussion of marginal frequencies to the extent to which it informs our understanding of correlations. Though we employ a particular dataset and splicing model for the sake of illustration, these analytical tools are not modelspecific and are generally applicable to other types of multivariate data.
Methods
CACNA1G Gene and SGL Data
Table S1 presents the data for the present statistical analyses, listing the frequencies of fulllength splice variants in wholebrain SGLs from two developmental stages of expression of the human calcium channel gene CACNA1G, described in reference 8. Marginal frequencies at the separate alternative loci calculated from this data are given in table S3, and are plotted in Figure S5 with confidence intervals. Figure 1 is a schematized CACNA1G gene structure illustrating the nine alternatively spliced exon segments (white blocks) and a splicing graph depicting the ways in exons may be joined by alternative splicing. There are 512 distinct structures, most of which have not been observed, and in fact may not be expressed.
Figure 1. CACNA1G Alternative Splicing Graph. The CACNA1G gene comprises 38 known exons and 37 introns. All known exons or exon segments are shown. Lengths are not to scale. Introns are omitted. Black boxes depict constitutive (invariate) exons or exon segments, identified by the upper set of labels. Thin black segments on the ends are noncoding termini, corresponding to the alternative splice loci. White blocks depict segments corresponding to the alternative splice loci, labeled below the figure. Line segments joining the boxes, above and below, indicate potential alternative splices.
Definitions
An alternative splice locus is a categorical variable representing a region of the primary transcript that is subject to alternative splicing (we may omit the qualifier 'alternative' when the meaning is clear). We consider alternative splicing a stochastic process, so that every alternative sequence configuration C_{S}, of any set of loci S, is a random variable with splicing probability p_{S}(C_{S}). We say that a locus j has a strong splicing bias if any p_{j}(C_{j}) approaches 1. For a single locus j, the configuration C_{j }may be represented by an integer between 0 and g_{j } 1, where g_{j }is the number of sequence alternatives, or multiplicity, of j. For a cassette exon, spliced in or out as a unit, it is often convenient to assign 1 to the insertion and 0 to the deletion, although the reverse may occasionally be more convenient – if the insertion is rare, for example. What defines a locus may also be flexible, depending on the purpose. An isolated cassette exon unambiguously defines a single binary locus, with multiplicity 2. Two adjacent alternatively spliced cassette exons, however, may be considered two loci with g = 2 or a single locus with g = 4. If the two exons are mutually exclusive then the best representation may be as a single locus with g = 3. Figure 1 illustrates the nine alternative loci in the CANCA1G gene.
If splicing at separate loci is independent [17,19,22], the expected frequency of each splice variant v is equal to its independent stochastic expectation
where f_{j}(v) is the marginal frequency of C_{j}(v), the configuration of locus j in splice variant v. The number of possible fulllength transcript variants is
Computations and illustrations were made in the R programming language [23].
Mutual information methods
We quantify splicing linkage between a pair of loci i and j with the mutual information I(i, j), which measures the reduction in uncertainty about the configuration of one locus when that of the other is specified [24]: I(i, j) = H(i) + H(j)  H(i, j), where H(i) + H(j) is the expected entropy of C_{ij }given independent splicing at the observed marginal frequencies, and H(i, j) = ∑_{ij}p(C_{ij})·log p(C_{ij}) is the observed entropy.
While mutual information is nonnegative, it is useful to define a directed, or 'configurationspecific' mutual information, which may be negative. The sign gives the direction of correlation between a specific pair of configurations, called the reference configurations. For example, if we define the reference configuration at a pair of binary loci as the insertion at both loci, then a negative value means that insertion at one locus correlates with deletion at the other. The choice of reference configuration is arbitrary, and reversing the reference configuration for one locus simply reverses the sign of the configurationspecific mutual information.
The dependency, D(ij) = I(i, j)/H(i), is the mutual information normalized to its maximum possible value, the total entropy of the 'dependent' locus, i. It measures the degree to which the independent variable is a predictor of the dependent variable: the one with lower marginal entropy has a higher dependency on the other.
The Linkage change index
To quantify developmental changes in linkage we introduce the linkage change index, S_{D }(Figure 2A–C). For a given pair of loci, we define the linkage vector D = (x, y), with components x – the splicing linkage in the fetal population, and y – the adult linkage. The difference y  x is a simple measure of the developmental change in linkage: it is zero when linkage is the same (x = y) at both stages and maximal (positive or negative) for a complete reversal of linkage (x = y). Scaling to the length of D gives
Figure 2. Two metric indices introduced in this work. A, The Linkage Vector, D, compares splicing correlation between two loci in two populations; B, The Linkage Change Index S_{D }measures displacement of D from the zerochange axis (x'). C, S_{D }is a sinusoidal function of the phase φ between D and x', the axis of zero linkage change. Positive values indicate an increasing positive or decreasing negative linkage change, and magnitude S_{D} > 1 indicates a sign reversal (c.f. panel B). D, The Accuracy Index,
= a/b, measures displacement of a point estimator x_{0 }from the center of mass of a distribution p(x).
where θ is the angle between D and the x axis (Figure 2A) and φ = θ  π/4 is the angle between D and the transformed x axis, x', representing unaltered linkage. In polar coordinates, D ranges in magnitude between about 1 and 1.4 times the larger of x and y, and has phase φ, the relative developmental change in linkage. Note that sin φ = y'/D, so S_{D }is proportional to y', the perpendicular displacement of D from the x' axis.
S_{D }is a more straightforward index of linkage change than either φ or sin φ (c.f. Figure 2B): S_{D }is positive when splicing correlation becomes more positive or less negative with development (0 <φ <π, magenta lines in Figure 2B and 2C), and if the correlation actually reverses direction from negative to positive, then S_{D }exceeds 1 (π/4 <φ < 3 π/4, quadrant II). Likewise, S_{D }< 0 reflects increasing negative (red lines), or decreasing positive correlation, and S_{D }< 1 means the linkage reverses, from positive to negative (quadrant IV). Figure 2C plots S_{D }as a function of the phase φ, annotated corresponding to Figure 2B.
Assessing higherorder linkages with loglinear models
Given a table of frequencies for N_{T }splice variants (Table S1), it is natural to arrange the data in a kdimensional contingency table with k variables (j = 1,..., k) of g_{j }categories (C_{j }= 0,..., g_{j } 1) each. This simplifies calculation of marginal frequencies. We fit the complete contingency table to a loglinear model [25], giving the logfrequency of each splice variant as a sum of coefficients u_{S}(C_{S}), which measure the extent of mutual correlation among a set of loci S with configuration C_{S}. The most complete, or saturated, loglinear model is:
where subscripts refer to alternative splice loci. Each fulllength splice variant C_{123...k }has a unique equation (3), giving N_{T }equations in all. The saturated model has a term for every possible subset of loci plus an intercept, u, with terms in all. With all binary loci, the saturated model has 2^{k }equations of 2^{k }terms each. An unsaturated model is hierarchical if the presence of a uterm for any group of loci S implies a uterm for every subset of S.
For any locus j in a set of loci S, the sum of u_{S}(C_{S}) over all configurations of j is constrained to zero. Thus, for any C_{S }and differing only in the configuration of a binary locus j, u_{S}(C_{S}) = u_{S}(). If all loci in S are binary, then all terms u_{S}(C_{S}) have the same magnitude, u_{S}. Working with all binary loci thus simplifies the analysis, but does not otherwise alter the capabilities of the method. We use normalized frequencies, rather than total counts, to allow direct comparison of populations of different sizes without rescaling
Accuracy index
The 'accuracy,' , measures the extent to which a point, x_{0}, is centered within a distribution, p(x). This has the advantage of extreme simplicity: it is the ratio of areas a and b in Figure 2D, where a is always the smaller of the two areas ∑{p(x): x ≤ x_{0}} and ∑{p(x): x ≥ x_{0}}. For a continuous pdf the two areas are and . Note that both a and b include p(x_{0}). This is by design, as it yields a meaningful result for any x_{0 }and distribution p(x). In the extreme lowvariance limit, for example, if the p(x) is an impulse δ(c), then (x) = 0 for all x except x = c, where (x) = 1. The accuracy is thus always defined, and ranges from 0, when x_{0 }lies completely outside the distribution, to 1, when x_{0 }is the median of the distribution.
Supplementary Figures, Tables, and expressions
Figures, tables and expressions with labeled beginning with 'S' are in 1.
Additional File 1. A PDF file with a table listing the data used for these analyses (Table S1), a graphical scheme illustrating how modular and interactive splicing relate to splice configuration (Figure S1), and a detailed exposition of the empirical Bayesian approach, with an extensive analysis of sensitivity to the choice of prior distribution.
Format: PDF Size: 1.1MB Download file
This file can be viewed with: Adobe Acrobat Reader
Results
Some interesting developmental splicing changes are evident simply form inspection of table S1. For example, the adult brain expresses a complement of structures that is very distinct from the fetalbrain population. A large portion of the difference is attributable to a developmental switch from insertion of segment 26 to 25C between constitutive exons 25B and 27. A broad array of other changes occur as well, most due to changes in splicing correlations. For example, the fetal brain expresses a significantly more diverse population of transcripts (from a much less diverse population of expressing cell types) than the adult brain, even though the marginal entropies of the alternative loci are highly similar and predict populations of equivalent diversity. See reference 8 for a detailed discussion of these matters.
Clock plots
Figure 3A plots splicing linkage between a pair of loci at one developmental stage versus another to display the developmental change in linkage. We call this a 'clock plot:' Each point is a vector whose magnitude measures the overall splicing linkage between the two loci and whose direction (displacement from the diagonal) indicates developmental regulation of linkage. Splicing may be developmentally regulated at both loci, but if they are regulated independently the plot point will fall on the origin, no matter how great the changes in splicing. If two loci are linked, but their linkage does not change with development, the point will lie away from the origin but on the diagonal. Thus, one pair of loci (1 and 2) shows a slight positive correlation between the reference configurations at the early stage, but this correlation increases greatly with development. The second pair also undergoes a developmental change in linkage, but in this case the loci become less correlated at the later stage. Linkage in this case is configurationspecific mutual information, which may be positive or negative. Plotting the configuration dependence allows us to see a reversal in the direction of correlation – manifest as a reflection about one axis – that may occur even in the absence of a change in mutual information.
Figure 3. Clock plots. A, Illustrative example. Each circle represents a pair of alternative loci (e.g. loci '1' and '2'). The gray vectors depict splicing linkage that is present but does not change with development. B, Mutual information clock plot for all 36 pairs of the nine loci in the fetal and adult cDNA populations. Pairs of loci are identified in the plot symbols by colors defined in the key. The reference configuration is the insertion for every locus. C, Dependency clock plot for the single pair of loci (25C, 26). orange: D(25C26); blue: D(2625C). D, Mutual information clock plot for the eight pairs involving locus 14: I(x, 14), ∀x ≠ 14. E, Dependencies, D(x14), of the same pairs.
Figure 3B is a clock plot displaying all 36 pairs of the nine alternative splice loci in the CACNA1G gene in fetal and adult human brain (data are in Table S1). The points are dispersed primarily along the adult axis, indicating a general developmental increase in splicing linkage among most pairs of loci, an interesting exception being those that involve locus 38B (violet). Splicing at one pair of loci in particular, 25C and 26, is highly linked, with insertion at one locus favoring deletion at the other in both stages of development, but much more strongly so in the adult than in the fetal brain. Several loci show considerable pairwise splicing linkages with multiple other loci. We note that domains that correlate structurally in this way are good candidates for some kind of functional relationship, and multiple pairwise splicing linkages to a single locus, as seen here, may reflect either simple pairwise functional interactions or a higherorder interrelationship. We explore the latter possibility in the next section.
Figure 3C, plots the (configurationspecific) dependency of 25C on 26 (orange) and that of 26 on 25C (blue). The dependency measures the extent to which splicing at one locus predicts splicing at the other. Unlike mutual information, the dependency is an asymmetric function of the two loci, and may reveal relationships that are less apparent with mutual information. In this case the strong developmental change in linkage is manifest more as a change in dependency of 26 on 25C rather than the other way around, as indicated by the much greater displacement of the blue point from the diagonal. This reflects a greater discrepancy in entropy of those two loci in the fetal population than in the adult (Table S3).
The statistical uncertainty, indicated by the dispersion around each data point in Figure 3C–E, for example, is obtained from 1000 simulated populations for each tissue, sampled by Monte Carlo from the empirical Bayes estimate (described below) of the distribution of splice variants in each tissue. For every pair of loci in each tissue, the mutual information (or dependency) values were binned into a histogram. Because splicing in the two tissues may safely be considered independent, the twodimensional joint distribution for a pair of loci is the Cartesian cross product of the resulting bincounts vectors from the two tissues. The error ring encloses the 95% mostprobable values in this case.
Figures 3D and 3E allow comparison of mutual information and dependency for the same pairs of loci, in this case all pairs involving locus 14. The data represent 1000 populations sampled from the empirical Bayes estimate, and error rings enclose the 95% mostproabable values for each pair of loci.
Pairwise linkage grids
Figure 4A displays linkage grids, showing the splicing dependency of all pairs of loci in a population that are statistically significant at the level of 99%. The rows give the dependent variables and columns the independent variables. In this way we may compare in adjacent grids the extent of splicing linkage in two populations for all pairs of loci exceeding a desired significance level. It is readily apparent in these plots that brain maturation entails the appearance or strengthening of a considerable number of pairwise correlations.
Figure 4. Linkage grids. A, B. Dependency, D(ij), of splicing at one locus i on a second locus j is plotted for all pairs of loci in the fetal and adult cDNA populations. Independent loci j are on the abscissa. The layout is the same for all grids. Only dependencies at or above 99% significance are displayed. Statistical significance was determined for the two tissue samples either separately (A) of jointly (B). For a given tissue, the same values are plotted in A and B when the linkage is significant in both. Note that a linkage may be significant in one tissue but not both or vice versa. C. The linkage change index, S_{D }(left), measures the extent of change in linkage at a pair of loci in a single parameter. Only changes significant at or above 99% are shown. The right panel plots the magnitude of the linkage vector, D, which gives an indication of the overall level of linkge at each pair of loci. The color scale ranges from 0/15 to +0.15 on the top four grids and from 1.4 to +1.4 on the bottom left. Some cell values fall outside the indicated range in order to improve the dynamic range overall. Plus signs (+) indicate these saturated cells.
Figure 4B plots the dependency values for those loci showing a statistically significant developmental change in linkage. Thus, whereas in Figure 4A the significance was determined separately for each tissue, in Figure 4B it was determined for both jointly. Note that while loci 25C and 26 show a high degree of negative dependency in both tissues (c.f. the red cells in the small whitebordered box within each grid of Figure 4A), the dependency of 25C on 26 does not change significantly with development, whereas the reverse dependency increases somewhat. This reflects the different positions of the two points in Figure 3C, where the orange ring touches the diagonal.
We have introduced the linkage change index, S_{D}, to quantify these changes (c.f. methods). Figure 4C plots S_{D }and D for those pairs of loci with nonzero S_{D }at 99% significance or greater. The left grid shows the dynamic range of directional changes, while the right shows the overall magnitudes of the linkages involved. An interesting point of comparison is the values within the whitebordered box representing loci 25C and 26. These cells are quite dim within the S_{D }grid, whereas they are bright in the other grids. This shows that, while there is strong splicing dependency between these two loci at both stages, and there is a statistically significant change in linkage with development, the extent of change is actually not very great in comparison to that at other pairs of loci. As we noted above, though, the dependency of 25C on 26 undergoes a greater developmental change than the reverse dependency; this is more apparent in the plot of S_{D }than in the other plots. Other significant changes are much more obvious here as well.
Spliceprints
Up to this point we have considered splicing linkage between pairs of loci. It is possible for splicing to involve correlations of higher order. For example a segment may be deleted at locus 1 only if segments are inserted at both loci 2 and 3. This situation necessarily entails pairwise correlations between loci 1 and 2 as well as 1 and 3, but a threeway linkage is more intricate than a collection of disjoint pairwise linkages.
Statistical interactions between splicing events at separate sites may be expected to relate directly to functional interactions between the alternatively spliced domains. It is only through this mechanism that alternative splicing may exert control over the function of the expressed proteins. The nature of physicochemical interaction between domains in a protein is extremely complex, generally unpredictable, and not limited to "nearestneighbor" effects. This means that functional interactions may be expected between arbitrarily many domains, and that functional effects of a configuration change may be highly nonlinear, depending on the complement of variable domains at other sites in the same protein. We have demonstrated highly nonlinear (nonadditive) functional effects, exactly of this nature, in the CACNA1G ion channel electrophysiology [8]. Thus we should not be surprised to find higherorder splicing interactions, reflecting control of domain combinations at multiple loci, and we do not expect this phenomenon to be limited to the present system.
We model multisite alternative splicing with a loglinear model [25] to quantify higherorder linkages. Figure 5 displays the amplitudes of the loglinear coefficients from two developmental stages, for all terms of order 1 or higher in the saturated model. The contributing loci are not identified here, but terms for subsets of the same cardinality are grouped together between vertical rules. In each plot, values above the zero line are coefficients derived from the Empirical Bayes estimate of the experimental population. For comparison, traces below the midline plot mean values from 1000 Monte Carlo populations. Because only the magnitudes are plotted, the ordinate values increase with distance from zero both upward and downward. Differential splicing regulation in the fetal and adult brain appears as different patterns in the upper and lower boxes. Notice that independent splicing gives coefficients that lie on the zero line, and that the more compressed fetal pattern indicates a lower level of splicing correlations, especially at higher orders. Although we focus only on the magnitudes of the loglinear coefficients in the present work, a wealth of information is present in their signs, which would admit a multidimensional extension of the clock plot analysis. Also, of course, the spliceprint is not limited to loglinear coefficients, and the coefficients may be presented in any desired order on the abscissa.
Figure 5. Spliceprints. Loglinear coefficient magnitudes are plotted for all subsets of more than one locus. The vertical scale is the same for all traces. Within a cardinality, k, the sequence of coefficients is determined by listing the 9 loci from left to right as in Figure 4 (abscissa), and choosing groups of loci from leftmost to rightmost as follows: for loci A = 25A, B = 14, etc., cardinality3 coefficients occur in the order ABC, ABD, ACD, BCD, ABE,..., PHI, GHI.
Minimallinkage models
We may wish to identify the smallest set of interactions that can account for the data, within bounds of statistical significance. It may be surprising, for example, that some genes display nearly independent splicing at multiple sites, even in rather complicated tissues [8,17,19]. In such cases one or two pairwise interactions may account for any deviations from independence. We may identify those by first ranking the pairs in order of decreasing mutual information, then define a hierarchical model with only the most highly correlated firstorder interaction terms. A leastsquares fit to this model gives coefficients for only those interactions. We then ask whether this represents a possible parent distribution for the observed population.
A simple way to do this is to sample Monte Carlo populations from the fitted distribution, with the same number of transcripts, N, as the experimental population. For each MC population we then calculate its likelihood of arising from a reference distribution, ρ, for example:
This is a multinomial distribution of N_{T }splice variant classes, v, each with expected probability p_{v }and abundance n_{v }in the sample: ∑n_{v }= N. The reference distribution is in fact arbitrary: we are not interested in the exact likelihood of our data; rather, we wish to find a model that generates populations of likelihood similar to the data with a given reference distribution. The fraction of MC likelihoods not exceeding the observed value measures the evidence against the model. This approach is an approximation to the method of posterior predictive assessment of model fit of Gelman et al. [26]. Expression (4) constitutes their statistic T. We seek a model with sufficient departure from independence to be consistent with the observed data.
Figure 6 illustrates this method with the adult data. It shows loglikelihood histograms for six Monte Carlo ensembles of 1000 populations each. The reference distribution is the independentsplicing expectation [equation (1)] of the experimental population. The horizontal rule at 143 shows the loglikelihood of the observed population. The first ensemble (leftmost histogram) was sampled from the reference distribution, and shows that independent splicing is inconsistent with the experimental data. The reference distribution gives the 'cardinality1' loglinear model, log p(C_{123...k}) = u + ∑_{j}u_{j}(C_{j}), with all independence terms and no interactions. The next histogram is obtained by adding a single interaction to this model: u_{34}(C_{34}), the pairwise interaction between loci 25C and 26, identified by mutual information as the most highly correlated segments. Each subsequent ensemble is obtained from the previous by adding the next mostcorrelated pair. A minimum of five pairwise linkages is thus required to account for the observed splicing correlations by a stochastic mechanism within bounds of 95% confidence.
Figure 6. Posterior predictive assessment of minimal model fit. A simple model may account for the observed distribution within admissible error limits, but it may cause one to overlook important effects in a network of interactions. Histograms of loglikelihood (assuming independent splicing) are presented, each for 1000 Monte Carlo populations, sampled from a series of models incorporating an increasing number of pairwise correlations. See text for details.
Saturated models
Peptide domains that interact functionally are likely to exhibit statistical correlations reflecting enrichment of productive interactions or suppression of detrimental ones. Figures 1 and 2 show that many individual loci participate in pairwise interactions with multiple other loci. Where these reflect functional interactions, we anticipate two important consequences, due to the fact that they occur within a closepacked, folded protein: (i) they are likely to be transitive in nature – e.g., if loci A and B interact and loci B and C do, we expect that A and C will as well. Furthermore, we should expect that splicing at C will have effects that depend on A and B together, i. e. (ii) a set of loci may influence protein function as an integrated ensemble, rather than a collection of functionally separable modules. It then follows that their splicing regulation will exhibit mutual, higherorder interdependencies. Unsaturated models cannot capture these correlations accurately: A loworder model, as in Figure 6 for example, is from this perspective an oversimplification, emphasizing a few loworder interactions at the expense of a wealth of information in the higher orders. Though we correctly identify the most highly correlated domain pairs, this gives no clue how pairs may link up within a molecule, i.e., evidence for ensembles of functionally interacting protein domains.
Figure 7 shows the distribution of amplitudes among the first 4 orders of interaction terms (cardinality k = 2,...,5) from fits of four hierarchical models to the same data (adult population). Each panel plots coefficients from a model that includes all terms of cardinality k and lower, but none higher. When excluded from the model, highorder interactions are 'absorbed' into lowerorder terms. Notice, for example, the redistribution of relative amplitudes within the cardinality3 coefficients as higherorder terms are included in the model. This occurs as weight from triplets in higherorder linkage groups is shifted to their higherorder coefficients when they are made available in the model. Since we wish to compare coefficients estimated under identical models from parallel data sets, we use a saturated model to avoid confounding the loworder terms with higherorder effects. A nonzero coefficient for a set of loci then indicates a mutual splicing dependency among all loci in the set, in excess of any lowerorder interactions that may be present among component subsets, and larger magnitudes reflect stronger correlations.
Figure 7. Spliceprints of successively higherorder hierarchical models fit to the same data. Excluding highorder effects from the model changes the relative magnitudes of the remaining coefficients, thus misrepresenting the lowerorder interactions. The vertical scale is the same for all plots.
Empirical Bayes methodology
We are not interested in the precise values of the coefficients as much as the relative amplitudes of the same coefficients from two different populations, as compared sidetoside in Figure 5, for example. From this we may discern statistically significant developmental shifts in splicing linkages. This requires that we estimate the variance of the loglinear coefficients. These may be obtained from a saturated model fit to Monte Carlo populations sampled from an estimate of the parent distribution.
The simplest such estimate is the observed distribution itself (the bootstrap). While this makes no assumptions about the underlying mechanism, it assigns zero probability to the unobserved splice forms, which is obviously unreasonable. Increasing the experimental sample size, even by an order of magnitude, may not make the bootstrap applicable if the number of alternatively spliced loci is even moderately large (Figure S2): the probability space expands geometrically with the number of variables, so unless the sample is vastly larger than the number of classes the table of observed frequencies will contain a large number of zeros (empty cells). This is exaggerated when splicing at any locus is strongly biased, as is common (e.g. Figure S5). Transcripts that combine rare splice configurations at multiple loci thus have a very low expectation, though we cannot assume that any empty cells would persist if we continued data collection indefinitely.
The empirical Bayes approach [27] enables an estimate of the parent distribution with plausible nonzero probabilities for the unobserved classes. This estimate (the posterior distribution) incorporates the observed distribution (the likelihood) with our current understanding of the underlying process (the prior distribution). We have found the 'pseudoBayes' estimator of Bishop et al. [25], a linear shrinkage estimator chosen for its simplicity, to be entirely adequate. Improvements may be made to the estimator – with nonlinear shrinkage, for example, but typically at the expense of added complexity. Our models do exhibit sensitivity to the choice of prior distribution, however, because of the sparse representation of splice forms in the experimental data. We present a thorough examination of different priors in the Supplementary Information. Because splicing at separate loci is approximately independent (Figure S4), equation (1) provides an excellent prior: the tissuespecific independent marginals prior. In this work we primarily use the averagedmarginals variant of this prior, obtained by averaging the fetal and adult expected frequencies.
The empirical Bayes methodology is open to the criticism that including experimental results in the prior may lead to duplicate use of evidence and subsequent underestimation of uncertainty. Purportedly 'uninformative' priors inadvertently introduce their own errors, however, mainly by forcing untenable splice correlations into the estimator (c.f. Supplementary Information; also Figure 9). By making judicious use of the observed marginal frequencies in the prior we minimize this effect, and keep the focus of inference on the interactions. The present method has a major advantage in providing a simple comparison of two developmental stages by linkage order, with low computational burden.
Figure 8. Cardinalityaveraged loglinear coefficients. All coefficients of the same degree are presented as a single average magnitude. 1000 Monte Carlo populations were sampled from the empirical Bayes posterior obtained with the 'averagedmarginals' prior and fetal (gray) or adult (black) likelihood. Error bars delimit the 2.5 – 97.5% interquantile range for each distribution.
Figure 9. Frequencyoffrequency plots. Monte Carlo populations of 10^{8 }transcripts each were sampled from the empirical Bayes estimate obtained with the fetal or adult likelihood and either the tissuespecific independent marginals (dots), averagedmarginals (squares), or uniform (circles) prior. The uniform prior gives an idiosyncratic Lshaped profile with an abrupt lower copynumber limit (open circles). This reflects the implicit exchangeability of the unobserved classes: all have the same low probability, but because the majority of forms are not observed, their cumulative probability in the estimator is large. This is one example of how the uniform prior (or any prior obtained by a small constant correction to the observed frequencies), although 'uninformative', is overly simplistic, and leads to artifacts.
Developmental changes in higherorder linkages
Figure 8 presents a statistical summary of Figure 5, obtained with the averagedmarginals prior. The adult population displays increased higherorder correlations compared to the fetal for groups of up to at least 6 loci. This agrees with the mutual information results (e.g. Figure 3B), with an interesting additional feature: the fetal and adult profiles are most divergent at cardinalities 4 and 5 with the gap closing toward cardinality 2. This shows that much of the difference in mutual information between the two tissues derives from extensive splicing correlations involving sets of considerably more than two loci, whereas the extent of isolated pairwise interactions is more nearly comparable in the two tissues. Elsewhere we demonstrate that these higherorder splicing correlations correspond to nonadditive functional interactions among multiple domains in the expressed ion channel protein [8]. The spliceprint provides compelling evidence that entire ensembles of domain configurations are selected in concert during processing of the RNA transcript, and that this process becomes more restrictive in the course of development. These loci are spread out over a large portion of a large transcript, separated by ten or more introns in some cases. Fededa et al [28] present one mechanism of longrange coordination of alternative splicing, and others most likely will be found.
The SGT is a transcriptome
Though we have used actual SGT data and a detailed multivariate stochastic splicing model to present analyses of splicing correlations, our focus is on the methods themselves, which apply regardless of the dataset or details of the splicing model. Only with a realistic splicing model, of course, can we make reliable statistical inferences. We present a detailed rationale for our model as a supplement [see 1]. This model may be further refined, for example through computational methods of variable selection (Dahinden, Parmigiani, Emerick, and Bühlmann, manuscript in preparation) or collection of additional data from different sources. A realistic splicing model allows us to investigate the singlegene transcriptome with established methods of transcriptome analysis. Figure 9 presents one such approach.
The transcriptome is a highly complex assortment of gene products, but it exhibits a remarkably stable expression pattern. Only a few genes are expressed at a high level, while most genes are represented by only a few copies. It is not clear that this pattern should persist at the singlegene level. Different physiological inputs would affect the profile at different levels, so those aspects of genenetwork topology that conspire to shape the aggregate gene expression profile, for example, may or may not be relevant to the selection of splice isoforms from a single gene in separate cells or tissues. Nonetheless, the basic characteristics of the transcriptome profile are also present in its elemental building block, the SGT. Figure 9 shows 'frequencyoffrequencies' plots for simulated SGTs sampled from the tissuespecific independentmarginals distribution (filled circles). The reverseJ pattern, like those obtained in genomewide expression profiles assayed by SAGE [29], reflects the complexity of both the transcript inventory and the tissue physiology in which this gene is expressed. The identical profile was obtained with the averagedmarginals estimator (squares), which places a lower reliance on the observed marginal splicing frequencies in either tissue. (A uniform prior distribution (open circles), which we have shown to be inappropriate for a wide range of reasons yields an idiosyncratic Lshaped profile with an abrupt lower copynumber limit).
This analysis shows that the SGT exhibits a complex expression profile similar to that of the complete transcriptome, but that this is true whether or not splicing correlations are present at observed levels. Splicing correlations may represent a level of organization unique to the singlegene level. Alternatively, of course, analysis of correlated gene expression in the wholegenome transcriptome may reveal comparably interesting behavior.
Discussion
The human genome supports in the neighborhood of 23,000 protein coding genes [30], very similar to the number found in genomes of vastly simpler organisms, such as C. elegans and Drosophila [31,32]. To account for the increase in human phenotypic richness, therefore, the number of structural genes is not as important a factor as the way in which genes are used. Variations in gene expression levels, changes in the timing of expression, evolutionary adaptations that rearrange gene interactions as well as evolution of the coding sequence, and increased posttranscriptional modification of primary transcripts to diversify the products of single genes all play a role [9,33].
Here we present tools to evaluate and visualize complex patterns of transcriptome variation, illustrated on populations of fulllength cDNA splice variants from CACNA1G, the gene encoding the human Ca_{v}3.1 Ttype calciumchannel α_{1 }subunit. In the course of brain maturation the transcriptome of this gene undergoes a transformation that would be largely invisible to a study of gene expression levels or a microarray or ESTbased splicing survey. The changes appear only in the complete structures of fulllength transcripts, as alterations in splicing correlations at separate loci within the same molecule. A standard analysis of pairwise correlations, while illuminating, is incomplete in an important way. Compared to the fetal transcriptome, the adult displays a marked increase in mutual information between many pairs of loci (Figure 4). The multivariate analysis, however, reveals two components of this increase: a modest elevation in disjoint pairwise linkages and a substantial increase in higherorder correlations that include linked pairs as a subset. Overall, splicing in the adult is far more restrictive than fetal splicing. This occurs at the same time as the range of cell types in which this gene is being expressed is diversifying, not constricting. This is consistent with the notion that splicing may need to be more stringently specified in the more intricate 'wiring' of the mature brain [8].
It is the grounding principle of this work, therefore, that splicing correlations will generally reflect functional interactions, and that these are likely to involve multiple domains. Splicing of physically linked domains should be coregulated to inhibit detrimental interactions as well as to enhance beneficial ones. This relates directly to the complexity of the processes that regulate selection of alternative domains, the most important factor being whether the domains are modular or functionally interactive.
Modular domains may be shuttled in or out with predictable effects, independent of splicing at other loci. They may be used to conjoin functional activities – postsynaptic targeting with fast activation gating in an ion channel, for example. Interactive domains, in contrast, express a shared functional effect that exists only in the context of the ensemble. A specific effect cannot be independently defined for a single interactive domain: reconfiguring one such domain 'reinterprets' the functional influences of the others.
That is, the molecular phenotype may be expressed as a linear combination of the effects of modular domains, but not so for interactive ones. As an example, deleting segment 38B of the Ca_{v}3.1 calcium channel decreases the window current magnitude when 25C is present, but increases it when 26 is present, and has no effect when both are absent; furthermore, it does not effect gating rates, except when 14, 25C and 26 were all absent, whereupon it speeds inactivation [8]. Whether domains interact functionally depends on the domains, and modular and interactive qualities are not mutually exclusive.
The number of alternative molecular phenotypes is the same whether the loci are modular or interactive. In the former case, however, any given state is decomposable into identifiable subsets of phenotypes, whereas in the latter it is not. Functional interactions admit the possibility of introducing completely unpredictable, qualitatively novel behavior simply by reconfiguring an existing set of domains. In the course of evolution, the simple addition of a new variable domain may reinterpret the phenotypes of existing splicing patterns, enabling a rapid expansion of functional alternatives from the ancestral gene.
Nonadditive functional interactions may have various causes. Inserting one domain may simply block access to a binding site for a second domain, for example. Another possibility is an 'allosteric' type of interaction where electrical or conformational changes communicate through the protein interior. The consequences of such interactions may become even more complex when other genes are alternatively spliced in multiple ways. Current estimates of the number of alternatively spliced genes in humans range to ~76% of known genes [34], with an average of 3.9 splicing isoforms per gene [1]. Furthermore, the frequency of alternative splicing is elevated in genes that mediate cell signaling and metabolic networks [34], increasing the likelihood of nonlinear, and largely unpredictable interactions between genes with alternatively spliced products that communicate through such networks. Strong intergenic interactions are of course normal where proteins contact physically, as subunits of a multienzyme complex, or in a multisubunit ion channel. There are 22–25 such genes for voltagedependent calcium channels, all of which may be alternatively spliced. These may assemble in up to 840 stoichiometric complexes, excompassing as many as 20 variable sites each. Physiological channels may arise from as many as ~9 × 10^{8 }transcript combinations. This is an enormous space of possibilities, just for calcium channels, that can be exploited in the refinement of neuronal networks.
We may expect splicing correlations to cross gene boundaries in such cases, though direct physical contact may not even be necessary in general [35]. Splicing linkage analyses in highthroughput transcriptomics may provide a valuable compliment to direct peptide interaction studies, such as yeast twohybrid, to reveal functional interactions that do not require strong physical contacts. It is interesting, in this light, that the notion of a reconfigurable 'interactome' [36] extends to variable domains within the protein interior.
The unpredictable consequences of functional interactions are amplified through ambiguity in the determinants of alternative splicing, which are not fully specified in the gene sequence. Complex mammalian genes support an intrinsic uncertainty in the structure of the expressed protein which is reduced epigenetically, through information residing outside the gene, within networks of trans regulatory factors, for example [3739]. Thus, very rare splice configurations may be produced under most conditions. Though any particular one may have a low probability, there is always a chance that a new form may arise, producing a protein that functions, albeit in an unusual way. A low level of such 'noise' may in fact be useful to a cell in a complex, unpredictable local environment. This certainly describes the mammalian brain, where humans have far outpaced other primates in the evolutionary divergence of phenotype. In keeping with this, the brain expresses a disproportionate diversity of alternative splicing, compared to other tissues [40].
In the context of expanding complexity in alternative splicing, interactions between variable domains therefore present a challenge to the regulatory processes that select them. A set of k modular domains may be configured through k sequential yes/no choices. Functional interactions, however, force a single, non decomposable, selection from 2^{k }alternatives. The regulatory complexity thus increases exponentially with the number of loci if they interact, but only linearly if they do not. This is the cost of diversifying the proteome through combinatorial splicing. It returns a significant payoff, however, because the exponential expansion of the regulatory load is compensated by an expansion of phenotypic potential on the same scale. We have noted [8] that this regulatory burden could be escaped if some mechanism were available for somatic selection during development, based on feedback from the expressed transcriptome.
Conclusion
As fulllength cDNA datasets become available the methods presented here will assist in defining the interaction landscape, revealing domain configurations that are selected in concert and providing insights into how domains within proteins interact functionally. Additionally, however, they should adapt well to studies of clustering in the transcription of genes (and parts of genes). A compelling application is to the class of RNA transcripts that do not encode proteins [41]. Though largely of unknown function, these ncRNAs comprise a large proportion of the transcriptome, representing roughly 50% of transcriptional units and covering 30 times more of the genome than the proteincoding mRNA, and they are elaborately processed (capped, polyadenylated and spliced – constitutively, alternatively and transspliced) [20]. They likely represent an important component of the intricate web of RNA factors involved in the regulation of gene expression [42], including regulated alternative splicing.
Authors' contributions
ME conceived the study, collected the data, designed the algorithms, code, graphics, and novel indices, with input from WSA on the linkage change index, performed the analyses and wrote the paper, with significant supporting contributions from GP. GP suggested the Bayes/loglinear approach, and provided invaluable guidance, being especially receptive to "interesting" ideas. WSA provided funding and valuable consultation, and initiated work on the linkage change index. All three authors read and approve of the final manuscript.
Acknowledgements
This work was supported by NIH award HL52307 and a Holden Targeted Research Award from the Epilepsy Foundation to W.S.A. We thank two anonymous referees for suggesting Figure 1 and drawing our attention to an overlooked reference.
References

Stamm S, Riethoven JJ, Le Texier V, Gopalakrishnan C, Kumanduri V, Tang Y, BarbosaMorais NL, Thanaraj TA: ASD: a bioinformatics resource on alternative splicing.
Nucleic Acids Res 2006, (34 Database):465. Publisher Full Text

Clark TA, Sugnet CW, Ares M: Genomewide analysis of mRNA processing in yeast using splicingspeciflc microarrays.
Science 2002, 296(5569):907910. PubMed Abstract  Publisher Full Text

Castle J, GarrettEngele P, Armour CD, Duenwald SJ, Loerch PM, Meyer MR, Schadt EE, Stoughton R, Parrish ML, Shoemaker DD, Johnson JM: Optimization of oligonucleotide arrays and RNA amplification protocols for analysis of transcript structure and alternative splicing.
Genome Biol 2003, 4(10):R66. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Le K, Mitsouras K, Roy M, Wang Q, Xu Q, Nelson SF, Lee C: Detecting tissuespecific regulation of alternative splicing as a qualitative change in microarray data.
Nucleic Acids Res 2004, 32(22):e180. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kapranov P, Drenkow J, Cheng J, Long J, Helt G, Dike S, Gingeras TR: Examples of the complex architecture of the human transcriptome revealed by RACE and highdensity tiling arrays.
Genome Res 2005, 15(7):987997. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Aebersold R, Mann M: Mass spectrometrybased proteomics.
Nature 2003, 422(6928):198207. PubMed Abstract  Publisher Full Text

Roberts GC, Smith CWJ: Alternative splicing: combinatorial output from the genome.
Curr Opin Chem Biol 2002, 6:375383. PubMed Abstract  Publisher Full Text

Emerick MC, Stein R, Kunze R, McNulty MM, Regan MR, Hanck DA, Agnew WS: Profiling the array of Ca_{v }3.1 variants from the human Ttype calcium channel gene CACNA1G: Alternative structures, developmental expression, and biophysical variations.
Proteins 2006.
Published online, May 2, 2006, 10.1002/prot.20877
PubMed Abstract  Publisher Full Text 
Mattick JS, Makunin IV: Small regulatory RNAs in mammals.
Hum Mol Genet 2005, (14 Spec No 1):121132. Publisher Full Text

RIKEN Genome Exploration Research Group, Genome Science Group [Genome Network Project Core Group] and the FANTOM Consortium: Antisense transcription in the mammalian transcriptome.
Science 2005, 309(5740):15641566. PubMed Abstract  Publisher Full Text

Yudate HT, Suwa M, Irie R, Matsui H, Nishikawa T, Nakamura Y, Yamaguchi D, Peng ZZ, Yamamoto T, Nagai K, Hayashi K, Otsuki T, Sugiyama T, Ota T, Suzukil Y, Sugano S, Isogai T, Masuho Y: Variation in sequence and organization of splicing regulatory elements in vertebrate genes.
Nucleic Acids Res 2001, 29:185188. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H, Kondo S, Nikaido I, Osato N, Saito R, Suzuki H, Yamanaka I, Kiyosawa H, Yagi K, Tomaru Y, Hasegawa Y, Nogami A, Schönbach C, Gojobori T, Baldarelli R, Hill DP, Bult C, Hume DA, Quackenbush J, Schriml LM, Kanapin A, Matsuda H, Batalov S, Beisel KW, Blake JA, Bradt D, Brusic p, et al.: Analysis of the mouse transcriptome based on functional annotation of 60,770 fulllength cDNAs.
Nature 2002, 420(6915):563573. PubMed Abstract  Publisher Full Text

Stapleton M, Carlson J, Brokstein P, Yu C, Champe M, George R, Guarin H, Kronmiller B, Pacleb J, Park S, Wan K, Rubin GM, Celniker SE: A Drosophila fulllength cDNA resource.
Genome Biol 2002, 3(12):research0080.10080.8. BioMed Central Full Text

Strausberg RL, Feingold EA, Grouse LH, Derge JG, Klausner RD, Collins FS, Wagner L, Shenmen CM, Schuler GD, Altschul SF, Zeeberg B, Buetow KH, Schaefer CF, Bhat NK, Hopkins RF, Jordan H, Moore T, Max SI, Wang J, Hsieh F, Diatchenko L, Marusina K, Farmer AA, Rubin GM, Hong L, Stapleton M, Soares MB, Bonaldo MF, Casavant TL, Scheetz TE, et al.: Generation and initial analysis of more than 15,000 fulllength human and mouse cDNA sequences.
Proc Natl Acad Sci USA 2002, 99(26):1689916903. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kikuchi S, Satoh K, Nagata T, Kawagashira N, Doi K, Kishimoto N, Yazaki J, Ishikawa M, Yamada H, Ooka H, Hotta I, Kojima K, Namiki T, Ohneda E, Yahagi W, Suzuki K, Li CJ, Ohtsuki K, Shishiki T, Otomo Y, Murakami K, lida Y, Sugano S, Fujimura T, Suzuki Y, Tsunoda Y, Kurosaki T, Kodama T, Masuda H, Kobayashi M, et al.: Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice.
Science 2003, 301(5631):376379. PubMed Abstract  Publisher Full Text

Sakurai T, Satou M, Akiyama K, lida K, Seki M, Kuromori T, Ito T, Konagaya A, Toyoda T, Shinozaki K: RARGE: a largescale database of RIKEN Arabidopsis resources ranging from transcriptome to phenome.

Neves G, Zucker J, Daly M, Chess A: Stochastic yet biased expression of multiple Dscam splice variants by individual cells.
Nat Genet 2004, 36(3):240246. PubMed Abstract  Publisher Full Text

Regan MR, Emerick MC, Agnew WS: Fulllength singlegene cDNA libraries: Applications in splice variant analysis.
Anal Biochem 2000, 286:265276. PubMed Abstract  Publisher Full Text

Regan MR, Lin DD, Emerick MC, Agnew WS: The effect of higher order RNA processes on changing patterns of protein domain selection: a developmentally regulated transcriptome of type 1 inositol 1,4,5trisphosphate receptors.
Proteins 2005, 59(2):312331. PubMed Abstract  Publisher Full Text

FANTOM Consortium, RIKEN Genome Exploration Research Group, and the Genome Science Group [Genome Network Project Core Group]: The transcriptional landscape of the mammalian genome.
Science 2005, 309(5740):15591563. PubMed Abstract  Publisher Full Text

Zhu J, Shendure J, Mitra RD, Church GM: Single molecule profiling of alternative premRNA splicing.
Science 2003, 301(5634):836838. PubMed Abstract  Publisher Full Text

Raymond CK, Castle J, GarrettEngele P, Armour CD, Kan Z, Tsinoremas N, Johnson JM: Expression of alternatively spliced sodium channel alphasubunit genes. Unique splicing patterns are observed in dorsal root ganglia.
J Biol Chem 2004, 279(44):4623446241. PubMed Abstract  Publisher Full Text

R Development Core Team: [http://www.Rproject.org] webcite
A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2005.
[ISBN 3900051070]

Cover TM, Thomas JA: Elements of information theory. 2nd edition. Hoboken, N.J.: WileyInterscience; 2005.

Bishop YMM, Fienberg SF, Holland PW: Discrete multivariate analysis: Theory and practice. Cambridge, MA: MIT Press; 1977.

Gelman A, Meng XL, Stern H: Posterior predictive assessment of model fitness via realized discrepancies.

Gelman A: Bayesian Data Analysis. 2nd edition. Boca Raton: Chapman and Hall/CRG; 2004.

Fededa JP, Petrillo E, Gelfand MS, Neverov AD, Kadener S, Nogus G, Pelisch F, Baralle FE, Muro AF, Kornblihtt AR: A Polar Mechanism Coordinates Different Regions of Alternative Splicing within a Single Gene.
Mol Cell 2007, 19(3):393404. Publisher Full Text

Kuznetsov VA: Distribution associated with stochastic processes of gene expression in a single eukaryotic cell.
EURASIP J App Sig Proc 2001, 4:285296. Publisher Full Text

International Human Genome Sequencing Consortium: Finishing the euchromatic sequence of the human genome.
Nature 2004, 431(7011):931945. PubMed Abstract  Publisher Full Text

C elegans Sequencing Consortium: Genome sequence of the nematode C. elegans: a platform for investigating biology.
Science 1998, 282(5396):20122018. PubMed Abstract  Publisher Full Text

Misra S, Crosby MA, Mungall CJ, Matthews BB, Campbell KS, Hradecky P, Huang Y, Kaminker JS, Millburn GH, Prochnik SE, Smith CD, Tupy JL, Whitfied EJ, Bayraktaroglu L, Berman BP, Bettencourt BR, Celniker SE, de Grey AD, Drysdale RA, Harris NL, Richter J, Russo S, Schroeder AJ, Shu SQ, Stapleton M, Yamada C, Ashburner M, Gelbart WM, Rubin GM, Lewis SE: Annotation of the Drosophila melanogaster euchromatic genome: a systematic review.
Genome Biol 2002, 3(12):research0083.1research0083.22. BioMed Central Full Text

Claverie JM: Gene number. What if there are only 30,000 human genes?
Science 2001, 291(5507):12551257. PubMed Abstract  Publisher Full Text

Johnson JM, Castle J, GarrettEngele P, Kan Z, Loerch PM, Armour CD, Santos R, Schadt EE, Stoughton R, Shoemaker DD: Genomewide survey of human alternative premRNA splicing with exon junction microarrays.
Science 2003, 302(5653):21412144. PubMed Abstract  Publisher Full Text

Völker J, Breslauer KJ: Communication between noncontacting macromolecules.
Annu Rev Biophys Biomol Struct 2005, 34:2142. PubMed Abstract  Publisher Full Text

Resch A, Xing Y, Modrek B, Gorlick M, Riley R, Lee C: Assessing the impact of alternative splicing on domain interactions in the human proteome.
J Proteome Res 2004, 3:7683. PubMed Abstract  Publisher Full Text

Ladd AN, Cooper TA: Finding signals that regulate alternative splicing in the postgenomic era.
Genome Biol 2002, 3(11):116. BioMed Central Full Text

Faustino NA, Cooper TA: PremRNA splicing and human disease.
Genes Dev 2003, 17(4):419437. PubMed Abstract  Publisher Full Text

Yeo G, Hoon S, Venkatesh B, Burge CB: Variation in sequence and organization of splicing regulatory elements in vertebrate genes.
Proc Natl Acad Sci USA 2004, 101(44):1570015705. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Grabowski PJ, Black DL: Alternative RNA splicing in the nervous system.
Prog Neurobiol 2001, 65(3):289308. PubMed Abstract  Publisher Full Text

Claverie JM: Fewer genes, more noncoding RNA.
Science 2005, 309(5740):15291530. PubMed Abstract  Publisher Full Text

Mattick JS: RNA regulation: a new genetics?
Nat Rev Genet 2004, 5(4):316323. PubMed Abstract  Publisher Full Text