Paucity of chimeric gene-transposable element transcripts in the Drosophila melanogaster genome

Lipatov, Mikhail; Lenkov, Kapa; Petrov, Dmitri A; Bergman, Casey M

doi:10.1186/1741-7007-3-24

Research article
Open access
Published: 12 November 2005

Paucity of chimeric gene-transposable element transcripts in the Drosophila melanogaster genome

Mikhail Lipatov¹,
Kapa Lenkov¹,
Dmitri A Petrov¹ &
…
Casey M Bergman²

BMC Biology volume 3, Article number: 24 (2005) Cite this article

10k Accesses
45 Citations
1 Altmetric
Metrics details

Abstract

Background

Recent analysis of the human and mouse genomes has shown that a substantial proportion of protein coding genes and cis-regulatory elements contain transposable element (TE) sequences, implicating TE domestication as a mechanism for the origin of genetic novelty. To understand the general role of TE domestication in eukaryotic genome evolution, it is important to assess the acquisition of functional TE sequences by host genomes in a variety of different species, and to understand in greater depth the population dynamics of these mutational events.

Results

Using an in silico screen for host genes that contain TE sequences, we identified a set of 63 mature "chimeric" transcripts supported by expressed sequence tag (EST) evidence in the Drosophila melanogaster genome. We found a paucity of chimeric TEs relative to expectations derived from non-chimeric TEs, indicating that the majority (~80%) of TEs that generate chimeric transcripts are deleterious and are not observed in the genome sequence. Using a pooled-PCR strategy to assay the presence of gene-TE chimeras in wild strains, we found that over half of the observed chimeric TE insertions are restricted to the sequenced strain, and ~15% are found at high frequencies in North American D. melanogaster populations. Estimated population frequencies of chimeric TEs did not differ significantly from non-chimeric TEs, suggesting that the distribution of fitness effects for the observed subset of chimeric TEs is indistinguishable from the general set of TEs in the genome sequence.

Conclusion

In contrast to mammalian genomes, we found that fewer than 1% of Drosophila genes produce mRNAs that include bona fide TE sequences. This observation can be explained by the results of our population genomic analysis, which indicates that most potential chimeric TEs in D. melanogaster are deleterious but that a small proportion may contribute to the evolution of novel gene sequences such as nested or intercalated gene structures. Our results highlight the need to establish the fixity of putative cases of TE domestication identified using genome sequences in order to demonstrate their functional importance, and reveal that the contribution of TE domestication to genome evolution may vary drastically among animal taxa.

Background

The origin of genetic novelty is of great interest in evolutionary biology. As mutation is the ultimate source of all genetic variation, understanding the mutational processes that lead to novel genomic features such as new genes, expression patterns or system interactions is paramount. The most commonly invoked mutational source of genetic novelty (after point substitution) is either segmental or whole genome duplication [1, 2]. More recently, the role of duplicative transposition – the copying and pasting of particular DNA sequences from one part of genome to another – has been shown to play an important role in the evolution of new genes (e.g. [3]). Evidence from the human and mouse genomes indicates that, in addition to providing the source of the transpositional machinery, transposable elements (or TEs) [4] can also provide the template DNA for new genes or regulatory sequences [5–11]. However, to understand the general role of TE domestication in eukaryotic genome evolution, it is important to assess the acquisition of functional TE sequences by host genomes in a variety of different species, and to understand in greater depth the population dynamics of these mutational events.

Here we have investigated the incorporation of TEs into mature transcripts in the fruitfly Drosophila melanogaster, a species about which much is known in terms of the sequence and function of genic and intergenic regions. To do so, we searched for potentially domesticated "chimeric" transcripts (i.e. transcripts containing both TE and host gene sequences) backed by experimental support in the form of expressed sequence tag (EST) evidence (cp. [10, 11]). The focus of this study is gene-TE associations contained within mRNA transcripts (i.e. within exons or untranslated regions, UTRs), so here we do not consider TEs that are either wholly contained in introns or located in the immediate vicinities of genes. An advantage of our approach is that the gene-TE chimeras identified are supported by experimental evidence rather than just by coordinate overlaps or mere proximity (cf. [12, 13]), and thus enriches for a subset of TE insertions that may contribute to functional gene evolution in the host.

In addition, we have assessed the presence in wild populations of gene-TE chimeras identified using the genome sequence, to gain insight into the evolutionary forces acting on these mutations in nature. Using a pooled-PCR strategy, we estimated population frequencies for a sample of chimeric TE insertions in North American strains of D. melanogaster. By comparing population frequencies of chimeric TEs to those of non-chimeric TEs of the same family from similar genomic contexts, we evaluated whether chimeric TEs generally segregate either at unusually high frequencies (indicating the action of adaptive selection) or at unusually low frequencies (indicating the action of purifying selection). These results also revealed which of the gene-TE chimeras detected in the genome sequence are likely to be constitutive components of the D. melanogaster transcriptome.

By comparing our set of gene-TE chimeras to the entire set of annotated genes and TEs in the D. melanogaster Release 3 euchromatin, we show that a chimeric TE insertion has a much lower probability than a non-chimeric TE insertion of existing in the sequenced strain. This extreme paucity of chimeric TEs can be explained by the simple fact that TE insertions generating chimeric transcripts are likely to be strongly deleterious for the host. However, we find that the population frequencies of observed chimeric TEs are generally indistinguishable from similarly paired non-chimeric TE insertions, and we find that some chimeric TE insertions can be found at high frequency in North American populations. This pattern indicates that chimeric TE insertions observed in the genome sequence do not differ substantially from non-chimeric TEs in their selective effects, and that the D. melanogaster transcriptome permits a low-level flux of chimeric transcripts that may contribute to the formation of new gene sequences. Finally, we discuss the possibility that chimeric transcripts explain the curious phenomenon of regulated somatic expression of TE transcripts in the developing Drosophila embryo.

Results

Identification of chimeric gene-TE transcripts in the D. melanogastergenome

In order to study the functional integration of TE sequences into host genes, we identified TE insertions present in mature transcripts of the D. melanogaster euchromatic Release 3 genome sequence. We call such transcripts "chimeric" as each of them has one component from a host gene and one from a TE insertion. In addition to using the standard methods in the field for directly finding genes and TEs that share overlapping coordinates or querying annotated transcripts directly for TE sequences [8, 10, 11], we also sought evidence for chimeric transcripts using a novel three-step process based on expressed sequence tags (ESTs) (see Materials and Methods). This indirect method of identifying gene-TE chimeras was necessary to avoid annotation biases resulting from the fact that "coding exons were not annotated in sequences with homology to transposable elements" [14] in the D. melanogaster genome.

In total, we found 63 protein-coding genes that produce chimeric transcripts supported by EST evidence (Table 1; for more information [see Additional file 1]). These chimeric transcripts involve 63 different TE insertions, but the relationship is not simply one-to-one: in two cases, TE insertions (FBti0019107 and FBti0020178, Table 1) occur in overlapping 3'UTRs of convergently transcribed neighboring genes producing two separate chimeric transcripts each (see Figure 1A); and in one case, three TE insertions are found in a chimeric transcript for a single gene (CG32021) on the 4^th chromosome. In addition, we found one noncoding transcript, the αγ-element [15], which is generated by two TE insertions within a larger nest of TEs situated between the Hsp70 Ba and Bbb genes. Our screen appears to have high sensitivity as evidenced by the fact that we identified four of the five exonic TE insertions previously reported in [12] (we found no supporting EST evidence for the fifth gene CG7900); the single exonic jockey insertion in the gene CG6191 reported in [16]; and the chimeric transcript generated by a Doc insertion into the gene CHKov1 (CG10618) reported in [17, 18]. We did not identify the Bari-1 insertion in cyp12a4 recently reported in [19], which is supported by EST evidence, since the region of overlap (18 bp) does not pass our length threshold.

Table 1 Chimeric TE insertions supported by EST evidence in the D. melanogaster Release 3 genome sequence. The leftmost column gives the gene(s) that generate(s) the chimeric transcript. FlyBase ID refers to the TE accession number in the Release 3.2 annotation. Rec. rate refers to the estimated recombination rate in the vicinity of the gene. Rightmost column gives the number of wild strain pools (out of six) where the TE insertion is present. An asterisk in the last column indicates that independent population frequency estimates are available for these TE insertions in [12, 17, 27-29].

Full size table

We note that six of the 65 chimeric TE insertions identified by BLAST-based methods do not have corresponding TEs in the Release 3.2 annotation. However, unannotated TEs of the correct family can be found in the genome sequence for these chimeric TE insertions (Table 1). This result indicates that an unknown proportion of real TE insertions has not been annotated in the Release 3 genome sequence (see below). To be able to analyze aspects of chimeric TEs in the context of the genome annotation, we excluded these six TE insertions from the "annotated set" of 59 TE insertions, although we do consider them to be bona fide members of the "total set" of 65 potential gene-TE chimeras in the D. melanogaster genome.

Properties of chimeric gene-TE transcripts

Most of the 63 genes generating the total set of chimeric transcripts are of unknown function, but we did identify chimeric transcripts in 23 characterized protein-coding genes including brown (bw), a gene that appears to be a hot-spot for natural TE insertions [20] and is known to carry a viable mutation (bw ¹) in the sequenced strain [14]. Our in silico screen also identified a chimeric TE insertion generated by the serine protease encoding gene Tequila that has recently been shown to impair the transcription of this gene, but with no apparent phenotypic consequences [21]. A general analysis of the molecular function and cellular localization of the total set of genes with chimeric transcripts, however, did not indicate a significant enrichment of any particular Gene Ontology (GO) category (data not shown).

Relative to other non-chimeric TEs inserted in transcribed regions (i.e. intronic TE insertions), the annotated set of TEs present in chimeric transcripts is significantly enriched for LTR insertions (Figure 2A). This observation largely accounts for the fact that the annotated set is also enriched in long TEs (Figure 2B), since LTR insertions tend to be longer than other classes of TE insertion in the genome [14]. Furthermore, chimeric TEs have a greater tendency to be present in high-recombination areas of the genome than non-chimeric, intronic TE insertions (Figure 2C). However, the overabundance of chimeric TEs in regions of high recombination is not caused simply by the fact that chimeric transcripts are preferentially formed by LTR insertions, since high-recombination TE insertions are over-represented among the chimeric non-LTR (i.e. LINE-like, TIR and FB) elements even more strongly than among the chimeric LTRs (data not shown).

TE sequences are found in UTRs in most of the chimeric transcripts they generate: 38 of the 63 TE insertions are found in 3'UTRs, 23 in the 5'UTRs and 4 in coding exons. We note that these numbers total more than 63 because two TE insertions (chimeras 47 and 61 [see Additional File 1]) fall into multiple categories. The higher incidence of TEs in UTRs and specifically in 3'UTRs parallels findings in the human and mouse genomes [10, 11]. The increased prevalence of TE insertions in 3'UTRs may be attributed to the increased average length of 3'UTRs (442 bp) relative to 5'UTRs (265 bp) in Drosophila [22] (as has been suggested previously to explain such patterns in the human genome [10]), or to the lower density of functional signals in 3' regions relative to 5' regions of genes. This pattern does not appear to result from biases in the EST libraries, since over 10 times more 5' ESTs were analyzed than 3' ESTs [23].

Surprisingly, the genes involved in chimeric transcripts are not always those nearest to the sites of the corresponding TE insertions. Four chimeric transcripts skip one or more genes between the gene and TE components of the transcript (chimeras 12, 18, 23 and 50; Table 1, Figure 1B and 1C), thereby creating nested or intercalated gene arrangements. The process of gene- or exon-skipping in chimeric transcript formation suggests a novel mutational mechanism to explain the surprisingly large proportion of nested genes in the D. melanogaster genome (many of which bear no hallmark of retroposition) [22, 24], as well as the evolution of complex intercalated gene structures that cannot arise via simple mechanisms of gene duplication.

Paucity of TEs in mature transcripts indicates that chimeric TE insertions are generally strongly deleterious

Of the 1,566 valid TEs in the Release 3.2 annotation of the D. melanogaster genome sequence, we estimate that 59 are chimeric TE insertions with some component co-transcribed in an exon, 414 are transcribed but entirely contained within spliced intronic sequences, and 1,093 are entirely contained within intergenic sequences not currently annotated as transcribed. A similar rank order pattern of TE abundance in different functional compartments has been observed in the Arabidopsis thaliana genome [25]. These numbers of TE insertions deviate significantly from their expected proportions based on the genome annotation of the 116.8 Mb Release 3 sequence (p < 1 × 10^-15) (Table 2). This deviation from expectations is the result of two factors: there are fewer TEs in transcribed regions than in intergenic regions (p < 1 × 10^-15) [14], and there is a further reduction in exonic regions relative to intronic regions (p < 1 × 10^-15). The reduction in transcribed regions, however, is not solely caused by under-representation in exonic sequences, since the number of intronic TE insertions is reduced relative to the number in intergenic regions (p < 1 × 10^-15). Together, these results indicate that there is a paucity of chimeric TE insertions in the genome, and that the causes of this paucity go above and beyond the effects of simply being transcribed.

Table 2 Distribution of TEs by genomic compartment. Using the Release 3.2 annotation, the 116.8 Mbp D. melanogaster genome sequence was partitioned into exonic, intronic and intergenic DNA with exons taking precedence over introns, and introns over intergenic regions for genes with alternative splicing or promoter usage. χ ² values (degrees of freedom) are for tests of the number of TE insertions observed relative to expected proportions based on the total length of corresponding genomic compartment. P-values of all χ ² tests were <1 × 10^-15.

Full size table

To estimate the extent to which the number of exonic TE insertions is reduced while controlling for the effect of transcription per se on the distribution of TEs, we use the number of intronic TEs and the length of the intronic compartment of the genome to estimate the proportion of unobserved chimeric TE insertions. The total length of intronic regions in the D. melanogaster genome is approximately 37.7 Mb and the total length of exonic regions is 28.2 Mb [22, 26]. If the selective pressures on exonic TEs were similar in magnitude to those on intronic TEs we would expect to find approximately 414*(28.2/37.7) = 310 TE insertions in the predicted exonic (coding plus untranslated) regions of the genome. The fact that we detect only 59 chimeric TEs out of an expected 310 (or 19%) indicates that a chimeric TE insertion is much more likely to be highly deleterious to the organism than a non-chimeric TE insertion that is spliced out of a mature transcript. These results are consistent with previous findings in the human genome, that the proportion of TE-derived sequence increases with increasing distance upstream from the start of transcription [10].

These calculations are based on a comparison of the annotated set of chimeric TE insertions relative to the total set of annotated TE insertions. As noted above, however, our results reveal that an unknown proportion of TEs in the Release 3 sequence were not annotated in [14]. If we assume that the frequency of unannotated TEs in intronic regions is proportional to that of the unannotated TE insertions in our sample (~10%), the expected number of TE insertions in exonic regions would increase to 310*1.10 = 341. Thus, using the total set under this proportionality assumption, the percentage of chimeric TE insertions detected relative to expectation is little changed (65 out of 341, 19%). To the extent that the number of unannotated TE insertions in introns is proportionally higher than in our sample, the percentage of observed chimeric TE insertions decreases even further, strengthening the claim for a paucity of chimeric TE insertions relative to expectation.

Observed chimeric TEs are not under unusual selective pressures

We estimated that ~80% of the TEs that have been inserted into mature genic transcripts are immediately purged from the genome by strong purifying selection, and therefore are not observed in the sequenced strain. What about the remaining ~20% of chimeric TE insertions that we do detect? We can envisage three scenarios to explain the existence of these chimeric TE insertions: 1) they are under strong purifying selection, like the TE insertions we do not observe; 2) they are adaptive, contributing useful sequences to the host genome; or 3) they are neither particularly deleterious nor particularly advantageous in comparison to the observed non-chimeric TE insertions in the genome.

In order to evaluate these possibilities, we surveyed the frequencies of chimeric TE insertions in wild D. melanogaster populations. The presence of each TE was tested in six pools of 8–12 North American strains (Table 1, rightmost column) using a PCR procedure custom-designed for each chimeric TE (see Methods for details). Pool frequencies were used to estimate confidence bounds on population frequencies using a maximum likelihood procedure (Table 3, [see Additional file 2]; see Methods for details).

Table 3 Maximum likelihood (ML) estimates and bounds on TE insertion frequencies in the North American D. melanogaster population, given the number of pools that contain the TE insertion.

Full size table

We were able to generate population data for 48 of the 59 annotated chimeric TE insertions. Twenty-seven chimeric TE insertions were found only in the sequenced strain, seven were found in all six-strain pools and 14 had intermediate pool frequencies. These proportions of absent (56%) and polymorphic (44%) chimeric TEs are very similar to a combined, non-random sample of 92 non-chimeric TE insertions with previously reported population frequency data that map to annotated Release 3 TEs: absent (58%) and polymorphic (42%) [12, 17, 27–29]. The negative effects of intronic TE insertions on transcription do not strongly affect this non-chimeric sample, since similar proportions of absent and polymorphic TE insertions are observed in intronic (60% absent, 40% polymorphic; n = 30) and intergenic (56% absent, 44% polymorphic; n = 62) regions.

To determine whether the chimeric TE insertions are, on aggregate, subject to unusual selective constraints, we compared each of their pool frequencies to those of similar, non-chimeric TE insertions (Table 4). By "similar," we mean that these TE insertions came from the same family as their chimeric counterparts, that they had similar lengths, and were inserted in areas with similar recombination rates (see Methods for details). Since the selective constraint on a TE insertion is expected to increase with its length and the recombination rate of its genomic neighborhood [17, 30], we tried to bracket each chimeric TE with a pair of similar non-chimeric family members: one with slightly higher, and one with slightly lower, length and recombination rate (columns 4 and 6 of Table 4, respectively). Our null hypothesis was that the chimeric TE insertions are neither particularly deleterious nor particularly advantageous in comparison with their non-chimeric counterparts. If this null hypothesis is true, we expect the pool frequencies of non-chimeric TE insertions in column 5 of Table 4 to be no higher, and the pool frequencies in column 7 to be no lower, than those of the chimeric TE insertions in column 3.

Table 4 For each chimeric TE (column 2), we give the number of strain pools in which the TE is present (column 3), the same for a similar TE with greater length in an area of higher recombination (columns 4 and 5), and for a similar TE with lower length inserted in an area with lower recombination (columns 6 and 7). For the first type of similar TE insertion, we expect slightly higher selective constraints, and thus slightly lower population frequency. The converse is true for the second type of similar TE insertion.

Full size table

For the set of 48 TE insertions for which we have population data, we cannot reject the null hypothesis of no difference in pool frequencies between chimeric and non-chimeric TE insertions. Neither the Wilcoxon one-sided test nor the Kruskal-Wallis test reject the null hypothesis in favor of the alternative that pool frequencies of chimeric TEs are significantly higher than those of their counterparts with greater lengths and recombination rates (p = 0.38 and p = 0.75, respectively; tests performed on the n = 34 TEs in Table 4 that have the appropriate counterparts). This indicates that, in general, the fact that a TE insertion is chimeric does not increase the likelihood that it is at higher population frequency and is therefore potentially adaptive. Similarly, we find no evidence that chimeric TEs in general have pool frequencies lower than those with shorter lengths and lower recombination rates (p = 0.15 for the one-sided Wilcoxon rank sum test, p = 0.30 for the Kruskal-Wallis rank sum test; n = 46). Thus, the fact that an observed TE insertion is chimeric does not increase the likelihood that it is deleterious.

While we do not provide evidence for unusual selection pressures acting on chimeric TE insertions overall, we do find a few exceptions to this general rule when TE insertions are analyzed on an individual basis. As shown in Figure 3, by comparing pool frequencies of chimeric TEs to those of the two types of non-chimeric counterparts, we detect evidence for two exceptional chimeric TE insertions. One, a Doc insertion (FBti0019430), which creates a truncated version of the putative choline transferase gene CHKov1 (CG10618), has a significantly elevated population frequency (chimera 44, Figure 3A) and has been reported previously to be a putatively adaptive TE insertion [17, 18]. The second, a pogo (FBti0019206) insertion into the fructose-bisphosphate encoding gene fbp, has a significantly decreased population frequency (chimera 21, Figure 3B) and is likely to be more deleterious than similar non-chimeric pogo insertions.

Discussion

We conducted a thorough search for TE insertions in the mature transcripts of genes in the sequenced D. melanogaster genome. To do so we used three different computational methods, including a novel, indirect EST-based approach (see Materials and Methods). As with all EST-based bioinformatics methods, this new approach to finding gene-TE chimeras is subject to biases in EST library composition. Such an approach was necessitated by annotation biases in the Drosophila genome that would have caused any direct analysis of annotated transcripts to underestimate the number of putative chimeric transcripts in the genome. Despite these conflicting biases, most of the 63 genes generating chimeric transcripts were identified by more than one method [see Additional file 1], although each method revealed unique chimeric TE insertions. Thus, multiple complementary approaches should be used in genome-wide studies of TE domestication to overcome both annotation and methodological biases.

Even using multiple methods for detecting chimeric transcripts, we estimate that only 0.46% of protein coding genes in Drosophila generate chimeric transcripts. Clearly the number of chimeric genes would be expected to increase somewhat with better annotation and/or increased EST coverage. Nevertheless, the number of chimeric transcripts in the Drosophila genome is likely to be more than an order of magnitude less than in the human and mouse genomes, where an estimated 27% and 18% of genes contain TE sequences [11]. These results together also suggest a rank order relationship between the proportion of chimeric genes and the amount of TE DNA in a genome (human, 46.36%; mouse, 38.55%; fly, 5.3%) [31–33]; however, further studies are needed to evaluate the strength and generality of this trend. Even a low number of gene-TE chimeras, such as presently observed in the D. melanogaster genome, may in the long-term contribute to the evolution of new transcripts and help explain unusual aspects of genomic organization structures such as nested or intercalated genes.

The low number of chimeric transcripts observed is not just the result of random effects of sparse TE insertion or the deleterious effects of TEs on transcription in the D. melanogaster genome. In fact, we found far fewer chimeric TE insertions in the genome than expected, relative to the number of non-chimeric TE insertions found in introns. This result indicates that the majority of TE insertions that occur in mature gene transcripts have a much higher probability of being deleterious than non-chimeric, intronic ones. The paucity of chimeric TE insertions in exons relative to introns demonstrates that the deleterious effects of chimeric TE insertions must exceed the cost of simply being transcribed, and probably results from improper translation or disruption of other functions of the mRNA such as localization or stability. Many of these unobserved events may contribute to the genome-wide load of deleterious mutations found in natural populations of D. melanogaster [34, 35].

Population frequencies of the chimeric TE insertions observed in the genome sequence of the isogenized y; cn, bw, sp strain on the whole do not differ significantly from those of their non-chimeric counterparts. This does not imply that chimeric TE insertions found in the sequenced strain have no effects on fitness; rather that the distribution of their fitness effects is not substantially different from that of the non-chimeric TE insertions located elsewhere in the genome. At worst the observed chimeric TE insertions may be weakly deleterious and counter-selected, in contrast to the unobserved chimeric TE insertions, which are presumed to be strongly deleterious and purged rapidly from the population.

There is, however, some indirect evidence that chimeric TE insertions may in fact be less weakly deleterious on average than non-chimeric TE insertions. If TE insertions are weakly deleterious, we expect a skew towards genomic regions of lower recombination where natural selection is less effective due to increased linkage between alleles of opposing selective effects [36]. This effect can be observed in the distribution of non-chimeric, intronic TE insertions, but is not observed in the distribution of chimeric TE insertions (Figure 2C). Thus, a typical observed chimeric TE insertion may in fact have a smaller negative effect on fitness than a typical non-chimeric TE insertion. This conclusion is supported by a lack of detectable fitness effects in direct experimental challenges on flies carrying the chimeric TE insertion detected in the Tequila (graal) gene [21].

The one TE insertion we did identify as putatively adaptive (chimera 44; Figure 3A) was previously identified in a randomly chosen set of ~60 TEs [17, 18]. We conclude that, in a search for adaptive TE insertions, selecting chimeric TE insertions is no better than selecting TEs from the Drosophila genome at random. This is perhaps not surprising, considering our finding that there is nothing unusual about the fitness effects of observed chimeric TE insertions. It is possible, however, that our inability to detect a significant difference in selection pressures resulted from the relatively small sample of both chimeric and control TE insertions studied here. Consideration of a larger number of strain pools will provide us with more statistical power and might show effects of chimerism on TE fitness that were not detected in this study.

Regardless of the forces that may have governed their history, we did identify seven chimeric TE insertions that appear to be at high frequency or possibly even fixed in North American populations of D. melanogaster. The existence of high frequency or fixed chimeric transcripts in the genome may provide a possible explanation for the curious observation of complex patterns of somatic gene expression exhibited by many LTR retrotransposons in D. melanogaster [37–40]. These largely-unexplored patterns of transcription are typically explained either by the existence of regulatory elements internal to the TE (internal enhancer model) or by the co-option of external cellular regulatory elements in the vicinity of a TE insertion (enhancer trap model) [39, 41]. The presence of chimeric transcripts in the D. melanogaster genome demonstrated here suggests a third possible mechanism for the observed pattern of somatic TE expression: read-through transcription of a host gene into a TE and cross-hybridization to a TE specific probe. Under this model, regulated expression of a host gene that produces a chimeric transcript could be (mis)interpreted as regulated expression of the TE included in the chimeric transcript.

We sought evidence for the possibility of read-through transcription as an explanation for regulated TE expression by querying the second release of the BDGP in situ database [42, 43] for embryonic expression patterns of the TEs and genes involved in chimeric transcripts detected in this study. Remarkably, as shown in Figure 4, we found that the embryonic expression pattern for developmental stages 11–16 of the gene CG12094 is almost identical to the expression pattern determined directly for the 412 element that is involved in the chimeric transcript generated by this gene.

Can read-through transcription from CG12094 explain the pattern of expression of the 412 element? We believe the answer to this question is no, for the simple reason that the probe used to determine the expression patterns of the 412 element (GM07634) shares no sequences for potential cross-hybridization with the chimeric CG12094 transcript (Figure 4). In addition, the TE insertion in CG12094 is not fixed, whereas the pattern of 412 element expression is similar among different strains (see [44]), suggesting that the presence of the 412 element insertion in CG12094 is not required for embryonic expression pattern of the 412 element. (In fact, these data taken together are more consistent with the stage 11–16 expression pattern of CG12094 detected by the RE52190 probe being generated by spurious cross-hybridization to transcripts emanating from 412 elements located elsewhere in the genome.) Thus in the case of the 412 element, we conclude that the best candidate gene in the D. melanogaster genome cannot explain somatic TE expression by production of a read-through chimeric transcript. Clearly more data will be necessary to evaluate the generality of this conclusion, but the lack of a role for read-though transcription in this case is generally consistent with the paucity and low population frequencies of the chimeric TE insertions in the D. melanogaster genome (Table 2) and with growing evidence for internal enhancer elements controlling regulated TE transcription [45–47].

Conclusion

In contrast to mammalian genomes, we found that fewer than 1% of Drosophila genes produce mRNAs that include bona fide TE sequences, and that the vast majority of potential chimeric TE insertions are likely to be deleterious and therefore unobserved in the genome sequence. Of those chimeric TE insertions that have weak enough negative fitness effects to have been observed in the sequenced D. melanogaster genome, over half are restricted to the sequenced strain and fewer than ~15% are likely to be fixed and therefore contribute to the origin of new gene sequences in the D. melanogaster genome. The relatively low numbers of fixed chimeric TE insertions also argue against read-through transcription as a predominant mechanism for generating patterns of somatic TE transcription in Drosophila embryos. These results also highlight the need to establish the fixity of putative cases of TE domestication identified in other genome sequences in order to demonstrate their functional importance, and indicate that the process of TE domestication may vary drastically among animal taxa.

Methods

in silicoscreen for chimeric gene-TE transcripts

Chimeric gene-TE transcripts were identified by three independent methods (with the following number codes used in Additional file 1): 1) a genomic coordinate intersection analysis; 2) a TE-to-gene BLAST analysis; and 3) a TE-to-EST-to-gene BLAST analysis. Coordinate overlaps were evaluated using the UCSC D. melanogaster table browser [48] and finding the intersection between the "FlyBase genes" and "FlyBase noncoding genes" tables, with a subsequent filter for those TE-gene overlaps >25 bp supported by EST evidence. For the TE-to-gene BLAST analysis, we sought chimeric TEs directly by querying each canonical TE sequence in version 7.1 of the BDGP TE data set [49] that had a representative in the Release 3.1 euchromatic genome annotation against the Release 3.1 annotated transcripts. For this analysis we used the combined output of hits from WU-BLASTN B = 10000 V = 10000 X = 3 M = 3-lcfilter-filter dust of >50 bp and >85% identity [50] together with NCBI-BLAST2 [51] hits of E < 1 × 10^-10.

For the TE-to-EST-to-gene BLAST analysis, we developed a three-step process using WU-BLASTN with the following parameters: B = 10000 V = 10000 X = 3 M = 3-lcfilter-filter dust. First, each TE in the BDGP TE data set was used to query the BDGP EST database (ca. Dec 2002) containing 281,297 ESTs and complete cDNAs [23, 52]. Second, ESTs with TE homology of >25 bp and >85% identity were aligned to the canonical TE sequence, and the non-TE component of the sequence was used to match the EST back to the corresponding host gene by querying transcripts in the Release 3.1 genome annotation [22]. Finally, the annotated host gene (±5000 bp) was used to query the TE database to ensure that a TE of the appropriate family is present in the genomic region, thereby filtering artifacts generated by EST library construction.

Transcripts from heterochromatic regions of the Release 3 genome were excluded from this analysis, as were genes labeled as "pseudogene" or unnamed genes with "existence uncertain" status in FlyBase. We also note that as in [14], we excluded from this analysis the enigmatic INE-1 element [53] that can be found in many transcripts [54], since this repetitive sequence is structurally distinct from all other TEs in the Drosophila genome.

Composition of DNA pools

A population of 64 individual strains from North America was combined into a total of 6 pools of 8 or 12 strains. The final concentration of each pool was 2.5 ng DNA of each individual strain per PCR reaction. The composition of each pool was as follows: Wi pool: Wi1, Wi3, Wi15, Wi18, Wi41, Wi45, Wi68, Wi77, Wi83, Wi98, Wi137, Wi148 – these strains were collected at the Wolfskill Orchard, Davis, CA and have been subjected to over 30 generations of brother-sister matings (gift and personal communication by Sergey Nuzhdin); We1 pool: We4, We7, We10, We11, We25, We44, We47, We50, We57, We60, WE67, We80; We2 pool: We13, We17, We21, We28, We33, We37, We63, WE70, We75, We83, We88, We91 – the strains in the two We pools were collected in Raleigh, NC and have been subjected to 10 – 15 generations of brother-sister matings (gift and personal communication by Greg Gibson); NA pool: Broward13, Broward5, Lake5, Okee14, Okee5, Orange1, Orange2, Paho4, Paho6, Paho9, Sebring12, Sebring17 – these isofemale strains were collected at various locations throughout North America (gift and personal communication by Jeff Birdsley); NB pool: NB1, NB6, NB7, NB8, NB12, NB13, NB14, NB16 – these isofemale strains were collected in New Buffalo, Michigan (gift and personal communication by Bettina Harr); CSW pool: 3B, 6D, 11D, 20C, 23D, 25C, 29B, 36D – these isofemale strains were collected at Countryside Winery, Blountville, Tennessee (gift and personal communication by Lev Yampolsky).

We note that some of the isofemale strains above may be heterozygous for a given TE insertion. This would lead to a slight increase in the effective number of strains in any given pool. However, such an increase is unlikely to have an effect on the qualitative nature of our results, as the addition of several strains to a pool generally has no significant effect on the confidence limits of the population frequency of a TE. For instance, in the section on population frequency estimation (below, also [see Additional file 2]), we show the extent to which the population frequency estimate remains the same when we treat 8-strain and 12-strain pools as if they were equivalent to each other.

PCR assays

The presence/absence of TEs in all strain pools was determined using the polymerase chain reaction (PCR). All PCR primers were designed using Primer 3 [55] and were checked with Virtual PCR [56]. All primers have a melting temperature of 63°C (+/-0.2°C) and were synthesized by Operon Biotechnologies, Inc. in 96 well plates. The primers are intended to assay for the presence of the TE insertion and consist of a "Left" primer that lies within the TE sequence and a "Right" primer that lies in the flanking region to the right of the TE insertion. Primer sequences used in this study can be found in Additional file 3. The presence of the TE insertion should produce a band of approximately 500 bp and the absence of the TE insertion should result in the absence of any band. On each plate there are 3 internal controls that should always produce a single band of predetermined size, designed to control for quality of PCR.

We also verified that the DNA concentrations were sufficient to detect the presence of TE in a single strain out of the 12 or 8 strains tested in the pool. Each plate of primers was assayed with a control pool comprising one of three North American pools (Wi, We1, or We2) with the addition of y; cn, bw, sp (sequenced strain) to control for primer design problems. The addition of y; cn, bw, sp should give a result indicating the presence of the TE insertion being assayed in all cases where primers were designed correctly. To be conservative, the concentration of the DNA from the y; cn, bw, sp strain was somewhat lower than that from the assayed strains. The PCR reaction mix was made using Redtaq Readymix from Sigma Aldrich (#R2523) and primers at a final concentration of 1 μmol/μl. The PCR conditions were: 94° for 5 s, 27 cycles of: 94° for 30 s, 62° for 30 s and 72° for 1 min. We note that for 83 TEs, the positive control PCR did not fail in any such cases, showing the presence of the TE; PCR with the same pool DNA lacking any TE showed its absence.

Estimation of TE population frequencies from pool frequencies

Given that a TE insertion is present in some of the North American strain pools and absent from others (i.e. given its pool frequency), we wished to calculate the likeliest frequency of this insertion in the entire North American population, as well as suitable confidence bounds around such a frequency estimate.

Let x ₁ (a number between 0 and 2) and x ₂ (a number between 0 and 4) be the respective numbers of 8-strain and 12-strain pools in which a particular element is present. Let y be the theoretical frequency of this element in the North American D. melanogaster population. The likelihood L, of any particular value of y given the observed values of x ₁ and x ₂ is proportional to the probability of obtaining such x ₁ and x ₂ if y has that value. That is,

L(y|x ₁, x ₂) ∝ Pr(x ₁ | y) × Pr(x ₂ | y) (1)

Where Pr(x ₁|y) is the probability that x ₁ out of two 8-strain pools contain the element and Pr(x ₂|y) is the probability that x ₂ out of four 12-strain pools contain the element, given that its overall frequency in the population is y.

The first term on the right hand side of equation (1) is equal to:

Where (1-y)⁸ is the probability that an element is not found in a given 8-strain pool, 1-(1-y)⁸ is the probability that it is, and the first term on the right hand side is the appropriate binomial coefficient. Similarly, the second term of equation (1) is equal to:

Substituting (2) and (3) into (1) and simplifying, we find that

Where k is an arbitrary multiplicative constant that absorbs the binomial coefficients in (2) and (3), since they are independent of the parameter y. In accordance with common practice, we make use of the log-likelihood function ln(L), which entails an arbitrary additive constant ln(k).

Additional file 2 provides three examples of the resulting log-likelihood functions. These functions correspond to the three possible combinations of x's that yield a total of four pools with detected element presence (i.e. for (x ₁, x ₂) equal to (0, 4), (1, 3) and (2, 2)). This file demonstrates that, given that the element is present in four out of six pools, estimation of population frequencies is relatively insensitive to the number of pools that contain eight or 12 strains. Therefore, to simplify the analysis, we combined all combinations of x ₁ and x ₂ under a common category such that x ₁ + x ₂ = 4.

For each log-likelihood function, the maximum likelihood estimate of the population frequency is the value of y at which the function reaches its maximum (middle column of Table 3). The confidence limits are determined by a likelihood ratio test of the values of y where the function drops below its maximus minus two (rightmost column of Table 3). The test statistic is the likelihood ratio of the 0-parameter model where y is fixed at the value of its maximum likelihood estimate to the 1-parameter model where y is allowed to vary. This statistic is distributed as a χ ² distribution with one degree of freedom. When the difference in log-likelihoods increases above 2, the likelihood ratio increases above e ² = 7.39, where e is the base of the natural logarithm. This value is the 99.3% quantile of the χ ² distribution (corresponding to p = 0.007, 1 d.f.). These confidence limits were used to set the error bars in Figures 3A and 3B. Note that in situations with more than one possible combination of x ₁ and x ₂ the two rightmost columns of Table 3 list values that are averaged over all possible combinations (see explanation for x ₁ + x ₂ = 4 above).

Estimation of genomic recombination rate in the neighborhood of each TE insertion

We estimated the recombination rate at each TE insertion site method using a method previously developed for the D. melanogaster genome [54]. This method combines the known physical and genetic distances between D. melanogaster genes to estimate the recombination rate profile of each chromosome as a second-degree polynomial function. An explanation of the method, and a tool that demonstrates its use, can be found on the world-wide web [57].

In Figure 2C, we classify chromosomal sites where the polynomial functions in [54] drop below zero as areas with "zero" recombination. We find that for the TE insertions in non-zero recombination areas, the median recombination rate is 2.75 cM / Mbp. Accordingly, we classify chromosomal sites with recombination rates above 0 and below 2.75 as areas with "low" recombination rates. The remaining chromosomal regions are labeled as areas of "high" recombination.

Abbreviations

bp:: base pairs
BDGP:: Berkeley Drosophila Genome Project
BLAST:: Basic Local Alignment Search Tool
EST:: Expressed Sequence Tag
GO:: Gene Ontology
LINE:: Long Interspersed Nuclear Element
LTR:: Long Terminal Repeat
Mbp:: megabase pairs
PCR:: Polymerase Chain Reaction
TE:: Transposable Element
TIR:: Terminal Inverted Repeat
UTR:: Untranslated Region.

References

Lynch M, Conery JS: The evolutionary fate and consequences of duplicate genes. Science. 2000, 290 (5494): 1151-1155. 10.1126/science.290.5494.1151.
Article CAS PubMed Google Scholar
Ohno S: Evolution by gene duplication. 1970, London: George Allen and Unwin
Chapter Google Scholar
Betran E, Thornton K, Long M: Retroposed new genes out of the X in Drosophila. Genome Res. 2002, 12 (12): 1854-1859. 10.1101/gr.6049.
Article PubMed Central CAS PubMed Google Scholar
McClintock B: Controlling elements and the gene. Cold Spring Harb Symp Quant Biol. 1956, 21: 197-216.
Article CAS PubMed Google Scholar
Brandt J, Schrauth S, Veith AM, Froschauer A, Haneke T, Schultheis C, Gessler M, Leimeister C, Volff JN: Transposable elements as a source of genetic innovation: expression and evolution of a family of retrotransposon-derived neogenes in mammals. Gene. 2005, 345 (1): 101-111. 10.1016/j.gene.2004.11.022.
Article CAS PubMed Google Scholar
Britten RJ: Coding sequences of functioning human genes derived entirely from mobile element sequences. Proc Natl Acad Sci U S A . 2004, 101 (48): 16825-16830. 10.1073/pnas.0406985101.
Article PubMed Central CAS PubMed Google Scholar
Sorek R, Ast G, Graur D: Alu-containing exons are alternatively spliced. Genome Res. 2002, 12 (7): 1060-1067. 10.1101/gr.229302.
Article PubMed Central CAS PubMed Google Scholar
Nekrutenko A, Li WH: Transposable elements are found in a large number of human protein-coding genes. Trends Genet. 2001, 17 (11): 619-621. 10.1016/S0168-9525(01)02445-3.
Article CAS PubMed Google Scholar
Brosius J: Genomes were forged by massive bombardments with retroelements and retrosequences. Genetica. 1999, 107 (1–3): 209-238. 10.1023/A:1004018519722.
Article CAS PubMed Google Scholar
Jordan IK, Rogozin IB, Glazko GV, Koonin EV: Origin of a substantial fraction of human regulatory sequences from transposable elements. Trends Genet. 2003, 19 (2): 68-72. 10.1016/S0168-9525(02)00006-9.
Article CAS PubMed Google Scholar
van de Lagemaat LN, Landry JR, Mager DL, Medstrand P: Transposable elements in mammals promote regulatory variation and diversification of genes with specialized functions. Trends Genet . 2003, 19 (10): 530-536. 10.1016/j.tig.2003.08.004.
Article CAS PubMed Google Scholar
Franchini LF, Ganko EW, McDonald JF: Retrotransposon-gene associations are widespread among D. melanogaster populations. Mol Biol Evol. 2004, 21 (7): 1323-1331. 10.1093/molbev/msh116.
Article CAS PubMed Google Scholar
Ganko EW, Bhattacharjee V, Schliekelman P, McDonald JF: Evidence for the contribution of LTR retrotransposons to C. elegans gene evolution. Mol Biol Evol. 2003, 20 (11): 1925-1931. 10.1093/molbev/msg200.
Article CAS PubMed Google Scholar
Kaminker JS, Bergman CM, Kronmiller B, Carlson J, Svirskas R, Patel S, Frise E, Wheeler DA, Lewis SE, Rubin GM, Ashburner M, Celniker SE: The transposable elements of the Drosophila melanogaster euchromatin: a genomics perspective. Genome Biol. 2002, 3 (12): RESEARCH0084-10.1186/gb-2002-3-12-research0084.
Article PubMed Central PubMed Google Scholar
Lis JT, Prestidge L, Hogness DS: A novel arrangement of tandemly repeated genes at a major heat shock site in D. melanogaster. Cell. 1978, 14 (4): 901-919. 10.1016/0092-8674(78)90345-8.
Article CAS PubMed Google Scholar
Bartolome C, Maside X, Charlesworth B: On the abundance and distribution of transposable elements in the genome of Drosophila melanogaster. Mol Biol Evol. 2002, 19: 926-937.
Article CAS PubMed Google Scholar
Petrov DA, Aminetzach YT, Davis JC, Bensasson D, Hirsh AE: Size matters: non-LTR retrotransposable elements and ectopic recombination in Drosophila. Mol Biol Evol. 2003, 20 (6): 880-892. 10.1093/molbev/msg102.
Article CAS PubMed Google Scholar
Aminetzach YT, Macpherson JM, Petrov DA: Pesticide resistance via transposition-mediated adaptive gene truncation in Drosophila. Science. 2005, 309 (5735): 764-767. 10.1126/science.1112699.
Article CAS PubMed Google Scholar
Marsano RM, Caizzi R, Moschetti R, Junakovic N: Evidence for a functional interaction between the Bari1 transposable element and the cytochrome P450 cyp12a4 gene in Drosophila melanogaster. Gene. 2005
Google Scholar
Yang HP, Tanikawa AY, Kondrashov AS: Molecular nature of 11 spontaneous de novo mutations in Drosophila melanogaster. Genetics. 2001, 157 (3): 1285-1292.
PubMed Central CAS PubMed Google Scholar
Munier AI, Medzhitov R, Janeway CA, Doucet D, Capovilla M, Lagueux M: graal : a Drosophila gene coding for several mosaic serine proteases. Insect Biochem Mol Biol. 2004, 34 (10): 1025-1035. 10.1016/j.ibmb.2003.09.009.
Article CAS PubMed Google Scholar
Misra S, Crosby MA, Mungall CJ, Matthews BB, Campbell KS, Hradecky P, Huang Y, Kaminker JS, Millburn GH, Prochnik SE, Smith CD, Tupy JL, Whitfied EJ, Bayraktaroglu L, Berman BP, Bettencourt BR, Celniker SE, de Grey AD, Drysdale RA, Harris NL, Richter J, Russo S, Schroeder AJ, Shu SQ, Stapleton M, Yamada C, Ashburner M, Gelbart WM, Rubin GM, Lewis SE: Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biol. 2002, 3 (12): RESEARCH0083-10.1186/gb-2002-3-12-research0083.
Article PubMed Central PubMed Google Scholar
Stapleton M, Carlson J, Brokstein P, Yu C, Champe M, George R, Guarin H, Kronmiller B, Pacleb J, Park S, Wan K, Rubin GM, Celniker SE: A Drosophila full-length cDNA resource. Genome Biol. 2002, 3 (12): RESEARCH0080-10.1186/gb-2002-3-12-research0080.
Article PubMed Central PubMed Google Scholar
Ashburner M, Misra S, Roote J, Lewis SE, Blazej R, Davis T, Doyle C, Galle R, George R, Harris N, Hartzell G, Harvey D, Hong L, Houston K, Hoskins R, Johnson G, Martin C, Moshrefi A, Palazzolo M, Reese MG, Spradling A, Tsang G, Wan K, Whitelaw K, Celniker S, Rubin GM: An exploration of the sequence of a 2.9-Mb region of the genome of Drosophila melanogaster: the Adh region. Genetics. 1999, 153: 179-219.
PubMed Central CAS PubMed Google Scholar
Wright SI, Agrawal N, Bureau TE: Effects of recombination rate and gene density on transposable element distributions in Arabidopsis thaliana. Genome Res. 2003, 13 (8): 1897-1903.
PubMed Central CAS PubMed Google Scholar
Celniker SE, Wheeler DA, Kronmiller B, Carlson JW, Halpern A, Patel S, Adams M, Champe M, Dugan SP, Frise E, Hodgson A, George RA, Hoskins RA, Laverty T, Muzny DM, Nelson CR, Pacleb JM, Park S, Pfeiffer BD, Richards S, Svirskas R, Tabor PE, Wan K, Scherer SE, Stapleton M, Sutton GG, Venter C, Weinstock G, Myers EW, Gibbs RA, Rubin GM: Finishing a whole genome shotgun sequence assembly: release 3 of the Drosophila euchromatic genome sequence. Genome Biology. 2002, 3: RESEARCH0079-10.1186/gb-2002-3-12-research0079.
Article PubMed Central PubMed Google Scholar
Maside X, Bartolome C, Charlesworth B: S-element insertions are associated with the evolution of the Hsp70 genes in Drosophila melanogaster. Curr Biol. 2002, 12: 1686-10.1016/S0960-9822(02)01181-8.
Article CAS PubMed Google Scholar
McCollum AM, Ganko EW, Barrass PA, Rodriguez JM, McDonald JF: Evidence for the adaptive significance of an LTR retrotransposon sequence in a Drosophila heterochromatic gene. BMC Evol Biol. 2002, 2 (1): 5-10.1186/1471-2148-2-5.
Article PubMed Central PubMed Google Scholar
Bartolome C, Maside X: The lack of recombination drives the fixation of transposable elements on the fourth chromosome of Drosophila melanogaster . Genet Res. 2004, 83 (2): 91-100. 10.1017/S0016672304006755.
Article CAS PubMed Google Scholar
Nuzhdin SV: Sure facts, speculations, and open questions about the evolution of transposable element copy number. Genetica. 1999, 107 (1–3): 129-137. 10.1023/A:1003957323876.
Article CAS PubMed Google Scholar
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ: Initial sequencing and analysis of the human genome. Nature. 2001, 409 (6822): 860-921. 10.1038/35057062.
Article CAS PubMed Google Scholar
Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, Cawley S, Chiaromonte F, Chinwalla AT, Church DM, Clamp M, Clee C, Collins FS, Cook LL, Copley RR, Coulson A, Couronne O, Cuff J, Curwen V, Cutts T, Daly M, David R, Davies J, Delehaunty KD, Deri J, Dermitzakis ET, Dewey C, Dickens NJ, Diekhans M, Dodge S, Dubchak I, Dunn DM, Eddy SR, Elnitski L, Emes RD, Eswara P, Eyras E, Felsenfeld A, Fewell GA, Flicek P, Foley K, Frankel WN, Fulton LA, Fulton RS, Furey TS, Gage D, Gibbs RA, Glusman G, Gnerre S, Goldman N, Goodstadt L, Grafham D, Graves TA, Green ED, Gregory S, Guigo R, Guyer M, Hardison RC, Haussler D, Hayashizaki Y, Hillier LW, Hinrichs A, Hlavina W, Holzer T, Hsu F, Hua A, Hubbard T, Hunt A, Jackson I, Jaffe DB, Johnson LS, Jones M, Jones TA, Joy A, Kamal M, Karlsson EK, Karolchik D, Kasprzyk A, Kawai J, Keibler E, Kells C, Kent WJ, Kirby A, Kolbe DL, Korf I, Kucherlapati RS, Kulbokas EJ, Kulp D, Landers T, Leger JP, Leonard S, Letunic I, Levine R, Li J, Li M, Lloyd C, Lucas S, Ma B, Maglott DR, Mardis ER, Matthews L, Mauceli E, Mayer JH, McCarthy M, McCombie WR, McLaren S, McLay K, McPherson JD, Meldrim J, Meredith B, Mesirov JP, Miller W, Miner TL, Mongin E, Montgomery KT, Morgan M, Mott R, Mullikin JC, Muzny DM, Nash WE, Nelson JO, Nhan MN, Nicol R, Ning Z, Nusbaum C, O'Connor MJ, Okazaki Y, Oliver K, Overton-Larty E, Pachter L, Parra G, Pepin KH, Peterson J, Pevzner P, Plumb R, Pohl CS, Poliakov A, Ponce TC, Ponting CP, Potter S, Quail M, Reymond A, Roe BA, Roskin KM, Rubin EM, Rust AG, Santos R, Sapojnikov V, Schultz B, Schultz J, Schwartz MS, Schwartz S, Scott C, Seaman S, Searle S, Sharpe T, Sheridan A, Shownkeen R, Sims S, Singer JB, Slater G, Smit A, Smith DR, Spencer B, Stabenau A, Stange-Thomann N, Sugnet C, Suyama M, Tesler G, Thompson J, Torrents D, Trevaskis E, Tromp J, Ucla C, Ureta-Vidal A, Vinson JP, Von Niederhausern AC, Wade CM, Wall M, Weber RJ, Weiss RB, Wendl MC, West AP, Wetterstrand K, Wheeler R, Whelan S, Wierzbowski J, Willey D, Williams S, Wilson RK, Winter E, Worley KC, Wyman D, Yang S, Yang SP, Zdobnov EM, Zody MC, Lander ES: Initial sequencing and comparative analysis of the mouse genome. Nature . 2002, 420 (6915): 520-562. 10.1038/nature01262.
Article CAS PubMed Google Scholar
Quesneville H, Bergman CM, Andrieu O, Autard D, Nouaud D, Ashburner M, Anxolabehere D: Combined evidence annotation of transposable elements in genome sequences. PLoS Comput Biol. 2005, 1 (2): e22-10.1371/journal.pcbi.0010022.
Article PubMed Central Google Scholar
Mukai T, Yamaguchi O: The genetic structure of natural populations of Drosophila melanogaster. XI. Genetic variability in a local population. Genetics. 1974, 76 (2): 339-366.
PubMed Central CAS PubMed Google Scholar
Watanabe TK, Yamaguchi O, Mukai T: The genetic variability of third chromosomes in a local population of Drosophila melanogaster. Genetics. 1976, 82 (1): 63-82.
PubMed Central CAS PubMed Google Scholar
Hill WG, Robertson A: The effect of linkage on limits to artificial selection. Genet Res. 1966, 8 (3): 269-294.
Article CAS PubMed Google Scholar
Flavell AJ, Ruby SW, Toole JJ, Roberts BE, Rubin GM: Translation and developmental regulation of RNA encoded by the eukaryotic transposable element copia. Proc Natl Acad Sci U S A. 1980, 77 (12): 7107-7111.
Article PubMed Central CAS PubMed Google Scholar
Parkhurst SM, Corces VG: Developmental expression of Drosophila melanogaster retrovirus-like transposable elements. EMBO J. 1987, 6: 419-424.
PubMed Central CAS PubMed Google Scholar
Ding D, Lipshitz HD: Spatially regulated expression of retrovirus-like transposons during Drosophila melanogaster embryogenesis. Genet Res. 1994, 64: 167-181.
Article CAS PubMed Google Scholar
Kearney JB, Wheeler SR, Estes P, Parente B, Crews ST: Gene expression profiling of the developing Drosophila CNS midline cells. Dev Biol. 2004, 275 (2): 473-492. 10.1016/j.ydbio.2004.08.047.
Article PubMed Central CAS PubMed Google Scholar
Arkhipova IR, Lyubomirskaya NV, Ilyin YV: Drosophila Retrotransposons. 1995, Austin, TX: R.G. Landes Co
Google Scholar
BDGP Embryonic Expression Pattern Project. [http://www.fruitfly.org/cgi-bin/ex/insitu.pl]
Tomancak P, Beaton A, Weiszmann R, Kwan E, Shu S, Lewis SE, Richards S, Ashburner M, Hartenstein V, Celniker SE, Rubin GM: Systematic determination of patterns of gene expression during Drosophila embryogenesis. Genome Biol . 2002, 3 (12): RESEARCH0088-0088. 10.1186/gb-2002-3-12-research0088.
Article PubMed Central PubMed Google Scholar
Brookman JJ, Toosy AT, Shashidhara LS, White RA: The 412 retrotransposon and the development of gonadal mesoderm in Drosophila. Development . 1992, 116: 1185-1192.
CAS PubMed Google Scholar
Mozer BA, Benzer S: Ingrowth by photoreceptor axons induces transcription of a retrotransposon in the developing Drosophila brain. Development. 1994, 120 (5): 1049-1058.
CAS PubMed Google Scholar
Bronner G, Taubert H, Jackle H: Mesoderm-specific B104 expression in the Drosophila embryo is mediated by internal cis-acting elements of the transposon. Chromosoma. 1995, 103 (10): 669-675.
CAS PubMed Google Scholar
Meignin C, Dastugue B, Vaury C: Intercellular communication between germ line and somatic line is utilized to control the transcription of ZAM, an endogenous retrovirus from Drosophila melanogaster. Nucleic Acids Res. 2004, 32 (13): 3799-3806. 10.1093/nar/gkh708.
Article PubMed Central CAS PubMed Google Scholar
UCSC D. melanogaster Genome Browser Gateway. [http://genome.ucsc.edu/cgi-bin/hgGateway?clade=insect&org=D.+melanogaster&db=dm1]
BDGP Natural Transposable Element Project. [http://www.fruitfly.org/p_disrupt/TE.html]
Washington University BLAST Archives. [http://blast.wustl.edu/]
NCBI BLAST. [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/]
Stapleton M, Liao G, Brokstein P, Hong L, Carninci P, Shiraki T, Hayashizaki Y, Champe M, Pacleb J, Wan K, Yu C, Carlson J, George R, Celniker S, Rubin GM: The Drosophila Gene Collection: Identification of putative full-length cDNAs for 70% of D. melanogaster genes. Genome Res. 2002, 12: 1294-1300. 10.1101/gr.269102.
Article PubMed Central PubMed Google Scholar
Locke J, Howard LT, Aippersbach N, Podemski L, Hodgetts RB: The characterization of DINE-1, a short, interspersed repetitive element present on chromosome and in the centric heterochromatin of Drosophila melanogaster. Chromosoma. 1999, 108 (6): 356-366. 10.1007/s004120050387.
Article CAS PubMed Google Scholar
Singh ND, Arndt PF, Petrov DA: Genomic heterogeneity of background substitutional patterns in Drosophila melanogaster. Genetics. 2005, 169 (2): 709-722. 10.1534/genetics.104.032250.
Article PubMed Central CAS PubMed Google Scholar
Rozen S, Skaletsky HJ: Primer3 on the WWW for general users and for biologist programmers. Bioinformatics Methods and Protocols: Methods in Molecular Biology. Edited by: Krawetz S, Misener S. 2000, Totowa, NJ: Humana Press, 365-386.
Google Scholar
Lexa M, Horak J, Brzobohaty B: Virtual PCR. Bioinformatics. 2001, 17 (2): 192-193. 10.1093/bioinformatics/17.2.192.
Article CAS PubMed Google Scholar
Drosophila melanogaster recombination rate calculator. [http://cgi.stanford.edu/~lipatov/recombination/recombination-rates.txt]

Download references

Acknowledgements

We thank Jeff Birdsley, Greg Gibson, Bettina Harr, Sergey Nuzhdin and Lev Yampolsky for the gifts of Drosophila strains and members of the DAP lab for helpful discussions. We thank Douda Bensasson and three anonymous reviewers for helpful comments on the manuscript. This work was funded by the Achievement Rewards for College Scientists Foundation through the Stanford Graduate Fellowship program (to ML); a NSF grant #0317171 (PI: DAP) and the Sloan and Hellman Fellowships (to DAP); a NIH training fellowship T32 HL07279 (PI: E. Rubin) and a USA Research Fellowship from the Royal Society (to CMB).

Author information

Authors and Affiliations

Department of Biological Sciences, Stanford University, Stanford, CA, 94305, USA
Mikhail Lipatov, Kapa Lenkov & Dmitri A Petrov
Faculty of Life Sciences, University of Manchester, Manchester, M13 9PT, UK
Casey M Bergman

Authors

Mikhail Lipatov
View author publications
You can also search for this author in PubMed Google Scholar
Kapa Lenkov
View author publications
You can also search for this author in PubMed Google Scholar
Dmitri A Petrov
View author publications
You can also search for this author in PubMed Google Scholar
Casey M Bergman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Casey M Bergman.

Additional information

Authors' contributions

ML developed and carried out the population genomic analyses, and drafted the manuscript; KEL participated in the design of the population genetic study and gathered all molecular population genetic data; DAP helped conceive of the study, participated in the design and coordination of the population genetic and population genomic components of the study and helped draft the manuscript; CMB conceived of the study, conducted the bioinformatics analyses, participated in the analysis of the data and drafted the manuscript.

Electronic supplementary material

12915_2005_52_MOESM1_ESM.xls

Additional file 1: Table of chimeric TE insertions in D. melanogaster Release 3 genome sequence with methods used for detection, location of TE in chimeric transcript, and supporting ESTs. (XLS 27 KB)

12915_2005_52_MOESM2_ESM.pdf

Additional file 2: Example of log-likelihood function for estimating population frequencies from pool frequencies (see methods for details). (PDF 122 KB)

Additional file 3: Table of PCR primers used in this study. (XLS 22 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Lipatov, M., Lenkov, K., Petrov, D.A. et al. Paucity of chimeric gene-transposable element transcripts in the Drosophila melanogaster genome. BMC Biol 3, 24 (2005). https://doi.org/10.1186/1741-7007-3-24

Download citation

Received: 06 July 2005
Accepted: 12 November 2005
Published: 12 November 2005
DOI: https://doi.org/10.1186/1741-7007-3-24

Paucity of chimeric gene-transposable element transcripts in the Drosophila melanogaster genome

Abstract

Background

Results

Conclusion

Background

Results

Identification of chimeric gene-TE transcripts in the D. melanogastergenome

Properties of chimeric gene-TE transcripts

Paucity of TEs in mature transcripts indicates that chimeric TE insertions are generally strongly deleterious

Observed chimeric TEs are not under unusual selective pressures

Discussion

Conclusion

Methods

in silicoscreen for chimeric gene-TE transcripts

Composition of DNA pools

PCR assays

Estimation of TE population frequencies from pool frequencies

Estimation of genomic recombination rate in the neighborhood of each TE insertion

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

12915_2005_52_MOESM1_ESM.xls

12915_2005_52_MOESM2_ESM.pdf

Additional file 3: Table of PCR primers used in this study. (XLS 22 KB)

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Biology

Contact us