Fosmid library end sequencing reveals a rarely known genome structure of marine shrimp Penaeus monodon

Huang, Shiao-Wei; Lin, You-Yu; You, En-Min; Liu, Tze-Tze; Shu, Hung-Yu; Wu, Keh-Ming; Tsai, Shih-Feng; Lo, Chu-Fang; Kou, Guang-Hsiung; Ma, Gwo-Chin; Chen, Ming; Wu, Dongying; Aoki, Takashi; Hirono, Ikuo; Yu, Hon-Tsen

doi:10.1186/1471-2164-12-242

Research article
Open access
Published: 17 May 2011

Fosmid library end sequencing reveals a rarely known genome structure of marine shrimp Penaeus monodon

Shiao-Wei Huang¹,
You-Yu Lin¹,
En-Min You¹,
Tze-Tze Liu²,
Hung-Yu Shu²,
Keh-Ming Wu³,
Shih-Feng Tsai³,
Chu-Fang Lo¹,
Guang-Hsiung Kou¹,
Gwo-Chin Ma⁴,
Ming Chen^1,4,5,
Dongying Wu^6,7,
Takashi Aoki⁸,
Ikuo Hirono⁸ &
…
Hon-Tsen Yu¹

BMC Genomics volume 12, Article number: 242 (2011) Cite this article

7706 Accesses
32 Citations
3 Altmetric
Metrics details

Abstract

Background

The black tiger shrimp (Penaeus monodon) is one of the most important aquaculture species in the world, representing the crustacean lineage which possesses the greatest species diversity among marine invertebrates. Yet, we barely know anything about their genomic structure. To understand the organization and evolution of the P. monodon genome, a fosmid library consisting of 288,000 colonies and was constructed, equivalent to 5.3-fold coverage of the 2.17 Gb genome. Approximately 11.1 Mb of fosmid end sequences (FESs) from 20,926 non-redundant reads representing 0.45% of the P. monodon genome were obtained for repetitive and protein-coding sequence analyses.

Results

We found that microsatellite sequences were highly abundant in the P. monodon genome, comprising 8.3% of the total length. The density and the average length of microsatellites were evidently higher in comparison to those of other taxa. AT-rich microsatellite motifs, especially poly (AT) and poly (AAT), were the most abundant. High abundance of microsatellite sequences were also found in the transcribed regions. Furthermore, via self-BlastN analysis we identified 103 novel repetitive element families which were categorized into four groups, i.e., 33 WSSV-like repeats, 14 retrotransposons, 5 gene-like repeats, and 51 unannotated repeats. Overall, various types of repeats comprise 51.18% of the P. monodon genome in length. Approximately 7.4% of the FESs contained protein-coding sequences, and the Inhibitor of Apoptosis Protein (IAP) gene and the Innexin 3 gene homologues appear to be present in high abundance in the P. monodon genome.

Conclusions

The redundancy of various repeat types in the P. monodon genome illustrates its highly repetitive nature. In particular, long and dense microsatellite sequences as well as abundant WSSV-like sequences highlight the uniqueness of genome organization of penaeid shrimp from those of other taxa. These results provide substantial improvement to our current knowledge not only for shrimp but also for marine crustaceans of large genome size.

Background

Crustaceans (lobster, shrimp, crab, etc.), a remarkable group of organisms filling up all types of habitats in the ocean with a wide array of adaptations, possess the greatest species diversity among marine animals. They are not only abundant in number, but also are among the most commercially exploited food species for human consumption [1]. Given their primarily aquatic habitats, however, they are not as well studied as insects, their terrestrial arthropod relatives.

The tiger shrimp (Penaeus monodon) has been one of the most important captured and cultured marine crustaceans in the world, especially in the Indo-Pacific region [1, 2]. However, the tiger shrimp industry has been plagued by viral diseases [3–5], resulting in substantial economic losses. Developments in shrimp genomics have been limited although a reasonably good EST database is available (Penaeus Genome Database; http://sysbio.iis.sinica.edu.tw/page/) [6]. A genomic analysis for the tiger shrimp will make a key contribution to deciphering the evolutionary history representing the crustacean lineages, especially those living in the ocean. The information contained in the genomic sequences will also benefit the shrimp industry by offering genomic tools to fend off the viral diseases and to improve the breeding program.

The genome size of the penaeid shrimp is estimated to be 2/3 of the human genome [7] and thus an order of magnitude lager than the model invertebrates, Caenorhabditis elegans and Drosophila melanogaster. Concerning their larger genome size than other invertebrates, we are most interested in knowing what the makeup of genomic DNA in the tiger shrimp genome is. Our initial attempt to sequence a few fosmid clones was hindered by an unusual high percentage of failure in sequencing reactions and by difficulties in assembling contigs, rousing suspicion that the shrimp genome is extraordinarily repetitive in nature. Consequently we set out to have a glimpse of the genomic structure by sequencing ends of fosmid clones. The results would offer insights to whole genome sequencing with appropriate and effective strategies. To achieve this aim, we constructed a P. monodon fosmid library from a female shrimp and made an initial analysis of 20,926 high-quality end sequences, a total of 11,114,786 bp representing 0.45% of the whole genome. The results provide substantial improvement to our current knowledge not only for shrimp but also for the genomic structure of invertebrates with large genomes.

Results

Estimation of the P. monodon genome size

The genome size of P. monodon has never been determined experimentally and therefore we measured DNA content of hemocytes of P. monodon with flow cytometry, using human lymphocytes as standardized control. In addition, the white shrimp (P. vannamei) genome, whose size is known, was used as a reference. The 1C nuclear DNA content of P. monodon was estimated to be ~72.2% of the human genome, i.e., ~ 2.53 pg DNA per nucleus or 2.17×10⁹bp per haploid genome. We also obtained the DNA content of P. vannamei to be 71.5% of the human genome (Additional file 1), which is consistent with the value previously reported by Chow et al. [7]. The 1C value of P. monodon is close to those previously reported four other penaeid shrimp species (2.37-2.51 pg for P. aztecus, P. duorarum, P. vannamei, and P. setiferus; see Chow et al. [7]).

Construction and characterization of the fosmid library

The constructed P. monodon fosmid library consists of a total of 288,000 clones arrayed in 750×384-well microtiter plates. To evaluate the average insert size, 111 clones were randomly selected from the fosmid library and analyzed with Not I. The average insert sizes (40.8 ± 3.6 kb) were close to the expected 40 kb (Additional file 2). Therefore, the P. monodon fosmid library covers 5.3× haploid genome equivalents based on an estimate of 2.17×10⁹bp per haploid genome.

Fosmid-end-sequence (FES) analysis

A total of 20,926 high-quality FESs (GenBank accession number JJ726384-JJ747309) with read lengths of ≥100 bp (Additional file 3) were obtained from 11,850 fosmid clones. Of the 11,850 fosmid clones, 9,072 clones had both end sequences present in our FESs. The length of the FESs ranged from 100 bp to 861 bp, with an average read length of 531 bp. A total of 11,114,786 bp of genomic sequences were generated from this study, representing approximately 0.45% of the P. monodon genome. The P. monodon genome appeared to be AT-rich, with GC content of 45.88%. This is the first estimate of GC content in a marine shrimp.

Repetitive sequence analysis

Repetitive sequences comprise an important part of eukaryotic genomes, and each species has its own characteristic repetitive sequences. The overall constitution of repetitive elements in the P. monodon genome was assessed by RepeatMasker. Of the 20,926 fosmid end reads, 49.82% (10,425/20,926) contained repeats (against A. gambiae repeat database). In terms of lengths, 15.49% and 15.44% of base pairs were repeatmasked against D. melanogaster and A. gambiae repeat database, respectively (Table 1).

Table 1 Characterization of repeat types by RepeatMasker*

Full size table

In spite of similar proportions of repetitive sequences masked by two different, i.e., D. melanogaster and A. gambiae, databases, the lengths allocated in major repeat types were different (Table 1). The length of transposable elements (both retrotransposons and DNA transposons) masked in the D. melanogaster database (54,154 bp) was much less than in the A. gambiae database (109,658 bp), while the length of simple repeats masked in the D. melanogaster database (858,898 bp) was larger than in the A. gambiae database (807,927 bp).

Of all repeat types, simple repeats were the most abundant type, identified in approximately 7.5% (7.73% against D. melanogaster and 7.27% against A. gambiae) of the total 11,114,786-bp FESs and accounting for nearly half of the repetitive sequences. Low complexity repeats (3.48% average over two databases) and small RNAs (3.77% average over 2 databases) were two other abundant repeat types (Table 1). Interspersed repeats (mainly retrotransposons and DNA transposons) were the least abundant (0.74% average over two databases), accounting for only a small fraction (4.76%) of repetitive sequences in length. Among transposable elements, long terminal repeat (LTR) retrotransposons were the most abundant, followed by non-LTR retrotransposons and DNA transposons. Among the LTR retrotransposons, the gypsy-type ranked first, which was the only LTR element identified by RepeatMasker using the A. gambiae repeat database.

Frequency and relative abundance of microsatellites in P. monodon

In analyzing the components of repetitive sequences, we noticed that simple repeats, in which microsatellites are included, comprise a significant proportion (~7.5%) of tiger shrimp genome and account for most of the repeat types. To further characterize the distribution and constitution of microsatellites in the P. monodon genome, 11,114,786-bp fosmid ends were analyzed by Tandem Repeat Finder. Nearly one-third (32.4%) of the end sequence reads contained microsatellites, and a total of 8,441 microsatellite loci comprising 8.3% of 11,114,786-bp fosmid ends were identified. The microsatellite loci are AT biased, with an A/T content of 61.7%.

Of all microsatellite classes, di- (44.3%) and tri-nucleotide repeats (31.0%) comprise more than 70% of their total length. In decreasing order, the 20 most frequently occurring microsatellites are AG, AC, AAT, AT, ATC, AGG, AAG, ACT, AGC, A, AGCC, AACCT, AAC, ACC, AGGG, AAAT, CG, AAAG, ACGG, and ACAT (Table 2 Figure 1), including all 4 dinucleotide motifs and almost all 10 trinucleotide motifs except ACG and CCG, constituent of 85.8% of microsatellite motifs identified.

Table 2 Characterization of microsatellites in the P. monodon genome^a

Full size table

In term of repeat motif length, among all microsatellite classes, dinucleotide repeats have the highest relative frequency and relative abundance (42.4%, 44.3%), followed by trinucleotide (33.0%, 31.0%), tetranucleotide (11.9%, 11.3%), hexa-nucleotide (7.1%, 6.2%), penta-nucleotide (3.7%, 6.3%) and mononucleotide repeats (1.9%, 0.9%).

Among dinucleotide repeats, AG repeats are the most abundant with relative frequency of 16.9% and relative abundance of 21.5%, followed by AC (13.7%, 12.8%) and AT repeats (10.8%, 9.5%). CG repeats are present in low relative frequency (1.1%) and relative abundance (0.4%), as observed in other invertebrates, mammals and plants, probably due to the structural problems it may have on DNA conformation [8]. Among trinucleotide repeats, AAT repeats, with relative frequency of 10.8% and relative abundance of 14.2%, comprised nearly one half (46%) of the trinucleotide repeats in lengths and were the most abundant in this class, far more than ATC repeats with the second highest relative frequency (6.5%) and abundance (5.4%).

It is noteworthy that one pentanucleotide repeat, AACCT, was particularly abundant compared to all other penta-, tetra-, or even hexa-nucleotide repeats (Table 2). With a relative frequency of 1.6% and a relative abundance of 4.5%, AACCT repeats were the 12^th most frequent microsatellite type. Their mean length per locus (316.1 bp) was the highest among the 20 most frequently occurring microsatellite classes. In particular, ~70% of the AACCT repeats are perfect or nearly perfect repeats. The (AACCT/TTAGG) repeat turns out to be the telomere motif in arthropods, which is only 1 base pair different from the ancestral telomere repeat motif (AACCCT/TTAGGG) found in vertebrates and in all other basal metazoan groups [9].

The microsatellite abundance lies on both the distribution frequencies and the sizes of the repeats. When comparing to other taxa, we found that microsatellite sequences in the P. monodon genome occur at higher density (Table 3) than in vertebrates. Approximately 1 microsatellite was present in every 1.32 kb, which is 4.6 times more frequent than the one per 6 kb estimated for humans [10]. The frequency of microsatellite sequences in this species was even higher than that in the Fugu genome, which have the highest microsatellite density (1 per 1.88 kb) known so far [11].

Table 3 Survey of microsatellite distribution and mean lengths in various genomes

Full size table

As to the sizes of the repeats, the mean length for individual microsatellite loci in P. monodon was unusually long, average 109.2 bp, which is 4 times the 25.6 bp in the Fugu genome [12] (Table 3). Of all 8,441 microsatellite loci identified, 84.1% had lengths over 40 bp, and 36.9% had lengths over 100 bp. A total of 135 microsatellite loci had lengths over 500 bp, mostly belonging to AACCT (31 hits; maximal length: 705 bp), AAT (27 hits; 729 bp), AG (22 hits; 763 bp), AC (12 hits; 783 bp), and AAG repeats (8 hits; 720 bp) (Table 2). The longest uninterrupted array of microsatellites was a (TC/AG) repeat, spanning 440 bp with 220 repeat units. Very long stretches of microsatellites in a single read, containing up to six microsatellite loci, were commonly observed. The characteristic of long stretches was also revealed in the length distribution of the 20 most frequently occurring microsatellites classes (Figure 2). Almost each class, except A repeats, had over one half of the loci with lengths exceeding 40 bp, and most of them had over 20% of the loci with lengths exceeding 100 bp. The high frequency and the long lengths of microsatellites in the P. monodon genome lead to a decreased sequencing success rate (down to ~70% from a typical rate of ~90%).

With a few exceptions, the length distribution patterns within each of these microsatellite classes are generally consistent (Figure 2). In particular, AACCT repeats had a notably high ratio of longer stretch, with 61.8% of the loci having lengths over 201 bp. In general, the microsatellites with a higher GC% tend to have shorter lengths. For example, CG repeats had a very narrow length range around 42-46 bp (94.5% at L3 type), and ACGG repeats had 98.9% of the loci with the lengths of 21-60 bp (the L2 to L3 types). The only exception is AGCC repeats; despite of the same GC% (75%) as those of ACGG repeats, most of the loci (73.9%) identified had longer lengths more than 120 bp.

High abundance of microsatellite sequences were also found in the transcribed regions. By examining the amount and distribution of microsatellites in one P. monodon EST dataset (PmTwN), repeat motifs were found in 8.1% of the uniquely expressed sequences, covering 1.12% of the EST lengths (11,161 bp per Mb). In comparison with other taxa that have been surveyed such as primates (1,515 bp per Mb) and rodents (2,488 bp per Mb) [13], the fraction of microsatellites in the expressed sequences in P. monodon is apparently higher. Of all microsatellite classes present in the expressed sequences, dinucleotide- and trinucleotide-repeats were predominant. AT-rich microsatellite types, especially poly (AT) and poly (AAT), were the most abundant, consistent with the result obtained by Maneeruttanarungroj et al. [14].

In contrast to those in genome average, the frequency distribution of microsatellites with AT-rich motifs [e.g., (AT)_n and (AAT)_n] in transcribed regions were apparently different (Figure 3), suggesting that different selective and/or mutational pressures are operating on coding and on other genomic regions. In addition, many EST contigs (Additional file 4a) and a number of known genes (Table 4) contained one or multiple microsatellites with notably long string of perfect repeats, most of which were dinucleotide repeats. For example, the P. monodon Anti-Virus (PmAV) gene (GB# DQ641258) [15] is known to contain a 280 bp-compound imperfect microsatellite repeat [(GT)₄₆] within its 5'-promoter region (Table 4). Moreover, at least some of the long microsatellites located within genes showed copy number variation, as demonstrated in several sets of ESTs apparently transcribed from the same gene (Additional file 4b).

Table 4 Examples of shrimp genes known to contain a very long stretch of microsatellites

Full size table

Novel repetitive elements

Our data above indicates an apparently lower fraction of transposable elements (less than 1%; Table 1) in the tiger shrimp, in comparison to 45% in the human genome [16] and 16% in the A. gambiae genome [17]. We therefore suspected that a large number of specific repeat types could not be detected using the existent repeat database. To unravel novel repetitive elements in P. monodon, we used the RECON program [18] to perform all versus all BlastN search in the 20,926 repeat-masked FESs. After filtering and collapsing families (see Methods), we identified 103 penaeid repetitive element (PRE) families (Table 5), with a total length of 4,867,916 bp comprising 43.8% of the P. monodon genome.

Table 5 Summary of the 103 novel repetitive elements in the P.monodon genome

Full size table

The 103 PREs can be categorized into four groups according to the similarity to sequences in the public databases (Table 5). The first group, comprising an estimated 21.6% of the P. monodon genome and containing 33 PREs, showed only moderate similarity (19%-55% identities) to a number of white spot syndrome virus (WSSV) genes, suggesting these WSSV-like sequences are part of the shrimp genome rather than from complete virions. WSSV is one of the most deadly viruses that have plagued the shrimp farming industry. These WSSV-like sequences probably are the proviral remnants of ancestral germ-line infections by active WSSVs, degenerating to an extent that they lost their functional potential as a virus. Most of the very long repeats were contained in this group, in some of which even more than one WSSV-related sequences were found. The PRE with the longest consensus was FAM31&207 (24.08 kb). This PRE not only contained one wsv343-related segment, but also two other non-overlapping regions similar to Inhibitor of Apoptosis Protein (IAP) gene and Innexin-3 gene. The PRE with the highest count number was FAM9_15-44 (470 hits in the 20,926 FESs), which contained at least five WSSV-related sequences.

The second group, comprising an estimated 9.8% of the P. monodon genome and containing 14 PREs, showed similarities to transposon-related sequences such as pol, gag, and reverse transcriptase genes. These 14 PREs, not well-represented in the Repbase (Repetitive DNA database), were older/more divergent members of the transposable element families and can only be annotated by protein-based RepeatMasker. All of the 14 novel transposons were retrotransposons. Seven of them belong to non-LTR retrotransposons, also known as LINEs (Long Interspersed Nuclear Elements), representing 3 (RTE, I, and Jockey) of the 15 previously described clades. An additional five belong to a unique class of retrotransposons called Penelope. And only two were LTR retrotransposons, both belonging to the gypsy clade. The PRE with the longest consensus was FAM309 (7.61 kb), being a Penelope element. The PRE with the highest number abundance was FAM9_1-14 (392 hits in the 20,926 FESs), belonging to the LINE/I clade. The PRE with the highest combined length was FAM185 (280 kb), also belonging to the LINE/I clade. It is noteworthy that a 60-bp sex-linked AFLP marker (E03M60M72.8), previously demonstrated as a female-associated W allele by linkage analysis and further confirmed on a population scale [19], apparently was derived from a member of FAM185. In addition, FAM498 is notable for carrying a PmAV-like sequence (nucleotide similarity = 95% [1658/1761]; E value: 0), indicating that a PmAV gene/sequence was transposed by a non-LTR retrotransposon. Some previously identified retrotransposons were verified and included in this group. For instance, a retrotransposon (GB# DQ228358) which was originally discovered due to its proximity to an IHHNV-related sequence in the African and Australian P. monodon genomes [20] falls in the repetitive element family FAM9_45-54. And several retrotransposons showing differential expression in response to a range of environmental stressors [21, 22] were demonstrated as members of 4 PREs, i.e., FAM9_1-14, FAM185, FAM75_17-25,35-36,39-40, and FAM9_45-54.

The third group, comprising 0.9% of the P. monodon genome and containing 5 PREs, matched known genes with a minimum length of 100 bp (equivalent to 34 amino acids) and with a minimum identity of 30% (30%~75%). This group might include large gene families with a great number of duplicated genes and pseudogenes, such as Heat Shock Protein 70 gene (FAM327 and FAM142) and Inhibitor of Apoptosis Protein gene (FAM31&207 and FAM46).

The fourth group, comprising an estimated 11.6% of the P. monodon genome and containing 51 PREs, did not match any known sequences. Some of them, e.g., FAM165, FAM816, and FAM696, contained minisatellites. Twenty PREs had the consensus shorter than 1.5 kb, four of which had a high GC content > 60%. The PRE with the longest consensus was FAM198 (7.61 kb). The PRE with the highest count number was FAM42 (419 hits in the 20,926 FESs), resulting in an extraordinary total length of ~169 kb despite of only 611 bp per repeat sequence.

To determine if the repetitive element families identified are transcriptionally active, their consensus sequences were searched for similarity in the Penaeus Genome Database http://sysbio.iis.sinica.edu.tw/page/[6], which includes over 200,000 Expressed Sequence Tags (ESTs) from four penaeid shrimps. Of the 103 repetitive element families, thirty-six had significant hits to P. monodon ESTs, implying expression of their portions in some members of the repetitive family as transcripts (Table 5 Additional file 5). These 36 transcriptionally active PREs included 14 WSSV-derived PREs (FAM31&207 was categorized into this group), 10 retrotransposons, 3 gene family-like PREs, and 9 unannotated PREs. Moreover, we found evidence indicating that some of the WSSV-like and the reverse transcriptase-like segments are active as transcribed RNAs. In 9 of the 14 transcriptionally active WSSV-derived PREs, transcripts derived from the WSSV-related region were found. And in 7 of the 10 transcriptionally active retrotransposon-derived PREs, the reverse transcriptase- or pol-like sequences seem to be expressed. However, among the 3 PREs assumed to be gene families, only one PRE (FAM575) seems to derive expression exactly from their putative protein-coding region.

In summary, the fact that a variety of repeats comprise a significant fraction of the P. monodon genome highlights its highly repetitive nature. If all repetitive elements identified both by RepeatMasker and RECON were included, up to 79.29% (16,592/20,926) of end sequence reads contain interspersed and/or tandem repeats. In terms of lengths, the repetitive sequences in the P. monodon genome comprise 51.18% of the P. monodon genome (Table 6). Retrotransposon-derived, WSSV-related sequences, and unknown/unannotated sequences comprised 10.75%, 21.59%, and 11.57% of the genome in length, respectively. These estimates are conservative because more repetitive elements are expected to be identified if the criteria for defining a repetitive element family are less stringent. Moreover, sequences homologous to 7 of the 103 PREs (Table 5) were found to be present in one kuruma shrimp BAC clone (Mj024A04), known to be highly repeated in the Marsupenaeus japonicus genome [23]. This suggests that the extreme repetitiveness might be a common feature of penaeid genome.

Table 6 Summary of repetitive sequences in the P.monodon genome

Full size table

Protein-coding sequences

To identify putative protein-coding sequences and to estimate gene density in the P. monodon genome, 20,926 fosmid end sequences were subjected to sequence similarity searches using BlastN on the Penaeus Genome Database. Approximately 28.2% (5,910 FESs) of the 20,926 FESs exhibited 80-100% identity (cut-off value: E-10) to 3,983 shrimp ESTs; however, 59.0% (3,471/5,910) of them were found to be derived from 98 PREs and were excluded. Most of the remaining FESs (57.5% = 1,403/2,439) were related to r-RNA genes; only 590 (2.8%) FESs might contain protein-coding genes after excluding those deriving from mitochondria and transposons. Among them, 399 matched to the ESTs of their own species, and 191 matched to the ESTs of other penaeid species. An alternative approach to identify protein-coding sequences within the FESs is to perform BlastX search on the NCBI nr protein database. Approximately 4.8% (994/20,926) of the 20,926 FESs showed sequence similarity to 586 nuclear genes, although most of the genes are unannotated or of unknown function. Notably, two gene homologues, IAPs (41/994) and Innexins (11/994), were shown to be present in high copy numbers in the FESs, suggesting that they are present in high abundance in the P. monodon genome. Overall, in combination of the BlastX search against the nr database and BlastN in the P. monodon EST database, we identified 1,541 (7.4%) FESs containing protein-coding genes. To estimate the gene count in the P. monodon genome in a more conservative approach, we used the result of BlastN against the EST database only. Providing that the average gene size ranges from 7 kb (estimated from one P. monodon fosmid clone; unpublished data) to 10 kb (as humans), the gene count in the P. monodon genome was estimated to be 10,400-14880 [= 4.8%× (2.17×10⁶kb)/gene size], with the gene density of one gene per 145-208 kb. These FESs containing significant hits with coding genes will be important for gene localization and will provide information for defining the gene structure.

Discussion

High abundance of microsatellites in the P. monodon genome

This is the first large-scale survey on the repeats in the tiger shrimp genome. Here we found that the P. monodon genome contains a significant proportion (8.3%) of microsatellite sequences, greater than those of other arthropods such as Drosophila species (0.54%) [13] and silk moth Bombyx mori (0.31%) [24] (Table 3). It is also much higher than the frequency of ~ 1% in many vertebrates including primates [25], human and rat [10], pig and chicken [26], rabbit [27] and Fugu (1.3%) [11]. The presence of large quantities of microsatellite sequences seems to be a distinct characteristic of penaeid genomes, like those of F. chinensis[28] and P. vannamei[29]. The mechanism that determines and maintains the abundance of tandem repeats is not well understood, but apparently reflects the response of the whole genome to overall selective and mutational pressures [30]. It is also plausible that transposable elements might contribute to the formation and the spread of the highly repetitive satellite DNAs by means of unequal crossing over [31].

Abundant microsatellites were also found in the transcribed regions. Similar results were obtained by Maneeruttanarungroj et al. [14], which revealed that 9.9% of the P. monodon ESTs (997/10,100) contained microsatellites. In addition, by reviewing the literature (Table 4) and by examining the P. monodon EST dataset in the Penaeus Genome Database (Additional file 4a), we found that many shrimp genes/ESTs contain long stretches of microsatellites. As longer repeats generally have higher mutation rates, the abundance and long stretches of microsatellites in transcribed regions are unusual, raising the possibility that they may have functional roles. In addition, most of these microsatellites were dinucleotide repeats (Table 4 Additional file 4), implying they act as regulatory elements within the 5'- or 3'-untranslated regions (UTRs) rather than as coding sequences of genes. Otherwise their copy number variation will result in frame-shift mutations. For example, the PmAV gene, an antiviral gene which is up-regulated upon viral infection, is known to contain a dinucleotide repeat [(GT)₄₆] in the promoter region as a negative regulatory element for PmAV expression [15].

Another example is the prophenoloxidase (proPO) gene in P. vannamei (Table 4). Two forms of proPO gene were found, both having a microsatellite near the 3' end of the open reading frame: proPO-a (GB# EU373096) has a perfect microsatellite [(CT)₂₀] [32], while proPO-b (GB# EF115296) has a compound imperfect microsatellite [(CT)₃₈(CA)₈(AA)(CA)₃(TA)(CA)₁₄] [33]. Their 3' end cDNA sequences following this (CT)_n repeat are different. It has been observed that proPO-b expression was down-regulated in the white shrimps challenged with WSSV, but whether this is in any relation to the (CT)_n repeat remains to be determined.

Microsatellites have been hypothesized to be an important source of quantitative genetic variation and evolutionary adaptation [34–36]. The high mutational rate suggest that microsatellites can act like adjustable tuning knobs through which specific genes are able to rapidly adjust the norm of reaction in response to minor or major shifts in evolutionary demands [37]. In this study, by examining EST database we observed that some microsatellites contained in the genes showed copy number variation, probably representing different alleles (Additional file 4b). One example was a C-type lectin-like gene. C-type lectin is known to play an important role in innate immunity of invertebrates. Intriguingly, this gene, together with 3 other genes known to have a very long stretch of microsatellites (PmAV, proPO, and Heat shock cognate 70 gene) (Table 4), are all involved in immune/stress response and possibly undergo frequent regulation of gene expression. This is in agreement with the hypothesis that microsatellites could have a role in adaptive evolution.

Transposable elements in the P. monodon genome

Transposable elements have been shown to occupy a large portion of some eukaryotic genomes, and may have a significant influence on genome evolution [38–40]. They may affect the expression of nearby genes, serve as homologous sites for recombination, and contribute to novel exons [41]. In this study, we identified 14 novel retrotransposons out of the 103 PREs. Together with DNA transposon, transposable elements occupy at least 10% of the P. monodon genome. Over one half of the transposable elements in length belong to non-LTR retrotransposons. Five non-LTR retrotransposon clades, CR1, R1, RTE, I, and Jockey clades, were identified (Table 5). Of them, the I clade was apparently the most represented, contributing to 73% (470,424/643,931 bp) of the non-LTR portion of the P. monodon genome. One PRE (FAM185) of the I clade was found to include a sex-linked AFLP marker (E03M60M72.8) [19], suggesting that at least one introgression site of non-LTR retrotransposons exists on the sex chromosome, mostly likely the W chromosome of the ZW sex determination system.

The R1 clade is a less represented non-LTR retrotransposon in the P. monodon genome. Unlike most other non-LTRs inserting throughout the host genome, however, the R1 clade is known for its distinct target specificity. For example, the R1 clade families RT and R7 have been known to specifically insert in the 28S and 18S ribosomal RNA (rRNA) genes, respectively [42–45]; the Mino elements insert into AC repeats [45]. All of these R1 clade families were found in the P. monodon genome, i.e., the RT (95 hits), R7 (2 hits), and Mino (2 hits) elements. Consistent with target specificity, a significant portion (4.11%) of the P. monodon genome was found to contain highly repetitive short sequences similar to 18S or 28S ribosomal RNA genes, some of which may reflect the remnants of the target-specific retrotransposition of the R1 clade.

Penelope elements are a unique but relatively little studied class of retrotransposons. This type of retrotransposon has been known to insert randomly throughout the genome, preferring AT-rich targets [46]. Penelope elements are also known for their patchy distribution in various taxonomic groups, e.g., they are present in only D. virilis and D. willistoni in a dozen sequenced Drosophila genomes, suggesting that they are frequently lost from relatively close species [46, 47]. In addition to one Penelope element previously identified, we further found 5 PREs representing Penelope elements, comprising a significant fraction (32.1% = 378.442/1179.814 kb) of the retrotransposon portion of the P. monodon genome.

As mentioned above, five previously established non-LTR retrotransposon clades (CR1, R1, RTE, I, and Jockey) have been identified in the P. monodon genome (Table 5). Among them, four clades (CR1, R1, I, and Jockey) are commonly found in most of the major arthropod lineages, e.g., insects [48], crustaceans, and chelicerates [49], suggesting that they were derived from the common ancestor of arthropods.

WSSV-related sequences and their implication in virus-host coevolution

So far no integration of virus, except the infectious hypodermal and hematopoietic necrosis virus (IHHNV), has been reported in the shrimp genome [20]. Our study is the first to demonstrate the prevalence of WSSV-like sequences in the P. monodon genome. Some of the WSSV-related PREs even reach a copy number in excess of 80,000 (= 400/0.45%) elements per genome. WSSV-related sequences have also been found in the genome of another penaeid species, M. japonicus[23]. Interestingly, although a number of shrimp viruses are prevalent in the wild, e.g., monodon baculovirus, hepatopancreatic parvovirus (HPV), and Taura Syndrome Virus (TSV) [50, 51], WSSV seems to be the only virus of which integrated sequences heavily occupied the shrimp genome. Additionally, the WSSV-related sequences accumulated within the P. monodon genome appear to be restricted to only a number of WSSV genes, e.g., wsv514 (putative DNA polymerase III catalytic subunit), wsv447, wsv360 (structure protein, capsid), wsv332 (structure protein), wsv306 (structure protein, tegument), wsv289 (putative serine/threonine protein kinase), wsv209 (structure protein, envelop), and wsv037 (structure protein, capsid). Moreover, segments similar to some of these WSSV genes can be found in 2 or even 3 PREs. These WSSV-like sequences are thought to be continuously accumulated within the shrimp genomes perhaps by reinfection and/or by intracellular transposition.

One highly repeated WSSV-related PRE, FAM31&207, containing segments similar to wsv343 as well as IAP and Innexin 3, is of particular interest. IAPs, with the hallmark of 1-3 copies of a zinc-binding baculoviral IAP repeat (BIR) domains in its 5'-portion, are a conserved group of proteins that regulates apoptosis in both vertebrates and invertebrates [52]. In addition to survival, IAPs are thought to be important regulators in differentiation, innate immune response and cell motility [52]. The IAP-like sequence within the FAM31&207 shared 61% identity with the 5'-portion of the P. monodon IAP gene containing three BIR domains. Innexins, originally characterized as the structural proteins of gap junctions in fly and worm, are also members of an evolutionarily conserved large gene family [53]. The Innexin-like sequence within the FAM31&207 revealed 40% identity with the Innexin 3 gene of pea aphid (Acyrthosiphon pisum). IAP and WSSV-like sequences were also found in high redundancy in the genome of kuruma shrimp, M. japonicus[23]. Therefore, the hyper-expansion of IAP- and WSSV-like sequences, which might have arisen from segmental duplication events, is likely a common feature of penaeid genome.

Despite of their large quantity, the function of these WSSV-like sequences in the P. monodon genome is unclear. WSSV, as the sole species of a new virus family Nimaviridae, is a large dsDNA virus (~300 kb) with many unique characteristics on their genome and on morphology [54]. It displays a remarkably broad host range among crustaceans, but is highly pathogenic and virulent only on penaeid shrimps [54]. Complete WSSV genome analyses revealed that most of the WSSV-encoding proteins show no homology to known proteins, and the small number of genes with identifiable features (mainly involved in nucleotide metabolism and DNA replication) are more similar to eukaryotic than to viral genes [54]. Whether these WSSV-like sequences are remnants of the WSSVs integrating into the host genome, or instead belong to portions of the P. monodon genome subsequently acquired by the virus, remains unknown. These two possibilities may not be mutually exclusive. In the first scenario, the WSSV-like sequences present in the P. monodon genome resulted from WSSV integration. These WSSV-like sequences can exist as junk DNA of no particular consequence, or may affect the fitness of the host. Their multiplicity, which may facilitate nonhomologous recombination, implies that these WSSV-like segments play important roles in genome structure. In addition, some of them were shown transcriptionally active, indicating they might be functional. As mentioned above, a few WSSV genes accumulated many more copies than others in the P. monodon genome. One possible explanation is that selection for these specific WSSV genes to provide protection against infection of related exogenous pathogenic WSSV, e.g., by interfering their replication cycles, as demonstrated in the endogenous retroviruses (ERVs) in vertebrates [55] and in the endogenous rice tungro bacilliform virus (RTBV)-like sequences (ERTBVs) in rice [56].

In the second scenario, the WSSV-like sequences present in the P. monodon genome correspond to original parts of the host genome, which were subsequently gained by WSSV through horizontal transfers. For large DNA viruses that replicate in the nucleus of the host cell such as herpesvirus and baculovirus, the uptake of cellular genes into the viral genome may be of significant advantage [57]. In certain mammalian dsDNA viruses such as herpesvirus, the cellular homologues of virus assist in escaping from detection and destruction by the host immune system via imitating the structure and function of host genes [58]. This might be also the same for WSSV. A deeper investigation on the distribution and the fraction of the WSSV-like sequences in the genomes of other crustaceans with different susceptibility to WSSV is clearly needed, which will shed light on the role and evolution of WSSV-related sequences in the P. monodon genome.

Conclusions

The high abundance of simple sequence repeats, novel transposable elements, and WSSV-like sequences illustrates the highly repetitive nature of the P. monodon genome. Especially, WSSV-like sequences, comprising over 20% of the P. monodon genome in length, highlights the uniqueness of genomic organization of penaeid shrimps from those of other arthropod lineages. Such a highly repetitive nature and the large genome size have placed major obstacles when working with the genomes of shrimp. The fosmid end sequences, along with the fosmid clone library, has provided the first glimpse into the sequence composition of an unsequenced crustacean genome, and will serve as a valuable resource for future physical mapping, whole genome sequencing and other genomic related studies.

Methods

Estimation of the genome size of P. monodon

The genome size of P. monodon was measured by flow cytometry of hemocytes. Samples were prepared according to the protocol of Chow et al. [7] with some modifications. Hemolymph was collected from the heart using a syringe containing 1 ml of phosphate-buffered saline-ribonuclease (PBS-RNase) solution (1% NaCl, 0.06% KCl, 0.0146% Na₂HPO₄, 0.004% KH₂PO₄, 1% sodium citrate, 2% sucrose, and 50 μg/ml RNase A). The hemolymph samples (approximately 1.1-1.5 ml) were transferred to Eppendorf tubes, held for 30 min at room temperature, and centrifuged at 600×g for 5 min. The pellet was resuspended in 1 ml of PBS-RNase solution, centrifuged, and resuspended in 0.3 ml of PBS. For fixation, 0.7 ml of ice-cold ethanol was gradually added to the cell suspension with gentle shaking. To remove cytoplasmic membrane, 0.05 ml of 1% (v/v) NP-40 solution was added, and the sample was vortexed for 3 times (2 sec per vortex). The sample was examined under a microscope to confirm the release of nuclei and then centrifuged at 600 ×g for 5 min. The pellet was resuspended in 1 ml of ice-cold 70% ethanol and filtered through 40-μm BD Falcon cell strainer (BD Biosciences) to remove debris and cell aggregates. The nuclei were stained by adding 0.01 ml of 0.1% (w/v) propidium iodide (PI) solution per 1 ml of sample. Then, by using flow cytometry (FC 500 System, Beckman Coulter) with an excitation wavelength of 488 nm and an emission wavelength of 615 nm, the fluorescence of 3,000-10,000 nuclei per sample was determined. The DNA distribution curves were analyzed by the WinMDI 2.8 software program (written by Joseph Trotter, Scripps Research Institute). DNA values were calculated by comparison to the human lymphocyte as a standard (3.50 pg DNA per nucleus) which were prepared by the same procedures but using the biological saline pH 7.4 as PBS instead.

Fosmid library construction

A wild female tiger shrimp caught from the coastal waters of Taiwan was used as the DNA source. High-molecular weight DNA from the muscle was extracted by standard phenol-chloroform procedure. After this treatment most of the isolated DNA was blunt-ended and sheared in a size range of 40 to 50 kb. The DNA was end-repaired and ligated into the fosmid vector pCC1FOS according to the manufacturer's protocols (Copy Control™ Fosmid Production Kit; Epicentre Technologies). Fosmid clones were packaged using MaxPlax Lambda Packaging Extract. Packaged fosmid clones were stored at 4°C over chloroform in 1 ml of Phage Dilution Buffer (10 mM Tris-HCl at pH 8.3, 100 mM NaCl, 10 mM MgCl₂). Well-separated colonies were picked, and were transferred into individual wells of 384 microtiter plates containing 60 μl/well LB supplemented with 10% glycerol and 12.5 μg/ml of chloramphenicol. The plates were incubated overnight at 37°C and then stored at -80°C.

Size estimation of fosmid clones

To evaluate the average insert size in the library, 111 clones were randomly selected from the fosmid library. Fosmid clone DNA was isolated by a standard alkaline lysis method. The DNA was then completely digested using Not I (New England Biolabs) and subjected to pulsed-field gel electrophoresis (PFGE) (Rotaphor Typ V, Biometra) on 0.75% agarose gel in 0.3× Loening buffer (0.01 M Tris-HCl, 0.01 M NaH₂PO₄, 1 mM EDTA, pH 7.5). The gel-run parameters were as follows: initial voltage, 130 V; final voltage, 90 V; ramping, logarithmic; initial angle, 130°; final angle, 110°, ramping, linear; switch time: 2 sec.; run time, 14 h; temperature, 10C.

Fosmid end sequencing

Fosmid DNA was isolated using Montage Plasmid MiniprepHTS kit (Millipore) according to the manufacturer's guidelines, and sequenced from both end with ABI BigDye Terminator v3.1 (Applied Biosystems) and ABI 3730xl DNA sequencer (Applied Biosystems). The forward sequencing primer sequence was 5'-GGATGTGCTGCAAGGCGATTAAGTTGG-3', and the reverse sequencing primer sequence was 5'-CTCGTATGTTGTGTGGAATTGTGAGC-3'. Base calling of chromatograms and trimming of fosmid-end sequences (FESs) were performed with PHRED software [59, 60]. Vector sequence was masked with CROSS_MATCH http://www.genome.washington.edu and trimmed. Reads < 50 bp and phred score < 20 were eliminated from our internal end-sequence database.