Skip to main content

XenDB: Full length cDNA prediction and cross species mapping in Xenopus laevis

Abstract

Background

Research using the model system Xenopus laevis has provided critical insights into the mechanisms of early vertebrate development and cell biology. Large scale sequencing efforts have provided an increasingly important resource for researchers. To provide full advantage of the available sequence, we have analyzed 350,468 Xenopus laevis Expressed Sequence Tags (ESTs) both to identify full length protein encoding sequences and to develop a unique database system to support comparative approaches between X. laevis and other model systems.

Description

Using a suffix array based clustering approach, we have identified 25,971 clusters and 40,877 singleton sequences. Generation of a consensus sequence for each cluster resulted in 31,353 tentative contig and 4,801 singleton sequences. Using both BLASTX and FASTY comparison to five model organisms and the NR protein database, more than 15,000 sequences are predicted to encode full length proteins and these have been matched to publicly available IMAGE clones when available. Each sequence has been compared to the KOG database and ~67% of the sequences have been assigned a putative functional category. Based on sequence homology to mouse and human, putative GO annotations have been determined.

Conclusion

The results of the analysis have been stored in a publicly available database XenDB http://bibiserv.techfak.uni-bielefeld.de/xendb/. A unique capability of the database is the ability to batch upload cross species queries to identify potential Xenopus homologues and their associated full length clones. Examples are provided including mapping of microarray results and application of 'in silico' analysis. The ability to quickly translate the results of various species into 'Xenopus-centric' information should greatly enhance comparative embryological approaches.

Supplementary material can be found at http://bibiserv.techfak.uni-bielefeld.de/xendb/.

Background

Following the publication of the first automated cDNA sequencing study in 1991 demonstrating the utility of large scale random clone cDNA sequencing approaches [1], there has been a rapid and accelerating growth of such Expressed Sequence Tags (EST). The initial study of 600 partial human sequences has grown to more than 20.0 × 106 while more than 30 organisms have more than 100,000 sequences. To make sense of the resulting sequence, a variety of bioinformatic approaches have been developed to identify protein coding sequences and domains [2–4] and generate 'unigene' sets based on agglomerative clustering methods [5, 6]. Clustering EST sequences is a widely used method for analyzing the transcriptome of a genome. Especially for organisms whose genome is not (yet) sequenced, the EST data is a valuable source of information. While enormously useful, most current analysis tools result in the loss of significant biological information such as alternatively spliced transcripts and polymorphisms [7–18]. Alternative splicing in particular plays important roles during both development and in the mature organism [7–15]. Moreover, most EST based approaches appear to overestimate the number of unique sequences compared to gene predictions based on whole genome sequencing efforts [19–22].

There are different approaches for EST clustering; the most commonly used being (1) each cluster represents a distinct gene, alternative transcripts of the same gene are grouped together into the same cluster. UniGene is one approach that uses this gene-based strategy [23–27]. (2) Alternative transcripts are represented by distinct clusters. Using genome assembly tools like CAP3 [28] or Phrap [29, 30] results in such a clustering, as these tools cannot (and are not designed to) handle the kinds of differences in the EST sequences. (3) STACK [6] groups ESTs based on their tissue source first, and clusters are then generated for each tissue separately. Our approach first generates gene-oriented clusters and then attempts to generate separate contigs which potentially correspond to alternative transcripts.

The underlying principle for each of these approaches is a pairwise comparison of all sequences to identify common subsequences of a given length and identity that is subsequently used to group sequences into clusters. The types of pairwise comparisons result in a runtime that is quadratic in the number of sequences to be compared. To achieve better running times, most tools try to identify promising pairs of sequences by applying word-based algorithms, which consider the frequency of common words in each pair of sequences [31]. In any case these approaches have to compare all possible pairs of sequences, resulting in a running time that grows quadratically with the number of sequences. We have implemented a pipeline for rapid processing and clustering of EST data, based on enhanced suffix arrays [32–34]. Compared to other methods it reduces the running time tremendously. While we focus on generating gene-based clusters, we also assembled each cluster separately using CAP3 to generate consensus sequences for further analyses. Liang et al. evaluated Phrap, CAP3, TA-EST and TIGR Assembler and found in their analysis that CAP3 consistently out-performed the other programs [35]. We therefore chose CAP3 for cluster assembly.

All sequence and clustering information obtained with our approach was stored in a relational database system. To allow for extensive queries, GenBank annotations were incorporated including the library source, tissue type, cell type and developmental stage. Results of all sequence analyses performed on the consensus sequences were stored in the database. This way, comparative queries could be answered to identify e.g. full length clones, sequences unique to X. laevis, or shared between Xenopus and another organism. The comparative query also allows the identification of the set of Xenopus sequences most related to a set from another organism. Thus, the XenDB database is designed to address a critical issue facing many researchers: the comparison of genomic studies in one organism and their application to studies in another model organism. This task is faced by many laboratories attempting to extract the information gained in human, mouse, fly and worm microarray and library sequencing studies which often consist of large tables of genes.

While other databases such as UniGene [36] or TIGR Gene Indices [37] also provide collections of clustered ESTs, the unique batch functionality of mapping results from other organisms to Xenopus laevis and retrieving their potential full length clones was not available before. Moreover, our implementation is specifically designed and focused on relating Xenopus sequence data to the major model organisms. Thus, one can search for the Xenopus homologue directly using the human or mouse protein.

Construction and Content

Sequence sources and cleanup

350,468 Sequences were downloaded from GenBank release 138 and stored in a relational database using the open source ORDBMS PostgreSQL. The following divisions were included: Vertebrate Sequences (VRT, 5,506 sequences), EST (344,747 sequences) and High Throughput cDNA (HTC, 215 sequences). 228,496 sequences were annotated as 5' ESTs and 116,122 as 3' ESTs. 245,415 different cDNA clones were represented in the data set, out of which 92,463 had both 5' and 3' sequences. Entries annotated as being genomic sequences were excluded from the analysis. To enhance the usability and search capabilities of the database, complete GenBank entries were incorporated. Annotations including but not limited to library source, tissue type, cell type and developmental stage were extracted directly from GenBank entries (feature: source, qualifiers: clone_lib, tissue_type, cell_type and dev_stage). Unfortunately, the sequences are not very well annotated in GenBank. 34% of the sequences do not have a tissue type assigned and 36% have no developmental stage information. Distributions of tissue types, developmental stages and clone libraries are shown in supplemental files [see additional files 2, 3 and 4 respectively].

197,888 ESTs (57.4% of the EST sequences) had information about high quality start or end of sequencing reads. This information was used to trim sequences according to high quality regions to insure best sequence quality. Vector sequence was downloaded from GenBank and VectorDB [38] and the sequence masked using the program Vmatch [39] developed by Stefan Kurtz. Vmatch is based on a novel sequence index (enhanced suffix arrays, [32–34]), allowing for the rapid identification of similarities in large sequence sets. ESTs were trimmed to eliminate vector sequence located at either the 5' or 3' end (6678 ESTs, 1.9% of total sequence set). In some cases, additional non vector sequence preceded or followed known vector sequence. If such non-vector sequence was less than 20 bases long, it was trimmed from the EST together with the vector sequence. ESTs that had vector sequences left after trimming were discarded completely. Repetitive elements were obtained from Repbase [40] and GenBank and masked using RepeatMasker [41]. In addition, if hits against ribosomal RNA and mitochondrial sequences were found in the downloaded sequence set, the corresponding sequences were removed. The availability of complete mitochondrial genomic and ribosomal sequences makes the inclusion of these sequences unnecessary while masking was performed to minimize possible clustering errors arising from these common sequences. Sequences that had less than 100 consecutive bases left after cleanup were discarded completely (21,039 sequences, 6.0%). The resulting sequence set consisted of 317,242 sequences (90.5%) with an average length of 536 bases (see Table 1).

Table 1 Summary of Xenopus EST cleanup and clustering.

Clustering and assembly of tentative contig sequences

The cleaned X. laevis EST sequence set was grouped into gene specific clusters using Vmatch. Vmatch preprocesses the EST sequences into an index structure: an enhanced suffix array. This data structure has been shown to be as powerful as suffix trees, with the advantage of a reduced space requirement and reduced processing time. Further on, enhanced suffix arrays have been shown to be superior to other matching tools for a variety of applications [33, 42, 43]. For a detailed introduction of enhanced suffix arrays see Abouelhoda et al. [34]. Briefly, the index efficiently represents all substrings of the sequences and allows the solution of matching tasks, in time independent of the size of the index (unlike BLAST). Vmatch was chosen for the following reasons: (1) At first, there was no clustering tool available which could handle large data sets efficiently, and which was documented well enough to allow a detailed replication and evaluation of existing clusters. (2) Second, Vmatch identifies similarities between sequences rapidly, and it provides additional options to cluster a set of sequences based on these matches. Furthermore, the Vmatch output provides information about how the clusters were derived. Due to the efficiency of Vmatch, we were able to perform the clustering for a wide variety of parameters on the complete sequence set (see below). This allowed us to study the effect of the parameter choice on the clustering. Moreover, in the future, the efficiency will allow us to more frequently update the data set. A longer term goal of the project is to generate a data set that maintains the different alleles in this pseudotetraploid animal as separate entries. The clustering approach has been integrated into an analysis pipeline which can be applied to other organisms that often receive less attention from the bioinformatics community.

The database sequences were clustered according to the matches found in a self comparison of the index. Initially each database sequence is put into its own cluster. Then all pairs of matches are generated and each pair is evaluated to possibly form single linkage clusters. To identify matching sequences, Vmatch first computes all maximal exact matches of a given minimal length (seeds) between all sequences. These seeds are extended in both directions allowing for matches, mismatches, insertions, and deletions using the X-Drop alignment strategy as described previously. This greedy alignment strategy was developed for comparing highly similar DNA sequences that differ only by sequencing errors, or by equivalent errors from other sources [44].

In an attempt to objectively define appropriate clustering criteria, we took advantage of the speed of the Vmatch clustering approach to systematically vary the relevant parameters (overlap length, % identity, seedlength and X-drop value). It was hypothesized that the 'correct' parameters would be revealed as an abrupt change in the curve on the resulting graph. An example of such an analysis showing the effect of varying the overlap length and % identity is presented in supplemental materials [see additional file 1]. Here a number of conclusions become apparent. First, at this level of resolution (~30 independent clusterings), a distinct point indicating the 'correct' parameter does not become readily apparent. Second, the collapse of the cluster set to few clusters containing every larger numbers of individual sequences serves as a reminder that all sequences (regardless of species) can be considered part of a single cluster. Finally, as the length overlap decreased, we observed the formation of 'superclusters' containing >10,000 sequences clearly derived from multiple gene families. These problem of 'superclusters' diminished at an overlap length of ~135 (data not shown, and not apparent in additional file 1). These clusters appear to be due to the presence of undefined repetitive elements, chimeric sequences and possibly transposed elements. Studies on the nature of the clustered sequences and the effects of parameter variation are ongoing.

For the current data set, we tried to select parameters which mimic the parameters that were probably used for generating the UniGene clusters. Unfortunately, the algorithm used for constructing the UniGene clusters is not sufficiently documented to allow complete reproduction. We selected parameters designed to produce a stringent clustering of the available sequences. For the described data set, sequences were clustered when a pairwise match of at least 150 nucleotides and 98% identity was found (seedlength = 33, X-Drop = 3). The construction of the enhanced suffix array took 33 minutes on a SUN UltraSparc III (900 MHz) CPU. Clustering took another 17 minutes. This resulted in 25,971 clusters containing 276,365 sequences (87.11% of the input set) and 40,877 singletons (12.89%). The average cluster size was 10.6 (std. dev 51.8) sequences. The distribution of cluster sizes is shown in Table 1. 22,834 clusters were composed of ESTs only, 61 clusters of mRNA sequences (VRT and HTC divisions) only and 3,076 clusters of both mRNAs and ESTs. Among the singletons are 4262 sequences which contain less than 150 nt (after sequence cleanup described above) and would therefore be incapable of being joined in a cluster. Less than 25% of these sequences have a significant match against NR database and less than 2% of the sequences match full length cDNA criteria described below.

Next, a consensus sequence was generated for each cluster using CAP3 [28]. The aim of this approach was to both refine the number of clusters and to improve the overall sequence quality. This latter aim simplifies the design of oligonucleotide probes. The 25,971 clusters produced 31,353 tentative contig (TC) sequences (avg. length: 1,045 bp, std. dev: 729 bp) and 4,801 singlets (avg. length: 664 bp, std. dev: 424 bp). The longest TC was 13,130 bp (DNA-dependent protein kinase catalytic subunit, accession: [Genbank:AB016434]), while the smallest TC was 154 bases long. Here, it became obvious that CAP3 is a genome assembly program not designed to assemble EST clusters containing potential splice variants: CAP3 assembly subsequently split a fraction of the clusters into separate contigs and singletons. On average, a cluster was split into 1.2 (std. dev 3.0) TCs and 1.8 (std. dev 11.3) singlets by CAP3. As illustrated in Table 1, the average length of the sequences increased from 536 bp (average for input ESTs) to 1,045 bp (average for CAP3 contig sequences) which was lower than the average length for previously characterized Xenopus full length sequences (sequences selected as full length by XGC had an average length of 2,115 bp).

There are many genes whose transcript is significant longer than 2× the current state of the art sequencing run of ~1000 bp. This means that 5' and 3' sequences derived from a >2 kb transcript are unable to be joined without sequence from incomplete cDNA clones which provide a source of nested deletions. Sequences from both ends can be linked by annotation, and this has been done by a variety of clustering approaches including NCBI UniGene which uses a double linkage rule. Non-overlapping 5' and 3' ESTs are assigned to the same cluster if clone IDs are found that link at least two 5' ends from one cluster with at least two 3' ends from another cluster and the two clusters are merged. We have examined the effect of double linkage joining using the clone annotation. In this analysis, 17,588 clusters were stable and the total number of clusters was reduced from 25,971 to 21,249. Most of the joined clusters (3,122) were created from two clusters while three clusters were combined 456 times. While the number of clusters is decreased by this joining, our overall analysis is not affected. Potential full length clones selected as part of the P5P group (see below) are also unaffected by annotation linkage. We provide the identity of clusters 'linked by annotation' as part of the XenDB output.

Sequence analysis

We have performed a variety of sequence comparisons at the protein level including translation analysis. The sequences of cluster TCs and all singletons were subject to extensive BLASTX [45] and FASTY [46] homology searches vs. the non-redundant protein database (NR) from NCBI and the proteomes of five major model organisms using the high throughput analysis pipeline of the Genlight system [47] Proteome sets for H. sapiens, M. musculus and R. norvegicus were obtained from the International Protein Index [48, 49]. The IPI provides a top-level guide to the main databases: Swiss-Prot, TrEMBL, RefSeq and Ensembl. It curates minimally redundant yet maximally complete sets of the indexed organisms. C. elegans and D. melanogaster protein sequences were retrieved from the UniProt database [50]. UniProt proteome sets are solely derived from Swiss-Prot and TrEMBL entries. Additionally, all available protein sequences for X. laevis and X. tropicalis were extracted from GenBank. additional file 5 provides an overview of the downloaded data sets. Performing separate comparisons allows a search for matching sequences based on the identity of any gene known from each species as well as query for genes which have matches in some but not all databases. We believe that this will aid in the discovery and analysis of conserved and unique genes. In addition to these databases, we have included BLASTX searches in the KOG database and have used the results to functionally classify the Xenopus sequences. All sequences resulting from the clustering and assembly processes were compared to these protein sets using BLASTX with an E-value cutoff of 1.0e-6. ESTs are often of low sequence quality, and sequencing errors can still exist in the assembled TC sequences. Therefore, all analyses against the protein databases were also done using FASTY (E-value cutoff: 1.0e-6) a version of FASTA that compares a DNA sequence to a protein sequence database, translates the DNA sequence in three forward (or reverse) frames and allows (in contrast to BLASTX) for frame shifts, maximizing the length of the resulting alignments.

Identification of chimeric sequences

A significant issue in EST clustering methods is the presence of chimeric sequence which inappropriately joins unrelated genes into a single cluster. While the number of chimeric sequences is estimated at less than 1% [51, 52], their presence has disproportionate effects on the clustering outcome. To identify potential chimeric sequences, we analyzed the FASTY hits in the protein NR database and applied the following simple procedure: Matches of at least 100 bp in length were mapped back to the TC sequences to identify the regions that are covered by a match. If two matches overlap, the region will be extended accordingly. If after the mapping two clearly separated regions remain, the TC is flagged as potential chimera (see Figure 3).

Figure 3
figure 3

Identification of chimeric TCs: Matches of at least 100 bp in length were mapped back to the TC sequences to identify the regions that are covered by a match (yellow boxes). If two matches overlap, the region will be extended accordingly. If after the mapping two clearly separated regions remain as shown here, the TC is flagged as potential chimera.

Examination of the identified chimeric sequences reveals three major classes. In the first, two distinct FASTY hits can be identified which do not overlap and are in opposite orientation. In the second, the second identified FASTY hit matches retroviral or transposable element related sequences. This suggests the possibility that these may reflect real transcripts in which a mobile element has been inserted into the genome. A close evaluation of such sequences may provide some insights into the evolutionary history of various populations of Xenopus. The final class of potential chimeric sequences identified contains short predicted or hypothetical proteins. This class may in fact not be chimeric at all but may reflect errors in protein coding prediction methods.

The described procedure identified 113 potential chimeric TCs (0.3% of the 33,034 sequences with matches against the protein NR database), which are flagged in the database as such. We do not eliminate these potential chimeras, as they don't significantly affect the results of the sequence analyses done later on, which are mainly based on the best hit only. In fact, the analysis underestimates the number of full length sequences, as some chimeras cover two full length protein matches. A complete identification of chimeric sequences is practically impossible without a comparison to the underlying genome sequence. And even then, polycistronic transcripts which may exist cannot be separated from chimeras perfectly [53].

Definitions

In the subsequent analyses we were interested in three kinds of information: (1) Full Length Orf containing COntigs (FLOCOs), (2) Full Length Insert containing CLones (FLICLs), and (3) Predicted 5' (P5P) sequences. The result of the clustering and CAP3 analysis generates a set of tentative contig sequences (TC). FLOCOs are defined as TC sequences that have an (almost) full length hit against a known protein. These sequences are especially useful for gene identification. Full length insert containing clones, FLICLs, were predicted. Such clones are distinguished by sequence homologies corresponding to the amino terminal part of a protein but are not restricted at the carboxy-terminus. These sequences are derived from clones which are predicted to carry a full length insert (see below), though the full length sequence has not been determined, usually because of single pass EST sequencing from the 5' end. Finally, we identified sequences that we call P5P for which sequence similarity did not extend through the amino-terminal end of the protein but whose length was sufficient to include a full length coding sequence of a similarly sized protein.

Identification of Full Length Orf containing COntigs (FLOCOs)

We were especially interested in full length hits of the TC sequences vs. known proteins. For this purpose, BLASTX and FASTY hits were categorized into four classes, representing the quality of the full length matches (see Figure 1): (1) Matches cover 100% of the sequence of a known protein. Additionally, the matched protein sequence has to begin with the conserved methionine and has to end at a conserved STOP codon. (2) Matches covering 100% of the sequence of a known protein. Additionally, the matched protein sequence has to include the initial methionine. (3) Matches capable of covering 100% of the matched protein sequence with no additional constraints. (4) Matches that cover the protein over almost its full length, allowing the match to start or end maximal ten amino acids after/before the start or end of the protein.

Figure 1
figure 1

Full length clone selection (top) and TC categories (bottom). ESTs derived from different clones were clustered and assembled. The CAP3 contig was compared to protein databases using BLASTX and FASTY and hits categorized in 4 categories. Class 1 hits had to match the whole protein sequence and start with an ATG in the TC and M in the protein and the hit had to end at a STOP codon. Class 2 hits had to match the whole protein sequence, start with an ATG in the TC and M in the protein. Class 3 had to match the full protein sequence (without further restrictions), class 4 had to cover the protein over almost its full length, allowing the match to start or end maximal 10 ten amino acids after/before the start or end of the protein. Predicted 5' TCs (P5P) had to have enough sequence to fill up the missing 5' end of the protein sequence. Clone selection: Clone A and B were discarded because of missing IMAGE id. Clone 54321 does not span 5' end of protein match. Clone 21345 was selected as most 5' clone fulfilling the requirements.

Table 2 shows the number of identified FLOCOs using BLASTX. 3,942 TCs were Class 1 hits in the non-redundant protein database. As the stringency of the full length definition was relaxed, the number of TCs characterized as full length increases to 5,050 (Class 2), 7,792 (Class 3) and 12,389 (Class 4) TCs respectively. As EST sequences have many sequencing errors, and even the assembly of clusters can not correct all of these, FASTY comparisons were done for the same data set (Table 3). This way, the length of the resulting alignments could be maximized. A comparison of Table 2 and Table 3 shows the effect of frame shift corrections obtained by FASTY. The number of TCs having Class 1 hits could be increased to 5,139 while the less stringent categories increased similarly by an average of 20%. The effect of frameshift correction can clearly be seen in Figure 2. Table 4 and Table 5 show the average lengths of TCs for each of the four categories. Here, the average length of the TCs is 2,210 bp for Class 1 TCs having FASTY matches against X. laevis, corresponding very well to already known Xenopus proteins. Overall, the average length decreases with lower quality categories as expected, especially for Class 4, where the alignment can miss 20 amino acids on both ends of the matching protein. The only exceptions are Drosophila and C. elegans, where the average length increases for Class 4 sequences.

Table 2 Number of X. laevis TCs with full length BLASTX hits in the non-redundant protein database (NCBI), five model organisms, and available X. laevis and X. tropicalis proteins, determined by BLASTX. Lower quality categories include sequences from higher, more stringent categories.
Table 3 Number of X. laevis TCs with full length FASTY hits in the non-redundant protein database (NCBI), five model organisms, and available X. laevis and X. tropicalis proteins, determined by FASTY. Lower quality categories include sequences from higher, more stringent categories.
Figure 2
figure 2

Comparison of a BLASTX alignment with corresponding full length FASTY alignment, as generated by the Genlight system. Blue boxes in (a) indicate open reading frames, green boxes start and red boxes stop codons, respectively. The assembled TC sequence has a frameshift at position 1150 from frame 1 to 3, generating two distinct HSPs in the BLASTX alignment (b). FASTY clearly corrects this frameshift and generates a full length alignment (c).

Table 4 Average length of X. laevis TCs for different BLASTX full length TC categories.
Table 5 Average length of X. laevis TCs for different FLASTY full length TC categories.

Comparing the numbers of full length sequences in Table 2 and Table 3, the matches in human, mouse, rat and X. laevis are in general agreement (2619 full length sequences for Class 1 on average). What is striking is the deviation of both the number of full length TCs as well as the average length of TCs having matches against Drosophila and C. elegans: only 268 and 190 full length sequences with average lengths of 1659 and 1575 bp for Drosophila and C. elegans in Class 1, respectively. Only within the Class 4 category there are 2,249 and 1,918 TCs with average lengths of 1,611 bp and 1,563 bp, respectively. A possible explanation for this difference is the divergence of the vertebrate species from these invertebrate model systems.

Selection of putative Full Length Insert containing CLones (FLICLs)

Often, biologists are interested in identifying a full length clone for further study and this desire has been met by the establishment of a number of the Gene Collections (the Mammalian Gene Collection [54], the Xenopus Gene Collection [55] and the Zebrafish Gene Collection [56]). We have extended our analysis described above to select potential full length insert containing clones (FLICLs) that are available through the IMAGE consortium and provide a simple yet powerful search tool to rapidly match homologous genes of interest to their Xenopus counterparts. The Gene Collections are an NIH initiative that supports the production of cDNA libraries, clones and 5'/3' sequences to provide a set of full-length (ORF) sequences and cDNA clones of expressed genes for a variety of model systems.

Since the average length of the characterized full length vertebrate protein is 1,400 bases and the average sequence length of a TC is 1,045 bases, many sequences which are full length will not be detected by the previous approach and will contain sequence gaps of approximately 350 bases. To identify additional clones that potentially carry a full length insert, we queried the database for sequence matches which were sufficiently long to include the start methionine but which did not have sufficient homology to be detected by the previous methods Thus, a sequence with a query start position (Startq) which is greater than the subject start site (Starts) is potentially a full length open reading frame (hereafter referred to as P5P, predicted 5 prime). Clearly, the value of such a prediction decreases as the values of Startq increases and the predictive value increases with lower values of Starts. Full length clones predicted by this method are subject to 3' truncations due to mispriming in poly(A) rich regions rather than at the polyA tail. Such regions would be characterized by the presence of the amino acid lysine (codons AAA, AAG) or asparagine (codons AAU, AAC).

Best FASTY hits were extracted for TCs from all four full length categories as well as the P5P categories as described above. For TCs matching these categories, the most 5' EST contributing to the CAP3 contig sequence was selected. In addition, the selected clone had to span the amino-terminal end of the FASTY protein match. Finally, to ensure the ready availability of the clones and therefore the utility of the analysis, the selected clone had to be available through the IMAGE consortium. See Figure 1 for an illustration of 5' clone selection. The P5P criteria selected 15,651 potential full length insert containing clones out of which 10,500 are distinct IMAGE clones, which represents an additional 1,557 sequences compared to Class 4. Two examples of such predicted protein coding sequences are presented in Figure 4. We have mapped these clones to 7,782 distinct clusters. To assess the quality of the FL prediction method, we compared our set to the IMAGE clone set selected by the Xenopus Gene Collection (XGC, [55]) for full length sequencing. As of April 2004 the XGC had selected 10,482 IMAGE clones for sequencing. Our analysis selected 3,152 IMAGE clones that were identical to clones selected by the XGC. Of the remaining 7,348 clones from our set, 4,866 selected IMAGE clones were found in an identical cluster as 4,465 XGC selected clones (note that some of these clones are in the same cluster). In addition, 1,154 XGC clones did not have sequence available to be included in our analysis. The remaining 1,711 IMAGE clones selected for sequencing by XGC are not found in our predicted set while 2,482 clones were unique to our set. In an effort to examine why the 1,711 sequences selected for sequencing were not identified as full length, we compared the startq and starts values as described above. Using the P5P prediction criteria described above, we identify 107 XGC selected IMAGE clones that we predict are not full length but have an alternative clone which we predict is full length. Though final confirmation of the results requires additional sequencing, our method appears to be successful at identifying full length sequences and distinguishing non-full length sequences identified by an independent method. The FL clones are labeled in the XenDB web interface (see below), allowing a rapid identification of potential FL clones for a gene of interest.

Figure 4
figure 4

Two examples of TCs derived from clones predicted to have a full length insert (P5P). The start positions in the hit suggest that the unmatched amino-terminal protein sequence is not well conserved between X. laevis and the matched organisms, here rabbit (top) and human (bottom), but the open reading frames (blue boxes) indicate that the clones the sequences were derived from do actually contain a full length insert. (Screenshots of the results were generated by the Genlight system.)

Due to the large number of sequences, we are unable to examine each sequence individually. Since the analysis depends on the overall degree of conservation among the sequences, such an approach will not be as successful on weakly conserved genes. In general, it seems likely that decreasing e-values correspond to higher quality predictions. On a global basis, the results need to be carefully considered, as an independent assessment of the distribution of conservation among the ensemble of sequences is not available.

Gene Ontology prediction and Functional Classification

The Gene Ontology (GO) project [57] is an ongoing international collaborative effort to generate consistent descriptions of gene products using a set of three controlled vocabularies or ontologies: biological processes, cellular components, and molecular functions. The GO vocabulary allows consistent searching of databases using uniform queries. The availability of such vocabularies can be critical to the interpretation of high through put approaches such as microarrays. Based on FASTY homologies with both mouse and human sequence, we have mapped GO annotations to the Xenopus sequences. Of the 30,683 TCs with matches to mouse (29,971) or human IPI sequences (29,963), 19,721 TCs have been assigned putative GO annotations. Among the 10,500 potential full length ORF containing IMAGE clones, 6,886 have been assigned GO annotations.

The non-redundant X. laevis data set was then classified based on their homology to known proteins from the KOG [58] database (BLASTX 1.0e-5 E-value cutoff, best hit selection). KOGS are euKaryotic clusters of Orthologous Groups. KOG includes proteins from 7 eukaryotic genomes: C. elegans, D. melanogaster, H. sapiens, A. thaliana, S. cerevisiae, S. pombe, E. cuniculi.17,624 sequences (67.3%) had a hit against the KOG database and could be assigned a functional category.

Identification of conserved genes not found in major model organisms

To identify additional genes within the dataset that are not found by comparison to protein sets of the major model organisms and to assess the extent of diverged or non conserved sequences, open reading frames of 600 nucleotides or longer were selected from the clustered data set for analysis. 219 sequences that did not have any hit in the previous analyses were identified (188 TCs representing 178 clusters and 31 singlets). We further restricted the number of sequences by re-running the BLASTX and FASTY analysis with E-value cutoffs of 0.01. 111 sequences (91 TCs representing 87 clusters consisting of an average of 6 ESTs per cluster and 19 singlets) without any significant similarity in protein databases could be identified and these were examined by TBLASTN against the human, mouse and 'others' EST databases (22.7 million sequences total). Signal peptides were identified by SignalP [59] as well as transmembrane domains by TMHMM [60, 61]. Results are presented in Table 6. The analysis identified 46 sequences with similarity to other organisms (E<0.01) with 11 sequences matching chicken (Gallus gallus), 10 sequences matching zebrafish (Danio rerio) and 6 sequences matching the rainbow trout (Oncorhynchus mykiss). Three of the sequences matched human sequences with less significance than the cutoff used above (i.e. 1.0e-6). Among the sequences with highly significant BLAST hits were two matches to the eastern tiger salamander (Ambystoma tigrinum tigrinum) and one to the rainbow trout (Oncorhynchus mykiss). A surprising match was to barley (Hordeum vulgare, E = 9.0e-35) which was the only plant represented among these hits. The remaining 65 sequences did not have significant homology to existing public database sequences. For 7 sequences both signal peptide cleavage sites and transmembrane domains could be identified. Another 15 sequences had either a signal peptide cleavage site or a transmembrane domain. These 22 sequences are potentially novel membrane proteins.

Table 6 Xenopus Long Open Reading Frames (>= 600 nt) without homology to major model organism protein sequences. ORF sequences were compared to all available EST data using TBLASTN. The 46 sequences shown here have homologies to ESTs from other organisms (E < 0.01). For each TC, the number of ESTs in the TC and the accession, SignalP and TMHMM results, and description and E-value of the best hit is shown. Additionally (not shown here), both signal peptides and transmembrane domains could be predicted in: clSignal peptides only in: cl4857_sin8, cl11312_sin2, cl11866_ctg2, cl14117_ctg1, cl16548_ctg1, cl19372_ctg2; Transmembrane domains only in: cl3994_ctg1, vimsin144578, cl18799_ctg1, cl18978_ctg1, cl18978_ctg2, cl25690_ctg1, cl23256_ctg1.

Utility

User interface

The results of the analyses described above have been incorporated into an SQL database amenable to complex queries. The database can be accessed through a user friendly web based interface (XenDB). XenDB allows individual and batch queries using Xenopus accession, GI, and XenDB, UniGene and TIGR cluster IDs. In addition, the user can query the Xenopus sequence hits using any protein accession/GI number both singly and in batch mode. This allows a rapid identification of Xenopus TCs and their corresponding clones with hits to given protein sequences. The output of various queries displays the matching Xenopus cluster(s) and links to a web page as presented in Figure 5. For each cluster, links to the best hit for a number of model organisms are provided as well as links to the assembly result, consensus sequence generated by CAP3, and visual alignments of all FASTY results. GenBank accession numbers for each EST in the cluster and whether the corresponding clone has been identified as full length are provided. Additionally, for each TC the COG and KOG classification, as well as the GO terms are available.

Figure 5
figure 5

Cluster view of the XenDB Web interface. Best FASTY hits to NR protein database, five model organisms and Xenopus proteins are shown on top. Gene Ontologies (GO) are based on best human and mouse IPI hits, functional categories on hits to COG and KOG databases. Below, additional information for each EST in the cluster is shown, such as accession, UniGene and TGI id, clone, cell and tissue type. Clones predicted not to be full length are colored red. Links to CAP3 assembly and TC sequence are provided.

The analysis and database system provides a very powerful tool which will enable the Xenopus community to take advantage of a number of technical and experimental advances. We have selected a couple of examples to illustrate possible types of queries. In considering the results, it is important to bear in mind that these examples can be combined to further refine the sequence set. In the first example, we sought to identify all the genes of a known type or class. In the second example, we wished to identify the set of Xenopus sequences which best matched a set of genes from another species identified using the CGAP database administered by the National Cancer Institute (NCI) [62, 63]. A final example demonstrates the ability of the system to translate results identified by microarray technologies, or other related high throughput technologies, to identify likely Xenopus homologues.

Homeobox gene identification

Homeobox containing proteins are a very important group of transcriptional regulators that play key roles in developmental processes. They can be divided into a 'complex' and a 'dispersed' super class representing the homeotic genes and the large number of homeodomain containing proteins dispersed (and diverged) within the genome [64]. The homeotic (Hox) genes play key roles in the anterior-posterior patterning of both vertebrate and invertebrate embryos and in Xenopus are often used as markers of anterior-posterior development. [65–67]. The vertebrate homeotic genes are organized into four clusters arranged in the same order in which they are expressed in the anterior-posterior axis [64]. Of the 39 vertebrate Hox genes, we have identified 28 homologs in Xenopus laevis, while 19 are present in the protein database (Table 7). For those sequences not identified, we sought to determine whether they had been identified in the genome of Xenopus tropicalis. To do so, we used TBLASTX, provided as a tool on the Xenopus tropicalis website [68] to search for the missing sequences. Strong matches were identified for all of the remaining Hox genes except HoxD12. Using the BLASTN tool on the genome site, we confirmed that the gene order was conserved within each scaffold (data not shown). Interestingly, we were unable to identify HoxD12 within the predicted region though both HosxD11 and HoxD13 were recognized.

Table 7 Homeobox genes in X. laevis: for each HOX gene the corresponding cluster and TC is shown, as well as the most 5' clone in the assembly and the protein accession number, if available. When X. laevis genes were not identified, an identifier corresponding X. tropicalis sequence is provided.

Homologue identification from the Cancer Genome Anatomy Project (CGAP)

A second example takes advantage of the CGAP database [69] administered by the National Cancer Institute (NCI). This database and resource incorporates a large number of interconnected modules aimed at gene expression in cancer. Among the modules are a Serial Analysis of Gene Expression (SAGE) database [70, 71]. The SAGE approach counts polyadenylated transcripts by sequencing a short 14 bp tag at the genes 3'end and is a quantitative method to examine gene expression [70]. Another module is the Digital Gene Expression Displayer (DGED) which distinguishes statistical differences in gene expression between two pools of libraries [72]. Each method generates tables of genes based on a wide variety of selection criteria. As would be expected, the source for the vast majority of the available data comes from either human or mouse thus demanding a tool to cross match the results in Xenopus.

For this particular example, we selected a tissue based query (DGED) derived from SAGE data in which we sought a set of genes that might include potential markers for glial or astrocyte fates. For this query, we selected all brain, cortex, cerebellum and spinal cord libraries excluding any libraries derived from cell lines. This yielded 58 potential libraries. From this we selected any library labeled as a glioblastoma for pool A and libraries labeled astrocytoma for pool B while excluding the remaining libraries (which included medulloblastomas, ependymomas, etc.). We did not distinguish between cancer grades. This limited the total number of libraries to six glioblastoma and nine astrocytoma libraries containing 487,197 and 863,610 SAGE tags each, respectively. Submission of the query resulted in the identification of 395 tags with a 2× expression factor and a 0.05 significance factor (default CGAP query values). These 395 tags represented 308 different sequences (180 were >2 fold higher in glioblastoma and 128 were >2 fold higher in astrocytoma) which corresponded to 278 proteins in the public database (115 glioblastoma, 163 astrocytoma) and were matched using the batch GenBank accession module available online in XenDB to 100 and 142 Xenopus sequences, respectively. (In the interests of space we have not included the extended table but provide the saved DGED query [see additional file 6] and the two text files [see additional files 7 and 8] that can be uploaded to the XenDB database). The results table includes links to the matching cluster and TC, the e-value and rank and whether a full length clone has been identified. The contig web link leads to additional information including the consensus analysis, the top FASTY hits to five model organisms and links to the Xenopus EST sequences in the TC (Figure 5). Among the genes identified are vimentin (15×, P = 0.01) and sox10 (7.6×, P = 0.03), genes previously established as markers of glial and oligodendrocyte fate respectively [73–75] as well as genes downstream of the Notch signalling pathway, known to be important for glia formation [76]. Thus the system developed and presented here allows 'in silico' based tools established for the study and analysis of other organisms, particularly human and mouse, to be easily and rapidly applied to the Xenopus model system.

Homologues of Drosophila eye development genes

In the final example, we take advantage of the database to perform a comparative analysis of microarray expression data. In many instances, the outcome of an array type experiment is a variety of tables listing regulated genes and the associated expression changes. Currently, there are few published Xenopus array studies available [77–85] while there exist extensive databases of expression for a variety of model organisms. The NCBI maintains a common database, the Gene Expression Omnibus [86] which contains data from over 15,000 samples including 337 Human, 92 mouse and 12 Drosophila experiments (average 25 samples/experiment). Based on an ongoing interest in eye development, we selected a recent paper by Michaut and co-workers in the Gehring lab which examined gene expression changes induced by ectopic expression of the eyeless gene (ey/Pax-6) in Drosophila imaginal disks [87]. The development of the eye is evolutionarily conserved among both vertebrates and invertebrates [88, 89]. Many important insights into eye development have come from studies in Drosophila which has defined a genetic cascade of evolutionarily conserved regulatory factors [90]. One such factor is Pax-6/eyeless which is capable of inducing ectopic eyes on both flies [91] and vertebrates [92]. In the Michaut study, 371 eye-induced genes are detected using two different oligonucleotide based array platforms (Affymetrix and Hoffmann-LaRoche) and 73 are discussed in detail within the text (Michaut et al., Table 1, 2). To identify likely homologues of these genes in Xenopus, GenBank accession numbers were obtained from the NCBI Gene Expression Omnibus ([93], accession # GSE271) and used to query the XenDB database to identify 47 potential homologues of the Drosophila Pax6/ey regulated genes and included 32 predicted full length sequences (Table 8). As these sequences are available from commercial sources, they can be readily obtained and tested using the various experimental approaches available to Xenopus such as gain of function studies by microinjection.

Table 8 Xenopus matches to Pax6/ey Regulated Genes identified by Michaut et al.

Discussion

Comparative approaches to important biological problems have resulted in enormous progress in the past decades. The advent of genomic and proteomic approaches has led to a torrent of data in many organisms and has demanded increasingly sophisticated bioinformatic approaches to organize and manage the information. We have developed an integrated information resource with a user-friendly interface powered by an automated clustering pipeline which will allow researchers to take advantage of the wealth of knowledge available in the public domain.

Comparison to human and mouse

Human and mouse are the best studied vertebrate organisms at the molecular level. In addition to the well publicized genome projects, both have extensive EST collections. This has led to the prediction and characterization of 44,775+ human sequences and 36,182 mouse sequences [94]. As vertebrate development is well conserved, it is important to assess the extent to which the Xenopus EST project has identified the known vertebrate genes. At the same time, one would like to identify any genes that are unique to Xenopus. Most gene prediction programs rely on homology thus eliminating this approach to unique gene identification. Sequences without significant homology could arise from incomplete sequencing that does not extend into the coding region. Results of the human genome project suggest that this would not be the case for a majority of the sequences analyzed in this report. The average 5' UTR in humans is 240 bp and the 3' UTR is 400 bp [95]. Sequencing reactions with current technologies yield readable sequence of 700 bases on average. Therefore, at least some subset of sequences would yield their protein sequence to analysis. An alternative origin of non-homologous sequences would be unspliced or improperly spliced transcripts. This possibility is also minimized by the utilization of polyA tails for RNA selection and reverse transcription priming using oligo(dT). A final, obvious and expensive approach is to select non-homologous sequences for full length double stranded sequencing. Sequence without errors more easily yields the desired open reading frame in even the simplest bioinformatic programs.

Sequences without hits

A class of sequences includes those without significant BLAST hits. In our analysis we have used a cutoff e-value of 10e-6. This of course is necessarily arbitrary, since as mentioned above it is not known what the exact level of similarity is between any given sequence pair. Based on this value, we remain with 43,753 sequences that neither have a BLASTX nor a FASTY hit to a known model organism sequence. The lack of similarity could be due to significant divergence of the sequence, the lack of an appropriate homologue in the public dataset, sequencing errors inherent in EST data or due to the presence of non-coding, presumably regulatory sequences, in the EST clone set. These unmatched sequences mirror the situation in the UniGene set for both mouse and human with greater than 3 and 4 × 106 EST sequences in 76,000 and 106,000 clusters respectively while fewer than 25,000 coding sequences have been recognized [21, 94, 96]. The source of these discrepancies are currently unclear, but may arise from non coding RNA (ncRNA)[97], micro RNA precursors [98], incompletely or unspliced transcripts [99]. In particular, ncRNAs are a likely source for a large fraction of the discrepancy based on estimates of a 10-fold greater number of non-coding transcription units than protein coding genes [100]. It has been estimated that >95% of transcription is non-coding [101]. Much of the analysis and identification of ncRNA relies on the availability of genomic sequence which is currently unavailable for X. laevis and incomplete for X. tropicalis, the highly homologous diploid species.

Completeness of Xenopus EST set

We have compared all the Xenopus sequences to the human and mouse protein sets to identify conserved proteins. An obvious question is how complete is the Xenopus EST set and what percentage of genes have been identified assuming that the vast majority of protein coding sequences have been evolutionarily conserved. Of the ~40,000 sequences in the IPI databases, 9,225 human and 7,664 mouse sequences do not have a strong match (E < 1.0e-6). Thus, there is a considerable effort remaining to develop a complete Xenopus protein coding set. In the course of our analysis we note the high degree of similarity between the allotetraploid laevis and diploid tropicalis Xenopus species which depended on the length of the matching sequence. For sequences covering >= 95% of the query, there was an average of 94% identity while the average identity dropped to 91% and 88% as the coverage dropped to 90 and 80% respectively. This conservation may allow sequences from both species to be combined to generate a more complete set.

It is well known that the outcome of clustering methods on a large scale depends on the variety of involved parameters. A systematic comparison between UniGene or TIGR Gene Indices and our results turns out to be extremely difficult, mainly because the underlying sequence sets differ as well due to different sequence cleanup and masking approaches. To maximize the utility and usability of our analysis, we have incorporated UniGene and TGI information into our dataset and provide simple tools for identifying the related UniGene and TGI identifier.

Future prospects

Both the clustering and consensus generation approaches are very rapid: 50 minutes for clustering on a single 900 MHz SPARC-CPU and a few hours for assembly on a cluster of 20 heterogeneous SPARC-based machines with 450 to 900 MHz. We therefore have achieved the design goal of being able to frequently update this aspect of the analysis. The subsequent comparative sequence analysis requires significantly greater resources and time (several weeks on same cluster of heterogeneous workstations). The analysis described above is performed by various PERL based scripts developed during the course of our analysis which will allow updates and application to other model systems. We are currently working on a tool to compare clusters over time which will allow the sequence analysis described below to be performed on the restricted set of modified/new clusters rather than to the entire ensemble. The effect of CAP3 consensus generation is that a given cluster can be split into several separate TC sequences, usually due to low sequence quality or differences in the UTR regions of the sequences. The UTR end splitting is likely due to the differences between the in-paralogs in this allotetraploid species. We believe that such information will be of value to those researchers interested in a variety of evolutionary questions, examples of which will be discussed below. The difference in ploidy makes Xenopus laevis distinct from all of the other organisms for which similar analysis have been performed.

As with all ongoing high throughput sequencing efforts, certain aspects of the results change in proportion to the total number of sequences. As noted above, a complete gene set for Xenopus will require additional sequencing. The generation of tetra, octo and dodecaploid species of Xenopus between 80 and 10 million years ago [102] offers opportunities in the field of evolutionary biology. For example, comparisons of 3' UTR regions between in-paralogs of Xenopus laevis and their counterpart diploid tropical species may improve statistical models of molecular evolution. At the genome level, the potential availability of genome data from the polyploid species may provide insight into questions of chromosome segregation and silencing. The selection of Xenopus as a model organism by the NIH http://www.nih.gov/science/models/ and the establishment of the Trans-NIH Xenopus Initiative [103] have directly led to the support of EST and genome sequencing efforts. Among the priorities identified is the establishment and funding of a Xenopus Database [104] which will integrate sequence, expression and other Xenopus data. We hope to be able to update the results described here on a regular basis and contribute to the community effort.

Conclusion

One of the primary goals of the effort was to provide a resource of gene-oriented EST clusters and transcript oriented TCs, enriched with various information from heterogeneous sources, that would be of value to the biology community and the Xenopus community in particular. Using the XenDB system, the biologist can identify sequences of interest using simple gene name queries, accessions, or gene ontologies. The identified sequences have been mapped to public resources like NCBI's UniGene and TIGR Gene Indices and a consensus sequence prepared. In addition, we have identified publicly available IMAGE clones that maximizes the 5' sequence to provide a full length construct when possible. These clones are available from IMAGE consortium providers.

Availability and requirements

Sequence availability, XenDB database and results display

The database and associated files are freely accessible through the XenDB website: http://bibiserv.techfak.uni-bielefeld.de/xendb/. The GenBank accession numbers and FASTA formatted files of the masked and clipped input sequences, as well as the TC sequences and results of the example applications (see below) can be downloaded. Additionally, the list of full length clones is available to researchers interested in performing genome-wide studies. Programs, scripts and database dumps are available from the authors upon request. The XenDB database should be cited with the present publication as a reference.

Abbreviations

EST:

Expressed Sequence Tag

ORDBMS:

Object Relational Database Managemant System

TC:

tentative contig sequence

KOG:

clusters of euKaryotic Orthologous Groups

GO:

Gene Ontology

VRT:

Vertebrate Sequences

HTC:

High Throughput cDNA

XGC:

Xenopus Gene Collection

MGC:

Mammalian Gene Collection

ZGC:

Zebrafish Gene Collection

FL:

Full Length

IPI:

International Protein Index

CGAP:

Cancer Genome Anatomy Project

DGED:

Differential Gene Expression Database

SAGE:

Serial Analysis of Gene Expression

ncRNA:

non-coding RNA

TGI:

TIGR Gene Index

References

  1. Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, .: Complementary DNA sequencing: expressed sequence tags and human genome project. Science. 1991, 252: 1651-1656.

    Article  PubMed  CAS  Google Scholar 

  2. Zhang MQ: Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet. 2002, 3: 698-709. 10.1038/nrg890.

    Article  PubMed  CAS  Google Scholar 

  3. Henderson J, Salzberg S, Fasman KH: Finding genes in DNA with a Hidden Markov Model. J Comput Biol. 1997, 4: 127-141.

    Article  PubMed  CAS  Google Scholar 

  4. Besemer J, Borodovsky M: Heuristic approach to deriving models for gene finding. Nucleic Acids Res. 1999, 27: 3911-3920. 10.1093/nar/27.19.3911.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  5. Pontius JU, Wagner L, Schuler GD: UniGene: a unified view of the transcriptome. The NCBI Handbook. 2003, Bethesda, MD, National Center for Biotechnology Information, 21-1-21-12.

    Google Scholar 

  6. Christoffels A, van Gelder A, Greyling G, Miller R, Hide T, Hide W: STACK: Sequence Tag Alignment and Consensus Knowledgebase. Nucleic Acids Res. 2001, 29: 234-238. 10.1093/nar/29.1.234.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  7. Mironov AA, Fickett JW, Gelfand MS: Frequent alternative splicing of human genes. Genome Res. 1999, 9: 1288-1293. 10.1101/gr.9.12.1288.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  8. Ladd AN, Cooper TA: Finding signals that regulate alternative splicing in the post-genomic era. Genome Biol. 2002, 3: reviews0008-10.1186/gb-2002-3-11-reviews0008.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Lipscombe D, Pan JQ, Gray AC: Functional diversity in neuronal voltage-gated calcium channels by alternative splicing of Ca(v)alpha1. Mol Neurobiol. 2002, 26: 21-44. 10.1385/MN:26:1:021.

    Article  PubMed  CAS  Google Scholar 

  10. Stamm S: Signals and their transduction pathways regulating alternative splicing: a new dimension of the human genome. Hum Mol Genet. 2002, 11: 2409-2416. 10.1093/hmg/11.20.2409.

    Article  PubMed  CAS  Google Scholar 

  11. Venables JP: Alternative splicing in the testes. Curr Opin Genet Dev. 2002, 12: 615-619. 10.1016/S0959-437X(02)00347-7.

    Article  PubMed  CAS  Google Scholar 

  12. Roberts GC, Smith CW: Alternative splicing: combinatorial output from the genome. Curr Opin Chem Biol. 2002, 6: 375-383. 10.1016/S1367-5931(02)00320-4.

    Article  PubMed  CAS  Google Scholar 

  13. Oklu R, Hesketh R: The latent transforming growth factor beta binding protein (LTBP) family. Biochem J. 2000, 352 Pt 3: 601-610. 10.1042/0264-6021:3520601.

    Article  PubMed  CAS  Google Scholar 

  14. Tarone G, Hirsch E, Brancaccio M, De Acetis M, Barberis L, Balzac F, Retta SF, Botta C, Altruda F, Silengo L, Retta F: Integrin function and regulation in development. Int J Dev Biol. 2000, 44: 725-731.

    PubMed  CAS  Google Scholar 

  15. Klint P, Claesson-Welsh L: Signal transduction by fibroblast growth factor receptors. Front Biosci. 1999, 4: D165-D177.

    Article  PubMed  CAS  Google Scholar 

  16. Chevreux B, Pfisterer T, Drescher B, Driesel AJ, Muller WE, Wetter T, Suhai S: Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res. 2004, 14: 1147-1159. 10.1101/gr.1917404.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  17. Kota R, Rudd S, Facius A, Kolesov G, Thiel T, Zhang H, Stein N, Mayer K, Graner A: Snipping polymorphisms from large EST collections in barley (Hordeum vulgare L.). Mol Genet Genomics. 2003, 270: 24-33. 10.1007/s00438-003-0891-6.

    Article  PubMed  CAS  Google Scholar 

  18. Useche FJ, Gao G, Harafey M, Rafalski A: High-throughput identification, database storage and analysis of SNPs in EST sequences. Genome Inform Ser Workshop Genome Inform. 2001, 12:194-203.: 194-203.

    Google Scholar 

  19. Nekrutenko A: Reconciling the numbers: ESTs versus protein-coding genes. Mol Biol Evol. 2004, 21: 1278-1282. 10.1093/molbev/msh125.

    Article  PubMed  CAS  Google Scholar 

  20. Wang JP, Lindsay BG, Leebens-Mack J, Cui L, Wall K, Miller WC, DePamphilis CW: EST clustering error evaluation and correction. Bioinformatics. 2004, 20: 2973-84. 10.1093/bioinformatics/bth342.

    Article  PubMed  CAS  Google Scholar 

  21. Genome-Consortium: Finishing the euchromatic sequence of the human genome. Nature. 2004, 431: 931-945. 10.1038/nature03001.

    Article  Google Scholar 

  22. Ewing B, Green P: Analysis of expressed sequence tags indicates 35,000 human genes. Nat Genet. 2000, 25: 232-234. 10.1038/76115.

    Article  PubMed  CAS  Google Scholar 

  23. Wheeler DL, Church DM, Edgar R, Federhen S, Helmberg W, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Suzek TO, Tatusova TA, Wagner L: Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res. 2004, 32 (Database issue): D35-D40. 10.1093/nar/gkh073.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  24. Wheeler DL, Church DM, Federhen S, Lash AE, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Tatusova TA, Wagner L: Database resources of the National Center for Biotechnology. Nucleic Acids Res. 2003, 31: 28-33. 10.1093/nar/gkg033.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  25. Schuler GD: Pieces of the puzzle: expressed sequence tags and the catalog of human genes. J Mol Med. 1997, 75: 694-698. 10.1007/s001090050155.

    Article  PubMed  CAS  Google Scholar 

  26. Schuler GD, Boguski MS, Stewart EA, Stein LD, Gyapay G, Rice K, White RE, Rodriguez-Tome P, Aggarwal A, Bajorek E, Bentolila S, Birren BB, Butler A, Castle AB, Chiannilkulchai N, Chu A, Clee C, Cowles S, Day PJ, Dibling T, Drouot N, Dunham I, Duprat S, East C, Hudson TJ, .: A gene map of the human genome. Science. 1996, 274: 540-546. 10.1126/science.274.5287.540.

    Article  PubMed  CAS  Google Scholar 

  27. Boguski MS, Schuler GD: ESTablishing a human transcript map. Nat Genet. 1995, 10: 369-371. 10.1038/ng0895-369.

    Article  PubMed  CAS  Google Scholar 

  28. Huang X, Madan A: CAP3: A DNA sequence assembly program. Genome Res. 1999, 9: 868-877. 10.1101/gr.9.9.868.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  29. Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998, 8: 186-194.

    Article  PubMed  CAS  Google Scholar 

  30. Phrap sequence assember website. 2005, Laboratory of Phil Green, HHMI Genome Sciences Department, University of Washington, [http://www.phrap.org/]

  31. Burke J, Davison D, Hide W: d2_cluster: a validated method for clustering EST and full-length cDNAsequences. Genome Res. 1999, 9: 1135-1142. 10.1101/gr.9.11.1135.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  32. Abouelhoda MI, Ohlebusch E, Kurtz S: Proceeding of the Ninth International Symposium on String Processing and Information Retieval. 2002, Springer Verlag, 31-43. Optimal exact string matching based on suffix arrays, 2476, Lecture Notes in Computer Science

    Google Scholar 

  33. Abouelhoda MI, Kurtz S, Ohlebusch E: Proceedings of the Second Workshop on Algorithms in Bioinformatics. 2002, Springer Verlag, 449-463.The Enhanced Suffix Array and its Applications to Genome Analysis, 2452, Lecture Notes in Computer Science

    Chapter  Google Scholar 

  34. Abouelhoda MI, Kurtz S, Ohlebusch E: Replacing Suffix Trees with Enhanced Suffix Arrays. Journal of Discrete Algorithms. 2004, 2: 53-86. 10.1016/S1570-8667(03)00065-0.

    Article  Google Scholar 

  35. Liang F, Holt I, Pertea G, Karamycheva S, Salzberg SL, Quackenbush J: An optimized protocol for analysis of EST sequences. Nucleic Acids Res. 2000, 28: 3657-3665. 10.1093/nar/28.18.3657.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  36. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Church DM, DiCuccio M, Edgar R, Federhen S, Helmberg W, Kenton DL, Khovayko O, Lipman DJ, Madden TL, Maglott DR, Ostell J, Pontius JU, Pruitt KD, Schuler GD, Schriml LM, Sequeira E, Sherry ST, Sirotkin K, Starchenko G, Suzek TO, Tatusov R, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2005, 33: D39-D45. 10.1093/nar/gki062.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  37. Quackenbush J, Cho J, Lee D, Liang F, Holt I, Karamycheva S, Parvizi B, Pertea G, Sultana R, White J: The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res. 2001, 29: 159-164. 10.1093/nar/29.1.159.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  38. Vector Database Website. 2005, [http://seq.yeastgenome.org/vectordb/]

  39. The Vmatch large scale sequence analysis software website. 2005, [http://www.vmatch.de/]

  40. Jurka J: Repbase update: a database and an electronic journal of repetitive elements. Trends Genet. 2000, 16: 418-420. 10.1016/S0168-9525(00)02093-X.

    Article  PubMed  CAS  Google Scholar 

  41. Smit A, Green P: Repeat Masker Website and Server. 2005, [http://www.repeatmasker.org/]

    Google Scholar 

  42. Beckstette M, Strothmann D, Homann R, Giegerich R, Kurtz S: PoSSuMsearch: Fast and Sensitive Matching of Position Specific Scoring Matrices Using Enhanced Suffix Arrays. In Proceedings of the German Conference on Bioinformatics (GCB 2004), GI Lecture Notes in Informatics, 53:53-64

  43. Kruger J, Sczyrba A, Kurtz S, Giegerich R: e2g: an interactive web-based server for efficiently mapping large EST and cDNA sets to genomic sequences. Nucleic Acids Res. 2004, 32: W301-W304. 10.1093/nar/gkh586.

    Article  PubMed  PubMed Central  Google Scholar 

  44. Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000, 7: 203-214. 10.1089/10665270050081478.

    Article  PubMed  CAS  Google Scholar 

  45. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  46. Pearson WR, Wood T, Zhang Z, Miller W: Comparison of DNA sequences with protein sequences. Genomics. 1997, 46: 24-36. 10.1006/geno.1997.4995.

    Article  PubMed  CAS  Google Scholar 

  47. Beckstette M, Mailänder JT, Marhöfer RJ, Sczyrba A, Ohlebusch E, Giegerich R, Selzer PM: Journal of Integrative Bioinformatics. Edited by: Hofestädt R. 2004, Magdeburg, IMBio, Informationsmanagement in der Biotechnologie e.V., 8: 79-94. Genlight: Interactive high-throughput sequence analysis and comparative genomics ,Yearbook Bioinformatics 2004

    Google Scholar 

  48. European Bioinformatics Institute International Protein Index Website. 2005, [http://www.ebi.ac.uk/IPI]

  49. Kersey PJ, Duarte J, Williams A, Karavidopoulou Y, Birney E, Apweiler R: The International Protein Index: an integrated database for proteomics experiments. Proteomics. 2004, 4: 1985-1988. 10.1002/pmic.200300721.

    Article  PubMed  CAS  Google Scholar 

  50. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS: UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 2004, 32 (Database issue): D115-D119. 10.1093/nar/gkh131.

    Article  Google Scholar 

  51. Aaronson JS, Eckman B, Blevins RA, Borkowski JA, Myerson J, Imran S, Elliston KO: Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data. Genome Res. 1996, 6: 829-845.

    Article  PubMed  CAS  Google Scholar 

  52. Hillier LD, Lennon G, Becker M, Bonaldo MF, Chiapelli B, Chissoe S, Dietrich N, DuBuque T, Favello A, Gish W, Hawkins M, Hultman M, Kucaba T, Lacy M, Le M, Le N, Mardis E, Moore B, Morris M, Parsons J, Prange C, Rifkin L, Rohlfing T, Schellenberg K, Marra M, .: Generation and analysis of 280,000 human expressed sequence tags. Genome Res. 1996, 6: 807-828.

    Article  PubMed  CAS  Google Scholar 

  53. Komar AA, Hatzoglou M: Internal ribosome entry sites in cellular mRNAs: The mystery of their existence. J Biol Chem. 2005

    Google Scholar 

  54. The Mammalian Gene Collection. 2005, [http://mgc.nci.nih.gov/]

  55. The Xenopus Gene Collection. 2005, [http://xgc.nci.nih.gov/]

  56. The Zebrafish Gene Collection. 2005, [http://zgc.nci.nih.gov/]

  57. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  58. Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Rogozin IB, Smirnov S, Sorokin AV, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol. 2004, 5: R7-10.1186/gb-2004-5-2-r7.

    Article  PubMed  PubMed Central  Google Scholar 

  59. Bendtsen JD, Nielsen H, von Heijne G, Brunak S: Improved prediction of signal peptides: SignalP 3.0. J Mol Biol. 2004, 340: 783-795. 10.1016/j.jmb.2004.05.028.

    Article  PubMed  Google Scholar 

  60. Krogh A, Larsson B, von Heijne G, Sonnhammer EL: Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001, 305: 567-580. 10.1006/jmbi.2000.4315.

    Article  PubMed  CAS  Google Scholar 

  61. Sonnhammer EL, von Heijne G, Krogh A: A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol. 1998, 6: 175-182.

    PubMed  CAS  Google Scholar 

  62. Lash AE, Tolstoshev CM, Wagner L, Schuler GD, Strausberg RL, Riggins GJ, Altschul SF: SAGEmap: a public gene expression resource. Genome Res. 2000, 10: 1051-1060. 10.1101/gr.10.7.1051.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  63. Strausberg RL, Buetow KH, Greenhut SF, Grouse LH, Schaefer CF: The cancer genome anatomy project: online resources to reveal the molecular signatures of cancer. Cancer Invest. 2002, 20: 1038-1050. 10.1081/CNV-120005922.

    Article  PubMed  CAS  Google Scholar 

  64. Gehring WJ, Affolter M, Burglin T: Homeodomain proteins. Annu Rev Biochem. 1994, 63: 487-526. 10.1146/annurev.bi.63.070194.002415.

    Article  PubMed  CAS  Google Scholar 

  65. Cox WG, Hemmati-Brivanlou A: Caudalization of neural fate by tissue recombination and bFGF. development. 1995, 121: 4349-4358.

    PubMed  CAS  Google Scholar 

  66. Wright CV, Morita EA, Wilkin DJ, De Robertis EM: The Xenopus XIHbox 6 homeo protein, a marker of posterior neural induction, is expressed in proliferating neurons. Development. 1990, 109: 225-234.

    PubMed  CAS  Google Scholar 

  67. Isaacs HV, Pownall ME, Slack JM: Regulation of Hox gene expression and posterior development by the Xenopus caudal homologue Xcad3. EMBO J. 1998, 17: 3413-3427. 10.1093/emboj/17.12.3413.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  68. JGI Xenopustropicalis Web Site . 2005, [http://genome.jgi-psf.org/Xentr3/Xentr3.home.html]

  69. Cancer Genome Anatomy Project. 2005, [http://cgap.nci.nih.gov/]

  70. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial analysis of gene expression. Science. 1995, 270: 484-487.

    Article  PubMed  CAS  Google Scholar 

  71. Boon K, Osorio EC, Greenhut SF, Schaefer CF, Shoemaker J, Polyak K, Morin PJ, Buetow KH, Strausberg RL, De Souza SJ, Riggins GJ: An anatomy of normal and malignant gene expression. Proc Natl Acad Sci U S A. 2002, 99: 11287-11292. 10.1073/pnas.152324199.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  72. Lal A, Lash AE, Altschul SF, Velculescu V, Zhang L, McLendon RE, Marra MA, Prange C, Morin PJ, Polyak K, Papadopoulos N, Vogelstein B, Kinzler KW, Strausberg RL, Riggins GJ: A public database for gene expression in human cancers. Cancer Res. 1999, 59: 5403-5407.

    PubMed  CAS  Google Scholar 

  73. Kuhlbrodt K, Herbarth B, Sock E, Hermans-Borgmeyer I, Wegner M: Sox10, a novel transcriptional modulator in glial cells. J Neurosci. 1998, 18: 237-250.

    PubMed  CAS  Google Scholar 

  74. Yoshida M: Intermediate filament proteins define different glial subpopulations. J Neurosci Res. 2001, 63: 284-289. 10.1002/1097-4547(20010201)63:3<284::AID-JNR1022>3.0.CO;2-6.

    Article  PubMed  CAS  Google Scholar 

  75. Yoshida M, Colman DR: Glial-defined rhombomere boundaries in developing Xenopus hindbrain. J Comp Neurol. 2000, 424: 47-57. 10.1002/1096-9861(20000814)424:1<47::AID-CNE4>3.0.CO;2-5.

    Article  PubMed  CAS  Google Scholar 

  76. Gaiano N, Fishell G: The role of notch in promoting glial and neural stem cell fates. Annu Rev Neurosci. 2002, 25: 471-490. 10.1146/annurev.neuro.25.030702.130823.

    Article  PubMed  CAS  Google Scholar 

  77. Konig R, Baldessari D, Pollet N, Niehrs C, Eils R: Reliability of gene expression ratios for cDNA microarrays in multiconditional experiments with a reference design. Nucleic Acids Res. 2004, 32: e29-10.1093/nar/gnh027.

    Article  PubMed  PubMed Central  Google Scholar 

  78. Crump D, Werry K, Veldhoen N, Van Aggelen G, Helbing CC: Exposure to the herbicide acetochlor alters thyroid hormone-dependent gene expression and metamorphosis in Xenopus Laevis. Environ Health Perspect. 2002, 110: 1199-1205.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  79. Munoz-Sanjuan I, Bell E, Altmann CR, Vonica A, Brivanlou AH: Gene profiling during neural induction in Xenopus laevis: regulation of BMP signaling by post-transcriptional mechanisms and TAB3, a novel TAK1-binding protein. development. 2002, 129: 5529-5540. 10.1242/dev.00097.

    Article  PubMed  CAS  Google Scholar 

  80. Tran PH, Peiffer DA, Shin Y, Meek LM, Brody JP, Cho KW: Microarray optimizations: increasing spot accuracy and automated identification of true microarray signals. Nucleic Acids Res. 2002, 30: e54-10.1093/nar/gnf053.

    Article  PubMed  PubMed Central  Google Scholar 

  81. Altmann CR, Bell E, Sczyrba A, Pun J, Bekiranov S, Gaasterland T, Brivanlou AH: Microarray-based analysis of early development in Xenopus laevis. Dev Biol. 2001, 236: 64-75. 10.1006/dbio.2001.0298.

    Article  PubMed  CAS  Google Scholar 

  82. Arima K, Shiotsugu J, Niu R, Khandpur R, Martinez M, Shin Y, Koide T, Cho KW, Kitayama A, Ueno N, Chandraratna RA, Blumberg B: Global analysis of RAR-responsive genes in the Xenopus neurula using cDNA microarrays. Dev Dyn. 2005, 232: 414-431. 10.1002/dvdy.20231.

    Article  PubMed  CAS  Google Scholar 

  83. Peiffer DA, von Bubnoff A, Shin Y, Kitayama A, Mochii M, Ueno N, Cho KW: A Xenopus DNA microarray approach to identify novel direct BMP target genes involved in early embryonic development. Dev Dyn. 2005, 232: 445-456. 10.1002/dvdy.20230.

    Article  PubMed  CAS  Google Scholar 

  84. Shin Y, Kitayama A, Koide T, Peiffer DA, Mochii M, Liao A, Ueno N, Cho KW: Identification of neural genes using Xenopus DNA microarrays. Dev Dyn. 2005, 232: 432-444. 10.1002/dvdy.20229.

    Article  PubMed  CAS  Google Scholar 

  85. Chung HA, Hyodo-Miura J, Kitayama A, Terasaka C, Nagamune T, Ueno N: Screening of FGF target genes in Xenopus by microarray: temporal dissection of the signalling pathway using a chemical inhibitor. Genes Cells. 2004, 9: 749-761. 10.1111/j.1356-9597.2004.00761.x.

    Article  PubMed  CAS  Google Scholar 

  86. Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002, 30: 207-210. 10.1093/nar/30.1.207.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  87. Michaut L, Flister S, Neeb M, White KP, Certa U, Gehring WJ: Analysis of the eye developmental pathway in Drosophila using DNA microarrays. Proc Natl Acad Sci U S A. 2003, 100: 4024-4029. 10.1073/pnas.0630561100.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  88. Glaser T, Walton DS, Maas RL: Genomic structure, evolutionary conservation and aniridia mutations in the human PAX6 gene. Nat Genet. 1992, 2: 232-239. 10.1038/ng1192-232.

    Article  PubMed  CAS  Google Scholar 

  89. Gehring WJ, Ikeo K: Pax 6: mastering eye morphogenesis and eye evolution. Trends Genet. 1999, 15: 371-377. 10.1016/S0168-9525(99)01776-X.

    Article  PubMed  CAS  Google Scholar 

  90. Gehring WJ: The genetic control of eye development and its implications for the evolution of the various eye-types. Int J Dev Biol. 2002, 46: 65-73.

    PubMed  Google Scholar 

  91. Halder G, Callaerts P, Gehring WJ: Induction of ectopic eyes by targeted expression of the eyeless gene in Drosophila [see comments]. Science. 1995, 267: 1788-1792.

    Article  PubMed  CAS  Google Scholar 

  92. Chow RL, Altmann CR, Lang RA, Hemmati-Brivanlou A: Pax6 induces ectopic eyes in a vertebrate. development. 1999, 126: 4213-4222.

    PubMed  CAS  Google Scholar 

  93. The NCBI Gene Expression Omnibus. 2005, [http://www.ncbi.nlm.nih.gov/geo/]

  94. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 2004, 101: 6062-6067. 10.1073/pnas.0400782101.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  95. Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, Cawley S, Chiaromonte F, Chinwalla AT, Church DM, Clamp M, Clee C, Collins FS, Cook LL, Copley RR, Coulson A, Couronne O, Cuff J, Curwen V, Cutts T, Daly M, David R, Davies J, Delehaunty KD, Deri J, Dermitzakis ET, Dewey C, Dickens NJ, Diekhans M, Dodge S, Dubchak I, Dunn DM, Eddy SR, Elnitski L, Emes RD, Eswara P, Eyras E, Felsenfeld A, Fewell GA, Flicek P, Foley K, Frankel WN, Fulton LA, Fulton RS, Furey TS, Gage D, Gibbs RA, Glusman G, Gnerre S, Goldman N, Goodstadt L, Grafham D, Graves TA, Green ED, Gregory S, Guigo R, Guyer M, Hardison RC, Haussler D, Hayashizaki Y, Hillier LW, Hinrichs A, Hlavina W, Holzer T, Hsu F, Hua A, Hubbard T, Hunt A, Jackson I, Jaffe DB, Johnson LS, Jones M, Jones TA, Joy A, Kamal M, Karlsson EK, Karolchik D, Kasprzyk A, Kawai J, Keibler E, Kells C, Kent WJ, Kirby A, Kolbe DL, Korf I, Kucherlapati RS, Kulbokas EJ, Kulp D, Landers T, Leger JP, Leonard S, Letunic I, Levine R, Li J, Li M, Lloyd C, Lucas S, Ma B, Maglott DR, Mardis ER, Matthews L, Mauceli E, Mayer JH, McCarthy M, McCombie WR, McLaren S, McLay K, McPherson JD, Meldrim J, Meredith B, Mesirov JP, Miller W, Miner TL, Mongin E, Montgomery KT, Morgan M, Mott R, Mullikin JC, Muzny DM, Nash WE, Nelson JO, Nhan MN, Nicol R, Ning Z, Nusbaum C, O'Connor MJ, Okazaki Y, Oliver K, Overton-Larty E, Pachter L, Parra G, Pepin KH, Peterson J, Pevzner P, Plumb R, Pohl CS, Poliakov A, Ponce TC, Ponting CP, Potter S, Quail M, Reymond A, Roe BA, Roskin KM, Rubin EM, Rust AG, Santos R, Sapojnikov V, Schultz B, Schultz J, Schwartz MS, Schwartz S, Scott C, Seaman S, Searle S, Sharpe T, Sheridan A, Shownkeen R, Sims S, Singer JB, Slater G, Smit A, Smith DR, Spencer B, Stabenau A, Stange-Thomann N, Sugnet C, Suyama M, Tesler G, Thompson J, Torrents D, Trevaskis E, Tromp J, Ucla C, Ureta-Vidal A, Vinson JP, Von Niederhausern AC, Wade CM, Wall M, Weber RJ, Weiss RB, Wendl MC, West AP, Wetterstrand K, Wheeler R, Whelan S, Wierzbowski J, Willey D, Williams S, Wilson RK, Winter E, Worley KC, Wyman D, Yang S, Yang SP, Zdobnov EM, Zody MC, Lander ES: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420: 520-562. 10.1038/nature01262.

    Article  PubMed  CAS  Google Scholar 

  96. Morey C, Avner P: Employment opportunities for non-coding RNAs. FEBS Lett. 2004, 567: 27-34. 10.1016/j.febslet.2004.03.117.

    Article  PubMed  CAS  Google Scholar 

  97. Bartel DP: MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 2004, 116: 281-297. 10.1016/S0092-8674(04)00045-5.

    Article  PubMed  CAS  Google Scholar 

  98. Gupta S, Zink D, Korn B, Vingron M, Haas SA: Genome wide identification and classification of alternative splicing based on EST data. Bioinformatics. 2004, 20: 2579-2585. 10.1093/bioinformatics/bth288.

    Article  PubMed  CAS  Google Scholar 

  99. Yelin R, Dahary D, Sorek R, Levanon EY, Goldstein O, Shoshan A, Diber A, Biton S, Tamir Y, Khosravi R, Nemzer S, Pinner E, Walach S, Bernstein J, Savitsky K, Rotman G: Widespread occurrence of antisense transcription in the human genome. Nat Biotechnol. 2003, 21: 379-386. 10.1038/nbt808.

    Article  PubMed  CAS  Google Scholar 

  100. Mattick JS: Non-coding RNAs: the architects of eukaryotic complexity. EMBO Rep. 2001, 2: 986-991. 10.1093/embo-reports/kve230.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  101. Sammut B, Marcuz A, Pasquier LD: The fate of duplicated major histocompatibility complex class Ia genes in a dodecaploid amphibian, Xenopus ruwenzoriensis. Eur J Immunol. 2002, 32: 1593-1604. 10.1002/1521-4141(200206)32:6<1593::AID-IMMU1593>3.0.CO;2-6.

    Article  PubMed  CAS  Google Scholar 

  102. Trans-NIH Xenopus Initiative Website. 2005, [http://www.nih.gov/science/models/Xenopus/]

  103. Xenbase Xenopus Web Resource Website. 2005, [http://xenbase.org]

Download references

Acknowledgements

The authors thank Jan Reinkensmeier for his help in setting up the XenDB Web pages, Alin Vonika, Trent Clarke and Stefan Kurtz for comments on the manuscript. The FSU School of Computation Science and Information Technology and FSU Supercomputing Facility provided computing resources. CRA was supported by an FSU Research Foundation Program Enhancement Grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Curtis R Altmann.

Additional information

Authors' contributions

A.S. developed and implemented the Vmatch based clustering pipeline. M.B. contributed his high throughput sequence analysis system Genlight. A.S. and M.B. developed the XenDB database schema, performed the post clustering data analyses and contributed to the manuscript. A.H.B. provided supervision and guidance on the development of the project design goals and the interpretation of analysis output with regard to biological significance. R.G. provided supervision and guidance on the development of the clustering pipeline and provided essential infrastructure. C.R.A. provided advice and guidance on the development of the clustering pipeline, the incorporation of analysis into the database and performed and interpreted the various queries presented and wrote a significant portion of the manuscript.

Alexander Sczyrba, Michael Beckstette contributed equally to this work.

Electronic supplementary material

12864_2005_324_MOESM1_ESM.pdf

Additional File 1: Figure S1, Effect of Parameter Variation on EST Clustering: Masked and trimmed EST sequences were clustered using the Vmatch algorithm using different overlap length and percentage identity values. The total number of clusters (blue) and the number of singletons (red) are plotted against the minimal overlap length. Values were plotted at different percentage identities (squares 98%, stars 96%, circles 94%). (PDF 4 KB)

12864_2005_324_MOESM2_ESM.doc

Additional File 2: Table S1, Distribution of EST sequences in the analysis based on the annotated tissue source for the preparation of the library. (NOTE: annotations are imported directly from GenBank entries and are dependent on the original annotation.) (DOC 36 KB)

12864_2005_324_MOESM3_ESM.doc

Additional File 3: Table S2: The 20 most abundant developmental stage annotations in the X. laevis data set as annotated in GenBank: Distribution of EST sequences in the analysis based on the annotated developmental stage of the source library. (NOTE: annotations are imported directly from GenBank entries and are dependent on the original annotation.) (DOC 30 KB)

12864_2005_324_MOESM4_ESM.doc

Additional File 4: Table S3: The 30 most abundant Clone Libraries in the X. laevis data set as determined by the GenBank annotation. (NOTE: annotations are imported directly from GenBank entries and are dependent on the original annotation.) (DOC 55 KB)

Additional File 5: Table S4: Sizes of protein sets used for sequence analysis of clustered sequences. (DOC 31 KB)

12864_2005_324_MOESM6_ESM.htm

Additional File 6: file containing the SAGE database query used in the glioblastoma and astrocytoma analysis. (HTM 24 KB)

12864_2005_324_MOESM7_ESM.txt

Additional File 7: File containing protein accession numbers of SAGE glioblastoma genes for upload to XenDb system (TXT 1 KB)

12864_2005_324_MOESM8_ESM.txt

Additional File 8: File containing protein accession numbers of SAGE astrocytoma genes for upload to XenDb system (TXT 2 KB)

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Sczyrba, A., Beckstette, M., Brivanlou, A.H. et al. XenDB: Full length cDNA prediction and cross species mapping in Xenopus laevis. BMC Genomics 6, 123 (2005). https://doi.org/10.1186/1471-2164-6-123

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2164-6-123

Keywords