The Brassica species, related to Arabidopsis thaliana, include an important group of crops and represent an excellent system for studying the evolutionary consequences of polyploidy. Previous studies have led to a proposed structure for an ancestral karyotype and models for the evolution of the B. rapa genome by triplication and segmental rearrangement, but these have not been validated at the sequence level.
We developed computational tools to analyse the public collection of B. rapa BAC end sequence, in order to identify candidates for representing collinearity discontinuities between the genomes of B. rapa and A. thaliana. For each putative discontinuity, one of the BACs was sequenced and analysed for collinearity with the genome of A. thaliana. Additional BAC clones were identified and sequenced as part of ongoing efforts to sequence four chromosomes of B. rapa. Strikingly few of the 19 inter-chromosomal rearrangements corresponded to the set of collinearity discontinuities anticipated on the basis of previous studies. Our analyses revealed numerous instances of newly detected collinearity blocks. For B. rapa linkage group A8, we were able to develop a model for the derivation of the chromosome from the ancestral karyotype. We were also able to identify a rearrangement event in the ancestor of B. rapa that was not shared with the ancestor of A. thaliana, and is represented in triplicate in the B. rapa genome. In addition to inter-chromosomal rearrangements, we identified and analysed 32 BACs containing the end points of segmental inversion events.
Our results show that previous studies of segmental collinearity between the A. thaliana, Brassica and ancestral karyotype genomes, although very useful, represent over-simplifications of their true relationships. The presence of numerous cryptic collinear genome segments and the frequent occurrence of segmental inversions mean that inference of the positions of genes in B. rapa based on the locations of orthologues in A. thaliana can be misleading. Our results will be of relevance to a wide range of plants that have polyploid genomes, many of which are being considered according to a paradigm of comprising conserved synteny blocks with respect to sequenced, related genomes.
The cultivated Brassica species, like Arabidopsis thaliana, are members of the Brassicaceae family . Brassica rapa (n = 10) contains the Brassica A genome, which is the smallest, at ca. 500 Mb . A genome sequencing project is underway http://brassica.bbsrc.ac.uk/ webcite. A number of genome analysis studies have shown that the Brassica genomes contain extensive triplication, consistent with their having evolved from a hexaploid ancestor [3-5]. Two sequence-level studies, one in B. oleracea  and one in B. rapa  have provided further support for the hypothesis of hexaploid ancestry for the Brassica species. Recent cytogenetic studies have shown that a distinctive feature of the Brassiceae tribe, of which the Brassica species are members but A. thaliana is not, is that they contain extensively triplicated genomes .
An elegant study using sequenced RFLP markers demonstrated that 21 segments of the genome of A. thaliana, representing almost its entirety, could be replicated and rearranged to generate a structure approximating that of the B. napus genome . In a similarly ground-breaking study, an ancestral karyotype (AK) of n = 8 was proposed for the Brassicaceae, which has been related to the A. thaliana genome sequence and the structure of the B. rapa genome derived by linkage mapping . Thus the genome sequence of A. thaliana is being used, either directly via the B. napus comparative analysis or indirectly via the AK inferred genome, to inform studies in the Brassica species. A complication in such comparative studies is that there are typically multiple orthologues in Brassica species for each gene represented in A. thaliana, although interspersed gene loss has reduced the number that might be expected in paleohexaploids such as the Brassica species .
Brassica species have been used to study the early responses of genomes to the induction of polyploidy, via resynthesis of B. napus by hybridization of B. rapa with B. oleracea. Such lines display genome instability, which can persist for many generations . Although this is hypothesised to involve homoeologous non-reciprocal translocations, such evolutionary events have not been studied at the sequence level. Indeed, sequence-level studies in Brassica to date have focussed on regions that show collinearity between the Brassica genome studied and that of A. thaliana. Similarly, in comparative studies in grass genomes, which are considered very much in terms of rearranged collinear blocks , little attention has been paid to the regions of collinearity breakdown.
We aimed to test the veracity of our present understanding of the evolution of the Brassica and Arabidopsis genomes from the AK genome by identifying and sequencing BAC clones containing genomic DNA of B. rapa that represent a sample of collinearity discontinuities (CDs) relative to the A. thaliana genome. This involved the development of bioinformatics tools and accessing data arising from ongoing activities to sequence the first four of the ten chromosomes of B. rapa.
Identification of BAC clones putatively containing collinearity discontinuities
We developed a method by which candidate B. rapa BAC clones spanning CDs with the Arabidopsis genome could be identified and selected for sequencing. Our starting point was the set of BAC end sequence (BES) data available for the combined libraries from the 'Chiifu' cultivar, which is the subject of the multinational genome sequencing project. Using a strategy opposite to that employed for selection of seed BACs for that programme, we analysed the mate-pairs within the BES data primarily for inferred disruptions in short- to medium-range synteny (up to five-fold of an average BAC insert, i.e. 500 kb). We first conducted a BLASTN similarity search against the Arabidopsis genome sequence with all 200,031 individual BES available from 106,144 B. rapa BAC clones, these sequences comprising 93,887 mate-pairs and 12,257 singletons. For each BES we recorded the pseudochromosome coordinates of the most significant alignment above a threshold E-value of 10-30. Of the clones with both mate-pair BES available, 26,574 (28%) gave mappings with each E-value above this threshold and were therefore amenable to further analysis.
We loaded these pseudochromosome mappings into our own copy of the ATIDB Arabidopsis genome database  to enable a programmatic analysis. A Perl script was developed to interrogate the database and to identify associations between non-contiguous regions of the Arabidopsis genome that are linked by a number of disjoined mate-pair mappings and thus produce a list of cognate B. rapa BAC clones that might contain discontinuities. The algorithm we used is described in more detail in Methods. Our initial approach took into account several factors; filtering out instances of clone duplications and discounting mate-pair mappings whose DNA strand dispositions differed from the majority. We experimented with a threshold number, over the range of 2-5, of independent mate-pair mappings linking any given pair of bins required to signal an association.
BAC clones potentially representing CDs with the A. thaliana genome were thus selected, one from each association identified by three or more BAC clones. These BAC clones were sequenced and annotated, inter alia, for similarity to A. thaliana gene models and to B. rapa BES using BLASTN. Of the 68 sequenced BACs, 38 were found not to contain CDs. In the majority of these (25), the BACs show alignment of multiple gene models from two regions of the A. thaliana genome. These pairs of regions of the A. thaliana genome are related to each other, representing paralogous segments. The sequences at one end of each B. rapa BAC shows the highest similarity to the corresponding gene model from one of the A. thaliana genome segments, whereas the sequences at the other end of the B. rapa BAC shows the highest similarity to the corresponding gene model from the other A. thaliana genome segment. We termed this paralogue conflation. In the remaining cases, there appears to be at one end of the clone a small stretch of inverted sequence or a single gene (or gene fragment) with similarity elsewhere in the A. thaliana genome. The remaining 30 B. rapa BAC clones contain similarity to two or more collinear runs of multiple A. thaliana gene models.
These findings enabled us to substantially improve the algorithm and to reduce the false positive rate, using the sequenced BACs as a training set. Graphical representations of the results are shown in Figure 1 and a summary is given in Table 1. A majority proportion of the CDs we were seeking would be ancestral to the divergence of the B. rapa A genome and B. oleracea C genome lineages and so we experimented with adding supplementary data obtained from a B. oleracea BAC library. Using identical methods we analysed 85,317 BES derived from 43,691 B. oleracea clones, these sequences comprising 41,626 mate-pairs and 2,065 singletons. Of the clones with both mate-pair BES available, 6,623 clones (15.9%) yielded significant mappings at the same E-value threshold as before. This proportion was almost half that obtained from the B. rapa dataset, probably due to the higher repeat content in B. oleracea reducing the probability of each BES in a mate-pair containing conserved (genic) sequence. When these B. oleracea mappings were added and the database re-analysed, the number of associations revealed was increased significantly at each threshold value (Table 1).
Figure 1. Representation of candidate collinearity discontinuities within the A. thaliana genome with respect to B. rapa. The plot is divided into cells with dimensions proportional to the physical lengths of the five Arabidopsis pseudochromosomes (numbered on axes). A blue square within a cell signifies an association between 500 kb bins requiring (a.) 2; (b.) 3 or (c.) 4 distinct instances of disjoined B. rapa BES mate-pair mappings linking them.
Table 1. Results of database interrogation for collinearity discontinuities
Sequence validation of putative collinearity discontinuities
A genome sequencing project is underway for B. rapa, with sequences derived using a BAC-by-BAC strategy and with annotation being made publicly available http://brassica.bbsrc.ac.uk/ webcite. During the course of this effort, we sequenced a number of BAC clones (and sets of overlapping clones) containing CDs relative to the A. thaliana genome. Several of these had not had both mate-pair BES available previously and hence would have been undetectable by our computational analysis. These clones, along with those identified from the genome-wide analysis, were then assessed rigorously in order to exclude any that might not be representative of the genome (e.g. chimaeric ligation or the result of rearrangement in E. coli). This was done by using the sequence annotation to assess whether there are additional B. rapa BAC clones overlapping and confirming the discontinuity. This process involved the identification, based on congruous BAC end sequence alignments, of additional BAC clones that confirm the CD, and is illustrated in Figure 2. This resulted in the confirmation of a further 20 discontinuities represented in sequenced BAC clones, making a total of 50, as listed in Table 2. Relative to the A. thaliana chromosome structure, 19 of these represent inter-chromosomal rearrangements, with the remaining 31 representing the end points of intra-chromosomal rearrangements (segmental inversions).
Figure 2. Validation of collinearity discontinuities. Annotation of the sequence of BAC KBrB026E16 showing aligned BAC end sequences and homologous A. thaliana gene models. Sequence similarities with A. thaliana gene models are shown with blue arrows. Sequence similarity with end sequences derived from other B. rapa BAC clones are colour-coded: red for the end sequences of fully sequenced BACs, white for the end sequences of BACs for which both can be found within the annotated BAC, orange for end sequences which may bridge to another fully sequenced BAC, and green for other cases. The two inferred collinearity blocks are indicated by grey arrows. The collinearity discontinuity (CD) is confirmed by congruous end sequence alignments of KBrB085N10, KBrB067E30, KBrB038E01 and KBrB021K15. Both end sequences of KBrB085N10 align to KBrB026E16, with the expected orientation of the reads (one in each collinearity block). For each of the remaining three, one end aligns to the A. thaliana chromosome 4-related region and the other to regions of KBrB053L06 (shown in red) that extend beyond KBrB026E16, as do one end of each of KBrB068D06 and KBrB116K14 (from which the other ends align to the chromosome 1-related region. Thus multiple BACs (for which only end sequence is available) confirm the CD identified in the sequenced clone, in addition to the fully-contained BAC (KBrB085N10).
Table 2. Characterization of B. rapa BAC clones containing discontinuities in collinearity with the A. thaliana genome
Fifteen of the 19 inter-chromosomal CDs were genetically mapped in the B. rapa genome, either by direct linkage mapping or by sequence overlap with a BAC mapped by linkage mapping described elsewhere , as summarised in Table 2. These could be related to the position in the Brassica A genome of CDs previously inferred by linkage mapping-, defined relative to A. thaliana chromosomes  and subsequently to the AK . We will use the nomenclature At(chromosome number, letter) to refer to the previously described A. thaliana chromosome blocks (e.g. At1A refers to A. thaliana chromosome 1, block A) as described in  and AK(letter) to refer to the ancestral karyotype blocks (e.g. AKA refers to ancestral karyotype block A), as described in .
One sequenced CD (represented by BAC KBrH131P10) mapped to linkage group A1 and is consistent with the position of the inferred CD between blocks At3A and At4B (AKF and AKT), as illustrated in Figure 3. However, the transition revealed by the BAC sequence is between the expected part of At3A (AK block F) but with At1D (AKD). The linkage mapping study  had identified markers on A1 with similarity to this region of the A. thaliana genome, but there was insufficient evidence to call the block. Only one copy of AKD had been identified previously , so our study has identified the position of one of the "missing" two blocks in the B. rapa genome that would be expected from its paleohexaploid ancestry.
Figure 3. Collinearity discontinuities mapped to B. rapa linkage group A1. The sequenced BAC clones are illustrated in relation to the CDs with which their mapping is consistent, as represented in . Orthologues delineating the boundaries of recombination events and the end-most aligned gene models within the BACs are designated by the name of the A. thaliana gene model suffixed "*".
Five sequenced CDs (represented by BACs KBrH110M01, KBrH010M06, KBrH034P16, KBrB055E21 and KBrH004I22) mapped to linkage group A3, as illustrated in Figure 4. The position of KBrH110M01 is consistent with the position of the inferred CD between At5E and At2B/C (AKW and AK J). However, the transition revealed by the BAC sequence shows an additional small segment in between these, with collinearity to a more distant (internal) part of At2B/C (AKJ). This structure may be the result of rearrangements within this collinearity block.
Figure 4. Collinearity discontinuities mapped to B. rapa linkage group A3. The sequenced BAC clones are illustrated in relation to the CDs with which their mapping is consistent, as represented in . Orthologues delineating the boundaries of recombination events and the end-most aligned gene models within the BACs are designated by the name of the A. thaliana gene model suffixed "*".
The sequences within KBrH010M06 represent the end of collinearity block At3C (AKM) and sequences internal to collinearity blocks At5F (AKX), but neither had been identified previously on linkage group A3.
The sequences within KBrH034P16 are internal to collinearity blocks At3D and At4B (AKN and AKT). Although At4B (AKT) had been identified previously on B. rapa linkage group A3, At3D (AKN) had not. Therefore the transition previously inferred on this linkage group between collinearity blocks At4B (AKT) and At3A (AKF) [9,10] may be more complex than anticipated. This is supported by the results of analysis of the structure of the Brassica A genome as represented in B. juncea, in which AKN and AKG-H were identified between AKT and AKF .
The sequences within KBrB055E21 represent the ends of collinearity blocks At3A and At4A (AKF and AKO). This is consistent with the transition inferred on the basis of linkage mapping , but is not consistent with the inferred interpolation of AKP between AKF and AKO that has been proposed .
The sequences within KBrH004I22 correspond to the end of At3A (AKF) and sequences internal to At2A (or at the end of AKK). Although At3A (AKF) had been identified previously on linkage group A3, At2A (AKK) had not. Only two copies of AKK had been identified previously , so our study may have identified the position of the "missing" third block in the B. rapa genome that would be expected from its paleohexaploid ancestry. The linkage mapping study  had identified markers on A3 that have similarity to this region of the A. thaliana genome, but there was insufficient evidence to call the block.
Three sequenced CDs (represented by BACs KBrH109L07, KBrH004M24 and KBrH001J23) mapped to linkage group A6, as illustrated in Figure 5. The sequences in KBrH109L07 represent the end of collinearity block At3C (AKM) and sequences internal to collinearity blocks At5F (AKX). Block At5F (AKX) had been position on linkage group A6 previously, but no copy of At3C (AKM) had been positioned previously on this linkage group. Thus this BAC represents the position of an additional copy of this segment (along with that identified in BAC KBrH010M06). The linkage mapping study  had identified markers on A6 that have similarity to this region of the A. thaliana genome, but there was insufficient evidence to call the block.
Figure 5. Collinearity discontinuities mapped to B. rapa linkage group A6. The sequenced BAC clones are illustrated in relation to the CDs with which their mapping is consistent, as represented in . Orthologues delineating the boundaries of recombination events and the end-most aligned gene models within the BACs are designated by the name of the A. thaliana gene model suffixed "*".
The sequences within KBrH004M24 correspond to the end of At2A (AKK) and sequences internal to At5D (or at the end of AKV) and confirm one of the CDs on linkage group A6 previously inferred .
The sequences within KBrH001J23 correspond to the end of collinearity block At3C (AKM) and sequences internal to collinearity blocks At1B (AKB). Block At1B (AKB) had been position on linkage group A6 previously, but no copy of At3C (AKM) had been positioned previously on this linkage group. Thus this BAC represents the position of the third copy of this segment (along with those identified in BACs KBrH010M06 and KBrH109L07).
Two sequenced CDs (represented by BACs KBrB089J13 and KBrB026E16) mapped to linkage group A8, as illustrated in Figure 6. The sequences in both correspond to those close to the ends of collinearity blocks At4B (AKT) and At1B (AKB). Thus they are both candidates for representing the CDs on linkage group A8 previously inferred [9,10]. However, the transition revealed by the sequence of KBrB089J13 shows an additional small segment in between these, with collinearity to a more distant (internal) part of At4B (AKT). This structure may be the result of rearrangements within this collinearity block.
Figure 6. Collinearity discontinuities mapped to B. rapa linkage group A8. The sequenced BAC clones are illustrated in relation to the CDs with which their mapping is consistent, as represented in . Orthologues delineating the boundaries of recombination events and the end-most aligned gene models within the BACs are designated by the name of the A. thaliana gene model suffixed "*".
Four sequenced CDs (represented by BACs KBrH106N09, KBrB028F11, KBrB026A12 and KBrH006I08) mapped to linkage group A9, as illustrated in Figure 7. The sequences in KBrH106N09 represent sequences within collinearity block At4A (AKO) and within At3B (AKL). Although At4A (AKO) had been identified previously on linkage group A9, At3B (AKL) had not. Only two copies of AKL had been identified previously , so our study has identified the position of the "missing" third block in the B. rapa genome that would be expected from its paleohexaploid ancestry. There is an additional small collinear segment between AKO and AKL, corresponding to sequences anticipated to have been positioned between AKO and AKP, suggesting that the previously defined boundary of AKO (corresponding to an orthologue of At4g04995)  may have been incorrect, with this block extending to an orthologue of A4g05460.
Figure 7. Collinearity discontinuities mapped to B. rapalinkage group A9. The sequenced BAC clones are illustrated in relation to the CDs with which their mapping is consistent, as represented in . Orthologues delineating the boundaries of recombination events and the end-most aligned gene models within the BACs are designated by the name of the A. thaliana gene model suffixed "*".
The sequences in KBrB028F11 represent sequences at the end of collinearity blocks At2A (AKH) and within At1D (AKD). Although At1D (AKD) had been identified previously on B. rapa linkage group A9, At2A (AKH) had not. Therefore the transitions previously inferred on this linkage group bordering block At1D (AKD) [9,10] may be more complex than anticipated. This is supported by the results of analysis of the structure of the Brassica A genome as represented in B. juncea, in which AKH was identified as being adjacent to AKD . The linkage mapping study  had identified markers on A9 that have similarity to this region of the A. thaliana genome, but there was insufficient evidence to call the block. The transition revealed by the BAC sequence shows an additional small segment in between At2A (AKH) and At1D (AKD), with collinearity to the end of At3A (AKF).
The sequences in KBrB026A12 represent sequences at the end collinearity block At2A (AKK) and within At5D (AKV). Although At5D (AKV) had been identified previously on linkage group A9, the part of At2A corresponding to AKK had not. Therefore the transitions previously inferred on this linkage group bordering block At5D (AKV) [9,10] may be more complex than anticipated. Only two copies of AKK had been identified previously , so our study has identified the position of the "missing" third block in the B. rapa genome that would be expected from its paleohexaploid ancestry. The linkage mapping study  had identified markers on A9 that have similarity to this region of the A. thaliana genome, but there was insufficient evidence to call the block.
The sequences in KBrH006I08 represent sequences at the end of collinearity blocks At3D (AKN) and within At2B (AKI). They confirm one of the CDs on linkage group A9 previously inferred [9,10], but indicate that At2B (AKI), as represented on linkage group A9, may be truncated.
Twenty four of the 31 CDs representing the end points of intra-chromosomal rearrangements (segmental inversions) were mapped in the B. rapa genome, either by direct linkage mapping or by overlap with a BAC mapped by linkage mapping. Their occurrence appears genome-wide, as summarised in Table 2. Few such rearrangements had been inferred previously, and these had not been clearly defined. The positions on linkage group A8 of BAC clones KBrB022J01 and KBrH064I20, and on linkage group A1 of BAC clones KBrH001C16, KBrH027B04, KBrB008F10 and KBrB090F01 are consistent with those expected for the inversions noted in At4B segments on these chromosomes .
Some of the CDs classified as segmental inversions on the basis of the relationships of genes represented in the BACs to their orthologues in A. thaliana may have actually represented inter-chromosomal rearrangements, but segments of those ancestral chromosomes have subsequently come together in the A. thaliana genome. This is most notably the case for A. thaliana chromosome 5, for which we have three "segmental inversion" CDs spanning relatively large regions of the chromosome. These three are represented by KBrH066L21, KBrB129C20 and KBrB055O17, and are illustrated in Figure 8. The sequences in KBrH066L21, which has been mapped to linkage group A9, represent sequences near the end of collinearity block At5D (AKV) and between At5B and At5C (AKQ and AKS). Both At5D (AKV) and At5B (AKQ) had been identified on this linkage group previously, but not adjacent to each other. This suggests that the transitions previously inferred on this linkage group bordering these collinearity blocks [9,10] may be more complex than anticipated. The identification of sequences corresponding to a region between AKQ and AKS suggests that the previously defined boundary of AKQ (corresponding to an orthologue of At5g28897)  may have been incorrect, with this block extending to an orthologue of A5g30510. The distal part of this extended collinearity block AKQ contains a small segmental inversion (relative to the A. thaliana genome) that is wholly contained within the BAC. The sequences in KBrB129C20 represents a CD between the ends of collinearity blocks At5E and At5A (AKW and AKR) and is consistent with the position of a previously described CD on linkage group A3 [9,10], to which the clone maps. The sequences within KBrB055O17 represents a CD between At5F and At5B (AKX and AKQ), respectively, but we were unable to determine its position in the genome.
Figure 8. Collinearity discontinuities involving segmental inversions. Orthologues delineating the boundaries of recombination events and the end-most aligned gene models within the BACs are designated by the name of the A. thaliana gene model suffixed "*". The linkage group to which each BAC maps, where determined, is shown next to the BAC name.
In addition to the small segmental inversion (relative to the A. thaliana genome) contained within KBrH066L21, we found further examples of secondary rearrangements at the points of CDs in KBrB022J01 and KBrH108B16, and wholly contained inversions within KBrB011D06 and KBrH064I20, as illustrated in Figure 8. We identified one example of a CD apparently representing intra-chromosomal rearrangements that were separated by sequences from elsewhere in the genome. The sequences in KBrH026A01 represent a small segment from the end of At4A (AKO) at one end of a segmental inversion within At2B (AKI), as illustrated in Figure 8.
Analysis of collinearity discontinuity sequences
None of the BAC clones containing the CDs was found to contain B. rapa satellite repeat sequences characteristic of centromeres [16,17], nor was any found with tandem tracts of TTTAGGG repeats that are associated with telomeres , although clone KBrH108B16 did have a cluster of 31 such repeats interspersed over 926 bp, but some 17 kb from the CD.
Minimal sequences defining the confirmed CDs were then selected from these BACs by manual inspection, aided by annotated ab initio gene predictions and Brassica EST alignments. These CD regions were then compared with the complete set of annotated B. rapa genome sequence currently available (approximately 120 Mb of gene space). The results are given in Table 3. The CD sequences appear to have nucleotide compositions intermediate to those of genic and intergenic regions in terms of microsatellite (simple sequence repeat) density. Although the CDs show densities of apparent gene features typical of the averaged genome sequence, detailed analysis reveals that these features are often gene fragments with homology to Arabidopsis gene models from regions outside those brought together at the discontinuities. Furthermore, they are relatively transcriptionally inactive, with the proportion of models with EST support (obtained from alignments with transcript assemblies derived from all available EST sequences) being about half that of the average for predicted genes. The CD regions also appear to be relatively transcriptionally silent when analysed with the very sensitive but targeted method of alignment with Brassica A genome Solexa reads derived from the B. napus leaf transcriptome.
Table 3. Summary of features over identified collinearity discontinuities and all annotated B. rapa sequence
Computational methods were used to successfully identify BAC clones representing verified CDs between the genomes of B. rapa and A. thaliana, and relative to an ancestral karyotype. Along with CDs identified during the ongoing chromosome sequencing project, these represent a substantial (but incomplete) sampling of the CDs in the genome of B. rapa. Previous studies had defined a segmental structure for the paleohexaploid Brassica genome based largely on genetic linkage of markers with similarity to sequences in the A. thaliana genome [9,10]. Whereas the seminal study in this area  compared the arrangements of the B. napus genome with that of A. thaliana, and ours compared the arrangement of the B. rapa genome with that of A. thaliana, we anticipate that the results should be directly comparable as there seems to be little difference in the organization of the A genome in these two Brassica species . Remarkably few of our CDs correspond to those expected from this structure: 3 of the 18 representing inter-chromosomal rearrangements and 6 of the 32 representing intra-chromosomal rearrangements. The relatively high "noise" inherent to comparative genomics studies in Brassica species, which is a consequence of the widespread occurrence of apparently transduplicated fragments of genes , means that multiple instance of collinear alignments are required to correctly identify collinear genome segments. This requirement limits the ability to identify relatively small segments using, for example, comparative linkage mapping based on RFLP markers. Although the paleopolyploid ancestry of Brassica species is now widely accepted, the lack of discernable triplication throughout the genome has not been fully explained. The hypothesised segments that have not been identified had been assumed to have been deleted. However, we have found evidence for the existence of numerous additional copies of genome segments, bringing the count of many of these to (or closer to) the predicted three. In one case (At3C/AKM) we identified and mapped onto the B. rapa genome three copies where none had been identified previously in that species, with only one copy having been identified in the Brassica A genome as represented in B. juncea .
Our analyses of CDs enable us to hypothesise how parts of the genomes of A. thaliana and B. rapa have evolved from the AK. For B. rapa linkage group A8, which contains one copy of AK blocks T-U, our data enable the development of a model for the derivation of the chromosome from the AK, as shown in Figure 9. The detection of sequences from AKT at both ends of an AKU collinearity block indicates that there may have been a circular intermediate derived from linkage group AK7 which was integrated into AK1 to form the progenitor of B. rapa A8. Some rearrangements of the AK seem to have taken place in the Brassica lineage before genome triplication, but not in the A. thaliana lineage. For example, as shown in Figure 10, A. thaliana chromosome 5 contains the pairs of AK blocks AKQ-AKR and AKW-AKX (derived from linkage groups AK6 and AK8, respectively). Whereas these AK blocks appear to have been recombined in the ancestor of B. rapa. This recombination is represented in B. rapa three times , on linkage groups A2, A3 and A10 for AKW-AKR, and at least twice for the reciprocal outcome of the recombination, AKQ-AKX (on A2 and A6).
Figure 9. Hypothetical derivation of B. rapa linkage group A8. Ancestral karyotype segments are labelled (A, B, C, S, T, U) and oriented by arrows. Orthologues delineating the boundaries of recombination events are designated by the name of the A. thaliana gene model suffixed "*".
Figure 10. Hypothetical derivation of A. thaliana and B. rapa linkage groups from the ancestral karyotype. Ancestral karyotype segments are labelled (Q, R, W, X) and oriented by arrows. Orthologues delineating the boundaries of recombination events are designated by the name of the A. thaliana gene model suffixed "*".
Our results show that previous studies of segmental collinearity between A. thaliana, Brassica and AK genomes, although very useful, represent over-simplifications of the true inter-relationships of the genomes. In addition to the occurrence of individual genes in non-collinear regions of the genomes previously noted , the presence of numerous cryptic collinear genome segments and the frequent occurrence of segmental inversions mean that inference of the positions of genes based on the locations of orthologues in A. thaliana can be misleading. Indeed, excessive reliance on collinearity with the genome of A. thaliana may prove problematic for the ongoing efforts to sequence the B. rapa genome. Polyploidy is common in plants, and there is no reason to conclude that the greater complexity of segmental rearrangement and evolution that we have observed is unusual. Therefore, our results will be of relevance to studies in a wide range of polyploid plant genomes, many of which are being considered as having blocks of conserved synteny with respect to the genomes of model species, and studies relating to evolutionary breakpoints and their relation to genome organisation .
The 200,031 publicly available BAC end sequence reads (BES) from the combined KBrH, KBrB and KBrS Brassica rapa ssp. pekinensis cv. Chiifu libraries, which were provided by the Korea Brassica Genome Resource Bank, were used in a WU-BLASTN search  versus the TAIR v6 Arabidopsis pseudomolecule sequences, using 1E-30 as the E-value cutoff. Supplementary BLAST parameters used were application of the Dust simple sequence filter and setting hspsepsmax = 1000, appropriate for use against very large subject sequences. Coordinates and scores for individual HSPs from the significant hits were then parsed into GFF format and loaded as features into a local copy of the ATIDB genome database  which is built on the GBrowse platform using a MySQL adaptor . An identical exercise was performed with a set of 85,317 B. oleracea BES obtained from line TO 1434 and these data added incrementally.
A Perl CGI script was developed to interrogate the ATIDB database using the Bio::DB::GFF applications programming interface and methods. The five Arabidopsis reference chromosome sequences (pseudomolecules) were divided into bins of a selectable size (250 kb - 1 Mb) and the B. rapa (and B. oleracea) BES features mapping within each were extracted and loaded into a hashed array structure, keyed by chromosome and bin. Each bin was then systematically compared with every other bin, with the algorithm exploiting mirror symmetry for efficiency. Text string comparisons of feature object names (e.g. KBrH088K13_F and KBrH088K13_R) were used to identify mate-pairs amongst the BES mappings linking any given pair of bins. The raw mate-pair associations between bins identified from this initial process are inherently noisy and so the algorithm goes on to filter them on a combination of theoretical and empirical criteria. These can be summarised as follows: (1) any mate-pair mappings between neighbouring bins on the same chromosome were discounted if their physical separation in Arabidopsis pseudomolecule space was less than a set threshold of 500 kb, reflecting our estimate of the conserved microsynteny range; (2) duplicate instances of the mate-pair mappings, indicating either simple duplications of clones within libraries or multiple cloning events of the same DNA fragment during library construction, were eliminated; (3) DNA strand dispositions of mate-pair BES mappings (e.g. "Chr3:Bin20:plus" vs. "Chr5:Bin10:minus") were analysed to eliminate minority variants as it was reasoned that independent physical correlates of CDs should reflect a consistent pattern (unlike chimaeric clones generated by in vitro recombination events), and finally, in a development of the algorithm prompted by analysis of false positives; (4) the raw BES mappings in pseudomolecule space were used to locate the nearest annotated gene models and mate-pair mappings were eliminated if these conflicted at either end with the results of direct BLASTN query of the BES against annotated Arabidopsis genes - countering what we termed paralogue conflation;. We imposed an arbitrary threshold for the number of independent mate-pair mappings to annotated gene regions required to trigger a significant association. We varied this threshold between 2 and 5 in order to experiment with the signal to noise ratio in the dataset.
The final output of the script was directed through two routes, a graphical dot-plot style representation of the mate-pair associations using an interface to the GridMap Java applet  and also a spreadsheet format summary of the details underlying each significant association (clone identifiers, HSP coordinates and gene models, strand dispositions). The implementation is available from additional file 1 and additional file 2.
Additional file 1. Simple HTML page used to select parameters and options for launching accompanying Perl CGI script (additional file 2).
Format: HTML Size: 3KB Download file
Additional file 2. Perl CGI script used to identify putative collinearity discontinuities (CDs) from BAC end mapping data. Output is directed to both graphical (requires GridMap applet ) and spreadsheet streams. There are numerous code dependencies, mostly commented in the script. The prerequisite is a Bio::DB::GFF database handle to a MySQL (or other) database storing the end mappings defined with respect to the reference sequence. The first author (MT) can assist with technical questions on implementation.
Format: CGI Size: 27KB Download file
Sequence annotation and analysis
Our automated annotation pipeline  was used to analyse the sequences at the CDs. All annotated B. rapa BAC sequences were stored in a GBrowse MySQL database . Minimal regions for each CD were manually selected by identifying the sequence flanked by the Arabidopsis gene models listed in Table 2, supplemented (where informative) either by annotated Brassica gene predictions from SNAP  post-processed with PASA  using raw EST data or by BLAT alignments  of Brassica transcript assemblies . The CD sequences were analysed for the presence of characterised telomeric or centromeric repeats and then scanned for various annotated features with a Perl script using Bio::DB::GFF methods. This was repeated for extracted subsets of the entire annotated sequence defined as genic or intergenic by the gene predictions. Solexa leaf transcriptome reads obtained from B. napus  were aligned with MAQ  onto the B. rapa BAC sequences.
The authors declare that they have no competing interests.
IB conceived of the study, participated in its design and coordination, and helped to draft the manuscript. MT conceived and implemented the CD identification tool and helped to draft the manuscript. ND helped to implement the BAC annotation pipeline. SJK conceived BAC selection, sequencing and identification for CDs of A09 and A03. FF, ES, TJY, JHM and BSP identified and validated the BACs for sequencing. SRC, ZW, SYL and YPL anchored sequenced BACs to the B. rapa linkage map. CDT, AHP and JCP contributed to the conception of aspects of the study and contributed for analysis the end sequences of B. oleracea BAC clones prior to public release. All authors read and approved the final manuscript.
We would like to thank the Beijing Genomics Institute for BAC sequencing. This work was funded by the UK Biotechnology and Biological Sciences Research Council (BB/E017363) and Rural Development Administration (BioGreen 21 Program 20050301034438, NAAS project No. 2007139062200001502 and 200901FHT020710397), and the Technology Development Program for Agriculture and Forestry, Ministry for Food, Agriculture, Forestry and Fisheries (Project No. 607003-05), Korea. CDT, AHP and JCP were supported by the U.S. National Science Foundation (DBI-0638536).
Theor Appl Genet 1991, 82:81-92. Publisher Full Text
Plant Mol Biol Report 1991, 9:208-218. Publisher Full Text
O'Neill CM, Bancroft I: Comparative physical mapping of segments of the genome of Brassica oleracea var alboglabra that are homoeologous to sequenced regions of the chromosomes 4 and 5 of Arabidopsis thaliana.
Park JY, Koo DH, Hong CP, Lee SJ, Jeon JW, Lee SH, Yun PY, Park BS, Kim HR, Bang JW, Plaha P, Bancroft I, Lim YP: Physical mapping and microsynteny of Brassica rapa ssp. pekinensis genome corresponding to a 222 kb gene-rich region of Arabidopsis chromosome 4 and partially duplicated on chromosome 5.
Mol Gen Genomics 2005, 274:579-588. Publisher Full Text
Town CD, Cheung F, Maiti R, Crabtree J, Haas BJ, Wortman JR, Hine EE, Althoff R, Arbogast TS, Tallon LJ, Vigouroux M, Trick M, Bancroft I: Comparative genomics of Brassica oleracea and Arabidopsis thaliana reveals gene loss, fragmentation and dispersal following polyploidy.
Yang TJ, Kim JS, Kwon SJ, Lim KB, Choi BS, Kim JA, Jin M, Park JY, Lim MH, Kim HI, Lee MC, Lim YP, Kang JJ, Hong JH, Kim CB, Bhak J, Bancroft I, Park BS: Sequence-level analysis of the diploidization process in the triplicated FLC region of Brassica rapa.
Trends in Plant Sci 2006, 11:535-542. Publisher Full Text
Kim H, Choi SR, Bae J, Hong CP, Lee SY, Hossain MJ, Van Nguyen D, Jin M, Park B-S, Bang J-W, Bancroft I, Lim Y-P: Sequenced BAC anchored reference genetic map that reconciles the ten individual chromosomes of Brassica rapa.
Panjabi P, Jagannath A, Bisht NC, Padmaja KL, Sharma S, Gupta V, Pradhan AK, Pental D: Comparative mapping of Brassica juncea and Arabidopsis thaliana using Intron Polymorphism (IP) markers: homoeologous relationships, diversification and evolution of the A, B and C Brassica genomes.
Theor Appl Genet 1995, 90:157-165. Publisher Full Text
Lim KB, Yang TJ, Hwang YJ, Kim JS, Park JY, Kwon SJ, Kim JA, Choi BS, Lim MH, Jin M, Kim HI, de Jong H, Bancroft I, Lim Y, Park BS: Characterization of the centromere and peri-centromere retrotransposons in Brassica rapa and their distribution in related Brassica species.
Cheung F, Trick M, Drou N, Lim Y-P, Park J-Y, Kwon S-J, Kim J-A, Scott R, Pires JC, Paterson AH, Town C, Bancroft I: Comparative analysis between homoeologous genome segments of Brassica napus and its progenitor species reveals extensive sequence-level divergence.
2002. PubMed Abstract
Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O: Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies.
Trick M, Cheung F, Drou N, Fraser F, Lobenhofer EK, Hurban P, Magusin A, Town CD, Bancroft I: A newly-developed community microarray resource for transcriptome profiling in Brassica species enables the confirmation of Brassica-specific expressed sequences.
Plant Biotech J 2009, 7:334-346. Publisher Full Text