Center for Bioinformatics, University of Hamburg, Bundesstraße 43, 20146 Hamburg, Germany

Cold Spring Harbor Lab, 1 Bungtown Rd, Williams #5, Cold Spring Harbor, NY 11724, USA

Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, NY, USA

Abstract

Background

The challenges of accurate gene prediction and enumeration are further aggravated in large genomes that contain highly repetitive transposable elements (TEs). Yet TEs play a substantial role in genome evolution and are themselves an important subject of study. Repeat annotation, based on counting occurrences of

Results

Here we introduce the Tallymer software, a flexible and memory-efficient collection of programs for ^{9 }bp.). We analyzed _{0}

Conclusion

The Tallymer software was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available. For more information on the software, see

Background

Repetitive elements abound in the genomes of higher organisms. Tandem repeats, simple sequence repeats, long terminal repeats (LTRs), segmental duplications, and transposable elements (TEs) are among those types commonly found in eukaryotic species. The biological role of these entities in genome evolution has been documented

For anticipated plant genome projects and those currently underway, effective and rapid annotation of their many repeats has acquired a new urgency. For example, the maize genome is estimated to be 60–70% repetitive, mainly in the form of retrotransposons that proliferated in the last 2 to 6 million years

Repeat identification strategies fall under two broad categories: de novo detection and similarity-based detection. RepeatMasker

De novo methods address many of these concerns. RECON

Other more recent programs like ReAS

Like other

Methods

Sequence data sets

This study used publicly available datasets, including expressed sequence tags (ESTs), gene-enriched genome fractions, representative whole genome sequences, and repeat libraries, as summarized in Table

The different sequence sets used in the validation experiment.

class

fraction

abbrev.

size

source

WGS

Pilot Bacterial Artificial Chromosomes

BAC

14.8

WGS

BAC End Sequences

BES

8.0

AGI

Repeat

Transposable Elements

RepI

8.0

MIPS

Repeat

DNA transposons

RepII

0.1

MIPS

GE

Expressed Sequence Clusters

EST

56.3

PlantGDB

GE

AZM4 High-_{0}

AZM4 HC

188.8

TIGR

GE

AZM4 Methyl-filtered

AZM4 MF

156.8

TIGR

GA

TIGR4 assembly of rice (

Osj:TIGR4

420.0

TIGR

GA

TAIR7 assembly of

At:TAIR7

115.4

TAIR

GA

JGI1.1 assembly of poplar (

Pt:JGI1.1

410.0

GA

Genoscope1 assembly of grapevine (

Vc:GEN1

487.0

WGS stands for survey of 'Whole Genome Sequences'. GE stands for 'gene enrichment'. GA stands for 'genome assemblies'. The sequence sizes are given in million base pairs.

Maize WGS data set

Whole-genome shotgun reads for Maize (

Gene annotation

^{-5}. Transposable elements were screened out by matching their hits against a manually curated list of 2852 transposable element genes.

Annotation of transposable elements

Transposable elements were identified using Repeat-Masker

Receiver operating characteristic (ROC) curves were compared by the method of ^{® }statistical software package.

Basic notions for sequence processing

We consider sequences over the DNA alphabet {A, C, G, T, N}, where N denotes an undetermined base (usually represented by one of the IUPAC characters _{k}(_{k}(_{k}(_{v∈S }mer_{k}(_{v∈S }_{k}(

Occurrence ratios

For integers

That is, _{S, k}(_{S, k}(1, 1) is the

Note that the denominator in this fraction equals the number of all positions in _{S, k}(

Average k-mer frequencies

The average frequency of

where

That is, _{10}(

Distribution ratios

Now consider a set

and

be the minimum and maximum possible _{min}, _{max}] into equal sized non-overlapping subintervals at some suitable distance Δ. We obtain the intervals

such that _{min }= -_{max }<

where _{k, M }is the fraction of sequences in _{k, M }is called

Efficient computation of occurrence ratios

Note that the occurrence ratios only depend on the _{S, k }be a table such that for all _{S, k}[_{S, k}

According to these equations we can efficiently compute _{S, k}(_{S, k}. To compute this, one needs to enumerate each _{S, k}[

Traditionally, occurrence counts for ^{k }of possible _{S, k}[

To explain our approach we begin with the concept of enhanced suffix arrays, as introduced in

Enhanced suffix arrays

Suppose _{1},...,$_{r-1 }between the

Suppose that _{h }=

The key to our method is to lexicographically sort the suffixes of _{1 }< ... <$_{r}. This character order induces an order on all nonempty suffixes of

The lcp-table lcp is an array of integers in the range 0 to _{r }is the largest suffix in the lexicographic order, _{r}. Hence we always have lcp[

The notion of lcp-intervals, introduced in

1. lcp[

2. lcp[

3.

4.

We will also use the shorthand ℓ-interval for an lcp-interval [

An interval [

Enumerating k-mers and their occurrence counts

The parent-child relationship of the intervals constitutes a conceptual (or virtual) tree which we call the lcp-interval tree of the suffix array. The leaves of the tree are the singleton intervals and the internal nodes of the tree are the lcp-intervals. In particular, the root of this tree is the 0-interval [0..

(1) The nodes of the lcp-interval tree are enumerated in bottom-up order, i.e. a node, say

(2) The children with the same parent node are enumerated according to the lexicographic order of the strings they represent.

(3) Whenever we process the children of a node, we have access to the lcp-value of the parent node.

(4) The values in tables suf and lcp are accessed in sequential order from left to right.

Due to property (2), the _{S, k }for a range of values _{min }and _{max}.

Suppose that all values incremented below are initialized to zero.

• We process a singleton interval [

1.

2. suf[

3.

As a consequence, for all _{min}} ≤ _{max}}, we increment _{S, k}

• We process an ℓ-interval [_{min}} ≤ _{max}}, we increment _{S, k}[

Analysis of time and space requirement

The suffix array can be computed in linear time and space (cf.

The algorithm to enumerate the lcp-intervals and singleton intervals, given the enhanced suffix array, runs in linear time, see _{2}

Due to property (4), the enhanced suffix array does not need to be represented in main memory. At any time, we only need to store two consecutive entries of table suf and lcp. Hence the space requirement is dominated by the stack needed for the bottom-up traversal of the (virtual) lcp-interval tree. Our specific application only requires to store nodes representing strings of length shorter than _{max }occurring more than once as substrings in _{max}, which results in a space requirement proportional to _{max}.

Besides random access to the sequence, we also need random access to a data structure for accumulating the occurrence counts. Let _{S, k}[_{min}, _{max}]. Then this data structure (e.g. hash table) requires space proportional to

The overall space requirement of the algorithm is proportional to _{max }+ _{2}_{max}, the space requirement is linearly dependent on _{max}. Since _{max }and ^{k }time and space proportional to 4^{k }for some fixed value of

Dividing and merging the datasets

The analysis above shows that the running time and space requirement of our algorithm for computing _{2 }^{32 }- 1 bytes) of main memory, _{2 }^{32 }- 1. That is, the sequence length is limited to 1 gigabyte. Since we want to process considerably larger sequences, we developed a divide-and-conquer approach. This cuts the sequence

Processing each section by the algorithm described above results in occurrence counts for each section. More precisely, for each section we enumerate pairs (

Efficient computation of distribution ratios

In contrast to the occurrence ratios, the average frequency _{min}, _{max}) storing all _{min }and _{max }times in _{min }and _{max }are user defined positive numbers. Constraining the indexed _{min }and _{max }is relevant in applications where we are interested in _{max}) or frequently (large value of _{min}). Given the index _{min}, _{max}), we want to solve the following tasks:

(1) For each possible sequence

(2) For each

In the previous section we have shown how to compute occurrence ratios by enumerating _{S, k}, the same enumeration process can determine the _{min }≤ _{max}. If this is satisfied, the _{min}, _{max}) is simply a sequence of lexico-graphically ordered

To implement _{min}, _{max}), we directly store each _{2 }4 = 2

**K****-mer uniqueness ratio for the 0.45 × maize WGS data set for varying values of****k**

Let _{2 }_{2 }_{2 }

To put it together, our index differs in several aspects from other indexing approaches employed in sequence analysis (e.g. suffix trees

• First, we do not store information about where the _{min }and _{max}. For large sequence sets and larger values of _{min }and _{max }should be chosen carefully. If these values are not restrictive enough, then there may be too many

• Second, we directly store each

The Tallymer software

The program

Results and discussion

Selection of k-mer size for use in maize

Because our method allows us to compute

Validation of the 20-mer frequency index for the WGS set

The use of _{k, M }for

The public maize sequence sets fall into three classes: (A) maize whole genome sequences, (B) maize repeats, and (c) maize gene enrichment sequences. The seven sequence sets are known to have differing degrees of repetitiveness and should therefore provide a means to verify our method. For example, we expect gene enriched sequence to be less repetitive than RepII sequences (DNA transposons), and RepII sequences to be less repetitive than RepI sequences (TEs).

The results of this analysis are shown in Figure ^{-57})). Though the λ-distribution for RepII also appears to be bimodal, we were unable to find significantly different repeat populations among the DNA transposon derivatives and superfamilies.

**λ****-distribution ratios for different classes of maize sequences. **The X-axis shows the _{min }and _{min}. The Y-axis shows the values for Ω_{k, M}, where

Two repeat populations in the RepI sequence set. Repeat elements of the RepI set are found in different relative proportions depending on the repeat level tested (corresponding to peaks 1 and 2 in Figure 2B).

peak 1

peak 2

class

total

percentage

total

percentage

unclassified

66

11.0

1

0.3

LINE

8

1.3

0

0

Ty1/copia

142

23.6

292

75.5

Ty3/gypsy

386

64.1

94

24.3

∑ 602

∑ 100.0

∑ 387

∑ 100.0

The three sequence sets delivered by gene enrichment methods shown in Figure _{0}_{0}

Genome annotation using k-mer frequencies

We used a previously published set of 100 maize BAC sequences that had been chosen at random to be representative of the whole genome

Comparison of masking using either

**Comparison of masking using either k-mer frequencies or alignment-based repeat masking.** (A) Percent of nucleotides masked in 100 BAC sequences (total length 14.3 Mb) as a function of absolute frequency threshold (logarithmic scale). Values are given for the sum of all sequences, and for the most and least repetitive BACs within the set. (B) Overlap between regions masked using the

We compared these results to masking based on the curated MIPS REcat repeat library. This library includes repeats from many sources including annotated TEs from the BACs used in this analysis. Thus application of this library can be regarded as a 'gold-standard' for detection of TEs within these BACs. Indeed, masking using the MIPS REcat library, resulted in a total repeat coverage of 80.8% (range 36.8–100.0% on individual clones). This exceeds the masking rate of 67% originally reported for this set

In contrast, 81% of the MIPS REcat-masked positions are also masked by our method. This was achieved at the absolute frequency threshold of 0.3 (Figure

We next compared the two repeat detection methods for their ability to discriminate TEs at the level of predicted genes. The set of 100 BACs were annotated using FGENESH and resulting predictions were classified as presumptive genes or as TEs using a similarity-based search (see Section "Methods"). Of the 2504 predicted genes, 359 (14%) were screened out as showing no evidence of homology to NCBI GenPept peptides. Of the remainder, 1842 (86%) were classified as TE while the remaining 303 (14%) were classified as presumptive genes. For these latter two classes the percent of coding sequence masked was calculated based on either RepeatMasker data or on constituent 20-mer frequencies at various thresholds. For each method of masking, receiver operating characteristic (ROC) curves were used to define a threshold of masking that best discriminated TEs from presumptive genes. Area under the curve (AUC), sensitivity, and specificity were used to compare efficacies

Because percent masking using

ROC plots showing sensitivity and specificity of TE detection among 2145 FGENESH models (1824 TE and 303 presumed genes) based on the percent of coding sequence masked using two methods

**ROC plots showing sensitivity and specificity of TE detection among 2145 FGENESH models (1824 TE and 303 presumed genes) based on the percent of coding sequence masked using two methods.** In one method BAC sequences were masked using an absolute frequency threshold of 0.8. In the other, masking was performed using RepeatMasker with the MIPS REcat library. ROC plot comparison of the maximum area under the curve resulting from the two plots showed that they are not significantly different (see main text for details).

Table

Discrimination of maize TE-encoded genes based on percent of coding sequence masked using either RepeatMasker (MIPS REcat library) or constituent 20-mer frequencies (WGS index with a threshold log repeat level of 0.8).

method

criterion

sensitivity

95% CI

specificity

95% CI

REcat

> 41.56

96.69

95.8–97.5

92.41

88.8–95.1

> 17.00

92.62

91.3–93.8

92.08

88.4–94.9

While both RepeatMasker and our method masked the majority of RepI retroelements, some low copy TEs escaped masking under our method based on average frequencies. As shown in our analysis of the 100 pilot BACs, low-copy DNA transposons, may be annotated as such by curated repeat databases, but missed by the counting approach used here. In the context of directed sequence finishing, low-copy repeats are often as in need of characterization as protein coding genes. Leaving them unmasked in maize is actually in the best interests of the project. But the average frequency threshold must be chosen carefully: more permissive thresholds will lead to finishing TE-like elements, and more strict thresholds may mask high-copy gene families. To use this method optimally, a balance must be struck with respect to the genome in question, and the thresholds need to be adjusted according to the annotation requirements.

The validated WGS index can be used to annotate any portion of the genome with respect to its component

Visualization of

**Visualization of k-mer frequencies in a 453 kbp assembly of four BAC sequences derived from maize chromosome 8. **A 100 kbp segment (range 70,001–170,000 nt) is shown. In the first two tracks transposable elements are shown in red while genes are shown in blue (exon/intron structure not shown). The third track, global

If the scope of the experiment is narrowed, however, and the

Comparative genomics

Beyond employing _{S, k}(1, 10), 100·_{S, k}(11, 100), 100·_{S, k}(101, 1000), 100·_{S, k}(1001, 10000), 100·_{S, k}(10001, ∞) (Figure _{S, k}(

Occurrence ratios in comparative genomics

**Occurrence ratios in comparative genomics.** Maize, sorghum and rice whole genome shotgun reads were randomly selected to generate 0.45 × coverage with respect to each genome's size. The total number of 20-mers in each logarithmic frequency class (A) are contrasted to the number of different 20-mers in each frequency class (B). Maize is the most repetitive of the three grasses analyzed here, but a corresponding increase in genome complexity is not observed.

For example, in the case of maize, there are 1,041,350,089 positions at which a 20 mer occurs. There are 456,445,768 different 20-mers of which 378,556,535 are found only once, while the most highly represented sequences exists 47,933 times.

The multiple occurrence ratios represented in Figure

Read lengths in whole genome shotgun sequencing projects limit this sort of analysis. Since Sanger reads average around 700 base pairs in length, most repetitive elements will be truncated at the 5' and 3' ends, making experiments with

The

**The k-mer uniqueness ratio for some assembled plant genomes as a function of k.** The uniqueness ratio is the ratio of

We performed a number of experiments demonstrating that

**K****-mer frequencies across orthologous regions of three maize cultivars.** The B73-based WGS index was used to annotate the Bronze-1 locus and surrounding regions in cultivars B73, McC and Mo17 (Genbank accession numbers

Conclusion

We have described a method based on

In designing the

To apply our methods to large sequence sets, we have developed fast and memory efficient algorithms to compute occurrence ratios, to index

Authors' contributions

SK developed the algorithms and implemented the Tallymer software. AN analyzed sequence data sets with respect to Tallymer output. JS performed annotation of maize BACs, their visualization, and ROC analysis. DW conceived this study. All authors contributed to writing the article.

Acknowledgements

The authors wish to thank Dick McCombie, Melissa Kramer, and Lidia Nascimento of the DNA Sequencing Facility at Cold Spring Harbor Laboratory (CSHL) for sequencing, assembly, and GenBank submission of the four newly-sequenced maize BACs described here. We thank the Maize Genome Sequencing Consortium for useful discussion, preliminary analysis and beta-testing of the software. We thank Shiran Pasternak (CSHL) and Ute Willhoeft (Center for Bioinformatics) for helpful discussions and critical reading of previous versions of the manuscript. We also thank Peter VanBuren (CSHL) for system management. The individuals were supported by the following grants, JS NSF DBI-0738000, AN NSF DBI-0527192, DW USDA ARS. We want to thank anonymous reviewers for suggestions to improve previous versions of the manuscript.