Keith Bradnam on Assemblathon 2: putting genome assembly tools to the test

Posted by Biome on 8th August 2013 - 1 Comment


Next generation sequencing has led to the production of vast quantities of raw sequencing data at relatively low cost, high speed and with a reasonable level of accuracy. However, piecing together these stretches of sequences into complete and high quality genome assemblies is still proving to a be a challenge. Numerous genome assembly tools have been developed to meet this challenge, with varying degrees of success. Recent Assemblathon competitions aim to put these various tools to the test. The results of Assemblathon 2, recently published in GigaScience by Keith Bradnam from the Genome Center at University of California, Davis, USA, and colleagues assessed the genome assemblies compiled from sequence data of three vertebrate species. Bradnam explains more about the Assemblathon, the results and their impact.

 

How did the Assemblathon come about?

The Assemblathon developed as an offshoot of the Genome 10K project, which is aiming to sequence the genomes of 10,000 vertebrates. There is almost no point in attempting to sequence so many genomes if we are not reasonably sure that we can accurately assemble those genomes from the initial sequencing read data. As there are so many different tools out there to perform de novo assembly, it seemed to make sense to try to benchmark some of them.

 

Why is it important to assess genome assembly?

There are many areas of genomics where a researcher can find a plethora of bioinformatics tools that all try to solve the same problem. However, different software tools often produce very different answers. Even the same tool can generate very different answers if you explore all of its parameters and configuration options. It is not always obvious how the results from different tools stack up against each other, and it is not always obvious as to which tools we should trust the most (if any).

If you wanted to know who made the best chili in your local area, then you could organize a ‘chili cook off’. As well as deciding an overall winner, you could also award prizes for specific categories of chili (best vegetarian chili, best low-fat recipe etc.). What we can do for chili we can also do for genome assemblers.

Contests like the Assemblathon can help reveal the differences in how different genome assemblers perform, and can also pinpoint the specific areas where one program might outperform another. However, just as tasting chili can be a very subjective experience, there can be similar issues when evaluating genome assemblers. One of the objectives for the Assemblathon was to try to get a handle on what ‘best’ means in the context of genome assembly.

 

What was the rationale for using a different approach to Assemblathon 1, where you used synthetic data?

It helps to put a jigsaw together when you have the picture on the box to help you! Most of the time when people perform genome assembly, they don’t have that picture. So it can be hard to know whether you even have all of the genome present (let alone whether it is accurately put together). In Assemblathon 1, we wanted to know what the answer was going to be before people started assembling data, so we used an artificial genome that was created in a way that tried to preserve some properties of a real genome.

However, for Assemblathon 2 there was a lot of interest in working with real world data. The genome assembly community wants to be solving problems that can actually help others with their research.

 

Why did you pick the three species that you did (budgie, boa constrictor and cichlid fish)?

A major factor was simply the availability of suitable sequencing data. But it also seemed a good idea to use some fairly diverse species, the genomes of which might pose different types of challenges for genome assemblers (different repeat content, heterozygosity etc.).

 

What were the main findings of Assemblathon 2 and were you surprised by the results?

To paraphrase Abraham Lincoln: you can get an assembler to perform well across all metrics in some species, across some metrics in all species, but you can’t get an assembler to perform well across all metrics in all species.

Personally speaking, I was expecting to see variation in performance between assemblers but I thought that some of the bigger ‘brand name’ assemblers might have shown more consistency when assessed across species.

 

The results from each of the teams on each of the genomes was very mixed, but would you say there were some entries that were better overall?

We were very diplomatic with how we described the outcome of Assemblathon 1, where some assemblers performed consistently better than others. But in Assemblathon 2, there was so much variation it seemed hard to say that any single team should be declared a winner. At best we can say that in a given species, and when considering a specific genome assembly metric, that there were winners i.e. some assemblers will give you longer contigs for the fish genome, some others might capture more of the genes in the snake genome, and others will give better coverage of the bird genome. These may all be different assemblers.

 

What is the appeal of making your work freely available in arXiv, and the open peer-review that GigaScience carries out?

Personally, I am a strong advocate that results from tax-payer funded research should become publicly available a.s.a.p. I think we can demonstrably show – by the volume of blog posts that have now discussed the Assemblathon 2 project – that the resulting conversation about genome assembly has been a useful one. I really hope that more journals adopt open peer-review and encourage the use of pre-print servers.

 

Do you think there will be an Assemblathon 3? If so, what would you do differently next time?

I’m in the process of writing a blog post on this very topic (to be posted here). There are many things that could, and perhaps should, be done differently. For starters, if the community embraces the FASTG format, it would make sense to use this as the format of choice for Assemblathon 3. Perhaps more importantly, we should find a different lead author!

 

For more on Assemblathon 2 and the unusual ‘meta’ peer review process that accompanied publication of the results in GigaScience please see Biome’s interview with the Editor-in-Chief of GigaScience, Laurie Goodman here.

 

More about the author(s)

Keith Bradnam, project scientist, Genome Center, University California, Davis, USA.

Keith Bradnam began his scientific career in ecology before completing a Masters in bioinformatics and a PhD in eukaryotic genome evolution. After a brief stint working on an Arabidopsis genome database, he moved on to work on the model organism database WormBase at the Wellcome Trust Sanger Institute, UK. Bradnam later joined the lab of Ian Korf at University of California, Davis, USA where he currently investigates intronic-based signals in gene expression, helps run the Assemblathon and teaches bioinformatics on the Unix & Perl for biologists course.

Research

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert S, Chapman JA et al.
GigaScience 2013, 2:10

Go to article >>