Langley, Salzberg, Neale and Wegrzyn on sequencing the loblolly pine genome

Posted by Biome on 31st March 2014 - 0 Comments


Conifers are known to have large and highly complex genomes in the range of 20 to 40 Gbps. One of its members, the loblolly pine (Pinus taeda), is the second most common tree species in the USA making it vital to American forestry, and is also a feedstock for the generation of biofuels. With over 1.5 billion loblolly pine seeds planted each year, a large majority of which have been genetically bred for improvement, this pine tree was an ideal candidate for the generation of a reference genome for conifers. In a recent study in Genome Biology, Charles Langley and David Neale from the University of California, Davis, USA, Jill Wegrzyn from the University of Connecticut, USA, Steven Salzberg from Johns Hopkins University, USA, and colleagues, describe how they sequenced and assembled the first full length genome of the loblolly pine, making this the longest genome sequenced to date at 22.18 Gbps. Here Langley, Salzberg, Neale and Wegrzyn discuss how they overcame the challenges associated with sequencing such a large genome.

 

Schematic of loblolly reproductive pathway. Image source: Neale et al, Genome Biology, 2014, 15:R59

Schematic of loblolly reproductive pathway. Image source: Neale et al, Genome Biology, 2014, 15:R59

Why is loblolly pine an important species to study and what led you to sequence its genome?

SS: Loblolly pine is the number one commercial tree species in the USA, used for a wide range of products, especially paper and construction timber.

DN: Loblolly pine has been used extensively in genetic studies because of the availability of multi-generation pedigrees developed by the breeding cooperatives. Thus, all kinds of useful genetic resources were available in loblolly pine that would not be found in other pine/conifer species.

CL: Like a number of other reference genome sequences the loblolly genome serves as a solid and fertile foundation for investigations at many levels, from pathogen resistance and efficient breeding to the comparative genomics of terrestrial plants. From a technical perspective this sequencing project moves the scale and integration of technologies involved in next-generation whole genome sequencing (NG-WGS) up a level. Also noteworthy is the fact that this genome sequence was created in a collaboration with a few modest laboratories rather than a large sequencing center.

My own motivation for contributing to this project derives from its value in the study of population genomics. Natural populations of loblolly pine are large and well-studied for many interesting traits. This makes them ideal for testing population genetics theories. Studies to understand the origin, maintenance and divergence of the underlying genomic variation depend on this high quality reference sequence.

 

What challenges did you encounter when sequencing and assembling the loblolly pine genome, and what strategies did you take to overcome these challenges?

CL: While the increasing cost efficiency of present day next generation sequencing (NGS) made the direct cost of the sequencing such a large genome manageable, the complexity and heterozygosity of the available DNA made the assembly daunting. By choosing to conduct most of the sequencing in the haploid genome of a single gamete (pine nut) of the target tree and by very effectively error-correcting and pre-assembling the mountain of reads, we were able to present the state-of-the-art assembler with a manageable scale of input data.

As mentioned above this project was conducted in several small labs. Creative and effective planning, open exchange and strong, focused collaborative commitment were each necessary but not always easy to achieve among fiercely independent scientists.

DN: This is a very key point and the credit goes to Chuck Langley for understanding the importance of open and constant dialogue among team members. This led to a very creative process that would not have been achieved otherwise.

SS: The enormous size of the genome was the main challenge. At the time we started, no existing software could assemble a genome of this size – it would simply exceed the memory capacity of any available computer and then crash. The assembly team, at the University of Maryland and Johns Hopkins University, USA, developed a new algorithm that could reduce most of the data by about 100-fold, which was critical to getting the genome put together.

We also began developing a new method to use fosmids – small genomic chunks about 38 kilobases in length – as an aid to assembly. We found that we can pool together as many as 5000 fosmids and then disentangle them computationally. This approach is still in development, but we’ve already used it for part of the loblolly assembly.

The use of a haploid genome was also key: it’s rare to be able to get haploid (rather than diploid) DNA for a multi-cellular organism. The biology of the pine tree helped us out here: a pine nut contains a significant quantity of haploid DNA.

 

What is the importance of generating a high quality genome assembly, and how does the quality of the loblolly pine genome assembly compare with other sequenced plant species?

CL: It is widely recognized that a full high quality reference sequence can drive rapid advances. It is less well appreciated that an incomplete reference genome rife with errors can waste precious talent and effort, ultimately slowing and diverting science.

While the present loblolly pine sequence is incompletely assembled, it is a solid foundation. The error rate is low. But version 2.0 is ‘baking in the oven’.

SS: A high quality assembly provides the basis for a great variety of downstream research. Once we have the assembly in hand, we can identify all the genes and then begin to link genes to phenotype, as we have been doing for more than a decade now with the human genome. It all starts with the genome itself.

DN: The quality and open access approach used with the loblolly genome means that it will serve as the reference for about 400 conifer genomes that will be sequenced in the years ahead.

 

How did the high quality of the loblolly pine genome assembly affect gene annotation and the insights gained into gene family evolution?

JW: The combination of a high quality genome assembly with long scaffolds and a comprehensive transcriptome generated from multiple tissue types provided evidence to describe over 50,000 genes. Several conifer genes have long introns that exceed 20 Kb in length and these would have been difficult to identify with shorter scaffolds. The full length genes allowed us to perform comparisons with protein sequences from several fully sequenced plant genomes and further investigate those specific to pine.

 

How were you able to utilize the genome assembly to identify genes underpinning important traits, such as disease resistance?

JW: John Davis at the University of Florida, USA, and his colleagues identified a single nucleotisde polymorphism (SNP) associated with fusiform rust resistance in loblolly pine. This genetically-mapped SNP was originally identified in a partial expressed sequence tag (EST). Availability of the genome and transcriptome positively identified the partial EST as a Toll-Interleukin Receptor / Nucleotide Binding / Leucine-Rich Repeat (TNL) gene. Analysis of orthologous proteins from several plant species indicated that this gene belongs to a class of TNLs that have expanded in conifers.

 

How do you think the availability of the loblolly genome sequence and assembly will aid future research?

CL: It will enable functional genomics in conifers and genomic selection (modern breeding). It will be an essential component of plant comparative genomics and will also serve as the essential reagent in population genomics investigations and genome wide association studies.

DN: It will provide a genetic resource for ecological genomics research that will facilitate better management of forests under changing climate conditions.

 

More about the author(s)

Charles Langley, Professor of Genetics, University of California, Davis, USA.

Charles Langley, Professor of Genetics, University of California, Davis, USA.

Charles Langley is Professor of Genetics in the Department of Evolution and Ecology at the University of California, Davis (UCD), USA. He obtained his PhD in zoology at the University of Texas at Austin, USA, and undertook his postdoctoral training at the University of Wisconsin-Madison, USA. He then joined the National Institute of Environmental Health Sciences, before joining UCD. The research interests of the Langley lab include population genetics and molecular evolution, more specifically addressing the forces that shape genetic variation within and between species, applying both computational and experimental approaches to these investigations. Langley is also fellow of the American Academy of Arts and Sciences.

Steven Salzberg, Professor of Medicine, Biostatistics, and Computer Science, Johns Hopkins University, USA.

Steven Salzberg, Professor of Medicine, Biostatistics, and Computer Science, Johns Hopkins University, USA.

Steven Salzberg is a Professor of Medicine, Biostatistics, and Computer Science and the Director of the Center for Computational Biology at the McKusick-Nathans Institute of Genetic Medicine at Johns Hopkins University, USA. He obtained his PhD in Computer Science from Harvard University, USA, and went on to join the Computer Science Department at Johns Hopkins University, USA, as an Assistant Professor. Currently research in the Salzberg lab focuses on next-generation sequence alignment and large-scale genome assembly, which has resulted in the generation of several pioneering, highly efficient systems for the alignment of next-generation sequencing reads, including the Bowtie, Tophat, and Cufflinks systems. All of the group’s software is free and open source. In addition to his software research, Salzberg has contributed analyses to many genome projects, including the human genome and multiple plant and animal genomes.

David Neale, Professor, University of California, Davis, USA.

David Neale, Professor, University of California, Davis, USA.

David Neale is a forest geneticist and Professor at the University of California, Davis, USA. He obtained his PhD in Forest Genetics at Oregon State University, USA. Research in the Neale lab is centered on discovering and understanding the function of genes in forest trees, especially those controlling complex traits, with a focus on single nucleotide polymorphism (SNP) discovery within candidate genes and association mapping to identify alleles useful in tree breeding.

Jill Wegrzyn, Assistant Research Professor, University of Connecticut, USA.

Jill Wegrzyn, Assistant Research Professor, University of Connecticut, USA.

 

 

Jill Wegrzyn is an Assistant Research Professor in the Department of Ecology and Evolutionary Biology and a scientist in the Bioinformatics Facility at the University of Connecticut, USA.  Her graduate education was completed at the University of California, San Diego (UCSD) and the Claremont Graduate University, USA, in the areas of bioinformatics and information systems. Subsequent research appointments focused on bioinformatics as applied to proteomics under Vivian Hook at UCSD and forest tree genomics with David Neale at the University of California, Davis, USA. Wegrzyn has developed and curated the forest tree genomics repository, TreeGenes, and related applications for over ten years. Current research in the Wegrzyn lab is focused on computational genomics in non-model plant species, with active projects in genome annotation, transcriptomics, and scientific databasing.

Research

Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies

Neale DB, Wegrzyn JL, Stevens KA, Zimin AV, Puiu D, Crepeau MW, Cardeno C, Koriabine M et al.
Genome Biology 2014, 15:R59

Go to article >>