Extending the BEAGLE library to a multi-FPGA platform
- Equal contributors
Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA
BMC Bioinformatics 2013, 14:25 doi:10.1186/1471-2105-14-25Published: 19 January 2013
Maximum Likelihood (ML)-based phylogenetic inference using Felsenstein’s pruning algorithm is a standard method for estimating the evolutionary relationships amongst a set of species based on DNA sequence data, and is used in popular applications such as RAxML, PHYLIP, GARLI, BEAST, and MrBayes. The Phylogenetic Likelihood Function (PLF) and its associated scaling and normalization steps comprise the computational kernel for these tools. These computations are data intensive but contain fine grain parallelism that can be exploited by coprocessor architectures such as FPGAs and GPUs. A general purpose API called BEAGLE has recently been developed that includes optimized implementations of Felsenstein’s pruning algorithm for various data parallel architectures. In this paper, we extend the BEAGLE API to a multiple Field Programmable Gate Array (FPGA)-based platform called the Convey HC-1.
The core calculation of our implementation, which includes both the phylogenetic likelihood function (PLF) and the tree likelihood calculation, has an arithmetic intensity of 130 floating-point operations per 64 bytes of I/O, or 2.03 ops/byte. Its performance can thus be calculated as a function of the host platform’s peak memory bandwidth and the implementation’s memory efficiency, as 2.03 × peak bandwidth × memory efficiency. Our FPGA-based platform has a peak bandwidth of 76.8 GB/s and our implementation achieves a memory efficiency of approximately 50%, which gives an average throughput of 78 Gflops. This represents a ~40X speedup when compared with BEAGLE’s CPU implementation on a dual Xeon 5520 and 3X speedup versus BEAGLE’s GPU implementation on a Tesla T10 GPU for very large data sizes. The power consumption is 92 W, yielding a power efficiency of 1.7 Gflops per Watt.
The use of data parallel architectures to achieve high performance for likelihood-based phylogenetic inference requires high memory bandwidth and a design methodology that emphasizes high memory efficiency. To achieve this objective, we integrated 32 pipelined processing elements (PEs) across four FPGAs. For the design of each PE, we developed a specialized synthesis tool to generate a floating-point pipeline with resource and throughput constraints to match the target platform. We have found that using low-latency floating-point operators can significantly reduce FPGA area and still meet timing requirement on the target platform. We found that this design methodology can achieve performance that exceeds that of a GPU-based coprocessor.