Abstract
Background
Maximum Likelihood (ML)based phylogenetic inference using Felsenstein’s pruning algorithm is a standard method for estimating the evolutionary relationships amongst a set of species based on DNA sequence data, and is used in popular applications such as RAxML, PHYLIP, GARLI, BEAST, and MrBayes. The Phylogenetic Likelihood Function (PLF) and its associated scaling and normalization steps comprise the computational kernel for these tools. These computations are data intensive but contain fine grain parallelism that can be exploited by coprocessor architectures such as FPGAs and GPUs. A general purpose API called BEAGLE has recently been developed that includes optimized implementations of Felsenstein’s pruning algorithm for various data parallel architectures. In this paper, we extend the BEAGLE API to a multiple Field Programmable Gate Array (FPGA)based platform called the Convey HC1.
Results
The core calculation of our implementation, which includes both the phylogenetic likelihood function (PLF) and the tree likelihood calculation, has an arithmetic intensity of 130 floatingpoint operations per 64 bytes of I/O, or 2.03 ops/byte. Its performance can thus be calculated as a function of the host platform’s peak memory bandwidth and the implementation’s memory efficiency, as 2.03 × peak bandwidth × memory efficiency. Our FPGAbased platform has a peak bandwidth of 76.8 GB/s and our implementation achieves a memory efficiency of approximately 50%, which gives an average throughput of 78 Gflops. This represents a ~40X speedup when compared with BEAGLE’s CPU implementation on a dual Xeon 5520 and 3X speedup versus BEAGLE’s GPU implementation on a Tesla T10 GPU for very large data sizes. The power consumption is 92 W, yielding a power efficiency of 1.7 Gflops per Watt.
Conclusions
The use of data parallel architectures to achieve high performance for likelihoodbased phylogenetic inference requires high memory bandwidth and a design methodology that emphasizes high memory efficiency. To achieve this objective, we integrated 32 pipelined processing elements (PEs) across four FPGAs. For the design of each PE, we developed a specialized synthesis tool to generate a floatingpoint pipeline with resource and throughput constraints to match the target platform. We have found that using lowlatency floatingpoint operators can significantly reduce FPGA area and still meet timing requirement on the target platform. We found that this design methodology can achieve performance that exceeds that of a GPUbased coprocessor.
Background
Different Bayesian and likelihoodbased phylogenetic inference tools use various methods for generating a sequence of candidate trees, but in general these tools use the Phylogenetic Likelihood Function (PLF) to evaluate the likelihood of a proposed tree [1]. Equation 1 shows the PLF.
The computational components described in this paper target the nucleotide model of evolution, but the design methodology can be generalized to all discretecharacter models. The PLF computes the conditional likelihood of each of the four bases being at position i in an ancestral sequence as a function of the conditional likelihoods of each of the bases at the same position in the left and right descendent nodes. Once all conditional likelihoods are computed for a candidate tree, the tree likelihood can be computed as a function of the conditional likelihoods at the root node, as shown in Equation 2. Though scaling in the equation is not part of the mathematical algorithm of the PLF, it is part of a computational algorithm which implements the PLF as a means to cope with limited numerical precision and large trees.
Computing the PLF and tree likelihood for candidate trees comprises the computational kernel of these tools. Since multiple tools use a common likelihood computation, Ayres et al., implemented a finely tuned implementation of the likelihood computation as a generalpurpose library called BEAGLE [2]. BEAGLE supports CPU and GPUbased architectures, but does not yet support Field Programmable Gate Array (FPGA)based architectures such as the Convey HC1 [3]. In this paper, we describe our effort to add FPGA support to BEAGLE as well as the resultant performance.
Related work
There has been previous work in accelerating the PLF to FPGAbased coprocessor architectures. Mak and Lam were perhaps the first team to implement likelihoodbased phylogeny inference on an FPGA [4]. They used specialpurpose logic in the FPGA fabric to perform the PLF using fixedpoint arithmetic. Alachiotis et al. also implemented the PLF in special purpose logic and achieved an average speedup of four relative to software on a sixteen core processor [5,6].
There has also been recent work in using Graphical Processor Units (GPUs) as coprocessors for MLbased phylogenetic inference. In recent work, Suchard et al. used the NVIDIA CUDA framework [7] to implement single and double precision versions of the PLF [8]. Using three NVIDIA GTX280 GPUs, they achieved a speedup of 20 for the single precision nucleotide model as compared to singlethreaded software. Zhou et al. developed a GPU implementation and evaluated it on a CPU running four processes and two NVIDIA GTX 480 GPUs [9]. They achieved a speedup of 42.3 relative to MrBayes running on a CPU with a single thread, and 13.9 when compared to an optimized CPU version with 4 threads.
Descriptions of PLF kernel
The kernel function of BEAGLE depends on which options and evolutionary models are used for the analysis. When using the 4state nucleotide model, nearly all the execution time is spent evaluating the loglikelihood score. This evaluation consists of evaluating the PLF, normalizing the conditional likelihood tables to avoid numerical underflow, and updating two logscaler arrays.
The kernel algorithm is described in Cstyle pseudocode below. Note that although our implementation produces consistent results with that of BEAGLE, the pseudocode describes the authors’ implementation and is not necessarily descriptive of the corresponding kernel included in BEAGLE.
Pseudocode 1: BEAGLE Kernel
// for each nucleotide in the sequence perform the PLF to complete the fourcolumn
// conditional likelihood table
h = 0;
for (k = 0; k < nsites; k++) {
State is {AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT};
Base is {A, C, G, T};
State = State.first;
for (i = 0; i < 4; i++) {
Base = Base.first;
sopL = sopR = 0;
for (j = 0; j < 4; j++) {
sopL = sopL + tipL[State] * clL[Base];
sopR = sopR + tipR[State] * clR[Base];
State = State.next;
Base = Base.next;
}
clP[h + i] = sopL * sopR;
}
// find the maximum of the previously computed values (scaler value)
scaler = 0;
for (i = 0; i < 4; i++)
if (clP[h + i] > scaler) scaler = clP[h + i];
// normalize the previously computed values
for (i = 0; i < 4; i++)
clP[h + i] = clP[h + i] / scaler;
// store the log of the scaler value and store in scP array
scP[k] = log(scaler);
// update the lnScaler array
lnScaler[k] = scP[k] * lnScaler[k];
// accumulate the loglikelihood value
condLike = 0;
for (i = 0; i < 4; i++)
condLike = condLike + bs[i] * clP[h + i];
lnL = lnL + numSites[k] * (lnScale[k] + log(condLike));
// increment counters
h = h + 4; clL = clL + 4; clR = clR + 4;
}
double log (double x) {
// initialize binary search
log_comp = −16;
coeff_set = 0;
coeff_incr = 8;
log_comp_incr = 8;
// perform a logarithmic binary search
for (i = 0;i < 4;i++) {
if (x < 10^log_comp) {
log_comp = log_comp  log_comp_incr;
} else {
log_comp = log_comp + log_comp_incr;
coeff_set = coeff_set + coeff_incr;
}
coeff_incr = coeff_incr / 2;
log_comp_incr = log_comp_incr / 2;
}
// compute the polynomial approximation
return_val = 0;
pow_x = 1.0;
for (i = 0; i < 5; i++) {
return_val + = return_val + coeff[coeff_set][i] * pow_x;
pow_x = pow_x * x;
}
return return_val;
}
The natural log approximation is implemented as an order5 polynomial where the coefficients are divided into 16 segments whose range scales logarithmically. The coefficients are computed using a Chebyshev approximation [10]. In all of our experiments, we verify that the results computed from our design, including the log approximation, are accurate to within 1% of the results delivered by BEAGLE.
Implementation
Our objective is to implement the PLF and tree likelihood computations on a Convey HC1 heterogeneous platform as a hardware/software codesign. This application is a good match for the HC1, which provides high memory bandwidth and its FPGAs are wellsuited for computing data intensive loops containing no loopcarried dependences except for the accumulate required for the final likelihood value. We first briefly describe the platform then discuss our implementation in detail.
Platform
Figure 1 shows the design of our target platform, the Convey HC1. The Convey HC1 is a reconfigurable computer containing an FPGAbased coprocessor attached to a host motherboard through a socketbased frontside bus interface. Unlike socketbased coprocessors from Nallatech [11], DRC [12], and XtremeData [13], which are confined to a footprint matching the size of the socket, Convey uses a mezzanine connector to bring the front side bus (FSB) interface to a large coprocessor board (roughly the size of an ATX motherboard).
Figure 1. The HC1 coprocessor board. Four application engines connect to eight memory controllers through a full crossbar.
The coprocessor board contains four userprogrammable Virtex5 LX 330 FPGAs called “application engines (AEs)”. The coprocessor board also contains eight memory controllers, each of which is implemented on its own Virtex5 FPGA. Each of the AEs is connected to each of the eight memory controllers through a 4x4 crossbar switch on each memory controller.
Each memory controller allows up to two independent 64bit memory transactions (read or write) per cycle on a 150 MHz clock, giving a peak theoretical bandwidth of 16 bytes * 150 MHz = 2.4 GB/s per memory controller, or 8 * 2.4 GB/s = 19.2 GB/s per AE, or 4 * 19.2 GB/s = 76.8 GB/s of aggregate bandwidth for the coprocessor board.
Although the HC1’s coprocessor shares the same memory space as the host CPU, the coprocessor’s memory is partitioned into eight disjoint regions corresponding to each of the eight memory controllers. In other words, each memory region is only accessible by one of the eight memory interfaces.
Each of the eight memory controllers are connected to two Conveydesigned scatter–gather DIMM (SGDIMM) modules. Each of these two DIMMs contains eight DRAMs, each with eight decoupled DRAM banks that can be addressed independently. Each of the memory controllers attempts to maximize bandwidth by scheduling incoming memory requests to each of the sixteen banks, routing requests to nonbusy banks and grouping reads and writes into bursts to minimize bus turns (state change between read and write). A reorder module, developed by Convey, rearranges the loaded data to match the request order before delivering it back to the user logic.
The effectiveness of their scheduler depends on the memory access pattern issued by the user logic on the AEs. Contention for crossbar ports and bank request FIFOs cause the HC1’s memory controller to throttle the AE by asserting a “stall” feedback signal. These stalls reduce the effective memory bandwidth.
For memorybound kernels the user logic should attempt to access memory during every clock cycle of operation. Any cycle where the logic does not access memory is either due to inefficiency in the AE’s memory interface or due to a stall request from Convey’s memory controller. In order to calculate actual memory bandwidth, we define memory efficiency as:
Actual memory bandwidth can therefore be computed as:
Hardware implementation of the kernel
On the coprocessor, each AE has eight processing units (PEs). To achieve maximum memory efficiency, each should make full use of the 128bit per cycle memory channel. Each PE implements the kernel described in Pseudocode 1 and thus each consumed 10 inputs per loop iteration (for each character in the input sequence), performing 130 single precision floatingpoint operations, and producing six outputs. Since the memory interface can only deliver four floating values per cycle, each PE requires three cycles to read its inputs and one and a half cycles to write its results for each input character.
Resource and throughput constrained synthesis
The data introduction interval (DII) is the number of cycles required for the pipeline
to read all of its inputs from the available input ports, i.e.
In conventional scheduling it is sufficient to provide at least one functional unit
of each required functional unit type to ensure that a schedule exists. However, in
order to achieve maximum throughput, the minimum number of functional units of an
operation type is
We use the “As Soon As Possible (ASAP)” scheduling technique for data path synthesis. ASAP repeatedly schedules the ready operations to the time slot in a manner of firstcomefirstserved. The start time of each operation is the minimum allowed by the dependencies of operations.
Our pipeline synthesis tool takes, as input, the number of input ports of the target platform (physical input ports), the number of input variables derived from the target expression (logical input ports), and a data flow graph (DFG) representing the target expression. Given these parameters, the tool converts the DFG into a pipeline described in Verilog hardware description language with minimum number of floatingpoint functional units.
As an example, consider the high level code below, which computes one column of the conditional probability table.
clP[0] = (tipL[AA]*clL[A] + tipL[AC]*clL[C] + tipL[AG]*clL[G] + tipL[AT]*clL[T]) * (tipR[AA]*clR[A] + tipR[AC]*clR[C] + tipR[AG]*clR[G] + tipR[AT]*clR[T]);
The two 4x4 transition likelihood tables tipL and tipR are invariant across all nodes and are thus treated as constants. As such, this expression has eight inputs and one output.
The DFG for this expression is shown in Figure 2. If this expression is synthesized onto a platform that can only provide one input value per clock, it can be synthesized as shown in Figure 3 with only one adder and two multipliers. The multiplexers select different input data to feed the functional units. The darkcolored registers save the intermediate results of the functional units.
Figure 2. An arithmetic operation as a DFG.
The DFG of the full PLF kernel is shown in Figure 4. The number of floatingpoint operations in DFG is 38 additions, 55 multiplications, 4 divisions and 11 comparisons. Our tool’s resource sharing feature maps these arithmetic operations into 13, 19, 2 and 4 functional operators respectively. However, we find that it is infeasible to fit all the singleprecision floatingpoint operators on one PE without adjusting the latency of each operator to reduce resource usage for the target device. By carefully evaluating the latency and resource usage of each operator in Xilinx CORE Generator System’s resource estimation [14], we used lowlatency functional units for adder, multiplier and comparators. Using trial and error, we determined the latency of the floatingpoint divider needs to be around 11 to satisfy the 150 MHz timing constraint. Using lowlatency operators we were able to reduce significantly FPGA’s slice registers for each functional unit and the entire pipeline, meeting the demanding resource requirement for each PE. Table 1 lists the low and maxlatency and the number of slice registers of each floatingpoint operator described in Xilinx Floatingpoint Operator Version 5.0 [15].
Figure 4. Full data flow graph of PLF and tree likelihood calculation.
Table 1. Xilinx IEEE754 singleprecision floatingpoint operator’s latency and slice register usage
Since each PE has two independent 64bit physical ports, we synthesized the PLF kernel to a deeppipelined singleprecision floatingpoint circuit with four input ports that receive four 32bit data per cycle. The latency of the entire pipeline is 125 cycles.
Hardware architecture of PLF kernel
The top level of the system design is shown in Figure 5. It is composed of a read/write memory interface and an accelerator. The kernel pipeline has four inputs and six outputs. Two 64bit 512entry input operand FIFOs op1 and op2 receive input data from the memory controller and feed four 32bit floatingpoint input values per cycle to the pipeline. Six results from the pipeline are written into twelve FIFOs and from which the results are stored back to memory through a multiplexer. We chose 2048 and 512 for the output FIFO depth N and M respectively. The choice is determined by the constraint of the number of Xilinx BRAMs in a single FPGA [16] and BRAM usage of other modules and components.
Figure 5. Hardware architecture of PLF accelerator.
In the write portion of the memory interface there are four FIFOs for clP, scP and lnScaler pipeline outputs. Four clP outputs (clP0, clP1, clP2, and clP3) correspond to four consecutive locations of clP array. By writing data into four FIFOs of scP and lnScaler in a roundrobin manner, the memory interface can store 128 bits per cycle into the memory, increasing the output throughput.
Figure 6 shows a finite state machine (FSM) that coordinates memory read and write requests with Convey’s memory controller interface. In the LD1 state, the controller requests two 64bit words from the clL array, which comprises four 32bit input values from the clL array. It then transitions to state LD2 where it requests two 64bit words from the clR array, which comprises the four 32bit input values from the clL array. Next it transitions to the LD3 where its requests a 32bit word from the lnScaler array and the numSites array.
Figure 6. Memory access FSM.
Until the output FIFOs have not reached full state, the controller repeats this sequence. When the output FIFOs become almost full, the current load request is interrupted and the machine state jumps to its store result state. The output results in the twelve FIFOs are then stored back to memory. When stores are finished, the interrupted load request will be resumed. Note that the pipeline can continue consuming input data if the input and output FIFOs are not full. After all load requests are satisfied the machine goes to the store state to store the remaining data in the FIFOs. When it is complete, the machine returns to the initial state IDLE.
Steps of calculating root log likelihood
BEAGLE provides a set of interface functions needed for the user to describe a candidate tree and request that its likelihood be computed. Specifically, these include functions that:
•initialize input arrays and scalars, including the transition probability tables and leaf node (tip) conditional likelihood tables, and all other necessary scalars and arrays,
•describe the topology of the proposed tree,
•request that the BEAGLE runtime library traverse the tree and update the conditional likelihood tables of all internal nodes, and
•request that BEAGLE compute the likelihood of the tree.
Figure 7 and Table 2 show how our driver program initializes all input data using a sequence of BEAGLE function calls.
Figure 7. Mapping between BEAGLE API and coprocessor communication.
Table 2. Descriptions of BEAGLE API implementation
Data organization
In order to utilize eight processing elements (PEs) on each FPGA we distribute the input data among four FPGAs evenly. We copy the host data arrays to the coprocessor’s memory space based on the organization and partition of coprocessor’s memory.
Convey’s memory mapping scheme requires that memory addressing of each PE be aligned to the MC controller number. Because of this, the data is arranged using a stride of 64 bytes for each PE in an AE and 512 bytes for the same PE space between two consecutive AEs. (e.g. AE0 and AE1).
Figure 8 shows the memory mapping of the host array clL and numSites (nS). Each PE in an AE addresses 16 elements of clL array (64 bytes), which correspond to input data of four consecutive sites. Since there is only one input data of numSites for each site, we need to pack the next four numSites input data into PE0’s addressing space in AE0. This is shown in Figure 6 with an arrow from host array indexed at byte address 2048 to PE0 at byte address 0. The mapping of host array clR is the same as that of clL. We initialize the elements of lnScaler with zero.
Figure 8. Memory allocation.
Results and discussion
We implemented the kernel on a multiFPGA platform Convey HC1. The platform has four Xilinx Virtex5 LX330 FPGAs. The design is described using our pipeline synthesis tool which generates deep floatingpoint pipeline in Verilog HDL. Xilinx ISE 13.2 [17] is used to synthesize, map, place and route the design, and generate the final bitstream. A single bitstream file is used to configure all the FPGAs. Each FPGA works on different input data based on the FPGA numbers and processing elements in each FPGA.
Performance results of our implementations
In the software implementation using BEAGLE APIs, we timed the elapsed time of function beagleUpdatePartials(), beagleAccumulateScaleFactors() and beagleCalculateRootLogLike() since they comprise the computation kernel. In the hardware implementation, we timed the kernel execution on multiFPGA. We assume the number of sites is a multiple of 128 to allow each PE to process the same amount of input data.
The performance results are shown in Table 3. When the number of sites is small, CPU implementation performs faster than both FPGAbased and GPUbased implementations. When the number of sites is larger than 512, both FPGA and GPU implementations outperform the CPU. For a large number of sites, we obtain around a 40X speedup compared to a singlethreaded CPU implementation (Xeon E5520) and around 3X speedup compared to a manycore GPU implementation (Tesla S1070/T10). Note the GPU results are not available when the number of sites exceeds 260 K, since the BEAGLE GPU implementation is not capable of processing this data size.
Table 3. Performance results of our design
We achieved a memory efficiency of around 50% for large data sizes. This is competitive with similar implementations in the literature. For example, Cong et al. reported their implementation of a bandwidthbounded application that has 32 PEs and utilizes all the memory access ports on Convey HC1 [18] having 30% efficiency. In general, the factors that contribute to the efficiency are external memory access order, memory buffer size, and frequency of memory bus turns (alternating between read and write operations).
Convey’s unique interleaved memory addressing attempts to maximize memory system utilization by distributing memory accesses across all memory banks. Memory stalls occur when the number of pending memory load requests reaches the size of memory request queue in the memory controller. In order to avoid this, the size of the custom memory buffer in the user design must be close to the size of the request queue. A smaller memory buffer cannot overcome the long latency of DDR2 memory access while a larger one increases memory stalls.
Frequently alternating between memory read and memory write will reduce the effectiveness of the Convey memory scheduler and reduce memory bandwidth. The use of deep output FIFOs and writing the entire contents of the output FIFOs when they fill will reduce the frequency of readwrite transitions and improve bandwidth.
FPGA resource usage
The resource utilization of each FPGA is listed in Table 4. In each PE all floatingpoint multipliers are implemented using DSP48E modules [19] while other floatingpoint operators are implemented using LUTs. Due to the large number of floatingpoint operators and deep pipeline circuit in each PE we utilize nearly all the slices in a single FPGA. The design runs at 150 MHz and the memory controller at 300 MHz.
Table 4. Area results of our design
Power consumption
The Xilinx power analyzer xpa reports that each FPGA design consumes about 23 W, or 92 W for all four FPGAs. The thermal design power of the Tesla GPU card is around 200 W. The FPGA implementation delivers a better performance while consuming less than half of the power of the Tesla GPU.
Discussion
Design motivation
This work is based on the “MrBayes accelerator” design, previously developed in the author’s lab [8]. The original design performed the same basic computations as described in this paper, but its pipeline was designed by hand and did not incorporate any functional unit reuse. As such, the original design instanced one functional unit for each operator in the DFG. The resulting design was large, allowing only one PE to fit on a single FPGA. In order to automate the design process and improve the resource efficiency, the authors developed a highlevel synthesis tool that generates a pipeline from a dataflow graph description of the kernel, and exploits functional unit reuse in such a way as to achieve the maximum throughput as bounded by the available memory bandwidth on the target platform. This synthesis tool was developed specifically for this application, but can be also used for any dataintensive kernel that has no loopcarried dependencies.
In addition, we implemented the new version of the design on the Convey HC1 reconfigurable computer, which has 13.2 times the amount of FPGA resources, 28.4 times the memory bandwidth, and over 1000 times the memory capacity of the original platform, an Annapolis Micro Systems WildStar II Pro. In order to make the design more general purpose, we integrated our design with the BEAGLE library instead of integrating it only into MrBayes 3 as in the original work.
Performance results
Our design achieves the highest possible level of performance as allowed by the memory system of the HC1. Memory efficiency was relatively low, and can be potentially improved by rearranging the order in which inputs are requested from the memory. Specifically, this can be performed by buffering a set of consecutive values from each input array before streaming the values into the pipeline in the order implied by the outermost loop in Pseudocode 1. This would require that the input values be read from memory in a different order than read by the pipeline. While we expect this to improve memory efficiency, the buffer would need to be designed in a way that doesn’t itself contain inefficiencies that negate its benefit. This is part of the authors’ future work.
Conclusions
In this paper we described an FPGAbased implementation of the core computations in the BEAGLE library that perform the phylogenetic likelihood function and tree likelihood computations. With this design we achieve 3X the performance of BEAGLE’s GPUbased implementation for large datasets.
The kernel implemented in this work is characterized by having a relatively low arithmetic intensity, making its performance dependent on the effective memory bandwidth achievable by the target platform. In order to achieve high performance under this condition, we developed a design automation tool that synthesizes the kernel’s data flow graph in a way that matches the pipeline’s throughput to the platform’s memory bandwidth while minimizing hardware requirements.
Availability and requirements
Project name: BEAGLE_HC1
Project home page:http://www.cse.sc.edu/~jbakos/software.shtml webcite
Operating system(s): Linux
Programming language: C and Verilog
Other requirements: Must be run on a Convey HC1
License: GNU GPL v4
Any restrictions to use by nonacademics: None
Competing interests
Both authors declared that they have no competing interests.
Authors’ contributions
JZ implemented the high level synthesis tool used to synthesize the BEAGLE kernel onto the Convey HC1 platform. JZ verified the design and performed performance testing. This work was performed under the direction of JB (JZ’s research advisor), whose original design, implemented on another FPGA platform [8], was the motivation and starting point for this improved design and design methodology. The manuscript was written jointly by JZ and JB. Both authors read and approved the final manuscript.
Acknowledgements
The authors wish to thank Glen Edwards of Convey Computer Corporation for his assistance in this work. The authors would also like to thank the anonymous reviewers for their insightful comments that allowed us to improve the quality of this paper.
This material is based upon work supported by the National Science Foundation under Grant No. 0844951.
References

Felsenstein J: Inferring Phylogenies. Sunderland, MA: Sinarer Associates, Inc. Publishers; 2004.

Ayres DL, Aaron D, Zwick DJ, Peter B, Holder MT, Lewis PO, Huelsenbeck JP, Fredrik R, Swofford DL, Cummings MP, Andrew R, Suchard MA: BEAGLE: an Application Programming Interface and HighPerformance Computing Library for Statistical Phylogenetics.
Syst Biol 2012, 61(1):170173. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Convey HC1 family.
http://www.conveycomputer.com webcite

Mak TST, Lam KP: Embedded Computation of MaximumLikelihood Phylogeny Inference Using Platform FPGA.
Proc. IEEE Computational Systems Bioinformatics Conference table of contents 2004, 512514.

Alachiotis N, Sotiriades E, Dollas A, Stamatakis A: Exploring FPGAs for accelerating the Phylogenetic Likelihood Function.
Proc. Eighth IEEE International Workshop on High Performance Computational Biology 2009, 18.

Alachiotis N, Sotiriades E, Dollas A, Stamatakis A: A Reconfigurable Architecture for the Phylogenetic Likelihood Function.
Proc. International Conference on Field Programmable Logic and Applications 2009, 674678.

Nvidia CUDA Framework.
http://www.nvidia.com/object/cuda_home_new.html webcite

Suchard MA, Rambaut A: ManyCore Algorithms for Statistical Phylogenetics.
Bioinformatics 2009, 25(11):13701376. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhou J, Liu X, Stones DS, Xie Q, Wang G: MrBayes on a Graphics Processing Unit.
Bioinformatics 2011, 27(No. 9):12551261. PubMed Abstract  Publisher Full Text

Stephanie Z, Bakos JD: FPGA acceleration of the phylogenetic likelihood function for Bayesian MCMC inference methods.
BMC Bioinforma 2010, 11:184. BioMed Central Full Text

Nallatech, Intel Xeon FSB FPGA Accelerator Module.
http://www.nallatech.com/IntelXeonFSBSocketFillers/fsbdevelopmentsystems.html webcite

DRC Computer, DRC Reconfigurable Processor Units (RPU).
http://www.drccomputer.com/drc/modules.html webcite

XtremeData Inc., XD2000i™ FPGA InSocket Accelerator for Intel FSB.
http://www.xtremedata.com/products/accelerators/insocketaccelerator/xd2000i webcite
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Xilinx Core Generator.
http://www.xilinx.com/ise/products/coregen_overview.pdf webcite

Xilinx Floating Operator Data Sheet. 2009.
http://www.xilinx.com/support/documentation/ip_documentation/floating_point_ds335.pdf webcite

Virtex5 FPGA User Guide.
http://www.xilinx.com/support/documentation/user_guides/ug190.pdf webcite

Xilinx ISE Design Suite 13.
http://www.xilinx.com/support/documentation/dt_ise132.htm webcite

Cong J, Huang M, Zou Y: 3D Recursive Gaussian IIR on GPU and FPGAs, A Case Study for Accelerating BandwidthBounded Applications. San Diego, CA; 2011:7073. [Proceedings of the 9th IEEE Symposium on Application Specific Processors (SASP 2011)]

Xilinx UG193 Virtex5 XtreameDSP Design Considerations.
http://www.xilinx.com/support/documentation/user_guides/ug193.pdf webcite