Genomics Research with OpenCL™ and FPGAs

Genomics Research with OpenCL™ and FPGAs

Genomics Research with OpenCL™ and FPGAs

With the rapid decrease in gene sequencing costs due to the emergence of second generation sequencing equipment, the availability of genome sequence data is increasing dramatically. The ability to correlate the variations among genomes is enabling advances in a wide range of medical research and personalized care. Because each human genome comprises more than three billion base pairs, whole ...genomic sequencing requires significant processing power, storage capacity, and network bandwidth. In particular, variant calling is extremely computationally intensive. The Genome Analysis Toolkit (GATK) is a software package developed at the Broad Institute to analyze high-throughput sequencing data. This paper describes the acceleration of the GATK’s HaplotypeCaller algorithm using Intel’s feld programmable gate array (FPGA) devices, programmed using the Intel® FPGA SDK for OpenCL™ software technology.

The Intel® FPGA SDK for OpenCL™ software technology enabled simple, effective implementation and testing of the Pair HMM algorithm for the GATK from the Broad Institute. The Altera FPGA shows significant performance acceleration relative to other technologies. Comparing the peak performance with IBM POWER8* and Xilinx platforms, the Intel® Arria® 10 device recorded speeds of up of 55x and 25x, respectively. Upon integration with the GATK Best Practices pipeline, the overall pipeline speed-up was 1.2x compared to the Intel® AVX technology implementation. Possible future work includes the following:

• Incorporate the accelerated algorithms into the complete GATK.

• Implement compression algorithms in the FPGA to enable more effective storage and transportation of genome data along with acceleration of analysis engines such as the GATK.

• Port to the recently announced Intel® Stratix® 10 FPGA to potentially achieve further performance scaling.

Further optimization of the OpenCL™ code could be done as well. For instance, in the current design, the result adder chain in Figure 4 is not fully utilized every cycle. An improvement would be to mux one of the adders from the HMM calculation so that the hardware could be shared. This type of optimization, called resource folding, would allow chip area to be reduced so that freed DSP resources could be used to add more computation units, increasing performance.