Halvade*

Halvade* is a MapReduce implementation of the best-practice DNA sequencing pipeline as recommended by Broad Institute.

Parallel Efficiency Reaches 91 Percent1

Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine.

Halvade* is a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK* Best Practices recommendations, supporting both whole genome and whole exome sequencing. Halvade is implemented in Java and uses the Apache Hadoop* MapReduce 2.0 API. For example, it supports the Cloudera Hadoop* Distribution as well as Amazon EMR*.

Performance Results

Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50x coverage) in less than 3 hours with high parallel efficiency1.

The speed-up curve shows that the more Hadoop tasks, the better the performance, with almost linear scaling. Here, each task uses six physical Intel® Xeon® CPU cores, which amounts to 12 hardware threads per Hadoop task. The efficiency curve shows the same result: With 360 cores in total, parallel efficiency is at 91.1 percent, indicating that available resources are effectively used.

Without Halvade, the same pipeline would run for an estimated 288 hours (ca. 12 days) on a single node. Even with multithreading enabled within the tools that support it, a runtime of 120 hours (ca. 5 days) was measured. With Halvade, the runtime is reduced to 3 hours on a 15-node Intel® Xeon® CPU cluster running Cloudera Hadoop* Distribution. Using only a single node, the whole pipeline runs in 48 hours (ca. 2 days).

Download the code ›

Reproduce these results with this optimization recipe ›

Publications

Dries Decap, Joke Reumers, Charlotte Herzeel, Pascal Costanza, and Jan Fostier. “Halvade: scalable sequence analysis with MapReduce.” Bioinformatics (2015) 31 (15): 2482-2488 first published online March 26, 2015.

Read the Halvade analysis article ›

Configuration Table

System Overview

 

Nodes

15 nodes, with 64 GB RAM each

Processor

In total: 30 Intel® Xeon® E5-2695 v2 CPUs @ 2.40 GHz each

Cores

In total: 360 physical cores (720 threads)

RAM

In total: 960 GB RAM

Apache Hadoop* Distribution

Cloudera version 5.0.1b

Tasks per Node

4 tasks per node, each task using 6 physical cores (12 threads)

免責事項

1

ベンチマーク結果は、「Spectre」および「Meltdown」と呼ばれる脆弱性への対処を目的とした最近のソフトウェア・パッチおよびファームウェア・アップデートの適用前に取得されたものです。パッチやアップデートを適用したデバイスやシステムでは、同様の結果が得られないことがあります。

性能に関するテストに使用されるソフトウェアとワークロードは、性能がインテル® マイクロプロセッサー用に最適化されていることがあります。SYSmark* や MobileMark* などの性能テストは、特定のコンピューター・システム、コンポーネント、ソフトウェア、操作、機能に基づいて行ったものです。結果はこれらの要因によって異なります。製品の購入を検討される場合は、ほかの製品と組み合わせた場合の本製品の性能など、ほかの情報や性能テストも参考にして、パフォーマンスを総合的に評価することをお勧めします。詳細については、http://www.intel.com/benchmarks を参照してください。