Pairwise alignment of biological sequences is a core component of many bioinformatics tools. In genomics, it is an essential building block for read mapping, variant detection, de novo genome assembly, and many other methods. In its most common formulation, the pairwise alignment problem is solved by algorithms based on Dynamic Programming (DP). As a result, they require quadratic time and memory on the length of the sequences.
In the context of the DRAC project, researchers from BSC and UAB have developed the wavefront alignment algorithm (WFA) for exact pairwise alignment of sequences. This method exploits the similarities between the sequences to deliver exact alignments while outperforming in time and memory other state-of-the-art algorithms. To this end, the WFA computes alignments of increasing score using a succinct diagonal-transition representation (Figure 1). As a result, the WFA algorithm runs in O(ns) time, proportional to the sequence length (n) and the error between them (s).
Due to its simplicity, the WFA algorithm can be easily vectorized using SIMD instructions, as opposed to traditional DP-based algorithms. Moreover, data dependencies can be automatically understood by modern compilers in order to transparently issue SIMD instructions for any supported vectorial architecture. Furthermore, the WFA allows using a succint encoding to further enhance SIMD performance and reduce the memory footprint by 4x. As a result, the WFA runs 20-300x faster than other state-of-the-art methods aligning short Illumina-like sequences, and 10-100x faster using long noisy reads like those produced by Oxford Nanopore Technologies.
This work has been funded by the European Union Regional Development Fund within the framework of the ERDF Operational Program of Catalonia 2014-2020, with a grant of 50% of total cost eligible under the DRAC project, with number 001-P-001723.