Aligners for long reads produced by third-generation sequencers

By Nucleati Team
Blog graphics.001

The fast-paced development in NGS has enabled us to do portable genomic research. However, everything comes with a cost. The third-generation sequencing technologies have rendered some read alignment tools irrelevant but still provided the background for new tools better suited for modern sequencing technologies. In general, the advancements in NGS technologies have made the read alignment extremely challenging due to the large size of the generated data and unique features of long reads compared to short reads, often requiring sophisticated analysis tools for alignment. The assembly of complex genomes is becoming a reality due to long reads but comes with the cost of coverage. The growing progress in accuracy and cost-effectiveness has made long-read sequencing the first choice for a broad range of applications in genomics for both model and non-model organisms (Burgess, 2018). Contrary, short-read sequences produce fragmented assemblies but are cost-effective and supported by a wide range of bioinformatics tools and pipelines.

Long read sequencing technologies such as Pacific Biosciences’ (PacBio), single-molecule real-time (SMRT) sequencing, and Oxford Nanopore Technologies (ONT) nanopore sequencing produces up to 10kb long reads (Pollard et al., 2018). Read alignment on a reference genome is a crucial step in the majority of genomic analysis pipelines. Generally, long-read alignment algorithms adopt a three-step approach of short-read aligners. Some long read mappers divide the long reads into short reads (i.e., 250 bps), align individual short reads, and then identify the mapping location of each long read based on adjacent alignment locations of the short reads (Lin et al., 2018). Some mappers still use hash-based indexing such as MOSAIK (Lee et al., 2018), Minimap (Heng Li 2018), and AGILE (Misra et al., 2011). Accuracy and sensitivity decrease with the increased sequencing errors in long reads. The long-read aligners need heuristically extracting fewer seeds per reading length in comparison to the short-read mappers. Long read aligners use hashed-based minimizer models to increase alignment rate compared to the conventional seeding or BWA-FM indexing approaches.

Here, we summarize some long-read alignment software, underlying mapping algorithms, and reference to the original article. We refer readers to the table published in an article by Alser et al, 2021.

  • diBELLA is a memory overlapper and aligner designed for long noisy and error-prone reads with parallel scalability. It looks for error-free k-mers and uses them to identify overlapping reads instead of doing the all-to-all alignment. (Ellis et al., 2019).
  • Minimap2 is a fast long-read alignment program that finds overlap between long noisy reads. It uses minimizer-based indexing and seeding algorithm with better chaining properties with an ability to produce CIGAR with fast extension alignment (Heng Li 2018)
  • GraphMap is a fast graph traversal algorithm specific for aligning long reads having a high error rate with mapping sensitivity of 10-80% and high precision (>95%) (Sović et al., 2016).
  • NGMLR is a fast convex gap-cost scoring algorithm to align long reads across SV breakpoints with unprecedented sensitivity and precision for variant calling (Sedlazeck et al., 2018).
  • MOSAIK is a stable hash clustering algorithm coupled with the Smith-Waterman algorithm for mapping reads generated by Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent, and Pacific Biosciences SMRT.
  • AGILE is a significantly faster hash table-based aligner for long 454 reads with diagonal multiple-seed match criteria and a dynamic incremental search algorithm to optimize every step of the mapping process (Misra et al., 2011).

Long read aligners are significantly faster in indexing time, but their accuracy degrades with short reads generated from long reads. Most of the long-read aligners are not compatible with the computing devices due to their high memory requirements but work well on modern HPC architecture. Overall, the future is bright for third-generation sequencing technologies. Current read aligners need a good balance between speed, memory usage. In addition, modern read aligners need more capability to handle several technological challenges and are flexible in accepting the changes and error rate.

References