Tools to evaluate the quality of third-generation/long-read sequencing data

By Nucleati Team
Blog graphics.001

Introduction

Third-generation sequencing, also known as long-read sequencing, uses technologies developed by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies to elucidate the sequence of a DNA molecule. These technologies benefit from longer read length, especially useful for the repetitive regions of the human genome. Significantly longer reads with different error rates than traditional next-generation sequencers like Ilumina require specialized tools for downstream analysis. The first step in analyzing the usability of long-red sequencing data is determining the quality of sequencing runs.

Tools to assess quality of long-read sequencing run

Several software tools are available in the public domain to evaluate the quality of sequencing results from third-generation sequencers. For example, NanoPack consists of Python3 scripts that help visualize and process third-generation sequencing data from PacBio and Oxford Nanopore.

Within NanoPack are tools such as NanoPlot, NanoQC, and NanoStat, which produce plots comparing read length and quality, evaluate nucleotide composition and quality distribution and generate a statistical summary from reads, respectively. Specifically, NanoPlot constructs read length histograms and violin plots to visualize read length and quality. The bivariate plots compare read lengths, Phred quality scores, read mapping quality, and reference identity.

Filtlong filters long read by both quality and length.2 By filtering out low quality, low read identity, and short reads based on a defined number of bases, Filtlong can produce a smaller and better subset of reads. Compared to the previously discussed quality evaluation methods of long-read sequencing data, MinIONQC is faster and uses files created by Oxford Nanopore’s Albacore or Guppy base callers instead of slowly extracting data from the FASTQ or FAST5 files. The plots generated by MinIONQC express read length and their associated Q score, with good Q scores above the default cutoff of 7. Similarly, RabbitQC provides high-speed quality control by attaining speedups up to two orders of magnitude compared to other quality control tools. QUAST (QUality ASsessment Tool) evaluates genome assemblies based on the computation of metrics such as the number of misassemblies, mismatches, or the duplication ratio. QUAST works without the reference genome, however, with reference genome as an input, it provides more information.

Conclusion

Third-generation sequencers have revolutionized the field of genome sequencing, biology, and medicine. Many tools such as NanoPack, Filtlong, MinIONQG, RabbitQC, and QUAST are available to assess the quality of the reads produced by sequencers and therefore play an essential role in their utility.

References