Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome精确的循环一致长读测序改进了人类基因组的变异检测和组装
Aaron M. Wenger,Paul Peluso,[…]Michael W. HunkapillerNature Biotechnologyvolume37,pages1155–1162()Cite this article
14kAccesses
61Citations
141Altmetric
Metricsdetails
Abstract
The DNA sequencing technologies in use today produce either highly accurate short reads or less-accurate long reads. We report the optimization of circular consensus sequencing (CCS) to improve the accuracy of single-molecule real-time (SMRT) sequencing (PacBio) and generate highly accurate (99.8%) long high-fidelity (HiFi) reads with an average length of 13.5 kilobases (kb). We applied our approach to sequence the well-characterized human HG002/NA24385 genome and obtained precision and recall rates of at least 99.91% for single-nucleotide variants (SNVs), 95.98% for insertions and deletions <50 bp (indels) and 95.99% for structural variants. Our CCS method matches or exceeds the ability of short-read sequencing to detect small variants and structural variants. We estimate that 2,434 discordances are correctable mistakes in the ‘genome in a bottle’ (GIAB) benchmark set. Nearly all (99.64%) variants can be phased into haplotypes, further improving variant detection. De novo genome assembly using CCS reads alone produced a contiguous and accurate genome with a contig N50 of >15 megabases (Mb) and concordance of 99.997%, substantially outperforming assembly with less-accurate long reads.
目前使用的DNA测序技术可以产生高度精确的短读,也可以产生较不精确的长读。
我们报告了优化的循环一致序列(CCS),以提高单分子实时(SMRT)测序(PacBio)的准确性,并产生高精度(99.8%)长的高保真度(HiFi),平均长度为13.5 kb。
将我们的方法应用于鉴定良好的人类HG002/NA24385基因组序列,单核苷酸变异(SNVs)的准确率和查全率至少为99.91%,插入和缺失和50 bp (indels)的准确率和查全率至少为95.98%,结构变异的查全率至少为95.99%。
我们的CCS方法匹配或超过了短读测序检测小变异和结构变异的能力。
我们估计,在“瓶中基因组”(GIAB)基准集中,有2434个不一致是可纠正的错误。几乎所有(99.64%)变异都可以分阶段转化为单倍型,从而进一步改进变异检测。
单独使用CCS读取的从头基因组组装产生了连续且准确的基因组,其contig N50为15 Mb,一致性为99.997%,大大优于不太准确的长读取组装。