700字范文 > Evaluation of hybrid and non-hybrid methods for de novo assembly of nanopore reads

Evaluation of hybrid and non-hybrid methods for de novo assembly of nanopore reads

时间：2019-11-10 07:39:33

相关推荐

Oxford Nanopore sequencing hybrid error correction and de novo assembly of a eukaryotic genome
Efficient Hybrid De Novo Error Correction and Assembly for Long Reads
Hybrid error correction and de novo assembly of single-molecule sequencing reads
Tools and Strategies for Long-Read Sequencing and De Novo Assembly of Plant Genomes

混合和非混合方法对nanopore reads从头组装的评价

动机:

纳米孔测序技术的出现，对现有的组装方法提出了挑战。在这项工作中，我们评估了现有的混合和非混合从头组装方法在长且容易出错的纳米孔读取上的性能。

结果:

我们基准测试了5个非混合(在错误纠正和支架方面)装配管道，以及两个混合装配器，它们使用第三代测序数据来支架Illumina装配。使用多个纳米孔数据(20、30、40和50)的测序覆盖，对几个公开可用的大肠杆菌K-12的MinION和Illumina数据集进行了测试。我们试图评估每个覆盖物的组装质量，以估计封闭细菌基因组组装的需求。在此基础上，开发了一个可扩展的基因组装配基准测试框架。结果表明，混合方法对NGS数据的质量依赖性强，而对纳米孔数据的质量和覆盖度的依赖性较弱，在较低的纳米孔覆盖度下表现较好。当覆盖度超过40时，所有的非杂交方法都能正确组装大肠杆菌基因组，即使是为太平洋生物科学量身定制的非杂交方法也能做到这一点。虽然与专门为nanopore读取设计的方法相比，它需要更高的覆盖率，但其运行时间明显较低。

在过去的十年中，下一代测序(NGS)设备已经主导了基因组测序市场。与以前使用的Sanger测序相比，NGS更便宜，更省时，也不需要很多劳动力。然而，当涉及到更长基因组的从头组装时，许多研究人员对使用NGS读取表示怀疑。这些设备产生数百个碱基对长，这是太短，不能毫不含糊地解决重复区域，即使在相对较小的微生物基因组(Nagarajan和Pop, )。尽管配对和配对技术的使用提高了组装基因组的准确性和完整性，但由于长时间的重复区域，NGS测序仍然产生高度碎片化的组装。这些不完整的基因组必须使用更费力的方法来完成，包括桑格测序和专门定制的装配方法。由于NGS的存在，人们开发了许多有效的算法来优化序列装配、比对和下游分析步骤中的运行时间和内存占用。

由于需要能够产生更长读取时间的技术来解决重复区域的问题，因此出现了新的测序方法，即所谓的“第三代测序技术”。

其中第一项是由太平洋生物科学公司（pacbio）开发的单分子测序技术。

尽管pacbio测序器产生更长的读取时间（多达数万个碱基对），但它们的读取错误率（10%到15%）明显高于NGS读数（1%）（Schirmer等人，）。

现有的装配和对准算法不能处理如此高的错误率。

这就导致了读取错误校正方法的发展。首先，使用补充的NGS（Illumina）数据进行混合校正（Koren等人，）。

后来，开发了pacbionly读取的自校正（chin等人，），这需要更高的覆盖率（>50x）。

需要开发新的、更为敏感的微传感器（即（和））和现有的优化（李（Ⅱ））。

，牛津纳米孔技术公司（ONT）推出了他们的微型微型测序仪，大小约为一支口琴。

迷你人可以产生长达几十万碱基对的读取。迷你人测序器（采用最新的R7.3化学成分）的一维读取原始碱基精度小于75%，而高质量的二维读取（80–88%精度）仅占所有二维读取的一小部分（IP等人，；Laver等人，）。

这再次刺激了开发更敏感的映射和重新调整算法的需求，如GraphMap（Sovic et al.，）和Marginalign（Jain et al.，）。

，当Loman等人证明仅使用ONT-reads组装细菌基因组（大肠杆菌K-12）是可能的，即使错误率很高（Loman等人，）。

得益于纳米孔测序技术的超长阅读时间、经济性和可用性，

这些结果可能在不久的将来引起从头序列分析的革命。

Majority of algorithms for de novo assembly follow either the de Bruijn graph (DBG) or the Overlap-Layout-Consensus (OLC) paradigm (Pop, ).OLC assemblers predate the DBG and were widely used in the Sanger sequencing era.A major representative of the OLC class is Celera which was developed and maintained until very recently.The DBG approach attempted to solve the problem of ever-growing sequencing throughput brought on by the NGS technologies.Unlike OLC in which overlaps between reads have to be calculated explicitly, DBG splits the reads into k-mers and constructs the overlap graph implicitly, e.g. through a hash table lookup.While the assembly in the OLC paradigm attempts to find a Hamiltonian path through an overlap graph, the DBG attempts to solve a, virtually, simpler problem of finding an Eulerian path through a de Bruijn graph.It was later shown that both de Bruijn and overlap graphs can be transformed into string graph form, in which, similar to the DBG, an Eulerian path also needs to be found to obtain the assembly (Myers, ).Major differences lie in the implementation specifics of both algorithms.Although the DBG approach is faster, OLC based algorithms perform better for longer reads (Pop, ).Additionally, DBG assemblers depend on finding exact-matching k-mers between reads (typically 21 to 127 bases long (Bankevich and Pevzner, )).Given the error rates in third generation sequencing data, this presents a serious limitation.The OLC approach, on the other hand, should be able to cope with higher error rates given a sensitive enough overlapper, but contrary to the DBG a time-consuming all-to-all pairwise comparison between input reads needs to be performed

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。