700字范文 > 凑个热闹之美团 YOLOv6 ORT/MNN/TNN/NCNN C++推理部署

凑个热闹之美团 YOLOv6 ORT/MNN/TNN/NCNN C++推理部署

时间：2019-12-25 03:44:10

↑ 点击蓝字关注人工智能与算法学习

作者丨DefTruth@知乎（已经过作者同意转载）

来源丨/p/533643238

编辑丨极市平台

导读

本文主要讲解了美团6月24开源的YOLOv6在不同推理引擎的C++工程化过程，包含了ORT、MNN、NCNN和TNN下的推理处理，以及不同推理框架下模型转换需要注意的问题。

0. 前言

昨天美团开源了YOLOv6，又是一个YOLO系列的新作。此时距离YOLOX开源差不多刚好一年的时间。之前捏过很多YOLO系列的推理例子，比如YOLOv3、YOLOv4、YOLOv5、YOLOX、YOLOR和YOLOP等等。虽然最近已经没有在做detection方向了，但作为YOLO系列的老粉了，出来凑个热闹应该总是可以的。所以，这次也来凑凑YOLO系列的热闹，给出几个不同的推理引擎的例子，包括ONNXRuntime、MNN、NCNN和TNN，以及简单记录下模型的转换过程。总的来说，YOLOv6 的 C++推理，都是些重复性的工作，没什么太大的难度，刚好趁着周末，顺手捏一捏。这篇文章不会记录地很详细，只讲几个要点。

1. ONNX 和 TNN 模型转换

经过尝试，直接转换出来的ONNX和TNN模型文件在推理时，结果一切正常，不需要修改 YOLOv6 的 Detect 源码，使用官方提供的 deploy/ONNX/export_onnx.py 直接转换即可。但是 NCNN 和 MNN 都需要修改 Detect 的源码进行特殊处理才可正常推理。所以 ONNX 和 TNN 放在这一节讲，MNN 和 NCNN 的模型转换放在下一小节讲。

首先下载源码:

git clone --depth=1 /meituan/YOLOv6.git

然后稍微修改下 export_onnx.py，源码是没有添加 onnxsim的，作为基操，我们把它添加上，修改后如下：

#!/usr/bin/envpython3#-*-coding:utf-8-*-importargparseimporttimeimportsysimportosimporttorchimporttorch.nnasnnimportonnximportonnxsimimportonnxruntimeasortROOT=os.getcwd()ifstr(ROOT)notinsys.path:sys.path.append(str(ROOT))fromyolov6.models.yoloimport*fromyolov6.models.monimport*fromyolov6.utils.eventsimportLOGGERfromyolov6.utils.checkpointimportload_checkpointif__name__=='__main__':parser=argparse.ArgumentParser()parser.add_argument('--weights',type=str,default='./yolov6s.pt',help='weightspath')parser.add_argument('--img-size',nargs='+',type=int,default=[640,640],help='imagesize')#height,widthparser.add_argument('--batch-size',type=int,default=1,help='batchsize')parser.add_argument('--half',action='store_true',help='FP16half-precisionexport')parser.add_argument('--inplace',action='store_true',help='setDetect()inplace=True')parser.add_argument('--device',default='0',help='cudadevice,i.e.0or0,1,2,3orcpu')args=parser.parse_args()args.img_size*=2iflen(args.img_size)==1else1#expandprint(args)t=time.time()apply_simplify=True#增加onnxsim#Checkdevicecuda=args.device!='cpu'andtorch.cuda.is_available()device=torch.device('cuda:0'ifcudaelse'cpu')assertnot(device.type=='cpu'andargs.half),'--halfonlycompatiblewithGPUexport,i.e.use--device0'#LoadPyTorchmodelmodel=load_checkpoint(args.weights,map_location=device,inplace=True,fuse=True)#loadFP32modelforlayerinmodel.modules():ifisinstance(layer,RepVGGBlock):layer.switch_to_deploy()#Inputimg=torch.zeros(args.batch_size,3,*args.img_size).to(device)#imagesize(1,3,320,192)iDetection#Updatemodelifargs.half:img,model=img.half(),model.half()#toFP16model.eval()fork,minmodel.named_modules():ifisinstance(m,Conv):#assignexport-friendlyactivationsifisinstance(m.act,nn.SiLU):m.act=SiLU()elifisinstance(m,Detect):m.inplace=args.inplacey=model(img)#dryrun#ONNXexporth,w=args.img_sizeexport_file=args.weights.replace('.pt',f'-{h}x{w}.onnx')#filename增加size标记try:LOGGER.info('\nStartingtoexportONNX...')torch.onnx.export(model,img,export_file,verbose=False,opset_version=12,training=torch.onnx.TrainingMode.EVAL,do_constant_folding=True,input_names=['image_arrays'],output_names=["outputs"],)#Checksonnx_model=onnx.load(export_file)#loadonnxmodelonnx.checker.check_model(onnx_model)#checkonnxmodelLOGGER.info(f'ONNXexportsuccess,savedas{export_file}')exceptExceptionase:LOGGER.info(f'ONNXexportfailure:{e}')ifapply_simplify:#增加的onnxsim部分print(f'{export_file}simplifyingwithonnx-simplifier{onnxsim.__version__}...')try:onnx_model=onnx.load(export_file)#loadonnxmodelonnx_model,check=onnxsim.simplify(onnx_model,check_n=3)assertcheck,'simplifyingcheckfailed'onnx.save(onnx_model,export_file)exceptExceptionase:print(f'{export_file}simplifierfailure:{e}')#RunningORTcheck增加ORT验证sess=ort.InferenceSession(export_file)print(f"ORTLoaded{export_file}!")for_insess.get_inputs():print(f"Input:{_}")for_insess.get_outputs():print(f"Output:{_}")print("ORTCheckDone!")#FinishLOGGER.info('\nExportcomplete(%.2fs)'%(time.time()-t))

预训练好的pt模型文件可以从官方提供的链接下载，放在 YOLOv6 的根目录下，直接转换就行：

PYTHONPATH=.python3./deploy/ONNX/export_onnx.py--weightsyolov6n.pt--img640--batch1PYTHONPATH=.python3./deploy/ONNX/export_onnx.py--weightsyolov6n.pt--img320--batch1PYTHONPATH=.python3./deploy/ONNX/export_onnx.py--weightsyolov6s.pt--img640--batch1PYTHONPATH=.python3./deploy/ONNX/export_onnx.py--weightsyolov6s.pt--img320--batch1PYTHONPATH=.python3./deploy/ONNX/export_onnx.py--weightsyolov6t.pt--img640--batch1

这个过程比较顺利，暂时没发现什么坑。接下来转换为 TNN 模型文件，命令如下：

convert2tnn#python3./converter.pyonnx2tnn./tnn_models/yolov6/yolov6t-640x640.onnx-o./tnn_models/yolov6/-vv1.0-optimize-alignconvert2tnn#python3./converter.pyonnx2tnn./tnn_models/yolov6/yolov6s-640x640.onnx-o./tnn_models/yolov6/-vv1.0-optimize-alignconvert2tnn#python3./converter.pyonnx2tnn./tnn_models/yolov6/yolov6n-640x640.onnx-o./tnn_models/yolov6/-vv1.0-optimize-alignconvert2tnn#python3./converter.pyonnx2tnn./tnn_models/yolov6/yolov6s-320x320.onnx-o./tnn_models/yolov6/-vv1.0-optimize-alignconvert2tnn#python3./converter.pyonnx2tnn./tnn_models/yolov6/yolov6n-320x320.onnx-o./tnn_models/yolov6/-vv1.0-optimize-align

TNN模型的转换需要用到 tnn-convert，如何使用tnn-convert就不展开了，感兴趣的同学可以看我之前写的一篇文章，传送门：

tnn-convert搭建简记-YOLOP转TNN5：/p/431418709

2. MNN 和 NCNN 模型转换

NCNN 和 MNN 都需要修改 Detect 的源码进行特殊处理才可正常推理。MNN其实也可以直接转，但是转出来的模型文件虽然能推理，但是在decode完之后，结果很奇怪，所以我最后决定将MNN的模型文件转换采用和NCNN同样的处理方式。其实很多写的关于YOLO系列的部署文章都有提到过，该系列在部署时主要的一个问题就是如何处理 Detect Head 中关于 decode 部分的逻辑。这部分代码在 YOLOv6中长这样：

defforward(self,x):z=[]foriinrange(self.nl):#...ifself.training:#...else:y=torch.cat([reg_output,obj_output.sigmoid(),cls_output.sigmoid()],1)bs,_,ny,nx=y.shapey=y.view(bs,self.na,self.no,ny,nx).permute(0,1,3,4,2).contiguous()ifself.grid[i].shape[2:4]!=y.shape[2:4]:d=self.stride.deviceyv,xv=torch.meshgrid([torch.arange(ny).to(d),torch.arange(nx).to(d)])self.grid[i]=torch.stack((xv,yv),2).view(1,self.na,ny,nx,2).float()ifself.inplace:y[...,0:2]=(y[...,0:2]+self.grid[i])*self.stride[i]#xyy[...,2:4]=torch.exp(y[...,2:4])*self.stride[i]#whelse:xy=(y[...,0:2]+self.grid[i])*self.stride[i]#xywh=torch.exp(y[...,2:4])*self.stride[i]#why=torch.cat((xy,wh,y[...,4:]),-1)z.append(y.view(bs,-1,self.no))returnxifself.trainingelsetorch.cat(z,1)

这部分虽然有些框架可以直接支持导出，但会产生大量的胶水op，所以一个可选的做法是，只导出decode之前的Raw部分的内容，在C++侧做decode。另外，我们可以看到，decode部分存在一个5维度的操作：

y=y.view(bs,self.na,self.no,ny,nx).permute(0,1,3,4,2).contiguous()

这部分在NCNN中应该是不支持的（按照我对ncnn::Mat的理解，它有c,h,w三个维度，并假设b=1，所以可以处理<=4维的张量），也不能直接导出。所以，这个5维的处理，我们也要做相应的修改。至于MNN，其实可以直接转换这部分decode的逻辑，但是我在推理时，发现出来的结果不太对，于是决定采用NCNN同样的处理方式，就是只导出decode前的部分，把decode放在c++侧处理，后来验证了这样做可以得到正常的推理结果。

那么，这段 Detect Head 的逻辑到底要怎么改呢？长话短说，我直接放一个我修改后的代码吧：

models/effidehead.py修改之后

classDetect(nn.Module):#...defforward(self,x):z=[]foriinrange(self.nl):#...ifself.training:#...else:#修改之后长这样x[i]=torch.cat([reg_output,obj_output,cls_output],1)bs,_,ny,nx=x[i].shape#x(bs,255,20,20)tox(bs,1,20,20,85=80+5)(bs,na,ny,nx,no=nc+5=4+1+nc)#x[i]=x[i].view(bs,self.na,self.no,ny,nx).permute(0,1,3,4,2).contiguous()x[i]=x[i].view(bs,self.na,85,-1).permute(0,1,3,2).contiguous()#(b,self.na,20x20,85)forNCNNx[i]=x[i].view(bs,-1,85)returntorch.cat(x,dim=1)#(b,?,85)

这里，我们把原来的5维操作修改成4维操作，因为导出时，实际的no(num_outputs=85)、na(num_anchors=1)的值都是可以事先计算出来的，通过看 YOLOv6 中的 configs 文件夹中的配置文件，我们也可以确定实际上 na一直为1，而 ny，nx是每个特征图的大小，对于固定的输入shape，各个特征图的ny，nx也是固定的。为了变成4维操作，我们可以把原来在最后的两个维度(ny,nx)直接拉平，按照行主序的线性内存来理解，这样是可行的。于是有：

x[i]=x[i].view(bs,self.na,85,-1).permute(0,1,3,2).contiguous()#(b,self.na,20x20,85)forNCNNx[i]=x[i].view(bs,-1,85)

最后，在返回结果值之前，我们再做一个合并处理，这样就不需要在c++解码的时候单独对每个特征图都做一遍，合并后，只要对一个大的输出做decode就可以了。

returntorch.cat(x,dim=1)#(b,?,85)

这样修改之后，export_onnx.py的代码不用变，还是用原来的命令行直接导出就可以了（我在文件名增加了-for-ncnn作为后缀方便区分）。

还是先导出为 ONNX：

模型文件为：

yolov6ls-lh|grepyolov6|grepfor-ncnn|greponnx-rw-r--r--1yanjunqiustaff16MJun2512:52yolov6n-320x320-for-ncnn.onnx-rw-r--r--1yanjunqiustaff16MJun2512:51yolov6n-640x640-for-ncnn.onnx-rw-r--r--1yanjunqiustaff66MJun2512:53yolov6s-320x320-for-ncnn.onnx-rw-r--r--1yanjunqiustaff66MJun2512:52yolov6s-640x640-for-ncnn.onnx-rw-r--r--1yanjunqiustaff57MJun2512:53yolov6t-640x640-for-ncnn.onnx

用netron打开来看，发现decode那部分已经没有了：

带decode的onnx，用netron打开是长这样的：

可以发现两者有很明显的区别，不带decode的图，在输出部分要简单很多。

接下来，就是按常规流程将ONNX转换成NCNN和MNN模型，命令行如下:

onnx2ncnnyolov6/yolov6s-320x320-for-ncnn.onnxyolov6/yolov6s-320x320-for-ncnn.paramyolov6/yolov6s-320x320-for-ncnn.binncnnoptimizeyolov6/yolov6s-320x320-for-ncnn.paramyolov6/yolov6s-320x320-for-ncnn.binyolov6/yolov6s-320x320-for-ncnn.opt.paramyolov6/yolov6s-320x320-for-ncnn.opt.bin0

一切正常。转换为MNN模型的命令为：

YOLOv6MNNConvert-fONNX--modelFileyolov6n-640x640-for-ncnn.onnx--MNNModelyolov6n-640x640.mnn--bizCodeMNNStarttoConvertOtherModelFormatToMNNModel...[15:19:09]/Users/yanjunqiu/Desktop/third_party/library/MNN/tools/converter/source/onnx/onnxConverter.cpp:30:ONNXModelirversion:6StarttoOptimizetheMNNNet...inputTensors:[image_arrays,]outputTensors:[outputs,]ConvertedSuccess!

也是一切正常。

3. ONNX 和 TNN 模型 C++ 推理

ONNX和TNN的模型都是带decode的，因此在后处理时简单些，不用生成anchor了。模型推理直接输出的维度是(1,n,85)，这个n表示总共输出的anchors个数，85的含义是：

85=5+80=cxcy(2)+cwch(2)+obj_conf(1)+cls_conf(80)

由于输出的坐标就已经是归一化后的cx,cy和cw,ch，所以后处理就很简单了，直接转换成x1,y1,x2,y2格式就行。逻辑大概如下：

floatcx=offset_obj_cls_ptr[0];floatcy=offset_obj_cls_ptr[1];floatw=offset_obj_cls_ptr[2];floath=offset_obj_cls_ptr[3];floatx1=((cx-w/2.f)-(float)dw_)/r_;floaty1=((cy-h/2.f)-(float)dh_)/r_;floatx2=((cx+w/2.f)-(float)dw_)/r_;floaty2=((cy+h/2.f)-(float)dh_)/r_;

详细的推理代码就不展开了，会在文章最后放出。

4. NCNN 和 MNN 模型 C++ 推理

NCNN和MNN的模型文件没有导出decode部分，因此后处理复杂一点。后处理主要包括2部分，一是生成anchors，二是根据生成的anchors和输出的原始信息解码坐标。YOLOv6 的anchors生成逻辑其实和YOLOX基本是一致的，就是每个feature map上每个锚点生成一个anchor框，做过detection算法的同学应该很多理解这句话的意思，我也就不啰嗦了。直接放代码吧。

generate_anchors函数主要逻辑

voidNCNNYOLOv6::generate_anchors(constinttarget_height,constinttarget_width,std::vector<int>&strides,std::vector<YOLOv6Anchor>&anchors){for(autostride:strides){intnum_grid_w=target_width/stride;intnum_grid_h=target_height/stride;for(intg1=0;g1<num_grid_h;++g1){for(intg0=0;g0<num_grid_w;++g0){YOLOv6Anchoranchor;anchor.grid0=g0;anchor.grid1=g1;anchor.stride=stride;anchors.push_back(anchor);}}}}

坐标解码的主要逻辑

constintgrid0=anchors.at(i).grid0;constintgrid1=anchors.at(i).grid1;constintstride=anchors.at(i).stride;floatdx=offset_obj_cls_ptr[0];floatdy=offset_obj_cls_ptr[1];floatdw=offset_obj_cls_ptr[2];floatdh=offset_obj_cls_ptr[3];floatcx=(dx+(float)grid0)*(float)stride;floatcy=(dy+(float)grid1)*(float)stride;floatw=std::exp(dw)*(float)stride;floath=std::exp(dh)*(float)stride;floatx1=((cx-w/2.f)-(float)dw_)/r_;floaty1=((cy-h/2.f)-(float)dh_)/r_;floatx2=((cx+w/2.f)-(float)dw_)/r_;floaty2=((cy+h/2.f)-(float)dh_)/r_;

详细的推理代码就不展开了，会在文章最后放出。

5. YOLOv6 C++推理的使用例子

首先是放出YOLOv6 的4个推理引擎的C++ 源码，想必大家最关心就是能不能白嫖了。所有的代码都集成进了lite.ai.toolkit（/DefTruth/lite.ai.toolkit）工具箱中，零成本无压力白嫖。

YOLOv6 ONNXRuntime C++ 源码：/DefTruth/lite.ai.toolkit/blob/main/lite/ort/cv/yolov6.cpp

YOLOv6 MNN C++ 源码：/DefTruth/lite.ai.toolkit/blob/main/lite/mnn/cv/mnn_yolov6.cpp

YOLOv6 NCNN C++ 源码：/DefTruth/lite.ai.toolkit/blob/main/lite/ncnn/cv/ncnn_yolov6.cpp

YOLOv6 TNN C++ 源码：/DefTruth/lite.ai.toolkit/blob/main/lite/tnn/cv/tnn_yolov6.cpp

对源码感兴趣的同学，可自行选择关心的推理引擎版本进行阅读。接下来，在简单贴几个使用lite.ai.toolkit 工具箱一键调用的例子。

ONNXRuntime版本

#include"lite/lite.h"staticvoidtest_default(){std::stringonnx_path="../../../hub/onnx/cv/yolov6s-640x640.onnx";std::stringtest_img_path="../../../examples/lite/resources/test_lite_yolov6_1.jpg";std::stringsave_img_path="../../../logs/test_lite_yolov6_1.jpg";//1.TestDefaultEngineONNXRuntimeauto*yolov6=newlite::cv::detection::YOLOv6(onnx_path);//defaultstd::vector<lite::types::Boxf>detected_boxes;cv::Matimg_bgr=cv::imread(test_img_path);yolov6->detect(img_bgr,detected_boxes);lite::utils::draw_boxes_inplace(img_bgr,detected_boxes);cv::imwrite(save_img_path,img_bgr);std::cout<<"DefaultVersionDetectedBoxesNum:"<<detected_boxes.size()<<std::endl;deleteyolov6;}

MNN版本

staticvoidtest_mnn(){#ifdefENABLE_MNNstd::stringmnn_path="../../../hub/mnn/cv/yolov6s-640x640.mnn";std::stringtest_img_path="../../../examples/lite/resources/test_lite_yolov6_2.jpg";std::stringsave_img_path="../../../logs/test_lite_yolov6_mnn_2.jpg";//3.TestSpecificEngineMNNauto*yolov6=newlite::mnn::cv::detection::YOLOv6(mnn_path);std::vector<lite::types::Boxf>detected_boxes;cv::Matimg_bgr=cv::imread(test_img_path);yolov6->detect(img_bgr,detected_boxes);lite::utils::draw_boxes_inplace(img_bgr,detected_boxes);cv::imwrite(save_img_path,img_bgr);std::cout<<"MNNVersionDetectedBoxesNum:"<<detected_boxes.size()<<std::endl;deleteyolov6;#endif}

NCNN版本

staticvoidtest_ncnn(){#ifdefENABLE_NCNNstd::stringparam_path="../../../hub/ncnn/cv/yolov6s-640x640-for-ncnn.opt.param";std::stringbin_path="../../../hub/ncnn/cv/yolov6s-640x640-for-ncnn.opt.bin";std::stringtest_img_path="../../../examples/lite/resources/test_lite_yolov6_2.jpg";std::stringsave_img_path="../../../logs/test_lite_yolov6_ncnn_2.jpg";//4.TestSpecificEngineNCNNauto*yolov6=newlite::ncnn::cv::detection::YOLOv6(param_path,bin_path);std::vector<lite::types::Boxf>detected_boxes;cv::Matimg_bgr=cv::imread(test_img_path);yolov6->detect(img_bgr,detected_boxes);lite::utils::draw_boxes_inplace(img_bgr,detected_boxes);cv::imwrite(save_img_path,img_bgr);std::cout<<"NCNNVersionDetectedBoxesNum:"<<detected_boxes.size()<<std::endl;deleteyolov6;#endif}

TNN版本

staticvoidtest_tnn(){#ifdefENABLE_TNNstd::stringproto_path="../../../hub/tnn/cv/yolov6s-640x640.opt.tnnproto";std::stringmodel_path="../../../hub/tnn/cv/yolov6s-640x640.opt.tnnmodel";std::stringtest_img_path="../../../examples/lite/resources/test_lite_yolov6_2.jpg";std::stringsave_img_path="../../../logs/test_lite_yolov6_tnn_2.jpg";//5.TestSpecificEngineTNNauto*yolov6=newlite::tnn::cv::detection::YOLOv6(proto_path,model_path);std::vector<lite::types::Boxf>detected_boxes;cv::Matimg_bgr=cv::imread(test_img_path);yolov6->detect(img_bgr,detected_boxes);lite::utils::draw_boxes_inplace(img_bgr,detected_boxes);cv::imwrite(save_img_path,img_bgr);std::cout<<"TNNVersionDetectedBoxesNum:"<<detected_boxes.size()<<std::endl;deleteyolov6;#endif}

输出的结果如下：