pytorch经onnx转tensorrt初体验（上）

pytorch转成tensorrt时需要利用中间件onnx，所以第一步需要将pytorch模型转成onnx格式。onnx其实相当于以通用格式保存网络的计算图。

1.0 安装 onnx

pip install onnx
pip install onnxruntime

1.1 pytorch模型转onnx模型

我们以resnet18为例。

pytorch: 1.2.0
onnx: 1.7.0
cuda: 10.1

pytorch转onnx的简单示例如下：

#--*-- coding:utf-8 --*--
import onnx 
import torch
import torchvision 
import netron

net = torchvision.models.resnet18(pretrained=True).cuda()
# net.eval()

export_onnx_file = "./resnet18.onnx"
x=torch.onnx.export(net,  # 待转换的网络模型和参数
                torch.randn(1, 3, 224, 224, device='cuda'), # 虚拟的输入，用于确定输入尺寸和推理计算图每个节点的尺寸
                export_onnx_file,  # 输出文件的名称
                verbose=False,      # 是否以字符串的形式显示计算图
                input_names=["input"]+ ["params_%d"%i for i in range(120)],  # 输入节点的名称，这里也可以给一个list，list中名称分别对应每一层可学习的参数，便于后续查询
                output_names=["output"], # 输出节点的名称
                opset_version=10,   # onnx 支持采用的operator set, 应该和pytorch版本相关，目前我这里最高支持10
                do_constant_folding=True, # 是否压缩常量
                dynamic_axes={"input":{0: "batch_size", 2: "h"}, "output":{0: "batch_size"},} #设置动态维度，此处指明input节点的第0维度可变，命名为batch_size
                )

# import onnx  # 注意这里导入onnx时必须在torch导入之前，否则会出现segmentation fault
net = onnx.load("./resnet18.onnx")  # 加载onnx 计算图
onnx.checker.check_model(net)  # 检查文件模型是否正确
onnx.helper.printable_graph(net.graph)  # 输出onnx的计算图

import onnxruntime
import numpy as np

netron.start("./resnet18.onnx")

session = onnxruntime.InferenceSession("./resnet18.onnx") # 创建一个运行session，类似于tensorflow
out_r = session.run(None, {"input": np.random.rand(16, 3, 256, 224).astype('float32')})  # 模型运行，注意这里的输入必须是numpy类型
print(len(out_r))
print(out_r[0].shape)

需要注意的几点：

在export中可以指定dynamic_axes, 即允许输入发生变化的维度，比如这里我们给的dummy input是1x3x224x224尺寸，然后限定input的第0，2维可以发生变化，于是在run时，可以输入尺寸为16x1z256x224的量。
在指定dynamic_axes关键字时，会出现warnning

UserWarning: Provided key input for dynamic axes is not a valid input/output name
  warnings.warn("Provided key {} for dynamic axes is not a valid input/output name".format(key))

这个可以不用理会，应该是onnx目前版本问题。https://github.com/pytorch/pytorch/issues/25681

导入onnx包时最好在torch之前，否则在调用onnx.checker时会出现问题。
通过netron可视化生成的onnx模型，可以看到给参数命的名称。

image.png

1.2 安装可视化工具netron

安装过程，以及如何在python环境中使用可按照官方文档安装。
这里讲一下如何在本地访问远程服务器中显示的计算图。
netron.start("...onnx") 之后会在浏览器中显示计算图，需要做的是将远程服务器的结果映射到本地，所以在ssh远程服务器时可以指定映射端口，比如：

ssh -L 127.0.0.1:8080:127.0.0.1:8080 zwzhou@172.18.32.238

将远程服务器的8080端口映射到本地的8080端口，netron默认的端口为8080.
接着就可以在本地浏览器中输入下列地址打开计算图

127.0.0.1:8080
或者
localhost:8080

2 onnx转tensorrt

2.1 tensorrt的安装

首先从英伟达官网下载对应版本的tensorrt，这里下载使用的tensorrt7.0对应cuda10.0的deb包

dpkg -i nv-tensorrt-repo-ubuntu1604-cuda10.0-trt7.0.0.11-ga-20191216_1-1_amd64.deb
sudo apt-get update
sudo apt-get install tensorrt

出现问题，一些依赖没有安装：

image.png

于是继续安装依赖包

sudo apt-get install libnvinfer7 libnvinfer-plugin7

发现依然缺少cuda的依赖项

image.png

网上找资料后，下面链接解释了原因：
https://www.cnblogs.com/wujianming-110117/p/12983439.html
另外还发现服务器上的cuda的驱动版本是10.1，但是nvcc -v 版本是9.2，于是重新下载对应9.0版本的tensorrt安装包进行安装，最终参考博客
https://blog.csdn.net/zong596568821xp/article/details/86077553安装成功。
参考博客中说使用dpkg -l | grep TensorRT 验证是否安装正确，但我这里没有任何输出。

打开python环境导入tensorrt可以成功导入。

import tensorrt
tensorrt.__version__  # 输出 `7.0.0.11`

2.2 tensorrt的基本流图

目前对tensorrt还是一头雾水，我们先来看看tensorrt的使用的基本流图。

image.png

首先以trt的Logger为参数，使用builder创建计算图类型INetworkDefinition。
然后使用Parsers将onnx等网络框架下的结构填充计算图，当然也可以使用tensorrt的API进行构建。
由计算图创建cuda环境下的引擎
最终进行推理的则是cuda引擎生成的ExecutionContext。engine.create_execution_context()
我们可以通过下面python中调用tensorrt的实例代码看一些这个流程。

使用IExecutionContext进行推理的基本步骤：

先进行空间分配，包括cuda上进行运算的输入输出缓存区间分配 cuda.mem_alloc
获得输入输出在cuda上的缓存地址，直接用int(input_mem)类似可获得
3.将输入数据由cpu复制到gpu: cuda.memcp_htod_async
执行引擎进行推理context.execute_async
将输出由cuda复制到cpu上，cuda.memcpy_dtoh_async

2.3 onnx使用python接口调用tensorrt

在了解了tensorrt创建推理环境的过程以及推理环境的使用方法后，就很容易理解下面的示例了。

#--*-- coding:utf-8 --*--
import pycuda.autoinit 
import pycuda.driver as cuda
import tensorrt as trt
import torch 
import time 
from PIL import Image
import cv2,os
import torchvision 
import numpy as np

filename = '/home/zwzhou/Code/img.png'
max_batch_size = 1
onnx_model_path = "./resnet18.onnx"

TRT_LOGGER = trt.Logger()

def get_img_np_nchw(filename):
    image = cv2.imread(filename)
    image_cv = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    image_cv = cv2.resize(image_cv, (224, 224))
    miu = np.array([0.485, 0.456, 0.406]).reshape(3, 1, 1)
    std = np.array([0.229, 0.224, 0.225]).reshape(3, 1, 1)
    img_np = np.array(image_cv, dtype=np.float)/255.
    img_np = img_np.transpose((2, 0, 1))
    img_np -= miu
    img_np /= std
    img_np_nchw = img_np[np.newaxis]
    img_np_nchw = np.tile(img_np_nchw,(max_batch_size, 1, 1, 1))
    return img_np_nchw

class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        """
        host_mem: cpu memory
        device_mem: gpu memory
        """
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host)+"\nDevice:\n"+str(self.device)

    def __repr__(self):
        return self.__str__()

def allocate_buffers(engine):
    inputs, outputs, bindings = [], [], []
    stream = cuda.Stream()
    for binding in engine:
        # print(binding) # 绑定的输入输出
        # print(engine.get_binding_shape(binding)) # get_binding_shape 是变量的大小
        size = trt.volume(engine.get_binding_shape(binding))*engine.max_batch_size
        # volume 计算可迭代变量的空间，指元素个数
        # size = trt.volume(engine.get_binding_shape(binding)) # 如果采用固定bs的onnx，则采用该句
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # get_binding_dtype  获得binding的数据类型
        # nptype等价于numpy中的dtype，即数据类型
        # allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)  # 创建锁业内存
        device_mem = cuda.mem_alloc(host_mem.nbytes)    # cuda分配空间
        # print(int(device_mem)) # binding在计算图中的缓冲地址
        bindings.append(int(device_mem))
        #append to the appropriate list
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings, stream

def get_engine(max_batch_size=1, onnx_file_path="", engine_file_path="",fp16_mode=False, save_engine=False):
    """
    params max_batch_size:      预先指定大小好分配显存
    params onnx_file_path:      onnx文件路径
    params engine_file_path:    待保存的序列化的引擎文件路径
    params fp16_mode:           是否采用FP16
    params save_engine:         是否保存引擎
    returns:                    ICudaEngine
    """
    # 如果已经存在序列化之后的引擎，则直接反序列化得到cudaEngine
    if os.path.exists(engine_file_path):
        print("Reading engine from file: {}".format(engine_file_path))
        with open(engine_file_path, 'rb') as f, \
            trt.Runtime(TRT_LOGGER) as runtime:
            return runtime.deserialize_cuda_engine(f.read())  # 反序列化
    else:  # 由onnx创建cudaEngine
        
        # 使用logger创建一个builder 
        # builder创建一个计算图 INetworkDefinition
        explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
        # In TensorRT 7.0, the ONNX parser only supports full-dimensions mode, meaning that your network definition must be created with the explicitBatch flag set. For more information, see Working With Dynamic Shapes.

        with trt.Builder(TRT_LOGGER) as builder, \
            builder.create_network(explicit_batch) as network,  \
            trt.OnnxParser(network, TRT_LOGGER) as parser: # 使用onnx的解析器绑定计算图，后续将通过解析填充计算图
            builder.max_workspace_size = 1<<30  # 预先分配的工作空间大小,即ICudaEngine执行时GPU最大需要的空间
            builder.max_batch_size = max_batch_size # 执行时最大可以使用的batchsize
            builder.fp16_mode = fp16_mode

            # 解析onnx文件，填充计算图
            if not os.path.exists(onnx_file_path):
                quit("ONNX file {} not found!".format(onnx_file_path))
            print('loading onnx file from path {} ...'.format(onnx_file_path))
            with open(onnx_file_path, 'rb') as model: # 二值化的网络结果和参数
                print("Begining onnx file parsing")
                parser.parse(model.read())  # 解析onnx文件
            #parser.parse_from_file(onnx_file_path) # parser还有一个从文件解析onnx的方法

            print("Completed parsing of onnx file")
            # 填充计算图完成后，则使用builder从计算图中创建CudaEngine
            print("Building an engine from file{}' this may take a while...".format(onnx_file_path))

            #################
            print(network.get_layer(network.num_layers-1).get_output(0).shape)
            # network.mark_output(network.get_layer(network.num_layers -1).get_output(0))
            engine=builder.build_cuda_engine(network)  # 注意，这里的network是INetworkDefinition类型，即填充后的计算图
            print("Completed creating Engine")
            if save_engine:  #保存engine供以后直接反序列化使用
                with open(engine_file_path, 'wb') as f:
                    f.write(engine.serialize())  # 序列化
            return engine


def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    # Transfer data from CPU to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # htod： host to device 将数据由cpu复制到gpu device
    # Run inference.
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    # 当创建network时显式指定了batchsize， 则使用execute_async_v2, 否则使用execute_async
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # gpu to cpu
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]

def postprocess_the_outputs(h_outputs, shape_of_output):
    h_outputs = h_outputs.reshape(*shape_of_output)
    return h_outputs

img_np_nchw = get_img_np_nchw(filename).astype(np.float32)
#These two modes are depend on hardwares
fp16_mode = False
trt_engine_path = "./model_fp16_{}.trt".format(fp16_mode)
# Build an cudaEngine
engine = get_engine(max_batch_size, onnx_model_path, trt_engine_path, fp16_mode)
# 创建CudaEngine之后,需要将该引擎应用到不同的卡上配置执行环境
context = engine.create_execution_context()
inputs, outputs, bindings, stream = allocate_buffers(engine) # input, output: host # bindings

# Do inference
shape_of_output = (max_batch_size, 1000)
# Load data to the buffer
inputs[0].host = img_np_nchw.reshape(-1)

# inputs[1].host = ... for multiple input
t1 = time.time()
trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream) # numpy data
t2 = time.time()
feat = postprocess_the_outputs(trt_outputs[0], shape_of_output)

print('TensorRT ok')

model = torchvision.models.resnet18(pretrained=True).cuda()
resnet_model = model.eval()
input_for_torch = torch.from_numpy(img_np_nchw).cuda()
t3 = time.time()
feat_2= resnet_model(input_for_torch)
t4 = time.time()
feat_2 = feat_2.cpu().data.numpy()
print('Pytorch ok!')

mse = np.mean((feat - feat_2)**2)
print("Inference time with the TensorRT engine: {}".format(t2-t1))
print("Inference time with the PyTorch model: {}".format(t4-t3))
print('MSE Error = {}'.format(mse))

print('All completed!')

其中：

get_img_np_nchw, postprocess_the_outputs分别对应的预处理和后处理
get_engine 对应着上文讲到的创建cuda引擎的过程
do_inference 对应着推理环境的使用步骤
HostDeviceMem 管理输入输出节点在cpu和cuda上的存储

遇到的问题：

刚开始创建计算图network时未显示的指定batch_size，出现如下错误：

[TensorRT] ERROR: Network must have at least one output
[TensorRT] ERROR: Network validation failed.
Completed creating Engine
Traceback (most recent call last):
  File "/home/zwzhou/Code/test_tensorrt/onnx2tensorrt.py", line 149, in <module>
    context = engine.create_execution_context()
AttributeError: 'NoneType' object has no attribute 'create_execution_context'

查找若干资料后发现：
In TensorRT 7.0, the ONNX parser only supports full-dimensions mode, meaning that your network definition must be created with the explicitBatch flag set. For more information, see Working With Dynamic Shapes. 所以onnx转tensorrt时必须加上explict_batch标志。
https://github.com/NVIDIA/TensorRT/issues/183
https://github.com/NVIDIA/TensorRT/issues/183#issuecomment-657673563
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#import_onnx_python
https://github.com/onnx/onnx-tensorrt/issues/318
https://github.com/triton-inference-server/server/issues/76

尝试使用bs>1的推理时，刚开始在onnx中指定了待使用的bs，但总是出现尺度不匹配。这是因为在分配缓存空间时，get_binding_shape获得的是onnx中指定的变量的尺寸，再乘上max_batch_size显然不对。所以
此时有两种解决办法：

onnx中指定batch_size=1 (推荐)
onnx中指定bs，后在分配缓存空间时，不再乘上max_batch_size.

设置gpu卡号时采用os.environ['CUDA_VISIBLE_DEVICES]='2'必须放置在pycuda.autoinit之前

不同的batchsize，时间(ms)对比(10次的平均时间)

bs	1	2	4	8	16	32	64
tensorrt	2.0	2.3	2.6	5.6	8.8	13.0	24.9
pytorch(热启动后)	7.6	3.2	2.7	4.2	4.3	3.4	5.1

所以目前的结论：使用python的API将onnx转成TensorRT时，对于bs=1会产生加速，但对于较大的bs，其速度反而没有原始的pytorch快。

注意不知道是否是TensorRT版本问题，在7.0中的python API，处理batch数据时只有第一个样本的结果是正确的，而其他样本的输出全为0. 该问题解决方法关注：
https://forums.developer.nvidia.com/t/output-of-batch-inference/72129
https://stackoverflow.com/questions/62584385/how-to-do-tensorrt-7-0-inference-for-batch-inputs-with-python-api
https://github.com/dusty-nv/jetson-inference/issues/280
https://github.com/dusty-nv/jetson-inference/issues/320
权宜之计，我目前的解决方法是直接生成固定batchsize的onnx，然后再使用tensorrt解析。此时的精度是正确的。

参考：
https://blog.csdn.net/quantum7/article/details/83380935
https://zhuanlan.zhihu.com/p/88318324

下一篇
将体验一下如何使用C++ 的API将onnx转tensorrt。