深度学习中如何正确地measure inference time

一、引入

network latency是在将深度学习网络投入实际应用时需要考虑的重要因素。
文章结构：

the main processes that make GPU execution unique, including asynchronous execution and GPU warm up
share code samples for measuring time correctly on a GPU
review some of the common mistakes people make when quantifying inference time on GPUs

二、异步执行（Asynchronous execution）

我们先来讨论GPU的执行机制。
在多线程或者多设备编程中，两个独立的代码块能被并行执行；也就是说，第二个代码块可能比第一个代码块先执行完。这个过程就叫作异步执行。
在深度学习中，我们常常需要用到异步执行，因为GPU操作默认是异步的。
更加具体来说，当我们用GPU去调用一个函数的时候，这些操作会安排到一个特定的设备去排队，但未必会去其他设备区排队。这个机制允许我们在CPU或者其他GPU上并行地执行运算。

下图左边为同步执行。进程A需要等到进程B的回复才能继续执行。右图为异步执行。进程A不需要等进程B执行完就可以继续执行。

image.png

异步执行给深度学习提供了便利。当我们用多个batches去进行inference时，当第一个batch被喂到GPU上的网络的时候，第二个batch能在CPU上进行预处理。

当我们用Python里的time库计算时间时，这个measurement是在CPU上运行的。因为GPU的异步特性，停止计算时间的那行代码将会在GPU运行完成之前被执行。之后我们会介绍如何在异步运行的条件下正确计算时间。

三、GPU warm-up

一个现代化的GPU设备有不同种耗电状态(power states）。当GPU不在使用并且persistence mode is not enabled，GPU会自动reduce its power state to a very low level，有时候甚至完全关闭。在低耗电模式中，GPU会关掉不同pieces of 硬件，包括 memory subsystems, internal subsystems, or even compute cores and caches.

任何尝试与GPU交互的程序的调用都将会导致驱动（the driver）去加载或者初始化GPU。这个驱动加载的行为（driver load behavior）是值得注意的。那些触发GPU初始化的应用会导致最高3秒的延迟，因为the scrubbing behavior of the error-correcting code。

举个例子，如果我们去测试一个网络的时间，这个网络跑一个样例需要10 milliseconds，那么跑1000个样例会导致大部分running time都花在了初始化服务器上。这样测出来的时间是不准的。

Nor does it reflect a production environment where usually the GPU is already initialized or working in persistence mode.
在measure time的时候，如何deal withGPU的初始化时间呢？

四、正确的measure inference time的方式

下面的Pytorch代码展示了如何正确地measure inference time。
这里用到的网络是Efficient-net-b0。在代码中，我们处理了上面提到的两个问题（GPU预热和异步执行）。
在进行任何时间测量之前，我们在网络上跑几个dummy examples以进行”GPU warm up “。这一步会自动初始化GPU并且阻止GPU在我们测量时间时回到省电模式（power saveing mode）。
然后，我们用 tr.cuda.event去在GPU上测量时间。
这里值得注意的是需要用到torch.cuda.synchronize()这个函数。这个函数能够同步host和device（即：GPU和CPU），因此，只有在GPU上的进程结束时，时间才会被记录。这样就解决了非同步执行的问题。

model = EfficientNet.from_pretrained(‘efficientnet-b0’)
device = torch.device(“cuda”)
model.to(device)
dummy_input = torch.randn(1, 3,224,224,dtype=torch.float).to(device)
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
repetitions = 300
timings=np.zeros((repetitions,1))
#GPU-WARM-UP：开始跑dummy example
for _ in range(10):
   _ = model(dummy_input)
# MEASURE PERFORMANCE
with torch.no_grad():
  for rep in range(repetitions):
     starter.record()
     _ = model(dummy_input)
     ender.record()
     # WAIT FOR GPU SYNC
     torch.cuda.synchronize()
     curr_time = starter.elapsed_time(ender)
     timings[rep] = curr_time
mean_syn = np.sum(timings) / repetitions
std_syn = np.std(timings)
print(mean_syn)

这段代码里，用了torch.cuda.Event(enable_timing=True)这个命令中的.record来记录时间，并且用torch.cuda.synchronize()进行了时间同步，最后用curr_time = starter.elapsed_time(ender)来计算起止时间差。

五、在测量时间时的常见错误

当我们测量一个网络的延迟（the latency of a network）时，我们的目标是只测量网络前向传播（feed-forward ）消耗的时间。这些是进行测量时可能犯的错误和可能导致的后果：

1. 在host和device之间传输数据（即：GPU和CPU）

这个错误常常是无意识发生的。比如：一个张量产生于CPU但是然后在GPU上进行inference。
这个内存分配（memory allocation）会消耗considerable amount of time，从而导致inference time变大。

这个错误会影响时间测量的均值和方差，如下图，横坐标是时间测量的方法，纵坐标是以milliseconds为单位的时间：

image.png

2. 不使用 GPU warm-up

正如上文提到的，the first run on the GPU prompts its initialization. GPU initialization can take up to 3 seconds, which makes a huge difference when the timing is in terms of milliseconds.

3. 用标准的CPU计时法

The most common mistake made is to measure time without synchronization. Even experienced programmers have been known to use the following piece of code.

s = time.time()
 _ = model(dummy_input)
curr_time = (time.time()-s )*1000

这段代码完全没考虑异步执行，因此输出了不正确的时间。这个错误对时间测量的均值和方差影响如下，横坐标是时间测量的方法，纵坐标是以milliseconds为单位的时间：

image.png

4. 只用一个exmaple来测试时间

神经网络的前向传播 has a (small) stochastic component。The variance of the run-time can be significant, especially when measuring a low latency network.
因此，用若干个样例来跑网络然后对结果取平均是非常重要的（300 examples can be a good number）

六、吞吐量的测量（Measuring Throughput）

神经网络吞吐量的定义为：在一个时间单元（如：一秒）内网络能处理的最大输入样例数。
与延迟（处理单个样例）不同的是，为了达到最大的吞吐量，我们往往希望并行处理尽可能多的样例。The effective parallelism is obviously data-, model-, and device-dependent.
因此，为了正确地测量吞吐量，我们需要进行以下两个步骤：
（1）估计最大并行所允许的最佳batch size
（2）在最佳batch size下，测量网络一秒内能处理的样例数目。

如何找到最佳batch size？一个好办法是在给定数据类型下，达到GPU的memory limit的batch size。这个batch size肯定与硬件类型和网络尺寸相关。最快找到最大batch size的方法是进行二分搜索。
当我们不需要考虑找最佳batch size时间的时候，a simple sequential search is sufficient.用一个for 循环，每次增大一个batch size，直到报错Run Time error。这时候的batch size就是在我们的神经网络和数据数据下，GPU能处理的最大batch size。
在找到最大的batch size之后，我们去计算实际的吞吐量。我们运行很多个batch（100个batch足够了），并用以下公式来计算一秒内神经网络能处理的样本数目：
(number of batches X batch size)/(total time in seconds).

model = EfficientNet.from_pretrained(‘efficientnet-b0’)
device = torch.device(“cuda”)
model.to(device)
dummy_input = torch.randn(optimal_batch_size, 3,224,224, dtype=torch.float).to(device)
repetitions=100
total_time = 0
with torch.no_grad():
  for rep in range(repetitions):
     starter, ender = torch.cuda.Event(enable_timing=True),          torch.cuda.Event(enable_timing=True)
     starter.record()
     _ = model(dummy_input)
     ender.record()
     torch.cuda.synchronize()
     curr_time = starter.elapsed_time(ender)/1000
     total_time += curr_time
Throughput = (repetitions*optimal_batch_size)/total_time
print(‘Final Throughput:’,Throughput)