在iOS上玩转yolo

本文主要介绍YOLOv2在iOS手机端的实现
Paper：https://arxiv.org/abs/1612.08242
Github：https://github.com/pjreddie/darknet
Website：https://pjreddie.com/darknet/yolo

YOLOv2简介

yolov2的输入为416x416，然后通过一些列的卷积、BN、Pooling操作最后到13x13x125的feature map大小。其中13x13对应原图的13x13网格，如下图所示。

这里写图片描述

125来自5x(5+20)，表示每一个cell中预测5个bounding boxes（表示5个anchor），每一个bounding boxes有x,y,w,h, confidence score(该框是目标的概率)，20个类的概率（PASCAL VOC数据集共有20类）。
anchor的值是通过在训练集的框上用k-means聚类算法获得。这里用k-means计算距离时不是用的欧式距离，而是如下的IOU得分。

d(box, centroid) = 1 - IOU(box, centroid)

在iOS上怎么实现？

由于在手机上运行需要考虑模型大小和速度问题，所以我们选择使用tiny yolo。网络结构如下：

Layer         kernel  stride  output shape
---------------------------------------------
Input                          (416, 416, 3)
Convolution    3×3      1      (416, 416, 16)
MaxPooling     2×2      2      (208, 208, 16)
Convolution    3×3      1      (208, 208, 32)
MaxPooling     2×2      2      (104, 104, 32)
Convolution    3×3      1      (104, 104, 64)
MaxPooling     2×2      2      (52, 52, 64)
Convolution    3×3      1      (52, 52, 128)
MaxPooling     2×2      2      (26, 26, 128)
Convolution    3×3      1      (26, 26, 256)
MaxPooling     2×2      2      (13, 13, 256)
Convolution    3×3      1      (13, 13, 512)
MaxPooling     2×2      1      (13, 13, 512)
Convolution    3×3      1      (13, 13, 1024)
Convolution    3×3      1      (13, 13, 1024)
Convolution    1×1      1      (13, 13, 125)
---------------------------------------------

整个网络只有九层 convolution，注意该inference网络已经去掉了BN层。另外，最后一层maxpooling没有改变feature map的尺寸，所以该maxpooling的stride为1。
采用Metal搭建YOLO网络的代码如下：

    public init(device: MTLDevice) {
        print("Setting up neural network...")
        let startTime = CACurrentMediaTime()
        
        self.device = device
        commandQueue = device.makeCommandQueue()
        
        conv9_img = MPSImage(device: device, imageDescriptor: conv9_id) //save the result
        
        lanczos = MPSImageLanczosScale(device: device)
        
        let relu = MPSCNNNeuronReLU(device: device, a: 0.1)
        
        
        conv1 = SlimMPSCNNConvolution(kernelWidth: 3,
                                         kernelHeight: 3,
                                         inputFeatureChannels: 3,
                                         outputFeatureChannels: 16,
                                         neuronFilter: relu,
                                         device: device,
                                         kernelParamsBinaryName: "conv1",
                                         padding: true,
                                         strideXY: (1,1))
        maxpooling1 = MPSCNNPoolingMax(device: device, kernelWidth: 2, kernelHeight: 2, strideInPixelsX: 2, strideInPixelsY: 2)
        conv2 = SlimMPSCNNConvolution(kernelWidth: 3,
                                      kernelHeight: 3,
                                      inputFeatureChannels: 16,
                                      outputFeatureChannels: 32,
                                      neuronFilter: relu,
                                      device: device,
                                      kernelParamsBinaryName: "conv2",
                                      padding: true,
                                      strideXY: (1,1))
        maxpooling2 = MPSCNNPoolingMax(device: device, kernelWidth: 2, kernelHeight: 2, strideInPixelsX: 2, strideInPixelsY: 2)
        conv3 = SlimMPSCNNConvolution(kernelWidth: 3,
                                      kernelHeight: 3,
                                      inputFeatureChannels: 32,
                                      outputFeatureChannels: 64,
                                      neuronFilter: relu,
                                      device: device,
                                      kernelParamsBinaryName: "conv3",
                                      padding: true,
                                      strideXY: (1,1))
        maxpooling3 = MPSCNNPoolingMax(device: device, kernelWidth: 2, kernelHeight: 2, strideInPixelsX: 2, strideInPixelsY: 2)
        conv4 = SlimMPSCNNConvolution(kernelWidth: 3,
                                      kernelHeight: 3,
                                      inputFeatureChannels: 64,
                                      outputFeatureChannels: 128,
                                      neuronFilter: relu,
                                      device: device,
                                      kernelParamsBinaryName: "conv4",
                                      padding: true,
                                      strideXY: (1,1))
        maxpooling4 = MPSCNNPoolingMax(device: device, kernelWidth: 2, kernelHeight: 2, strideInPixelsX: 2, strideInPixelsY: 2)
        conv5 = SlimMPSCNNConvolution(kernelWidth: 3,
                                      kernelHeight: 3,
                                      inputFeatureChannels: 128,
                                      outputFeatureChannels: 256,
                                      neuronFilter: relu,
                                      device: device,
                                      kernelParamsBinaryName: "conv5",
                                      padding: true,
                                      strideXY: (1,1))
        maxpooling5 = MPSCNNPoolingMax(device: device, kernelWidth: 2, kernelHeight: 2, strideInPixelsX: 2, strideInPixelsY: 2)
        conv6 = SlimMPSCNNConvolution(kernelWidth: 3,
                                      kernelHeight: 3,
                                      inputFeatureChannels: 256,
                                      outputFeatureChannels: 512,
                                      neuronFilter: relu,
                                      device: device,
                                      kernelParamsBinaryName: "conv6",
                                      padding: true,
                                      strideXY: (1,1))
        maxpooling6 = MPSCNNPoolingMax(device: device, kernelWidth: 2, kernelHeight: 2, strideInPixelsX: 1, strideInPixelsY: 1)
        //offset setting is necessary to make sure 13x13->13x13 after pooling
        maxpooling6.offset = MPSOffset(x: 2, y: 2, z: 0)
        maxpooling6.edgeMode = MPSImageEdgeMode.clamp
        conv7 = SlimMPSCNNConvolution(kernelWidth: 3,
                                      kernelHeight: 3,
                                      inputFeatureChannels: 512,
                                      outputFeatureChannels: 1024,
                                      neuronFilter: relu,
                                      device: device,
                                      kernelParamsBinaryName: "conv7",
                                      padding: true,
                                      strideXY: (1,1))
        conv8 = SlimMPSCNNConvolution(kernelWidth: 3,
                                      kernelHeight: 3,
                                      inputFeatureChannels: 1024,
                                      outputFeatureChannels: 1024,
                                      neuronFilter: relu,
                                      device: device,
                                      kernelParamsBinaryName: "conv8",
                                      padding: true,
                                      strideXY: (1,1))
        conv9 = SlimMPSCNNConvolution(kernelWidth: 1,
                                      kernelHeight: 1,
                                      inputFeatureChannels: 1024,
                                      outputFeatureChannels: 125,
                                      neuronFilter: nil,
                                      device: device,
                                      kernelParamsBinaryName: "conv9",
                                      padding: false,
                                      strideXY: (1,1))
        
        let endTime = CACurrentMediaTime()
        print("Elapsed time: \(endTime - startTime) sec")
    }

通过网络得到13x13x125的feature map后需要把它转换为对应的5个bounding boxes,转换方式如下：

这里写图片描述

转换的代码实现如下：

                // The predicted tx and ty coordinates are relative to the location
                // of the grid cell; we use the logistic sigmoid to constrain these
                // coordinates to the range 0 - 1. Then we add the cell coordinates
                // (0-12) and multiply by the number of pixels per grid cell (32).
                // Now x and y represent center of the bounding box in the original
                // 416x416 image space.
                let x = (Float(cx) + Math.sigmoid(tx)) * blockSize
                let y = (Float(cy) + Math.sigmoid(ty)) * blockSize
                
                // The size of the bounding box, tw and th, is predicted relative to
                // the size of an "anchor" box. Here we also transform the width and
                // height into the original 416x416 image space.
                let w = exp(tw) * anchors[2*b    ] * blockSize
                let h = exp(th) * anchors[2*b + 1] * blockSize
                
                // The confidence value for the bounding box is given by tc. We use
                // the logistic sigmoid to turn this into a percentage.
                let confidence = Math.sigmoid(tc)

转换为bounding boxes之后框共有13x13x5个，这里面很多框都不对，此时需要通过bestClassScore * confidence>0.3来过滤无用的框。过滤之后还是会有很多满足条件的框，所有最后还需要通过非极大值抑制算法来去除冗余的框。
NMS的实现代码如下

/**
 Removes bounding boxes that overlap too much with other boxes that have
 a higher score.
 
 Based on code from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/non_max_suppression_op.cc
 
 - Parameters:
 - boxes: an array of bounding boxes and their scores
 - limit: the maximum number of boxes that will be selected
 - threshold: used to decide whether boxes overlap too much
 */
func nonMaxSuppression(boxes: [YOLO.Prediction], limit: Int, threshold: Float) -> [YOLO.Prediction] {
    
    // Do an argsort on the confidence scores, from high to low.
    let sortedIndices = boxes.indices.sorted { boxes[$0].score > boxes[$1].score }
    
    var selected: [YOLO.Prediction] = []
    var active = [Bool](repeating: true, count: boxes.count)
    var numActive = active.count
    
    // The algorithm is simple: Start with the box that has the highest score.
    // Remove any remaining boxes that overlap it more than the given threshold
    // amount. If there are any boxes left (i.e. these did not overlap with any
    // previous boxes), then repeat this procedure, until no more boxes remain
    // or the limit has been reached.
    outer: for i in 0..<boxes.count {
        if active[i] {
            let boxA = boxes[sortedIndices[i]]
            selected.append(boxA)
            if selected.count >= limit { break }
            
            for j in i+1..<boxes.count {
                if active[j] {
                    let boxB = boxes[sortedIndices[j]]
                    if IOU(a: boxA.rect, b: boxB.rect) > threshold {
                        active[j] = false
                        numActive -= 1
                        if numActive <= 0 { break outer }
                    }
                }
            }
        }
    }
    return selected
}

Img

完整的实现代码：https://github.com/Revo-Future/YOLO-iOS

Reference:
http://blog.csdn.net/hrsstudy/article/details/71173305?utm_source=itdadao&utm_medium=referral

在iOS上玩转yolo

YOLOv2简介

在iOS上怎么实现？

推荐阅读更多精彩内容