一、主要贡献

作者以RetinaNet和FCOS为例，分析了anchor-based和anchor-free的性能差异的原因：

1、每个位置的anchor数量不同。retinanet每个点多个anchor，fcos每个点只有一个anchor point
2、正负样本的定义方法不同。retinanet使用IOU的双阈值，fcos使用空间和尺度限制
3、回归的初始状态。retinanet是修改先验的anchor；fcos是使用anchor point。

ATSS论文的主要贡献：

1、指出anchor-based和anchor-free的检测方法的本质区别是由于正负样本的定义不同
2、提出一个通过目标的统计特征，在训练过程中自适应进行正负样本分配
3、证明在一个位置放置多个anchor去检测目标是一个低效的方法
4、在没有任何成本的情况下达到了COCO上最好的表现

抛出了一个在目标检测领域的核心问题，即label asign，如何分配正负样本？

二、分析anchor-free和anchor-based方法的差距

作者为了公平的比较两者实际的差异，使用相同的训练方法和tricks，并且将RetinaNet每个位置的anchor设为1。但是两者依旧存在0.8%的差距。

image.png

作者继续分析了存在差距的原因：

1、正负样本的定义方法

image.png
2、回归的初始状态，即对anchor回归还是对一个中心点回归。

image.png

通过以下实验的，得出结论：正负样本的定义方法才是核心原因

image.png

三、提出Adaptive Training Sample Selection

在训练的过程中，通过目标的统计特征，自动进行正负样本的划分。具体过程：

1、对于每个ground-truth $g$ ，通过 $L2$ 距离选择 $k$ 个离其中心点最近的anchor，对于 $\mathcal L$ 层特征金字塔，共存在 $k \times \mathcal L$ 个候选的正样本。
2、计算挑选出来的候选的正样本和 $g$ 之间的IOU。计算相应的均值 $m_g$ 和标准差 $v_g$ 。
3、通过均值和标准差这两个统计特征，得到阈值 $t_g = m_g + v_g$
4、如果候选样本中IOU大于 $t_g$ ，并且候选样本的中心点位于ground-truth中，将其标记为正样本
5、如果一个anchor box被分配给了多个ground-truth，仅保留IOU最大的。

image.png
1、为什么通过中心点的欧式距离选择候选的正样本？
对于RetinaNet和FCOS，越靠近ground-truth，预测效果越好。
2、为什么使用了均值和标准差作为IOU阈值？
可以自动调节选取正负样本的阈值。比如当出现高方差的时候，往往意味着有一个FPN层出现了较高的IOU，说明该层非常适合这个物体的预测，因此最终的正样本都出自该层；而出现低方差的时候，说明有多个FPN层适合预测这个物体，因此会在多个层选取正样本。

image.png
3、为什么限制anchor box的中心点要在ground-truth中？
中心点在ground-truth之外的anchor box往往属于poor candidates。使用ground-truth外的特征去预测ground-truth。
4、采用这种label asign划分正负样本是否有效
根据统计统计学，虽然不是标准的正态分布，但是仍然大约会有16%的候选样本会被划分为正样本，每一个ground-truth在不同尺度、不同比例、不同位置都会分配 $0.2 \times k \times \mathcal L$ 个正样本。相反对于RetinaNet和FCOS的分配策略而言，大的物体会有更多的正样本，这并不是一种公平的方式。
5、如何选择超参数 $k$ ？
对于 $k$ 的选择并不敏感。

image.png

四、结果验证

1、使用了 ATSS后，RetinaNet和FCOS无明显差距

image.png

2、不同尺度和不同比例的anchor box效果都很鲁棒

image.png

3、引入ATSS策略后，设置anchor数量与结果没有明显的关系。

image.png

4、ATSS的性能

image.png

五、源码实现

源码参考了mmdetection的实现：

@BBOX_ASSIGNERS.register_module()
class ATSSAssigner(BaseAssigner):
    """Assign a corresponding gt bbox or background to each bbox.

    Each proposals will be assigned with `0` or a positive integer
    indicating the ground truth index.

    - 0: negative sample, no assigned gt
    - positive integer: positive sample, index (1-based) of assigned gt

    Args:
        topk (float): number of bbox selected in each level
    """

    def __init__(self,
                 topk,
                 iou_calculator=dict(type='BboxOverlaps2D'),
                 ignore_iof_thr=-1):
        self.topk = topk
        self.iou_calculator = build_iou_calculator(iou_calculator)
        self.ignore_iof_thr = ignore_iof_thr

    # https://github.com/sfzhang15/ATSS/blob/master/atss_core/modeling/rpn/atss/loss.py

    def assign(self,
               bboxes,
               num_level_bboxes,
               gt_bboxes,
               gt_bboxes_ignore=None,
               gt_labels=None):
        """Assign gt to bboxes.

        The assignment is done in following steps

        1. compute iou between all bbox (bbox of all pyramid levels) and gt
        2. compute center distance between all bbox and gt
        3. on each pyramid level, for each gt, select k bbox whose center
           are closest to the gt center, so we total select k*l bbox as
           candidates for each gt
        4. get corresponding iou for the these candidates, and compute the
           mean and std, set mean + std as the iou threshold
        5. select these candidates whose iou are greater than or equal to
           the threshold as postive
        6. limit the positive sample's center in gt


        Args:
            bboxes (Tensor): Bounding boxes to be assigned, shape(n, 4).
            num_level_bboxes (List): num of bboxes in each level
            gt_bboxes (Tensor): Groundtruth boxes, shape (k, 4).
            gt_bboxes_ignore (Tensor, optional): Ground truth bboxes that are
                labelled as `ignored`, e.g., crowd boxes in COCO.
            gt_labels (Tensor, optional): Label of gt_bboxes, shape (k, ).

        Returns:
            :obj:`AssignResult`: The assign result.
        """
        INF = 100000000
        bboxes = bboxes[:, :4]
        num_gt, num_bboxes = gt_bboxes.size(0), bboxes.size(0)

        # compute iou between all bbox and gt
        overlaps = self.iou_calculator(bboxes, gt_bboxes)

        # assign 0 by default
        assigned_gt_inds = overlaps.new_full((num_bboxes, ),
                                             0,
                                             dtype=torch.long)

        if num_gt == 0 or num_bboxes == 0:
            # No ground truth or boxes, return empty assignment
            max_overlaps = overlaps.new_zeros((num_bboxes, ))
            if num_gt == 0:
                # No truth, assign everything to background
                assigned_gt_inds[:] = 0
            if gt_labels is None:
                assigned_labels = None
            else:
                assigned_labels = overlaps.new_full((num_bboxes, ),
                                                    -1,
                                                    dtype=torch.long)
            return AssignResult(
                num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)

        # compute center distance between all bbox and gt
        gt_cx = (gt_bboxes[:, 0] + gt_bboxes[:, 2]) / 2.0
        gt_cy = (gt_bboxes[:, 1] + gt_bboxes[:, 3]) / 2.0
        gt_points = torch.stack((gt_cx, gt_cy), dim=1)

        bboxes_cx = (bboxes[:, 0] + bboxes[:, 2]) / 2.0
        bboxes_cy = (bboxes[:, 1] + bboxes[:, 3]) / 2.0
        bboxes_points = torch.stack((bboxes_cx, bboxes_cy), dim=1)

        distances = (bboxes_points[:, None, :] -
                     gt_points[None, :, :]).pow(2).sum(-1).sqrt()

        if (self.ignore_iof_thr > 0 and gt_bboxes_ignore is not None
                and gt_bboxes_ignore.numel() > 0 and bboxes.numel() > 0):
            ignore_overlaps = self.iou_calculator(
                bboxes, gt_bboxes_ignore, mode='iof')
            ignore_max_overlaps, _ = ignore_overlaps.max(dim=1)
            ignore_idxs = ignore_max_overlaps > self.ignore_iof_thr
            distances[ignore_idxs, :] = INF
            assigned_gt_inds[ignore_idxs] = -1

        # Selecting candidates based on the center distance
        candidate_idxs = []
        start_idx = 0
        for level, bboxes_per_level in enumerate(num_level_bboxes):
            # on each pyramid level, for each gt,
            # select k bbox whose center are closest to the gt center
            end_idx = start_idx + bboxes_per_level
            distances_per_level = distances[start_idx:end_idx, :]
            selectable_k = min(self.topk, bboxes_per_level)
            _, topk_idxs_per_level = distances_per_level.topk(
                selectable_k, dim=0, largest=False)
            candidate_idxs.append(topk_idxs_per_level + start_idx)
            start_idx = end_idx
        candidate_idxs = torch.cat(candidate_idxs, dim=0)

        # get corresponding iou for the these candidates, and compute the
        # mean and std, set mean + std as the iou threshold
        candidate_overlaps = overlaps[candidate_idxs, torch.arange(num_gt)]
        overlaps_mean_per_gt = candidate_overlaps.mean(0)
        overlaps_std_per_gt = candidate_overlaps.std(0)
        overlaps_thr_per_gt = overlaps_mean_per_gt + overlaps_std_per_gt

        is_pos = candidate_overlaps >= overlaps_thr_per_gt[None, :]

        # limit the positive sample's center in gt
        for gt_idx in range(num_gt):
            candidate_idxs[:, gt_idx] += gt_idx * num_bboxes
        ep_bboxes_cx = bboxes_cx.view(1, -1).expand(
            num_gt, num_bboxes).contiguous().view(-1)
        ep_bboxes_cy = bboxes_cy.view(1, -1).expand(
            num_gt, num_bboxes).contiguous().view(-1)
        candidate_idxs = candidate_idxs.view(-1)

        # calculate the left, top, right, bottom distance between positive
        # bbox center and gt side
        l_ = ep_bboxes_cx[candidate_idxs].view(-1, num_gt) - gt_bboxes[:, 0]
        t_ = ep_bboxes_cy[candidate_idxs].view(-1, num_gt) - gt_bboxes[:, 1]
        r_ = gt_bboxes[:, 2] - ep_bboxes_cx[candidate_idxs].view(-1, num_gt)
        b_ = gt_bboxes[:, 3] - ep_bboxes_cy[candidate_idxs].view(-1, num_gt)
        is_in_gts = torch.stack([l_, t_, r_, b_], dim=1).min(dim=1)[0] > 0.01
        is_pos = is_pos & is_in_gts

        # if an anchor box is assigned to multiple gts,
        # the one with the highest IoU will be selected.
        overlaps_inf = torch.full_like(overlaps,
                                       -INF).t().contiguous().view(-1)
        index = candidate_idxs.view(-1)[is_pos.view(-1)]
        overlaps_inf[index] = overlaps.t().contiguous().view(-1)[index]
        overlaps_inf = overlaps_inf.view(num_gt, -1).t()

        max_overlaps, argmax_overlaps = overlaps_inf.max(dim=1)
        assigned_gt_inds[
            max_overlaps != -INF] = argmax_overlaps[max_overlaps != -INF] + 1

        if gt_labels is not None:
            assigned_labels = assigned_gt_inds.new_full((num_bboxes, ), -1)
            pos_inds = torch.nonzero(
                assigned_gt_inds > 0, as_tuple=False).squeeze()
            if pos_inds.numel() > 0:
                assigned_labels[pos_inds] = gt_labels[
                    assigned_gt_inds[pos_inds] - 1]
        else:
            assigned_labels = None
        return AssignResult(
            num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)

Adaptive Training Sample Selection

一、主要贡献

二、分析anchor-free和anchor-based方法的差距

三、提出Adaptive Training Sample Selection

四、结果验证

五、源码实现

推荐阅读更多精彩内容