写在前面
- 文章出处: ECCV 2022
- 模型名字: Detic
- 整体概括:这篇文章跟最开始的OVD-Net一样,都是从pretraining的角度解决open vocabulary的问题,但是这篇文章的思路更加简单粗暴,直接加入imagenet的类别作为训练。本质上不是真正的open vocabulary,但是能够囊括2000类别;
1. Introduction:
OD has two subtasks: 1) finding boxes (localization); 2) naming the boxes (classification)
Previous works couple these two subtasks;
however, the detection benchmarks are much smaller than the classification benchmark;
as in the fig, both the image number and the category number of LVIS (OD) are much smaller than ImageNet (CLS).
This paper:
propose a detector with image classes (Detic) that uses image-level supervision in addition to detection supervision.
decouple the localization and classification sub-problems;
use image-level labels to train the classifier and broaden the vocabulary of the detector;
illustration:
standard OD: need gt boxes and labels;
weakly supervised od: assign image-level labels to predicted boxes [error-prone]
this paper: assigns image-level labels to the max-size proposals.
2 Method
2.1 preliminary
detection dataset , with class set
image classification dataset , with class set
testing dataset with class set .
, , and may or may not overlap.
tradional OD: C_{det}D_{cls} = \phi $
OVD: allows
2.2 Detic
the whole idea is quite simple.
- use both the detection dataset and the classifiction dataset to train the detection model.
sample a batch from both and .
if image belongs to , then loss = typical od loss, rpn loss + rg loss + cls loss
if image belongs to , then loss = max-size loss, max-size means the proposal has the max size is finally regarded as the region, then used to caculate the cls loss.