1. 写在前面
https://arxiv.org/abs/2111.09452
从经典的OD到OVD的最主要挑战在于现有的OD数据集类别都是有限的,例如最常用的COCO只有80个类,所以对于novel类别的识别会比较困难。
基于这一点,这篇文章的思路就是想从大规模的image-caption pairs中自动生成一些伪物体检测标注,用来训练模型。
1. Introduction
OD is limited to a fixed set of objects (e.g., 80 objects for COCO)
to reduce the need for human labor for annotating: Zero-shot OD & OVD
ZSOD: transfer from base to novel by exploring the corelations between base and novel categories;
OVD: transfer from base to novel with the help of image captions (个人不是那么准确,也不是一定要用caption来做OVD)
both ZS-OD and OVD are limited by the small size of base category set
This paper: automatically generates bounding-box annotations for objects at scale using existing resources.
Existing vision-language models imply strong localization ability.
this paper: improve OVD using pseudo-bounding box annotations generated from large-scale image caption pairs.
left:human labor
right:image-caption + VL model --> pseudo annotation
2. Method
two components:
a pseudo bounding-box label generator;
an open vocabulary object detector.
2.1 pseudo bounding-box label generator
predefine objects of interest
input: {image, caption} pairs
image --> image encoder --> visual embedding per region
caption --> text encoder --> text embedding per token
input {visual embeddings, text embeddings} --> multi-modal encoder --> multi modal features by image-text interaction via cross-attention.
for each object in the predefined objects, e.g., racket, use the grad-cam to visulize its activation map.
apply proposal generator to get multiple boxes
the box with the largest overlap with the activation map is regarded as the pseudo box.
2.2 open vocabulary object detector with pseudo-bounding box:
because of the two-step pipeline, any OVD model is acceptable.
in this paper, a typical framework for OVD is thus selected.
- input: image,large-scale object vocabulary set
- image --> feature extractor --> object proposals --> RoI --> region-based visual embedding
- category texts in large-scale object vocabulary set + "background" --> text encoder --> text embeddings
- training: encourgae the paired {region-based visual embedding, text embedding}