1 introduction
1.1 main story
OVD: training text category as text embedding rather than discrete IDs.
base set:,novel set ;
OVD: training on , and test on the union of and ;
previous OD: number of objects are the same between the train and test;
OVD: to deal with additional categories at test time, the common practice is to replace the conventional fixed-size classifier fully-connected layer with the text embeddings of base categories.
OVD任务具体可以参照第一篇论文:https://www.jianshu.com/p/b23de6b4476b
Previous existing method: leverage image-text pretraining, via knowledge distillation, weak supervision, self-training, and frozen model but on CNNs;
assume pretrained VLM are given, and develop adaptation or finetuning recipes to bridge the gap between image-level pretraining and object-level finetuning (pretrain是image level的,但是下游任务是object level的)
This paper:
- exploring OVD with ViT
- propose RO-ViT: pretrain ViT in a region-aware manner for OVD
1.2 related works on OVD:
1) learning alignment between region visual representation and category word embed;
2) hallucinating visual features with a generative model
3) image-text pretraining 【this paper falls into】
existing paper based on CNN and assume image-text pretrained models are given and focus on finetuning or adaptation.
this paper: focuses on improving the upstream image-text pretraining with ViT.
2 method
2.1 common pipeline
- proposals not matched to any annotations in are labeled as "background".
- training: for each region i, calculate the detection score as the cosine_similary{region visual embedding, text embedding}, followed by a softmax;
- testing: expand the text embedding from to .
2.2 Region-Aware Image-Text Pretraining
existing method: align between whole image and text;
this paper: a novel cropped positional embeddings (CPE) to make aware of the region.
cropped positional embeddings (CPE):
position embedding is key to transformers, providing information on which element comes.
整个框架分为三个部分;
- 左边:出去了CPE部分就是大家熟悉的用{img, caption} 作为pairs进行contrastive learning学习; 但是loss从原本的softmax改成了focal loss;
- 中间:CPE部分,也就是这篇论文如何在pretrain阶段实现region-awareness,也就是对原本的fully image position embedding改成1)先upsample成OD的常见尺寸,比如原来的embedding是224x224xD, 西先upsample成OD里常见的图像尺寸1024x1024xD;2)然后再从1024x1024xD中random crop and resize得到新的原先img size的positinal embedding。 这样会让模型认为当前图像是某张大图里的其中一个区域部分。
- 右边:应用到下游的时候把pretrain时候的GAP改成detector head;
:(个人认为是论文中最为核心的部分就是在pretrain的阶段建立起region跟text之间的关系,具体的实现是通过CPE模块对positional embedding进行random 变换得到的,论文中其他的部分例如loss修改等细节就不再介绍了,大家感兴趣的可以再follow原文。