1 basic
- github.com/alirezazareian/ovr-cnn
- the first paper which proposes the task of "open-vocabulary object detection"
2 introduction
OD: each category needs thousands of bounding boxes;
stage 1: use {image, caption} pairs to learn a visual semantic space;
stage 2: use annotated boxes for several classes to train object detection;
stage 3: inference which can detect objects beyond the base classes;
to summarize, we train a model that takes an image and detects any object within a given target vocabulary VT.
Task Definition:
- test on target vocabulary ;
- train on an image-caption dataset with the vocabulary as
- train on an annotated object detection dataset with the vocabulary as
- is not known during training and can be any subset of the entire vocabulary .
**compare with ZSD and WSD: **
- ZSD: no ;
- WSD: no V_{T}$ before training;
- OVD is a generalization of ZSD and WSD.
outcome:
- significant outcome the ZSD and WSD methods;
3 Method
OVD framework:
- meaning of open: the words in the captions are not limited, but in practice, it is not literally "open" as it is limited to pretrained word embeddings. (However, word embeddings are typically trained on very large text corpora such as Wiki pedia that cover nearly every word
3.1 Learning visual semantic space
- resembles the PixelBERT
- use the RN50 as the visual encoder; and the BERT as the text encoder;
- design a V2L (vision to language) module (mapping the vectors of vision patches to text vectors)
- use the grounding (main) task to train the RN50 & V2L module.
specifically,
- input image --> RN50 --> features of patches
- each patch feature (vision) --> V2L --> patch feature (language)
- caption --> Embedding --> BERT --> features of words
- patch features (language), words features --> multimodal transformer --> new features for patches and words , .
- task: perform weakly supervised grounding using { , }, making the paired {img, caption} be the positive, while the unpaired {img, caption} the negative, and dis between {img, caption} is calculated by average of all and .
the grounding objectives results in a learned visual backbone and V2L layer that can map regions in the image into words that best describe them.
besides, to teach the model learn to 1) extract all objects that might be described in captions and 2) determine what word completes the caption best, further introduce the image text matching (ITM) subtask and the Masked Language Matching (not sure about the full name) (MLM) subtask.
3.2 Learning open-vocabulary detection
- use faster-rcnn
- block1-3 to extract features
- RPN --> predict objectness & bounding box coordinates;
- non-max suppression (NMS)
- region-of-interest pooling (ROI pooling) to get a feature map for each potential object which is typically used for classification in the supervised way;
However, in the zero-shot setting,
3.3 testing
basically the same with the training but for the last step compare the box features after V2L to the target classes.