1 basic info
OVTrack: Open-Vocabulary Multiple Object Tracking
2 introduction
open vocabulary MOT: tracking beyond predefined training categories.
- the classes of interested objects are available at test time
- Detection: similar to OV D, use CLIP to align image features and text embedding.
- Association: CLIP feature distillation helps in learning better appearance representations.
- Besides, used the denoising diffusion probabilistic models (DDPMs) to form an effective data hallucination strategy.
OVTracker sets a new SOTA on TAO benchmark with only static images as training data
3 open-vocabulary MOT
basically the same as OVD.
benchmark builds on the TAO benchmark.
4 OVTrack
framework:
OVTracker's functionality: localization, classification, and association;
- localization: train Faster-RCNN in a class-agnostic manner
- classification: first replace the original classifier in Faster-RCNN with a text head add an image head generating the embeddings. Then, use the CLIP text and image encoders to supervise these two heads. Apply supervision on image and text getting the and , respectively.
- Association: using contrastive learning with paired objects in and .
Learning to track without video data.
use the large-scale, diverse image dataset LVIS to train the OVTrack.
propose a data hallucination method.