Meeting ID: 986 2945 7476
Passcode: 073217
Referent: Paul Voigtlaender, M.Sc.
Lehrstuhl Informatik 13
Thema: Video Object Segmentation and Tracking
Abstract:
Video Object Segmentation (VOS) is the computer vision task of segmenting
generic objects in a video given their ground truth segmentation masks
in the first frame. Strongly related are the tasks of single-object
tracking (SOT) and multi-object tracking (MOT), where one or multiple
objects need to be tracked on a bounding box level. All these tasks
are highly related and have important applications like autonomous
driving and video editing. At the same time, all of these tasks remain
very challenging till today. In this talk, we present our work on
VOS, MOT, and SOT.
Firstly, we present a VOS method, FEELVOS, which follows the feature
embedding-learning paradigm. FEELVOS is one of the first VOS methods
which use a feature embedding as internal guidance of a convolutional
network and learn the embedding end-to-end with a segmentation loss.
Following this approach, FEELVOS achieves strong results while being
fast and not requiring test-time fine-tuning. This feature embedding-learning
paradigm together with end-to-end learning has by now become the
dominating approach for VOS.
We further extend the popular MOT task to Multi-Object Tracking and
Segmentation (MOTS) by requiring methods to also produce segmentation
masks. We propose a semi-automatic labeling method and use it to annotate
two existing MOT datasets with masks. We release the resulting KITTI MOTS
and the MOTSChallenge benchmarks together with new evaluation measures and
a baseline method. Additionally, we promote the new MOTS task by hosting a
workshop challenge. MOTS is a step towards bringing the communities of VOS
and MOT together to facilitate further exchange of ideas.
Finally, we present Siam R-CNN, a Siamese re-detection architecture
based on Faster R-CNN, to tackle the task of long-term single-object
tracking. In contrast to most previous long-term tracking approaches,
Siam R-CNN performs re-detection on the whole image instead of a local
window, allowing it to recover after losing the object of interest.
Additionally, we propose a tracklet dynamic programming (TDPA) algorithm
to incorporate spatio-temporal context into Siam R-CNN. Siam R-CNN
produces strong results for SOT and VOS, and performs especially well
for long-term tracking.