+**********************************************************************
*
*
* Einladung
*
*
*
* Informatik-Oberseminar
*
*
*
+**********************************************************************
Zeit: Dienstag, 5. März 2024, 10.00 Uhr
Ort: Raum 025, Mies-van-der-Rohe Str. 15 (UMIC Gebäude)
Referent: Ali Athar, M.Sc.
Lehrstuhl Informatik 13
Thema: Segmenting and Tracking Objects in Video
Abstract:
Research to develop methods that can accurately localize and track objects
in video has been ongoing for decades. Approaches capable of accomplishing
this are highly sought after for a variety of applications including
autonomous robots, self-driving vehicles, sports analytics, video editing,
etc. Despite significant progress in recent times, the task is far from
solved, in particular for challenging scenarios involving occlusions,
motion blur, and camera ego-motion. In this thesis, we present a series of
works that advance the state of research in this domain in various ways, as
outlined below.
Our first work, STEm-Seg, is an end-to-end trainable method for instance
segmentation that models the input video as a single 3D space-time volume
and relies on clustering per-pixel embeddings to segment and track objects.
This differs from existing approaches, which largely follow the
tracking-by-detection paradigm. Our novel formulation for these embeddings
enables us to cluster the embeddings in an efficient and end-to-end learned
fashion. The second work, called HODOR, is aimed at mitigating the need for
densely annotated data for training video tracking methods. Specifically,
it tackles the task of Video Object Segmentation (VOS) in a weakly
supervised manner where it can be trained using static images or sparsely
annotated video. To this end, we adopt a novel approach that encodes
objects into concise descriptors. This is in contrast to existing
approaches that predominantly learn space-time correspondences, which makes
it challenging to train them in such a setting.
Whereas the two aforementioned works propose network architectures, our
third project proposes a dataset and benchmark called BURST that aims to
unify the current, fragmented landscape of datasets in video segmentation
research. BURST includes a benchmark suite that evaluates multiple tasks
related to object segmentation in video with shared data and consistent
evaluation metrics. The idea behind this is to facilitate knowledge
exchange between the research sub-communities tackling these tasks and also
to encourage the development of methods with multi-task capability.
Finally, our fourth work, TarViS, can be seen as a logical continuation of
the above in that it is a method that can tackle multiple video
segmentation tasks. To achieve this, we decouple the task definition from
the core network architecture and use a set of dynamic query inputs to
specify the task-specific segmentation targets. This formulation enables us
to train a single model jointly on a collection of datasets spanning
multiple tasks (Video Instance/Object/Panoptic Segmentation). During
inference, the model can switch between tasks by simply hot-swapping the
input queries accordingly.
Es laden ein: die Dozentinnen und Dozenten der Informatik