+**********************************************************************
*
*
*                          Einladung
*
*
*
*                     Informatik-Oberseminar
*
*
*
+**********************************************************************

Zeit:  Dienstag, 5. März 2024, 10.00 Uhr
Ort:   Raum 025, Mies-van-der-Rohe Str. 15 (UMIC Gebäude)

Referent: Ali Athar, M.Sc.
                Lehrstuhl Informatik 13

Thema: Segmenting and Tracking Objects in Video

Abstract:

Research to develop methods that can accurately localize and track objects in video has been ongoing for decades. Approaches capable of accomplishing this are highly sought after for a variety of applications including autonomous robots, self-driving vehicles, sports analytics, video editing, etc. Despite significant progress in recent times, the task is far from solved, in particular for challenging scenarios involving occlusions, motion blur, and camera ego-motion. In this thesis, we present a series of works that advance the state of research in this domain in various ways, as outlined below.

Our first work, STEm-Seg, is an end-to-end trainable method for instance segmentation that models the input video as a single 3D space-time volume and relies on clustering per-pixel embeddings to segment and track objects. This differs from existing approaches, which largely follow the tracking-by-detection paradigm. Our novel formulation for these embeddings enables us to cluster the embeddings in an efficient and end-to-end learned fashion. The second work, called HODOR, is aimed at mitigating the need for densely annotated data for training video tracking methods. Specifically, it tackles the task of Video Object Segmentation (VOS) in a weakly supervised manner where it can be trained using static images or sparsely annotated video. To this end, we adopt a novel approach that encodes objects into concise descriptors. This is in contrast to existing approaches that predominantly learn space-time correspondences, which makes it challenging to train them in such a setting.

Whereas the two aforementioned works propose network architectures, our third project proposes a dataset and benchmark called BURST that aims to unify the current, fragmented landscape of datasets in video segmentation research. BURST includes a benchmark suite that evaluates multiple tasks related to object segmentation in video with shared data and consistent evaluation metrics. The idea behind this is to facilitate knowledge exchange between the research sub-communities tackling these tasks and also to encourage the development of methods with multi-task capability. Finally, our fourth work, TarViS, can be seen as a logical continuation of the above in that it is a method that can tackle multiple video segmentation tasks. To achieve this, we decouple the task definition from the core network architecture and use a set of dynamic query inputs to specify the task-specific segmentation targets. This formulation enables us to train a single model jointly on a collection of datasets spanning multiple tasks (Video Instance/Object/Panoptic Segmentation). During inference, the model can switch between tasks by simply hot-swapping the input queries accordingly.

Es laden ein: die Dozentinnen und Dozenten der Informatik