+**********************************************************************
*
*
*                          Einladung
*
*
*
*                     Informatik-Oberseminar
*
*
*
+**********************************************************************

Zeit:  Freitag, 24. Mai 2024, 10.30 Uhr
Ort:   Raum 025, Mies-van-der-Rohe Str. 15 (UMIC Gebäude)

Der Vortrag findet hybrid statt.
Zoom:  https://rwth.zoom-x.de/j/64536789621?pwd=d2I5Y1VDM2xscVVDbTF0ZnJVWUUvZz09
Meeting ID: 645 3678 9621, Passcode: 998171

Referent: Sabarinath Mahadevan, M.Sc.
          Lehrstuhl Informatik 13

Thema:  The Many Facets of Object Segmentation in Images and Videos

Abstract:

Segmentation is an important task in computer vision where the pixels belonging to a region of interest, that often share similar characteristics, have to be separated out, and assigned a unique label. This task is relevant in both the image and video domains, and finds direct applications in a wide range of fields including but not limited to autonomous driving, robotics, image editing, and surveillance. For both images and videos, the task of segmentation has various flavours that are often highly related and equally challenging. With the advent of deep learning, modern computer vision algorithms are able to leverage large amounts of available data to achieve impressive performance for many of these segmentation problems. However, such algorithms are often catered either towards segmenting a pre-defined set of object categories, or to some specific sub-domains, and require to be adapted for out-of-domain tasks by training these methods on additional domain-specific data, that is expensive to annotate. As a result, we need algorithms that generalise well to data from unseen domains and also to new task settings in addition to having methods that can efficiently annotate data.

The first part of this thesis focusses on efficiently annotating objects using Interactive Segmentation, where the goal is to segment and refine objects in an image using user clicks. Here, I'll present two of our research works that advances the state-of-the-art in the interactive segmentation domain.  The first among these, ITIS, develops a novel iterative training strategy in which clicks are added iteratively during training based on the error regions in the network predictions. The iterative training strategy aligns the simulated user-click patterns that are used during training with the actual click patterns that the network would encounter at test time. While this strategy is effective in reducing the number of clicks required to annotate objects, ITIS can only segment one object at a time due to the limitations of its network architecture. We address this problem in our subsequent work called DynaMITe, where we formulate user clicks as spatiotemporal sequences, and develop a novel Transformer based formulation that can process such a sequence and encode them into relevant object or region descriptors. These descriptors are then used to generate the relevant instance segmentation masks. Unlike previous methods, our architecture can process clicks for multiple objects at once, and correspondingly predicts non overlapping segmentation masks without any post-processing.

In the second part of this talk, I'll focus on end-to-end video segmentation networks based on 3D convolutions, and present our work STEm-Seg. STEm-Seg is a a bottom-up end-to-end approach for instance segmentation on videos, which uses a partially 3D network to learn spatio-temporal embeddings that can be clustered into instance tubes based on the predicted clustering parameters. Our method is very generic and can be applied to a wide range of instance segmentation tasks in videos. 

Finally, I’ll present our latest work, Point-VOS, where we show that video segmentation models can learn from spatiotemporally sparse point annotations instead of dense per-object mask annotations. We also present an efficient point-wise annotations scheme, and use it to annotate two large-scale video datasets with associated language expressions.  We also present a new Point-VOS benchmark and the corresponding baselines, and show that on our point annotations can be used to achieve results close to state-of-the-art models that use dense mask supervision. Additionally, we evaluate models that connect vision and language on the VNG task, and correspondingly show that our data helps in improving their performance.