+**********************************************************************
*
*
*                          Einladung
*
*
*
*                     Informatik-Oberseminar
*
*
*
+**********************************************************************

Zeit:  Dienstag, 6. März 2024, 13.00 Uhr
Ort:   Raum 025, Mies-van-der-Rohe Str. 15 (UMIC Gebäude)

Referent: Jonathon Luiten, M.Sc.
                Lehrstuhl Informatik 13

Thema: Dynamic 3D Representations and Robust Evaluation for Visual Tracking

Abstract:

Visual tracking is a core task within computer vision, one that involves understanding the motion and persistence of the dynamic world when observed in video. Building performant tracking algorithms is critical for many applications such as robotics; self-driving vehicles; virtual and augmented reality; scene-analysis for sporting, retail and construction scenarios; and content creation and editing. While other areas of computer vision such as recognition and detection have recently reached outstanding performance due to deep learning and extremely large training datasets, tracking has remained an incredibly difficult task where such approaches have not been able to achieve similar success. We argue that tracking is inherently different from these tasks and that simply scaling up compute and data is not going to be enough. In this thesis we develop what we believe to be the missing piece holding tracking back from similar success: the use of dynamic 3D representations that can be used to model the underlying scene. Furthermore, we find that the second thing holding back the field of visual tracking was the lack of adequate evaluation metrics and benchmark settings. We address these limitations by introducing novel metrics and benchmarks, which are crucial for measuring the performance of algorithms and guiding the field toward making meaningful progress.

The first half of this thesis deals with approaches to lift representations for tracking to 3D, both at the level of whole objects (MOTSFusion) and at the level of infinitesimal 3D scene elements (Dynamic 3D Gaussians). Traditionally, tracking involves finding correspondences between static 2D representations in each timestep, such as pixels or bounding-boxes. Instead, we represent the world as a set of dynamic 3D representations that move around over time in order to consistently represent the same physical location in space as it moves.
We reformulate tracking from a correspondence estimation problem, to an analysis-by-synthesis problem of fitting an underlying dynamic 3D model, whose motion explains changes in image content across timesteps. By using 3D representations we can better model appearance changes due to the 3D motion of the scene and the motion of the camera through the world, while also making use of intuitive physics knowledge about how objects move through the 3D world. This enables us to both obtain better tracking results, while also resulting in consistent dynamic 3D representations that are directly useful for many downstream tasks.

The second half of this thesis deals with building robust metrics and benchmarks for evaluating the performance of visual tracking algorithms. For the task of Multi-Object Tracking (MOT), previous evaluation metrics have been sorely lacking, focusing only on particular aspects of tracking performance (e.g. detection or association), but not being able holistically measure improvements in tracking performance. Furthermore, tracking evaluation has been limited to settings where only a small number of fixed object classes were evaluated. We address both of these evaluation limitations by proposing the HOTA Metrics for evaluating tracking performance in a fair and holistic way, and introducing the task of Open-World Tracking for extending tracking evaluation to a open-world setting where a potentially unlimited set of object classes need to be tracked, even if they were not previously seen during training. Together, these mark a step-change in how tracking methods are evaluated and benchmarked, and allow the tracking community to make meaningful progress towards more performant and useful tracking algorithms.

Overall, by developing both dynamic 3D representations for tracking and a novel set of evaluation metrics and benchmarks, this thesis provides a number of crucial missing pieces that are needed to move towards truly useful and performant tracking algorithms, and thus toward the success of the multitude of applications for which tracking is a core component.

Es laden ein: die Dozentinnen und Dozenten der Informatik