+**********************************************************************
*
*
* Einladung
*
*
*
* Informatik-Oberseminar
*
*
*
+**********************************************************************
Zeit: Freitag, 12. Juli 2019, 10.00 Uhr
Ort: Informatikzentrum, E3, Raum 9222
Referent: Dipl.-Inform. Malte Nuhn
Thema: Unsupervised Training with Applications in Natural Language
Processing//
Abstract:
The state-of-the-art algorithms for various natural language processing
tasks require large amounts of labeled training data. At the same time,
obtaining labeled data of high quality is often the most costly step in
setting up natural language processing systems.Opposed to this,
unlabeled data is much cheaper to obtain and available in larger
amounts.Currently, only few training algorithms make use of unlabeled
data. In practice, training with only unlabeled data is not performed at
all. In this thesis, we study how unlabeled data can be used to train a
variety of models used in natural language processing. In particular, we
study models applicable to solving substitution ciphers, spelling
correction, and machine translation. This thesis lays the groundwork for
unsupervised training by presenting and analyzing the corresponding
models and unsupervised training problems in a consistent manner.We show
that the unsupervised training problem that occurs when breaking
one-to-one substitution ciphers is equivalent to the quadratic
assignment problem (QAP) if a bigram language model is incorporated and
therefore NP-hard. Based on this analysis, we present an effective
algorithm for unsupervised training for deterministic substitutions. In
the case of English one-to-one substitution ciphers, we show that our
novel algorithm achieves results close to human performance, as
presented in [Shannon 49].
Also, with this algorithm, we present, to the best of our knowledge, the
first automatic decipherment of the second part of the Beale
ciphers.Further, for the task of spelling correction, we work out the
details of the EM algorithm [Dempster & Laird + 77] and experimentally
show that the error rates achieved using purely unsupervised training
reach those of supervised training.For handling large vocabularies, we
introduce a novel model initialization as well as multiple training
procedures that significantly speed up training without hurting the
performance of the resulting models significantly.By incorporating an
alignment model, we further extend this model such that it can be
applied to the task of machine translation. We show that the true
lexical and alignment model parameters can be learned without any
labeled data: We experimentally show that the corresponding likelihood
function attains its maximum for the true model parameters if a
sufficient amount of unlabeled data is available. Further, for the
problem of spelling correction with symbol substitutions and local
swaps, we also show experimentally that the performance achieved with
purely unsupervised EM training reaches that of supervised training.
Finally, using the methods developed in this thesis, we present results
on an unsupervised training task for machine translation with a ten
times larger vocabulary than that of tasks investigated in previous work.
Es laden ein: die Dozentinnen und Dozenten der Informatik
_______________________________________________
--
--
Stephanie Jansen
Faculty of Mathematics, Computer Science and Natural Sciences
HLTPR - Human Language Technology and Pattern Recognition
RWTH Aachen University
Ahornstraße 55
D-52074 Aachen
Tel. Frau Jansen: +49 241 80-216 06
Tel. Frau Andersen: +49 241 80-216 01
Fax: +49 241 80-22219
sek(a)i6.informatik.rwth-aachen.de
www.hltpr.rwth-aachen.de
Tel: +49 241 80-216 01/06
Fax: +49 241 80-22219
sek(a)i6.informatik.rwth-aachen.de
www.hltpr.rwth-aachen.de
+**********************************************************************
*
*
* Einladung
*
*
*
* Informatik-Oberseminar
*
*
*
+**********************************************************************
Zeit: Freitag, 8. April 2022, 12.15 Uhr
Zoom URL: https://rwth.zoom.us/j/97644054920
Referent: Krishna Subramanian, M.Sc.
Lehrstuhl für Informatik 10
Thema: Lowering the Barriers to Hypothesis-Driven Data Science
Abstract:
Data science is a frequent task in academia and industry. One common use of data science is to validate hypotheses, in which the analyst uses significance-based hypothesis testing to draw insights about a population distribution based on experimental data. Apart from data scientists, who are professionally trained in data science and are highly skilled, many non-professional analysts also carry out data analysis. These non-professionals, who we refer to as data workers, are domain experts who lack expertise in data science, such as academic researchers, project managers, and sales managers.
Through interviews, observations, online surveys, and content analyses, we aim to understand data workers' workflows across important tasks in hypothesis testing: learning theoretical and practical statistics, selecting statistical procedures, using data science programming IDEs to experiment with ideas in source code, refine and refactor source code, and disseminating findings from an analysis.
We present our findings grouped into two steps when performing data science tasks:
1. Preparing to perform data science tasks: We discuss our findings about the impact of formal training on real-world statistical practice; trade-offs among information sources used for selecting statistical procedures; perceived complexity and uncertainty about statistical procedure selection; and reluctance among data workers to adopt alternative methods of analysis. Based on the above findings, we present design recommendations and two artifacts to improve data workers' workflows. Our artifacts include Statsplorer, a web-based tool to help data workers kickstart analysis and learn about common issues in statistical practice, such as over-testing, overlooking assumptions, and selecting the appropriate test; and StatPlayground, an interactive simulation tool that can be used to self-learn or teach statistical concepts and statistical procedure selection.
2. Performing data science tasks: Our findings include an overview of data workers' workflows when performing hypothesis testing using programming IDEs, which follows an exploratory programming workflow; and a comparison of existing interfaces for data science programming, namely computational notebooks, scripts, and consoles, and a discussion of how well they support various steps in hypothesis testing. To improve data workers' workflows when performing data science tasks, we contribute design recommendations and two artifacts. Our artifacts include StatWire, an experimental hybrid-programming interface that encourages data workers to write high-quality source code; and Tractus, an interactive visualization that can lower the cost of working with experimental source code.
Based on our work, we present four takeaways that can be used by researchers, software developers, and educators to lower the barriers to hypothesis testing.
---
Es laden ein: die Dozentinnen und Dozenten der Informatik
+**********************************************************************
*
*
* Einladung
*
*
*
* Informatik-Oberseminar
*
*
*
+**********************************************************************
Zeit: Dienstag, 29. März 2022, 15:00 Uhr
Ort:
https://rwth.zoom.us/j/95327979988?pwd=VU8rT1oyVGhiZENvQ2NuVVB2UVVndz09
Referent: Janis Born M.Sc.
Lehrstuhl für Informatik 8
Thema: Topological Aspects of Maps Between Surfaces
Abstract:
We consider the generation of high-quality maps between 3D surfaces in
the form of discrete homeomorphisms. Specifically, we address the
topological issues underlying the construction of such maps, which have
so far received comparably little attention in geometry processing
research. We approach this task from two different angles: First, we
propose a robust method for the construction of maps from sparse
landmark correspondences, based on compatible layout embeddings. Our
robust embedding strategy systematically searches for short, natural
embeddings and therefore reliably avoids a range of sporadic topological
initialization errors which can occur with previous heuristic
approaches. Second, we introduce a novel algorithm to extract
topological map descriptions from approximate, non-homeomorphic input
maps. Such a purely abstract description of map topology may then be
used to guide the construction of a proper homeomorphism. As our
inference method is highly robust to a wide range of map defects and
imperfect map representations, this effectively allows to delegate the
difficult task of finding a natural map topology to specialized shape
matching methods, which have grown increasingly capable. These
advancements promote the further automation of map generation techniques
in two regards: They vastly reduce the need for human supervision, and
make the results of automatic shape matching methods accessible for
topological initialization.
Es laden ein: die Dozentinnen und Dozenten der Informatik