[Informatik-Vortraege] Einladung Informatik-Oberseminar Malte Nuhn

5 Jul 2019

      +**********************************************************************
*
*
*                          Einladung
*
*
*
*                     Informatik-Oberseminar
*
*
*
+**********************************************************************

Zeit:  Freitag, 12. Juli 2019, 10.00 Uhr
Ort:   Informatikzentrum, E3, Raum 9222

Referent: Dipl.-Inform. Malte Nuhn

Thema: Unsupervised Training with Applications in Natural Language 
Processing//

Abstract:

The state-of-the-art algorithms for various natural language processing 
tasks require large amounts of labeled training data. At the same time, 
obtaining labeled data of high quality is often the most costly step in 
setting up natural language processing systems.Opposed to this, 
unlabeled data is much cheaper to obtain and available in larger 
amounts.Currently, only few training algorithms make use of unlabeled 
data. In practice, training with only unlabeled data is not performed at 
all. In this thesis, we study how unlabeled data can be used to train a 
variety of models used in natural language processing. In particular, we 
study models applicable to solving substitution ciphers, spelling 
correction, and machine translation. This thesis lays the groundwork for 
unsupervised training by presenting and analyzing the corresponding 
models and unsupervised training problems in a consistent manner.We show 
that the unsupervised training problem that occurs when breaking 
one-to-one substitution ciphers is equivalent to the quadratic 
assignment problem (QAP) if a bigram language model is incorporated and 
therefore NP-hard. Based on this analysis, we present an effective 
algorithm for unsupervised training for deterministic substitutions. In 
the case of English one-to-one substitution ciphers, we show that our 
novel algorithm achieves results close to human performance, as 
presented in [Shannon 49].

Also, with this algorithm, we present, to the best of our knowledge, the 
first automatic decipherment of the second part of the Beale 
ciphers.Further, for the task of spelling correction, we work out the 
details of the EM algorithm [Dempster & Laird + 77] and experimentally 
show that the error rates achieved using purely unsupervised training 
reach those of supervised training.For handling large vocabularies, we 
introduce a novel model initialization as well as multiple training 
procedures that significantly speed up training without hurting the 
performance of the resulting models significantly.By incorporating an 
alignment model, we further extend this model such that it can be 
applied to the task of machine translation. We show that the true 
lexical and alignment model parameters can be learned without any 
labeled data: We experimentally show that the corresponding likelihood 
function attains its maximum for the true model parameters if a 
sufficient amount of unlabeled data is available. Further, for the 
problem of spelling correction with symbol substitutions and local 
swaps, we also show experimentally that the performance achieved with 
purely unsupervised EM training reaches that of supervised training. 
Finally, using the methods developed in this thesis, we present results 
on an unsupervised training task for machine translation with a ten 
times larger vocabulary than that of tasks investigated in previous work.

Es laden ein: die Dozentinnen und Dozenten der Informatik
_______________________________________________

-- 
--
Stephanie Jansen

Faculty of Mathematics, Computer Science and Natural Sciences
HLTPR - Human Language Technology and Pattern Recognition
RWTH Aachen University
Ahornstraße 55
D-52074 Aachen
Tel. Frau Jansen:  +49 241 80-216 06
Tel. Frau Andersen: +49 241 80-216 01
Fax: +49 241 80-22219
sek@i6.informatik.rwth-aachen.de
www.hltpr.rwth-aachen.de
Tel: +49 241 80-216 01/06
Fax: +49 241 80-22219
sek@i6.informatik.rwth-aachen.de
www.hltpr.rwth-aachen.de

[Informatik-Vortraege] Einladung Informatik-Oberseminar Malte Nuhn

Sekretariat I6