--- Englich version below ---


Sehr geehrte Abonnenten des Kolloquium-Newsletters,

gerne informieren wir Sie über den nächsten Termin unseres Kommunikationstechnischen Kolloquiums.

Freitag, 25. September 2020
Vortragende: Liuhui Deng
Zeit: 11:00 Uhr
Ort:
https://rwth.zoom.us/j/97904157921?pwd=SWpsbDl0MWhrWjY1ZkZaeFRoYmErZz09

        Meeting-ID: 979 0415 7921
        Passwort: 481650

Master-Vortrag: Speech Inpainting Using Image Processing Techniques

Speech inpainting is a task that reconstructs speech from damaged speech signals, wherein corruption can result from improper storage, packet loss in communication networks and etc. Neural networks are becoming an active research hot-spot in the field of audio inpainting in recent years, including speech inpainting, music inpainting and etc. The networks can either be fed waveforms of audio or other feature representations such as Short-Time Frequency Transform (STFT), Mel Frequency Cepstral Coefficients (MFCC) and etc. in order to reconstruct audio.

In this thesis, advanced Convolutional Neural Networks (CNNs) based architectures in image inpainting are adopted to the task speech inpainting. The motivation lie in the facts that the neural techniques in image inpainting are well investigated and turn out to be powerful, besides, the task speech inpainting can be interpreted as image inpainting when speech spectrogram is treated as 2-dimensional image. The involving networks are mainly context encoder, context encoder with Generative Adversarial Networks (GANs), EdgeConnect (w / o GANs) and EdgeConnect (with GANs).

In this work, context encoder is an encoder decoder architecture and takes as input STFT magnitudes (and ground truth corruption mask) while EdgeConnect is fed additionally edge map of spectrogram in order to alleviate the blurriness issue observed in image inpainting. EdgeConnect (w / o GANs) is composed of two sub-models, both of which are a context encoder. One sub-model is referred to as edge completion model which reconstructs edge map from corrupted edge map and the other is inpainting model which reconstructs spectrogram based on correupted spectrogram and edge map. GANs applied in the models of interest are also intended to mitigate the blurriness by adding adversarial loss from GANs to the loss function of context encoder, edge completion model and inpainting model. Experiments indicate that context encoder (w/ or w/o GANs) outperforms the CNNs which are simply stacking a few convolutional layers. EdgeConnect (w/ or w/o GANs) achieves even better performance than context encoder (w/ or w/o GANs) mainly thanks to additional informative edge map of spectrogram. The best model among them is EdgeConnect (with GANs), its reconstructed speeches achieve 3,03 in terms of PESQ score, 71,2% improvement compared to input corrupted speech. Besides, analyses of edge map quality in EdgeConnect (w/ or w/o GANs) reveal that edge map of low quality heavily degrades the inpainting performance, thus a well performing edge completion model is of great importance and is a promising direction to put more effort into in the future.


Alle Interessierten sind herzlich eingeladen, eine Anmeldung ist nicht erforderlich.

Allgemeine Informationen zum Kolloquium sowie eine aktuelle Liste der Termine des Kommunikationstechnischen Kolloquiums finden Sie unter:
http://www.iks.rwth-aachen.de/aktuelles/kolloquium/


Dear subscribers of the colloquium newsletter,

we are happy to inform you about the next date of our communication technology colloquium.

Friday, September 25, 2020
Speaker: Liuhui Deng
Time: 11:00 a.m.
Location:
https://rwth.zoom.us/j/97904157921?pwd=SWpsbDl0MWhrWjY1ZkZaeFRoYmErZz09

        Meeting-ID: 979 0415 7921
        Passwort: 481650

Master Lecture: Speech Inpainting Using Image Processing Techniques

Speech inpainting is a task that reconstructs speech from damaged speech signals, wherein corruption can result from improper storage, packet loss in communication networks and etc. Neural networks are becoming an active research hot-spot in the field of audio inpainting in recent years, including speech inpainting, music inpainting and etc. The networks can either be fed waveforms of audio or other feature representations such as Short-Time Frequency Transform (STFT), Mel Frequency Cepstral Coefficients (MFCC) and etc. in order to reconstruct audio.

In this thesis, advanced Convolutional Neural Networks (CNNs) based architectures in image inpainting are adopted to the task speech inpainting. The motivation lie in the facts that the neural techniques in image inpainting are well investigated and turn out to be powerful, besides, the task speech inpainting can be interpreted as image inpainting when speech spectrogram is treated as 2-dimensional image. The involving networks are mainly context encoder, context encoder with Generative Adversarial Networks (GANs), EdgeConnect (w / o GANs) and EdgeConnect (with GANs).

In this work, context encoder is an encoder decoder architecture and takes as input STFT magnitudes (and ground truth corruption mask) while EdgeConnect is fed additionally edge map of spectrogram in order to alleviate the blurriness issue observed in image inpainting. EdgeConnect (w / o GANs) is composed of two sub-models, both of which are a context encoder. One sub-model is referred to as edge completion model which reconstructs edge map from corrupted edge map and the other is inpainting model which reconstructs spectrogram based on correupted spectrogram and edge map. GANs applied in the models of interest are also intended to mitigate the blurriness by adding adversarial loss from GANs to the loss function of context encoder, edge completion model and inpainting model. Experiments indicate that context encoder (w/ or w/o GANs) outperforms the CNNs which are simply stacking a few convolutional layers. EdgeConnect (w/ or w/o GANs) achieves even better performance than context encoder (w/ or w/o GANs) mainly thanks to additional informative edge map of spectrogram. The best model among them is EdgeConnect (with GANs), its reconstructed speeches achieve 3,03 in terms of PESQ score, 71,2% improvement compared to input corrupted speech. Besides, analyses of edge map quality in EdgeConnect (w/ or w/o GANs) reveal that edge map of low quality heavily degrades the inpainting performance, thus a well performing edge completion model is of great importance and is a promising direction to put more effort into in the future.


All interested parties are cordially invited, registration is not required.

General information on the colloquium, as well as a current list of the dates of the communication technology colloquium can be found at:
http://www.iks.rwth-aachen.de/aktuelles/kolloquium

-- 
Irina Ronkartz
Institute of Communication Systems (IKS)
RWTH Aachen University
Muffeter Weg 3a, 52074 Aachen, Germany
+49 241 80 26958 (phone)
ronkartz@iks.rwth-aachen.de
http://www.iks.rwth-aachen.de/