Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement - PERCEPTION Access content directly
Journal Articles IEEE Transactions on Signal Processing Year : 2021

Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

Abstract

In this paper, we are interested in unsupervised (unknown noise) speech enhancement using latent variable generative models. We propose to learn a generative model for clean speech spectrogram based on a variational autoencoder (VAE) where a mixture of audio and visual networks is used to infer the posterior of the latent variables. This is motivated by the fact that visual data, i.e. lips images of the speaker, provide helpful and complementary information about speech. As such, they can help train a richer inference network, where the audio and visual information are fused. Moreover, during speech enhancement, visual data are used to initialize the latent variables, thus providing a more robust initialization than using the noisy speech spectrogram. A variational inference approach is derived to train the proposed VAE. Thanks to the novel inference procedure and the robust initialization, the proposed audio-visual VAE exhibits superior performance on speech enhancement than using the standard audio-only counterpart.
Fichier principal
Vignette du fichier
main.pdf (4.4 Mo) Télécharger le fichier
Origin : Files produced by the author(s)

Dates and versions

hal-02926172 , version 1 (09-03-2021)
hal-02926172 , version 2 (26-01-2022)

Identifiers

Cite

Mostafa Sadeghi, Xavier Alameda-Pineda. Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement. IEEE Transactions on Signal Processing, 2021, 69, pp.1899-1909. ⟨10.1109/TSP.2021.3066038⟩. ⟨hal-02926172v2⟩
249 View
249 Download

Altmetric

Share

Gmail Facebook X LinkedIn More