Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

Mostafa Sadeghi; Xavier Alameda-Pineda

doi:10.1109/TSP.2021.3066038

Journal Articles IEEE Transactions on Signal Processing Year : 2021

Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

(1, 2) , (1, 3)

1
2
3

Mostafa Sadeghi

Function : Author
PersonId : 752828
IdHAL : msadeghi
ORCID : 0000-0002-0272-8017

Interpretation and Modelling of Images and Videos

Speech Modeling for Facilitating Oral-Based Communication

Xavier Alameda-Pineda

Function : Author
PersonId : 16186
IdHAL : xavier-alameda-pineda
ORCID : 0000-0002-5354-1084
IdRef : 18450919X

Interpretation and Modelling of Images and Videos

Vers des robots à l’intelligence sociale au travers de l’apprentissage, de la perception et de la commande

Abstract

In this paper, we are interested in unsupervised (unknown noise) speech enhancement using latent variable generative models. We propose to learn a generative model for clean speech spectrogram based on a variational autoencoder (VAE) where a mixture of audio and visual networks is used to infer the posterior of the latent variables. This is motivated by the fact that visual data, i.e. lips images of the speaker, provide helpful and complementary information about speech. As such, they can help train a richer inference network, where the audio and visual information are fused. Moreover, during speech enhancement, visual data are used to initialize the latent variables, thus providing a more robust initialization than using the noisy speech spectrogram. A variational inference approach is derived to train the proposed VAE. Thanks to the novel inference procedure and the robust initialization, the proposed audio-visual VAE exhibits superior performance on speech enhancement than using the standard audio-only counterpart.

Keywords

Audio-visual speech enhancement generative models variational auto-encoder mixture model

Domains

Computer Vision and Pattern Recognition [cs.CV] Signal and Image Processing Machine Learning [cs.LG] Sound [cs.SD]

Fichier principal

main.pdf (4.4 Mo)

Origin : Files produced by the author(s)

Xavier Alameda-Pineda : Connect in order to contact the contributor

https://inria.hal.science/hal-02926172

Submitted on : Wednesday, January 26, 2022-11:41:26 AM

Last modification on : Saturday, April 27, 2024-3:09:38 AM

Dates and versions

hal-02926172 , version 1 (09-03-2021)

hal-02926172 , version 2 (26-01-2022)

Identifiers

HAL Id : hal-02926172 , version 2
ARXIV : 1912.10647
DOI : 10.1109/TSP.2021.3066038

Cite

Mostafa Sadeghi, Xavier Alameda-Pineda. Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement. IEEE Transactions on Signal Processing, 2021, 69, pp.1899-1909. ⟨10.1109/TSP.2021.3066038⟩. ⟨hal-02926172v2⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 UGA CNRS INRIA IRISA INSMI LJK LJK_GI LJK_GI_PERCEPTION UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES MIAI ANR UR1-MATH-NUM LJK-GI-ROBOTLEARN

249 View

249 Download

Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

Abstract

Keywords

Domains

Dates and versions

Identifiers

Cite

Export

Collections

Altmetric

Share