Multitask learning in Audio Captioning: a sentence embedding regression loss acts as a regularizer

Etienne Labbé; Julien Pinquier; Thomas Pellegrini

doi:10.48550/arXiv.2305.01482

Conference Papers Year : 2023

Multitask learning in Audio Captioning: a sentence embedding regression loss acts as a regularizer

(1) , (1) , (1)

Etienne Labbé

Function : Author
PersonId : 1186443
IdHAL : etienne-labbe
ORCID : 0000-0002-7219-5463

Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio

Julien Pinquier

Function : Author
PersonId : 21789
IdHAL : julien-pinquier
ORCID : 0000-0003-1556-1284
IdRef : 086752839

Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio

Thomas Pellegrini

Function : Author
PersonId : 741962
IdHAL : thomas-pellegrini
ORCID : 0000-0001-8984-1399
IdRef : 127577955

Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio

Abstract

In this work, we propose to study the performance of a model trained with a sentence embedding regression loss component for the Automated Audio Captioning task. This task aims to build systems that can describe audio content with a single sentence written in natural language. Most systems are trained with the standard Cross-Entropy loss, which does not take into account the semantic closeness of the sentence. We found that adding a sentence embedding loss term reduces overfitting, but also increased SPIDEr from 0.397 to 0.418 in our first setting on the AudioCaps corpus. When we increased the weight decay value, we found our model to be much closer to the current state-of-the-art scores, with a SPIDEr score up to 0.444 compared to a 0.475 score. Moreover, this model uses eight times less trainable parameters than the current state-of-the-art method Multi-TTA. In this training setting, the sentence embedding loss has no more impact on the model performance.

Keywords

sound event description multitask learning audio language task overfitting sentence embedding regression loss semantic loss

Domains

Sound [cs.SD]

Fichier principal

Multitask_learning_in_Audio_Captioning__a_sentence_embedding_regression_loss_acts_as_a_regularizer.pdf (371.81 Ko)

img/SBERT-Loss-V2a.pdf (45.47 Ko)

img/SBERT-Loss-V2b.pdf (41.19 Ko)

img/val_loss_over_epoch-optim_AdamW-wd_1e-06-smooth_0.0.pdf (29.96 Ko)

img/val_loss_over_epoch-optim_AdamW-wd_2.0-smooth_0.0.pdf (29.91 Ko)

img/val_sbert.sim_over_epoch-optim_AdamW-wd_1e-06-smooth_0.75.pdf (29.89 Ko)

img/val_sbert.sim_over_epoch-optim_AdamW-wd_2.0-smooth_0.75.pdf (29.87 Ko)

Origin : Files produced by the author(s)

Etienne Labbé : Connect in order to contact the contributor

https://hal.science/hal-04207519

Submitted on : Thursday, September 14, 2023-3:52:43 PM

Last modification on : Tuesday, January 16, 2024-4:26:57 PM

Dates and versions

hal-04207519 , version 1 (14-09-2023)

Licence

Attribution

Identifiers

HAL Id : hal-04207519 , version 1
ARXIV : 2305.01482
DOI : 10.48550/arXiv.2305.01482

Cite

Etienne Labbé, Julien Pinquier, Thomas Pellegrini. Multitask learning in Audio Captioning: a sentence embedding regression loss acts as a regularizer. 31st European Signal Processing Conference (EUSIPCO 2023), Sep 2023, Helsinki, Finland. ⟨10.48550/arXiv.2305.01482⟩. ⟨hal-04207519⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLSE2 CNRS UT1-CAPITOLE GENCI IRIT IRIT-SAMOVA ANR IRIT-SI TOULOUSE-INP UNIV-UT3 UT3-TOULOUSEINP

46 View

36 Download

Multitask learning in Audio Captioning: a sentence embedding regression loss acts as a regularizer

Abstract

Keywords

Domains

Dates and versions

Licence

Identifiers

Cite

Export

Collections

Altmetric

Share