Skip to Main content Skip to Navigation

Exploitation de transcriptions bruitées pour la reconnaissance automatique de la parole

Adrien Dufraux 1, 2 
1 MULTISPEECH - Speech Modeling for Facilitating Oral-Based Communication
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : Usual methods to design automatic speech recognition systems require speech datasets with high quality transcriptions. These datasets are composed of the acoustic signals uttered by speakers and the corresponding word-level transcripts representing what is being said. It takes several thousand hours of transcribed speech to build a good speech recognition model. The dataset must include a variety of speakers recorded in different situations in order to cover the wide variability of speech and language. To create such a system, human annotators are asked to listen to audio tracks and to write down the corresponding text. This process is costly and can lead to errors. What is beeing said in realistic settings is indeed not always easy to understand. Poorly transcribed signals cause a drop of performance of the acoustic model. To improve the quality of the transcripts, the same utterances may be transcribed by several people, but this leads to an even more expensive process.This thesis takes the opposite view. We design algorithms which can exploit datasets with “noisy” transcriptions i.e., which contain errors. The main goal of this thesis is to reduce the costs of building an automatic speech recognition system by limiting the performance drop induced by these errors.We first introduce the Lead2Gold algorithm. Lead2Gold is based on a cost function that is tolerant to datasets with noisy transcriptions. We model transcription errors at the letter level with a noise model. For each transcript in the dataset, the algorithm searches for a set of likely better transcripts relying on a beam search in a graph. This technique is usually not used to design cost functions. We show that it is possible to explicitly add new elements (here a noise model) to design complex cost functions.We then express the Lead2Gold loss in the wFST formalism. wFSTs are graphs whose edges are weighted and represent symbols. To build flexible cost functions we can compose several graphs. With our proposal, it becomes easier to add new elements, such as a lexicon, to better characterize good transcriptions. We show that using wFSTs is a good alternative to using Lead2Gold's explicit beam search. The modular formulation allows us to design a new variety of cost functions that model transcription errors.Finally, we conduct a data collection experiment in real conditions. We observe different types of annotator profiles. Annotators do not have the same perception of acoustic signals and hence can produce different types of errors. The explicit goal of this experiment is to collect transcripts with errors and to prove the usefulness of modeling these errors.
Complete list of metadata
Contributor : Thèses UL Connect in order to contact the contributor
Submitted on : Tuesday, May 17, 2022 - 9:27:17 AM
Last modification on : Wednesday, May 18, 2022 - 3:40:18 AM


Files produced by the author(s)


  • HAL Id : tel-03669875, version 1


Adrien Dufraux. Exploitation de transcriptions bruitées pour la reconnaissance automatique de la parole. Informatique [cs]. Université de Lorraine, 2022. Français. ⟨NNT : 2022LORR0032⟩. ⟨tel-03669875⟩



Record views


Files downloads