Towards the Automatic Processing of Language Registers: Semi-supervisedly Built Corpus and Classifier for French - Irisa Accéder directement au contenu
Communication Dans Un Congrès Année : 2023

Towards the Automatic Processing of Language Registers: Semi-supervisedly Built Corpus and Classifier for French

Résumé

Language registers are a strongly perceptible characteristic of texts and speeches. However, they are still poorly studied in natural language processing. In this paper, we present a semi-supervised approach which jointly builds a corpus of texts labeled in registers and an associated classifier. This approach relies on a small initial seed of expert data. After massively retrieving web pages, it iteratively alternates the training of an intermediate classifier and the annotation of new texts to augment the labeled corpus. The approach is applied to the casual, neutral, and formal registers, leading to a 750M word corpus and a final neural classifier with an acceptable performance.
Fichier principal
Vignette du fichier
paper_58.pdf (335.98 Ko) Télécharger le fichier
58-poster.pdf (232.25 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02064694 , version 1 (09-04-2019)

Identifiants

Citer

Gwénolé Lecorvé, Hugo Ayats, Benoît Fournier, Jade Mekki, Jonathan Chevelu, et al.. Towards the Automatic Processing of Language Registers: Semi-supervisedly Built Corpus and Classifier for French. International Conference on Computational Linguistics and Intelligent Text Processing (CICLing), Apr 2019, La Rochelle, France. pp.480-492, ⟨10.1007/978-3-031-24337-0_34⟩. ⟨hal-02064694⟩
358 Consultations
462 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More