Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora - LINA - Equipe Traitement Automatique du Langage Naturel Access content directly
Conference Papers Year : 2016

Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora

Abstract

Comparable corpora are the main alternative to the use of parallel corpora to extract bilingual lexicons. Although it is easier to build comparable corpora, specialized comparable corpora are often of modest size in comparison with corpora issued from the general domain. Consequently, the observations of word co-occurrences which are the basis of context-based methods are unreliable. We propose in this article to improve word co-occurrences of specialized comparable corpora and thus context representation by using general-domain data. This idea, which has been already used in machine translation task for more than a decade, is not straightforward for the task of bilingual lexicon extraction from specific-domain comparable corpora. We go against the mainstream of this task where many studies support the idea that adding out-of-domain documents decreases the quality of lexicons. Our empirical evaluation shows the advantages of this approach which induces a significant gain in the accuracy of extracted lexicons.
Fichier principal
Vignette du fichier
C16-1321.pdf (135.48 Ko) Télécharger le fichier
Origin : Files produced by the author(s)
Loading...

Dates and versions

hal-02001789 , version 1 (09-01-2020)

Identifiers

  • HAL Id : hal-02001789 , version 1

Cite

Amir Hazem, Emmanuel Morin. Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora. 26th International Conference on Computational Linguistics (COLING), Dec 2016, Osaka, Japan. pp.3401-3411. ⟨hal-02001789⟩
152 View
59 Download

Share

Gmail Facebook X LinkedIn More