Contextual Concept Discovery Algorithm

Abstract : In this paper, we focus on the ontological concept extraction and evaluation process from HTML documents. In order to improve this process, we propose an unsupervised hierarchical clustering algorithm namely “Contextual Concept Discovery” (CCD) which is an incremental use of the partitioning algorithm Kmeans and is guided by a structural context. Our context exploits the html structure and the location of words to select the semantically closer cooccurrents for each word and to improve word weighting. Guided by this context definition, we perform an incremental clustering that refines the context of each word clusters to obtain semantically extracted concepts. The CCD algorithm offers the choice between either an automatic execution or a user's interaction. The last function of the CCD algorithm is to provide a complementary support for an easy evaluation task. This functionality is based on a large collection of web documents and several context definitions deduced from it by applying a linguistic and a documentary analysis. We experiment our algorithm on HTML documents related to the tourism domain. Our results show how the execution of our context-based improves the conceptual quality and the relevance of the extracted ontological concepts and how our credibility degree criterion assists the domain experts and facilitates the evaluation task.
Complete list of metadatas

https://hal-supelec.archives-ouvertes.fr/hal-00218204
Contributor : Evelyne Faivre <>
Submitted on : Friday, January 25, 2008 - 3:51:42 PM
Last modification on : Wednesday, June 20, 2018 - 2:32:02 PM

Identifiers

  • HAL Id : hal-00218204, version 1

Collections

Citation

Lobna Karoui, Marie-Aude Aufaure, Nacéra Bennacer Seghouani. Contextual Concept Discovery Algorithm. FLAIRS 2007, May 2007, United States. pp.460-465. ⟨hal-00218204⟩

Share

Metrics

Record views

104