A Web Page Topic Segmentation Algorithm Based on Visual Criteria and Content Layout

Abstract : This paper presents experiments using an algorithm of web page topic segmentation that show significant improvement in the retrieval of documents. Instead of processing the whole document, a web page is segmented into different semantic blocks according to visual criteria (such as horizontal lines, colors) and structural tags (such as heading, paragraph). Several segmentation solutions have been evaluated and we show that combining visual and content layout criteria give the best result for increasing the precision: the ranking of a page is calculated by the sum of the scores of relevant segments of the page resulting from the segmentation algorithm.
Complete list of metadatas

https://hal-supelec.archives-ouvertes.fr/hal-00232588
Contributor : Evelyne Faivre <>
Submitted on : Friday, February 1, 2008 - 11:05:29 AM
Last modification on : Thursday, March 29, 2018 - 11:06:03 AM

Identifiers

  • HAL Id : hal-00232588, version 1

Collections

Citation

Idir Chibane, Bich-Liên Doan. A Web Page Topic Segmentation Algorithm Based on Visual Criteria and Content Layout. SIGIR'07, Jul 2007, Amsterdam, Netherlands. pp.817-818. ⟨hal-00232588⟩

Share

Metrics

Record views

96