Robust Web content extraction

Kowalkiewicz, M., Orlowska, M. E., Kaczmarek, T., & Abramowicz, W. (2006) Robust Web content extraction. In 15th International Conference on World Wide Web, May 22 - 26, 2006, Edinburgh, Scotland UK.

View at publisher


We present an empirical evaluation and comparison of two content extraction methods in HTML: absolute XPath expressions and relative XPath expressions. We argue that the relative XPath expressions, although not widely used, should be used in preference to absolute XPath expressions in extracting content from human-created Web documents. Evaluation of robustness covers four thousand queries executed on several hundred webpages. We show that in referencing parts of real world dynamic HTML documents, relative XPath expressions are on average significantly more robust than absolute XPath ones.

Impact and interest:

4 citations in Scopus
Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

12 since deposited on 26 Aug 2015
4 in the past twelve months

Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 86019
Item Type: Conference Paper
Refereed: No
Keywords: Content extraction, Evaluation, Robustness, Wrappers, Content based retrieval, Electronic document exchange, HTML, Robust control, Robustness (control systems), Websites, Markup languages, XPath expressions, Web services, World Wide Web, Empirical evaluations, HTML documents, Web content, Web document, Web page
DOI: 10.1145/1135777.1135928
ISBN: 1595933239
Divisions: Current > QUT Faculties and Divisions > Science & Engineering Faculty
Copyright Owner: The authors
Deposited On: 26 Aug 2015 06:22
Last Modified: 03 Sep 2015 05:29

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page