The influence of pre-processing on the estimation of readability of web documents

Palotti, João, Zuccon, Guido, & Hanbury, Allan (2015) The influence of pre-processing on the estimation of readability of web documents. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management, ACM, Melbourne, VIC, pp. 1763-1766.

View at publisher


This paper investigates the effect that text pre-processing approaches have on the estimation of the readability of web pages. Readability has been highlighted as an important aspect of web search result personalisation in previous work. The most widely used text readability measures rely on surface level characteristics of text, such as the length of words and sentences. We demonstrate that different tools for extracting text from web pages lead to very different estimations of readability. This has an important implication for search engines because search result personalisation strategies that consider users reading ability may fail if incorrect text readability estimations are computed.

Impact and interest:

4 citations in Scopus
Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

17 since deposited on 21 Dec 2015
9 in the past twelve months

Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 91421
Item Type: Conference Paper
Refereed: Yes
Additional URLs:
Keywords: Readability, Text pre-processing
DOI: 10.1145/2806416.2806613
ISBN: 9781450337946
Divisions: Past > QUT Faculties & Divisions > Faculty of Science and Technology
Current > Schools > School of Information Systems
Copyright Owner: Copyright 2015 ACM
Deposited On: 21 Dec 2015 00:41
Last Modified: 09 Jan 2016 08:07

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page