Indexing without spam

Zuccon, Guido, Nguyen, Anthony, Leelanupab, Teerapong, & Azzopardi, Leif (2011) Indexing without spam. In Cunningham, Sally Jo, Scholer, Falk, & Thomas, Paul (Eds.) Proceedings of the 16th Australasian Document Computing Symposium, RMIT University, Australian National University, Canberra, pp. 6-13.

View at publisher (open access)


The presence of spam in a document ranking is a major issue for Web search engines. Common approaches that cope with spam remove from the document rankings those pages that are likely to contain spam. These approaches are implemented as post-retrieval processes, that filter out spam pages only after documents have been retrieved with respect to a user’s query. In this paper we suggest to remove spam pages at indexing time, therefore obtaining a pruned index that is virtually “spam-free”. We investigate the benefits of this approach from three points of view: indexing time, index size, and retrieval performances. Not surprisingly, we found that the strategy decreases both the time required by the indexing process and the space required for storing the index. Surprisingly instead, we found that by considering a spam-pruned version of a collection’s index, no difference in retrieval performance is found when compared to that obtained by traditional post-retrieval spam filtering approaches.

Impact and interest:

2 citations in Scopus
Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

14 since deposited on 02 Jun 2014
2 in the past twelve months

Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 69285
Item Type: Conference Paper
Refereed: Yes
Additional URLs:
Keywords: Information Retrieval, Index Pruning, Spam, Web Search, Efficiency
Divisions: Current > Institutes > Institute for Future Environments
Current > QUT Faculties and Divisions > Science & Engineering Faculty
Copyright Owner: Copyright 2011 Author(s)
Deposited On: 02 Jun 2014 03:49
Last Modified: 24 Jun 2017 14:43

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page