Clustering and labeling a web scale document collection using Wikipedia clusters

Nayak, Richi, Mills, Rachel, De-Vries, Christopher, & Geva, Shlomo (2014) Clustering and labeling a web scale document collection using Wikipedia clusters. In Yi, Zeng, Kotoulas, Spyros, & Huang, Zhisheng (Eds.) Web-KR '14 Proceedings of the 5th International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning, ACM New York, NY, USA, Shanghai, China, pp. 23-30.

View at publisher


Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.

Impact and interest:

3 citations in Scopus
Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

ID Code: 80070
Item Type: Conference Paper
Refereed: Yes
Additional URLs:
Keywords: Document clustering, Big data, Wikipedia, ClueWeb, Document signature
DOI: 10.1145/2663792.2663803
ISBN: 9781450316064
Divisions: Current > QUT Faculties and Divisions > Science & Engineering Faculty
Deposited On: 15 Jan 2015 00:22
Last Modified: 23 Jun 2017 17:03

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page