Clustering and labeling a web scale document collection using Wikipedia clusters
Nayak, Richi, Mills, Rachel, De-Vries, Christopher, & Geva, Shlomo (2014) Clustering and labeling a web scale document collection using Wikipedia clusters. In Yi, Zeng, Kotoulas, Spyros, & Huang, Zhisheng (Eds.) Web-KR '14 Proceedings of the 5th International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning, ACM New York, NY, USA, Shanghai, China, pp. 23-30.
Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.
Impact and interest:
Citation counts are sourced monthly from and citation databases.
These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.
Citations counts from theindexing service can be viewed at the linked Google Scholar™ search.
|Item Type:||Conference Paper|
|Keywords:||Document clustering, Big data, Wikipedia, ClueWeb, Document signature|
|Divisions:||Current > Schools > School of Electrical Engineering & Computer Science
Current > QUT Faculties and Divisions > Science & Engineering Faculty
|Deposited On:||15 Jan 2015 00:22|
|Last Modified:||15 Jan 2015 23:27|
Repository Staff Only: item control page