QUT ePrints

Document clustering evaluation : Divergence from a random baseline

De Vries, Christopher M., Geva, Shlomo, & Trotman, Andrew (2012) Document clustering evaluation : Divergence from a random baseline. In Workshop "Information Retrieval 2012" (IR-2012), 12-14 September, 2012, Technical University of Dortmund, Dortmund, Germany.

Abstract

Divergence from a random baseline is a technique for the evaluation of document clustering. It ensures cluster quality measures are performing work that prevents ineffective clusterings from giving high scores to clusterings that provide no useful result. These concepts are defined and analysed using intrinsic and extrinsic approaches to the evaluation of document cluster quality. This includes the classical clusters to categories approach and a novel approach that uses ad hoc information retrieval. The divergence from a random baseline approach is able to differentiate ineffective clusterings encountered in the INEX XML Mining track. It also appears to perform a normalisation similar to the Normalised Mutual Information (NMI) measure but it can be applied to any measure of cluster quality. When it is applied to the intrinsic measure of distortion as measured by RMSE, subtraction from a random baseline provides a clear optimum that is not apparent otherwise. This approach can be applied to any clustering evaluation. This paper describes its use in the context of document clustering evaluation.

Impact and interest:

Citation countsare sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

119 since deposited on 29 Aug 2012
52 in the past twelve months

Full-text downloadsdisplays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 53371
Item Type: Conference Paper
Additional URLs:
Keywords: document clustering, evaluation, collection selection, XML mining, clustering
Subjects: Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > ARTIFICIAL INTELLIGENCE AND IMAGE PROCESSING (080100) > Pattern Recognition and Data Mining (080109)
Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > LIBRARY AND INFORMATION STUDIES (080700) > Information Retrieval and Web Search (080704)
Divisions: Current > Schools > School of Electrical Engineering & Computer Science
Current > QUT Faculties and Divisions > Science & Engineering Faculty
Copyright Owner: Copyright 2012 Please consult the author.
Deposited On: 29 Aug 2012 11:12
Last Modified: 20 Mar 2013 07:26

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page