Clustering XML documents using frequent subtrees

Kutty, Sangeetha, Tran, Tien, Nayak, Richi, & Li, Yuefeng (2009) Clustering XML documents using frequent subtrees. In Advances in Focused Retrieval, Springer, Dagstuhl Castle, Germany, pp. 436-445.

View at publisher


This paper presents an experimental study conducted over the INEX 2008 Document Mining Challenge corpus using both the structure and the content of XML documents for clustering them. The concise common substructures known as the closed frequent subtrees are generated using the structural information of the XML documents. The closed frequent subtrees are then used to extract the constrained content from the documents. A matrix containing the term distribution of the documents in the dataset is developed using the extracted constrained content. The k-way clustering algorithm is applied to the matrix to obtain the required clusters. In spite of the large number of documents in the INEX 2008 Wikipedia dataset, the proposed frequent subtree-based clustering approach was successful in clustering the documents. This approach significantly reduces the dimensionality of the terms used for clustering without much loss in accuracy.

Impact and interest:

6 citations in Scopus
1 citations in Web of Science®
Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

441 since deposited on 14 Jan 2010
6 in the past twelve months

Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 18216
Item Type: Conference Paper
Refereed: Yes
Keywords: clustering, Frequent Mining, Frequent subtrees, INEX, Structural mining, Wikipedia, XML document mining
DOI: 10.1007/978-3-642-03761-0_45
ISBN: 9783642037603
Subjects: Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > ARTIFICIAL INTELLIGENCE AND IMAGE PROCESSING (080100) > Pattern Recognition and Data Mining (080109)
Divisions: Past > QUT Faculties & Divisions > Faculty of Science and Technology
Copyright Owner: Copyright 2009 Springer
Copyright Statement: Conference proceedings published, by Springer Verlag, will be available via SpringerLink.
Deposited On: 14 Jan 2010 01:52
Last Modified: 06 Jul 2017 10:01

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page