Clustering XML documents using frequent subtrees
This paper presents an experimental study conducted over the INEX 2008 Document Mining Challenge corpus using both the structure and the content of XML documents for clustering them. The concise common substructures known as the closed frequent subtrees are generated using the structural information of the XML documents. The closed frequent subtrees are then used to extract the constrained content from the documents. A matrix containing the term distribution of the documents in the dataset is developed using the extracted constrained content. The k-way clustering algorithm is applied to the matrix to obtain the required clusters. In spite of the large number of documents in the INEX 2008 Wikipedia dataset, the proposed frequent subtree-based clustering approach was successful in clustering the documents. This approach significantly reduces the dimensionality of the terms used for clustering without much loss in accuracy.
Impact and interest:
Citation counts are sourced monthly from and citation databases.
Citations counts from theindexing service can be viewed at the linked Google Scholar™ search.
|Item Type:||Conference Paper|
|Keywords:||clustering, Frequent Mining, Frequent subtrees, INEX, Structural mining, Wikipedia, XML document mining|
|Subjects:||Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > ARTIFICIAL INTELLIGENCE AND IMAGE PROCESSING (080100) > Pattern Recognition and Data Mining (080109)|
|Divisions:||Past > QUT Faculties & Divisions > Faculty of Science and Technology
Past > Schools > School of Information Technology
|Copyright Owner:||Copyright 2009 Springer|
|Copyright Statement:||Conference proceedings published, by Springer Verlag, will be available via SpringerLink. http://www.springer.de/comp/lncs/|
|Deposited On:||14 Jan 2010 01:52|
|Last Modified:||18 Jul 2014 02:30|
Repository Staff Only: item control page