Clustering and classification of maintenance longs using text data mining

Edwards, Brett, Zatorsky, Michael, & Nayak, Richi (2008) Clustering and classification of maintenance longs using text data mining. In Australasian Data Mining Conference 2008, November 2008, Adelaide, Australia.

View at publisher


Spreadsheets applications allow data to be stored with low development overheads, but also with low data quality. Reporting on data from such sources is difficult using traditional techniques. This case study uses text data mining techniques to analyse 12 years of data from dam pump station maintenance logs stored as free text in a spreadsheet application. The goal was to classify the data as scheduled maintenance or unscheduled repair jobs. Data preparation steps required to transform the data into a format appropriate for text data mining are discussed. The data is then mined by calculating term weights to which clustering techniques are applied. Clustering identified some groups that contained relatively homogeneous types of jobs. Training a classification model to learn the cluster groups allowed those jobs to be identified in unseen data. Yet clustering did not provide a clear overall distinction between scheduled and unscheduled jobs. With some manual analysis to code a target variable for a subset of the data, classification models were trained to predict the target variable based on text features. This was achieved with a moderate level of accuracy.

Impact and interest:

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

374 since deposited on 03 Mar 2009
45 in the past twelve months

Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 18423
Item Type: Conference Paper
Refereed: Yes
Additional Information: The contents of this proceeding can be freely accessed online via the organiser's web page (see hypertext link).
Additional URLs:
Keywords: data mining, clustering, text mining, log analysis
ISBN: 978-1-920682-68-2.
Divisions: Past > QUT Faculties & Divisions > Faculty of Science and Technology
Copyright Owner: Copyright 2008 Australian Computer Society, Inc.
Copyright Statement: Reproduced in accordance with the copyright policy of the publisher
Deposited On: 03 Mar 2009 02:11
Last Modified: 29 Feb 2012 13:49

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page