Relevance feature discovery for text mining

Li, Yuefeng, Algarni, Abdulmohsen, Albthan, Mubarak, Shen, Yan, & Bijaksana, Moch Arif (2014) Relevance feature discovery for text mining. IEEE Transactions on Knowledge and Data Engineering, 27(6), 1656 -1669.

View at publisher

Abstract

It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing user preferences because of large scale terms and data patterns. Most existing popular text mining and classification methods have adopted term-based approaches. However, they have all suffered from the problems of polysemy and synonymy. Over the years, there has been often held the hypothesis that pattern-based methods should perform better than term-based ones in describing user preferences; yet, how to effectively use large scale patterns remains a hard problem in text mining. To make a breakthrough in this challenging issue, this paper presents an innovative model for relevance feature discovery. It discovers both positive and negative patterns in text documents as higher level features and deploys them over low-level features (terms). It also classifies terms into categories and updates term weights based on their specificity and their distributions in patterns. Substantial experiments using this model on RCV1, TREC topics and Reuters-21578 show that the proposed model significantly outperforms both the state-of-the-art term-based methods and the pattern based methods.

Impact and interest:

7 citations in Scopus
Search Google Scholar™
2 citations in Web of Science®

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

ID Code: 80085
Item Type: Journal Article
Refereed: Yes
Keywords: Text mining, Text feature extraction, Text classification
DOI: 10.1109/TKDE.2014.2373357
ISSN: 1041-4347
Divisions: Current > Schools > School of Electrical Engineering & Computer Science
Current > QUT Faculties and Divisions > Science & Engineering Faculty
Funding:
Copyright Owner: Copyright 2014 by IEEE
Deposited On: 15 Jan 2015 01:19
Last Modified: 20 May 2015 09:07

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page