Investigation of the quality of topic models for noisy data sources

, , & (2018) Investigation of the quality of topic models for noisy data sources. In Tao, X, Pasi, G, & Weber, R (Eds.) Proceedings of the 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI). Institute of Electrical and Electronics Engineers Inc., United States of America, pp. 488-493.

[img] PDF (430kB)
Investigation of the Quality of Topic Models for Noisy Data Sources .pdf.
Administrators only | Request a copy from author
[img]
Preview
Accepted Version (PDF 414kB)

View at publisher

Description

Latent Dirichlet Allocation (LDA) has become the most stable and widely used topic model to derive topics from collections of documents where it depicts different levels of success based on diversified domains of inputs. Nevertheless, it is a vital requirement to evaluate the LDA against the quality of the input. The noise and uncertainty of the content create a negative influence on the topic model. The major contribution of this investigation is to critically evaluate the LDA based on the quality of input sources and human perception. The empirical study shows the relationship between the quality of the input and the accuracy of the output generated by LDA. Perplexity and coherence have been evaluated with three data-sets (RCV1, conference data set, tweets) which contain different level of complexities and uncertainty in their contents. Human perception in generating topics has been compared with the LDA in terms of human defined topics. Results of the analysis demonstrate a strong relationship between the quality of the input and generated topics. Thus, highly relevant topics were generated from formally written contents while noisy and messy contents lead to generate meaningless topics. A considerable gap is noticed between human defined topics and LDA generated topics. Finally, a concept-based topic modeling technique is proposed to improve the quality of topics by capturing the meaning of the content and eliminating the irrelevant and meaningless topics.

Impact and interest:

1 citations in Scopus
2 citations in Web of Science®
Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

325 since deposited on 29 Apr 2019
56 in the past twelve months

Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 128777
Item Type: Chapter in Book, Report or Conference volume (Conference contribution)
ORCID iD:
Xu, Yueorcid.org/0000-0002-1137-0272
Li, Yuefengorcid.org/0000-0002-3594-8980
Measurements or Duration: 6 pages
Keywords: LDA, content quality, topic modeling
DOI: 10.1109/WI.2018.00-48
ISBN: 978-1-5386-7325-6
Pure ID: 33314045
Divisions: Past > Institutes > Institute for Future Environments
Past > QUT Faculties & Divisions > Science & Engineering Faculty
Copyright Owner: Consult author(s) regarding copyright matters
Copyright Statement: This work is covered by copyright. Unless the document is being made available under a Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other uses. If the document is available under a Creative Commons License (or other specified license) then refer to the Licence for details of permitted re-use. It is a condition of access that users recognise and abide by the legal requirements associated with these rights. If you believe that this work infringes copyright please provide details by email to qut.copyright@qut.edu.au
Deposited On: 29 Apr 2019 23:44
Last Modified: 07 Aug 2024 04:14