Trading spaces : on the lore and limitations of latent semantic analysis
Eduard, Hoenkamp (2011) Trading spaces : on the lore and limitations of latent semantic analysis. In Amati, Giambattista & Crestani, Fabio (Eds.) Advances in Information Retrieval Theory : Third International Conference, Springer-Verlag, University Residential Centre of Bertinoro, Bertinoro, Italy, pp. 40-51.
Two decades after its inception, Latent Semantic Analysis(LSA) has become part and parcel of every modern introduction to Information Retrieval. For any tool that matures so quickly, it is important to check its lore and limitations, or else stagnation will set in. We focus here on the three main aspects of LSA that are well accepted, and the gist of which can be summarized as follows:
(1) that LSA recovers latent semantic factors underlying the document space,
(2) that such can be accomplished through lossy compression of the document space by eliminating lexical noise, and
(3) that the latter can best be achieved by Singular Value Decomposition.
For each aspect we performed experiments analogous to those reported in the LSA literature and compared the evidence brought to bear in each case. On the negative side, we show that the above claims about LSA are much more limited than commonly believed. Even a simple example may show that LSA does not recover the optimal semantic factors as intended in the pedagogical example used in many LSA publications. Additionally, and remarkably deviating from LSA lore, LSA does not scale up well: the larger the document space, the more unlikely that LSA recovers an optimal set of semantic factors. On the positive side, we describe new algorithms to replace LSA (and more recent alternatives as pLSA, LDA, and kernel methods) by trading its l2 space for an l1 space, thereby guaranteeing an optimal set of semantic factors. These algorithms seem to salvage the spirit of LSA as we think it was initially conceived.
Impact and interest:
Citation countsare sourced monthly fromand citation databases.
These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.
Citations counts from theindexing service can be viewed at the linked Google Scholar™ search.
|Item Type:||Conference Paper|
|Keywords:||information retrieval, latent semantic analysis, dimension reduction, l1-norm, compressive sensing|
|Subjects:||Australian and New Zealand Standard Research Classification > MATHEMATICAL SCIENCES (010000) > APPLIED MATHEMATICS (010200)|
Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > INFORMATION SYSTEMS (080600) > Information Engineering and Theory (080607)
|Divisions:||Past > QUT Faculties & Divisions > Faculty of Science and Technology|
Past > Schools > Information Systems
|Copyright Owner:||Copyright 2011 Springer-Verlag|
|Copyright Statement:||This is the author-version of the work. Conference proceedings published, by Springer Verlag, will be available via SpringerLink. http://www.springerlink.com|
|Deposited On:||25 Nov 2011 08:12|
|Last Modified:||25 Nov 2011 08:12|
Repository Staff Only: item control page