QUT ePrints

TOPSIG : Topology Preserving Document Signatures

Geva, Shlomo & De Vries, Christopher M. (2011) TOPSIG : Topology Preserving Document Signatures. In Conference on Information and Knowledge Management 2011, 24-28 October 2011, Glasgow, Scotland. (In Press)

View at publisher

Abstract

Performance comparisons between File Signatures and Inverted Files for text retrieval have previously shown several significant shortcomings of file signatures relative to inverted files. The inverted file approach underpins most state-of-the-art search engine algorithms, such as Language and Probabilistic models. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper describes TopSig, a new approach to the construction of file signatures. Many advances in semantic hashing and dimensionality reduction have been made in recent times, but these were not so far linked to general purpose, signature file based, search engines. This paper introduces a different signature file approach that builds upon and extends these recent advances. We are able to demonstrate significant improvements in the performance of signature file based indexing and retrieval, performance that is comparable to that of state of the art inverted file based systems, including Language models and BM25. These findings suggest that file signatures offer a viable alternative to inverted files in suitable settings and positions the file signatures model in the class of Vector Space retrieval models.

Impact and interest:

2 citations in Scopus
Search Google Scholar™

Citation countsare sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

255 since deposited on 21 Jul 2011
121 in the past twelve months

Full-text downloadsdisplays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 43451
Item Type: Conference Paper
Keywords: Signature Files, Random Indexing, Topology, Quantisation, Vector Space IR, Search Engines, Document Clustering, Document
Subjects: Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > ARTIFICIAL INTELLIGENCE AND IMAGE PROCESSING (080100) > Pattern Recognition and Data Mining (080109)
Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > LIBRARY AND INFORMATION STUDIES (080700) > Information Retrieval and Web Search (080704)
Divisions: Past > Schools > Computer Science
Past > QUT Faculties & Divisions > Faculty of Science and Technology
Copyright Owner: Copyright 2011 Please consult the authors.
Deposited On: 21 Jul 2011 11:01
Last Modified: 05 May 2012 22:00

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page