Deep Web Collection Selection

King, John Douglas (2004) Deep Web Collection Selection. Masters by Research thesis, Queensland University of Technology.


The deep web contains a massive number of collections that are mostly invisible to search engines. These collections often contain high-quality, structured information that cannot be crawled using traditional methods.

An important problem is selecting which of these collections to search. Automatic collection selection methods try to solve this problem by suggesting the best subset of deep web collections to search based on a query.

A few methods for deep Web collection selection have proposed in Collection Retrieval Inference Network system and Glossary of Servers, Server system.

The drawback in these methods is that they require communication between the search broker and the collections, and need metadata about each collection.

This thesis compares three different sampling methods that do not require communication with the broker or metadata about each collection. It also transforms some traditional information retrieval based techniques to this area. In addition, the thesis tests these techniques using INEX collection for total 18 collections (including 12232 XML documents) and total 36 queries.

The experiment shows that the performance of sample-based technique is satisfactory in average.

Impact and interest:

Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

753 since deposited on 03 Dec 2008
50 in the past twelve months

Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 15992
Item Type: QUT Thesis (Masters by Research)
Supervisor: Li, Yuefeng & Geva, Shlomo
Keywords: information retrieval, deep web, collection selection, singular value decomposition, latent semantic analysis, sampling, query focused, probabilistic
Divisions: Past > QUT Faculties & Divisions > Faculty of Science and Technology
Past > Schools > School of Software Engineering & Data Communications
Department: Faculty of Information Technology
Institution: Queensland University of Technology
Copyright Owner: Copyright John Douglas King
Deposited On: 03 Dec 2008 03:54
Last Modified: 17 Oct 2013 22:56

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page