Analysis of large data logs: an application of Poisson sampling on excite web queries

Ozmutlu, H. Cenk, Ozmutlu, Seda, & Spink, Amanda H. (2002) Analysis of large data logs: an application of Poisson sampling on excite web queries. Information Processing and Management, 38(4), pp. 473-490.

View at publisher


Search engines are the gateway for users to retrieve information from the Web. There is a crucial need for tools that allow effective analysis of search engine queries to provide a greater understanding of Web users' information seeking behavior. The objective of the study is to develop an effective strategy for the selection of samples from large-scale data sets. Millions of queries are submitted to Web search engines daily and new sampling techniques are required to bring these databases to a manageable size, while preserving the statistically representative characteristics of the entire data set. This paper reports results from a study using data logs from the Excite Web search engine. We use Poisson sampling to develop a sampling strategy, and show how sample sets selected by Poisson sampling statistically effectively represent the characteristics of the entire dataset. In addition, this paper discusses the use of Poisson sampling in continuous monitoring of stochastic processes, such as Web site dynamics.

Impact and interest:

33 citations in Scopus
27 citations in Web of Science®
Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

447 since deposited on 28 Nov 2006
7 in the past twelve months

Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 5683
Item Type: Journal Article
Refereed: Yes
Keywords: Poisson sampling, Large, scale in depth data analysis, Web user modeling, Search engine queries, Data mining
DOI: 10.1016/S0306-4573(01)00043-7
ISSN: 0306-4573
Subjects: Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > LIBRARY AND INFORMATION STUDIES (080700) > Information Retrieval and Web Search (080704)
Divisions: Past > QUT Faculties & Divisions > Faculty of Science and Technology
Copyright Owner: Copyright 2002 Elsevier
Copyright Statement: Reproduced in accordance with the copyright policy of the publisher.
Deposited On: 28 Nov 2006 00:00
Last Modified: 10 Aug 2011 16:41

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page