Parallel streaming signature EM-tree: A clustering algorithm for web scale applications

De Vries, Christopher, , , & (2015) Parallel streaming signature EM-tree: A clustering algorithm for web scale applications. In Zhai, C & Gummadi, K (Eds.) Preceedings of the 24th International World Wide Web Conference (WWW 2015). Association for Computing Machinery (ACM), United States of America, pp. 216-226.

[img]
Preview
PDF (559kB)
frp0419-devries.pdf.

View at publisher

Description

The proliferation of the web presents an unsolved problem of automatically analyzing billions of pages of natural language. We introduce a scalable algorithm that clusters hundreds of millions of web pages into hundreds of thousands of clusters. It does this on a single mid-range machine using efficient algorithms and compressed document representations. It is applied to two web-scale crawls covering tens of terabytes. ClueWeb09 and ClueWeb12 contain 500 and 733 million web pages and were clustered into 500,000 to 700,000 clusters. To the best of our knowledge, such fine grained clustering has not been previously demonstrated. Previous approaches clustered a sample that limits the maximum number of discoverable clusters. The proposed EM-tree algorithm uses the entire collection in clustering and produces several orders of magnitude more clusters than the existing algorithms. Fine grained clustering is necessary for meaningful clustering in massive collections where the number of distinct topics grows linearly with collection size. These fine-grained clusters show an improved cluster quality when assessed with two novel evaluations using ad hoc search relevance judgments and spam classifications for external validation. These evaluations solve the problem of assessing the quality of clusters where categorical labeling is unavailable and unfeasible.

Impact and interest:

9 citations in Scopus
6 citations in Web of Science®
Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

381 since deposited on 21 May 2015
22 in the past twelve months

Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 84386
Item Type: Chapter in Book, Report or Conference volume (Conference contribution)
ORCID iD:
Geva, Shlomoorcid.org/0000-0003-1340-2802
Nayak, Richiorcid.org/0000-0002-9954-0159
Measurements or Duration: 11 pages
DOI: 10.1145/2736277.2741111
ISBN: 978-1-4503-3469-3
Pure ID: 32792626
Divisions: Past > Institutes > Institute for Future Environments
Past > QUT Faculties & Divisions > Science & Engineering Faculty
Copyright Owner: Copyright 2015 International World Wide Web Conference
Copyright Statement: This work is covered by copyright. Unless the document is being made available under a Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other uses. If the document is available under a Creative Commons License (or other specified license) then refer to the Licence for details of permitted re-use. It is a condition of access that users recognise and abide by the legal requirements associated with these rights. If you believe that this work infringes copyright please provide details by email to qut.copyright@qut.edu.au
Deposited On: 21 May 2015 03:27
Last Modified: 09 Feb 2025 14:58