A principled experimental design approach to Big Data analysis

Drovandi, Christopher C., Holmes, Christopher, McGree, James, Mengersen, Kerrie, Richardson, Sylvia, & Ryan, Elizabeth (2015) A principled experimental design approach to Big Data analysis. [Working Paper] (Unpublished)

Abstract

Big Datasets are endemic, but they are often notoriously difficult to analyse because of their size, heterogeneity, history and quality. The purpose of this paper is to open a discourse on the use of modern experimental design methods to analyse Big Data in order to answer particular questions of interest. By appealing to a range of examples, it is suggested that this perspective on Big Data modelling and analysis has wide generality and advantageous inferential and computational properties. In particular, the principled experimental design approach is shown to provide a flexible framework for analysis that, for certain classes of objectives and utility functions, delivers near equivalent answers compared with analyses of the full dataset under a controlled error rate. It can also provide a formalised method for iterative parameter estimation, model checking, identification of data gaps and evaluation of data quality. Finally, it has the potential to add value to other Big Data sampling algorithms, in particular divide-and-conquer strategies, by determining efficient sub-samples.

Impact and interest:

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

221 since deposited on 01 Oct 2015
165 in the past twelve months

Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 87946
Item Type: Working Paper
Refereed: No
Keywords: Big Data, Sub-sampling, Experimental design, Active learning, Dimension reduction, Subset
Subjects: Australian and New Zealand Standard Research Classification > MATHEMATICAL SCIENCES (010000) > STATISTICS (010400)
Divisions: Current > QUT Faculties and Divisions > Science & Engineering Faculty
Copyright Owner: Copyright 2015 The Author(s)
Deposited On: 01 Oct 2015 01:55
Last Modified: 21 Oct 2015 17:54

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page