Using decision trees to understand structure in missing data

Tierney, Nicholas J., Harden, Fiona A., Harden, Maurice J., & Mengersen, Kerrie L. (2015) Using decision trees to understand structure in missing data. BMJ Open, 5(6), e007450.

View at publisher (open access)

Abstract

Objectives

Demonstrate the application of decision trees – classification and regression trees (CARTs), and their cousins, boosted regression trees (BRTs) – to understand structure in missing data.

Setting

Data taken from employees at three different industry sites in Australia.

Participants

7915 observations were included.

Materials and Methods

The approach was evaluated using an occupational health dataset comprising results of questionnaires, medical tests, and environmental monitoring. Statistical methods included standard statistical tests and the ‘rpart’ and ‘gbm’ packages for CART and BRT analyses, respectively, from the statistical software ‘R’. A simulation study was conducted to explore the capability of decision tree models in describing data with missingness artificially introduced.

Results

CART and BRT models were effective in highlighting a missingness structure in the data, related to the Type of data (medical or environmental), the site in which it was collected, the number of visits and the presence of extreme values. The simulation study revealed that CART models were able to identify variables and values responsible for inducing missingness. There was greater variation in variable importance for unstructured compared to structured missingness.

Discussion

Both CART and BRT models were effective in describing structural missingness in data. CART models may be preferred over BRT models for exploratory analysis of missing data, and selecting variables important for predicting missingness. BRT models can show how values of other variables influence missingness, which may prove useful for researchers.

Conclusion

Researchers are encouraged to use CART and BRT models to explore and understand missing data.

Impact and interest:

0 citations in Scopus
Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

26 since deposited on 02 Jul 2015
15 in the past twelve months

Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 85099
Item Type: Journal Article
Refereed: Yes
Keywords: Epidemiology, Health Services Research, Occupational and Environmental Medicine, Public Health, Research Methods
DOI: 10.1136/bmjopen-2014-007450
ISSN: 2044-6055
Divisions: Current > Research Centres > ARC Centre of Excellence for Mathematical & Statistical Frontiers (ACEMS)
Current > Schools > School of Clinical Sciences
Current > QUT Faculties and Divisions > Faculty of Health
Current > Institutes > Institute for Future Environments
Current > Institutes > Institute of Health and Biomedical Innovation
Current > Schools > School of Mathematical Sciences
Current > QUT Faculties and Divisions > Science & Engineering Faculty
Copyright Owner: Copyright 2015 Tierney NJ, et al
Copyright Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Deposited On: 02 Jul 2015 22:55
Last Modified: 05 Aug 2015 10:22

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page