Application of data mining techniques in the prediction of coronary artery disease : use of anaesthesia timeseries and patient risk factor data
Pitt, Ellen Alexandra (2009) Application of data mining techniques in the prediction of coronary artery disease : use of anaesthesia timeseries and patient risk factor data. Masters by Research thesis, Queensland University of Technology.

Ellen Pitt Thesis
(PDF 5MB)

Abstract
The high morbidity and mortality associated with atherosclerotic coronary vascular disease (CVD) and its complications are being lessened by the increased knowledge of risk factors, effective preventative measures and proven therapeutic interventions. However, significant CVD morbidity remains and sudden cardiac death continues to be a presenting feature for some subsequently diagnosed with CVD. Coronary vascular disease is also the leading cause of anaesthesia related complications. Stress electrocardiography/exercise testing is predictive of 10 year risk of CVD events and the cardiovascular variables used to score this test are monitored perioperatively. Similar physiological timeseries datasets are being subjected to data mining methods for the prediction of medical diagnoses and outcomes. This study aims to find predictors of CVD using anaesthesia timeseries data and patient risk factor data. Several preprocessing and predictive data mining methods are applied to this data. Physiological timeseries data related to anaesthetic procedures are subjected to preprocessing methods for removal of outliers, calculation of moving averages as well as data summarisation and data abstraction methods. Feature selection methods of both wrapper and filter types are applied to derived physiological timeseries variable sets alone and to the same variables combined with risk factor variables. The ability of these methods to identify subsets of highly correlated but nonredundant variables is assessed. The major dataset is derived from the entire anaesthesia population and subsets of this population are considered to be at increased anaesthesia risk based on their need for more intensive monitoring (invasive haemodynamic monitoring and additional ECG leads). Because of the unbalanced class distribution in the data, majority class undersampling and Kappa statistic together with misclassification rate and area under the ROC curve (AUC) are used for evaluation of models generated using different prediction algorithms. The performance based on models derived from feature reduced datasets reveal the filter method, Cfs subset evaluation, to be most consistently effective although Consistency derived subsets tended to slightly increased accuracy but markedly increased complexity. The use of misclassification rate (MR) for model performance evaluation is influenced by class distribution. This could be eliminated by consideration of the AUC or Kappa statistic as well by evaluation of subsets with undersampled majority class. The noise and outlier removal preprocessing methods produced models with MR ranging from 10.69 to 12.62 with the lowest value being for data from which both outliers and noise were removed (MR 10.69). For the raw timeseries dataset, MR is 12.34. Feature selection results in reduction in MR to 9.8 to 10.16 with time segmented summary data (dataset F) MR being 9.8 and raw timeseries summary data (dataset A) being 9.92. However, for all timeseries only based datasets, the complexity is high. For most preprocessing methods, Cfs could identify a subset of correlated and nonredundant variables from the timeseries alone datasets but models derived from these subsets are of one leaf only. MR values are consistent with class distribution in the subset folds evaluated in the ncross validation method. For models based on Cfs selected timeseries derived and risk factor (RF) variables, the MR ranges from 8.83 to 10.36 with dataset RF_A (raw timeseries data and RF) being 8.85 and dataset RF_F (time segmented timeseries variables and RF) being 9.09. The models based on counts of outliers and counts of data points outside normal range (Dataset RF_E) and derived variables based on time series transformed using Symbolic Aggregate Approximation (SAX) with associated timeseries pattern cluster membership (Dataset RF_ G) perform the least well with MR of 10.25 and 10.36 respectively. For coronary vascular disease prediction, nearest neighbour (NNge) and the support vector machine based method, SMO, have the highest MR of 10.1 and 10.28 while logistic regression (LR) and the decision tree (DT) method, J48, have MR of 8.85 and 9.0 respectively. DT rules are most comprehensible and clinically relevant. The predictive accuracy increase achieved by addition of risk factor variables to timeseries variable based models is significant. The addition of timeseries derived variables to models based on risk factor variables alone is associated with a trend to improved performance. Data mining of feature reduced, anaesthesia timeseries variables together with risk factor variables can produce compact and moderately accurate models able to predict coronary vascular disease. Decision tree analysis of timeseries data combined with risk factor variables yields rules which are more accurate than models based on timeseries data alone. The limited additional value provided by electrocardiographic variables when compared to use of risk factors alone is similar to recent suggestions that exercise electrocardiography (exECG) under standardised conditions has limited additional diagnostic value over risk factor analysis and symptom pattern. The effect of the preprocessing used in this study had limited effect when timeseries variables and risk factor variables are used as model input. In the absence of risk factor input, the use of timeseries variables after outlier removal and time series variables based on physiological variable values’ being outside the accepted normal range is associated with some improvement in model performance.
Impact and interest:
Citation counts are sourced monthly from Scopus and Web of Science® citation databases.
These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.
Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.
Fulltext downloads:
Fulltext downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.
ID Code:  34427 

Item Type:  QUT Thesis (Masters by Research) 
Supervisor:  Nayak, Richi, Tickle, Alan, & Cumpston, Philip 
Keywords:  anaesthesia, physiological data, timeseries, clustering, feature selection, predictors of outcome, anaesthesia complications, cardiac risk factors, data mining 
Divisions:  Past > QUT Faculties & Divisions > Faculty of Science and Technology 
Institution:  Queensland University of Technology 
Deposited On:  09 Sep 2010 05:35 
Last Modified:  28 Oct 2011 19:57 
Export: EndNote  Dublin Core  BibTeX
Repository Staff Only: item control page