A comparison of methods for classifying clinical samples based on proteomics data : a case study for statistical and machine learning approaches

Sampson, Dayle, Parker, Anthony, Upton, Zee, & Hurst, Cameron (2011) A comparison of methods for classifying clinical samples based on proteomics data : a case study for statistical and machine learning approaches. PLoS ONE, 6(9), pp. 1-11.

View at publisher (open access)


The discovery of protein variation is an important strategy in disease diagnosis within the biological sciences. The current benchmark for elucidating information from multiple biological variables is the so called “omics” disciplines of the biological sciences. Such variability is uncovered by implementation of multivariable data mining techniques which come under two primary categories, machine learning strategies and statistical based approaches. Typically proteomic studies can produce hundreds or thousands of variables, p, per observation, n, depending on the analytical platform or method employed to generate the data. Many classification methods are limited by an n≪p constraint, and as such, require pre-treatment to reduce the dimensionality prior to classification. Recently machine learning techniques have gained popularity in the field for their ability to successfully classify unknown samples. One limitation of such methods is the lack of a functional model allowing meaningful interpretation of results in terms of the features used for classification. This is a problem that might be solved using a statistical model-based approach where not only is the importance of the individual protein explicit, they are combined into a readily interpretable classification rule without relying on a black box approach. Here we incorporate statistical dimension reduction techniques Partial Least Squares (PLS) and Principal Components Analysis (PCA) followed by both statistical and machine learning classification methods, and compared them to a popular machine learning technique, Support Vector Machines (SVM). Both PLS and SVM demonstrate strong utility for proteomic classification problems.

Impact and interest:

18 citations in Scopus
12 citations in Web of Science®
Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

74 since deposited on 19 Jul 2012
14 in the past twelve months

Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 52508
Item Type: Journal Article
Refereed: Yes
DOI: 10.1371/journal.pone.0024973
ISSN: 1932-6203
Divisions: Current > QUT Faculties and Divisions > Faculty of Health
Current > Institutes > Institute of Health and Biomedical Innovation
Copyright Owner: Copyright 2011 the authors
Copyright Statement: This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Deposited On: 19 Jul 2012 06:31
Last Modified: 30 Oct 2013 06:04

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page