The impact of OCR accuracy on automated cancer classification of pathology reports

Zuccon, Guido, Nguyen, Anthony, Bergheim, Anton, Wickman, Sandra, & Grayson, Narelle (2012) The impact of OCR accuracy on automated cancer classification of pathology reports. In Studies in Health Technology and Informatics : Health Informatics: Building a Healthcare Future Through Trusted Information, IOS Press, Sydney, NSW, pp. 250-256.

[img] Accepted Version (PDF 159kB)
Administrators only | Request a copy from author

View at publisher



To evaluate the effects of Optical Character Recognition (OCR) on the automatic cancer classification of pathology reports.


Scanned images of pathology reports were converted to electronic free-text using a commercial OCR system. A state-of-the-art cancer classification system, the Medical Text Extraction (MEDTEX) system, was used to automatically classify the OCR reports. Classifications produced by MEDTEX on the OCR versions of the reports were compared with the classification from a human amended version of the OCR reports.


The employed OCR system was found to recognise scanned pathology reports with up to 99.12% character accuracy and up to 98.95% word accuracy. Errors in the OCR processing were found to minimally impact on the automatic classification of scanned pathology reports into notifiable groups. However, the impact of OCR errors is not negligible when considering the extraction of cancer notification items, such as primary site, histological type, etc.


The automatic cancer classification system used in this work, MEDTEX, has proven to be robust to errors produced by the acquisition of freetext pathology reports from scanned images through OCR software. However, issues emerge when considering the extraction of cancer notification items.

Impact and interest:

3 citations in Scopus
Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

ID Code: 69291
Item Type: Conference Paper
Refereed: Yes
Additional URLs:
DOI: 10.3233/978-1-61499-078-9-250
ISBN: 9781614990772
Divisions: Current > Schools > School of Information Systems
Current > QUT Faculties and Divisions > Science & Engineering Faculty
Copyright Owner: Copyright 2012 IOS Press
Deposited On: 17 Jun 2014 22:59
Last Modified: 21 Jun 2014 16:34

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page