Automatic de-identification of electronic health records : an Australian perspective

Zuccon, G., Strachan, M., Nguyen, A., Bergheim, A., & Grayson, N. (2013) Automatic de-identification of electronic health records : an Australian perspective. In NICTA - Louhi 2013, 11-12 February 2013, Sydney, NSW.

View at publisher (open access)


We present an approach to automatically de-identify health records. In our approach, personal health information is identified using a Conditional Random Fields machine learning classifier, a large set of linguistic and lexical features, and pattern matching techniques. Identified personal information is then removed from the reports. The de-identification of personal health information is fundamental for the sharing and secondary use of electronic health records, for example for data mining and disease monitoring. The effectiveness of our approach is first evaluated on the 2007 i2b2 Shared Task dataset, a widely adopted dataset for evaluating de-identification techniques. Subsequently, we investigate the robustness of the approach to limited training data; we study its effectiveness on different type and quality of data by evaluating the approach on scanned pathology reports from an Australian institution. This data contains optical character recognition errors, as well as linguistic conventions that differ from those contained in the i2b2 dataset, for example different date formats. The findings suggest that our approach compares to the best approach from the 2007 i2b2 Shared Task; in addition, the approach is found to be robust to variations of training size, data type and quality in presence of sufficient training data.

Impact and interest:

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

34 since deposited on 07 May 2014
6 in the past twelve months

Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 69301
Item Type: Conference Paper
Refereed: No
Additional Information: Paper presented in The 4th International Workshop on Health Document Text Mining and Information Analysis with the Focus of Cross-Language Evaluation
Divisions: Current > Institutes > Institute for Future Environments
Current > Schools > School of Information Systems
Current > QUT Faculties and Divisions > Science & Engineering Faculty
Copyright Owner: Copyright 2013 [please consult the author]
Deposited On: 07 May 2014 00:36
Last Modified: 08 May 2014 13:55

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page