De-Identification of health records using Anonym: Effectiveness and robustness across datasets

Zuccon, Guido, Daniel, Kotzur, Nguyen, Anthony, & Bergheim, Anton (2014) De-Identification of health records using Anonym: Effectiveness and robustness across datasets. Artificial Intelligence in Medicine, 61(3), pp. 145-151.

View at publisher

Abstract

Objective

Evaluate the effectiveness and robustness of Anonym, a tool for de-identifying free-text health records based on conditional random fields classifiers informed by linguistic and lexical features, as well as features extracted by pattern matching techniques. De-identification of personal health information in electronic health records is essential for the sharing and secondary usage of clinical data. De-identification tools that adapt to different sources of clinical data are attractive as they would require minimal intervention to guarantee high effectiveness.

Methods and Materials

The effectiveness and robustness of Anonym are evaluated across multiple datasets, including the widely adopted Integrating Biology and the Bedside (i2b2) dataset, used for evaluation in a de-identification challenge. The datasets used here vary in type of health records, source of data, and their quality, with one of the datasets containing optical character recognition errors.

Results

Anonym identifies and removes up to 96.6% of personal health identifiers (recall) with a precision of up to 98.2% on the i2b2 dataset, outperforming the best system proposed in the i2b2 challenge. The effectiveness of Anonym across datasets is found to depend on the amount of information available for training.

Conclusion

Findings show that Anonym compares to the best approach from the 2006 i2b2 shared task. It is easy to retrain Anonym with new datasets; if retrained, the system is robust to variations of training size, data type and quality in presence of sufficient training data.

Impact and interest:

2 citations in Scopus
Search Google Scholar™
2 citations in Web of Science®

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

18 since deposited on 14 Apr 2014
7 in the past twelve months

Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 70127
Item Type: Journal Article
Refereed: Yes
Keywords: Conditional Random Fields, Pattern Matching, De-identification, Health records
DOI: 10.1016/j.artmed.2014.03.006
ISSN: 0933-3657
Subjects: Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > COMPUTER SOFTWARE (080300) > Computer System Security (080303)
Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > LIBRARY AND INFORMATION STUDIES (080700) > Health Informatics (080702)
Australian and New Zealand Standard Research Classification > PSYCHOLOGY AND COGNITIVE SCIENCES (170000) > COGNITIVE SCIENCE (170200) > Knowledge Representation and Machine Learning (170203)
Divisions: Past > QUT Faculties & Divisions > Faculty of Science and Technology
Current > Schools > School of Information Systems
Copyright Owner: Copyright 2014 Elsevier
Copyright Statement: Licensed under the Creative Commons Attribution; Non-Commercial; No-Derivatives 4.0 International. DOI: 10.1016/j.artmed.2014.03.006
Deposited On: 14 Apr 2014 02:09
Last Modified: 15 Jul 2015 17:50

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page