De-Identification of health records using Anonym: Effectiveness and robustness across datasets
Zuccon, Guido, Daniel, Kotzur, Nguyen, Anthony, & Bergheim, Anton (2014) De-Identification of health records using Anonym: Effectiveness and robustness across datasets. Artificial Intelligence in Medicine, 61(3), pp. 145-151.
Evaluate the effectiveness and robustness of Anonym, a tool for de-identifying free-text health records based on conditional random fields classifiers informed by linguistic and lexical features, as well as features extracted by pattern matching techniques. De-identification of personal health information in electronic health records is essential for the sharing and secondary usage of clinical data. De-identification tools that adapt to different sources of clinical data are attractive as they would require minimal intervention to guarantee high effectiveness.
Methods and Materials
The effectiveness and robustness of Anonym are evaluated across multiple datasets, including the widely adopted Integrating Biology and the Bedside (i2b2) dataset, used for evaluation in a de-identification challenge. The datasets used here vary in type of health records, source of data, and their quality, with one of the datasets containing optical character recognition errors.
Anonym identifies and removes up to 96.6% of personal health identifiers (recall) with a precision of up to 98.2% on the i2b2 dataset, outperforming the best system proposed in the i2b2 challenge. The effectiveness of Anonym across datasets is found to depend on the amount of information available for training.
Findings show that Anonym compares to the best approach from the 2006 i2b2 shared task. It is easy to retrain Anonym with new datasets; if retrained, the system is robust to variations of training size, data type and quality in presence of sufficient training data.
Impact and interest:
Citation counts are sourced monthly from and citation databases.
These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.
Citations counts from theindexing service can be viewed at the linked Google Scholar™ search.
Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.
|Item Type:||Journal Article|
|Keywords:||Conditional Random Fields, Pattern Matching, De-identification, Health records|
|Subjects:||Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > COMPUTER SOFTWARE (080300) > Computer System Security (080303)
Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > LIBRARY AND INFORMATION STUDIES (080700) > Health Informatics (080702)
Australian and New Zealand Standard Research Classification > PSYCHOLOGY AND COGNITIVE SCIENCES (170000) > COGNITIVE SCIENCE (170200) > Knowledge Representation and Machine Learning (170203)
|Divisions:||Past > QUT Faculties & Divisions > Faculty of Science and Technology
Current > Schools > School of Information Systems
|Copyright Owner:||Copyright 2014 Elsevier|
|Copyright Statement:||Licensed under the Creative Commons Attribution; Non-Commercial; No-Derivatives 4.0 International. DOI: 10.1016/j.artmed.2014.03.006|
|Deposited On:||14 Apr 2014 02:09|
|Last Modified:||15 Jul 2015 17:50|
Repository Staff Only: item control page