ILA
Clinical Data Anonymization

Overview
AI system for medical data de-identification using fine-tuned language models. Enables privacy-compliant processing of clinical documents.
Context
The amount of unstructured medical documents increases each year, presenting an opportunity to extract valuable insights that could significantly improve healthcare. However, to take advantage of this potential, it is crucial to de-identify these documents to protect patient privacy and comply with GDPR/HIPAA regulations before using them for research.
The Challenge
Traditional approaches to clinical de-identification require large, expensive annotated datasets and don't adapt well to new document types. The challenge was twofold: (1) explore which deep learning strategies have the greatest impact on NER performance for PHI detection, and (2) create a tool that can quickly achieve robust performance without requiring massive annotated datasets.
The Solution
The thesis has two parts. Part 1: A comprehensive evaluation of deep learning NER architectures (BERT, BiLSTM-CRF, fine-tuned transformers) to identify which methods work best for clinical de-identification. Part 2: Development of ILA (Incremental Learning Annotator), an innovative open-source tool that uses active learning and incremental training to rapidly build accurate models with minimal manual annotation effort.
Results & Impact
- 95%+ F1 score on PHI detection across multiple entity types
- Reduced annotation effort by 70% compared to traditional approaches
- Published and defended at UCLouvain (2024)
- Open-source ILA tool available for healthcare research
- Enables GDPR-compliant clinical document processing
- Model improves automatically as users annotate edge cases
Deep Learning for de-identification of clinical documents
Abstract
The amount of unstructured medical documents increases each year, presenting an opportunity to extract valuable insights that could significantly improve healthcare. However, to take advantage of this potential, it is crucial to de-identify these documents in order to protect patient privacy and to be able to use these documents for research. This study explores the different deep learning solutions for the de-identification of clinical documents. The first part explores the current strategies to recognize specific words in documents to understand which method has the greatest impact on performances. This evaluation helps to identify the strengths and weaknesses that traditional deep learning approaches may encounter. The second part introduces an innovative open-source tool: the Incremental Learning Annotator (ILA). This tool enhances the ability to obtain quickly a robust model that achieves good performance. This solves the need of large and well annotated dataset to obtain a robust deep learning model.
Key Contributions
- ✓Comprehensive evaluation of NER architectures for clinical de-identification
- ✓Novel active learning approach for medical document annotation
- ✓Development of ILA: Incremental Learning Annotator tool
- ✓Proven methodology for rapid model deployment with minimal data
Project Details
- Year
- 2024
- Duration
- 6 months (Master's Thesis)
- Team
- Solo + Academic Supervisor
- My Role
- Researcher & Developer
- Client / Sector
- UCLouvain
Technologies
Key Impact
Privacy-compliant healthcare data processing