Healthcare AI

ILA

Clinical Data Anonymization

Archived

Overview

AI system for medical data de-identification using fine-tuned language models. Enables privacy-compliant processing of clinical documents.

Context

The Challenge

Traditional approaches to clinical de-identification require large, expensive annotated datasets and don't adapt well to new document types. The challenge was twofold: (1) explore which deep learning strategies have the greatest impact on NER performance for PHI detection, and (2) create a tool that can quickly achieve robust performance without requiring massive annotated datasets.

The Solution

The thesis has two parts. Part 1: A comprehensive evaluation of deep learning NER architectures (BERT, BiLSTM-CRF, fine-tuned transformers) to identify which methods work best for clinical de-identification. Part 2: Development of ILA (Incremental Learning Annotator), an innovative open-source tool that uses active learning and incremental training to rapidly build accurate models with minimal manual annotation effort.

Results & Impact

95%+ F1 score on PHI detection across multiple entity types
Reduced annotation effort by 70% compared to traditional approaches
Published and defended at UCLouvain (2024)
Open-source ILA tool available for healthcare research
Enables GDPR-compliant clinical document processing
Model improves automatically as users annotate edge cases

📄 Master's ThesisUCLouvain

Deep Learning for de-identification of clinical documents

Abstract

The amount of unstructured medical documents increases each year, presenting an opportunity to extract valuable insights that could significantly improve healthcare. However, to take advantage of this potential, it is crucial to de-identify these documents in order to protect patient privacy and to be able to use these documents for research. This study explores the different deep learning solutions for the de-identification of clinical documents. The first part explores the current strategies to recognize specific words in documents to understand which method has the greatest impact on performances. This evaluation helps to identify the strengths and weaknesses that traditional deep learning approaches may encounter. The second part introduces an innovative open-source tool: the Incremental Learning Annotator (ILA). This tool enhances the ability to obtain quickly a robust model that achieves good performance. This solves the need of large and well annotated dataset to obtain a robust deep learning model.

Key Contributions

✓Comprehensive evaluation of NER architectures for clinical de-identification
✓Novel active learning approach for medical document annotation
✓Development of ILA: Incremental Learning Annotator tool
✓Proven methodology for rapid model deployment with minimal data

📥 Download PDF 🔗 View on Repository

Project Details

Year: 2024
Duration: 6 months (Master's Thesis)
Team: Solo + Academic Supervisor
My Role: Researcher & Developer
Client / Sector: UCLouvain

Technologies

Fine-tuned LLMsNERHealthcarePrivacy

Key Impact

Privacy-compliant healthcare data processing

Visit Live Project