Skip to main navigation Skip to search Skip to main content

Towards LLM-Guided Healthcare Dataset Harmonization

Christos Smailis, Carlos Ordonez, Ioannis A. Kakadiaris

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Electronic health record (EHR) datasets come in various schemas and can contain a range of data types, measurement units, and variables that share duplicate semantic content. The process of bringing such datasets into a common schema with consistent values, so that it is possible to perform queries uniformly, is known as harmonization. However, performing this process manually can be both time-consuming and prone to errors. In this work, we present a web-based platform that semi-Automates the harmonization and linking of EHR datasets through a human-in-The-loop framework, guiding users with the use of large language models (LLMs). Our solution is a two-stage harmonization pipeline that keeps schema metadata processing online while handling patient-level data locally, to align with HIPAA data privacy principles. In the first stage, users harmonize and link only non-identifiable schema information. In the second stage, sensitive value-level harmonization occurs entirely on the user's system, so no private and protected health information ever leaves their environment. Throughout both stages, we expect that LLM-powered suggestions could potentially speed up the harmonization and linking processes.

Original languageEnglish (US)
Title of host publication2025 IEEE 12th International Conference on Data Science and Advanced Analytics, DSAA 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798331511791
DOIs
StatePublished - 2025
Event12th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2025 - Birmingham, United Kingdom
Duration: Oct 9 2025Oct 12 2025

Publication series

Name2025 IEEE 12th International Conference on Data Science and Advanced Analytics, DSAA 2025

Conference

Conference12th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2025
Country/TerritoryUnited Kingdom
CityBirmingham
Period10/9/2510/12/25

Keywords

  • harmonization
  • large language models
  • linking

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'Towards LLM-Guided Healthcare Dataset Harmonization'. Together they form a unique fingerprint.

Cite this