TY - GEN
T1 - Towards LLM-Guided Healthcare Dataset Harmonization
AU - Smailis, Christos
AU - Ordonez, Carlos
AU - Kakadiaris, Ioannis A.
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Electronic health record (EHR) datasets come in various schemas and can contain a range of data types, measurement units, and variables that share duplicate semantic content. The process of bringing such datasets into a common schema with consistent values, so that it is possible to perform queries uniformly, is known as harmonization. However, performing this process manually can be both time-consuming and prone to errors. In this work, we present a web-based platform that semi-Automates the harmonization and linking of EHR datasets through a human-in-The-loop framework, guiding users with the use of large language models (LLMs). Our solution is a two-stage harmonization pipeline that keeps schema metadata processing online while handling patient-level data locally, to align with HIPAA data privacy principles. In the first stage, users harmonize and link only non-identifiable schema information. In the second stage, sensitive value-level harmonization occurs entirely on the user's system, so no private and protected health information ever leaves their environment. Throughout both stages, we expect that LLM-powered suggestions could potentially speed up the harmonization and linking processes.
AB - Electronic health record (EHR) datasets come in various schemas and can contain a range of data types, measurement units, and variables that share duplicate semantic content. The process of bringing such datasets into a common schema with consistent values, so that it is possible to perform queries uniformly, is known as harmonization. However, performing this process manually can be both time-consuming and prone to errors. In this work, we present a web-based platform that semi-Automates the harmonization and linking of EHR datasets through a human-in-The-loop framework, guiding users with the use of large language models (LLMs). Our solution is a two-stage harmonization pipeline that keeps schema metadata processing online while handling patient-level data locally, to align with HIPAA data privacy principles. In the first stage, users harmonize and link only non-identifiable schema information. In the second stage, sensitive value-level harmonization occurs entirely on the user's system, so no private and protected health information ever leaves their environment. Throughout both stages, we expect that LLM-powered suggestions could potentially speed up the harmonization and linking processes.
KW - harmonization
KW - large language models
KW - linking
UR - https://www.scopus.com/pages/publications/105029897391
UR - https://www.scopus.com/inward/citedby.url?scp=105029897391&partnerID=8YFLogxK
U2 - 10.1109/DSAA65442.2025.11247965
DO - 10.1109/DSAA65442.2025.11247965
M3 - Conference contribution
AN - SCOPUS:105029897391
T3 - 2025 IEEE 12th International Conference on Data Science and Advanced Analytics, DSAA 2025
BT - 2025 IEEE 12th International Conference on Data Science and Advanced Analytics, DSAA 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 12th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2025
Y2 - 9 October 2025 through 12 October 2025
ER -