TY - JOUR
T1 - Methodological considerations for optimal variable selection in machine learning for health services research
AU - Dong, Weichuan
AU - Lal, Trisha
AU - Liu, Fangzhou
AU - Pronovost, Peter
AU - Bora, Samudragupta
AU - Hoehn, Richard S.
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025
Y1 - 2025
N2 - Effective variable selection is central to the success of health services research, where large, complex datasets often include numerous variables with varying degrees of relevance. This paper presents a structured approach to variable selection, highlighting the importance of combining domain expertise with advanced analytical techniques to ensure the inclusion of only the most pertinent variables. We explore several methods, including manual selection, correlation matrices, random forests, and stepwise regression, each with its strengths and limitations in managing multicollinearity, dimensionality, and interpretability. By carefully preprocessing variables—removing redundant, irrelevant, or missing data—and applying feature selection tools like decision tree-based algorithms, researchers can streamline their models to focus on the most impactful predictors. This approach not only improves the reliability and precision of findings but also enhances the interpretability of complex models, particularly when working with social determinants of health (SDOH). Through a case study using the LexisNexis SDOH dataset, we illustrate how these methods can be tailored to identify patients at highest risk for adverse health outcomes. The proposed framework fosters more accurate, actionable insights and supports targeted interventions that aim to reduce health inequities.
AB - Effective variable selection is central to the success of health services research, where large, complex datasets often include numerous variables with varying degrees of relevance. This paper presents a structured approach to variable selection, highlighting the importance of combining domain expertise with advanced analytical techniques to ensure the inclusion of only the most pertinent variables. We explore several methods, including manual selection, correlation matrices, random forests, and stepwise regression, each with its strengths and limitations in managing multicollinearity, dimensionality, and interpretability. By carefully preprocessing variables—removing redundant, irrelevant, or missing data—and applying feature selection tools like decision tree-based algorithms, researchers can streamline their models to focus on the most impactful predictors. This approach not only improves the reliability and precision of findings but also enhances the interpretability of complex models, particularly when working with social determinants of health (SDOH). Through a case study using the LexisNexis SDOH dataset, we illustrate how these methods can be tailored to identify patients at highest risk for adverse health outcomes. The proposed framework fosters more accurate, actionable insights and supports targeted interventions that aim to reduce health inequities.
UR - https://www.scopus.com/pages/publications/105007238778
UR - https://www.scopus.com/inward/citedby.url?scp=105007238778&partnerID=8YFLogxK
U2 - 10.1007/s10742-025-00347-8
DO - 10.1007/s10742-025-00347-8
M3 - Article
AN - SCOPUS:105007238778
SN - 1387-3741
VL - 25
SP - 474
EP - 486
JO - Health Services and Outcomes Research Methodology
JF - Health Services and Outcomes Research Methodology
IS - 4
ER -