Methodological considerations for optimal variable selection in machine learning for health services research

Weichuan Dong, Trisha Lal, Fangzhou Liu, Peter Pronovost, Samudragupta Bora, Richard S. Hoehn

Research output: Contribution to journalArticlepeer-review

Abstract

Effective variable selection is central to the success of health services research, where large, complex datasets often include numerous variables with varying degrees of relevance. This paper presents a structured approach to variable selection, highlighting the importance of combining domain expertise with advanced analytical techniques to ensure the inclusion of only the most pertinent variables. We explore several methods, including manual selection, correlation matrices, random forests, and stepwise regression, each with its strengths and limitations in managing multicollinearity, dimensionality, and interpretability. By carefully preprocessing variables—removing redundant, irrelevant, or missing data—and applying feature selection tools like decision tree-based algorithms, researchers can streamline their models to focus on the most impactful predictors. This approach not only improves the reliability and precision of findings but also enhances the interpretability of complex models, particularly when working with social determinants of health (SDOH). Through a case study using the LexisNexis SDOH dataset, we illustrate how these methods can be tailored to identify patients at highest risk for adverse health outcomes. The proposed framework fosters more accurate, actionable insights and supports targeted interventions that aim to reduce health inequities.

Original languageEnglish (US)
Pages (from-to)474-486
Number of pages13
JournalHealth Services and Outcomes Research Methodology
Volume25
Issue number4
DOIs
StateAccepted/In press - 2025

ASJC Scopus subject areas

  • Health Policy
  • Public Health, Environmental and Occupational Health

Fingerprint

Dive into the research topics of 'Methodological considerations for optimal variable selection in machine learning for health services research'. Together they form a unique fingerprint.

Cite this