TY - JOUR
T1 - Interpretability in healthcare
T2 - A comparative study of local machine learning interpretability techniques
AU - ElShawi, Radwa
AU - Sherif, Youssef
AU - Al-Mallah, Mouaz
AU - Sakr, Sherif
N1 - Funding Information:
European Regional Development Fund, MOBJD341; grant MOBTT75 Funding information
Funding Information:
The work of Radwa Elshawi is funded by the European Regional Development Funds via the Mobilitas Plus programme (MOBJD341). The work of Sherif Sakr and Youssef Sherif is funded by the European Regional Development Funds via the Mobilitas Plus programme (grant MOBTT75).
Publisher Copyright:
© 2020 Wiley Periodicals LLC.
Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.
PY - 2021/11
Y1 - 2021/11
N2 - Although complex machine learning models (eg, random forest, neural networks) are commonly outperforming the traditional and simple interpretable models (eg, linear regression, decision tree), in the healthcare domain, clinicians find it hard to understand and trust these complex models due to the lack of intuition and explanation of their predictions. With the new general data protection regulation (GDPR), the importance for plausibility and verifiability of the predictions made by machine learning models has become essential. Hence, interpretability techniques for machine learning models are an area focus of research. In general, the main aim of these interpretability techniques is to shed light and provide insights into the prediction process of the machine learning models and to be able to explain how the results from the prediction was generated. A major problem in this context is that both the quality of the interpretability techniques and trust of the machine learning model predictions are challenging to measure. In this article, we propose four fundamental quantitative measures for assessing the quality of interpretability techniques—similarity, bias detection, execution time, and trust. We present a comprehensive experimental evaluation of six recent and popular local model agnostic interpretability techniques, namely, LIME, SHAP, Anchors, LORE, ILIME“ and MAPLE on different types of real-world healthcare data. Building on previous work, our experimental evaluation covers different aspects for its comparison including identity, stability, separability, similarity, execution time, bias detection, and trust. The results of our experiments show that MAPLE achieves the highest performance for the identity across all data sets included in this study, while LIME achieves the lowest performance for the identity metric. LIME achieves the highest performance for the separability metric across all data sets. On average, SHAP has the smallest average time to output explanation across all data sets included in this study. For detecting the bias, SHAP and MAPLE enable the participants to better detect the bias. For the trust metric, Anchors achieves the highest performance on all data sets included in this work.
AB - Although complex machine learning models (eg, random forest, neural networks) are commonly outperforming the traditional and simple interpretable models (eg, linear regression, decision tree), in the healthcare domain, clinicians find it hard to understand and trust these complex models due to the lack of intuition and explanation of their predictions. With the new general data protection regulation (GDPR), the importance for plausibility and verifiability of the predictions made by machine learning models has become essential. Hence, interpretability techniques for machine learning models are an area focus of research. In general, the main aim of these interpretability techniques is to shed light and provide insights into the prediction process of the machine learning models and to be able to explain how the results from the prediction was generated. A major problem in this context is that both the quality of the interpretability techniques and trust of the machine learning model predictions are challenging to measure. In this article, we propose four fundamental quantitative measures for assessing the quality of interpretability techniques—similarity, bias detection, execution time, and trust. We present a comprehensive experimental evaluation of six recent and popular local model agnostic interpretability techniques, namely, LIME, SHAP, Anchors, LORE, ILIME“ and MAPLE on different types of real-world healthcare data. Building on previous work, our experimental evaluation covers different aspects for its comparison including identity, stability, separability, similarity, execution time, bias detection, and trust. The results of our experiments show that MAPLE achieves the highest performance for the identity across all data sets included in this study, while LIME achieves the lowest performance for the identity metric. LIME achieves the highest performance for the separability metric across all data sets. On average, SHAP has the smallest average time to output explanation across all data sets included in this study. For detecting the bias, SHAP and MAPLE enable the participants to better detect the bias. For the trust metric, Anchors achieves the highest performance on all data sets included in this work.
KW - big data
KW - data science
KW - interpretability
KW - machine learning
UR - http://www.scopus.com/inward/record.url?scp=85096722982&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85096722982&partnerID=8YFLogxK
U2 - 10.1111/coin.12410
DO - 10.1111/coin.12410
M3 - Article
AN - SCOPUS:85096722982
SN - 0824-7935
VL - 37
SP - 1633
EP - 1650
JO - Computational Intelligence
JF - Computational Intelligence
IS - 4
ER -