Explainable Fusion Model for Non-Alcoholic Fatty Liver Disease Risk Prediction

Abstract—Non-Alcoholic Fatty Liver Disease (NAFLD) is an emerging health issue across the world especially in less developed and middle-income nations like Bangladesh where early diagnosis is essential to avoid severe liver-related complications but the current screening procedures are invasive, costly, and not suitable to be applied at large scale. This paper proposes an interpretable machine learning-based model of NAFLD risk prediction on the basis of a custom real-world clinical dataset prepared from hospital records representative of the Bangladeshi population and a unified set of heterogeneous publicly available data. The combined dataset had a significant amount of missing data as a result of dissimilar feature presence that was filled in with low-rank matrix decomposition with the help of SoftImpute. SMOTE oversampling has been used to reduce the imbalance in the classes after the completion of the matrices. On the final balanced dataset, the Logistic Regression, Random Forest and XGBoost models were trained as well as a stacking ensemble model. Experimental findings confirm that Logistic Regression had a test accuracy of 97.99%, whereas Random Forest, XGBoost and the stacked ensemble had test accuracy of 99.28%, 99.31% and 99.40%. And the ensemble model provided the highest F1 score in both test and validation datasets. The SHAP-based explainability was added in order to offer both overall feature significance and patient-level clarifications.