Multi-Modal Fusion Tuberculosis Severity Classification: Vision Transformer for Chest X-Ray and Tabular Transformer for Clinical Data

1) A novel multimodal fusion framework for TB severity
classification, integrating chest X-ray images via pre-
trained Vision Transformer (ViT) and clinical data via
hyperparameter-tuned Tabular Transformer, with early
feature concatenation for joint learning.
2) Extensive hyperparameter tuning and regularization
(e.g., dropout, learning rate scheduling) to enhance generalization.
3) Achieved 94.01% accuracy on the MIMIC-CXR dataset
using pneumonia as TB proxy, outperforming unimodal
baselines and state-of-the-art methods.