ViT-Sign: An Explainable Vision Transformer for Sign Language Recognition

Md. Fakrul Islam Rafsan
Jun, Tue, 2026

ViT-Sign: An Explainable Vision Transformer for Sign Language Recognition

Over 70 million deaf people worldwide frequently use sign language. However, the challenges of real-time implementation and interpretation have limited automated sign language recognition systems. CNN-based older techniques frequently overlook subtle features that are crucial for differentiating between similar motions. This research examines the recognition of 37 distinct sign language classes, which include A to Z, 0 to 9, and a space. We utilized a dataset of 55,500 photos for this. Our recommended method, ViT-Sign, uses a Tiny version of Vision Transformer. It has only 5.5 million parameters, and we evaluate its performance using accuracy, precision, recall, and F1-score. Our model achieves 99.42\% accuracy, 99.44\% precision, 99.42\% recall, and a 99.42\% F1-score. We also employ Grad-CAM visuals to make it clear that our model concentrates on the correct regions of the hands rather than random background information. Experiments show that Vision Transformers can efficiently record spatial relationships and acquire discriminative representations of gestures. The proposed method can be utilized to create successful communication systems that allow hearing-impaired people to communicate more easily.