SSViT-4.0: A Self-Supervised Hybrid CNN-Transformer Framework for Industrial Visual Anomaly Detection

Detecting industrial visual anomalies remains a critical challenge in smart manufacturing due to scarce labeled defect samples and highly variable texture patterns. Existing anomaly detection (AD) and active learning (AL) approaches often struggle under adversarial conditions, as training typically relies on only normal, unlabeled data. This research proposes SSViT‑4.0, a self-supervised hybrid framework combining Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for reliable, label-efficient anomaly detection in industrial images. A cross-hierarchical fusion module integrates global ViT self-attention with local CNN feature extraction, overcoming limitations of conventional independent feature streams or supervised fine-tuning. Self-supervised pretraining via reconstruction enables efficient representation learning without manual annotations, while patch-based embeddings and a lightweight anomaly scoring head allow accurate anomaly localization at inference. Experimental results on benchmark datasets (e.g., MVTec AD) demonstrate that SSViT‑4.0 surpasses CNN-only and Transformer-only baselines in detection accuracy and localization, maintaining real-time inference. The framework offers a scalable and efficient solution for automated visual inspection in Industry 4.0 environments.