A Novel Hybrid CNN-RNN Architecture for Emotion Recognition from Speech

Understanding and interpreting human emotions is crucial in Human-Computer Interaction (HCI), and Speech Emotion Recognition (SER) is central to this effort. Traditional methods have been used in SER for years, but recent advances in Deep Learning (DL) offer superior results. In this regard, this research introduces a novel hybrid architecture combining Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) to enhance SER accuracy. The model is trained on a diverse dataset from four sources, covering seven emotional categories, and achieves an impressive testing accuracy of 93.40%. The study demonstrates that the proposed model consistently performs well across different emotion classes, with accuracies ranging from 88% to 99%. Notably, the model excels in recognizing “Female surprise” with a 99% accuracy, while “Male disgust” has the lowest accuracy at 88%. These results highlight the model’s robustness and ability to generalize across various emotions and demographic groups. This research not only sets a new benchmark in SER but also advances the development of emotionally intelligent systems, with applications in interactive voice response systems, mental health monitoring, and personalized digital assistants.