Hybrid Multi-Representation CNN–LSTM Framework with Adaptive Fusion for Speech Emotion Recognition
Hybrid Multi-Representation CNN–LSTM Framework with Adaptive Fusion for Speech Emotion Recognition
S.GURU PRASAD
M-Tech, Department .Of Computer Science And Engineering,
Vemu Institute Of Technology,
P.Kothakota,Chittoor District, Andhra Pradesh-517112,India
Email Id: sguruprrasad@Gmail.Com
Ms.M.SREEVANI
Assistant professor, M.Tech,Dept of CSE,
Vemu institute of Technology ,p.kothakota.
Email Id: vani.cse183@Gmail.Com
Abstract - Speech emotion recognition (SER) is a critical field of study in the area of affective computing, which allows researchers to automatically determine human affective behaviours based on sound. However, the achievement of credible emotion classification is still a daunting task due to the heterogeneity of the speaker identification, content linguistic, recording, and prosodic peculiarities. In order to overcome those challenges, the current study presents a hybrid, multi representation deep-learning framework that combines the complementary information based on raw temporal waveform signals and Spectro-temporal acoustic descriptors. The suggested architecture involves a dual- branch network architecture. On the first branch, a one-dimensional convolutional neural network (1D -CNN) is supplied with raw speech waveforms by taking into account inherent dynamical characteristics with time. At the same time, a second branch is used which utilizes a two-dimensional convolutional neural network (2D- CNN) to extract log-Mel spectrograms in additionto MFCC-delta features, thus teaching significant spectral features. Both branches are then fused with adaptive asymmetric fusion gate and this dynamically balanced the contributions of each modality. The resulting amalgamated featurerepresentation is then processed by a bi-directional long short-term memory (Bi LSTM) module with multi-headed self attention. Such a configuration is meant to represent long-term dependencies of speech signal. The empirical compare and contrast studies of the RAVDESS, EMO-DB, CREMA-D datasets and IEMOCAP show better results that compare to the baseline methods with weighted accuracy of 93.7, 92.1, 88.4, and 74.6 respectively. These findings are a probable reason to believe in the strength and universality of the offered hybrid framework in various affective-speech recognition tasks.
Keywords - Speech emotion recognition; CNN-LSTM; multi-representation learning; self-attention; affective computing.