Hybrid Multi-Representation CNN–LSTM Framework with Adaptive Fusion for Speech Emotion Recognition

Notification

Announcement!

ISJEM Invites papers for various areas like engineering, Management, Science & other multi discplinary subjects. Please submit your paper for review.

ISJEM assigns a digital object identifier (DOI) to each published paper, making it easier for the paper to be cited in various major databases like Google Scholar, ResearchGate, Academia.edu, etc…

ISJEM takes 24–48 hours to publish a research paper. Within 24 hours, the submitted paper will be reviewed and notified of its status, and it will be published once the processing fee is successfully received.

Hybrid Multi-Representation CNN–LSTM Framework with Adaptive Fusion for Speech Emotion Recognition

Version

File Size 622.74 KB

Downloads 31

Files 1

Published 14 March 2026

Updated 14 March 2026

Hybrid Multi-Representation CNN–LSTM Framework with Adaptive Fusion for Speech Emotion Recognition

S.GURU PRASAD
M-Tech, Department .Of Computer Science And Engineering,
Vemu Institute Of Technology,
P.Kothakota,Chittoor District, Andhra Pradesh-517112,India
Email Id: sguruprrasad@Gmail.Com
Ms.M.SREEVANI
Assistant professor, M.Tech,Dept of CSE,
Vemu institute of Technology ,p.kothakota.
Email Id: vani.cse183@Gmail.Com

Abstract - Speech emotion recognition (SER) is a critical field of study in the area of affective computing, which allows researchers to automatically determine human affective behaviours based on sound. However, the achievement of credible emotion classification is still a daunting task due to the heterogeneity of the speaker identification, content linguistic, recording, and prosodic peculiarities. In order to overcome those challenges, the current study presents a hybrid, multi representation deep-learning framework that combines the complementary information based on raw temporal waveform signals and Spectro-temporal acoustic descriptors. The suggested architecture involves a dual- branch network architecture. On the first branch, a one-dimensional convolutional neural network (1D -CNN) is supplied with raw speech waveforms by taking into account inherent dynamical characteristics with time. At the same time, a second branch is used which utilizes a two-dimensional convolutional neural network (2D- CNN) to extract log-Mel spectrograms in additionto MFCC-delta features, thus teaching significant spectral features. Both branches are then fused with adaptive asymmetric fusion gate and this dynamically balanced the contributions of each modality. The resulting amalgamated featurerepresentation is then processed by a bi-directional long short-term memory (Bi LSTM) module with multi-headed self attention. Such a configuration is meant to represent long-term dependencies of speech signal. The empirical compare and contrast studies of the RAVDESS, EMO-DB, CREMA-D datasets and IEMOCAP show better results that compare to the baseline methods with weighted accuracy of 93.7, 92.1, 88.4, and 74.6 respectively. These findings are a probable reason to believe in the strength and universality of the offered hybrid framework in various affective-speech recognition tasks.
Keywords - Speech emotion recognition; CNN-LSTM; multi-representation learning; self-attention; affective computing.

Download

or download free

[free_download_btn]

[changelog]