Abstract

Speech Emotion Recognition for multiclass classification using Hybrid CNN-LSTM


Abstract


Emotions are biological states of the human nervous system recorded in different signal forms that may be audio signals, electroencephalogram signals, etc. In this paper, cross-corpus emotion recognition is carried out on voice data. Also, a hybrid CNN–LSTM (Convolution Neural Network–Long Short-Term Memory) model was proposed for recognizing gender-biased emotions. Three established corpora were considered, namely, SAVEE, RAVDESS and TESS. Three new corpora have been constructed by combining the above-mentioned corpora for cross-corpus implementation, referred to as mix corpus. Corpora formed were gender-specific (i.e., male and female) and gender independent. Seven different emotions (i.e., happiness, sadness, anger, fear, neutral, disgust and surprise) have been identified within all the corpora. Data augmentation has been applied to reduce over-fitting and increase the robustness of deep neural networks by adding noise and pitch features to the signals. Also, the Mel-Frequency Cepstral Coefficient (MFCC) method was used for extracting feature before applying the hybrid network to each database. The experiment results show that the female corpus gives better accuracy than the male corpus.




Keywords


Speech emotion recognition; CNN; LSTM; MFCC; Cross-corpus; Deep Learning