dc.description.abstract |
Emotion recognition refers to the ability of a system to identify and understand human emotions based on various input sources such as speech, text, motion, facial expressions, or physiological signals. In the domain of emotion recognition, selfsupervised learning (SSL) techniques can be used to train models, which can be particularly useful in situations where labeled emotion data are insufficient or costly to acquire. Recently, SSL techniques have been applied to audio based emotion recognition by utilizing monolingually pre-trained models. Regarding the comprehension and classification of emotions in speech, these models have demonstrated encouraging outcomes on a single language. However, the research on multilingual audio emotion recognition, is noticeably lacking. Although multilingually pre-trained models such as wav2vec2-large-xlsr-53, XLS-R-128 have shown promising results in multilingual automatic speech recognition (ASR), their application to speech classification tasks has received little attention. Applying the self-supervised learning approach to multilingual based audio emotion recognition could provide insightful results, as it has demonstrated efficacy in a range of natural language processing tasks. In this work we propose two end-to-end emotion recognition models that can understand emotions from speech signal across multiple languages without relying on huge amounts of annotated data. A multilingual dataset was formed by incorporating datasets of four different languages namely IEMOCAP, EMODB, EMOVO, BanglaSER and RAVDESS for training and testing purposes in this study. SSL models pre-trained on large multilingual corpus proved to be beneficial in this task, as these models generated high level speech representations for multilingual emotion recognition. Both of the proposed multilingual emotion recognition models outperformed all the baseline SSL models. The proposed wav2vec2-large-xlsr-53 based model achieved an unweighted accuracy (UA) of 79.25% and a weighted accuracy (WA) of 77.82% while the proposed XLS-R-128 based model achieved an unweighted accuracy of 75.74% and a weighted accuracy of 74.17%. The wav2vec2-large-xlsr-53 based model improved UA by nearly 8.37% and WA by nearly 9.46% compared to the best baseline model, which is HuBERT Large based multilingual emotion recognition model. It also showed excellence in accurately classifying various emotion classes with high precision, recall, and F1 scores. Therefore, the application of multilingually pre-trained models |
en_US |