DSpace Repository

Multilingual audio based emotion recognition using self-supervised learning

Show simple item record

dc.contributor.advisor Haque, Dr. Mohammad Ariful
dc.contributor.author Das, Auditi.
dc.date.accessioned 2024-09-18T04:43:23Z
dc.date.available 2024-09-18T04:43:23Z
dc.date.issued 2024-03-05
dc.identifier.uri http://lib.buet.ac.bd:8080/xmlui/handle/123456789/6847
dc.description.abstract Emotion recognition refers to the ability of a system to identify and understand human emotions based on various input sources such as speech, text, motion, facial expressions, or physiological signals. In the domain of emotion recognition, selfsupervised learning (SSL) techniques can be used to train models, which can be particularly useful in situations where labeled emotion data are insufficient or costly to acquire. Recently, SSL techniques have been applied to audio based emotion recognition by utilizing monolingually pre-trained models. Regarding the comprehension and classification of emotions in speech, these models have demonstrated encouraging outcomes on a single language. However, the research on multilingual audio emotion recognition, is noticeably lacking. Although multilingually pre-trained models such as wav2vec2-large-xlsr-53, XLS-R-128 have shown promising results in multilingual automatic speech recognition (ASR), their application to speech classification tasks has received little attention. Applying the self-supervised learning approach to multilingual based audio emotion recognition could provide insightful results, as it has demonstrated efficacy in a range of natural language processing tasks. In this work we propose two end-to-end emotion recognition models that can understand emotions from speech signal across multiple languages without relying on huge amounts of annotated data. A multilingual dataset was formed by incorporating datasets of four different languages namely IEMOCAP, EMODB, EMOVO, BanglaSER and RAVDESS for training and testing purposes in this study. SSL models pre-trained on large multilingual corpus proved to be beneficial in this task, as these models generated high level speech representations for multilingual emotion recognition. Both of the proposed multilingual emotion recognition models outperformed all the baseline SSL models. The proposed wav2vec2-large-xlsr-53 based model achieved an unweighted accuracy (UA) of 79.25% and a weighted accuracy (WA) of 77.82% while the proposed XLS-R-128 based model achieved an unweighted accuracy of 75.74% and a weighted accuracy of 74.17%. The wav2vec2-large-xlsr-53 based model improved UA by nearly 8.37% and WA by nearly 9.46% compared to the best baseline model, which is HuBERT Large based multilingual emotion recognition model. It also showed excellence in accurately classifying various emotion classes with high precision, recall, and F1 scores. Therefore, the application of multilingually pre-trained models en_US
dc.language.iso en en_US
dc.publisher Department of Electrical and Electronic Engineerng (EEE), BUET en_US
dc.subject Wavelets en_US
dc.title Multilingual audio based emotion recognition using self-supervised learning en_US
dc.type Thesis-MSc en_US
dc.contributor.id 0419062270 en_US
dc.identifier.accessionNumber 119728
dc.contributor.callno 623.82/DAS/2024 en_US


Files in this item

Files Size Format View

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record

Search BUET IR


Advanced Search

Browse

My Account