Multilingual audio based emotion recognition using self-supervised learning

BUET ILS
BUET Institutional Repository: Home
→
Dissertations/Theses
→
Dissertations/Theses - Department of Electrical and Electronic Engineering
→
View Item

dc.contributor.advisor	Haque, Dr. Mohammad Ariful
dc.contributor.author	Das, Auditi.
dc.date.accessioned	2024-09-18T04:43:23Z
dc.date.available	2024-09-18T04:43:23Z
dc.date.issued	2024-03-05
dc.identifier.uri	http://lib.buet.ac.bd:8080/xmlui/handle/123456789/6847
dc.description.abstract	Emotion recognition refers to the ability of a system to identify and understand human emotions based on various input sources such as speech, text, motion, facial expressions, or physiological signals. In the domain of emotion recognition, selfsupervised learning (SSL) techniques can be used to train models, which can be particularly useful in situations where labeled emotion data are insufficient or costly to acquire. Recently, SSL techniques have been applied to audio based emotion recognition by utilizing monolingually pre-trained models. Regarding the comprehension and classification of emotions in speech, these models have demonstrated encouraging outcomes on a single language. However, the research on multilingual audio emotion recognition, is noticeably lacking. Although multilingually pre-trained models such as wav2vec2-large-xlsr-53, XLS-R-128 have shown promising results in multilingual automatic speech recognition (ASR), their application to speech classification tasks has received little attention. Applying the self-supervised learning approach to multilingual based audio emotion recognition could provide insightful results, as it has demonstrated efficacy in a range of natural language processing tasks. In this work we propose two end-to-end emotion recognition models that can understand emotions from speech signal across multiple languages without relying on huge amounts of annotated data. A multilingual dataset was formed by incorporating datasets of four different languages namely IEMOCAP, EMODB, EMOVO, BanglaSER and RAVDESS for training and testing purposes in this study. SSL models pre-trained on large multilingual corpus proved to be beneficial in this task, as these models generated high level speech representations for multilingual emotion recognition. Both of the proposed multilingual emotion recognition models outperformed all the baseline SSL models. The proposed wav2vec2-large-xlsr-53 based model achieved an unweighted accuracy (UA) of 79.25% and a weighted accuracy (WA) of 77.82% while the proposed XLS-R-128 based model achieved an unweighted accuracy of 75.74% and a weighted accuracy of 74.17%. The wav2vec2-large-xlsr-53 based model improved UA by nearly 8.37% and WA by nearly 9.46% compared to the best baseline model, which is HuBERT Large based multilingual emotion recognition model. It also showed excellence in accurately classifying various emotion classes with high precision, recall, and F1 scores. Therefore, the application of multilingually pre-trained models	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Electrical and Electronic Engineerng (EEE), BUET	en_US
dc.subject	Wavelets	en_US
dc.title	Multilingual audio based emotion recognition using self-supervised learning	en_US
dc.type	Thesis-MSc	en_US
dc.contributor.id	0419062270	en_US
dc.identifier.accessionNumber	119728
dc.contributor.callno	623.82/DAS/2024	en_US

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Dissertations/Theses - Department of Electrical and Electronic Engineering
Post graduate dissertations (Theses) of Electrical and Electronic Engineering (EEE)

Show simple item record

Search BUET IR

Advanced Search

Browse

All of IR
This Collection

Multilingual audio based emotion recognition using self-supervised learning

Files in this item

This item appears in the following Collection(s)

Search BUET IR

Browse

All of IR

This Collection

My Account