Abstract:
Anger detection from conversations has many real-life applications that include improving interpersonal communications, providing customer services, and enhancing workplace performance. Despite its numerous applications in a variety of domains, anger is one of the least studied basic human emotions. The existing works on anger detection mostly deal with audio-only data, though text transcriptions can be directly obtained from spoken conversations. In this thesis, we propose novel deep learning-based approaches for offline and online anger detection from audio-textual data obtained from real-life conversations. Offline anger detection deals with detecting anger from a pre-collected audio-textual conversation, while online anger detection predicts anger in the subsequent utterances of a conversation from the previous utterances.
For offline anger detection, we introduce an ensemble approach that combines handcrafted acoustic features, SincNet-based raw waveform features, and BERT-based textual features in a mid-level fusion scheme within an attention-based CNN architecture. In addition, the model includes a gender classifier to incorporate gender information into offline anger detection. On the other hand, for online anger detection, which predicts the anger of future conversational utterances from current (and past) utterances, we propose a transformer-based technique that combines audio and textual features in a mid-level fusion scheme, utilizing an ensemble-based downstream classifier. We demonstrate the efficacy of our proposed approaches using two data sets: the Bengali call-center data set and the IEMOCAP data set. Experimental results show that our proposed approaches outperform the state-of-the-art baselines by a significant margin. For offline anger recognition, our model achieves an F1 score of 85.5% on the Bengali call-center data set and 91.4% on the IEMOCAP data set. For online anger recognition, our model yields an F1 score of 66.9% on the Bengali call-center data set and 67.7% on the IEMOCAP data set. Additionally, we vary different utterance parameters, such as the numbers of input and output utterances and observe their effect on the performance of anger detection.