Bangla text sentiment analysis based on extended lexicon dictionary using supervised machine learning and deep learning algorithms

BUET ILS
BUET Institutional Repository: Home
→
Dissertations/Theses
→
Dissertations/Theses - Institute of Information and Communication Technology
→
View Item

dc.contributor.advisor	Mondal, Dr. Md. Rubaiyat Hossain
dc.contributor.author	Bhowmik, Nitish Ranjan
dc.date.accessioned	2024-01-20T08:33:52Z
dc.date.available	2024-01-20T08:33:52Z
dc.date.issued	2022-02-19
dc.identifier.uri	http://lib.buet.ac.bd:8080/xmlui/handle/123456789/6561
dc.description.abstract	With the Internet’s social digital content proliferation, sentiment analysis (SA) has gained a wide research interest in natural language processing (NLP). A little significant research has beendone intheBanglalanguagedomainbecauseofhavingintricategrammaticalstructuresinthetext.This paper focuses on SA in the context of the Bangla language. Firstly, a specific domain-based cat- egorical weighted lexicon data dictionary (LDD) is developed to analyze Bangla text sentiments. This LDD is developed by applying the concepts of normalization, tokenization, and stemmingto two Bangla datasets available in the GitHub repository. Secondly, a novel rule-based algorithm termed as Bangla Text Sentiment Score (BTSC) is developed to detect sentence polarity. This al- gorithm considers parts of speech tagger words and special characters to generate a word score and extract polarity from a sentence and a blog. The BTSC algorithm, with the help of LDD is appliedtoextractsentimentsbygeneratingscoresofthetwoBangladatasets.Thirdly,twofeature matricesaredevelopedbyapplyingthetermfrequency-inversedocumentfrequency(tf-idf)tothe two datasets and the corresponding BTSC scores. Next, supervised machine learning classifiers are applied to the feature matrices. In the deep learning part, these polarities are then fed into the hybrid neural network and the preprocessed text as training samples. The preprocessed texts are formatted as a vectorization of words of unique numbers of pre-trained word embedding models. Word2Vec matrix with the top highest probability word is applied on the embedding layer as a weighted matrix to fit the DL models. This paper also presents a remarkably detailed analysis of selectiveDLmodelswithfine-tuning.Thefine-tuningincludestheuseofdropout,optimizerreg- ularization,learningrate,multiplelayers,filters,attentionmechanism,capsulelayers,transformer xvii xviii withprogressivetrainingalongwithvalidationandtestingaccuracy,precision,recallandF1-score. Experimental results indicate that the proposed new long short-term memory (LSTM) models are highlyaccurateinperformingSAtasks.Experimentalresultscorroborateourtheoreticalclaimand showtheefficiencyofourproposedapproachinbothmachinelearninganddeeplearningapproach. ResultsshowthatforthecaseofBiGramfeature,supportvectormachine(SVM)achievesthebest classification accuracy of 82.21%. For our proposed hierarchical attention-based LSTM (HAN- LSTM),DynamicroutingbasedcapsuleneuralnetworkwithBi-LSTM(D-CAPSNET-Bi-LSTM) and bidirectional encoder representations from Transformers (BERT) with LSTM (BERT-LSTM) model we achieved accuracy values of 78.52%, 80.82% and 84.18%respectively.	en_US
dc.language.iso	en	en_US
dc.publisher	Institute of Information and Communication Technology (IICT)	en_US
dc.subject	Natural language processing (Computer science)	en_US
dc.title	Bangla text sentiment analysis based on extended lexicon dictionary using supervised machine learning and deep learning algorithms	en_US
dc.type	Thesis-MSc	en_US
dc.contributor.id	1017312022	en_US
dc.identifier.accessionNumber	119243
dc.contributor.callno	005.45/BHO/2022	en_US