Abstract:
With the Internet’s social digital content proliferation, sentiment analysis (SA) has gained a wide research interest in natural language processing (NLP). A little significant research has beendone intheBanglalanguagedomainbecauseofhavingintricategrammaticalstructuresinthetext.This paper focuses on SA in the context of the Bangla language. Firstly, a specific domain-based cat- egorical weighted lexicon data dictionary (LDD) is developed to analyze Bangla text sentiments. This LDD is developed by applying the concepts of normalization, tokenization, and stemmingto two Bangla datasets available in the GitHub repository. Secondly, a novel rule-based algorithm termed as Bangla Text Sentiment Score (BTSC) is developed to detect sentence polarity. This al- gorithm considers parts of speech tagger words and special characters to generate a word score and extract polarity from a sentence and a blog. The BTSC algorithm, with the help of LDD is appliedtoextractsentimentsbygeneratingscoresofthetwoBangladatasets.Thirdly,twofeature matricesaredevelopedbyapplyingthetermfrequency-inversedocumentfrequency(tf-idf)tothe two datasets and the corresponding BTSC scores. Next, supervised machine learning classifiers are applied to the feature matrices. In the deep learning part, these polarities are then fed into the hybrid neural network and the preprocessed text as training samples. The preprocessed texts are formatted as a vectorization of words of unique numbers of pre-trained word embedding models. Word2Vec matrix with the top highest probability word is applied on the embedding layer as a weighted matrix to fit the DL models. This paper also presents a remarkably detailed analysis of selectiveDLmodelswithfine-tuning.Thefine-tuningincludestheuseofdropout,optimizerreg- ularization,learningrate,multiplelayers,filters,attentionmechanism,capsulelayers,transformer
xvii
xviii
withprogressivetrainingalongwithvalidationandtestingaccuracy,precision,recallandF1-score. Experimental results indicate that the proposed new long short-term memory (LSTM) models are highlyaccurateinperformingSAtasks.Experimentalresultscorroborateourtheoreticalclaimand showtheefficiencyofourproposedapproachinbothmachinelearninganddeeplearningapproach. ResultsshowthatforthecaseofBiGramfeature,supportvectormachine(SVM)achievesthebest classification accuracy of 82.21%. For our proposed hierarchical attention-based LSTM (HAN- LSTM),DynamicroutingbasedcapsuleneuralnetworkwithBi-LSTM(D-CAPSNET-Bi-LSTM) and bidirectional encoder representations from Transformers (BERT) with LSTM (BERT-LSTM) model we achieved accuracy values of 78.52%, 80.82% and 84.18%respectively.