Abstract:
Though there has been a large body of recent works in language modeling for high resource languages such as English and Chinese, the area is still unexplored for low resource languages like Bangla and Hindi. We propose an end to end trainable memory efficient convolutional neural network (CNN) architecture CoCNN to handle specific characteristics such as high inflection, morphological richness, flexible word order and phonetical spelling errors of Bangla and Hindi. In particular, we introduce two learnable convolutional sub-models at word and sentence level. We show that state-of-the-art Transformer models do not necessarily yield the best performance for Bangla and Hindi. CoCNN outperforms pretrained BERT with 16X less parameters and 10X less training time, while it achieves much better performance than state-of-the-art long short term memory (LSTM) models on multiple real-world datasets. The word level CNN sub-model SemanticNet of CoCNN architecture has shown its potential as an effective Bangla spell checker. We explore this potential and develop a state-of-the-art Bangla spell checker. Bangla typing is mostly performed using English keyboard and can be highly erroneous due to the presence of compound and similarly pronounced letters. Spelling correction of a misspelled word requires understanding of word typing pattern as well as the context of the word usage. We propose a specialized BERT model, BSpell targeted towards word for word correction in sentence level. BSpell contains CNN sub-model SemanticNet being motivated from CoCNN along with specialized auxiliary loss. This allows BSpell to specialize in highly inflected Bangla vocabulary in the presence of spelling errors. We further propose hybrid pretraining scheme for BSpell combining word level and character level masking. Utilizing this pretraining scheme, BSpell achieves 91.5% accuracy on real life Bangla spelling correction validation set.