dc.description.abstract |
Recent advances in deep learning have aided in the development of neural language models that have achieved state-of-the-art results in many natural language processing (NLP) tasks. Conditional text generation, a major subfield of NLP, has particularly benefited from neural sequence-to-sequence (seq2seq) models, which can generate a text sequence when conditioned on a given input text sequence. These seq2seq models, however, come with a major caveat: they tend to be heavily data-driven, i.e., a large number of training samples must be fed into these models to train them effectively, and the absence of which can even affect their performance substantially. This has thus limited the applicability of these models to only the languages for which there are large datasets available, i.e., the high-resource languages. As a result, low-resource languages (e.g., Bengali) often fail to reap the benefit of these models and trail significantly in performance compared to high- resource ones. Even in multilingual language models, which are trained on hundreds of languages, low-resource languages remain underrepresented, as they are often not the primary focus of these models. These above-mentioned effects have only cascaded and barred the advancement of major NLG applications (e.g., machine translation, text summarization) from the under-served low-resource communities. In this work, we explore two major conditional text generation problems, machine translation and abstractive text summarization, from a low-resource and multilingual perspective. We improve the sentence segmentation algorithm for Bengali and propose two novel alignment techniques and effective algorithms for parallel corpus creation for machine translation under low-resource scenarios. Side by side, we create a large parallel training corpus and establish reliable evaluation benchmarks for Bengali-English machine translation as a representative low-resource language pair. Furthermore, for the first-time ever, we introduce a set of novel automatic annotation techniques and curate a large-scale multilingual dataset for abstractive text summarization and benchmark on the multilingual summarization task using a new multilingual metric for evaluation of the model-generated summaries. We show the superiority of multilingual training over back-translation-based and monolingual summarization. |
en_US |