Abstract:
Voice command recognition task commonly involves an Automatic Speech Recognition (ASR) system with context-specific optimization. Automatic Speech Recognition system development involves corpus resource development such as phoneme list, text corpus, word dictionary, phonetic dictionary, and speech corpus. These corpus resources are used to train speech recognition models. The performance of the speech recognition systems can be further improved by exploiting user and device-specific contexts. Context information for a specific smartphone user includes contact names, installed apps, songs, media files, location, recent search history, the content of the screen user is looking at, etc. The context information changes frequently so it is desired that the contextual model will be updated on-the-fly within the device. Traditional speech recognition systems usually consist of several individual components such as an acoustic model, a language model, a pronunciation dictionary, etc. So context-specific optimization can be achieved by tuning a particular component like the language model. Recently, end-to-end speech recognition architectures have been very effective in many speech recognition tasks. Incorporating context-specific optimization with the latest end-to-end speech recognition architectures requires a different approach. In this work, we focus on Bangla voice command recognition. We develop an ASR system for voice command recognition tasks and improve the performance further using context-specific optimization. In our work, we develop each linguistic resource in a way that considers language-specific characteristics of Bangla. We enrich our speech corpus with both domain-specific and domain-independent speech data. We also experiment with traditional and end-to-end speech recognition architectures. We propose a novel approach for context-specific optimization of voice commands. We also explore several other approaches for improving ASR performance such as synthetic speech corpus development and semi-supervised speech recognition.