Abstract:
This thesis presents a high-quality, end-to-end multi-speaker text-to-speech (TTS)
system for Bangla - a language spoken by millions yet lacking in open-source, high- quality speech resources. TTS systems have broad applications, including virtual assistants, audiobooks, dubbing, and accessibility tools. Despite Bangla’s large speaker base, its representation in modern open-source speech synthesis remains limited. Motivated by this gap, and the lack of accessible tooling for building contemporary TTS systems in Bangla, this work presents a curated speech dataset, tentatively named Bani, compiled from publicly available corpora and community- driven projects. To address this, a curated dataset - tentatively named Bani - was compiled from public and community-driven sources. A key contribution is a remastering pipeline using deep learning-based denoising and enhancement, substantially improving audio quality for TTS training.
Bani served as the foundation for training single-speaker and multi-speaker TTS models. The architecture is based on the Variational Inference with Adversarial Learning for end-to-end TTS (VITS) model, with two core modifications intro- duced in this work: explicit duration modeling and integration of a pretrained speaker embedding model jointly trained with the system. These changes aimed to improve convergence, speaker similarity, and synthesized speech naturalness.
Evaluation combined objective metrics - Mel-Cepstral Distortion (MCD), tran- scription error rates, and speaker similarity - with subjective Mean Opinion Score (MOS) tests from native Bangla speakers (1=poor, 5=excellent). The modi- fied multi-speaker model achieved a MOS of 3.64 ± 0.48, surpassing the baseline (3.46 ± 0.50) and single-speaker (3.10 ± 0.61) models. Objective scores showed a
10% drop in MCD and 9.5% boost in speaker similarity. A commercial Google Bangla TTS system scored 4.12 ± 0.33. These results show that both audio re- mastering and architectural changes significantly enhance perceived and measured synthesis quality, while explicit duration modeling improves training efficiency without sacrificing fidelity.