DSpace Repository

Multi-speaker end-to-end text to speech synthesis for low resource languages

Show simple item record

dc.contributor.advisor Ariful Haque, Dr. Mohammad
dc.contributor.author Shahruk Hossain
dc.date.accessioned 2025-12-09T06:59:44Z
dc.date.available 2025-12-09T06:59:44Z
dc.date.issued 2025-04-15
dc.identifier.uri http://lib.buet.ac.bd:8080/xmlui/handle/123456789/7223
dc.description.abstract This thesis presents a high-quality, end-to-end multi-speaker text-to-speech (TTS) system for Bangla - a language spoken by millions yet lacking in open-source, high- quality speech resources. TTS systems have broad applications, including virtual assistants, audiobooks, dubbing, and accessibility tools. Despite Bangla’s large speaker base, its representation in modern open-source speech synthesis remains limited. Motivated by this gap, and the lack of accessible tooling for building contemporary TTS systems in Bangla, this work presents a curated speech dataset, tentatively named Bani, compiled from publicly available corpora and community- driven projects. To address this, a curated dataset - tentatively named Bani - was compiled from public and community-driven sources. A key contribution is a remastering pipeline using deep learning-based denoising and enhancement, substantially improving audio quality for TTS training. Bani served as the foundation for training single-speaker and multi-speaker TTS models. The architecture is based on the Variational Inference with Adversarial Learning for end-to-end TTS (VITS) model, with two core modifications intro- duced in this work: explicit duration modeling and integration of a pretrained speaker embedding model jointly trained with the system. These changes aimed to improve convergence, speaker similarity, and synthesized speech naturalness. Evaluation combined objective metrics - Mel-Cepstral Distortion (MCD), tran- scription error rates, and speaker similarity - with subjective Mean Opinion Score (MOS) tests from native Bangla speakers (1=poor, 5=excellent). The modi- fied multi-speaker model achieved a MOS of 3.64 ± 0.48, surpassing the baseline (3.46 ± 0.50) and single-speaker (3.10 ± 0.61) models. Objective scores showed a 10% drop in MCD and 9.5% boost in speaker similarity. A commercial Google Bangla TTS system scored 4.12 ± 0.33. These results show that both audio re- mastering and architectural changes significantly enhance perceived and measured synthesis quality, while explicit duration modeling improves training efficiency without sacrificing fidelity. en_US
dc.language.iso en en_US
dc.publisher Department of Electrical and Electronic Engineering (EEE), BUET en_US
dc.subject Speech synthesis en_US
dc.title Multi-speaker end-to-end text to speech synthesis for low resource languages en_US
dc.type Thesis-MSc en_US
dc.contributor.id 1018062228 en_US
dc.identifier.accessionNumber 120130
dc.contributor.callno 623.99/SHA/2025 en_US


Files in this item

Files Size Format View

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record

Search BUET IR


Advanced Search

Browse

My Account