Speaker Diarization from Bangla Conversation
Abstract
Speaker diarization is a fundamental task in speech processing that aims to identify
and segment different speakers within an audio recording. It involves determining
”who spoke when” in a given conversation or speech. Speaker diarization has various
applications, such as meeting transcription, speaker tracking in broadcast news, audio
indexing, and speaker profiling in forensics. It is particularly challenging for languages
with diverse phonetic characteristics, such as Bangla. In this study, we investigate
speaker diarization techniques tailored specifically for Bangla conversations. We
explore three feature extraction methods—Gammatonegram, Constant-Q Transform
(CQT), and Mel-Frequency Cepstral Coefficients (MFCC)—combined with Gaussian
Mixture Models (GMM) for clustering. Evaluation using Diarization Error Rate
(DER) and various metrics reveals promising results. The Diarization Error Rate
(DER) is a widely used metric in the speaker diarization community to measure the
overall performance of a diarization system. It takes into account missed speaker
errors, false alarm speaker errors, and speaker confusion errors. A lower DER
indicates better diarization performance, with a DER of 0% representing a perfect
diarization system. Among the approaches studied, the ANN+MFCC+GMM method
demonstrates exceptional performance, achieving a DER of 0.193 and an accuracy of
0.807. This indicates its effectiveness in accurately identifying speakers in Bangla
conversations. These findings underscore the potential of the proposed methods for
Bangla speaker diarization. Future research aims to refine techniques and address
Bangla-specific challenges, ultimately enhancing the accuracy and robustness of
speaker diarization systems for Bangla conversations.
Collections
- M.Sc Thesis/Project [149]