Can we use Common Voice to train a Multi-Speaker TTS system? - Grid'5000 Accéder directement au contenu
Communication Dans Un Congrès Année : 2023

Can we use Common Voice to train a Multi-Speaker TTS system?

Résumé

Training of multi-speaker text-to-speech (TTS) systems relies on curated datasets based on high-quality recordings or audiobooks. Such datasets often lack speaker diversity and are expensive to collect. As an alternative, recent studies have leveraged the availability of large, crowdsourced automatic speech recognition (ASR) datasets. A major problem with such datasets is the presence of noisy and/or distorted samples, which degrade TTS quality. In this paper, we propose to automatically select high-quality training samples using a non-intrusive mean opinion score (MOS) estimator, WV-MOS. We show the viability of this approach for training a multi-speaker GlowTTS model on the Common Voice English dataset. Our approach improves the overall quality of generated utterances by 1.26 MOS point with respect to training on all the samples and by 0.35 MOS point with respect to training on the LibriTTS dataset. This opens the door to automatic TTS dataset curation for a wider range of languages.
Fichier principal
Vignette du fichier
Can_we_use_Mozilla_Common_Voice_for_TTS_CC (1).pdf (213.02 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03812715 , version 1 (13-10-2022)

Identifiants

  • HAL Id : hal-03812715 , version 1

Citer

Sewade Ogun, Vincent Colotte, Emmanuel Vincent. Can we use Common Voice to train a Multi-Speaker TTS system?. The 2022 IEEE Spoken Language Technology Workshop (SLT 2022), Jan 2023, Doha, Qatar. ⟨hal-03812715⟩
137 Consultations
270 Téléchargements

Partager

Gmail Facebook X LinkedIn More