Multi-latency look-ahead for streaming speaker segmentation

Bilal Rahou; Hervé Bredin

doi:10.21437/Interspeech.2024-923

Communication Dans Un Congrès Année : 2024

Multi-latency look-ahead for streaming speaker segmentation

, (1, 2, 3)

1
2
3

Bilal Rahou

Fonction : Auteur

Hervé Bredin

Fonction : Auteur
PersonId : 15856
IdHAL : hbredin
ORCID : 0000-0002-3739-925X
IdRef : 121165779

Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio

Institut de recherche en informatique de Toulouse

Centre National de la Recherche Scientifique

Résumé

We address the task of streaming speaker diarization and propose several contributions to achieve a better trade-off between latency and accuracy. First, computational latency is reduced to its bare minimum by switching to a causal frame-wise speaker segmentation architecture. Then, a multi-latency look-ahead mechanism is used during training to support adaptive latency during inference at no additional computational cost. Finally, we detail the method used during inference to achieve the final frame-wise segmentation. We evaluate the impact of these contributions on the AMI meeting dataset with a focus on the speaker segmentation step, seen through the prism of voice activity detection, overlapped speech detection and speaker change detection.

Mots clés

speaker diarization speaker segmentation low latency lookahead speaker diarization speaker segmentation low latency lookahead

Domaines

Traitement du signal et de l'image [eess.SP]

Fichier principal

rahou24_interspeech.pdf (402.31 Ko)

Origine	Fichiers éditeurs autorisés sur une archive ouverte

Hervé Bredin : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04734819

Soumis le : lundi 14 octobre 2024-11:01:07

Dernière modification le : mardi 15 octobre 2024-03:21:52

Dates et versions

hal-04734819 , version 1 (14-10-2024)

Identifiants

HAL Id : hal-04734819 , version 1
DOI : 10.21437/Interspeech.2024-923

Citer

Bilal Rahou, Hervé Bredin. Multi-latency look-ahead for streaming speaker segmentation. Interspeech 2024, Sep 2024, Kos, Greece. pp.1610-1614, ⟨10.21437/Interspeech.2024-923⟩. ⟨hal-04734819⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLSE2 CNRS UT1-CAPITOLE GENCI IRIT IRIT-SAMOVA IRIT-SI TOULOUSE-INP UNIV-UT3 UT3-TOULOUSEINP

102 Consultations

33 Téléchargements

Multi-latency look-ahead for streaming speaker segmentation

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager