IMPLEMENTATION OF THE SPEAKER IDENTIFICATION SYSTEM BASED ON THE PYANNOTE AUDIO PROCESSING LIBRARY

Authors

DOI:

https://doi.org/10.32782/IT/2022-2-1

Keywords:

diarization system, PyAnnote library, machine learning, clustering, audio analysis, speaker identification

Abstract

In the field of machine learning, one of the main areas is speech processing and recognition. One of the important tasks of working with audio data is diarization. Diarization determines the time boundaries in the audio recording belonging to individual speakers, that is, figuratively speaking, solves the problem of «who speaks when?». However, known commercial and open source diarization tools use segment clustering, but do not answer the question «who exactly is speaking now?». There are systems that identify the speaker, but such systems are designed for the fact that there is only one speaker in the audio recording. Therefore, a relevant task is to create a diarization system that allows the identification of many speakers that arbitrarily change in audio recordings. In this study, we propose two architectures of speaker identification systems based on diarization, which work respectively on the per segment basis and per cluster analysis. To implement the system, we used the PyAnnote library, which is open source. The evaluation of the speaker identification system was carried out on the open audio database AMI Corpus, which contains 100 hours of annotated and transcribed audio and video data. Various metrics for assessing the accuracy of diarization are considered and, taking into account the specifics of the developed system, the expediency of using such an assessment as the F-Measure of identification is substantiated. The methodology of research is described, which included three experiments. The first experiment is aimed at studying the architecture of the identification system based on per segment analysis, and the second experiment is aimed at studying the architecture that uses per cluster analysis. The third experiment concerns the determination of the optimal training sample duration for the classifiers of the identification system. The experimental results showed that the cluster-based approach showed better identification results compared to the segment-based approach. It was also found that the optimal duration of audio data sampling for training the classifier for each specific speaker is 20 seconds.

References

Juang B., Rabiner Lawrence. Automatic Speech Recognition – A Brief History of the Technology Development. 2005.

Homayoon Beigi. Fundamentals of Speaker Recognition. New York: Springer. 2011.

Anguera Xavier, Bozonnet Simon, Evans Nicholas, та ін. Speaker Diarization: A Review of Recent Research. 2012. IEEE Transactions on Audio, Speech & Language Processing. DOI: 10.1109/TASL.2011.2125954.

Li Runxin, Schultz Tanja, Jin Qin. Improving speaker segmentation via speaker identification and text segmentation. 2009.

Bredin Herve, Yin Ruiqing, Coria Juan, Gelly Gregory, та ін. Pyannote.Audio: Neural Building Blocks for Speaker Diarization. 2020. DOI: 10.1109/ICASSP40776.2020.9052974.

Jin Qin, Laskowski Kornel, Schultz Tanja, Waibel Alex. Speaker segmentation and clustering in meetings. 2004.

Tanveer Md, Casabuena Diego, Karlgren Jussi, Jones Rosie. Unsupervised Speaker Diarization that is Agnostic to Language, Overlap-Aware, and Tuning Free. 2022. DOI: 10.21437/Interspeech.2022-10605.

Le Lan Gaël, Meignier Sylvain, Charlet Delphine, Deléglise Paul. Speaker diarization with unsupervised training framework. 2016. DOI: 10.1109/ICASSP.2016.7472741.

Dawalatabad Nauman, Madikeri Srikanth, Sekhar Chandra, Murthy Hema. Novel Architectures for Unsupervised Information Bottleneck based Speaker Diarization of Meetings. 2020.

Zhang Aonan, Wang Quan, Zhu Zhenyao, Paisley John, Wang Chong. Fully Supervised Speaker Diarization. 2019. DOI: 10.1109/ICASSP.2019.8683892.

Fini Enrico, Brutti Alessio. Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data. 2020. DOI: 10.1109/ICASSP40776.2020.9053477.

Xie Weidi, Nagrani Arsha, Chung Joon Son, Zisserman Andrew. Utterance-level Aggregation for Speaker Recognition in the Wild. 2019. DOI: 10.1109/ICASSP.2019.8683120.

Herchonvicz Andrey L., Franco Cristiano R., Jasinski Marcio G.. A comparison of cloud-based speech recognition engines. 2019. DOI: 10.14210/cotb.v0n0.p366-375.

Ravanelli Mirco, Parcollet Titouan, Plantinga Peter, Rouhe Aku, та ін. SpeechBrain: A General-Purpose Speech Toolkit. 2021.

Giannakopoulos Theodoros. pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis. 2015. DOI: 10.1371/journal.pone.0144610.

Bredin Hervé, Laurent Antoine. End-to-end speaker segmentation for overlap-aware resegmentation. 2021.

Wang Keke, Mao Xudong, Wu Hao, Ding Chen, та ін. The ByteDance Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2021. 2021.

Mao Huanru, McAuley Julian, Cottrell Garrison. Speech Recognition and Multi-Speaker Diarization of Long Conversations. 2020. DOI: 10.21437/Interspeech.2020-3039.

Inaguma Hirofumi, Yan Brian, Dalmia Siddharth, Guo Pengcheng, та ін. ESPnet-ST IWSLT 2021 Offline Speech Translation System. 2021.

Ueda Yushi, Maiti Soumi, Watanabe Shinji, Zhang Chunlei, та ін. EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers. 2022.

Bredin Hervé. pyannote.metrics: A Toolkit for Reproducible Evaluation, Diagnostic, and Error Analysis of Speaker Diarization Systems. 2017. DOI: 10.21437/Interspeech.2017-411.

Романюк Андрій. Векторні представлення слів для української мови. Науковий журнал «Україна Модерна». 2019. №27. DOI: 10.30970/uam.2019.27.1062

Snyder David, Garcia-Romero Daniel, Sell Gregory, Povey Daniel, Khudanpur Sanjeev. X-Vectors: Robust DNN Embeddings for Speaker Recognition. 2018. DOI: 10.1109/ICASSP.2018.8461375.

Published

2022-12-29