A x-vector based Speaker Recognition in Persian

Shahbakhti, Fatemeh; Moradi-Shabestari, Maryam; Ghasemi-Naraghi, Zeinab

doi:10.48301/jear.2024.471677.1034

A x-vector based Speaker Recognition in Persian

Document Type : Original Article

Authors

fatemeh shahbakhti ¹

maryam Moradi-Shabestari ²

Zeinab Ghasemi-Naraghi ³

¹ Department of Electrical and Computer Engineering, Faculty of Shariaty, Skill National University (nus), Tehran, Iran

² Electrical and Computer Engineering Department, Tehran University, Tehran, Iran

³ Computer Engineering Department, AmirKabir University of Technology, Tehran, Iran

10.48301/jear.2024.471677.1034

Abstract

In this paper, a text-independent speaker recognition system in Persian is implemented by deep neural networks. The x-vector technique based on Time Delay Neural Network (TDNN) is used to extract the embeddings from speech signals. This method attracts researcher’s attention due to noise robustness and high performance. Data augmentation and noise addition are used to improve system performance. The PLDA classifier is used to recognize the speaker. Previous research in the field of “speaker recognition in Persian” is limited. In this work, the network is trained on the Persian part of the CommonVoice dataset. According to the error analysis, non-speech parts of an utterance decrease the accuracy of speaker recognition. So, the non-speech parts are removed by a Convolutional Recurrent Deep Neural Networks (CRDNN). The accuracy of speaker recognition and verification in CommonVoice is 95.24% and 95.56%, respectively. The Equal Error Rate (EER) evaluation metric of the speaker verification system is 4.72%. The attendance monitoring system was developed as one of the applications of the speaker recognition system. System accuracy for 12 and 15 seconds of collected data(includes 16 women and 12 men) is 98.92% and 100%, respectivly.

Keywords

Deep neural networks

speaker recognition

x-vector

Persian Language

Subjects

Software Engineering

[1] Furui, S. (1996). An Overview of Speaker Recognition Technology. In C-H. Lee, F. K. Soong, & K. K. Paliwal (Eds.), Automatic Speech and Speaker Recognition: Advanced Topics (pp. 31-56). Springer. https://doi.org/10.1007/978-1-4613-1367-0_2

[2] Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-End Factor Analysis for Speaker Verification. Institute of Electrical and Electronics Engineers Transactions on Audio, Speech, and Language Processing, 19(4), 788-798. https: //doi.org/10.1109/TASL.2010.2064307

[3] Okabe, K., Koshinaka, T., & Shinoda, K. (2018, September 2-6). Attentive statistics pooling for deep speaker embedding [Conference session]. 2018 International Speech Communication Association, Hyderabad, India. https://doi.org/10.21437/inte rspeech.2018-993

[4] Mohammad Amini, M., & Matrouf, D. (2021, January 18-21). Data augmentation versus noise compensation for x-vector speaker recognition systems in noisy environments [Conference session]. 28th European Signal Processing Conference, Amsterdam, Netherlands. https://doi.org/10.23919/Eusipco47968.2020.9287690

[5] VoxCeleb. (n.d.). VoxCeleb: Large-scale audio-visual datasets of human speech. https:/ /mm.kaist.ac.kr/datasets/voxceleb/#downloads

[6] Openslr. (n.d.). LibriSpeech ASR corpus. https://www.openslr.org/12

[7] Nist. (2016, August 4). Speaker Recognition Evaluation 2016. https://www.nist.gov/ system/files/documents/2016/10/07/sre16_eval_plan_v1.3.pdf

[8] Hom, K. L., Beigi, H., & Betti, R. (2022). Application of Speaker Recognition x-Vectors to Structural Health Monitoring. In Z. Mao (Ed.), Model Validation and Uncertainty Quantification, Volume 3 (pp. 139-148). Springer International Publishing. http s://doi.org/10.1007/978-3-030-77348-9_18

[9] Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. Institute of Electrical and Electronics Engineers Transactions on Speech and Audio Processing, 3(1), 72-83. https://do i.org/10.1109/89.365379

[10] Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing, 10(1-3), 19-41. https://doi. org/10.1006/dspr.1999.0361

[11] Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint Factor Analysis Versus Eigenchannels in Speaker Recognition. Institute of Electrical and Electronics Engineers Transactions on Audio, Speech, and Language Processing, 15(4), 1435-1447. htt ps://doi.org/10.1109/TASL.2006.881693

[12] Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018, April 15-20). X-Vectors: Robust DNN Embeddings for Speaker Recognition [Conference session]. 2018 Institute of Electrical and Electronics Engineers International Conference on Acoustics, Speech and Signal Processing, Calgary, Alberta, Canada. https://d oi.org/10.1109/ICASSP.2018.8461375

[13] Kanagasundaram, A., Sridharan, S., Ganapathy, S., Singh, P., & Fookes, C. (2019, September 15-19). A study of x-vector based speaker recognition on short utterances [Conference session]. Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria. https://doi.org/10.21437/Interspe ech.2019-1891

[14] Jahangir, R., Teh, Y. W., Memon, N. A., Mujtaba, G., Zareei, M., Ishtiaq, U., Akhtar, M. Z., & Ali, I. (2020). Text-Independent Speaker Identification Through Feature Fusion and Deep Neural Network. Institute of Electrical and Electronics Engineers Access, 8, 32187-32202. https://doi.org/10.1109/ACCESS.2020.2973541

[15] Tripathi, M., Singh, D., & Susan, S. (2020). Speaker Recognition Using SincNet and X-Vector Fusion. In L. Rutkowski, R. Scherer, M. Korytkowski, W. Pedrycz, R. Tadeusiewicz, & J. M. Zurada (Eds.), Artificial Intelligence and Soft Computing (pp. 252-260). Springer International Publishing. https://doi.org/10.1007/97 8-3-030-61401-0_24

[16] Rouvier, M., Dufour, R., & Bousquet, P. M. (2021, January 18-21). Review of different robust x-vector extractors for speaker verification [Conference session]. 28th European Signal Processing Conference, Amsterdam, Netherlands. https://doi.org/10.23 919/Eusipco47968.2020.9287426

[17] Wu, Z., Wang, S., Qian, Y., & Yu, K. (2019, September 15-19). Data Augmentation Using Variational Autoencoder for Embedding Based Speaker Verification [Conference session]. Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria. http://dx.doi.org/10.21437/Inters peech.2019-2248

[18] Taherian, H., Wang, Z. Q., Chang, J., & Wang, D. (2020). Robust Speaker Recognition Based on Single-Channel and Multi-Channel Speech Enhancement. Institute of Electrical and Electronics Engineers/Association for Computing Machinery Transactions on Audio, Speech, and Language Processing, 28, 1293-1302. https://doi.org/10. 1109/TASLP.2020.2986896

[19] Kataria, S., Nidadavolu, P. S., Villalba, J., Chen, N., García-Perera, P., & Dehak, N. (2020, May 4-8). Feature Enhancement with Deep Feature Losses for Speaker Verification [Conference session]. 2020 Institute of Electrical and Electronics Engineers International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain. https://doi.org/10.1109/ICASSP40776.2020.9053110

[20] Zeinali, H., Sameti, H., & Stafylakis, T. (2018, June 26-29). DeepMine Speech Processing Database: Text-Dependent and Independent Speaker Verification and Speech Recognition in Persian and English [Conference session]. The Speaker and Language Recognition Workshop, Les Sables d'Olonne, France. http://dx.doi.org/10.21437/Odyssey. 2018-54

[21] Khoa, T. D., & Tsai, T. H. (2020, October 30-31). A Text-Independent Speaker Verification for SdSV Challenge 2020 [Conference session]. 2020 Institute of Electrical and Electronics Engineers 5th International Conference on Computing Communication and Automation, Greater Noida, India. https://doi.org/10.1109/ICCCA49541.2 020.9250773

[22] Khosravani, A., & Homayounpour, M. M. (2017). A PLDA approach for language and text independent speaker recognition. Computer Speech & Language, 45, 457-474. https://doi.org/10.1016/j.csl.2017.04.003

[23] CommonVoice. (2021). Datasets. https://commonvoice.mozilla.org/en/datasets

[24] Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J., Chou, J-C., Yeh, S-L., Fu, S-W., Liao, C-F., Rastorgueva, E., Grondin, F., Aris, W., Na, H., Gao, Y., De Mori, R., & Bengio, Y. (2021). SpeechBrain: A general-purpose speech toolkit. arXiv, 1-34. https://doi.org/10 .48550/arXiv.2106.04624

[25] Sagayam, K. M., Bruntha, P. M., Sridevi, M., Renith Sam, M., Kose, U., & Deperlioglu, O. (2021). A cognitive perception on content-based image retrieval using an advanced soft computing paradigm. In T. Gandhi, S. Bhattacharyya, S. De, D. Konar, & S. Dey (Eds.), Advanced Machine Vision Paradigms for Medical Image Analysis (pp. 189-211). Academic Press. https://doi.org/10.1016/B978-0-12-819295-5.00007-X

[26] Butterworth, S. (1930). On the theory of filter amplifiers. Wireless Engineer, 7(6), 536-541. https://www.changpuak.ch/electronics/downloads/On_the_Theory _of_Filter_Amplifiers.pdf