COMPARATIVE ANALYSIS OF NEURAL NETWORK MODELS FOR THE PROBLEM OF SPEAKER RECOGNITION

Authors

DOI:

https://doi.org/10.30837/ITSSI.2023.24.172

Keywords:

comparative analysis; neural network; intellectual models; model; machine learning; speaker identification; speaker recognition

Abstract

The subject matter of the article are the neural network models designed or adapted for the problem of voice analysis in the context of the speaker identification and verification tasks. The goal of this work is to perform a comparative analysis of relevant neural network models in order to determine the model(s) that best meet the chosen formulated criteria, – model type, programming language of model’s implementation, parallelizing potential, binary or multiclass, accuracy and computing complexity. Some of these criteria were chosen because of universal importance, regardless of particular application, such as accuracy and computational complexity. Others were chosen due to the architecture and challenges of the scientific communication system mentioned in the work that performs tasks of the speaker identification and verification. The relevance of the paper lies in the prevalence of audio as a communication medium, which results in a wide range of practical applications of audio intelligence in various fields of human activity (business, law, military), as well as in the necessity of enabling and encouraging efficient environment for inward-facing audio-based scientific communication among young scientists in order for them to accelerate their research and to acquire scientific communication skills. To achieve the goal, the following tasks were solved: criteria for models to be judged upon were formulated based on the needs and challenges of the proposed model; the models, designed for the problems of speaker identification and verification, according to formulated criteria were reviewed with the results compiled into a comprehensive table; optimal models were determined in accordance with the formulated criteria. The following neural network based models have been reviewed: SincNet, VGGVox, Jasper, TitaNet, SpeakerNet, ECAPA_TDNN. Conclusions. For the future research and practical solution of the problem of speaker authentication it will be reasonable to use a convolutional neural network implemented in the Python programming language, as it offers a wide variety of development tools and libraries to utilize.

Author Biographies

Vladyslav Kholiev, Kharkіv National University of Radio Electronics

Рostgraduate at the Department of Electronic Computers

Olesia Barkovska, Kharkіv National University of Radio Electronics

PhD (Engineering Sciences), Associate Professor,  Associate Professor at the Department of Electronic Computers

References

References

Barkovska, O. (2022), "Research into speech-to-text tranfromation module in the proposed model of a speaker’s automatic speech annotation", Innovative Technologies and Scientific Solutions for Industries, No. 4 (22), P. 5–13. DOI: https://doi.org/10.30837/ITSSI.2022.22.005

Yashina, E., Artiukh, R., Рan, N., Zelensky, A. (2019), "Information technology for recognition of road signs using a neural network", Innovative Technologies and Scientific Solutions for Industries, No. 2 (8), P. 130–141. DOI: https://doi.org/10.30837/2522-9818.2019.8.130

Kholiev, V., Barkovska, O. (2023), "Analysis of the of training and test data distribution for audio series classification", Information and control systems at railway transport, No. 1, P. 38 43. DOI: https://doi.org/10.18664/ikszt.v28i1.276343

Illingworth, S.; Allen, G. (2020), "Introduction", Effective science communication: a practical guide to surviving as a scientist (2nd ed.), Bristol, UK; Philadelphia: IOP Publishing, Р. 1–5. DOI: https://doi.org/10.1088/978-0-7503-2520-2ch1

Côté, I., Darling, E. (2018), "Scientists on Twitter: Preaching to the choir or singing from the rooftops?", FACETS, 3, Р. 682–694. DOI: https://doi.org/10.1139/facets-2018-0002

Klin, B., Podpora, M., Beniak, R., Gardecki, A., Rut, J. (2023), "Smart Beamforming in Verbal Human-machine Interaction for Humanoid Robots", IEEE Robotics and Automation Letters, Р. 4689–4696. DOI: 10.1109/LRA.2023.3288381

Jin, R., Ablimit, M., Hamdulla, A. (2023), "Speaker Verification based on Single Channel Speech Separation", IEEE Access, available at: https://ieeexplore.ieee.org/iel7/6287639/6514899/10156847.pdf

Froiz-Míguez, I., Fraga-Lamas, P., Fernández-Caramés, T. M. (2023), "Design, Implementation and Practical Evaluation of a Voice Recognition Based IoT Home Automation System for Low-Resource Languages and Resource-Constrained Edge IoT Devices: a System for Galician and Mobile Opportunistic Scenarios", IEEE Access, available at: https://www.researchgate.net/profile/Tiago-Fernandez-Carames

Tesema, F. B., Gu, J., Song, W., Wu, H., Zhu, S., Lin, Z. (2023), "Efficient Audiovisual Fusion for Active Speaker Detection", IEEE Access, Vol. 11, Р. 45140–45153. DOI: 10.1109/ACCESS.2023.3267668

Hu, Z., LingHu, K., Liao, C., Yu, H. (2023), "Speech Emotion Recognition Based on Attention MCNN Combined With Gender Information", IEEE Access, Vol. 11, Р. 50285–50294. DOI: 10.1109/ACCESS.2023.3278106

Barkovska, O., Kholiev, V., Pyvovarova, D., Ivaschenko, G., Rosinskiy, D. (2021), "International system of knowledge exchange for young scientists", Advanced Information Systems, No. 5 (1), Р. 69–74. DOI: https://doi.org/10.20998/2522-9052.2021.1.09

Ravanelli, M., Bengio, Y. (2018), "Speaker Recognition from Raw Waveform with SincNet", 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, Р. 1021–1028. DOI: https://doi.org/10.1109/SLT.2018.8639585

Nagrani, A., Chung, J. S., Zisserman, A. (2017), "VoxCeleb: A Large-Scale Speaker Identification Dataset", Proc. Interspeech 2017, Р. 2616–2620. DOI: https://doi.org/10.21437/Interspeech.2017-950

Chung, J. S., Nagrani, A., Zisserman, A. (2018), "VoxCeleb2: Deep Speaker Recognition", Proc. Interspeech 2018, Р. 1086–1090. DOI: https://doi.org/10.21437/Interspeech.2018-1929

Koluguri, N. R., Park, T., Ginsburg, B. (2021), "TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Р. 8102–8106. DOI: https://doi.org/10.48550/arXiv.2110.04410

Koluguri, N. R., Li, J., Lavrukhin, V., Ginsburg, B. (2020), "SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). DOI: https://doi.org/10.48550/arXiv.2010.12653

Dawalatabad, N., Ravanelli, M., Grondin, F., Thienpondt, J., Desplanques, B., Na, H. (2021), "ECAPA-TDNN Embeddings for Speaker Diarization", Proc. Interspeech, 2021, Р. 3560–3564. DOI: https://doi.org/10.21437/interspeech.2021-941

Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J., Nguyen, H., Gadde, R. (2019), "Jasper: An End-to-End Convolutional Neural Acoustic Model", Electrical Engineering and Systems Science, Р. 71–75. DOI: https://doi.org/10.21437/Interspeech.2019-1819

Downloads

Published

2023-11-13

How to Cite

Kholiev, V., & Barkovska, O. (2023). COMPARATIVE ANALYSIS OF NEURAL NETWORK MODELS FOR THE PROBLEM OF SPEAKER RECOGNITION. INNOVATIVE TECHNOLOGIES AND SCIENTIFIC SOLUTIONS FOR INDUSTRIES, (2 (24), 172–178. https://doi.org/10.30837/ITSSI.2023.24.172