COMPARATIVE ANALYSIS OF NEURAL NETWORK MODELS FOR THE PROBLEM OF SPEAKER RECOGNITION
DOI:
https://doi.org/10.30837/ITSSI.2023.24.172Keywords:
comparative analysis; neural network; intellectual models; model; machine learning; speaker identification; speaker recognitionAbstract
The subject matter of the article are the neural network models designed or adapted for the problem of voice analysis in the context of the speaker identification and verification tasks. The goal of this work is to perform a comparative analysis of relevant neural network models in order to determine the model(s) that best meet the chosen formulated criteria, – model type, programming language of model’s implementation, parallelizing potential, binary or multiclass, accuracy and computing complexity. Some of these criteria were chosen because of universal importance, regardless of particular application, such as accuracy and computational complexity. Others were chosen due to the architecture and challenges of the scientific communication system mentioned in the work that performs tasks of the speaker identification and verification. The relevance of the paper lies in the prevalence of audio as a communication medium, which results in a wide range of practical applications of audio intelligence in various fields of human activity (business, law, military), as well as in the necessity of enabling and encouraging efficient environment for inward-facing audio-based scientific communication among young scientists in order for them to accelerate their research and to acquire scientific communication skills. To achieve the goal, the following tasks were solved: criteria for models to be judged upon were formulated based on the needs and challenges of the proposed model; the models, designed for the problems of speaker identification and verification, according to formulated criteria were reviewed with the results compiled into a comprehensive table; optimal models were determined in accordance with the formulated criteria. The following neural network based models have been reviewed: SincNet, VGGVox, Jasper, TitaNet, SpeakerNet, ECAPA_TDNN. Conclusions. For the future research and practical solution of the problem of speaker authentication it will be reasonable to use a convolutional neural network implemented in the Python programming language, as it offers a wide variety of development tools and libraries to utilize.
References
References
Barkovska, O. (2022), "Research into speech-to-text tranfromation module in the proposed model of a speaker’s automatic speech annotation", Innovative Technologies and Scientific Solutions for Industries, No. 4 (22), P. 5–13. DOI: https://doi.org/10.30837/ITSSI.2022.22.005
Yashina, E., Artiukh, R., Рan, N., Zelensky, A. (2019), "Information technology for recognition of road signs using a neural network", Innovative Technologies and Scientific Solutions for Industries, No. 2 (8), P. 130–141. DOI: https://doi.org/10.30837/2522-9818.2019.8.130
Kholiev, V., Barkovska, O. (2023), "Analysis of the of training and test data distribution for audio series classification", Information and control systems at railway transport, No. 1, P. 38 43. DOI: https://doi.org/10.18664/ikszt.v28i1.276343
Illingworth, S.; Allen, G. (2020), "Introduction", Effective science communication: a practical guide to surviving as a scientist (2nd ed.), Bristol, UK; Philadelphia: IOP Publishing, Р. 1–5. DOI: https://doi.org/10.1088/978-0-7503-2520-2ch1
Côté, I., Darling, E. (2018), "Scientists on Twitter: Preaching to the choir or singing from the rooftops?", FACETS, 3, Р. 682–694. DOI: https://doi.org/10.1139/facets-2018-0002
Klin, B., Podpora, M., Beniak, R., Gardecki, A., Rut, J. (2023), "Smart Beamforming in Verbal Human-machine Interaction for Humanoid Robots", IEEE Robotics and Automation Letters, Р. 4689–4696. DOI: 10.1109/LRA.2023.3288381
Jin, R., Ablimit, M., Hamdulla, A. (2023), "Speaker Verification based on Single Channel Speech Separation", IEEE Access, available at: https://ieeexplore.ieee.org/iel7/6287639/6514899/10156847.pdf
Froiz-Míguez, I., Fraga-Lamas, P., Fernández-Caramés, T. M. (2023), "Design, Implementation and Practical Evaluation of a Voice Recognition Based IoT Home Automation System for Low-Resource Languages and Resource-Constrained Edge IoT Devices: a System for Galician and Mobile Opportunistic Scenarios", IEEE Access, available at: https://www.researchgate.net/profile/Tiago-Fernandez-Carames
Tesema, F. B., Gu, J., Song, W., Wu, H., Zhu, S., Lin, Z. (2023), "Efficient Audiovisual Fusion for Active Speaker Detection", IEEE Access, Vol. 11, Р. 45140–45153. DOI: 10.1109/ACCESS.2023.3267668
Hu, Z., LingHu, K., Liao, C., Yu, H. (2023), "Speech Emotion Recognition Based on Attention MCNN Combined With Gender Information", IEEE Access, Vol. 11, Р. 50285–50294. DOI: 10.1109/ACCESS.2023.3278106
Barkovska, O., Kholiev, V., Pyvovarova, D., Ivaschenko, G., Rosinskiy, D. (2021), "International system of knowledge exchange for young scientists", Advanced Information Systems, No. 5 (1), Р. 69–74. DOI: https://doi.org/10.20998/2522-9052.2021.1.09
Ravanelli, M., Bengio, Y. (2018), "Speaker Recognition from Raw Waveform with SincNet", 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, Р. 1021–1028. DOI: https://doi.org/10.1109/SLT.2018.8639585
Nagrani, A., Chung, J. S., Zisserman, A. (2017), "VoxCeleb: A Large-Scale Speaker Identification Dataset", Proc. Interspeech 2017, Р. 2616–2620. DOI: https://doi.org/10.21437/Interspeech.2017-950
Chung, J. S., Nagrani, A., Zisserman, A. (2018), "VoxCeleb2: Deep Speaker Recognition", Proc. Interspeech 2018, Р. 1086–1090. DOI: https://doi.org/10.21437/Interspeech.2018-1929
Koluguri, N. R., Park, T., Ginsburg, B. (2021), "TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Р. 8102–8106. DOI: https://doi.org/10.48550/arXiv.2110.04410
Koluguri, N. R., Li, J., Lavrukhin, V., Ginsburg, B. (2020), "SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). DOI: https://doi.org/10.48550/arXiv.2010.12653
Dawalatabad, N., Ravanelli, M., Grondin, F., Thienpondt, J., Desplanques, B., Na, H. (2021), "ECAPA-TDNN Embeddings for Speaker Diarization", Proc. Interspeech, 2021, Р. 3560–3564. DOI: https://doi.org/10.21437/interspeech.2021-941
Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J., Nguyen, H., Gadde, R. (2019), "Jasper: An End-to-End Convolutional Neural Acoustic Model", Electrical Engineering and Systems Science, Р. 71–75. DOI: https://doi.org/10.21437/Interspeech.2019-1819
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Владислав Холєв, Олеся Барковська
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Our journal abides by the Creative Commons copyright rights and permissions for open access journals.
Authors who publish with this journal agree to the following terms:
Authors hold the copyright without restrictions and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-commercial and non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their published work online (e.g., in institutional repositories or on their website) as it can lead to productive exchanges, as well as earlier and greater citation of published work.