Speaker recognition by ultrashort utterances

Authors

DOI:

https://doi.org/10.15587/1729-4061.2025.327907

Keywords:

announcer recognition, ultra-short utterances, phoneme-by-phoneme recognition, ECAPA-TDNN, phonemes of the Kazakh language

Abstract

The object of this study is the accuracy of announcer identification based on short utterances.

To solve the task of speaker identification based on ultra-short speech utterances, a phoneme-by-phoneme approach to constructing voice models has been proposed within the framework of the study. The validity of this approach is based on the fact that short utterances usually contain a limited number of phonemes. In this regard, a hypothesis was put forward assuming that in order to increase the accuracy of announcer identification based on short utterances, it is necessary to analyze the sound of specific phonemes by different announcers.

The experiments involved speech recordings of monosyllabic words with corresponding phonemes, on the basis of which, using the ECAPA-TDNN neural network architecture, announcer voice models were constructed. The experimental studies showed that voice models constructed based on the sounds of only one model provide higher announcer identification accuracy compared to generalized models constructed based on all speech sounds.

It was also found that different phonemes provide different announcer identification accuracy. For example, with a speech signal duration of 2–3 seconds, the accuracy of announcer identification by the generalized model was 75 %. And the accuracy of announcer identification using a model built on the basis of only one phoneme "E", with the same input data, was 85 %, which is 10 percentage points higher than that of the generalized model

Author Biographies

Bekbolat Medetov, L.N. Gumilyov Eurasian National University

PhD, Associate Professor

Department of Radio Engineering, Electronics and Telecommunications

Aigul Nurlankyzy, Satbayev University; Non-Profit Joint Stock Company "Almaty University of Power Engineering and Telecommunications named after Gumarbek Daukeyev"

PhD Student

Department of Electronics, Telecommunications and Space Technologies

Department of Space Engineering

Timur Namazbayev, Al-Farabi Kazakh National University

Master, Senior Lecturer

Department of Solid State Physics and Nonlinear Physics

Ainur Akhmediyarova, Satbayev University

PhD

Department of Software Engineering

Kairatbek Zhetpisbayev, Turan University

PhD

Department of Radio Engineering, Electronics and Telecommunications

Ainur Zhetpisbayeva, L.N. Gumilyov Eurasian National University

PhD, Associate Professor

Department of Radio Engineering, Electronics and Telecommunications

Aliya Kargulova, S. Seifullin Kazakh Agrotechnical Research University

Senior Lecturer (Adviser)

Department of Electric Power Supply

References

  1. Sharif-Noughabi, M., Razavi, S. M., Mohamadzadeh, S. (2025). Improving the Performance of Speaker Recognition System Using Optimized VGG Convolutional Neural Network and Data Augmentation. International Journal of Engineering, 38 (10), 2414–2425. https://doi.org/10.5829/ije.2025.38.10a.17
  2. Tomar, S., Koolagudi, S. G. (2025). Blended-emotional speech for Speaker Recognition by using the fusion of Mel-CQT spectrograms feature extraction. Expert Systems with Applications, 276, 127184. https://doi.org/10.1016/j.eswa.2025.127184
  3. Chauhan, N., Isshiki, T., Li, D. (2024). Enhancing Speaker Recognition Models with Noise-Resilient Feature Optimization Strategies. Acoustics, 6 (2), 439–469. https://doi.org/10.3390/acoustics6020024
  4. Kohler, O., Imtiaz, M. (2025). Investigation of Text-Independent Speaker Verification by Support Vector Machine-Based Machine Learning Approaches. Electronics, 14 (5), 963. https://doi.org/10.3390/electronics14050963
  5. Missaoui, I., Lachiri, Z. (2025). Stationary wavelet Filtering Cepstral coefficients (SWFCC) for robust speaker identification. Applied Acoustics, 231, 110435. https://doi.org/10.1016/j.apacoust.2024.110435
  6. Zhang, X., Tang, J., Cao, H., Wang, C., Shen, C., Liu, J. (2025). A Self-Supervised Method for Speaker Recognition in Real Sound Fields with Low SNR and Strong Reverberation. Applied Sciences, 15 (6), 2924. https://doi.org/10.3390/app15062924
  7. Li, P., Hoi, L. M., Wang, Y., Yang, X., Im, S. K. (2025). Enhancing Speaker Recognition with CRET Model: a fusion of CONV2D, RESNET and ECAPA-TDNN. EURASIP Journal on Audio, Speech, and Music Processing, 2025 (1). https://doi.org/10.1186/s13636-025-00396-4
  8. Ohi, A. Q., Mridha, M. F., Hamid, M. A., Monowar, M. M., Lee, D., Kim, J. (2020). A Lightweight Speaker Recognition System Using Timbre Properties. arXiv. https://doi.org/10.48550/arXiv.2010.05502
  9. Kye, S. M., Jung, Y., Lee, H. B., Hwang, S. J., Kim, H. (2020). Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs. Interspeech 2020. https://doi.org/10.21437/interspeech.2020-1283
  10. Wang, W., Zhao, H., Yang, Y., Chang, Y., You, H. (2023). Few-shot short utterance speaker verification using meta-learning. PeerJ Computer Science, 9, e1276. https://doi.org/10.7717/peerj-cs.1276
  11. Chen, Y., Zheng, S., Wang, H., Cheng, L., Zhu, T., Huang, R. et al. (2025). 3D-Speaker-Toolkit: An Open-Source Toolkit for Multimodal Speaker Verification and Diarization. ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. https://doi.org/10.1109/icassp49660.2025.10888389
  12. Desplanques, B., Thienpondt, J., Demuynck, K. (2020). ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. Interspeech 2020. https://doi.org/10.21437/interspeech.2020-2650
  13. Ravanelli, M., Parcollet, T., Moumen, A., de Langen, S., Subakan, C., Plantingaet, P. al. (2024). Open-Source Conversational AI with SpeechBrain 1.0. arXiv. https://arxiv.org/abs/2407.00463
  14. Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L. et al. (2021). SpeechBrain: A General-Purpose Speech Toolkit. arXiv. https://arxiv.org/abs/2106.04624
Speaker recognition by ultrashort utterances

Downloads

Published

2025-04-29

How to Cite

Medetov, B., Nurlankyzy, A., Namazbayev, T., Akhmediyarova, A., Zhetpisbayev, K., Zhetpisbayeva, A., & Kargulova, A. (2025). Speaker recognition by ultrashort utterances. Eastern-European Journal of Enterprise Technologies, 2(9 (134), 62–69. https://doi.org/10.15587/1729-4061.2025.327907

Issue

Section

Information and controlling system