Two-factor authentication based on keyword spotting and speaker verification

Authors

DOI:

https://doi.org/10.30837/2522-9818.2025.3.005

Abstract

The subject matter of the article is the development and evaluation of a two-factor speaker authentication method based on voiceprint
identification and keyword spotting (KWS), designed for secure voice-based access in human-machine interfaces, especially for
users with limited mobility. The goal of the work is to create a method for managing speaker authentication using convolutional
neural networks (CNNs), comparing the efficiency of two widely used spectral feature extraction techniques – Mel-Frequency
Cepstral Coefficients (MFCC) and Short-Time Fourier Transform (STFT) spectrograms. The following tasks were solved in the
article: a model of a two-factor authentication method is proposed, which includes speaker identification and voice
password recognition; the quality of MFCC and STFT spectrograms features is compared; the influence of the number
of epochs, CNN architecture and training parameters on the system accuracy is evaluated; the effect of the sampling rate
on the performance of the models was investigated. The following methods are used: deep learning methods with CNN
architecture, fine-tuning, MFCC, and STFT feature extraction, mathematical and statistical analysis of training efficiency,
and system performance metrics. The following results were obtained: the method achieved 97.95% accuracy in speaker
identification using MFCCs after 60 training epochs, and 99.82% accuracy in voice password verification using the same
CNN structure after 20 epochs. The average accuracy of the entire authentication process was 98.75%. Moreover, using MFCC
features reduced training time by a factor of 23 and memory consumption by a factor of 7 compared to STFT spectrograms.
Conclusions: the effectiveness of a two-factor voice authentication method that combines speaker identification by acoustic
voice characteristics and voice password verification was implemented and studied. Further research directions include studying
the impact of alternative spectral features (in particular, CQCC, GFCC, prosodic parameters) on improving accuracy and resistance
to spoofing. Special attention will be paid to optimizing the model for energy-efficient use on portable devices.

Author Biography

Olesia Barkovska, Kharkiv National University of Radio Electronics

Ph.D (Engineering Sciences), Associate Professor? Associate Professor of the Department of Electronic Computers

References

References

Mourtzis, D., Angelopoulos, J., Panopoulos, N. (2023), "The Future of the Human–Machine Interface (HMI) in Society 5.0".

Future Internet, № 15, 162 р. DOI: https://doi.org/10.3390/fi15050162

Grobelna, I., Mailland, D., Horwat, M. (2025), "Design of Automotive HMI: New Challenges in Enhancing User Experience,

Safety, and Security". Appl. Sci. № 15, 5572 р. DOI: https://doi.org/10.3390/app15105572

Esquivel, P. et al. (2024), "Voice Assistant Utilization among the Disability Community for Independent Living:

A Rapid Review of Recent Evidence", Human Behavior and Emerging Technologies, Vol. 2024, №. 1, 6494944 р.

DOI: https://doi.org/10.1155/2024/6494944

Semary, H. E., Al-Karawi, K. A. (2024), "Abdelwahab M. M. Using voice technologies to support disabled people",

Journal of Disability Research, 2024. Vol. 3. №. 1. DOI: https://doi.org/10.57197/jdr-2023-0063

Lawrence, I. D., Pavitra, A. R. R. (2024), "Voice-controlled drones for smart city applications", Sustainable Innovation for

Industry 6.0. Р. 162–177. DOI: DOI: 10.1109/ICUFN.2017.7993759

Ryu, R., Yeom, S., Kim, S. H., Herbert, D. (2021), "Continuous multimodal biometric authentication schemes: a systematic

review", IEEE Access. Vol. 9. Р. 34541-34557. DOI: 10.1109/ACCESS.2021.3061589

Barkovska, O., Liapin, Y., Muzyka, T., Ryndyk, I., Botnar, P. (2024), "Gaze direction monitoring model in computer

system for academic performance assessment. Civil law aspect", Information Technologies and Learning Tools, Vol 99,

№1, Р. 63–75. DOI: 10.33407/itlt.v99i1.5503

Shaheed, K., Mao, A., Qureshi, I. et al. (2021), "A Systematic Review on Physiological-Based Biometric Recognition Systems:

Current and Future Trends". Arch Computat Methods Eng 28, Р. 4917–4960. DOI: https://doi.org/10.1007/s11831-021-09560-3

Sasongko, S. M. A., Tsaury, S., Ariessaputra, S., Ch, S. (2023), "Mel Frequency Cepstral Coefficients (MFCC) Method and

Multiple Adaline Neural Network Model for Speaker Identification". International Journal on Informatics Visualization,

№ 7(4), Р. 2306–2312. DOI: https://doi.org/10.30630/joiv.7.4.1376

Desplanques, B., Thienpondt, J., & Demuynck, K. (2020), "ECAPA-TDNN: Emphasized Channel Attention, Propagation and

Aggregation in TDNN Based Speaker Verification". In Interspeech 2020, Р. 3830–3834. DOI:

https://doi.org/10.21437/Interspeech.2020-2650

Jahangir, R., Alreshoodi, M., Alarfaj, F. K. (2025), "Spectrogram Features-Based Automatic Speaker Identification for

Smart Services". Applied Artificial Intelligence, № 39(1). DOI: https://doi.org/10.1080/08839514.2025.2459476

Tirumala, S. S., Shahamiri, S. R., Garhwal, A. S., Wang, R. (2017), "Speaker Identification Features Extraction Methods:

A Systematic Review". Expert Systems with Applications, № 90, Р. 250–271. DOI: https://doi.org/10.1016/j.eswa.2017.08.015

Iliev, Y.; Ilieva, G. (2023), "A Framework for Smart Home System with Voice Control Using NLP Methods". Electronics

, № 12, 116 р. DOI: https://doi.org/10.3390/electronics1201011614

Kim, Y., Hyon, Y., Lee, S., Woo, S. D., Ha, T., Chung, C. (2022), "The coming era of a new auscultation system for

analyzing respiratory sounds", BMC Pulmonary Medicine, Vol. 22, №. 1. 119 р. DOI: 10.1186/s12890-022-01896-1

Barkovska, O, Havrashenko, А. (2024), "Research of the impact of noise reduction methods on the quality of

audio signal recovery", Information and control systems at railway transport, 2024, Vol. 29, №. 3. Р. 57–65.

DOI: https://doi.org/10.18664/ikszt.v29i3.313606

Zaman, K., Sah, M., Direkoglu, C., Unoki, M. (2023), "A Survey of Audio Classification Using Deep Learning",

IEEE Access, Vol. 11, Р. 106620–106649. DOI: 10.1109/ACCESS.2023.3318015

Xie, X., Cai, H., Li, C., Wu, Y., Ding, F. (2023), "A Voice Disease Detection Method Based on MFCCs and Shallow CNN",

Journal of Voice, Oct. 2023, DOI: https://doi.org/10.1016/j.jvoice.2023.09.024

Tu, Y., Lin, W., Mak, M. W. (2022), "A survey on text-dependent and text-independent speaker verification", IEEE Access.

Vol. 10. Р. 99038-99049. DOI: DOI: 10.1109/ACCESS.2022.3206541

Luitel, Sophina, Mohd, Anwar. (2022), "Audio Sentiment Analysis Using Spectrogram and Bag-of-Visual- Words",

IEEE 23rd International Conference on Information Reuse and Integration for Data Science (IRI), IEEE, Р. 200–205.

DOI: https://doi.org/10.1109/IRI54793.2022.00052

Singh, V. K., Sharma, K., Sur, S. N. (2023), "A survey on preprocessing and classification techniques for acoustic scene",

Expert Systems with Applications, Vol. 229, 120520 р. DOI: https://doi.org/10.1016/j.eswa.2023.120520

Labied, M., Belangour, A., Banane, M., Erraissi, A. (2022), "An overview of Automatic Speech Recognition Preprocessing

Techniques", 2022 International Conference on Decision Aid Sciences and Applications (DASA), Chiangrai, Thailand,

Р. 804–809, DOI: 10.1109/DASA54658.2022.9765043

Downloads

Published

2025-09-30

How to Cite

Barkovska, O. (2025). Two-factor authentication based on keyword spotting and speaker verification. INNOVATIVE TECHNOLOGIES AND SCIENTIFIC SOLUTIONS FOR INDUSTRIES, (3(33), 5–18. https://doi.org/10.30837/2522-9818.2025.3.005