Predicting Risks of Cardiovascular Disease on Small Datasets using Feature Engineering

Authors

DOI:

https://doi.org/10.30837/2522-9818.2026.1.055

Keywords:

classification; risk prediction; feature importance; machine learning; ECG signal; Random Forest

Abstract

Relevance. Cardiovascular diseases remain a leading cause of mortality globally, creating a high demand for automated diagnostic systems. However, developing reliable machine learning models for electrocardiogram (ECG) analysis is often hindered by the availability of only small-scale and imbalanced datasets, which limits the effectiveness of deep learning approaches. The object of research is the process of automated processing and classification of electrocardiographic signals for diagnostic purposes. The subject of the research includes methods of beat-centric feature extraction, patient-level aggregation strategies, and machine learning algorithms for cardiovascular risk prediction. The purpose of this paper is to develop and evaluate a reliable classification framework, optimized for small datasets, that increases prediction accuracy by leveraging patient-level feature aggregation and explainable machine learning models. To achieve this goal, the following tasks were solved: 1) implementation of a robust preprocessing pipeline using a refined Pan-Tompkins algorithm for precise beat-centric segmentation; 2) development of a statistical feature aggregation strategy to mitigate local signal variability; and 3) optimization and validation of a Random Forest classifier. The methodology employed includes digital signal processing (Butterworth filtering), advanced feature engineering (HRV, Wavelets analysis), and rigorous 10-fold Stratified Cross-Validation to ensure generalization on limited data. Research results. The study proposes a pipeline initiating with standard signal preprocessing, followed by precise R-peak detection and beat-centric segmentation. Physiological features (HRV, wavelet, morphological) are then extracted from individual segments and statistically aggregated at the patient level. Experiments on a dataset of 164 subjects demonstrated that the proposed patient-level aggregation strategy significantly outperformed traditional segment-based analysis. The final Random Forest model achieved an ROC-AUC score of 0.84. Feature importance analysis confirmed the critical role of Heart Rate Variability (HRV) metrics, particularly SDNN and RMSSD, in differentiating between healthy and high-risk subjects.

Author Biographies

Alexander Krajči, University of Žilina

postgraduate student, Department of Informatics, Faculty of Management Science and Informatics

Ludmila Sidorenko, State University of Medicine and Pharmacy “Nicolae Testemițanu”

Associate Professor of the Department of Molecular Biology and Human Genetics

Olesia Barkovska, Kharkiv National University of Radio Electronics

Ph.D (Engineering Sciences), Associate Professor, Associate Professor of the Department of Electronic Computers

References

References

"World Heart Report 2023: Confronting the World’s Number One Killer. Geneva, Switzerland. World Heart Federation". available at: https://medbox.org/document/world-heart-report-2023-confronting-the-worlds-number-one-killer

Global, Regional, and National Burden of Cardiovascular Diseases and Risk Factors in 204 Countries and Territories, 1990-2023. JACC Journals, Vol. 86, No. 22, 2025, рр. 2167-2243. DOI: https://doi.org/10.1016/j.jacc.2025.08.015

Berkaya, S. K. et al. (2018), “A Survey on ECG Analysis”, Biomedical Signal Processing and Control, vol. 43, pp. 216-235, doi: https://doi.org/10.1016/j.bspc.2018.03.003

Yong O., Yang L., Kardos A., Zhao Y. (2026), "Non-invasive cardiovascular and vital signs monitoring techniques: review, challenges, and perspectives, Measurement", Vol. 258, part E, 119472 р. DOI: https://doi.org/10.1016/j.measurement.2025.119472

Ghasad, P.P., Vegivada, J.V.S,. Kamble, V.M., et al. (2025), "A systematic review of automated prediction of sudden cardiac death using ECG signals", Physiological Measurement, Vol. 46. DOI: https://doi.org/10.1088/1361-6579/ad9ce5

Wu, Z., Guo, C. (2025), "Deep learning and electrocardiography: systematic review of current techniques in cardiovascular disease diagnosis and management", BioMed Eng OnLine, Vol. 24, No. 23, DOI: https://doi.org/10.1186/s12938-025-01349-w

Singh, A.K, Krishnan, S. (2023), "ECG signal feature extraction trends in methods and applications", Biomed Eng Online, Vol. 22(1):22. DOI: https://doi.org/10.1186/s12938-023-01075-1

Kutlu, Y., Kuntalp, D. (2012), "Feature Extraction for ECG Heartbeats Using Higher Order Statistics of WPD Coefficients", Computer Methods and Programs in Biomedicine, Vol. 105, No. 3, pp. 257-267, DOI: https://doi.org/10.1016/j.cmpb.2011.10.002

Pan, J., Tompkins, W. J. (1985), "A Real-Time QRS Detection Algorithm", IEEE Trans. on Biomedical Engineering, Vol. 32, No. 3, pp. 230-236, DOI: https://doi.org/10.1109/TBME.1985.325532

Zhai, D., Bao, X., Long, X., Ru, T. and Zhou, G. (2023), "Precise Detection and Localization of R-Peaks From ECG Signals", Mathematical Biosciences and Engineering, Vol. 20, No. 11, pp. 19191-19208, DOI: https://doi.org/10.3934/mbe.2023848

Safdar, M.F., Nowak, R.M., Pałka, P. (2023), "Pre-Processing techniques and artificial intelligence algorithms for electrocardiogram (ECG) signals analysis: A comprehensive review", Computers in Biology and Medicine, Vol. 170, 107908 р. DOI: https://doi.org/10.1016/j.compbiomed.2023.107908

Breiman, L. (2001), "Random Forests", Machine Learning, Vol. 45, No. 1, pp. 5-32, DOI: https://doi.org/10.1023/A:1010933404324

Rabcan, J., Zaitseva, E., Levashenko, V., Kvassay, M. (2025), "Advancing ECG Signal Classification with a Fuzzy Classifier Approach", IEEE Access, Vol.13, 2025, pp. 83840-83856. DOI: https://doi.org/10.1109/ACCESS.2025.3568086

Zaitseva, E., Rabcan, J., Levashenko, V., Kvassay, M. (2023), "Importance analysis of decision-making factors based on fuzzy decision trees", Applied Soft Computing, Vol. 134, 109988 р. DOI: https://doi.org/10.1016/j.asoc.2023.109988

Levashenko, V., Zaitseva, E. (2002), "Usage of New Information Estimations for Induction of Fuzzy Decision Trees". In: Yin, H., Allinson, N., Freeman, R., Keane, J., Hubbard, S. (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2002. IDEAL 2002. Lecture Notes in Computer Science, Vol 2412. Springer, Berlin, Heidelberg, DOI: https://doi.org/10.1007/3-540-45675-9_74

Zaitseva, E., Levashenko, V., Puuronen, S. (2007), "Fuzzy classifier based on fuzzy decision tree", Proc. of the Int. Conf. on Computer as a Tool (EUROCON), 2007, pp. 823-827, DOI: https://doi.org/10.1109/EURCON.2007.4400614

Klabunde, R.E. (2021), "Normal Sinus Rhythm", Cardiovascular Physiology Concepts. available at: https://cvphysiology.com/arrhythmias/a009

Zaitseva, E., Levashenko, V., Kvassay, M., Deserno T. (2026), "Reliability estimation of healthcare systems using Fuzzy Decision Trees", Proc. of the Fed. Conf. on Computer Science and Information Systems (FedCSIS), 2016, pp. 331-340, DOI: https://doi.org/10.15439/2016F150

Downloads

Published

2026-03-30

How to Cite

Krajči, A., Sidorenko, L., & Barkovska, O. (2026). Predicting Risks of Cardiovascular Disease on Small Datasets using Feature Engineering. INNOVATIVE TECHNOLOGIES AND SCIENTIFIC SOLUTIONS FOR INDUSTRIES, (1(35), 55–64. https://doi.org/10.30837/2522-9818.2026.1.055