Predicting Risks of Cardiovascular Disease on Small Datasets using Feature Engineering
DOI:
https://doi.org/10.30837/2522-9818.2026.1.055Keywords:
classification; risk prediction; feature importance; machine learning; ECG signal; Random ForestAbstract
Relevance. Cardiovascular diseases remain a leading cause of mortality globally, creating a high demand for automated diagnostic systems. However, developing reliable machine learning models for electrocardiogram (ECG) analysis is often hindered by the availability of only small-scale and imbalanced datasets, which limits the effectiveness of deep learning approaches. The object of research is the process of automated processing and classification of electrocardiographic signals for diagnostic purposes. The subject of the research includes methods of beat-centric feature extraction, patient-level aggregation strategies, and machine learning algorithms for cardiovascular risk prediction. The purpose of this paper is to develop and evaluate a reliable classification framework, optimized for small datasets, that increases prediction accuracy by leveraging patient-level feature aggregation and explainable machine learning models. To achieve this goal, the following tasks were solved: 1) implementation of a robust preprocessing pipeline using a refined Pan-Tompkins algorithm for precise beat-centric segmentation; 2) development of a statistical feature aggregation strategy to mitigate local signal variability; and 3) optimization and validation of a Random Forest classifier. The methodology employed includes digital signal processing (Butterworth filtering), advanced feature engineering (HRV, Wavelets analysis), and rigorous 10-fold Stratified Cross-Validation to ensure generalization on limited data. Research results. The study proposes a pipeline initiating with standard signal preprocessing, followed by precise R-peak detection and beat-centric segmentation. Physiological features (HRV, wavelet, morphological) are then extracted from individual segments and statistically aggregated at the patient level. Experiments on a dataset of 164 subjects demonstrated that the proposed patient-level aggregation strategy significantly outperformed traditional segment-based analysis. The final Random Forest model achieved an ROC-AUC score of 0.84. Feature importance analysis confirmed the critical role of Heart Rate Variability (HRV) metrics, particularly SDNN and RMSSD, in differentiating between healthy and high-risk subjects.
References
References
"World Heart Report 2023: Confronting the World’s Number One Killer. Geneva, Switzerland. World Heart Federation". available at: https://medbox.org/document/world-heart-report-2023-confronting-the-worlds-number-one-killer
Global, Regional, and National Burden of Cardiovascular Diseases and Risk Factors in 204 Countries and Territories, 1990-2023. JACC Journals, Vol. 86, No. 22, 2025, рр. 2167-2243. DOI: https://doi.org/10.1016/j.jacc.2025.08.015
Berkaya, S. K. et al. (2018), “A Survey on ECG Analysis”, Biomedical Signal Processing and Control, vol. 43, pp. 216-235, doi: https://doi.org/10.1016/j.bspc.2018.03.003
Yong O., Yang L., Kardos A., Zhao Y. (2026), "Non-invasive cardiovascular and vital signs monitoring techniques: review, challenges, and perspectives, Measurement", Vol. 258, part E, 119472 р. DOI: https://doi.org/10.1016/j.measurement.2025.119472
Ghasad, P.P., Vegivada, J.V.S,. Kamble, V.M., et al. (2025), "A systematic review of automated prediction of sudden cardiac death using ECG signals", Physiological Measurement, Vol. 46. DOI: https://doi.org/10.1088/1361-6579/ad9ce5
Wu, Z., Guo, C. (2025), "Deep learning and electrocardiography: systematic review of current techniques in cardiovascular disease diagnosis and management", BioMed Eng OnLine, Vol. 24, No. 23, DOI: https://doi.org/10.1186/s12938-025-01349-w
Singh, A.K, Krishnan, S. (2023), "ECG signal feature extraction trends in methods and applications", Biomed Eng Online, Vol. 22(1):22. DOI: https://doi.org/10.1186/s12938-023-01075-1
Kutlu, Y., Kuntalp, D. (2012), "Feature Extraction for ECG Heartbeats Using Higher Order Statistics of WPD Coefficients", Computer Methods and Programs in Biomedicine, Vol. 105, No. 3, pp. 257-267, DOI: https://doi.org/10.1016/j.cmpb.2011.10.002
Pan, J., Tompkins, W. J. (1985), "A Real-Time QRS Detection Algorithm", IEEE Trans. on Biomedical Engineering, Vol. 32, No. 3, pp. 230-236, DOI: https://doi.org/10.1109/TBME.1985.325532
Zhai, D., Bao, X., Long, X., Ru, T. and Zhou, G. (2023), "Precise Detection and Localization of R-Peaks From ECG Signals", Mathematical Biosciences and Engineering, Vol. 20, No. 11, pp. 19191-19208, DOI: https://doi.org/10.3934/mbe.2023848
Safdar, M.F., Nowak, R.M., Pałka, P. (2023), "Pre-Processing techniques and artificial intelligence algorithms for electrocardiogram (ECG) signals analysis: A comprehensive review", Computers in Biology and Medicine, Vol. 170, 107908 р. DOI: https://doi.org/10.1016/j.compbiomed.2023.107908
Breiman, L. (2001), "Random Forests", Machine Learning, Vol. 45, No. 1, pp. 5-32, DOI: https://doi.org/10.1023/A:1010933404324
Rabcan, J., Zaitseva, E., Levashenko, V., Kvassay, M. (2025), "Advancing ECG Signal Classification with a Fuzzy Classifier Approach", IEEE Access, Vol.13, 2025, pp. 83840-83856. DOI: https://doi.org/10.1109/ACCESS.2025.3568086
Zaitseva, E., Rabcan, J., Levashenko, V., Kvassay, M. (2023), "Importance analysis of decision-making factors based on fuzzy decision trees", Applied Soft Computing, Vol. 134, 109988 р. DOI: https://doi.org/10.1016/j.asoc.2023.109988
Levashenko, V., Zaitseva, E. (2002), "Usage of New Information Estimations for Induction of Fuzzy Decision Trees". In: Yin, H., Allinson, N., Freeman, R., Keane, J., Hubbard, S. (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2002. IDEAL 2002. Lecture Notes in Computer Science, Vol 2412. Springer, Berlin, Heidelberg, DOI: https://doi.org/10.1007/3-540-45675-9_74
Zaitseva, E., Levashenko, V., Puuronen, S. (2007), "Fuzzy classifier based on fuzzy decision tree", Proc. of the Int. Conf. on Computer as a Tool (EUROCON), 2007, pp. 823-827, DOI: https://doi.org/10.1109/EURCON.2007.4400614
Klabunde, R.E. (2021), "Normal Sinus Rhythm", Cardiovascular Physiology Concepts. available at: https://cvphysiology.com/arrhythmias/a009
Zaitseva, E., Levashenko, V., Kvassay, M., Deserno T. (2026), "Reliability estimation of healthcare systems using Fuzzy Decision Trees", Proc. of the Fed. Conf. on Computer Science and Information Systems (FedCSIS), 2016, pp. 331-340, DOI: https://doi.org/10.15439/2016F150
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Our journal abides by the Creative Commons copyright rights and permissions for open access journals.
Authors who publish with this journal agree to the following terms:
Authors hold the copyright without restrictions and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-commercial and non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their published work online (e.g., in institutional repositories or on their website) as it can lead to productive exchanges, as well as earlier and greater citation of published work.












