Machine learning model for predicting substance properties based on its physicochemical properties

Authors

DOI:

https://doi.org/10.30837/2522-9818.2025.1.151

Keywords:

ML model; cloud; Confusion Matrix Analysis; predicted quality; actual quality; ROC-AUC; data; parameters; averaging.

Abstract

Subject matter. The article focuses on extending previous binary classification results to multi-class classification using an ML model to analyze substance properties based on physicochemical characteristics. Goal. The primary objective is to develop a new ML model and metrics to compare different models' analysis quality, particularly in predicting wine quality from its composition. Tasks are data preparation, model development, training, tuning, evaluation, deployment, and monitoring. Methods. The study uses AWS SageMaker for data preparation, model development, training, tuning, evaluation, deployment, and monitoring, with data processed using Jupyter notebooks and pandas. Results. Data Analysis: The analysis includes descriptive statistics, correlation matrices, and visualizations like histograms and scatter plots to understand data relationships and quality. Model Training and Evaluation: The models were trained using XGBoost, with data split into training, validation, and testing sets, and evaluated using confusion matrices and AUC-ROC metrics. Confusion Matrix Analysis: Confusion matrices for two models showed mixed results, highlighting the challenge of comparing model performance and the need for further research on unbalanced classes. Hyperparameter Tuning: Amazon SageMaker's automatic hyperparameter tuning was used to optimize model performance, employing Bayesian optimization and Gaussian process regression. ROC-AUC Metrics: The study utilized ROC-AUC metrics to evaluate model performance, with micro-averaging and macro-averaging approaches showing different AUC values for the two models. Key Findings: The second model showed slightly better performance based on AUC metrics, but confusion matrix analysis suggested the need for models tailored to unbalanced classes. Conclusions. The research successfully developed a new ML model for multi-class classification, demonstrating its potential for improving wine quality prediction and suggesting future research directions.

Author Biographies

Oleksandr Kyrsanov, Kharkiv National University of Radio Electronics

Postgraduate Student at the Department of Information and Network Engineering

Stanislav Kryvenko, Kharkiv National University of Radio Electronics

PhD (Engineering Sciences), Associate Professor,  Associate Professor at the Department of Information and Network Engineering

References

Список літератури

Кирсанов О., Кривенко С. Конструювання ознак для застосування навчання машин при обробці клінічних даних. ICTEE.2024. Т. 4. № 2. 2024. С. 162–171. DOI: https://doi.org/10.23939/ictee2024.02.162

Dua D., Graff C. UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA. 2019. URL: http://archive.ics.uci.edu/ml (accessed: 07.11.2024)

Cortez P., Cerdeira A., Almeida F., Matos T., Reis J. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems. Vol. 47. No. 4. 2009. P. 547–553. DOI: 10.1016/j.dss.2009.05.016

Bhatlawande S., Srivastava S., Shilaskar S. Information extraction from digital receipts and bank transactions using machine learning. 2023 International Conference on Integration of Computational Intelligent System (ICICIS). Pune, India. 2023. P. 1–5. DOI: 10.1109/ICICIS56802.2023.10430289

Wu Z., Pulido V. Statistical analyses for fantasy sports. 2022 IEEE Integrated STEM Education Conference (ISEC). Princeton, NJ, USA. 2022. P. 269–269. DOI: 10.1109/ISEC54952.2022.10025111

Fu W., Widagdo T. E. SQL interface development for spatial data retrieval on Cassandra. 2022 International Conference on Data and Software Engineering (ICoDSE). Denpasar, Indonesia. 2022. P. 138–143. DOI: 10.1109/ICoDSE56892.2022.9971942

Chatziantoniou D., Kantere V., Antoniou N., Gantzia A. Data virtual machines: simplifying data sharing, exploration & querying in big data environments. 2022 IEEE International Conference on Big Data (Big Data). Osaka, Japan. 2022. P. 373–380. DOI: 10.1109/BigData55660.2022.10020508

Geetha V., Sujatha N. An overview of descriptive analytics and data visualization. 2024 5th International Conference on Smart Electronics and Communication (ICOSEC). Trichy, India. 2024. P. 1158–1163. DOI: 10.1109/ICOSEC61587.2024.10722273

Du B., Deng F. The method of network intrusion detection based on descriptive statistics model and Logistic model. 2022 International Conference on Machine Learning and Knowledge Engineering (MLKE). Guilin, China. 2022. P. 160–163. DOI: 10.1109/MLKE55170.2022.00037

Cuartero A., Paoletti M. E., García-Rodríguez P., Haut J. M. PyCircularStats: a Python-based tool for remote sensing circular statistics and graphical analysis. IGARSS 2022 -- 2022 IEEE International Geoscience and Remote Sensing Symposium. Kuala Lumpur, Malaysia. 2022. P. 2876–2879. DOI: 10.1109/IGARSS46834.2022.9884758

Camacho J., Wasielewska K., Bro R., Kotz D. Interpretable feature learning in multivariate big data analysis for network monitoring. IEEE Transactions on Network and Service Management. Vol. 21. No. 3. 2024. P. 2926–2943. DOI: 10.1109/TNSM.2024.3368501

Wang Q., Mazor T., Harbig T. A., Cerami E., Gehlenborg N. ThreadStates: state-based visual analysis of disease progression. IEEE Transactions on Visualization and Computer Graphics. Vol. 28. No. 1. 2022. P. 238–247. DOI: 10.1109/TVCG.2021.3114840

Baes M., Herrera C., Neufeld A., Ruyssen P. Low-rank plus sparse decomposition of covariance matrices using neural network parametrization. IEEE Transactions on Neural Networks and Learning Systems. Vol. 34. No. 1. 2023. P. 171–185. DOI: 10.1109/TNNLS.2021.3091598

Burgueño-Romero A. M., Benítez-Hidalgo A., Barba-González C., Aldana-Montes J. F. Towards an open-source MLOps architecture. IEEE Software. 2024. DOI: 10.1109/MS.2024.3421675

Muralikrishna B. S. R. Clinical diagnosis of Alzheimer’s disease employing support vector machine. 2022 IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE). Ballari, India. 2022. P. 1–5. DOI: 10.1109/ICDCECE53908.2022.9792897

Ghorbani R., Ghousi R., Makui A., Atashi A. A new hybrid predictive model to predict the early mortality risk in intensive care units on a highly imbalanced dataset. IEEE Access. Vol. 8. 2020. P. 141066–141079. DOI: 10.1109/ACCESS.2020.3013320

Charitha C., Devi Chaitrasree A., Varma P. C., Lakshmi C. Type-II diabetes prediction using machine learning algorithms. 2022 International Conference on Computer Communication and Informatics (ICCCI). Coimbatore, India. 2022. P. 1–5. DOI: 10.1109/ICCCI54379.2022.9740844

Bezruk V. M., Krivenko S. A., Kyrsanov O. O., Kryvenko S. S., Kryvenko L. S. Training the machine learning model for clinical IoT data and device interoperability. 2023 12th Mediterranean Conference on Embedded Computing (MECO). Budva, Montenegro. 2023. P. 1–6. DOI: 10.1109/MECO58584.2023.10154963

Smith M. L., Kwembe T. A. Application of machine learning classifiers interfacing Google Colab and Sklearn to intrusion detection CSE-CIC-IDS2018 dataset. 2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE). Las Vegas, NV, USA. 2023. P. 1884–1890. DOI: 10.1109/CSCE60160.2023.00311

McClish D. K. Analyzing a portion of the ROC curve. Medical Decision Making. Vol. 9. No. 3. 1989. P. 190–195. DOI: 10.1177/0272989X8900900307

References

Kyrsanov, O., Kryvenko, S. (2024), "Feature construction for machine learning application in clinical data processing" ["Konstrujuvannja oznak dlja zastosuvannja navchannja mashyn pry obrobci klinichnyx danux"], ICTEE.2024, Vol. 4, No. 2, P. 162–171. DOI: https://doi.org/10.23939/ictee2024.02.162

Dua, D., Graff, C. (2019), "UCI Machine Learning Repository, University of California, School of Information and Computer Science", Irvine, CA, available at: http://archive.ics.uci.edu/ml (last accessed 07.11.2024)

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J. (2009), "Modeling wine preferences by data mining from physicochemical properties", Decision Support Systems, Vol. 47, No. 4, P. 547–553. DOI: 10.1016/j.dss.2009.05.016

Bhatlawande, S., Srivastava, S., Shilaskar, S. (2023), "Information extraction from digital receipts and bank transactions using machine learning", 2023 International Conference on Integration of Computational Intelligent System (ICICIS), Pune, India, P. 1–5. DOI: 10.1109/ICICIS56802.2023.10430289

Wu, Z., Pulido, V. (2022), "Statistical analyses for fantasy sports", 2022 IEEE Integrated STEM Education Conference (ISEC), Princeton, NJ, USA, P. 269–269. DOI: 10.1109/ISEC54952.2022.10025111

Fu, W., Widagdo, T. E. (2022), "SQL interface development for spatial data retrieval on Cassandra", 2022 International Conference on Data and Software Engineering (ICoDSE), Denpasar, Indonesia, P. 138–143. DOI: 10.1109/ICoDSE56892.2022.9971942

Chatziantoniou, D., Kantere, V., Antoniou, N., Gantzia, A. (2022), "Data virtual machines: simplifying data sharing, exploration & querying in big data environments", 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, P. 373–380. DOI: 10.1109/BigData55660.2022.10020508

Geetha, V., Sujatha, N. (2024), "An overview of descriptive analytics and data visualization", 2024 5th International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, P. 1158–1163. DOI: 10.1109/ICOSEC61587.2024.10722273

Du, B., Deng, F. (2022), "The method of network intrusion detection based on descriptive statistics model and Logistic model", 2022 International Conference on Machine Learning and Knowledge Engineering (MLKE), Guilin, China, P. 160–163. DOI: 10.1109/MLKE55170.2022.00037

Cuartero, A., Paoletti, M. E., García-Rodríguez, P., Haut, J. M. (2022), "PyCircularStats: a Python-based tool for remote sensing circular statistics and graphical analysis", IGARSS 2022 – 2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, P. 2876–2879. DOI: 10.1109/IGARSS46834.2022.9884758

Camacho, J., Wasielewska, K., Bro, R., Kotz, D. (2024), "Interpretable feature learning in multivariate big data analysis for network monitoring", IEEE Transactions on Network and Service Management, Vol. 21, No. 3, P. 2926–2943. DOI: 10.1109/TNSM.2024.3368501

Wang, Q., Mazor, T., Harbig, T. A., Cerami, E., Gehlenborg, N. (2022), "ThreadStates: state-based visual analysis of disease progression", IEEE Transactions on Visualization and Computer Graphics, Vol. 28, No. 1, P. 238–247. DOI: 10.1109/TVCG.2021.3114840

Baes, M., Herrera, C., Neufeld, A., Ruyssen, P. (2023), "Low-rank plus sparse decomposition of covariance matrices using neural network parametrization", IEEE Transactions on Neural Networks and Learning Systems, Vol. 34, No. 1, P. 171–185. DOI: 10.1109/TNNLS.2021.3091598

Burgueño-Romero, A. M., Benítez-Hidalgo, A., Barba-González, C., Aldana-Montes, J. F. (2024), "Towards an open-source MLOps architecture", IEEE Software, DOI: 10.1109/MS.2024.3421675

Muralikrishna, B. S. R. (2022), "Clinical diagnosis of Alzheimer’s disease employing support vector machine", 2022 IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE), Ballari, India, P. 1–5. DOI: 10.1109/ICDCECE53908.2022.9792897

Ghorbani, R., Ghousi, R., Makui, A., Atashi, A. (2020), "A new hybrid predictive model to predict the early mortality risk in intensive care units on a highly imbalanced dataset", IEEE Access, Vol. 8, P. 141066–141079. DOI: 10.1109/ACCESS.2020.3013320

Charitha, C., Devi Chaitrasree, A., Varma, P. C., Lakshmi, C. (2022), "Type-II diabetes prediction using machine learning algorithms", 2022 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, P. 1–5. DOI: 10.1109/ICCCI54379.2022.9740844

Bezruk, V. M., Krivenko, S. A., Kyrsanov, O. O., Kryvenko, S. S., Kryvenko, L. S. (2023), "Training the machine learning model for clinical IoT data and device interoperability", 2023 12th Mediterranean Conference on Embedded Computing (MECO), Budva, Montenegro, P. 1–6. DOI: 10.1109/MECO58584.2023.10154963

Smith, M. L., Kwembe, T. A. (2023), "Application of machine learning classifiers interfacing Google Colab and Sklearn to intrusion detection CSE-CIC-IDS2018 dataset", 2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE), Las Vegas, NV, USA, P. 1884–1890. DOI: 10.1109/CSCE60160.2023.00311

McClish, D. K. (1989), "Analyzing a portion of the ROC curve", Medical Decision Making, Vol. 9, No. 3, P. 190–195. DOI: 10.1177/0272989X8900900307

Published

2025-03-31

How to Cite

Kyrsanov, O., & Kryvenko, S. (2025). Machine learning model for predicting substance properties based on its physicochemical properties. INNOVATIVE TECHNOLOGIES AND SCIENTIFIC SOLUTIONS FOR INDUSTRIES, (1(31), 151–165. https://doi.org/10.30837/2522-9818.2025.1.151

Issue

Section

ELECTRONICS, TELECOMMUNICATION SYSTEMS & COMPUTER NETWORKS