Comparative analysis of methods for prediction continuous numerical features on big datasets

Eduard Kinshakov; Yuliia  Parfenenko; Vira Shendryk

doi:10.15587/2706-5448.2021.244003

Authors

Eduard Kinshakov Sumy State University, Ukraine https://orcid.org/0000-0001-7116-7244
Yuliia Parfenenko Sumy State University, Ukraine https://orcid.org/0000-0003-4377-5132
Vira Shendryk Sumy State University, Ukraine https://orcid.org/0000-0001-8325-3115

DOI:

https://doi.org/10.15587/2706-5448.2021.244003

Keywords:

machine learning, data analysis, big data, linear regression, decision tree, random forest

Abstract

The object of research is the process of choosing a method for predicting continuous numerical features on big datasets. The importance of the study is due to the fact that today in various subject areas it is necessary to solve the problem of predicting performance indicators based on data collected from different sources and presented in different formats, which is the task of big data analysis. To solve the problem, the methods of statistical analysis were considered, namely multiple linear regression, decision trees and a random forest. An array of extensive data was built without specifying the subject area, its preliminary processing, analysis was carried out to establish the correlation between the features. The processing of the big data array was carried out using the technology of parallel computing by means of the Dask library of the Python language. Since working with big data requires significant computing resources, this approach does not require the use of powerful computer technology. Prediction models were built using multiple linear regression methods, decision trees and a random forest, visualization of the prediction results and analysis of the reliability of the constructed models. Based on the results of calculating the prediction error, it was found that the greatest prediction accuracy among the considered methods is the random forest method. When applying this method, the prediction accuracy for a dataset of numerical features was approximately 97 %, which indicates a high reliability of the constructed model. Thus, it is possible to conclude that the random forest method is suitable for solving prediction problems using large data sets, it can be used for datasets with a large number of features and is not sensitive to data scaling. The developed software application in Python can be used to predict numerical features from different subject areas, the prediction results are imported into a text file.

Author Biographies

Eduard Kinshakov, Sumy State University

Postgraduate Student

Department of Information Technology

Yuliia Parfenenko, Sumy State University

PhD, Associate Professor

Department of Information Technology

Vira Shendryk, Sumy State University

PhD, Associate Professor

Department of Information Technology

References

Rahmani, A. M., Azhir, E., Ali, S., Mohammadi, M., Ahmed, O. H., Yassin Ghafour, M. et. al. (2021). Artificial intelligence approaches and mechanisms for big data analytics: a systematic study. PeerJ Computer Science, 7, e488. doi: http://doi.org/10.7717/peerj-cs.488
Labrinidis, A., Jagadish, H. V. (2012). Challenges and opportunities with big data. Proceedings of the VLDB Endowment, 5 (12), 2032–2033. doi: http://doi.org/10.14778/2367502.2367572
Oussous, A., Benjelloun, F.-Z., Ait Lahcen, A., Belfkih, S. (2018). Big Data technologies: A survey. Journal of King Saud University – Computer and Information Sciences, 30 (4), 431–448. doi: http://doi.org/10.1016/j.jksuci.2017.06.001
Joseph, R. C., Johnson, N. A. (2013). Big Data and Transformational Government. IT Professional, 15 (6), 43–48. doi: http://doi.org/10.1109/mitp.2013.61
Gandomi, A., Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35 (2), 137–144. doi: http://doi.org/10.1016/j.ijinfomgt.2014.10.007
Khine, K. L. L., Nyunt, T. T. S.; Zin, T., Lin, J. W. (Eds.) (2019) Predictive Big Data Analytics Using Multiple Linear Regression Model. Big Data Analysis and Deep Learning Applications. ICBDL, 9–19. doi: http://doi.org/10.1007/978-981-13-0869-7_2
Song, Y.-Y., Lu, Y. (2015). Decision tree methods: applications for classification and prediction. Shanghai archives of psychiatry, 27. doi: http://doi.org/10.11919/j.issn.1002-0829.215044
Islam, S., Amin, S. H. (2020). Prediction of probable backorder scenarios in the supply chain using Distributed Random Forest and Gradient Boosting Machine learning techniques. Journal of Big Data, 7 (1). doi: http://doi.org/10.1186/s40537-020-00345-2
Zrazhevskyi, O. H. (2010). Metody pobudovy modelei dlia dovhostrokovoho prohnozuvannia finansovykh chasovykh riadiv. Systemni doslidzhennnia ta informatsiini tekhnolohii, 1, 123–142.
Tangirala, S. (2020). Evaluating the Impact of GINI Index and Information Gain on Classification using Decision Tree Classifier Algorithm. International Journal of Advanced Computer Science and Applications, 11 (2), 612–619. doi: http://doi.org/10.14569/ijacsa.2020.0110277
Breiman, L. (2001). Random Forests. Machine Learning, 45, 5–32. doi: http://doi.org/10.1023/a:1010933404324

Comparative analysis of methods for prediction continuous numerical features on big datasets

Authors

DOI:

Keywords:

Abstract

Author Biographies

Eduard Kinshakov, Sumy State University

Yuliia Parfenenko, Sumy State University

Vira Shendryk, Sumy State University

References

Downloads

Published

How to Cite

Issue

Section

License

Information site

Language

Information

Developed By

Current Issue