A method to enhance Apache Spark performance based on data segmentation and configuration parameters settings

Authors

DOI:

https://doi.org/10.30837/ITSSI.2024.27.128

Keywords:

framework; input file; segmentation; test data; data generator; execution time; configuration parameters; Spark; Hadoop; MapReduce.

Abstract

When using modern big data processing tools, there is a problem of increasing the productivity of using modern frameworks in the context of effective setting of various configuration parameters. The object of the research is computational processes of processing big data with the use of technologies of high-performance frameworks. The subject is methods and approaches to the effective setting of configuration parameters of frameworks in the conditions of limitations of virtualization environments and local resources. The purpose of the study is to improve the performance of Apache Spark and Apache Hadoop deployment modes based on a combined approach that includes preprocess segmentation of input data and setting of basic and additional configuration parameters that take into account the limitations of the virtual environment and local resources. Achieving the set goal involves the following tasks: create a synthesized set of WordCount test data for using input data segmentation methods. Determine the composition of general and specific Apache Spark and Apache Hadoop configuration parameters that most affect the performance of frameworks in Spark Standalone and Hadoop Yarn (FIFO) deployment modes. Justify changes in the values of the configuration parameters (accepted by default) by setting the level of parallelism, the number of partitions of the input file according to the number of processor cores, the number of tasks assigned to each core and the system executor. Conduct experimental research to substantiate theoretical results and prove their use in practice. Methods. The research used the following methods: statistical analysis; a method of generating test data based on defined segmentation characteristics with arbitrary volumes of data; a systematic approach for comprehensive evaluation and analysis of performance of frameworks based on selected configuration parameters. The results. On the basis of the developed system of parameters for evaluating the performance of the studied frameworks, experiments were carried out, which include: the application of the method of segmentation of input data based on the division of the input file into paragraphs (lines) for different values of the ranges of the number of words and the number of letters in each word; setting the main parameters and specific ones, in particular, partitioning and parallelism, taking into account the characteristics of the virtual environment and the local resource. According to the obtained results, a detailed analysis of the use of the proposed methods to improve the performance of the studied frameworks with recommendations for choosing the optimal values of data segmentation parameters and configuration parameters was carried out. You are snowmen. The obtained results of the experiments allow us to conclude that the use of the proposed methods of setting the configuration parameters of Spark and Hadoop will increase the processing productivity: for small files (0.5–1 GB) on average up to 25–30%, for large ones (1.5–2.5 GB ) – up to 10–20% on average. At the same time, the average value of the execution time of one task decreased by 10-15% for files of different sizes and with different number of words in a line.

Author Biographies

Serhii Minukhin, Simon Kuznets Kharkiv National University of Economics

Doctor of Sciences (Engineering), Professor, Professor at the Department of Information Systems

Nikita Koptilov, Simon Kuznets Kharkiv National University of Economics

Master's Degree Student, Department of Information Systems

References

Список літератури

Borthakur D. Petabyte scale databases and storage systems at Facebook. SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013. P. 1267–1268. DOI: https://doi.org/10.1145/2463676.2463713

Survey A. Past, Present and Future of Hadoop / A. Zarei et al. Computer Science. Networking and Internet Architecture. 2022. DOI: https://doi.org/10.48550/arXiv.2202.13293

Apac he Spark Unified engine for large-scale data analytics. URL: http://spark.apache.org/

Apache Hadoop. URL: https://hadoop.apache.org/

Apache Storm. URL: https://storm.apache.org/2024/02/02/storm261-released.html

Polato I. A Comprehensive View of Hadoop Research – A Systematic Literature Review. / I. Polato et al. Journal of Network and Computer Applications. 2014. Vol. 46. P. 1–25. DOI: https://doi.org/10.1016/j.jnca.2014.07.022

Jeyaraj R., Ananthanarayana V. S., Paul A. Fine‑grained data‑locality aware MapReduce job scheduler in a virtualized environment. Journal of Ambient Intelligence and Humanized Computing. Vol. 11. 2020. P. 4261–4272. DOI: https://doi.org/10.1007/s12652-020-01707-7

Ibrahim S., Lu L., Qi L. Evaluating MapReduce on virtual machines: The Hadoop case. IEEE International Conference on Cloud Computing. Vol. 5931. 2009. P. 519–528. DOI: 10.1007/978-3-642-10665-1_47

White T. Hadoop: The definitive guide. O’Reilly Media, Inc. 2012. URL: https://www.academia.edu/34540716/Hadoop_The_Definitive_Guide

Vavilapalli V. K. Apache Hadoop YARN: Yet Another Resource Negotiator. / V. K. Vavilapalli et al. SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing. 2013. No 5. P. 1–16. DOI: 10.1145/2523616.2523633

Yi Yao New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters. / Yi Yao et al. IEEE Transactions on Cloud Computing. 2019. Vol. 9. №. 3. P. 1158–1171. DOI: 10.1109/TCC.2019.2894779

Perwej Y. An Empirical Exploration of the Yarn in Big Data. / Y. Perwej et al. International Journal of Applied Information Systems (IJAIS). 2017. Vol. 12. № 9. P. 19–29. DOI: 10.5120/ijais2017451730

J. V. Gautam Empirical Study of Job Scheduling Algorithms in Hadoop MapReduce. / J. V. Gautam et al. Cybernetics and Information Technologies. 2017. Vol. 17. No 1. P. 146–163. DOI: 10.1515/cait-2017-0012

R. Ghazali A classification of Hadoop job schedulers based on performance optimization approaches. / R. Ghazali et al. Cluster Computing. 2021. Vol. 24. Issue 4. P. 3381–3403. DOI: https://doi.org/10.1007/s10586-021-03339-8

A. A. Abdallat Hadoop MapReduce Job Scheduling Algorithms Survey and Use Cases. / A. A. Abdallat et al. Modern Applied Science. 2019. Vol. 13. No. 7. P. 38–48. DOI: 10.5539/mas.v13n7p38

Abdul H. S. An overview on Big Data and Hadoop. International Journal of Computer Applications. Vol. 154. Number 10. 2016. P. 29–35. DOI: 10.5120/ijca2016912241

S. Hedayati MapReduce scheduling algorithms in Hadoop: a systematic study. / S. Hedayati et al. Journal of Cloud Computing. 2023. Vol. 12. Issue 1. P. 1–30. DOI: 10.1186/s13677-023-00520-9

M. Pastorelli Practical Size-based Scheduling for MapReduce Workloads. / M. Pastorelli et al. Computer Science. Distributed, Parallel, and Cluster Computing. 2013. 12 р. DOI: https://doi.org/10.48550/arXiv.1302.2749

Herodotou H., Babu S. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. Proceedings of the VLDB Endowment. Vol. 4. Issue 11. 2011. P. 1111–1122. DOI: 10.14778/3402707.3402746

H. Chang Scheduling in Mapreduce-Like Systems for fast completion time. / H. Chang et al. NFOCOM 2011. 30th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies. DOI: https://doi.org/10.1109/INFCOM.2011.5935152

High-Performance Virtualized Spark Clusters on Kubernetes for Deep Learning. Performance Study. 2021. URL: https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/performance/spark-k8s-vsphere67-perf.pdf

Q. Zhang A Comparative Study of Containers and Virtual Machines in Big Data Environment. / Q. Zhang et al. 2018. URL: https://arxiv.org/pdf/1807.01842.pdf

S. A. Babu System performance evaluation of para virtualization, container virtualization, and full virtualization using Xen, Openvz, And Xenserver. / S. A. Babu et al. In Advances in Computing and Communications (ICACC). 2014. P. 247–250. DOI: 10.1109/ICACC.2014.66

J. Bhimani Accelerating big data applications using lightweight virtualization framework on enterprise cloud. / J. Bhimani et al. In High Performance Extreme Computing Conference (HPEC). 2017. P. 1–7. DOI: 10.1109/HPEC.2017.8091086

Issa J. Performance Evaluation and Estimation Model Using Regression Method for Hadoop Word Count. IEEE Access. Vol. 3. 2015. P. 2784–2793. DOI: 10.1109/ACCESS.2015.2509598

Benlachimi Y., Yazidi A. El, Hasnaoui M. L. A Comparative Analysis of Hadoop and Spark Frameworks using Word Count Algorithm. International Journal of Advanced Computer Science and Applications. Vol. 12. No. 4. 2021. P. 778–788. DOI: 10.14569/IJACSA.2021.0120495

Jayanthi M., Mohan R. K. R. Experimental Setup of Apache Spark Application Execution in a Standalone Cluster Environment using Default Scheduling Mode. 2022 International Conference on Automation, Computing and Renewable Systems (ICACRS). IEEE. 2022. P. 984–988. DOI: 10.1109/ICACRS55517.2022.10029155

References

Borthakur, D. (2013), "Petabyte scale databases and storage systems at Facebook", SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013. P. 1267–1268. DOI: https://doi.org/10.1145/2463676.2463713

Zarei, A., Safari,S., Ahmadi, M., Mardukhi F. (2022),"Past, Present and Future of Hadoop: A Survey". 2022. DOI: https://doi.org/10.48550/arXiv.2202.13293

Apache Spark. Unified engine for large-scale data analytics. available at: http://spark.apache.org/

Apache Hadoop. available at: https://hadoop.apache.org/

Apache Storm. available at: https://storm.apache.org/2024/02/02/storm261-released.html

Polato, I., Ré, R., Goldman, A., Kon F. (2014), "A Comprehensive View of Hadoop Research – A Systematic Literature Review", Journal of Network and Computer Applications. Vol. 46. P. 1–25. DOI: https://doi.org/10.1016/j.jnca.2014.07.022

Jeyaraj, R., Ananthanarayana, V. S., Paul, A. (2020), "Fine‑grained data‑locality aware MapReduce job scheduler in a virtualized environment", Journal of Ambient Intelligence and Humanized Computing. Vol. 11. P. 4261–4272. DOI: https://doi.org/10.1007/s12652-020-01707-7

Ibrahim, S., Lu, L., Qi, L. (2009), "Evaluating MapReduce on virtual machines: The Hadoop case", IEEE International Conference on Cloud Computing, 2009. Vol. 5931. P. 519–528. DOI: 10.1007/978-3-642-10665-1_47

White, T. "Hadoop: The definitive guide". O’Reilly Media, Inc. 2012. available at: https://www.academia.edu/34540716/Hadoop_The_Definitive_Guide

Vavilapalli, V. K., Murthyh, A. C., Douglasm, C., Agarwali, S., Konarh, M., Evansy, R., Gravesy, T., Lowey, J., Shahh, H., Sethh, S., Sahah, B., Curinom, C., O’Malleyh, O., Radiah, S., Reedf, B., Baldeschwielerh, E. (2013), "Apache Hadoop YARN: Yet Another Resource Negotiator", SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing, 2013. No 5. P. 1–16. DOI: 10.1145/2523616.2523633

Yao; Y., Gao, H.; Wang, J., Sheng, B., Mi, N. (2019), "New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters", IEEE Transactions on Cloud Computing. Vol. 9. №. 3. P. 1158–1171. DOI: 10.1109/TCC.2019.2894779

Perwej, Y., Kerim, B., Adrees, M. S., Sheta, O. E. (2017), "An Empirical Exploration of the Yarn in Big Data", International Journal of Applied Information Systems (IJAIS). Vol. 12. № 9. P. 19–29. DOI: 10.5120/ijais2017451730

Gautam, J. V., Prajapati, H. B., Dabhi, V. K., Chaudhary, S. (2017), "Empirical Study of Job Scheduling Algorithms in Hadoop MapReduce", Cybernetics and Information Technologies. Vol. 17. No 1. P. 146–163. DOI: 10.1515/cait-2017-0012

Ghazali, R., Adabi, S., Down, D. G., Movaghar, A. (2021), "A classification of Hadoop job schedulers based on performance optimization approaches", Cluster Computing. Vol. 24. Issue 4. P. 3381–3403. DOI: https://doi.org/10.1007/s10586-021-03339-8

Abdallat, A. A., Arwa, I. A., Duaa, A. A., amimi, AlWidian, J. A. (2019), "Hadoop MapReduce Job Scheduling Algorithms Survey and Use Cases", Modern Applied Science. Vol. 13. No. 7. P. 38–48. DOI: 10.5539/mas.v13n7p38

Abdul, H. S. (2016), "An overview on Big Data and Hadoop", International Journal of Computer Applications. Vol. 154. Number 10. P. 29–35. DOI:10.5120/ijca2016912241

Hedayati, S., Maleki, N., Olsson, T., Ahlgren, F., Seyednezhad, M., Berahmand, K. (2023), "MapReduce scheduling algorithms in Hadoop: a systematic study", Journal of Cloud Computing. Vol. 12. Issue 1. P. 1–30. DOI: 10.1186/s13677-023-00520-9

Pastorelli, M., Barbuzzi, A., Carra, D., Dell'Amico, M., Michiardi, P. (2013), "Practical Size-based Scheduling for MapReduce Workloads". 12 р. DOI: https://doi.org/10.48550/arXiv.1302.2749

Herodotou, H., Babu, S. (2011), "Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs", Proceedings of the VLDB Endowment. Vol. 4. Issue 11. P. 1111–1122. DOI: 10.14778/3402707.3402746

Chang, H., Kodialam, M., Kompella, R. R., Lakshman, T. V., Lee, M., Mukherjee, S., (2011), "Scheduling in Mapreduce-Like Systems for fast completion time", NFOCOM 2011. 30th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, 2011. DOI: https://doi.org/10.1109/INFCOM.2011.5935152

"High-Performance Virtualized Spark Clusters on Kubernetes for Deep Learning. Performance Study". 2021. available at: https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/performance/spark-k8s-vsphere67-perf.pdf

Zhang, Q., Liu, L., Pu, C., Dou, Q., Wu, L., Zhou, W. "A Comparative Study of Containers and Virtual Machines in Big Data Environment". 2018. available at: https://arxiv.org/pdf/1807.01842.pdf

Babu, S. A., Hareesh, M. J., Martin, J. P., Cherian, S., Sastri, Y. (2014), "System performance evaluation of para virtualization, container virtualization, and full virtualization using Xen, Openvz, And Xenserver", In Advances in Computing and Communications (ICACC), 2014. P. 247–250. DOI: 10.1109/ICACC.2014.66

Bhimani, J., Yang, Z., Leeser, M., Mi, N. (2017), "Accelerating big data applications using lightweight virtualization framework on enterprise cloud", In High Performance Extreme Computing Conference (HPEC). P. 1–7. DOI: 10.1109/HPEC.2017.8091086

Issa, J. (2015), "Performance Evaluation and Estimation Model Using Regression Method for Hadoop Word Count", IEEE Access. Vol. 3. P. 2784–2793. DOI: 10.1109/ACCESS.2015.2509598

Benlachimi, Y., Yazidi, A. El, Hasnaoui, M. L. (2021), "A Comparative Analysis of Hadoop and Spark Frameworks using Word Count Algorithm", International Journal of Advanced Computer Science and Applications. Vol. 12. No. 4. P. 778–788. DOI: 10.14569/IJACSA.2021.0120495

Jayanthi, M., Mohan, R. K. R. (2022), "Experimental Setup of Apache Spark Application Execution in a Standalone Cluster Environment using Default Scheduling Mode". 2022 International Conference on Automation, Computing and Renewable Systems (ICACRS). P. 984–988. DOI: 10.1109/ICACRS55517.2022.10029155

Published

2024-03-30

How to Cite

Minukhin, S., & Koptilov, N. (2024). A method to enhance Apache Spark performance based on data segmentation and configuration parameters settings. INNOVATIVE TECHNOLOGIES AND SCIENTIFIC SOLUTIONS FOR INDUSTRIES, (1 (27), 128–139. https://doi.org/10.30837/ITSSI.2024.27.128