A method to enhance Apache Spark performance based on data segmentation and configuration parameters settings
DOI:
https://doi.org/10.30837/ITSSI.2024.27.128Keywords:
framework; input file; segmentation; test data; data generator; execution time; configuration parameters; Spark; Hadoop; MapReduce.Abstract
When using modern big data processing tools, there is a problem of increasing the productivity of using modern frameworks in the context of effective setting of various configuration parameters. The object of the research is computational processes of processing big data with the use of technologies of high-performance frameworks. The subject is methods and approaches to the effective setting of configuration parameters of frameworks in the conditions of limitations of virtualization environments and local resources. The purpose of the study is to improve the performance of Apache Spark and Apache Hadoop deployment modes based on a combined approach that includes preprocess segmentation of input data and setting of basic and additional configuration parameters that take into account the limitations of the virtual environment and local resources. Achieving the set goal involves the following tasks: create a synthesized set of WordCount test data for using input data segmentation methods. Determine the composition of general and specific Apache Spark and Apache Hadoop configuration parameters that most affect the performance of frameworks in Spark Standalone and Hadoop Yarn (FIFO) deployment modes. Justify changes in the values of the configuration parameters (accepted by default) by setting the level of parallelism, the number of partitions of the input file according to the number of processor cores, the number of tasks assigned to each core and the system executor. Conduct experimental research to substantiate theoretical results and prove their use in practice. Methods. The research used the following methods: statistical analysis; a method of generating test data based on defined segmentation characteristics with arbitrary volumes of data; a systematic approach for comprehensive evaluation and analysis of performance of frameworks based on selected configuration parameters. The results. On the basis of the developed system of parameters for evaluating the performance of the studied frameworks, experiments were carried out, which include: the application of the method of segmentation of input data based on the division of the input file into paragraphs (lines) for different values of the ranges of the number of words and the number of letters in each word; setting the main parameters and specific ones, in particular, partitioning and parallelism, taking into account the characteristics of the virtual environment and the local resource. According to the obtained results, a detailed analysis of the use of the proposed methods to improve the performance of the studied frameworks with recommendations for choosing the optimal values of data segmentation parameters and configuration parameters was carried out. You are snowmen. The obtained results of the experiments allow us to conclude that the use of the proposed methods of setting the configuration parameters of Spark and Hadoop will increase the processing productivity: for small files (0.5–1 GB) on average up to 25–30%, for large ones (1.5–2.5 GB ) – up to 10–20% on average. At the same time, the average value of the execution time of one task decreased by 10-15% for files of different sizes and with different number of words in a line.
References
Список літератури
Borthakur D. Petabyte scale databases and storage systems at Facebook. SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013. P. 1267–1268. DOI: https://doi.org/10.1145/2463676.2463713
Survey A. Past, Present and Future of Hadoop / A. Zarei et al. Computer Science. Networking and Internet Architecture. 2022. DOI: https://doi.org/10.48550/arXiv.2202.13293
Apac he Spark Unified engine for large-scale data analytics. URL: http://spark.apache.org/
Apache Hadoop. URL: https://hadoop.apache.org/
Apache Storm. URL: https://storm.apache.org/2024/02/02/storm261-released.html
Polato I. A Comprehensive View of Hadoop Research – A Systematic Literature Review. / I. Polato et al. Journal of Network and Computer Applications. 2014. Vol. 46. P. 1–25. DOI: https://doi.org/10.1016/j.jnca.2014.07.022
Jeyaraj R., Ananthanarayana V. S., Paul A. Fine‑grained data‑locality aware MapReduce job scheduler in a virtualized environment. Journal of Ambient Intelligence and Humanized Computing. Vol. 11. 2020. P. 4261–4272. DOI: https://doi.org/10.1007/s12652-020-01707-7
Ibrahim S., Lu L., Qi L. Evaluating MapReduce on virtual machines: The Hadoop case. IEEE International Conference on Cloud Computing. Vol. 5931. 2009. P. 519–528. DOI: 10.1007/978-3-642-10665-1_47
White T. Hadoop: The definitive guide. O’Reilly Media, Inc. 2012. URL: https://www.academia.edu/34540716/Hadoop_The_Definitive_Guide
Vavilapalli V. K. Apache Hadoop YARN: Yet Another Resource Negotiator. / V. K. Vavilapalli et al. SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing. 2013. No 5. P. 1–16. DOI: 10.1145/2523616.2523633
Yi Yao New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters. / Yi Yao et al. IEEE Transactions on Cloud Computing. 2019. Vol. 9. №. 3. P. 1158–1171. DOI: 10.1109/TCC.2019.2894779
Perwej Y. An Empirical Exploration of the Yarn in Big Data. / Y. Perwej et al. International Journal of Applied Information Systems (IJAIS). 2017. Vol. 12. № 9. P. 19–29. DOI: 10.5120/ijais2017451730
J. V. Gautam Empirical Study of Job Scheduling Algorithms in Hadoop MapReduce. / J. V. Gautam et al. Cybernetics and Information Technologies. 2017. Vol. 17. No 1. P. 146–163. DOI: 10.1515/cait-2017-0012
R. Ghazali A classification of Hadoop job schedulers based on performance optimization approaches. / R. Ghazali et al. Cluster Computing. 2021. Vol. 24. Issue 4. P. 3381–3403. DOI: https://doi.org/10.1007/s10586-021-03339-8
A. A. Abdallat Hadoop MapReduce Job Scheduling Algorithms Survey and Use Cases. / A. A. Abdallat et al. Modern Applied Science. 2019. Vol. 13. No. 7. P. 38–48. DOI: 10.5539/mas.v13n7p38
Abdul H. S. An overview on Big Data and Hadoop. International Journal of Computer Applications. Vol. 154. Number 10. 2016. P. 29–35. DOI: 10.5120/ijca2016912241
S. Hedayati MapReduce scheduling algorithms in Hadoop: a systematic study. / S. Hedayati et al. Journal of Cloud Computing. 2023. Vol. 12. Issue 1. P. 1–30. DOI: 10.1186/s13677-023-00520-9
M. Pastorelli Practical Size-based Scheduling for MapReduce Workloads. / M. Pastorelli et al. Computer Science. Distributed, Parallel, and Cluster Computing. 2013. 12 р. DOI: https://doi.org/10.48550/arXiv.1302.2749
Herodotou H., Babu S. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. Proceedings of the VLDB Endowment. Vol. 4. Issue 11. 2011. P. 1111–1122. DOI: 10.14778/3402707.3402746
H. Chang Scheduling in Mapreduce-Like Systems for fast completion time. / H. Chang et al. NFOCOM 2011. 30th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies. DOI: https://doi.org/10.1109/INFCOM.2011.5935152
High-Performance Virtualized Spark Clusters on Kubernetes for Deep Learning. Performance Study. 2021. URL: https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/performance/spark-k8s-vsphere67-perf.pdf
Q. Zhang A Comparative Study of Containers and Virtual Machines in Big Data Environment. / Q. Zhang et al. 2018. URL: https://arxiv.org/pdf/1807.01842.pdf
S. A. Babu System performance evaluation of para virtualization, container virtualization, and full virtualization using Xen, Openvz, And Xenserver. / S. A. Babu et al. In Advances in Computing and Communications (ICACC). 2014. P. 247–250. DOI: 10.1109/ICACC.2014.66
J. Bhimani Accelerating big data applications using lightweight virtualization framework on enterprise cloud. / J. Bhimani et al. In High Performance Extreme Computing Conference (HPEC). 2017. P. 1–7. DOI: 10.1109/HPEC.2017.8091086
Issa J. Performance Evaluation and Estimation Model Using Regression Method for Hadoop Word Count. IEEE Access. Vol. 3. 2015. P. 2784–2793. DOI: 10.1109/ACCESS.2015.2509598
Benlachimi Y., Yazidi A. El, Hasnaoui M. L. A Comparative Analysis of Hadoop and Spark Frameworks using Word Count Algorithm. International Journal of Advanced Computer Science and Applications. Vol. 12. No. 4. 2021. P. 778–788. DOI: 10.14569/IJACSA.2021.0120495
Jayanthi M., Mohan R. K. R. Experimental Setup of Apache Spark Application Execution in a Standalone Cluster Environment using Default Scheduling Mode. 2022 International Conference on Automation, Computing and Renewable Systems (ICACRS). IEEE. 2022. P. 984–988. DOI: 10.1109/ICACRS55517.2022.10029155
References
Borthakur, D. (2013), "Petabyte scale databases and storage systems at Facebook", SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013. P. 1267–1268. DOI: https://doi.org/10.1145/2463676.2463713
Zarei, A., Safari,S., Ahmadi, M., Mardukhi F. (2022),"Past, Present and Future of Hadoop: A Survey". 2022. DOI: https://doi.org/10.48550/arXiv.2202.13293
Apache Spark. Unified engine for large-scale data analytics. available at: http://spark.apache.org/
Apache Hadoop. available at: https://hadoop.apache.org/
Apache Storm. available at: https://storm.apache.org/2024/02/02/storm261-released.html
Polato, I., Ré, R., Goldman, A., Kon F. (2014), "A Comprehensive View of Hadoop Research – A Systematic Literature Review", Journal of Network and Computer Applications. Vol. 46. P. 1–25. DOI: https://doi.org/10.1016/j.jnca.2014.07.022
Jeyaraj, R., Ananthanarayana, V. S., Paul, A. (2020), "Fine‑grained data‑locality aware MapReduce job scheduler in a virtualized environment", Journal of Ambient Intelligence and Humanized Computing. Vol. 11. P. 4261–4272. DOI: https://doi.org/10.1007/s12652-020-01707-7
Ibrahim, S., Lu, L., Qi, L. (2009), "Evaluating MapReduce on virtual machines: The Hadoop case", IEEE International Conference on Cloud Computing, 2009. Vol. 5931. P. 519–528. DOI: 10.1007/978-3-642-10665-1_47
White, T. "Hadoop: The definitive guide". O’Reilly Media, Inc. 2012. available at: https://www.academia.edu/34540716/Hadoop_The_Definitive_Guide
Vavilapalli, V. K., Murthyh, A. C., Douglasm, C., Agarwali, S., Konarh, M., Evansy, R., Gravesy, T., Lowey, J., Shahh, H., Sethh, S., Sahah, B., Curinom, C., O’Malleyh, O., Radiah, S., Reedf, B., Baldeschwielerh, E. (2013), "Apache Hadoop YARN: Yet Another Resource Negotiator", SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing, 2013. No 5. P. 1–16. DOI: 10.1145/2523616.2523633
Yao; Y., Gao, H.; Wang, J., Sheng, B., Mi, N. (2019), "New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters", IEEE Transactions on Cloud Computing. Vol. 9. №. 3. P. 1158–1171. DOI: 10.1109/TCC.2019.2894779
Perwej, Y., Kerim, B., Adrees, M. S., Sheta, O. E. (2017), "An Empirical Exploration of the Yarn in Big Data", International Journal of Applied Information Systems (IJAIS). Vol. 12. № 9. P. 19–29. DOI: 10.5120/ijais2017451730
Gautam, J. V., Prajapati, H. B., Dabhi, V. K., Chaudhary, S. (2017), "Empirical Study of Job Scheduling Algorithms in Hadoop MapReduce", Cybernetics and Information Technologies. Vol. 17. No 1. P. 146–163. DOI: 10.1515/cait-2017-0012
Ghazali, R., Adabi, S., Down, D. G., Movaghar, A. (2021), "A classification of Hadoop job schedulers based on performance optimization approaches", Cluster Computing. Vol. 24. Issue 4. P. 3381–3403. DOI: https://doi.org/10.1007/s10586-021-03339-8
Abdallat, A. A., Arwa, I. A., Duaa, A. A., amimi, AlWidian, J. A. (2019), "Hadoop MapReduce Job Scheduling Algorithms Survey and Use Cases", Modern Applied Science. Vol. 13. No. 7. P. 38–48. DOI: 10.5539/mas.v13n7p38
Abdul, H. S. (2016), "An overview on Big Data and Hadoop", International Journal of Computer Applications. Vol. 154. Number 10. P. 29–35. DOI:10.5120/ijca2016912241
Hedayati, S., Maleki, N., Olsson, T., Ahlgren, F., Seyednezhad, M., Berahmand, K. (2023), "MapReduce scheduling algorithms in Hadoop: a systematic study", Journal of Cloud Computing. Vol. 12. Issue 1. P. 1–30. DOI: 10.1186/s13677-023-00520-9
Pastorelli, M., Barbuzzi, A., Carra, D., Dell'Amico, M., Michiardi, P. (2013), "Practical Size-based Scheduling for MapReduce Workloads". 12 р. DOI: https://doi.org/10.48550/arXiv.1302.2749
Herodotou, H., Babu, S. (2011), "Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs", Proceedings of the VLDB Endowment. Vol. 4. Issue 11. P. 1111–1122. DOI: 10.14778/3402707.3402746
Chang, H., Kodialam, M., Kompella, R. R., Lakshman, T. V., Lee, M., Mukherjee, S., (2011), "Scheduling in Mapreduce-Like Systems for fast completion time", NFOCOM 2011. 30th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, 2011. DOI: https://doi.org/10.1109/INFCOM.2011.5935152
"High-Performance Virtualized Spark Clusters on Kubernetes for Deep Learning. Performance Study". 2021. available at: https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/performance/spark-k8s-vsphere67-perf.pdf
Zhang, Q., Liu, L., Pu, C., Dou, Q., Wu, L., Zhou, W. "A Comparative Study of Containers and Virtual Machines in Big Data Environment". 2018. available at: https://arxiv.org/pdf/1807.01842.pdf
Babu, S. A., Hareesh, M. J., Martin, J. P., Cherian, S., Sastri, Y. (2014), "System performance evaluation of para virtualization, container virtualization, and full virtualization using Xen, Openvz, And Xenserver", In Advances in Computing and Communications (ICACC), 2014. P. 247–250. DOI: 10.1109/ICACC.2014.66
Bhimani, J., Yang, Z., Leeser, M., Mi, N. (2017), "Accelerating big data applications using lightweight virtualization framework on enterprise cloud", In High Performance Extreme Computing Conference (HPEC). P. 1–7. DOI: 10.1109/HPEC.2017.8091086
Issa, J. (2015), "Performance Evaluation and Estimation Model Using Regression Method for Hadoop Word Count", IEEE Access. Vol. 3. P. 2784–2793. DOI: 10.1109/ACCESS.2015.2509598
Benlachimi, Y., Yazidi, A. El, Hasnaoui, M. L. (2021), "A Comparative Analysis of Hadoop and Spark Frameworks using Word Count Algorithm", International Journal of Advanced Computer Science and Applications. Vol. 12. No. 4. P. 778–788. DOI: 10.14569/IJACSA.2021.0120495
Jayanthi, M., Mohan, R. K. R. (2022), "Experimental Setup of Apache Spark Application Execution in a Standalone Cluster Environment using Default Scheduling Mode". 2022 International Conference on Automation, Computing and Renewable Systems (ICACRS). P. 984–988. DOI: 10.1109/ICACRS55517.2022.10029155
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Our journal abides by the Creative Commons copyright rights and permissions for open access journals.
Authors who publish with this journal agree to the following terms:
Authors hold the copyright without restrictions and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-commercial and non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their published work online (e.g., in institutional repositories or on their website) as it can lead to productive exchanges, as well as earlier and greater citation of published work.