Enhancing the performance of distributed big data processing systems using Hadoop and Polybase

Authors

DOI:

https://doi.org/10.15587/1729-4061.2018.139630

Keywords:

Hadoop, MapReduce, HDFS, PolyBase SQL Server, T-SQL, grid computing, scaling, scalable group PolyBase, external objects, Hortonworks Data Platform

Abstract

The approach to improvement of performance of distributed information systems based on sharing technologies of the Hadoop cluster and component of SQL Server PolyBase was considered. It was shown that the relevance of the problem, solved in the research, relates to the need for processing Big Data with different way of representation, in accordance with solving diverse problems of business projects. An analysis of methods and technologies of creation of hybrid data warehouses based on different data of SQL and NoSQL types was performed. It was shown that at present, the most common is the technology of Big Data processing with the use of Hadoop distributed computation environment. The existing technologies of organization and access to the data in the Hadoop cluster with SQL-like DBMS by using connectors were analyzed. The comparative quantitative estimates of using Hive and Sqoop connectors during exporting data to the Hadoop warehouse were presented. An analysis of special features of Big Data processing in the architecture of Hadoop-based distributed cluster computations was carried out. The features of Polybase technology as a component of SQL Server for organizing a bridge between SQL Server and Hadoop data of the SQL and NoSQL types were presented and described. The composition of the model computer plant based on the virtual machine for implementation of joint setting of PolyBase and Hadoop for solving test tasks was described. A methodological toolset for the installation and configuration of Hadoop and PolyBase SQL Server software was developed with consideration of constraints on computing capacities. Queries for using PolyBase and data warehouse Hadoop when processing Big Data were considered. To assess the performance of the system, absolute and relative metrics were proposed. For large volume of test data, the results of the experiments were presented and analyzed, which illustrated an increase in productivity of the distributed information system – query execution time and magnitude of memory capacity of temporary tables, created in this case. A comparative analysis of the studied technology with existing connectors with Hadoop cluster, which showed the advantage of PolyBase over connectors of Sqoop and Hive was performed. The results of the research could be used in the course of scientific and training experiments of organization when implementing the most modern IT-technologies.

Author Biographies

Sergii Minukhin, Simon Kuznets Kharkiv National University of Economics Nauky ave., 9-А, Kharkiv, Ukraine, 61166

Doctor of Technical Sciences, Professor

Department of Information Systems

Victor Fedko, Simon Kuznets Kharkiv National University of Economics Nauky ave., 9-А, Kharkiv, Ukraine, 61166

PhD, Associate Professor

Department of Information Systems

Yurii Gnusov, Kharkiv National University of Internal Affairs L. Landau ave., 27, Kharkiv, Ukraine, 61080

PhD, Head of Department

Department of cybersecurity

References

  1. Bajaber, F., Elshawi, R., Batarfi, O., Altalhi, A., Barnawi, A., Sakr, S. (2016). Big Data 2.0 Processing Systems: Taxonomy and Open Challenges. Journal of Grid Computing, 14 (3), 379–405. doi: https://doi.org/10.1007/s10723-016-9371-1
  2. Big Data Taxonomy (2014). BIG DATA WORKING GROUP, 33. Available at: https://downloads.cloudsecurityalliance.org/initiatives/bdwg/Big_Data_Taxonomy.pdf
  3. Fedko, V. V., Tarasov, O. V., Losiev, M. Yu. (2013). Orhanizatsiya baz danykh ta znan. Kharkiv: Vyd. KhNEU, 200.
  4. Fedko, V. V., Tarasov, O. V., Losiev, M. Yu. (2014). Suchasni zasoby dostupu do danykh. Kharkiv: Vyd. KhNEU im. S. Kuznetsia, 328.
  5. Priyanka, AmitPal (2016). A Review of NoSQL Databases, Types and Comparison with Relational Database. International Journal of Engineering Science and Computing, 6 (5), 4963–4966.
  6. Mohamed, M. A., Altrafi, O. G., Ismail, M. O. (2014). Relational vs. NoSQL Databases: A Survey. International Journal of Computer and Information Technology, 03 (03), 598–602.
  7. Jatana, N., Puri, S., Ahuja, M., Kathuria, I., Gosain, D. (2012). A Survey and Comparison of Relational and Non-Relational Database. International Journal of Engineering Research & Technology, 1 (6), 1–5.
  8. Abdullah, A., Zhuge, Q. (2015). From Relational Databases to NoSQL Databases: Performance Evaluation. Research Journal of Applied Sciences, Engineering and Technology, 11 (4), 434–439. doi: https://doi.org/10.19026/rjaset.11.1799
  9. AH Al Hinai (2016). A Performance Comparison of SQL and NoSQL Databases for Large Scale Analysis of Persistent Logs. Uppsala University, 111.
  10. Schulz, W. L., Nelson, B. G., Felker, D. K., Durant, T. J. S., Torres, R. (2016). Evaluation of relational and NoSQL database architectures to manage genomic annotations. Journal of Biomedical Informatics, 64, 288–295. doi: https://doi.org/10.1016/j.jbi.2016.10.015
  11. SQL and data analysis. Some implications for data analysits and higher education. Data Scientist Master’s Program. Available at: https://www.simplilearn.com/big-data-and-analytics/senior-data-scientist-masters-program-training
  12. Elshawi, R., Sakr, S. Big Data Systems Meet Machine Learning Challenges: Towards Big Data Science as a Service. Available at: https://arxiv.org/pdf/1709.07493.pdf
  13. Jeonga, S., Houb, R., Lynchc, J. P., Sohnc, H., Law, K. H. (2017). An information modeling framework for bridge monitoring. Advances in Engineering Software, 114, 11–31.
  14. Scaling Social Science with Apache Hadoop. Available at: http://blog.cloudera.com/blog/2010/04/scaling-social-science-with-hadoop/
  15. Liu, X. (2015). An Analysis of Relational Database and NoSQL Database on an Ecommerce Platform Master of Science in Computer Science. University of Dublin, Trinity College, 81.
  16. Ferreira, L. (2012). Bridging the gap between SQL and NoSQL. Universidade do Monho, 97. Available at: http://mei.di.uminho.pt/sites/default/files/dissertacoes/eeum_di_dissertacao_pg15533.pdf
  17. Roijackers, J. (2012). Bridging SQL and NoSQL. Eindhoven University of Technology Department of Mathematics and Computer Science Master’s thesis, 100.
  18. Oluwafemi, E., Sahalu, B., Abdullahi, S. E. (2016). TripleFetchQL: A Platform for Integrating Relational and NoSQL Databases. International Journal of Applied Information Systems, 10 (5), 54. doi: https://doi.org/10.5120/ijais2016451513
  19. Roijackers, J., Fletcher, G. H. L. (2013). On Bridging Relational and Document-Centric Data Stores. Lecture Notes in Computer Science, 135–148. doi: https://doi.org/10.1007/978-3-642-39467-6_14
  20. Liao, Y.-T., Zhou, J., Lu, C.-H., Chen, S.-C., Hsu, C.-H., Chen, W. et. al. (2016). Data adapter for querying and transformation between SQL and NoSQL database. Future Generation Computer Systems, 65, 111–121. doi: https://doi.org/10.1016/j.future.2016.02.002
  21. Kuderu, N., Kumari, V. (2016). Relational Database to NoSQL Conversion by Schema Migration and Mapping. International Journal of Computer Engineering in Research Trends, 3 (9), 506. doi: https://doi.org/10.22362/ijcert/2016/v3/i9/48900
  22. Maislos, A. Hybrid Databases: Combining Relational and NoSQL. Available at: https://www.stratoscale.com/blog/dbaas/hybrid-databases-combining-relational-nosql/
  23. Blessing, E. J., Asagba, P. O. (2017). Hybrid Database System for Big Data Storage and Management. International Journal of Computer Science, Engineering and Applications, 7 (3/4), 15–27. doi: https://doi.org/10.5121/ijcsea.2017.7402
  24. Yuhanna, N., Gualtieri, M. The Forrester Wave™: Translytical Data Platforms, Q4 2017. Available at: https://www.forrester.com/report/The+Forrester+Wave+Translytical+Data+Platforms+Q4+2017/-/E-RES134282
  25. DB-Engines Ranking. Available at: https://db-engines.com/en/ranking
  26. Magic Quadrant for Operational Database Management Systems. Available at: https://www.gartner.com/doc/3823563/magic-quadrant-operational-database-management
  27. DevOps for Hadoop. Available at: http://agilealmdevops.com/2017/10/22/devops-for-hadoop/
  28. Özcan, F., Tian, Y., Tözün, P. (2017). Hybrid Transactional/Analytical Processing. Proceedings of the 2017 ACM International Conference on Management of Data – SIGMOD '17. doi: https://doi.org/10.1145/3035918.3054784
  29. Apache Sqoop. Available at: http://sqoop.apache.org
  30. Hadapt Inc. Available at: http://www.hadapt.com
  31. The Apache Hive. Available at: https://hive.apache.org
  32. Apache Pig: Overview. Available at: https://www.tutorialspoint.com/apache_pig/apache_pig_overview.htm
  33. The Apache Software Foundation. Available at: https://hbase.apache.org
  34. Apache Hbase. Available at: https://hortonworks.com/apache/hbase
  35. CASSANDRA. Available at: http://cassandra.apache.org
  36. The Cassandra Query Language (CQL). Available at: http://cassandra.apache.org/doc/latest/cql
  37. MongoDB Connector for Hadoop. Available at: https://docs.mongodb.com/ecosystem/tools/hadoop
  38. Liao, Y.-T., Zhou, J., Lu, C.-H., Chen, S.-C., Hsu, C.-H., Chen, W. et. al. (2016). Data adapter for querying and transformation between SQL and NoSQL database. Future Generation Computer Systems, 65, 111–121. doi: https://doi.org/10.1016/j.future.2016.02.002
  39. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database. Available at: http://www.oracle.com/technetwork/bdc/hadoop-loader/connectors-hdfs-wp-1674035.pdf
  40. Greenplum. Available at: https://greenplum.org
  41. Asterdata. Available at: http://www.asterdata.com
  42. Shvachko, K. (2011). Apache Hadoop. The Scalability Update. Login: The usenix Magazine, 36 (3), 7–13.
  43. Shuffling and Sorting in Hadoop MapReduce. Available at: https://data-flair.training/blogs/shuffling-and-sorting-in-hadoop/
  44. Kawa, A. Introduction to YARN. Available at: https://www.ibm.com/developerworks/library/bd-yarn-intro/index.html
  45. Banker, K. (2012). MongoDB in Action. Manning Publications, 287.
  46. Performance Tuning Data Load into Hadoop with Sqoop. Available at: http://www.xmsxmx.com/performance-tuning-data-load-into-hadoop-with-sqoop/
  47. Sqoop Performance Tuning Guidelines. Available at: https://kb.informatica.com/h2l/HowTo%20Library/1/0930-SqoopPerformanceTuningGuidelines-H2L.pdf
  48. Hadoop and Hive as scalable alternatives to RDBMS: a case study. Available at: https://scholarworks.boisestate.edu/cgi/viewcontent.cgi?referer=https://www.google.com.ua/&httpsredir=1&article=1001&context=cs_gradproj
  49. PolyBase Queries. Available at: https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-queries
  50. Gankidi, V. R., Teletia, N., Patel, J. M., Halverson, A., DeWitt, D. J. (2014). Indexing HDFS data in PDW. Proceedings of the VLDB Endowment, 7 (13), 1520–1528. doi: https://doi.org/10.14778/2733004.2733023
  51. DeWitt, D. J., Halverson, A., Nehme, R., Shankar, S., Aguilar-Saborit, J., Avanes, A. et. al. (2013). Split query processing in polybase. Proceedings of the 2013 International Conference on Management of Data – SIGMOD ’13. doi: https://doi.org/10.1145/2463676.2463709
  52. Cloudera vs Hortonworks vs MapR: Comparing Hadoop Distributions. Available at: https://www.experfy.com/blog/cloudera-vs-hortonworks-comparing-hadoop-distributions
  53. PolyBase scale-out groups. Available at: https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-scale-out-groups?view=sql-server-2017
  54. CREATE EXTERNAL DATA SOURCE (Transact-SQL). Available at: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-2017
  55. CREATE EXTERNAL FILE FORMAT (Transact-SQL). Available at: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql?view=sql-server-2017
  56. CREATE EXTERNAL TABLE (Transact-SQL). Available at: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-2017
  57. Generate Test CSV Data. Available at: http://www.convertcsv.com/generate-test-data.htm
  58. System Properties Comparison Microsoft SQL Server vs. Oracle. Available at: https://db-engines.com/en/system/Microsoft+SQL+Server%3BOracle
  59. Please, Please Stop Complaining about SQL Server Licensing Costs and Complexity. Available at: https://joeydantoni.com/2016/08/18/please-please-stop-complaining-about-sql-server-licensing-costs-and-complexity/
  60. Ramel D. Microsoft Takes Aim at Oracle with SQL Server 2016. Available at: https://rcpmag.com/articles/2016/03/10/microsoft-vs-oracle-at-data-driven.aspx
  61. «Za tu zhe funkcional'nost', kotoruyu daet SQL Server, Oracle prosit v 10 raz bol'she», – Konstantin Taranov o SQL Server. Available at: https://habr.com/company/pgdayrussia/blog/329842/
  62. Oracle Big Data Connectors. Available at: https://shop.oracle.com/apex/product?p1=OracleBigDataConnectors&p2=&p3=&p4=&p5=&intcmp=ocom_big_data_connectors

Downloads

Published

2018-07-27

How to Cite

Minukhin, S., Fedko, V., & Gnusov, Y. (2018). Enhancing the performance of distributed big data processing systems using Hadoop and Polybase. Eastern-European Journal of Enterprise Technologies, 4(2 (94), 16–28. https://doi.org/10.15587/1729-4061.2018.139630