Enhancing the performance of distributed big data processing systems using Hadoop and Polybase
DOI:
https://doi.org/10.15587/1729-4061.2018.139630Keywords:
Hadoop, MapReduce, HDFS, PolyBase SQL Server, T-SQL, grid computing, scaling, scalable group PolyBase, external objects, Hortonworks Data PlatformAbstract
The approach to improvement of performance of distributed information systems based on sharing technologies of the Hadoop cluster and component of SQL Server PolyBase was considered. It was shown that the relevance of the problem, solved in the research, relates to the need for processing Big Data with different way of representation, in accordance with solving diverse problems of business projects. An analysis of methods and technologies of creation of hybrid data warehouses based on different data of SQL and NoSQL types was performed. It was shown that at present, the most common is the technology of Big Data processing with the use of Hadoop distributed computation environment. The existing technologies of organization and access to the data in the Hadoop cluster with SQL-like DBMS by using connectors were analyzed. The comparative quantitative estimates of using Hive and Sqoop connectors during exporting data to the Hadoop warehouse were presented. An analysis of special features of Big Data processing in the architecture of Hadoop-based distributed cluster computations was carried out. The features of Polybase technology as a component of SQL Server for organizing a bridge between SQL Server and Hadoop data of the SQL and NoSQL types were presented and described. The composition of the model computer plant based on the virtual machine for implementation of joint setting of PolyBase and Hadoop for solving test tasks was described. A methodological toolset for the installation and configuration of Hadoop and PolyBase SQL Server software was developed with consideration of constraints on computing capacities. Queries for using PolyBase and data warehouse Hadoop when processing Big Data were considered. To assess the performance of the system, absolute and relative metrics were proposed. For large volume of test data, the results of the experiments were presented and analyzed, which illustrated an increase in productivity of the distributed information system – query execution time and magnitude of memory capacity of temporary tables, created in this case. A comparative analysis of the studied technology with existing connectors with Hadoop cluster, which showed the advantage of PolyBase over connectors of Sqoop and Hive was performed. The results of the research could be used in the course of scientific and training experiments of organization when implementing the most modern IT-technologies.
References
- Bajaber, F., Elshawi, R., Batarfi, O., Altalhi, A., Barnawi, A., Sakr, S. (2016). Big Data 2.0 Processing Systems: Taxonomy and Open Challenges. Journal of Grid Computing, 14 (3), 379–405. doi: https://doi.org/10.1007/s10723-016-9371-1
- Big Data Taxonomy (2014). BIG DATA WORKING GROUP, 33. Available at: https://downloads.cloudsecurityalliance.org/initiatives/bdwg/Big_Data_Taxonomy.pdf
- Fedko, V. V., Tarasov, O. V., Losiev, M. Yu. (2013). Orhanizatsiya baz danykh ta znan. Kharkiv: Vyd. KhNEU, 200.
- Fedko, V. V., Tarasov, O. V., Losiev, M. Yu. (2014). Suchasni zasoby dostupu do danykh. Kharkiv: Vyd. KhNEU im. S. Kuznetsia, 328.
- Priyanka, AmitPal (2016). A Review of NoSQL Databases, Types and Comparison with Relational Database. International Journal of Engineering Science and Computing, 6 (5), 4963–4966.
- Mohamed, M. A., Altrafi, O. G., Ismail, M. O. (2014). Relational vs. NoSQL Databases: A Survey. International Journal of Computer and Information Technology, 03 (03), 598–602.
- Jatana, N., Puri, S., Ahuja, M., Kathuria, I., Gosain, D. (2012). A Survey and Comparison of Relational and Non-Relational Database. International Journal of Engineering Research & Technology, 1 (6), 1–5.
- Abdullah, A., Zhuge, Q. (2015). From Relational Databases to NoSQL Databases: Performance Evaluation. Research Journal of Applied Sciences, Engineering and Technology, 11 (4), 434–439. doi: https://doi.org/10.19026/rjaset.11.1799
- AH Al Hinai (2016). A Performance Comparison of SQL and NoSQL Databases for Large Scale Analysis of Persistent Logs. Uppsala University, 111.
- Schulz, W. L., Nelson, B. G., Felker, D. K., Durant, T. J. S., Torres, R. (2016). Evaluation of relational and NoSQL database architectures to manage genomic annotations. Journal of Biomedical Informatics, 64, 288–295. doi: https://doi.org/10.1016/j.jbi.2016.10.015
- SQL and data analysis. Some implications for data analysits and higher education. Data Scientist Master’s Program. Available at: https://www.simplilearn.com/big-data-and-analytics/senior-data-scientist-masters-program-training
- Elshawi, R., Sakr, S. Big Data Systems Meet Machine Learning Challenges: Towards Big Data Science as a Service. Available at: https://arxiv.org/pdf/1709.07493.pdf
- Jeonga, S., Houb, R., Lynchc, J. P., Sohnc, H., Law, K. H. (2017). An information modeling framework for bridge monitoring. Advances in Engineering Software, 114, 11–31.
- Scaling Social Science with Apache Hadoop. Available at: http://blog.cloudera.com/blog/2010/04/scaling-social-science-with-hadoop/
- Liu, X. (2015). An Analysis of Relational Database and NoSQL Database on an Ecommerce Platform Master of Science in Computer Science. University of Dublin, Trinity College, 81.
- Ferreira, L. (2012). Bridging the gap between SQL and NoSQL. Universidade do Monho, 97. Available at: http://mei.di.uminho.pt/sites/default/files/dissertacoes/eeum_di_dissertacao_pg15533.pdf
- Roijackers, J. (2012). Bridging SQL and NoSQL. Eindhoven University of Technology Department of Mathematics and Computer Science Master’s thesis, 100.
- Oluwafemi, E., Sahalu, B., Abdullahi, S. E. (2016). TripleFetchQL: A Platform for Integrating Relational and NoSQL Databases. International Journal of Applied Information Systems, 10 (5), 54. doi: https://doi.org/10.5120/ijais2016451513
- Roijackers, J., Fletcher, G. H. L. (2013). On Bridging Relational and Document-Centric Data Stores. Lecture Notes in Computer Science, 135–148. doi: https://doi.org/10.1007/978-3-642-39467-6_14
- Liao, Y.-T., Zhou, J., Lu, C.-H., Chen, S.-C., Hsu, C.-H., Chen, W. et. al. (2016). Data adapter for querying and transformation between SQL and NoSQL database. Future Generation Computer Systems, 65, 111–121. doi: https://doi.org/10.1016/j.future.2016.02.002
- Kuderu, N., Kumari, V. (2016). Relational Database to NoSQL Conversion by Schema Migration and Mapping. International Journal of Computer Engineering in Research Trends, 3 (9), 506. doi: https://doi.org/10.22362/ijcert/2016/v3/i9/48900
- Maislos, A. Hybrid Databases: Combining Relational and NoSQL. Available at: https://www.stratoscale.com/blog/dbaas/hybrid-databases-combining-relational-nosql/
- Blessing, E. J., Asagba, P. O. (2017). Hybrid Database System for Big Data Storage and Management. International Journal of Computer Science, Engineering and Applications, 7 (3/4), 15–27. doi: https://doi.org/10.5121/ijcsea.2017.7402
- Yuhanna, N., Gualtieri, M. The Forrester Wave™: Translytical Data Platforms, Q4 2017. Available at: https://www.forrester.com/report/The+Forrester+Wave+Translytical+Data+Platforms+Q4+2017/-/E-RES134282
- DB-Engines Ranking. Available at: https://db-engines.com/en/ranking
- Magic Quadrant for Operational Database Management Systems. Available at: https://www.gartner.com/doc/3823563/magic-quadrant-operational-database-management
- DevOps for Hadoop. Available at: http://agilealmdevops.com/2017/10/22/devops-for-hadoop/
- Özcan, F., Tian, Y., Tözün, P. (2017). Hybrid Transactional/Analytical Processing. Proceedings of the 2017 ACM International Conference on Management of Data – SIGMOD '17. doi: https://doi.org/10.1145/3035918.3054784
- Apache Sqoop. Available at: http://sqoop.apache.org
- Hadapt Inc. Available at: http://www.hadapt.com
- The Apache Hive. Available at: https://hive.apache.org
- Apache Pig: Overview. Available at: https://www.tutorialspoint.com/apache_pig/apache_pig_overview.htm
- The Apache Software Foundation. Available at: https://hbase.apache.org
- Apache Hbase. Available at: https://hortonworks.com/apache/hbase
- CASSANDRA. Available at: http://cassandra.apache.org
- The Cassandra Query Language (CQL). Available at: http://cassandra.apache.org/doc/latest/cql
- MongoDB Connector for Hadoop. Available at: https://docs.mongodb.com/ecosystem/tools/hadoop
- Liao, Y.-T., Zhou, J., Lu, C.-H., Chen, S.-C., Hsu, C.-H., Chen, W. et. al. (2016). Data adapter for querying and transformation between SQL and NoSQL database. Future Generation Computer Systems, 65, 111–121. doi: https://doi.org/10.1016/j.future.2016.02.002
- High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database. Available at: http://www.oracle.com/technetwork/bdc/hadoop-loader/connectors-hdfs-wp-1674035.pdf
- Greenplum. Available at: https://greenplum.org
- Asterdata. Available at: http://www.asterdata.com
- Shvachko, K. (2011). Apache Hadoop. The Scalability Update. Login: The usenix Magazine, 36 (3), 7–13.
- Shuffling and Sorting in Hadoop MapReduce. Available at: https://data-flair.training/blogs/shuffling-and-sorting-in-hadoop/
- Kawa, A. Introduction to YARN. Available at: https://www.ibm.com/developerworks/library/bd-yarn-intro/index.html
- Banker, K. (2012). MongoDB in Action. Manning Publications, 287.
- Performance Tuning Data Load into Hadoop with Sqoop. Available at: http://www.xmsxmx.com/performance-tuning-data-load-into-hadoop-with-sqoop/
- Sqoop Performance Tuning Guidelines. Available at: https://kb.informatica.com/h2l/HowTo%20Library/1/0930-SqoopPerformanceTuningGuidelines-H2L.pdf
- Hadoop and Hive as scalable alternatives to RDBMS: a case study. Available at: https://scholarworks.boisestate.edu/cgi/viewcontent.cgi?referer=https://www.google.com.ua/&httpsredir=1&article=1001&context=cs_gradproj
- PolyBase Queries. Available at: https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-queries
- Gankidi, V. R., Teletia, N., Patel, J. M., Halverson, A., DeWitt, D. J. (2014). Indexing HDFS data in PDW. Proceedings of the VLDB Endowment, 7 (13), 1520–1528. doi: https://doi.org/10.14778/2733004.2733023
- DeWitt, D. J., Halverson, A., Nehme, R., Shankar, S., Aguilar-Saborit, J., Avanes, A. et. al. (2013). Split query processing in polybase. Proceedings of the 2013 International Conference on Management of Data – SIGMOD ’13. doi: https://doi.org/10.1145/2463676.2463709
- Cloudera vs Hortonworks vs MapR: Comparing Hadoop Distributions. Available at: https://www.experfy.com/blog/cloudera-vs-hortonworks-comparing-hadoop-distributions
- PolyBase scale-out groups. Available at: https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-scale-out-groups?view=sql-server-2017
- CREATE EXTERNAL DATA SOURCE (Transact-SQL). Available at: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-2017
- CREATE EXTERNAL FILE FORMAT (Transact-SQL). Available at: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql?view=sql-server-2017
- CREATE EXTERNAL TABLE (Transact-SQL). Available at: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-2017
- Generate Test CSV Data. Available at: http://www.convertcsv.com/generate-test-data.htm
- System Properties Comparison Microsoft SQL Server vs. Oracle. Available at: https://db-engines.com/en/system/Microsoft+SQL+Server%3BOracle
- Please, Please Stop Complaining about SQL Server Licensing Costs and Complexity. Available at: https://joeydantoni.com/2016/08/18/please-please-stop-complaining-about-sql-server-licensing-costs-and-complexity/
- Ramel D. Microsoft Takes Aim at Oracle with SQL Server 2016. Available at: https://rcpmag.com/articles/2016/03/10/microsoft-vs-oracle-at-data-driven.aspx
- «Za tu zhe funkcional'nost', kotoruyu daet SQL Server, Oracle prosit v 10 raz bol'she», – Konstantin Taranov o SQL Server. Available at: https://habr.com/company/pgdayrussia/blog/329842/
- Oracle Big Data Connectors. Available at: https://shop.oracle.com/apex/product?p1=OracleBigDataConnectors&p2=&p3=&p4=&p5=&intcmp=ocom_big_data_connectors
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2018 Sergii Minukhin, Victor Fedko, Yurii Gnusov
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.