Preliminary data classification by multilevel segmentation of histograms for clustering of hypercubes

Authors

DOI:

https://doi.org/10.15587/2706-5448.2020.220428

Keywords:

cumulative histogram, multilevel segmentation, piecewise linear approximation, hierarchical clustering, decomposition of data space.

Abstract

The object of research is an algorithm for the classification of large data based on the hierarchical clustering algorithm. The nonlinear complexity of the clustering algorithm does not allow for data samples of 5–10 thousand and above. To classify data, it is necessary to pre-group them. Therefore, the subject of research is the data segmentation algorithm based on piecewise linear approximation.

In the course of the study, let’s use hierarchical clustering algorithms, the method of piecewise linear approximation of the cumulative histogram, calculated by a special procedure, and the procedure for searching for segmentation thresholds.

The computational complexity of the classical hierarchical algorithm reaches the value of O(N3), and certain steps to limit the search can achieve the value of O(N2), which is confirmed by experiments to study the dependence of the hierarchical tree on the initial sample. An approximate approach to key clustering with partitioning of a set of basic keys is implemented. To reduce further the complexity of the hierarchical clustering algorithm, a decomposition approach based on splitting the initial sample of large data into a number of subsets is proposed. In this article to use the hierarchical clustering algorithm for big data classification the preliminary decomposition method is proposed. It is based on multilevel segmentation of cumulative or ordinary histograms obtained for every feature coordinate characterizing object of data. Thresholds of multilevel segmentation are obtained by piecewise linear approximation of histogram functions. Build hypercubes of data are being accepted as objects for three stages clustering algorithm.

Powerful tool for data classification is obtained, the use of which allows carrying out many experiments with data of various types to identify patterns among the data features. Its application is intended for the processing of patient data, molecular structures, economic problems for making optimal treatment decisions, diagnostics and modeling. Thanks to this approach, data classification can be performed online to obtain the results of direct analysis when data is received, for example, from spacecraft.

Author Biographies

Roman Melnyk, Lviv Polytechnic National University, 12, Bandera str., Lviv, Ukraine, 79013

Doctor of Technical Sciences, Professor

Department of Software

Ruslan Tushnytskyy, Lviv Polytechnic National University, 12, Bandera str., Lviv, Ukraine, 79013

PhD, Associate Professor

Department of Software

Roman Kvit, Lviv Polytechnic National University, 12, Bandera str., Lviv, Ukraine, 79013

PhD, Associate Professor

Department of Higher Mathematics

Tetyana Salo, Lviv Polytechnic National University, 12, Bandera str., Lviv, Ukraine, 79013

PhD, Associate Professor

Department of Higher Mathematics

References

  1. Yip, A. M., Ding, C., Chan, T. F. (2006). Dynamic cluster formation using level set methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28 (6), 877–889. doi: http://doi.org/10.1109/tpami.2006.117
  2. Viattchenin, D. (2009). Developments in fuzzy clustering. The collection of papers. Minsk: Vever, 216.
  3. Sandeep, V. (2010). Effective level sets and shape detection: an application to natural images. Gulbarga: Electronics and Communications Engineering, 132.
  4. Grady, L., Schwartz, E. L. (2006). Isoperimetric graph partitioning for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28 (3), 469–475. doi: http://doi.org/10.1109/tpami.2006.57
  5. Pavan, M., Pelillo, M. (2007). Dominant Sets and Pairwise Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29 (1), 167–172. doi: http://doi.org/10.1109/tpami.2007.250608
  6. Foggia, P., Percannella, G., Vento, M. (2014). Graph matching and learning in pattern recognition in the last 10 years. International Journal of Pattern Recognition and Artificial Intelligence, 28 (1), 1450001. doi: http://doi.org/10.1142/s0218001414500013
  7. Dong, X., Shen, J., Shao, L., Van Gool, L. (2016). Sub-Markov Random Walk for Image Segmentation. IEEE Transactions on Image Processing, 25 (2), 516–527. doi: http://doi.org/10.1109/tip.2015.2505184
  8. Buluç, A., Meyerhenke, H., Safro, I., Sanders, P., Schulz, C. (2016). Recent Advances in Graph Partitioning. Lecture Notes in Computer Science, 117–158. doi: http://doi.org/10.1007/978-3-319-49487-6_4
  9. Wang, N., Gao, X., Tao, D., Yang, H., Li, X. (2018). Facial feature point detection: A comprehensive survey. Neurocomputing, 275 (31), 50–65. doi: http://doi.org/10.1016/j.neucom.2017.05.013
  10. Ding, C., He, X. (2005). Cluster aggregate inequality and multilevel hierarchical clustering. Proc. 9th European Conf. Principles of Data Mining and Knowledge Discovery, 71–83. doi: http://doi.org/10.1007/11564126_12
  11. Melnyk, R., Tushnytskyy, R. (2009). Algorithm accuracy and complexity optimization by inequality merging for data clustering. Proc. of the 10-th Intern. Conf. The Experience of Designing and Application of CAD Systems in Microelectronics (CADSM), 453–455.
  12. Melnyk, R., Aleekseev, O. (2006). Clustering of pattern keys base on decomposition of their set. Editing and processing of information, 24 (100), 110−114.
  13. Melnyk, R., Tushnytskyy, R. (2008). Pattern keys clustering using large-scale dataset cascading decomposition. Computer Science and Information Technology, 604, 249−254.
  14. Shenkoya, T., Kim, E. (2019). A case study of the daedeok innopolis innovation cluster and its implications for Nigeria. World Technopolis Review, 8 (2), 104−119. doi: http://dx.doi.org/10.7165/wtr19a1218.21
  15. Ramer, U. (1972). An iterative procedure for the polygonal approximation of plane curves. Computer Graphics and Image Processing, 1 (3), 244–256. doi: http://doi.org/10.1016/s0146-664x(72)80017-0
  16. Douglas, D., Peucker, T. (1973). Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization, 10 (2), 112–122.

Downloads

Published

2020-12-30

How to Cite

Melnyk, R., Tushnytskyy, R., Kvit, R., & Salo, T. (2020). Preliminary data classification by multilevel segmentation of histograms for clustering of hypercubes. Technology Audit and Production Reserves, 6(2(56), 47–55. https://doi.org/10.15587/2706-5448.2020.220428

Issue

Section

Reports on research projects