Message clusterization method based on archive transformation

Authors

  • Олексій Олександрович Сірий Kyiv National University of Ukraine «Kyiv polytechnic institute» 37 Peremogy ave, Kyiv, Ukraine, 03056, Ukraine

DOI:

https://doi.org/10.15587/2313-8416.2015.44364

Keywords:

archiving, entropy, text recognition, spam, fishing, LZ77, Huffman algorithm

Abstract

This article represents the method of the text’s parameters identification and their classification with the help of archiving. Using the direct bond between the archiving with LZ77 and Huffman algorithm and entropy, the text’s characteristics are identified, and they help to define its language, style, authorship, and cluster data files by their topic relevance

Author Biography

Олексій Олександрович Сірий, Kyiv National University of Ukraine «Kyiv polytechnic institute» 37 Peremogy ave, Kyiv, Ukraine, 03056

Department of Information Security

Institute of Physics and Technology

References

Thiago, S. G., Walmir, M. C. (2009). A review of machine learning approaches to Spam filtering. Expert Systems with Applications, 36 (7), 10206–10222. doi: 10.1016/j.eswa.2009.02.037

Schwarts, A. (2004). SpamAssasin. O’Reilly Media, 224.

Sahami, M., Dumais, S., Heckerman, D., Horvitz, E. (1998). A Bayesian approach to filtering junk email. AAAI Technical Report, WS-98-05.

Vatolin, D., Ratushnyak, A., Smirnov, M., Yoockin, V. (2002). Data compression methods. Structure of archivers, image and video compression. Moscow, Russia: Dialog-MIFI, 384.

Ziv, J., Lempel, A. (1977). A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory, IT-23 (3), 337–343.

Benedetto, D., Caglioti, E., Loreto, V. (2002). Language Trees and Zipping. Physical review letter, 88 (4), 1–4. doi: 10.1103/physrevlett.88.048702

Algorithms, methods, source codes. Available at: http://algolist.manual.ru/compress/standard/huffman.php

Published

2015-06-21

Issue

Section

Technical Sciences