Development of a document classification method by using geodesic distance to calculate similarity of documents

Hung Vo-Trung

doi:10.15587/1729-4061.2020.203866

Development of a document classification method by using geodesic distance to calculate similarity of documents

Автор(и)

Hung Vo-Trung University of Technology and Education University of Danang 48 Cao Thang, Danang, Vietnam, 55000, В'єтнам https://orcid.org/0000-0002-4473-4458

DOI:

https://doi.org/10.15587/1729-4061.2020.203866

Ключові слова:

text classification, machine learning, geodesic distance, euclidian distance, SVM, NLP, kernel function

Анотація

Currently, the Internet has given people the opportunity to access to human knowledge quickly and conveniently through various channels such as Web pages, social networks, digital libraries, portals... However, with the process of exchanging and updating information quickly, the volume of information stored (in the form of digital documents) is increasing rapidly. Therefore, we are facing challenges in representing, storing, sorting and classifying documents.

In this paper, we present a new approach to text classification. This approach is based on semi-supervised machine learning and Support Vector Machine (SVM). The new point of the study is that instead of calculating the distance between the vectors by Euclidean distance, we use geodesic distance. To do this, the text must first be expressed as an n-dimensional vector. In the n-dimensional vector space, each vector is represented by one point; use geodesic distance to calculate the distance from a point to nearby points and connect into a graph. The classification is based on calculating the shortest path between vertices on the graph through a kernel function. We conducted experiments on articles taken from Reuters on 5 different topics. To evaluate the proposed method, we tested the SVM method with the traditional calculation based on Euclidean distance and the method we proposed based on geodesic distance. The experiment was performed on the same data set of 5 topics: Business, Markets, World, Politics, and Technology. The results showed that the correct classification rate is better than the traditional SVM method based on Euclidean distance (average of 3.2 %)

Біографія автора

Hung Vo-Trung, University of Technology and Education University of Danang 48 Cao Thang, Danang, Vietnam, 55000

Doctor of Computer Science (Grenoble INP, France)

Professor, Vice-Rector

Посилання

Hartmann, J., Huppertz, J., Schamp, C., Heitmann, M. (2019). Comparing automated text classification methods. International Journal of Research in Marketing, 36 (1), 20–38. doi: https://doi.org/10.1016/j.ijresmar.2018.09.009
Kadhim, A. I. (2019). Survey on supervised machine learning techniques for automatic text classification. Artificial Intelligence Review, 52 (1), 273–292. doi: https://doi.org/10.1007/s10462-018-09677-1
Vapnik, V. N. (2013). The nature of statistical learning theory. Springer. doi: https://doi.org/10.1007/978-1-4757-3264-1
Pratama, B. Y., Sarno, R. (2015). Personality classification based on Twitter text using Naive Bayes, KNN and SVM. 2015 International Conference on Data and Software Engineering (ICoDSE). doi: https://doi.org/10.1109/icodse.2015.7436992
Shah, F. P., Patel, V. (2016). A review on feature selection and feature extraction for text classification. 2016 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET). doi: https://doi.org/10.1109/wispnet.2016.7566545
Chatterjee, S., George Jose, P., Datta, D. (2019). Text Classification Using SVM Enhanced by Multithreading and CUDA. International Journal of Modern Education and Computer Science, 11 (1), 11–23. doi: https://doi.org/10.5815/ijmecs.2019.01.02
Sarkar, A., Chatterjee, S., Das, W., Datta, D. (2015). Text Classification using Support Vector Machine. International Journal of Engineering Science Invention, 4 (11), 33–37.
Cervantes, J., García Lamont, F., López-Chau, A., Rodríguez Mazahua, L., Sergio Ruíz, J. (2015). Data selection based on decision tree for SVM classification on large data sets. Applied Soft Computing, 37, 787–798. doi: https://doi.org/10.1016/j.asoc.2015.08.048
Cheng, F., Yang, K., Zhang, L. (2015). A Structural SVM Based Approach for Binary Classification under Class Imbalance. Mathematical Problems in Engineering, 2015, 1–10. doi: https://doi.org/10.1155/2015/269856
Dai, H. (2018). Research on SVM improved algorithm for large data classification. 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA). doi: https://doi.org/10.1109/icbda.2018.8367673
M. Ikonomakis, S. Kotsiantis, V. Tampakas (2005), Text Classification Using Machine Learning Techniques. WSEAS TRANSACTIONS on COMPUTERS, 8 (4), 966–974. Available at: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=5F958D32F2F130AD5E61A077BA5D823A?doi=10.1.1.95.9153&rep=rep1&type=pdf
Joachims, T. (2012). Learning to Classify Text Using Support Vector Machines. Springer. https://doi.org/10.1007/978-1-4615-0907-3
Kowsari, K., Meimandi, K. J., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D. (2019). Text Classification Algorithms: A Survey. Information, 10 (4), 150. doi: https://doi.org/10.3390/info10040150
Heylen, R., Scheunders, P. (2012). Calculation of Geodesic Distances in Nonlinear Mixing Models: Application to the Generalized Bilinear Model. IEEE Geoscience and Remote Sensing Letters, 9 (4), 644–648. doi: https://doi.org/10.1109/lgrs.2011.2177241
González-Castro, V., Fernández-Robles, L., García-Ordás, M. T., Alegre, E., García-Olalla, O. (2012). Adaptive pattern spectrum image description using Euclidean and Geodesic distance without training for texture classification. IET Computer Vision, 6 (6), 581–589. doi: https://doi.org/10.1049/iet-cvi.2012.0098
Smys, S., Bestak, R., Rocha, Á. (Eds.) (2020). Inventive Computation Technologies. Lecture Notes in Networks and Systems. doi: https://doi.org/10.1007/978-3-030-33846-6
Even, S. (2011). Graph Algorithms. Cambridge University Press. doi: https://doi.org/10.1017/cbo9781139015165
Aggarwal, A., Chandra, A. K., Snir, M. (1990). Communication complexity of PRAMs. Theoretical Computer Science, 71 (1), 3–28. doi: https://doi.org/10.1016/0304-3975(90)90188-n
Fazakis, N., Karlos, S., Kotsiantis, S., Sgarbas, K. (2016). Self-Trained LMT for Semisupervised Learning. Computational Intelligence and Neuroscience, 2016, 1–13. doi: https://doi.org/10.1155/2016/3057481
Feil, B., Abonyi, J. (2007). Geodesic Distance Based Fuzzy Clustering. Soft Computing in Industrial Applications, 50–59. doi: https://doi.org/10.1007/978-3-540-70706-6_5
Souvenir, R., Pless, R. (2005). Manifold clustering. Tenth IEEE International Conference on Computer Vision (ICCV’05) Vol. 1. doi: https://doi.org/10.1109/iccv.2005.149
Wu, Y., Chan, K. L. (2004). An extended Isomap algorithm for learning multi-class manifold. Proceedings of 2004 International Conference on Machine Learning and Cybernetics, 6, 3429–3433. doi: https://doi.org/10.1109/icmlc.2004.1380379
Yong, Q., Jie, Y. (2004). Modified Kernel Functions by Geodesic Distance. EURASIP Journal on Advances in Signal Processing, 2004 (16). doi: https://doi.org/10.1155/s111086570440314x

##submission.downloads##

PDF (English)

Опубліковано

2020-08-31

Як цитувати

Vo-Trung, H. (2020). Development of a document classification method by using geodesic distance to calculate similarity of documents. Eastern-European Journal of Enterprise Technologies, 4(2 (106), 25–32. https://doi.org/10.15587/1729-4061.2020.203866

Завантажити посилання

Номер

Том 4 № 2 (106) (2020): Інформаційні технології. Системи управління в промисловості

Розділ

Інформаційні технології

Ліцензія

Ця робота ліцензується відповідно до Creative Commons Attribution 4.0 International License.

Закріплення та умови передачі авторських прав (ідентифікація авторства) здійснюється у Ліцензійному договорі. Зокрема, автори залишають за собою право на авторство свого рукопису та передають журналу право першої публікації цієї роботи на умовах ліцензії Creative Commons CC BY. При цьому вони мають право укладати самостійно додаткові угоди, що стосуються неексклюзивного поширення роботи у тому вигляді, в якому вона була опублікована цим журналом, але за умови збереження посилання на першу публікацію статті в цьому журналі.

Ліцензійний договір – це документ, в якому автор гарантує, що володіє усіма авторськими правами на твір (рукопис, статтю, тощо).
Автори, підписуючи Ліцензійний договір з ПП «ТЕХНОЛОГІЧНИЙ ЦЕНТР», мають усі права на подальше використання свого твору за умови посилання на наше видання, в якому твір опублікований. Відповідно до умов Ліцензійного договору, Видавець ПП «ТЕХНОЛОГІЧНИЙ ЦЕНТР» не забирає ваші авторські права та отримує від авторів дозвіл на використання та розповсюдження публікації через світові наукові ресурси (власні електронні ресурси, наукометричні бази даних, репозитарії, бібліотеки тощо).
За відсутності підписаного Ліцензійного договору або за відсутністю вказаних в цьому договорі ідентифікаторів, що дають змогу ідентифікувати особу автора, редакція не має права працювати з рукописом.
Важливо пам’ятати, що існує і інший тип угоди між авторами та видавцями – коли авторські права передаються від авторів до видавця. В такому разі автори втрачають права власності на свій твір та не можуть його використовувати в будь-який спосіб.

Development of a document classification method by using geodesic distance to calculate similarity of documents

Автор(и)

DOI:

Ключові слова:

Анотація

Біографія автора

Hung Vo-Trung, University of Technology and Education University of Danang 48 Cao Thang, Danang, Vietnam, 55000

Посилання

##submission.downloads##

Опубліковано

Як цитувати

Номер

Розділ

Ліцензія

Мова

Інформація

Подати статтю

##plugins.block.developedBy.blockTitle##