Development of a document classification method by using geodesic distance to calculate similarity of documents
DOI:
https://doi.org/10.15587/1729-4061.2020.203866Keywords:
text classification, machine learning, geodesic distance, euclidian distance, SVM, NLP, kernel functionAbstract
Currently, the Internet has given people the opportunity to access to human knowledge quickly and conveniently through various channels such as Web pages, social networks, digital libraries, portals... However, with the process of exchanging and updating information quickly, the volume of information stored (in the form of digital documents) is increasing rapidly. Therefore, we are facing challenges in representing, storing, sorting and classifying documents.
In this paper, we present a new approach to text classification. This approach is based on semi-supervised machine learning and Support Vector Machine (SVM). The new point of the study is that instead of calculating the distance between the vectors by Euclidean distance, we use geodesic distance. To do this, the text must first be expressed as an n-dimensional vector. In the n-dimensional vector space, each vector is represented by one point; use geodesic distance to calculate the distance from a point to nearby points and connect into a graph. The classification is based on calculating the shortest path between vertices on the graph through a kernel function. We conducted experiments on articles taken from Reuters on 5 different topics. To evaluate the proposed method, we tested the SVM method with the traditional calculation based on Euclidean distance and the method we proposed based on geodesic distance. The experiment was performed on the same data set of 5 topics: Business, Markets, World, Politics, and Technology. The results showed that the correct classification rate is better than the traditional SVM method based on Euclidean distance (average of 3.2 %)
References
- Hartmann, J., Huppertz, J., Schamp, C., Heitmann, M. (2019). Comparing automated text classification methods. International Journal of Research in Marketing, 36 (1), 20–38. doi: https://doi.org/10.1016/j.ijresmar.2018.09.009
- Kadhim, A. I. (2019). Survey on supervised machine learning techniques for automatic text classification. Artificial Intelligence Review, 52 (1), 273–292. doi: https://doi.org/10.1007/s10462-018-09677-1
- Vapnik, V. N. (2013). The nature of statistical learning theory. Springer. doi: https://doi.org/10.1007/978-1-4757-3264-1
- Pratama, B. Y., Sarno, R. (2015). Personality classification based on Twitter text using Naive Bayes, KNN and SVM. 2015 International Conference on Data and Software Engineering (ICoDSE). doi: https://doi.org/10.1109/icodse.2015.7436992
- Shah, F. P., Patel, V. (2016). A review on feature selection and feature extraction for text classification. 2016 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET). doi: https://doi.org/10.1109/wispnet.2016.7566545
- Chatterjee, S., George Jose, P., Datta, D. (2019). Text Classification Using SVM Enhanced by Multithreading and CUDA. International Journal of Modern Education and Computer Science, 11 (1), 11–23. doi: https://doi.org/10.5815/ijmecs.2019.01.02
- Sarkar, A., Chatterjee, S., Das, W., Datta, D. (2015). Text Classification using Support Vector Machine. International Journal of Engineering Science Invention, 4 (11), 33–37.
- Cervantes, J., García Lamont, F., López-Chau, A., Rodríguez Mazahua, L., Sergio Ruíz, J. (2015). Data selection based on decision tree for SVM classification on large data sets. Applied Soft Computing, 37, 787–798. doi: https://doi.org/10.1016/j.asoc.2015.08.048
- Cheng, F., Yang, K., Zhang, L. (2015). A Structural SVM Based Approach for Binary Classification under Class Imbalance. Mathematical Problems in Engineering, 2015, 1–10. doi: https://doi.org/10.1155/2015/269856
- Dai, H. (2018). Research on SVM improved algorithm for large data classification. 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA). doi: https://doi.org/10.1109/icbda.2018.8367673
- M. Ikonomakis, S. Kotsiantis, V. Tampakas (2005), Text Classification Using Machine Learning Techniques. WSEAS TRANSACTIONS on COMPUTERS, 8 (4), 966–974. Available at: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=5F958D32F2F130AD5E61A077BA5D823A?doi=10.1.1.95.9153&rep=rep1&type=pdf
- Joachims, T. (2012). Learning to Classify Text Using Support Vector Machines. Springer. https://doi.org/10.1007/978-1-4615-0907-3
- Kowsari, K., Meimandi, K. J., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D. (2019). Text Classification Algorithms: A Survey. Information, 10 (4), 150. doi: https://doi.org/10.3390/info10040150
- Heylen, R., Scheunders, P. (2012). Calculation of Geodesic Distances in Nonlinear Mixing Models: Application to the Generalized Bilinear Model. IEEE Geoscience and Remote Sensing Letters, 9 (4), 644–648. doi: https://doi.org/10.1109/lgrs.2011.2177241
- González-Castro, V., Fernández-Robles, L., García-Ordás, M. T., Alegre, E., García-Olalla, O. (2012). Adaptive pattern spectrum image description using Euclidean and Geodesic distance without training for texture classification. IET Computer Vision, 6 (6), 581–589. doi: https://doi.org/10.1049/iet-cvi.2012.0098
- Smys, S., Bestak, R., Rocha, Á. (Eds.) (2020). Inventive Computation Technologies. Lecture Notes in Networks and Systems. doi: https://doi.org/10.1007/978-3-030-33846-6
- Even, S. (2011). Graph Algorithms. Cambridge University Press. doi: https://doi.org/10.1017/cbo9781139015165
- Aggarwal, A., Chandra, A. K., Snir, M. (1990). Communication complexity of PRAMs. Theoretical Computer Science, 71 (1), 3–28. doi: https://doi.org/10.1016/0304-3975(90)90188-n
- Fazakis, N., Karlos, S., Kotsiantis, S., Sgarbas, K. (2016). Self-Trained LMT for Semisupervised Learning. Computational Intelligence and Neuroscience, 2016, 1–13. doi: https://doi.org/10.1155/2016/3057481
- Feil, B., Abonyi, J. (2007). Geodesic Distance Based Fuzzy Clustering. Soft Computing in Industrial Applications, 50–59. doi: https://doi.org/10.1007/978-3-540-70706-6_5
- Souvenir, R., Pless, R. (2005). Manifold clustering. Tenth IEEE International Conference on Computer Vision (ICCV’05) Vol. 1. doi: https://doi.org/10.1109/iccv.2005.149
- Wu, Y., Chan, K. L. (2004). An extended Isomap algorithm for learning multi-class manifold. Proceedings of 2004 International Conference on Machine Learning and Cybernetics, 6, 3429–3433. doi: https://doi.org/10.1109/icmlc.2004.1380379
- Yong, Q., Jie, Y. (2004). Modified Kernel Functions by Geodesic Distance. EURASIP Journal on Advances in Signal Processing, 2004 (16). doi: https://doi.org/10.1155/s111086570440314x
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2020 Trung Hung Vo
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.