Development of a document classification method by using geodesic distance to calculate similarity of documents

Authors

DOI:

https://doi.org/10.15587/1729-4061.2020.203866

Keywords:

text classification, machine learning, geodesic distance, euclidian distance, SVM, NLP, kernel function

Abstract

Currently, the Internet has given people the opportunity to access to human knowledge quickly and conveniently through various channels such as Web pages, social networks, digital libraries, portals... However, with the process of exchanging and updating information quickly, the volume of information stored (in the form of digital documents) is increasing rapidly. Therefore, we are facing challenges in representing, storing, sorting and classifying documents.

In this paper, we present a new approach to text classification. This approach is based on semi-supervised machine learning and Support Vector Machine (SVM). The new point of the study is that instead of calculating the distance between the vectors by Euclidean distance, we use geodesic distance. To do this, the text must first be expressed as an n-dimensional vector. In the n-dimensional vector space, each vector is represented by one point; use geodesic distance to calculate the distance from a point to nearby points and connect into a graph. The classification is based on calculating the shortest path between vertices on the graph through a kernel function. We conducted experiments on articles taken from Reuters on 5 different topics. To evaluate the proposed method, we tested the SVM method with the traditional calculation based on Euclidean distance and the method we proposed based on geodesic distance. The experiment was performed on the same data set of 5 topics: Business, Markets, World, Politics, and Technology. The results showed that the correct classification rate is better than the traditional SVM method based on Euclidean distance (average of 3.2 %)

Author Biography

Hung Vo-Trung, University of Technology and Education University of Danang 48 Cao Thang, Danang, Vietnam, 55000

Doctor of Computer Science (Grenoble INP, France)

Professor, Vice-Rector

References

  1. Hartmann, J., Huppertz, J., Schamp, C., Heitmann, M. (2019). Comparing automated text classification methods. International Journal of Research in Marketing, 36 (1), 20–38. doi: https://doi.org/10.1016/j.ijresmar.2018.09.009
  2. Kadhim, A. I. (2019). Survey on supervised machine learning techniques for automatic text classification. Artificial Intelligence Review, 52 (1), 273–292. doi: https://doi.org/10.1007/s10462-018-09677-1
  3. Vapnik, V. N. (2013). The nature of statistical learning theory. Springer. doi: https://doi.org/10.1007/978-1-4757-3264-1
  4. Pratama, B. Y., Sarno, R. (2015). Personality classification based on Twitter text using Naive Bayes, KNN and SVM. 2015 International Conference on Data and Software Engineering (ICoDSE). doi: https://doi.org/10.1109/icodse.2015.7436992
  5. Shah, F. P., Patel, V. (2016). A review on feature selection and feature extraction for text classification. 2016 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET). doi: https://doi.org/10.1109/wispnet.2016.7566545
  6. Chatterjee, S., George Jose, P., Datta, D. (2019). Text Classification Using SVM Enhanced by Multithreading and CUDA. International Journal of Modern Education and Computer Science, 11 (1), 11–23. doi: https://doi.org/10.5815/ijmecs.2019.01.02
  7. Sarkar, A., Chatterjee, S., Das, W., Datta, D. (2015). Text Classification using Support Vector Machine. International Journal of Engineering Science Invention, 4 (11), 33–37.
  8. Cervantes, J., García Lamont, F., López-Chau, A., Rodríguez Mazahua, L., Sergio Ruíz, J. (2015). Data selection based on decision tree for SVM classification on large data sets. Applied Soft Computing, 37, 787–798. doi: https://doi.org/10.1016/j.asoc.2015.08.048
  9. Cheng, F., Yang, K., Zhang, L. (2015). A Structural SVM Based Approach for Binary Classification under Class Imbalance. Mathematical Problems in Engineering, 2015, 1–10. doi: https://doi.org/10.1155/2015/269856
  10. Dai, H. (2018). Research on SVM improved algorithm for large data classification. 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA). doi: https://doi.org/10.1109/icbda.2018.8367673
  11. M. Ikonomakis, S. Kotsiantis, V. Tampakas (2005), Text Classification Using Machine Learning Techniques. WSEAS TRANSACTIONS on COMPUTERS, 8 (4), 966–974. Available at: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=5F958D32F2F130AD5E61A077BA5D823A?doi=10.1.1.95.9153&rep=rep1&type=pdf
  12. Joachims, T. (2012). Learning to Classify Text Using Support Vector Machines. Springer. https://doi.org/10.1007/978-1-4615-0907-3
  13. Kowsari, K., Meimandi, K. J., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D. (2019). Text Classification Algorithms: A Survey. Information, 10 (4), 150. doi: https://doi.org/10.3390/info10040150
  14. Heylen, R., Scheunders, P. (2012). Calculation of Geodesic Distances in Nonlinear Mixing Models: Application to the Generalized Bilinear Model. IEEE Geoscience and Remote Sensing Letters, 9 (4), 644–648. doi: https://doi.org/10.1109/lgrs.2011.2177241
  15. González-Castro, V., Fernández-Robles, L., García-Ordás, M. T., Alegre, E., García-Olalla, O. (2012). Adaptive pattern spectrum image description using Euclidean and Geodesic distance without training for texture classification. IET Computer Vision, 6 (6), 581–589. doi: https://doi.org/10.1049/iet-cvi.2012.0098
  16. Smys, S., Bestak, R., Rocha, Á. (Eds.) (2020). Inventive Computation Technologies. Lecture Notes in Networks and Systems. doi: https://doi.org/10.1007/978-3-030-33846-6
  17. Even, S. (2011). Graph Algorithms. Cambridge University Press. doi: https://doi.org/10.1017/cbo9781139015165
  18. Aggarwal, A., Chandra, A. K., Snir, M. (1990). Communication complexity of PRAMs. Theoretical Computer Science, 71 (1), 3–28. doi: https://doi.org/10.1016/0304-3975(90)90188-n
  19. Fazakis, N., Karlos, S., Kotsiantis, S., Sgarbas, K. (2016). Self-Trained LMT for Semisupervised Learning. Computational Intelligence and Neuroscience, 2016, 1–13. doi: https://doi.org/10.1155/2016/3057481
  20. Feil, B., Abonyi, J. (2007). Geodesic Distance Based Fuzzy Clustering. Soft Computing in Industrial Applications, 50–59. doi: https://doi.org/10.1007/978-3-540-70706-6_5
  21. Souvenir, R., Pless, R. (2005). Manifold clustering. Tenth IEEE International Conference on Computer Vision (ICCV’05) Vol. 1. doi: https://doi.org/10.1109/iccv.2005.149
  22. Wu, Y., Chan, K. L. (2004). An extended Isomap algorithm for learning multi-class manifold. Proceedings of 2004 International Conference on Machine Learning and Cybernetics, 6, 3429–3433. doi: https://doi.org/10.1109/icmlc.2004.1380379
  23. Yong, Q., Jie, Y. (2004). Modified Kernel Functions by Geodesic Distance. EURASIP Journal on Advances in Signal Processing, 2004 (16). doi: https://doi.org/10.1155/s111086570440314x

Downloads

Published

2020-08-31

How to Cite

Vo-Trung, H. (2020). Development of a document classification method by using geodesic distance to calculate similarity of documents. Eastern-European Journal of Enterprise Technologies, 4(2 (106), 25–32. https://doi.org/10.15587/1729-4061.2020.203866