Evaluation of the efficiency of large language models for extracting entities from unstructured documents
DOI:
https://doi.org/10.15587/2706-5448.2025.341926Keywords:
legal unstructured document, structured document annotation, token processing cost, GPT-4.1-miniAbstract
The object of research is arrays of unstructured documents located on public websites of rural and urban communities of Ukraine.
The study is devoted to solving the problem of choosing a large language model (LLM), which is the best for applied use in solving named entity recognition (NER) problems during document processing. Modern researchers recognize that such a choice is significantly influenced by the features of the subject area and the language of document creation. However, when studying the feasibility of using LLM to solve NER problems, the features of the operation of such models are practically not taken into account. The issues of evaluating such features remain largely unexplored.
A method for recognizing selected varieties of legal unstructured texts in the Ukrainian language is proposed. Unlike existing ones, this method solves the NER problem for those documents that are subject to recognition/classification. Metrics for the cost of processing input and output tokens are proposed and a methodology for evaluating the cost of using LLM is developed. Based on these results, a comparative evaluation of the application of common LLMs to solve the NER problem on Ukrainian texts that need to be recognized was conducted. According to the evaluation results, it was recognized that: (I) GPT-4o is the best in terms of accuracy and quality of processing (Precision = 0.919; Recall = 0.954; F1 = 0.936); (II) GPT-4o-mini with discounts is the best in terms of average document processing cost (0.00045 USD per document); (III) GPT-4.1-mini with discounts is the best in terms of quality/cost ratio (the indicator value is 0.938). The GPT-4.1-mini LLM is recommended as the best for applied application.
The evaluation results obtained allow to significantly simplify the choice of LLM, which is advisable to use for creating information systems and technologies for processing unstructured documents created in Ukrainian.
References
- Jonker, A., Gomstyn, A. (2025). Structured vs. unstructured data: What's the difference? IBM. Available at: https://www.ibm.com/think/topics/structured-vs-unstructured-data Last accessed: 26.08.2025
- What is text mining? IBM. Available at: https://www.ibm.com/think/topics/text-mining Last accessed: 26.08.2025
- What Percentage of Data is Unstructured? 3 Must-Know Statistics (2024). Edge Delta. Available at: https://edgedelta.com/company/blog/what-percentage-of-data-is-unstructured Last accessed: 26.08.2025
- Shcho take rozpiznavannia imenovanykh sutnostei (NER) – pryklad, vypadky vykorystannia, perevahy ta problemy (2025). Shaip. Available at: https://uk.shaip.com/blog/named-entity-recognition-and-its-types/ Last accessed: 26.08.2025
- Seow, W. L., Chaturvedi, I., Hogarth, A., Mao, R., Cambria, E. (2025). A review of named entity recognition: from learning methods to modelling paradigms and tasks. Artificial Intelligence Review, 58 (10). https://doi.org/10.1007/s10462-025-11321-8
- Pitsilou, V., Papadakis, G., Skoutas, D. (2024). Using LLMs to Extract Food Entities from Cooking Recipes. 2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW). Utrecht, 21–28. https://doi.org/10.1109/icdew61823.2024.00008
- Brach, W., Košťál, K., Ries, M. (2025). The Effectiveness of Large Language Models in Transforming Unstructured Text to Standardized Formats. IEEE Access, 13, 91808–91825. https://doi.org/10.1109/access.2025.3573030
- Zeginis, D., Kalampokis, E., Tarabanis, K. (2024). Applying an ontology-aware zero-shot LLM prompting approach for information extraction in Greek: the case of DIAVGEIA gov gr. Proceedings of the 28th Pan-Hellenic Conference on Progress in Computing and Informatics. New York, 324–330. https://doi.org/10.1145/3716554.3716603
- Liu, Y., Hou, J., Chen, Y., Jin, J., Wang, W. (2025). LLM-ACNC: Aerospace Requirement Texts Knowledge Graph Construction Utilizing Large Language Model. Aerospace, 12 (6), 463. https://doi.org/10.3390/aerospace12060463
- Truhn, D., Loeffler, C. M., Müller‐Franzes, G., Nebelung, S., Hewitt, K. J., Brandner, S. et al. (2023). Extracting structured information from unstructured histopathology reports using generative pre‐trained transformer 4 (GPT‐4). The Journal of Pathology, 262 (3), 310–319. https://doi.org/10.1002/path.6232
- Hu, Y., Chen, Q., Du, J., Peng, X., Keloth, V. K., Zuo, X. et al. (2024). Improving large language models for clinical named entity recognition via prompt engineering. Journal of the American Medical Informatics Association, 31 (9), 1812–1820. https://doi.org/10.1093/jamia/ocad259
- del Moral-González, R., Gómez-Adorno, H., Ramos-Flores, O. (2025). Comparative analysis of generative LLMs for labeling entities in clinical notes. Genomics & Informatics, 23 (1). https://doi.org/10.1186/s44342-024-00036-x
- Campillos-Llanos, L., Valverde-Mateos, A., Capllonch-Carrión, A. (2025). Hybrid natural language processing tool for semantic annotation of medical texts in Spanish. BMC Bioinformatics, 26 (1). https://doi.org/10.1186/s12859-024-05949-6
- Xu, Q., Liu, Y., Wang, D., Huang, S. (2025). Automatic recognition of cross-language classic entities based on large language models. Npj Heritage Science, 13 (1). https://doi.org/10.1038/s40494-025-01624-y
- Shyshatskyi, O. (2025). Dataset and additional materials. GitHub. Available at: https://github.com/oshyshatskyi-phd/public-docs-processing Last accessed: 26.08.2025
- Gemini models that support batch predictions. Google Cloud. Available at: https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini#models_that_support_batch_predictions Last accessed: 21.06.2025
- Pricing. OpenAI platform. Available at: https://platform.openai.com/docs/pricing Last accessed: 21.06.2025
- Models & Pricing. Deepseek API Docs. Available at: https://api-docs.deepseek.com/quick_start/pricing Last accessed: 21.06.2025
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Oleksandr Shyshatskyi, Borys Moroz, Maksym Ievlanov, Ihor Levykin, Dmytro Moroz

This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.



