Enhancing retrieval performance in social media with corpus-based query expansion using bidirectional encoder representations from transformers

Authors

DOI:

https://doi.org/10.15587/1729-4061.2025.340258

Keywords:

information retrieval, query expansion, corpus-based retrieval, BERT, social media, e-government

Abstract

This study focuses on a collection of tweets related to government services (e-government), which are preprocessed and transformed into a domain-specific corpus for query expansion. Conventional IR models struggle with unstructured and noisy content containing informal language and abbreviations, which reduces retrieval accuracy. To overcome these issues, this study proposes a hybrid query expansion (QE) model named ROCBERT-QE, which combines corpus-based retrieval (CBR) with bidirectional encoder representations from transformers (BERT). The model applies dual expansion, using corpus-based co-occurrence frequencies to capture lexical relationships and BERT embeddings to preserve semantic context. A domain-specific corpus consisting of 5,017 preprocessed tweets related to Indonesia’s National Health Insurance (BPJS) was constructed, encompassing 6,215 unique terms that represent linguistic variation and informality in public discourse. Experimental results demonstrate that ROCBERT-QE outperforms baseline retrieval methods such as TF-IDF, BM25, and standard BERT. For single-word queries, Recall reached 0.8574 and Precision 0.8807, while for sentence-level queries, Recall was 0.8932 and Precision 0.9175. The synergy of frequency-based and contextual expansion enables effective handling of lexical noise and semantic ambiguity. The results confirm the scientific potential of combining corpus-based and transformer-based approaches in IR tasks involving unstructured content. Practically, ROCBERT-QE can be applied for real-time analysis of citizen discourse in e-government contexts, such as service evaluation, policy feedback, and early detection of public issues. The framework is scalable and adaptable to other domains with informal or multilingual data characteristics

Author Biographies

Roberto Kaban, Universitas Sumatera Utara

Doctoral Program of Computer Science, Doctoral Student

Department of Computer Science

Poltak Sihombing, Universitas Sumatera Utara

Doctor of Philosophy (PhD) in Computer Science, Professor

Department of Computer Science

Syahril Efendi, Universitas Sumatera Utara

Doctor of Philosophy (PhD) in Mathematics, Professor

Department of Computer Science

Maya Silvi Lydia, Universitas Sumatera Utara

Doctor of Mathematics, Associate Professor

Department of Computer Science

References

  1. Ghanem, F. A., Padma, M. C., Alkhatib, R. (2023). Automatic Short Text Summarization Techniques in Social Media Platforms. Future Internet, 15 (9), 311. https://doi.org/10.3390/fi15090311
  2. Sharma, S., Panda, S. P. (2023). Efficient information retrieval model: overcoming challenges in search engines-an overview. Indonesian Journal of Electrical Engineering and Computer Science, 32 (2), 925. https://doi.org/10.11591/ijeecs.v32.i2.pp925-932
  3. Iamnitchi, A., Hall, L. O., Horawalavithana, S., Mubang, F., Ng, K. W., Skvoretz, J. (2023). Modeling information diffusion in social media: data-driven observations. Frontiers in Big Data, 6. https://doi.org/10.3389/fdata.2023.1135191
  4. Benamara, F., Inkpen, D., Taboada, M. (2018). Introduction to the Special Issue on Language in Social Media: Exploiting Discourse and Other Contextual Information. Computational Linguistics, 44 (4), 663–681. https://doi.org/10.1162/coli_a_00333
  5. Andreasen, T., Bordogna, G., Tré, G. D., Kacprzyk, J., Larsen, H. L., Zadrożny, S. (2024). The power and potentials of Flexible Query Answering Systems: A critical and comprehensive analysis. Data & Knowledge Engineering, 149, 102246. https://doi.org/10.1016/j.datak.2023.102246
  6. Killingback, J., Zamani, H. (2025). Benchmarking Information Retrieval Models on Complex Retrieval Tasks. arXiv. https://doi.org/10.48550/ARXIV.2509.07253
  7. Allahim, A., Cherif, A., Imine, A. (2025). Semantic approaches for query expansion: taxonomy, challenges, and future research directions. PeerJ Computer Science, 11, e2664. https://doi.org/10.7717/peerj-cs.2664
  8. Massai, L. (2024). Evaluation of semantic relations impact in query expansion-based retrieval systems. Knowledge-Based Systems, 283, 111183. https://doi.org/10.1016/j.knosys.2023.111183
  9. Alshanik, F., Apon, A., Du, Y., Herzog, A., Safro, I. (2022). Proactive Query Expansion for Streaming Data Using External Sources. 2022 IEEE International Conference on Big Data (Big Data), 701–708. https://doi.org/10.1109/bigdata55660.2022.10020577
  10. Silva, S., Seara Vieira, A., Celard, P., Iglesias, E. L., Borrajo, L. (2021). A Query Expansion Method Using Multinomial Naive Bayes. Applied Sciences, 11 (21), 10284. https://doi.org/10.3390/app112110284
  11. Chen, Z., Wang, J., Yang, X. (2022). A Concept Net-based semantic constraint method for query expansion. 2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), 906–913. https://doi.org/10.1109/wi-iat55865.2022.00147
  12. Xia, Y., Wu, J., Kim, S., Yu, T., Rossi, R. A., Wang, H., McAuley, J. (2025). Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 4275–4286. https://doi.org/10.18653/v1/2025.naacl-long.216
  13. Fan, Y., Xie, X., Cai, Y., Chen, J., Ma, X., Li, X., Zhang, R., Guo, J. (2022). Pre-training Methods in Information Retrieval. IEEE Xplore. https://doi.org/10.1561/9781638280637
  14. Claveau, V. (2020). Query expansion with artificially generated texts. arXiv. https://arxiv.org/abs/2012.08787
  15. Kumar, R., Sharma, S. C. (2022). Hybrid optimization and ontology-based semantic model for efficient text-based information retrieval. The Journal of Supercomputing, 79 (2), 2251–2280. https://doi.org/10.1007/s11227-022-04708-9
  16. Moreno-Ortiz, A., García-Gámez, M. (2023). Strategies for the Analysis of Large Social Media Corpora: Sampling and Keyword Extraction Methods. Corpus Pragmatics, 7 (3), 241–265. https://doi.org/10.1007/s41701-023-00143-0
  17. Seo, W., An, H., Lee, S. (2025). A New Query Expansion Approach via Agent-Mediated Dialogic Inquiry. arXiv. https://doi.org/10.48550/arXiv.2502.08557
  18. Shyrokykh, K., Girnyk, M., Dellmuth, L. (2023). Short text classification with machine learning in the social sciences: The case of climate change on Twitter. PLOS ONE, 18 (9), e0290762. https://doi.org/10.1371/journal.pone.0290762
  19. Mutiarin, D., Wahyuni, H., Ismail, N. S. A., Kumorotomo, W. (2023). Social Media in Support of Indonesia’s One Data Interoperability Process for Implementing Data Governance Policies. E3S Web of Conferences, 440, 03022. https://doi.org/10.1051/e3sconf/202344003022
  20. Yulita, I. N., Al-Auza’i, A. F., Prabuwono, A. S., Sholahuddin, A., Ardiansyah, F., Sarathan, I., Djuyandi, Y. (2023). Bidirectional Long Short-Term Memory for Analysis of Public Opinion Sentiment on Government Policy During the COVID-19 Pandemic. International Journal of Advanced Computer Science and Applications, 14 (11). https://doi.org/10.14569/ijacsa.2023.0141189
  21. Wahyudi, W., Loilatu, M. J., Omrilio, O. (2024). Capturing Political Conversation: An Analysis of Public Response on Social Media during Indonesian Election. Open Journal of Social Sciences, 12 (09), 163–182. https://doi.org/10.4236/jss.2024.129009
  22. Abdelazim, H., Tharwat, M., Mohamed, A. (2023). Semantic Embeddings for Arabic Retrieval Augmented Generation (ARAG). International Journal of Advanced Computer Science and Applications, 14 (11). https://doi.org/10.14569/ijacsa.2023.01411135
  23. Faggioli, G. (2023). Modelling and Explaining IR System Performance Towards Predictive Evaluation. ACM SIGIR Forum, 57 (1), 1–2. https://doi.org/10.1145/3636341.3636361
  24. Wang, Y., Huang, H., Feng, C. (2017). Query Expansion Based on a Feedback Concept Model for Microblog Retrieval. Proceedings of the 26th International Conference on World Wide Web, 559–568. https://doi.org/10.1145/3038912.3052710
Enhancing retrieval performance in social media with corpus-based query expansion using bidirectional encoder representations from transformers

Downloads

Published

2025-10-31

How to Cite

Kaban, R., Sihombing, P., Efendi, S., & Lydia, M. S. (2025). Enhancing retrieval performance in social media with corpus-based query expansion using bidirectional encoder representations from transformers. Eastern-European Journal of Enterprise Technologies, 5(2 (137), 70–83. https://doi.org/10.15587/1729-4061.2025.340258