Comparative Analysis of RAG-Based Open-Source LLMs for Indonesian Banking Customer Service Optimization Using Simulated Data

Hendra Lijaya; Patricia Ho; Handri Santoso

doi:10.32736/sisfokom.v14i3.2383

Authors

Hendra Lijaya Pradita University
Patricia Ho Pradita University
Handri Santoso Pradita University https://orcid.org/0000-0002-3265-8179

DOI:

https://doi.org/10.32736/sisfokom.v14i3.2383

Keywords:

Bank Customer Service, Large Language Model (LLM), LLM-as-a-Judge, Semantic Similarity, Retrieval-Augmented Generation (RAG)

Abstract

In the digital era, banks face challenges in delivering fast, accurate, and efficient customer service, especially for frequently asked simple questions. This study evaluates the effectiveness of three open-source Large Language Models (LLMs), namely Gemma2-9B-Sahabat-AI, Qwen2.5-14B-Instruct, and Mistral-Nemo-Instruct in supporting a Retrieval-Augmented Generation (RAG) question-answering system for the banking sector. Using 12,000 synthetic billing documents indexed with intfloat/multilingual-e5-large-instruct embeddings (1024 dimensions), model performance was assessed via semantic similarity metrics, LLM-as-a-Judge scores (GPT-4o-mini and Gemini 2.0 Flash), and human validation Gemma2-9B-Sahabat-AI achieved the highest semantic similarity score (0.9627), followed by Mistral (0.9614) and Qwen2.5 (0.9284). In LLM-as-a-Judge evaluations, Qwen2.5 ranked highest on GPT-4o-mini (92.2), while Gemma2 led under Gemini 2.0 Flash (88.4). Human evaluators gave perfect scores for factual questions (1–10), but all models struggled with arithmetic in question 13. Gemma2’s average response time was 41 seconds, faster than Qwen2.5’s 72 seconds and Mistral’s 48 seconds, confirming Gemma2’s balanced performance in accuracy, speed, and computational efficiency. These findings underscore the potential of locally operated open-source LLMs for banking applications, ensuring privacy and regulatory compliance. However, limitations include reliance on synthetic data, a narrow question set, and lack of user diversity. Future research should involve broader queries, real user testing, and numeric reasoning modules to ensure robust and scalable deployment in real-world banking customer service environments.

References

N. Izzah and M. Z. Rachmawan, “Penerapan Strategi Cost Efficiency (Efisiensi Biaya) Pada PT Bank Muamalat Indonesia, Tbk. Tahun 2017,” Abiwara: Jurnal Vokasi Administrasi Bisnis, vol. 1, no. 1, pp. 37–44, 2019, doi: https://doi.org/10.31334/abiwara.v1i1.500.

N. Lekhawichit, C. Chavaha, K. Chienwattanasook, and K. Jermsittiparsert, “THE IMPACT OF SERVICE QUALITY ON THE CUSTOMER SATISFACTION: MEDIATING ROLE OF WAITING TIME,” 2021. Accessed: May 03, 2025. [Online]. Available: http://www.psychologyandeducation.net/pae/index.php/pae/article/view/2552/2228

A. Z. Desta and T. H. Belete, “The Influence of Waiting Lines Management on Customer Satisfaction in Commercial Bank of Ethiopia,” Institutions and Risks, vol. 3, no. 3, pp. 5–12, 2019, doi: 10.21272/fmir.3(3).

J.-H. Yang and A.-S. Park, “Analysis on Relationship between Waiting Time and Customer Satisfaction of General Hospitals,” 2021. Accessed: May 03, 2025. [Online]. Available: https://www.nveo.org/index.php/journal/article/view/248/223

A. Subyantoro, D. Tri Mardiana, M. S. Zulfikar, and M. Hasan, PELATIHAN DAN PENGEMBANGAN SUMBER DAYA MANUSIA. ZAHIR PUBLISHING, 2022. Accessed: May 03, 2025. [Online]. Available: http://eprints.upnyk.ac.id/34931/1/Buku%20Pengembangan%20Sumber%20Daya%20Manusia.pdf

J. Wirtz and V. Zeithaml, “Cost-effective service excellence,” J Acad Mark Sci, vol. 46, no. 1, pp. 59–80, Jan. 2018, doi: 10.1007/s11747-017-0560-7.

F. O. Edeh, N. M. Zayed, V. Nitsenko, O. Brezhnieva-Yermolenko, J. Negovska, and M. Shtan, “Predicting Innovation Capability through Knowledge Management in the Banking Sector,” Journal of Risk and Financial Management, vol. 15, no. 7, Jul. 2022, doi: 10.3390/jrfm15070312.

K. Bahl, R. Kiran, and A. Sharma, “Evaluating the effectiveness of training of managerial and non-managerial bank employees using Kirkpatrick’s model for evaluation of training,” Humanit Soc Sci Commun, vol. 11, no. 1, Dec. 2024, doi: 10.1057/s41599-024-02973-y.

M. Gumede, “THE IMPACT OF TRAINING AND DEVELOPMENT ON EMPLOYEE PERFORMANCE: A CASE STUDY OF CAPITEC BANK IN DURBAN,” Durban University of Technology, 2021. Accessed: May 03, 2025. [Online]. Available: https://openscholar.dut.ac.za/server/api/core/bitstreams/72d4bd9f-4019-473d-b04b-aae4f34ec578/content

A. Vilard et al., “The Effects of Training and Development on Employees Performance: The Case of the National Financial Credit Bank (NFCB) of the Centre Region of Cameroon,” International Journal of Science and Business, vol. 4, no. 6, pp. 88–106, 2020, doi: 10.5281/zenodo.3897174.

Y. Gao et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” ArXiv, Dec. 2023, [Online]. Available: http://arxiv.org/abs/2312.10997

S. Setty, H. Thakkar, A. Lee, E. Chung, and N. Vidra, “Improving Retrieval for RAG based Question Answering Models on Financial Documents,” ArXiv, Mar. 2024, [Online]. Available: http://arxiv.org/abs/2404.07221

Z. Xu et al., “Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering,” in SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, Inc, Jul. 2024, pp. 2905–2909. doi: 10.1145/3626772.3661370.

K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang, “REALM: Retrieval-Augmented Language Model Pre-Training,” ArXiv, Feb. 2020, [Online]. Available: http://arxiv.org/abs/2002.08909

V. Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering,” ArXiv, Apr. 2020, [Online]. Available: http://arxiv.org/abs/2004.04906

P. Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” ArXiv, May 2020, [Online]. Available: http://arxiv.org/abs/2005.11401

S. Kim, H. Song, H. Seo, and H. Kim, “Optimizing Retrieval Strategies for Financial Question Answering Documents in Retrieval-Augmented Generation Systems,” ArXiv, Mar. 2025, [Online]. Available: http://arxiv.org/abs/2503.15191

C. Choi et al., “FinDER: Financial Dataset for Question Answering and Evaluating Retrieval-Augmented Generation,” ArXiv, Apr. 2025, [Online]. Available: http://arxiv.org/abs/2504.15800

D. Staegemann, C. Haertel, C. Daase, M. Pohl, M. Abdallah, and K. Turowski, “A Review on Large Language Models and Generative AI in Banking,” in Proceedings of the 7th International Conference on Finance, Economics, Management and IT Business, SCITEPRESS - Science and Technology Publications, 2025, pp. 267–278. doi: 10.5220/0013472600003956.

G. Olaoye and H. Jonathan, “EasyChair Preprint The Evolving Role of Large Language Models (LLMs) in Banking,” 2024.

K. A. Maspul and N. K. Putri, “Will Big Data and AI Redefine Indonesia’s Financial Future?,” Jurnal Bisnis dan Komunikasi Digital, vol. 2, no. 2, p. 21, Feb. 2025, doi: 10.47134/jbkd.v2i2.3739.

M. Nadzirin Anshari Nur and G. K. Kassymova, “The Potential Misuse of Artificial Intelligence Technology Systems in Banking Fraud,” Universitas Diponegoro, vol. 21, no. 1, p. 17, Feb. 2025, Accessed: May 28, 2025. [Online]. Available: https://www.researchgate.net/publication/390122387_The_Potential_Misuse_of_Artificial_Intelligence_Technology_Systems_in_Banking_Fraud

L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei, “Multilingual E5 Text Embeddings: A Technical Report,” ArXiv, Feb. 2024, [Online]. Available: http://arxiv.org/abs/2402.05672

T. Jiang et al., “E5-V: Universal Embeddings with Multimodal Large Language Models,” ArXiv, Jul. 2024, [Online]. Available: http://arxiv.org/abs/2407.12580

M. Riviere et al., “Gemma 2: Improving Open Language Models at a Practical Size,” ArXiv, Jul. 2024, [Online]. Available: http://arxiv.org/abs/2408.00118

A. Yang et al., “Qwen2.5 Technical Report,” ArXiv, Dec. 2024, [Online]. Available: http://arxiv.org/abs/2412.15115

A. Q. Jiang et al., “Mistral 7B,” ArXiv, Oct. 2023, [Online]. Available: http://arxiv.org/abs/2310.06825

M. A. Oumano and S. M. Pickett, “Comparison of Large Language Models’ Performance on 600 Nuclear Medicine Technology Board Examination–Style Questions,” J Nucl Med Technol, vol. 00, p. jnmt.124.269335, May 2025, doi: 10.2967/JNMT.124.269335.

A. Mahboub, M. E. Za’ter, B. Al-Rfooh, Y. Estaitia, A. Jaljuli, and A. Hakouz, “Evaluation of Semantic Search and its Role in Retrieved-Augmented-Generation (RAG) for Arabic Language,” ArXiv, Mar. 2024, [Online]. Available: http://arxiv.org/abs/2403.18350

D. Chandrasekaran and V. Mago, “Evolution of Semantic Similarity -- A Survey,” ArXiv, Apr. 2020, doi: 10.1145/3440755.

J. Gu et al., “A Survey on LLM-as-a-Judge,” ArXiv, Nov. 2024, [Online]. Available: http://arxiv.org/abs/2411.15594

H. Huang et al., “An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4,” ArXiv, Mar. 2024, [Online]. Available: http://arxiv.org/abs/2403.02839

E. Oro, F. M. Granata, A. Lanza, A. Bachir, L. De Grandis, and M. Ruffolo, “Evaluating Retrieval-Augmented Generation for Question Answering with Large Language Models,” in 4th National Conference on Artificial Intelligence, CINI, May 2024, pp. 1–6. Accessed: May 30, 2025. [Online]. Available: https://ceur-ws.org/Vol-3762/495.pdf

J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin, “Large Language Models for Mathematical Reasoning: Progresses and Challenges,” ArXiv, vol. 1, p. 114, Jan. 2024, [Online]. Available: http://arxiv.org/abs/2402.00157