An Evaluation of ChatGPT's Translation Accuracy Using BLEU Score
DOI:
https://doi.org/10.17507/tpls.1404.07Keywords:
BLEU score evaluation, ChatGPT-4 translation, large language models, machine translation accuracy, translation quality assessmentAbstract
Traditional views have long held that machine translation cannot achieve the quality and accuracy of human translators, especially in complex language pairs like Persian and English. This study challenges this perspective by demonstrating that ChatGPT-4, with access to vast amounts of multilingual data and leveraging advanced large language model algorithms, significantly outperforms widely utilized open-source machine translation tools and approaches the realm of human translation quality. This research aims to critically assess the translation accuracy of ChatGPT-4 against a traditional open-source machine translation tool from Persian to English, highlighting the advancements in artificial intelligence-driven translation technologies. Using Bilingual Evaluation Understudy scores for a comprehensive evaluation, this study compares the translation outputs from ChatGPT-4 with MateCat, providing a quantitative basis for comparing their accuracy and quality. ChatGPT-4 achieves a BLUE score of 0.88 and an accuracy of 0.68, demonstrating superior performance compared to MateCat, with a 0.82 BLUE score and 0.49 accuracy. The results indicate that the translations generated by ChatGPT-4 surpass those produced by MateCat and nearly mirror the quality of human translations. The evaluation demonstrates the effectiveness of OpenAI's large language model algorithms in improving translation accuracy.
References
Abidin, Z., & Ahmad, I. (2021). Effect of mono corpus quantity on statistical machine translation Indonesian–Lampung dialect of nyo. In Journal of Physics: Conference Series, 1751(1), 12036.
Adedokun, M. J., Salami, S., Onyeali, D. C., Toheeb, B. O., Adeyoyin, D., & Afuzobugwu, K. (2023). Transforming Smallholder Farmers Support with an AI-powered FAQbot: A Comparison of Techniques. Retrieved March 7, 2024, from https://openreview.net/forum?id=VPl472SKaB
Amin, R., & Mandapuram, M. (2021). CMS - Intelligent Machine Translation with Adaptation and AI. ABC Journal of Advanced Research, 10(2), 199–206. https://doi.org/10.18034/abcjar.v10i2.693
Bahdanau D, Cho K, Bengio Y. (2016). Neural machine translation by jointly learning to align and translate. https://doi.org/10.48550/arXiv.1409.0473
Banat, M., & Abu Adla, Y. (2023). Exploring the Effectiveness of GPT-3 in Translating Specialized Religious Text from Arabic to English: A Comparative Study with Human Translation. Journal of Translation and Language Studies, 4(2), 1–23. https://doi.org/10.48185/jtls.v4i2.762
Bhadwal, N., Agrawal, P., & Madaan, V. (2020). A machine translation system from Hindi to Sanskrit language using rule based approach. Scalable Computing: Practice and Experience, 21(3), 543–554. https://doi:10.12694/scpe.v21i3.1783
Castillo-González, W., Lepez, C. O., & Bonardi, M. C. (2022). Chat GPT: a promising tool for academic editing. Data & Metadata, 1, 23. https://doi:10.56294/dm202223
Chatzikoumi, E. (2019). How to evaluate machine translation: A review of automated and human metrics. Natural Language Engineering, 26(2), 137–161. https://doi.org/10.1017/s1351324919000469
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724-1734. https://doi.org/10.3115/v1/d14-1179
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., ..., Fiedel, N. (2022). Palm: Scaling language modeling with pathways. https://doi.org/10.48550/arXiv.2204.02311
De Martino, J.M., Silva, I.R., Marques, J.C.T., Martins, A.C., Poeta, E.T., Christinele, D.S., & Campos, J.P.A.F. (2023). Neural machine translation from text to sign language. Univ Access Inf Soc. https://doi.org/10.1007/s10209-023-01018-6
Evtikhiev, M., Bogomolov, E., Sokolov, Y., & Bryksin, T. (2023). Out of the BLEU: How should we assess quality of the Code Generation models? Journal of Systems and Software, 203, 111741. https://doi.org/10.1016/j.jss.2023.111741
Fakih, A., Ghassemiazghandi, M., Fakih, A. H., & Singh, M. K. (2024). Evaluation of Instagram's Neural Machine Translation for Literary Texts: An MQM-Based Analysis. Gema Online Journal of Language Studies 213, Volume 24(1), 1730-1732. . http://doi.org/10.17576/gema-2024-2401-13
Farooq, U., Rahim, M. S. M., Sabir, N., Hussain, A., & Abid, A. (2021). Advances in machine translation for sign language: approaches, limitations, and challenges. Neural Computing and Applications, 33(21), 14357–14399. https://doi.org/10.1007/s00521-021-06079-3
Forcada, M. L., & Ñeco, R. P. (1997). Recursive hetero-associative memories for translation. Lecture Notes in Computer Science, 453–462. https://doi.org/10.1007/bfb0032504
Freitag, M., Rei, R., Mathur, N., Lo, C-K., Stewart, C., Avramidis, E., Kocmi, T., Foster, G., Lavie, A., & Martins, A.F.T. (2022). Results of WMT22 metrics shared task: Stop using BLEU–neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT) (pp. 46-68).
Freitag, M., Foster, G., Grangier, D., Ratnakar, V., Tan, Q., & Macherey, W. (2021). Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Transactions of the Association for Computational Linguistics, 9, 1460–1474. https://doi.org/10.1162/tacl_a_00437
Ghassemiazghandi, M. (2023). Machine Translation of Selected Ghazals of Hafiz from Persian into English. Arab World English Journal for Translation and Literary Studies, 7(1), 220–232. https://doi.org/10.24093/awejtls/vol7no1.17
Han, L. (2016). Machine translation evaluation resources and methods: A survey. ArXiv: Computation and language. Cornell University Library. https://doi.org/10.48550/arXiv.1605.04515
Han, L. (2022). An overview on machine translation evaluation. https://doi.org/10.48550/arXiv.2202.11027
Haque, S., Eberhart, Z., Bansal, A., & McMillan, C. (2022). Semantic similarity metrics for evaluating source code summarization. Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension. https://doi.org/10.1145/3524610.3527909
Harsha, N. S., Kumar, C. N., Sonthi, V. K., & Amarendra, K. (2022). Lexical Ambiguity in Natural Language Processing Applications. In 2022 International Conference on Electronics and Renewable Systems (ICEARS) (pp. 1550-1555). IEEE.
Hendy, A., Abdelrehim, M., Sharaf, A., Raunak, V., Gabr, M., Matsushita,H., Kim, Y.J., Afify, M., & Awadalla H.H. (2023). How good are GPT models at machine translation? A comprehensive evaluation. https://doi.org/10.48550/arXiv.2302.09210
Jiao, W., Wang, W., Huang, J. T., Wang, X., & Tu, Z. (2023). Is ChatGPT a good translator? A preliminary study. https://doi.org/10.48550/arXiv.2301.08745
Jumanto, J., Rizal, S. S., Asmarani, R., & Sulistyorini, H. (2022). The Discrepancies of Online Translation-Machine Performances: A Mini-Test on Object Language and Metalanguage. In 2022 International Seminar on Application for Technology of Information and Communication (iSemantic) (pp. 27-35). IEEE.
Kahlon, N.K., & Singh, W. (2023) Machine translation from text to sign language: a systematic review. Univ Access Inf Soc, 22, 1–35. https://doi.org/10.1007/s10209-021-00823-1
Kang, X., Zhao, Y., Zhang, J., & Zong, C. (2021). Enhancing lexical translation consistency for document-level neural machine translation. Association for Computing Machinery, 21, 3. https://doi.org/10.1145/3485469
Kenny, D. (2022). Human and machine translation. Machine translation for everyone: Empowering users in the age of artificial intelligence, 18, 23.
Khoshafah, F. (2023). ChatGPT for Arabic-English Translation: Evaluating the Accuracy. https://doi.org/10.21203/rs.3.rs-2814154/v1
Kocmi, T., Federmann, C., Grundkiewicz, R., Junczys-Dowmunt, M., Matsushita, H., & Menezes, A. (2021). To ship or not to ship: An extensive evaluation of automatic metrics for machine translation. https://doi.org/10.48550/arXiv.2107.10821
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10(8), 707–710.
Liu, Y., Han, T., Ma, S., Zhang, J., Yang, Y., Tian, J., He, H., Li, A., He, M., Liu, Z., Wu, Z., Zhao, L., Zhu, D., Li, X., Qiang, N., Shen, D., Liu, T., & Ge, B. (2023). Summary of CHATGPT-related research and perspective towards the future of large language models. Meta-Radiology, 1(2), 100017. https://doi.org/10.1016/j.metrad.2023.100017
Marie, B., Fujita, A., & Rubino, R. (2021). Scientific credibility of machine translation research: A meta-evaluation of 769 papers. https://doi.org/10.48550/arXiv.2106.15195
Maruf, S., Saleh, F., & Haffari, G. (2021). A Survey on Document-level Neural Machine Translation. ACM Computing Surveys, 54(2), 1–36. https://doi.org/10.1145/3441691
Mishra, R. (2024). A Comparative Analysis of Statistical and Neural Machine Translation Models. Integrated Journal of Science and Technology, 1(2), 1-3
Mohamed, S. A., Elsayed, A. A., Hassan, Y. F., & Abdou, M. A. (2021). Neural machine translation: past, present, and future. Neural Computing and Applications, 33(23), 15919–15931. https://doi.org/10.1007/s00521-021-06268-0
Olah, C. (2015). Understanding LSTM Networks. Retrieved March 8, 2024, from https://colah.github.io/posts/2015-08-Understanding-LSTMs
Papineni, K., Roukos, S., Ward, T., & Zhu W-J.(2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Meeting on Association for Computational Linguistics. ACL, 311–318. https://doi.org/10.3115/1073083.1073135
Poibeau, T. (2017). Machine translation. MIT Press.
Quintana, R. C., & Castilho, S. (2022). A review of the Integration of Machine Translation in CAT tools. New Trends in Translation and Technology 2022, 214.
Ranathunga, S., Lee, E. S. A., Prifti Skenduli, M., Shekhar, R., Alam, M., & Kaur, R. (2023). Neural machine translation for low-resource languages: A survey. ACM Computing Surveys, 55(11), 1-37.
Rawling, P., & Wilson, P. (2021). The Routledge Handbook of Translation and Philosophy. Abingdon, Oxon: Routledge, Taylor & Francis Group.
Reiter, E. (2018). A structured review of the validity of BLEU. Computational Linguistics, 44(3), 393-401.
Rivera-Trigueros, I. (2022). Machine translation systems and quality assessment: a systematic review. Lang Resources & Evaluation, 56, 593–619. https://doi.org/10.1007/s10579-021-09537-5
Sahari, Y., Qasem, F., Asiri, E., Alasmri, I., Assiri A., & Mahdi, H. (2024). Translation of Figurative Language: A Comparative Study of ChatGPT and Human Translators. https://doi.org/10.21203/rs.3.rs-3921149/v1
Sakamoto, A. (2020). The value of translation in the era of automation: An examination of threats. When Translation Goes Digital, 231–255. https://doi:10.1007/978-3-030-51761-8_10
Sanz-Valdivieso, L., & López-Arroyo, B. (2023). Google Translate vs. ChatGPT: Can non-language professionals trust them for specialized translation? Proceedings of the International Conference on Human-Informed Translation and Interpreting Technology 2023. https://doi.org/10.26615/issn.2683-0078.2023_008
Segonne, V., & Mickus, T. (2023). “Definition Modeling: To model definitions.” Generating Definitions With Little to No Semantics. https://doi.org/10.48550/arXiv.2306.08433
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas (pp. 223–231).
Stahlberg, F. (2020). Neural machine translation: A review. Journal of Artificial Intelligence Research, 69, 343-418.
Sutskever, I., Vinyals, O., and Le, Q. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS 2014).
Tehrani Shafagh, A. (2023). Princes of the Court: Memoirs of the Seil Sepor Family (A. Tehrani Shafagh, Trans.). Sahami Enteshar Company. (Original work published 2006)
Tillmann, C., Vogel, S., Ney, H., Zubiaga, A., & Sawaf, H. (1997). Accelerated DP based search for statistical translation. 5th European Conference on Speech Communication and Technology (Eurospeech 1997). https://doi.org/10.21437/eurospeech.1997-673
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems (2017): 6000–6010.
Wang, H., Wu, H., He, Z., Huang, L., & Church, K. W. (2022). Progress in machine translation. Engineering, 18, 143-153.
Wang, Y. (2024). Research of types and current state of machine translation. Applied and Computational Engineering, 37(1), 95–101. https://doi:10.54254/2755-2721/37/20230479
Way, A. (2018). Quality Expectations of Machine Translation. Translation Quality Assessment, 159–178. https://doi.org/10.1007/978-3-319-91241-7_8
Weaver, W. (1955). Translation. Mach Transl Lang, 14, pp. 15-23.
Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L., Gouws, S., Kato, Y., Kudo, T., Kazawa. H., ..., Dean, J. (2016). Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. https://doi.org/10.48550/arXiv.1609.08144
Zaghlool, Z. D. M., & Khasawneh, M. A. S. (2023). Aligning Translation Curricula with Technological Advancements; Insights from Artificial Intelligence Researchers and Language Educators. Studies in Media and Communication, 12(1), 58. https://doi.org/10.11114/smc.v12i1.6378