A BLEU-Based Evaluation of ChatGPT's Chinese-to-English Translation

Authors

  • Linli He Universiti Sains Malaysia
  • Mozhgan Ghassemiazghandi Universiti Sains Malaysia

DOI:

https://doi.org/10.17507/tpls.1512.21

Keywords:

BLEU, ChatGPT, machine translation, machine translation evaluation

Abstract

Political text translation presents unique challenges requiring precise ideological expression, cultural sensitivity, and terminological consistency—aspects that extend beyond conventional linguistic accuracy. While ChatGPT demonstrates growing capabilities in machine translation tasks, its performance in specialized political discourse remains underexplored. This study evaluates ChatGPT's Chinese-to-English translation quality using the 2023 Chinese Government Work Report, employing both BLEU metrics and human assessment across three criteria: syntax and grammar, cultural and ideological accuracy, and fluency and coherence. Three experienced translators evaluated ChatGPT's translations using a 6-point scale, while BLEU scores provided automated evaluation. Results reveal a significant contradiction: while BLEU scores remained low (0.31-0.37), human evaluation showed moderate performance with notable variations across criteria. ChatGPT achieved the highest scores in fluency and coherence (5.53 average) but struggled significantly with cultural and ideological accuracy (4.43 average), particularly in preserving political terminology precision and contextual appropriateness. Critical issues include generic translations of politically specific terms and inadequate handling of culturally embedded expressions. The study's key finding demonstrates that BLEU evaluation alone is fundamentally insufficient for assessing political text translation quality due to single-reference constraints and inability to capture ideological nuances. Our findings highlight the limitations of BLEU in evaluating politically nuanced texts and underscore the necessity of human evaluation for meaningful assessment of specialized domain translation. This research contributes to understanding AI translation capabilities in political discourse and provides evidence-based recommendations for developing more appropriate evaluation frameworks for specialized translation domains.

Author Biographies

Linli He, Universiti Sains Malaysia

School of Languages, Literacies and Translation

Mozhgan Ghassemiazghandi, Universiti Sains Malaysia

School of Languages, Literacies and Translation

References

AlAfnan, M. A. (2025). Large Language Models as Computational Linguistics Tools: A Comparative Analysis of ChatGPT and Google Machine Translations. Journal of Artificial Intelligence and Technology, 5, 20–32. https://doi.org/10.37965/jait.2024.0549

Alawida, M., Mejri, S., Mehmood, A., Chikhaoui, B., & Isaac Abiodun, O. (2023). A Comprehensive Study of ChatGPT: Advancements, Limitations, and Ethical Considerations in Natural Language Processing and Cybersecurity. Information, 14(8), Article 8. https://doi.org/10.3390/info14080462

Antaki, F., Touma, S., Milad, D., El-Khoury, J., & Duval, R. (2023). Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings. Ophthalmology Science, 3(4), 100324. https://doi.org/10.1016/j.xops.2023.100324

Araújo, S., & Aguiar, M. (2023). Comparing ChatGPT’s and Human Evaluation of Scientific Texts’ Translations from English to Portuguese Using Popular Automated Translators. Notebook for the SimpleText Lab at CLEF 2023. CEUR Workshop Proceedings.

Azaria, A. (2022). ChatGPT Usage and Limitations. https://hal.science/hal-03913837

Banerjee, S., & Lavie, A. (2005). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In J. Goldstein, A. Lavie, C.-Y. Lin, & C. Voss (Eds), Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (pp. 65–72). Association for Computational Linguistics. https://aclanthology.org/W05-0909

Chen, B., & Cherry, C. (2014). A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU. In O. Bojar, C. Buck, C. Federmann, B. Haddow, P. Koehn, C. Monz, M. Post, & L. Specia (Eds), Proceedings of the Ninth Workshop on Statistical Machine Translation (pp. 362–367). Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-3346

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. de O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., & Zaremba, W. (2021). Evaluating Large Language Models Trained on Code [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2107.03374

Cheng, Y., Wang, R., Chen, J., Chao, Y., Maimaitili, A., & Zhang, H. (2023). Context-Based AI Translation From a Globalization Perspective: A Case Study of ChatGPT. Sino-US English Teaching, 20(9). https://doi.org/10.17265/1539-8072/2023.09.005

De Angelis, L., Baglivo, F., Arzilli, G., Privitera, G. P., Ferragina, P., Tozzi, A. E., & Rizzo, C. (2023). ChatGPT and the rise of large language models: The new AI-driven infodemic threat in public health. Frontiers in Public Health, 11. https://www.frontiersin.org/articles/10.3389/fpubh.2023.1166120

Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. Proceedings of the Second International Conference on Human Language Technology Research, 138–145.

Du, H. (2023). A Corpus-Based Study on the Linguistic Features of the English Translation of the Report on the Work of the Government. Modern Linguistics, 11, 2630. https://doi.org/10.12677/ML.2023.116356

Du, L. (2025). Chinese Political Discourse in Translation: A Corpus-based Critical Discourse Analysis. Routledge. https://doi.org/10.4324/9781003544456

Evans, D. (2023, January 27). I asked ChatGPT why human intervention is so necessary and why we shouldn’t be scared of progress. LinkedIn. Retrieved July 6, 2024, from https://www.linkedin.com/pulse/i-asked-chat-gpt-why-human-intervention-so-necessary-we-darren-evans/

Evtikhiev, M., Bogomolov, E., Sokolov, Y., & Bryksin, T. (2023). Out of the BLEU: How should we assess quality of the Code Generation models? Journal of Systems and Software, 203, 111741. https://doi.org/10.1016/j.jss.2023.111741

Feng, J. (2024). An Analysis of the Translation Output and Value Dissemination of ChatGPT. Lecture Notes in Education Psychology and Public Media, 35(1), 212–218. https://doi.org/10.54254/2753-7048/35/20232108

Floridi, L. (2023). AI as Agency Without Intelligence: On ChatGPT, Large Language Models, and Other Generative Models. Philosophy & Technology, 36(1), 15. https://doi.org/10.1007/s13347-023-00621-y

Han, C., & Lu, X. (2025). Beyond BLEU: Repurposing neural-based metrics to assess interlingual interpreting in tertiary-level language learning settings. Research Methods in Applied Linguistics, 4(1), 100184. https://doi.org/10.1016/j.rmal.2025.100184

Hendy, A., Abdelrehim, M., Sharaf, A., Raunak, V., Gabr, M., Matsushita, H., Kim, Y. J., Afify, M., & Awadalla, H. H. (2023). How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation [Preprint]. arXiv. http://arxiv.org/abs/2302.09210

Jiang, Z., Lv, Q., Zhang, Z., & Lei, L. (2024). Convergences and Divergences between Automatic Assessment and Human Evaluation: Insights from Comparing ChatGPT-Generated Translation and Neural Machine Translation [Preprint]. arXiv.Org. https://arxiv.org/abs/2401.05176v3

Jiang, Z., & Zhang, Z. (2024). Can ChatGPT Rival Neural Machine Translation? A Comparative Study [Preprint]. arXiv. http://arxiv.org/abs/2401.05176

Jiao, W., Wang, W., Huang, J., Wang, X., & Tu, Z. (2023). Is ChatGPT a Good Translator? Yes With GPT-4 As The Engine [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2301.08745

Kalla, D., & Smith, N. (2023). Study and Analysis of Chat GPT and its Impact on Different Fields of Study. International Journal of Innovative Science and Research Technology, 8(3). https://ssrn.com/abstract=4402499

Karabayeva, I., & Kalizhanova, A. (2024). Evaluating machine translation of literature through rhetorical analysis. Journal of Translation and Language Studies, 5(1), Article 1. https://doi.org/10.48185/jtls.v5i1.962

Khoshafah, Faten. (2023, April 17). ChatGPT for Arabic-English Translation: Evaluating the Accuracy [Preprint]. Research Square. https://doi.org/10.21203/rs.3.rs-2814154/v2

Larroyed, A. (2023). Redefining Patent Translation: The Influence of ChatGPT and the Urgency to Align Patent Language Regimes in Europe with Progress in Translation Technology. GRUR International, 72(11), 1009–1017. https://doi.org/10.1093/grurint/ikad099

Lavie, A. (2011, September 19). Evaluating the Output of Machine Translation Systems. Proceedings of Machine Translation Summit XIII: Tutorial Abstracts. https://aclanthology.org/2011.mtsummit-tutorials.3

Lee, S., Lee, J., Moon, H., Park, C., Seo, J., Eo, S., Koo, S., & Lim, H. (2023). A Survey on Evaluation Metrics for Machine Translation. Mathematics, 11(4), Article 4. https://doi.org/10.3390/math11041006

Liu, S., & Zhu, W. (2023). An Analysis of the Evaluation of the Translation Quality of Neural Machine Translation Application Systems. Applied Artificial Intelligence, 37(1), 2214460. https://doi.org/10.1080/08839514.2023.2214460

Marie, B., Fujita, A., & Rubino, R. (2021). Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2106.15195

Mattas, P. (2023). ChatGPT: A Study of AI Language Processing and its Implications. International Journal of Research Publication and Reviews, 4, 435–440. https://doi.org/10.55248/gengpi.2023.4218

OpenAI. (2022, November 30). Introducing ChatGPT. Retrieved June 5, 2025, from https://openai.com/blog/chatgpt.

Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A Method for Automatic Evaluation of Machine Translation. In P. Isabelle, E. Charniak, & D. Lin (Eds), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311–318). Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135

Park, S., & Eunsil, C. (2023). A Study of Translatability of Irony in ChatGPT. The Journal of Translation Studies, 24(2), 131–160. https://doi.org/10.15749/jts.2023.24.2.005

Peng, K., Ding, L., Zhong, Q., Shen, L., Liu, X., Zhang, M., Ouyang, Y., & Tao, D. (2023). Towards Making the Most of ChatGPT for Machine Translation [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2303.13780

Puppel, M., & Borg, C. (2025). Evaluating ChatGPT’s Performance in Creative Text Translation for Communication: A Case Study from English into German. Media and Intercultural Communication: A Multidisciplinary Journal, 3(1), 1-27. https://doi.org/10.22034/mic.2024.480506.1023

Rafaeli, O., Abend, O., Choshen, L., & Nikolaev, D. (2021). Part of Speech and Universal Dependency effects on English Arabic Machine Translation [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2106.00745

Ray, P. P. (2023). ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 3, 121–154. https://doi.org/10.1016/j.iotcps.2023.04.003

Rizki, K. A. M., & Masykuroh, Q. (2024). Evaluating ChatGPT’s Translation of Harry Potter: A Qualitative Study of Translation Techniques, Accuracy, and Acceptability. JELITA, 6(1), Article 1. https://doi.org/10.56185/jelita.v6i1.902

Wang, Z., & Mao, C. (2023). ChatGPT yiwen zhiliang de pinggu yu tisheng — yi taoci lei wenben hanying fanyi wei li [Evaluation and Enhancement of ChatGPT Translation Quality: A Case Study of Chinese-English Translation in Ceramic Texts]. Shandong Ceramics, 46(4), 20–27. https://doi.org/10.3969/j.issn.1005-0639.2023.04.003

Wu, J. (2023). A Comparative Analysis of Chinese-English Translation Quality Based on ChatGPT: A Case Study of Chinese Characteristic Words. Journal of Social Science Humanities and Literature, 6(5), Article 5. https://doi.org/10.53469/jsshl.2023.06(05).08

Yang, Y., Liu, R., Qian, X., & Ni, J. (2023). Performance and perception: Machine translation post-editing in Chinese-English news translation by novice translators. Humanities and Social Sciences Communications, 10(1), Article 1. https://doi.org/10.1057/s41599-023-02285-7

Zhao, Y., Zhang, J., & Zong, C. (2023). Transformer: A General Framework from Machine Translation to Others. Machine Intelligence Research, 20(4), Article 4. https://doi.org/10.1007/s11633-022-1393-5

Zhu, G., & Wang, X. (2023). ChatGPT de yunxing moshi, guanjian jishu ji weilai tujing [ChatGPT: Operation Mode, Key Technology and Future Prospects]. Xinjiang Normal University Journal (Philosophy and Social Sciences Edition), 44(4), 113–122. https://doi.org/10.14100/j.cnki.65-1039/g4.20230217.001

Zhu, P. (2023). Translation of Personal Pronouns in Government Work Report from the Perspective of Explicitation. International Journal of Education and Humanities, 9(2), Article 2. https://doi.org/10.54097/ijeh.v9i2.9911

Downloads

Published

2025-12-01

Issue

Section

Articles