A BLEU-Based Evaluation of ChatGPT's Chinese-to-English Translation
DOI:
https://doi.org/10.17507/tpls.1512.21Keywords:
BLEU, ChatGPT, machine translation, machine translation evaluationAbstract
Political text translation presents unique challenges requiring precise ideological expression, cultural sensitivity, and terminological consistency—aspects that extend beyond conventional linguistic accuracy. While ChatGPT demonstrates growing capabilities in machine translation tasks, its performance in specialized political discourse remains underexplored. This study evaluates ChatGPT's Chinese-to-English translation quality using the 2023 Chinese Government Work Report, employing both BLEU metrics and human assessment across three criteria: syntax and grammar, cultural and ideological accuracy, and fluency and coherence. Three experienced translators evaluated ChatGPT's translations using a 6-point scale, while BLEU scores provided automated evaluation. Results reveal a significant contradiction: while BLEU scores remained low (0.31-0.37), human evaluation showed moderate performance with notable variations across criteria. ChatGPT achieved the highest scores in fluency and coherence (5.53 average) but struggled significantly with cultural and ideological accuracy (4.43 average), particularly in preserving political terminology precision and contextual appropriateness. Critical issues include generic translations of politically specific terms and inadequate handling of culturally embedded expressions. The study's key finding demonstrates that BLEU evaluation alone is fundamentally insufficient for assessing political text translation quality due to single-reference constraints and inability to capture ideological nuances. Our findings highlight the limitations of BLEU in evaluating politically nuanced texts and underscore the necessity of human evaluation for meaningful assessment of specialized domain translation. This research contributes to understanding AI translation capabilities in political discourse and provides evidence-based recommendations for developing more appropriate evaluation frameworks for specialized translation domains.
References
AlAfnan, M. A. (2025). Large Language Models as Computational Linguistics Tools: A Comparative Analysis of ChatGPT and Google Machine Translations. Journal of Artificial Intelligence and Technology, 5, 20–32. https://doi.org/10.37965/jait.2024.0549
Alawida, M., Mejri, S., Mehmood, A., Chikhaoui, B., & Isaac Abiodun, O. (2023). A Comprehensive Study of ChatGPT: Advancements, Limitations, and Ethical Considerations in Natural Language Processing and Cybersecurity. Information, 14(8), Article 8. https://doi.org/10.3390/info14080462
Antaki, F., Touma, S., Milad, D., El-Khoury, J., & Duval, R. (2023). Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings. Ophthalmology Science, 3(4), 100324. https://doi.org/10.1016/j.xops.2023.100324
Araújo, S., & Aguiar, M. (2023). Comparing ChatGPT’s and Human Evaluation of Scientific Texts’ Translations from English to Portuguese Using Popular Automated Translators. Notebook for the SimpleText Lab at CLEF 2023. CEUR Workshop Proceedings.
Azaria, A. (2022). ChatGPT Usage and Limitations. https://hal.science/hal-03913837
Banerjee, S., & Lavie, A. (2005). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In J. Goldstein, A. Lavie, C.-Y. Lin, & C. Voss (Eds), Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (pp. 65–72). Association for Computational Linguistics. https://aclanthology.org/W05-0909
Chen, B., & Cherry, C. (2014). A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU. In O. Bojar, C. Buck, C. Federmann, B. Haddow, P. Koehn, C. Monz, M. Post, & L. Specia (Eds), Proceedings of the Ninth Workshop on Statistical Machine Translation (pp. 362–367). Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-3346
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. de O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., & Zaremba, W. (2021). Evaluating Large Language Models Trained on Code [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2107.03374
Cheng, Y., Wang, R., Chen, J., Chao, Y., Maimaitili, A., & Zhang, H. (2023). Context-Based AI Translation From a Globalization Perspective: A Case Study of ChatGPT. Sino-US English Teaching, 20(9). https://doi.org/10.17265/1539-8072/2023.09.005
De Angelis, L., Baglivo, F., Arzilli, G., Privitera, G. P., Ferragina, P., Tozzi, A. E., & Rizzo, C. (2023). ChatGPT and the rise of large language models: The new AI-driven infodemic threat in public health. Frontiers in Public Health, 11. https://www.frontiersin.org/articles/10.3389/fpubh.2023.1166120
Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. Proceedings of the Second International Conference on Human Language Technology Research, 138–145.
Du, H. (2023). A Corpus-Based Study on the Linguistic Features of the English Translation of the Report on the Work of the Government. Modern Linguistics, 11, 2630. https://doi.org/10.12677/ML.2023.116356
Du, L. (2025). Chinese Political Discourse in Translation: A Corpus-based Critical Discourse Analysis. Routledge. https://doi.org/10.4324/9781003544456
Evans, D. (2023, January 27). I asked ChatGPT why human intervention is so necessary and why we shouldn’t be scared of progress. LinkedIn. Retrieved July 6, 2024, from https://www.linkedin.com/pulse/i-asked-chat-gpt-why-human-intervention-so-necessary-we-darren-evans/
Evtikhiev, M., Bogomolov, E., Sokolov, Y., & Bryksin, T. (2023). Out of the BLEU: How should we assess quality of the Code Generation models? Journal of Systems and Software, 203, 111741. https://doi.org/10.1016/j.jss.2023.111741
Feng, J. (2024). An Analysis of the Translation Output and Value Dissemination of ChatGPT. Lecture Notes in Education Psychology and Public Media, 35(1), 212–218. https://doi.org/10.54254/2753-7048/35/20232108
Floridi, L. (2023). AI as Agency Without Intelligence: On ChatGPT, Large Language Models, and Other Generative Models. Philosophy & Technology, 36(1), 15. https://doi.org/10.1007/s13347-023-00621-y
Han, C., & Lu, X. (2025). Beyond BLEU: Repurposing neural-based metrics to assess interlingual interpreting in tertiary-level language learning settings. Research Methods in Applied Linguistics, 4(1), 100184. https://doi.org/10.1016/j.rmal.2025.100184
Hendy, A., Abdelrehim, M., Sharaf, A., Raunak, V., Gabr, M., Matsushita, H., Kim, Y. J., Afify, M., & Awadalla, H. H. (2023). How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation [Preprint]. arXiv. http://arxiv.org/abs/2302.09210
Jiang, Z., Lv, Q., Zhang, Z., & Lei, L. (2024). Convergences and Divergences between Automatic Assessment and Human Evaluation: Insights from Comparing ChatGPT-Generated Translation and Neural Machine Translation [Preprint]. arXiv.Org. https://arxiv.org/abs/2401.05176v3
Jiang, Z., & Zhang, Z. (2024). Can ChatGPT Rival Neural Machine Translation? A Comparative Study [Preprint]. arXiv. http://arxiv.org/abs/2401.05176
Jiao, W., Wang, W., Huang, J., Wang, X., & Tu, Z. (2023). Is ChatGPT a Good Translator? Yes With GPT-4 As The Engine [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2301.08745
Kalla, D., & Smith, N. (2023). Study and Analysis of Chat GPT and its Impact on Different Fields of Study. International Journal of Innovative Science and Research Technology, 8(3). https://ssrn.com/abstract=4402499
Karabayeva, I., & Kalizhanova, A. (2024). Evaluating machine translation of literature through rhetorical analysis. Journal of Translation and Language Studies, 5(1), Article 1. https://doi.org/10.48185/jtls.v5i1.962
Khoshafah, Faten. (2023, April 17). ChatGPT for Arabic-English Translation: Evaluating the Accuracy [Preprint]. Research Square. https://doi.org/10.21203/rs.3.rs-2814154/v2
Larroyed, A. (2023). Redefining Patent Translation: The Influence of ChatGPT and the Urgency to Align Patent Language Regimes in Europe with Progress in Translation Technology. GRUR International, 72(11), 1009–1017. https://doi.org/10.1093/grurint/ikad099
Lavie, A. (2011, September 19). Evaluating the Output of Machine Translation Systems. Proceedings of Machine Translation Summit XIII: Tutorial Abstracts. https://aclanthology.org/2011.mtsummit-tutorials.3
Lee, S., Lee, J., Moon, H., Park, C., Seo, J., Eo, S., Koo, S., & Lim, H. (2023). A Survey on Evaluation Metrics for Machine Translation. Mathematics, 11(4), Article 4. https://doi.org/10.3390/math11041006
Liu, S., & Zhu, W. (2023). An Analysis of the Evaluation of the Translation Quality of Neural Machine Translation Application Systems. Applied Artificial Intelligence, 37(1), 2214460. https://doi.org/10.1080/08839514.2023.2214460
Marie, B., Fujita, A., & Rubino, R. (2021). Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2106.15195
Mattas, P. (2023). ChatGPT: A Study of AI Language Processing and its Implications. International Journal of Research Publication and Reviews, 4, 435–440. https://doi.org/10.55248/gengpi.2023.4218
OpenAI. (2022, November 30). Introducing ChatGPT. Retrieved June 5, 2025, from https://openai.com/blog/chatgpt.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A Method for Automatic Evaluation of Machine Translation. In P. Isabelle, E. Charniak, & D. Lin (Eds), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311–318). Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135
Park, S., & Eunsil, C. (2023). A Study of Translatability of Irony in ChatGPT. The Journal of Translation Studies, 24(2), 131–160. https://doi.org/10.15749/jts.2023.24.2.005
Peng, K., Ding, L., Zhong, Q., Shen, L., Liu, X., Zhang, M., Ouyang, Y., & Tao, D. (2023). Towards Making the Most of ChatGPT for Machine Translation [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2303.13780
Puppel, M., & Borg, C. (2025). Evaluating ChatGPT’s Performance in Creative Text Translation for Communication: A Case Study from English into German. Media and Intercultural Communication: A Multidisciplinary Journal, 3(1), 1-27. https://doi.org/10.22034/mic.2024.480506.1023
Rafaeli, O., Abend, O., Choshen, L., & Nikolaev, D. (2021). Part of Speech and Universal Dependency effects on English Arabic Machine Translation [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2106.00745
Ray, P. P. (2023). ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 3, 121–154. https://doi.org/10.1016/j.iotcps.2023.04.003
Rizki, K. A. M., & Masykuroh, Q. (2024). Evaluating ChatGPT’s Translation of Harry Potter: A Qualitative Study of Translation Techniques, Accuracy, and Acceptability. JELITA, 6(1), Article 1. https://doi.org/10.56185/jelita.v6i1.902
Wang, Z., & Mao, C. (2023). ChatGPT yiwen zhiliang de pinggu yu tisheng — yi taoci lei wenben hanying fanyi wei li [Evaluation and Enhancement of ChatGPT Translation Quality: A Case Study of Chinese-English Translation in Ceramic Texts]. Shandong Ceramics, 46(4), 20–27. https://doi.org/10.3969/j.issn.1005-0639.2023.04.003
Wu, J. (2023). A Comparative Analysis of Chinese-English Translation Quality Based on ChatGPT: A Case Study of Chinese Characteristic Words. Journal of Social Science Humanities and Literature, 6(5), Article 5. https://doi.org/10.53469/jsshl.2023.06(05).08
Yang, Y., Liu, R., Qian, X., & Ni, J. (2023). Performance and perception: Machine translation post-editing in Chinese-English news translation by novice translators. Humanities and Social Sciences Communications, 10(1), Article 1. https://doi.org/10.1057/s41599-023-02285-7
Zhao, Y., Zhang, J., & Zong, C. (2023). Transformer: A General Framework from Machine Translation to Others. Machine Intelligence Research, 20(4), Article 4. https://doi.org/10.1007/s11633-022-1393-5
Zhu, G., & Wang, X. (2023). ChatGPT de yunxing moshi, guanjian jishu ji weilai tujing [ChatGPT: Operation Mode, Key Technology and Future Prospects]. Xinjiang Normal University Journal (Philosophy and Social Sciences Edition), 44(4), 113–122. https://doi.org/10.14100/j.cnki.65-1039/g4.20230217.001
Zhu, P. (2023). Translation of Personal Pronouns in Government Work Report from the Perspective of Explicitation. International Journal of Education and Humanities, 9(2), Article 2. https://doi.org/10.54097/ijeh.v9i2.9911