Evaluating Large Language Models and Neural Machine Translation Systems in Translating Liaozhaizhiyi: A Cross-Cultural Literary Study
DOI:
https://doi.org/10.17507/tpls.1605.34Keywords:
ChatGPT, Neural machine translation, Google translate, Youdao translate, Liaozhaizhiyi, Chinese classical literature, cross-cultural communicationAbstract
Neural machine translation has demonstrated strong performance in high-resource languages and commercial translation contexts. However, its effectiveness in translating classical Chinese literature remains insufficiently examined. This study conducts a comparative evaluation of three translation systems—ChatGPT, Google Translate, and Youdao Translate—using selected texts from Liaozhaizhiyi as the research corpus. The analysis focuses on four dimensions: semantic alignment, measured by BLEU scores; translation fluency; stylistic fidelity; and the detectability of machine-generated translation patterns. The results indicate that ChatGPT achieves superior performance in stylistic fidelity, particularly in preserving poetic tone, as well as in semantic alignment when translating idiomatic expressions and culturally embedded references. In addition, translations produced by ChatGPT exhibit fewer mechanical artefacts commonly associated with neural machine translation outputs. Further experiments demonstrate that structured prompt engineering strategies contribute to improved literary naturalness and greater cultural coherence in the translated texts. These findings suggest that large language models offer notable advantages in the translation of classical literary works and provide empirical insights into the role of artificial intelligence in facilitating cross-cultural interpretation and the international transmission of Chinese literary traditions.
References
Aharoni, R., Koppel, M., & Goldberg, Y. (2014). Automatic Detection of Machine-translated Text and Translation Quality Estimation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 289–295). https://doi.org/10.3115/v1/P14-2048
Ahrenberg, L. (2017). Comparing Machine Translation and Human Translation: A Case Study. In Proceedings of the Workshop on Human-Informed Translation and Interpreting Technology (pp. 21–28). Association for Computational Linguistics: Copenhagen, Denmark. Retrieved December 29, 2025, from https://aclanthology.org/W17-7903/
Akabli, J., & Khaloufi, R. (2024). Translating identity in Leila Abouzeid’s Return to Childhood. AWEJ for Translation & Literary Studies, 8(2), 2–17. https://doi.org/10.24093/awejtls/vol8no2.1
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint, arXiv:1409.0473. https://arxiv.org/abs/1409.0473
Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278. https://doi.org/10.1016/j.jml.2012.11.001
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://arxiv.org/abs/2005.14165
Castilho, S., Moorkens, J., Gaspari, F., Calixto, I., Tinsley, J., & Way, A. (2017). Is neural machine translation the new state of the art? Prague Bulletin of Mathematical Linguistics, 108, 109–120. https://doi.org/10.1515/pralin-2017-0013
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1724–1734). https://doi.org/10.3115/v1/D14-1179
Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30, 4299–4307. https://arxiv.org/abs/1706.03741
Drobot, I.-A. (2021). Translating literature using machine translation: Is it really possible? Scientific Bulletin of the Politehnica University of Timișoara: Transactions on Modern Languages, 20(1), 57–64. https://doi.org/10.59168/FUAP6124
Dankers, V., Lucas, C. G., & Titov, I. (2022). Can transformers be too compositional? Analysing idiom processing in neural machine translation. arXiv preprint, arXiv:2205.15301. https://doi.org/10.48550/arXiv. 2205.15301
Deng, X., & Yu, Z. (2022). A systematic review of machine-translation-assisted language learning for sustainable education. Sustainability, 14(13), 7598. https://doi.org/10.3390/su14137598
España-Bonet, C., Costa-Jussà, M. R., Rapp, R., Lambert, P., Eberle, K., Banchs, R. E., & Babych, B. (2016). Hybrid machine translation overview. In Hybrid Approaches to Machine Translation (pp. 1–24). Springer Cham. https://doi.org/10.1007/978-3-319-21311-8
Gao, R., Lin, Y., Zhao, N., & Cai, Z. G. (2024). Machine translation of Chinese classical poetry: A comparison among ChatGPT, Google Translate, and DeepL Translator. Humanities and Social Sciences Communications, 11(1), Article 835. https://doi.org/10.1057/s41599-024-03363-0
Gozzi, M., & Di Maio, F. (2024). Comparative analysis of prompt strategies for large language models: Single-task vs. multitask prompts. Electronics, 13(23), 4712. https://doi.org/10.3390/electronics13234712
Guerberof-Arenas, A., & Toral, A. (2022). Creativity in translation: Machine translation as a constraint for literary texts. Translation Spaces, 11, 184–212. https://doi.org/10.1075/ts.21025.gue
Hadley, J., Popović, M., Afli, H., & Way, A. (2019). Proceedings of the Qualities of Literary Machine Translation. European Association for Machine Translation, Dublin, Ireland. Retrieved December 7, 2025, from https://aclanthology.org/W19-7300/
Jiao, W., Wang, W., Huang, J., Wang, X., Shi, S., & Tu, Z. (2023). Is ChatGPT a good translator? Yes with GPT-4 as the engine: A preliminary study. arXiv preprint, arXiv:2301.08745. https://doi.org/10.48550/arXiv.2301.08745
Jing, Y., Yang, Y., Feng, Z., Ye, J., Yu, Y., & Song, M. (2019). Neural style transfer: A review. IEEE Transactions on Visualization and Computer Graphics, 26(11), 3365–3385. https://doi.org/10.1109/TVCG.2019.2921336
Koehn, P. (2009). Statistical machine translation. MIT Press.
Kulkarni, A., Shivananda, A., Kulkarni, A., & Gudivada, D. (2023). The ChatGPT architecture: An in-depth exploration of OpenAI’s conversational language model. In Applied generative AI for beginners: Practical knowledge on diffusion models, ChatGPT, and other LLMs (pp. 55–77). Apress. https://doi.org/10.1007/978-1-4842-9994-4_4
Karpinska, K., & Iyyer, M. (2023). Large language models effectively leverage document-level context for machine translation. In Proceedings of the Eighth Conference on Machine Translation (WMT 2023), Volume 1: Research Papers (pp. 478–489). Retrieved December 7, 2025, from https://aclanthology.org/2023.wmt-1.41/
Lau, J., Wang, Y., & Tang, G. (2024). Improving BERTScore for machine translation evaluation through contrastive learning. IEEE Access, 12, 77739–77749. https://doi.org/10.1109/ACCESS.2024.3406993
Matuschek, H., Kliegl, R., Vasishth, S., Baayen, H., & Bates, D. (2017). Balancing Type I error and power in linear mixed models. Journal of Memory and Language, 94, 305–315. https://doi.org/10.1016/j.jml.2017.01.001
Naveen, P., & Trojovský, P. (2024). Overview and challenges of machine translation for contextually appropriate translations. iScience, 27(10), 110878. https://doi.org/10.1016/j.isci.2024.110878
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311–318). https://doi.org/10.3115/1073083.1073135
Post, M. (2018). A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation (WMT) Volume 1: Research Papers (pp. 186–191). https://doi.org/10.48550/arXiv.1804.08771
Qin, C., Zhang, A., Zhang, Z., Chen, J., Yasunaga, M., & Yang, D. (2023). Is ChatGPT a general-purpose natural language processing task solver? arXiv preprint, arXiv:2302.06476. https://doi.org/10.48550/arXiv.2302.06476
Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 4902–4912). Retrieved December 7, 2025, from https://aclanthology.org/2020.acl-main.442/
Si, S., Zhou, S., & Zhang, Y. (2024). Exploring the capabilities of ChatGPT in ancient Chinese translation and person name recognition. Corpus-Based Studies across Humanities, 2, 221–234. https://doi.org/10.1515/csh-2024-0017
Toral, A., & Way, A. (2018). What level of quality can neural machine translation attain on literary text? In J. Moorkens, S. Castilho, F. Gaspari, & S. Doherty (Eds.), Translation quality assessment: From principles to practice (pp. 263–287). Springer. https://doi.org/10.1007/978-3-319-91241-7_12
Weaver, W. (1955). Translation. In W. N. Locke & A. D. Booth (Eds.), Machine translation of languages: Fourteen essays (pp. 15–23). MIT Press.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837. https://doi.org/10.48550/arXiv.2201.11903
Wang, Q. (2025). Evaluating Uighur literary translation: A comparative study of ChatGPT, Google Translate, and Bing Translator. PLoS ONE, 20, e0335261. https://doi.org/10.1371/journal.pone.0335261
Zhou, P., & Cheng, J. (2025). Stylistic variation across English translations of Chinese science fiction: Ken Liu versus ChatGPT. Frontiers in Artificial Intelligence, 8, Article 1576750. https://doi.org/10.3389/frai.2025.1576750
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). BERTScore: Evaluating text generation with BERT. arXiv preprint, arXiv:1904.09675. https://doi.org/10.48550/arXiv.1904.09675
Zhang, B., Haddow, B., & Birch, A. (2023). Prompting large language model for machine translation: A case study. arXiv preprint, arXiv:2301.07069. https://doi.org/10.48550/arXiv.2301.07069