AI-Assisted Corpus Linguistics: Integrating NLP Models Into Corpus Analysis
DOI:
https://doi.org/10.17507/tpls.1602.24Keywords:
natural processing language, artificial intelligence, corpus analysis, corpus linguisticAbstract
Integrating natural language processing (NLP) and artificial intelligence (AI) models into corpus linguistics has opened new avenues for linguistic analysis, yet their suitability for rigorous academic research remains debated due to issues like opacity and interpretability. This systematic review explores how NLP models transform traditional corpus linguistics methodologies, focusing on their applications, benefits, and challenges. Employing a PRISMA-guided approach, the study reviewed peer-reviewed literature from 2013 to 2025 across databases like Scopus and ACL Anthology, using keywords such as “AI in corpus linguistics” and “NLP corpus analysis”. Inclusion criteria targeted studies applying NLP models (e.g., BERT, GPT) to linguistic tasks, resulting in 12 selected studies after screening 922 records. A quality assessment using the CASP checklist ensured robustness, followed by thematic synthesis of findings. Results highlight that NLP models enhance corpus analysis by automating tasks like keyword extraction and pragmatic annotation, while offering scalability and semantic depth. Applications span discourse analysis, diachronic studies, and sociolinguistic variation, supported by tools like CorpusChat and Hugging Face Transformers. However, challenges include model biases, lack of transparency, and domain mismatch. The study explores that AI-driven NLP models significantly advance corpus linguistics but require addressing ethical, privacy, and reproducibility concerns to ensure academic rigor. Future research should focus on developing domain-specific models and enhancing interpretability to fully harness AI’s potential in linguistic studies.
References
Abdelaal, N. (2023). The role of corpora in enhancing translation accuracy and fluency feasibility of using corpora as a tool in translation practice. Australian Journal of Applied Linguistics, 6(3), 205-218.
Alaqlobi, O., Alduais, A., Qasem, F., & Alasmari, M. (2024). Artificial intelligence in applied (linguistics): a content analysis and future prospects. Cogent Arts & Humanities, 11(1), 2382422.
Alpdemir, Y., & Alpdemir, M. N. (2024). AI-Assisted Text Composition for Automated Content Authoring Using Transformer-Based Language Models. 2024 IEEE International Conference on Advanced Systems and Emergent Technologies (IC_ASET), 1-6.
bin Subait, W., Asiri, M. M., Alzaidi, M. S. A., Alanazi, M. H., Alshammeri, M., Yafoz, A., Alsini, R., & Khadidos, A. O. (2025). Artificial Intelligence-based Natural Language Processing for sarcasm detection and classification on Arabic Corpus. Alexandria Engineering Journal, 125, 320-331.
Cámara, J., Troya, J., Burgueño, L., & Vallecillo, A. (2023). On the assessment of generative AI in modeling tasks: an experience report with ChatGPT and UML. Software and Systems Modeling, 22(3), 781-793.
Chen, L.-C., & Chang, K.-H. (2024). An entropy-based corpus method for improving keyword extraction: An example of sustainability corpus. Engineering Applications of Artificial Intelligence, 133, 108049.
Cheung, L., & Crosthwaite, P. (2025). CorpusChat: integrating corpus linguistics and generative AI for academic writing development. Computer Assisted Language Learning, 133(Part B), 1-27.
Crosthwaite, P., & Baisa, V. (2023). Generative AI and the end of corpus-assisted data-driven learning? Not so fast! Applied Corpus Linguistics, 3(3), 100066.
Curry, N., Baker, P., & Brookes, G. (2024). Generative AI for corpus approaches to discourse studies: A critical evaluation of ChatGPT. Applied Corpus Linguistics, 4(1), 100082.
Fauzanz, A., Basthomi, Y., & Ivone, F. M. (2022). Effects of using online corpus and online dictionary as data-driven learning on students' grammar mastery. LEARN Journal: Language Education and Acquisition Research Network, 15(2), 679-704.
Flowerdew, J. (2024). Data-driven learning: From Collins Cobuild Dictionary to ChatGPT. Language Teaching, 58(3), 1-18.
Gatla, T. R. (2024). A groundbreaking research in breaking language barriers: NLP and linguistics development. International Journal of Advanced Research and Interdisciplinary Scientific Endeavours, 1(1), 1-7.
Granger, S. (2024). From early to future learner corpus research. International Journal of Learner Corpus Research, 10(2), 247-279.
Ihekweazu, C., Zhou, B., & Adelowo, E. A. (2024). Ethics-Driven Education: Integrating AI Responsibly for Academic Excellence. Information Systems Education Journal, 22(3), 36-46.
Li, Y., Huo, Y., Jiang, Z., Zhong, R., He, P., Su, Y., Briand, L. C., & Lyu, M. R. (2024). Exploring the effectiveness of llms in automated logging statement generation: An empirical study. IEEE Transactions on Software Engineering, 50(12), 3188-3207.
Lin, P., & Adolphs, S. (2023). Corpus linguistics. In The Routledge Handbook of Applied Linguistics (pp. 296-308). Routledge.
Lin, Y., Wang, R., & Chu, C. (2025). Addressing Domain Mismatch in Unsupervised Neural Machine Translation. IEEE Transactions on Audio, Speech and Language Processing, 33, 472-482.
Liu, J., & Ma, Q. (2025). Examining corpus-based language pedagogy (CBLP) practices in data-driven learning (DDL) for low-proficiency L2 English learners: A meta-analysis. Educational Technology & Society, 28(2), 53.
Lusta, A., Demirel, Ö., & Mohammadzadeh, B. (2023). Language corpus and data driven learning (DDL) in language classrooms: A systematic review. Heliyon, 9(12), e22731.
Mo, Z., & Crosthwaite, P. (2025). Exploring the affordances of generative AI large language models for stance and engagement in academic writing. Journal of English for Academic Purposes, 75, 101499.
Mohamed, Y. A., Khanan, A., Bashir, M., Mohamed, A. H. H., Adiel, M. A., & Elsadig, M. A. (2024). The impact of artificial intelligence on language translation: a review. IEEE Access, 12, 25553-25579.
Mohamed, Y. A., Mohamed, A. H., Kannan, A., Bashir, M., Adiel, M. A., & Elsadig, M. A. (2024). Navigating the Ethical Terrain of AI-Generated Text Tools: A Review. IEEE Access, 12, 197061-197120.
Piperno, R., Bacco, L., Dell’Orletta, F., Merone, M., & Pecchia, L. (2025). Cross-lingual distillation for domain knowledge transfer with sentence transformers. Knowledge-Based Systems, 311, 113079.
Raiaan, M. A. K., Mukta, M. S. H., Fatema, K., Fahad, N. M., Sakib, S., Mim, M. M. J., Ahmad, J., Ali, M. E., & Azam, S. (2024). A review on large language models: Architectures, applications, taxonomies, open issues and challenges. IEEE Access, 12, 26839-26874.
Sobti, R., Guleria, K., & Kadyan, V. (2024). Comprehensive literature review on children automatic speech recognition system, acoustic linguistic mismatch approaches and challenges. Multimedia Tools and Applications, 83(35), 81933-81995.
Suijkerbuijk, M., Prins, Z., de Heer Kloots, M., Zuidema, W., & Frank, S. L. (2025). BLiMP-NL: A corpus of Dutch minimal pairs and acceptability judgments for language model evaluation. Computational Linguistics, 1-39.
Uchida, S. (2024). Using early LLMs for corpus linguistics: Examining ChatGPT's potential and limitations. Applied Corpus Linguistics, 4(1), 100089.
Veres, C. (2022). Large language models are not models of natural language: they are corpus models. IEEE Access, 10, 61970-61979.
Wibawa, A. P., & Kurniawan, F. (2024). Advancements in natural language processing: Implications, challenges, and future directions. Telematics and Informatics Reports, 16, 100173.
Xu, H., & Huang, Y. (2025). Corpus-based Translation and Interpreting Studies in the Age of AI: Innovations and Challenges. In Translation Studies in the Age of Artificial Intelligence (pp. 85-99). Routledge.
Yu, D., Li, L., Su, H., & Fuoli, M. (2024). Assessing the potential of LLM-assisted annotation for corpus-based pragmatics and discourse analysis: The case of apology. International Journal of Corpus Linguistics, 29(4), 534-561.
Zhao, D. (2025). The impact of AI-enhanced natural language processing tools on writing proficiency: An analysis of language precision, content summarization, and creative writing facilitation. Education and Information Technologies, 30(6), 8055-8086.