Establishing Inter-Rater Reliability in CEFR-Based Textbook Evaluation: Evidence From a Fleiss’ Kappa Study

Imed Sdiri; Manjet Kaur Mehar Singh

doi:10.17507/tpls.1607.36

Authors

Imed Sdiri Universiti Sains Malaysia
Manjet Kaur Mehar Singh Universiti Sains Malaysia

DOI:

https://doi.org/10.17507/tpls.1607.36

Keywords:

inter-rater reliability, Fleiss’ Kappa, CEFR alignment, textbook evaluation, bias

Abstract

Reliability is a central concern in research on textbook evaluation, particularly when judgments depend on the subjective interpretation of pedagogical features. In the context of English as a Foreign Language (EFL) education, the Common European Framework of Reference for Languages (CEFR) has become the dominant benchmark for assessing curricular alignment. Yet, while numerous studies claim that textbooks are evaluated against CEFR guidelines, few report systematically on the reliability of such evaluations. This paper addresses this gap by examining inter-rater reliability in the evaluation of an EFL textbook marketed by a ‎renowned international publisher as aligned with the CEFR A2 ‎level. Three expert raters independently applied a validated 22-item evaluation instrument to 20 reading comprehension lessons. Agreement was measured using Fleiss’ Kappa ( ), a statistic specifically designed for categorical data involving multiple raters. The results revealed a consistently high level of inter-rater reliability, with a pooled Fleiss’ Kappa of 0.89. These findings confirm the robustness of the evaluation instrument and demonstrate that with rigorous rater training and independent coding, subjective bias in EFL textbook evaluation can be effectively minimized. The study contributes a methodological model that enhances the rigor of EFL textbook evaluation studies by advocating for evidence-based validation procedures.

Author Biographies

Imed Sdiri, Universiti Sains Malaysia

School of Languages, Literacies, and Translation

Manjet Kaur Mehar Singh, Universiti Sains Malaysia

School of Languages, Literacies, and Translation

References

Albakkosh, I. (2024). Using Fleiss’ Kappa Coefficient to Measure the Intra and Inter-Rater Reliability of Three AI Software Programs in the Assessment of EFL Learners’ Story Writing. International Journal of Educational Sciences and Arts, 3(1), 69–96. https://doi.org/10.59992/ijesa.2023.v3n1p4

Alderson, J. C. (2005). Diagnosing foreign language proficiency: The interface between learning and assessment. Continuum.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford University Press.

Bailly, S. (Ed.). (2003). Common European Framework of Reference for Languages: Learning, teaching, assessment; A guide for users. Council of Europe.

Benchoufi, M., Matzner-Lober, E., Molinari, N., Jannot, A. S., & Soyer, P. (2020). Interobserver agreement issues in radiology. Diagnostic and interventional imaging, 101(10), 639-641. https://doi.org/10.1016/j.diii.2020.09.001

Byram, M., & Parmenter, L. (Eds.). (2012). The Common European Framework of Reference: The globalisation of language education policy. Multilingual Matters.

Cambridge University Press. (2013). Introductory guide to the Common European Framework of Reference (CEFR) for English language teachers. Cambridge University Press.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.

Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge University Press.

Council of Europe. (2020). Common European Framework of Reference for Languages: Learning, teaching, assessment – Companion volume. Council of Europe Publishing.

Cranford, S. (2020). The Pursued, the Pursuing, and Unconscious Prestige Bias. Matter, 2(5), 1065-1067.

Creswell, J. W. (2012). Educational research: Planning, conducting, and evaluating quantitative and qualitative research (4th ed.). Pearson.

Demirel, E., & Fakazli, Ö. (2021). A comparison of the speaking and writing activities in EFL coursebooks with the CEFR. International Journal of Curriculum and Instruction, 13(1), 168–189.

Deygers, B., Zeidler, B., Vilcu, D., & Carlsen, C. H. (2017). One Framework to Unite Them All? Use of the CEFR in European University Entrance Policies. Language Assessment Quarterly, 15(1), 3–15. https://doi.org/10.1080/15434303.2016.1261350

Fleckenstein, J., Keller, S., Krüger, M., Tannenbaum, R. J., & Köller, O. (2020). Linking TOEFL iBT® writing rubrics to CEFR levels: Cut scores and validity evidence from a standard setting study. Assessing Writing, 43, 100420. https://doi.org/10.1016/j.asw.2019.100420

Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.

Fleiss, J. L., Levin, B., & Paik, M. C. (2003). Statistical inference for a single proportion. Statistical Methods for Rates and Proportions, 3, 64-79. https://doi.org/10.1002/0471445428.ch2

Guerra, J., Gonçalves, S., Fisne, F. N., & Gungor, M. (2018). The CEFR in primary English classrooms: A snapshot from Turkey and Portugal. Eurasian Journal of Educational Research, 76, 123–144.

Gwet, K. L. (2014). Handbook of inter-rater reliability (4th ed.). Advanced Analytics.

Hassan, E., Miller, P., & Jiang, D. (2019). Inter-rater reliability in qualitative research. Journal of Research Practice, 15(2), 110–125.

Ilc, G., & Stopar, A. (2015). Validating the Slovenian national alignment to CEFR: The case of the B2 reading comprehension examination in English. Language Testing, 32(4), 443–462.

Kaur, P., & Jian, M. Z. (2022). The CEFR-aligned curriculum: Perspectives of Malaysian teachers. Asian Journal of Research in Education and Social Sciences, 4(1), 138–145.

Kanchai, T. (2019). Thai EFL university lecturers’ viewpoints towards impacts of the CEFR on their English language curricula and teaching practice. NIDA Journal of Language and Communication, 24(35), 23–47. Retrieved January 16, 2026, from https://so04.tci-thaijo.org/index.php/NJLC/article/download/202408/141228

Khan, A., David, A. R., Ahmad, A. H., Ali, A., & Lah, S. C. (2023). Initial Insights into ‎CEFR Adoption at a Language Faculty of a Public University in ‎Malaysia. PASAA, 67(1), 330-360.‎ Retrieved January 16, 2026, from https://digital.car.chula.ac.th/cgi/viewcontent.cgi?article=1795&context=pasaa

Klein, D. (2018). Implementing a General Framework for Assessing Interrater Agreement in Stata. The Stata Journal: Promoting Communications on Statistics and Stata, 18(4), 871-901. https://doi.org/10.1177/1536867X1801800408

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.

Laerd Statistics. (2019). Fleiss' Kappa in SPSS Statistics. Retrieved January 16, 2026, from https://statistics.laerd.com/

Laufer, B. (1989). What percentage of text lexis is essential for comprehension? In C. Lauren & M. Nordman (Eds.), Special language: From humans thinking to thinking machines (pp. 316–323). Multilingual Matters.

Laufer, B. (2021). Lexical thresholds and alleged threats to validity: A storm in a teacup. Reading in a Foreign Language, 33(2), 238–246.

McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22(3), 276–282.

McNamara, T. (2000). Language testing. Oxford University Press.

Moons, F., & Vandervieren, E. (2025). Measuring agreement among several raters classifying subjects into one or more (hierarchical) categories: A generalization of Fleiss’ kappa. Behavior Research Methods, 57(10), 287. https://doi.org/10.3758/s13428-025-02746-8

Mukundan, J., & Ahour, T. (2010). A review of textbook evaluation checklists across four decades (1970-2008). Porta Linguarum, 13, 336–352.

Nation, I. S. P. (2006). How large a vocabulary is needed for reading and listening? The Canadian Modern Language Review, 63(1), 59–82.

Natova, I. (2021). Estimating CEFR reading comprehension text complexity. The Language Learning Journal, 49(6), 699–710.

Nederhof, A. J. (1985). Methods of coping with social desirability bias: A review. European Journal of Social Psychology, 15(3), 263–280.

Nelson, J. C., & Pepe, M. S. (2000). Statistical description of interrater variability in ordinal ratings. Statistical methods in medical research, 9(5), 475-496.

Nickerson, R. S. (1998). Confirmation bias: A ubiquitous phenomenon in many guises. Review of General Psychology, 2(2), 175–220.

North, B. (2000). The development of a common framework scale of language proficiency. Peter Lang.

North, B. (2007). The CEFR illustrative descriptor scales. The Modern Language Journal, 91(4), 656-659.

North, B. (2014). The CEFR in practice (Vol. 4). Cambridge University Press.

North, B. (2021). The CEFR companion volume—What’s new and what might it imply for teaching/learning and for assessment. CEFR Journal-research and practice, 4, 5-24.

Patton, M. Q. (2002). Qualitative research & evaluation methods (3rd ed.). Sage Publications.

Peters, U. (2022). What is the function of confirmation bias? Erkenntnis, 87(3), 1351-1376. https://doi.org/10.1007/s10670-020-00252-1

Podsakoff, P. M., MacKenzie, S. B., Lee, J. Y., & Podsakoff, N. P. (2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88(5), 879–903.

Ponnusamy, N. K., Sandaran, S. C., & Gunasegaran, I. (2021). Evaluation of Year 6 KSSR English (SK) textbook: Teachers’ perspectives. LSP International Journal, 8(1), 67–80.

Riazi, A. M. (2016). The Routledge encyclopedia of research methods in applied linguistics. Routledge.

Schouteten, J. J., Gellynck, X., & Slabbinck, H. (2019). Influence of organic labels on consumer's flavor perception and emotional profiling: Comparison between a central location test and home-use-test. Food Research International, 116, 1000-1009. https://doi.org/10.1016/j.foodres.2018.09.038

Schuldt, J. P., Muller, D., & Schwarz, N. (2012). The “fair trade” effect: Health halos from social ethics claims. Social Psychological and Personality Science, 3(5), 581-589. https://doi.org/10.1177/1948550611431

Setyono, B., & Widodo, H. P. (2019). The representation of multicultural values in the Indonesian Ministry of Education and Culture-endorsed EFL textbook: A critical discourse analysis. Intercultural Education, 30(4), 383–397. https://doi.org/10.1080/14675986.2019.1658761

Sheldon, L. E. (1988). Evaluating ELT textbooks and materials. ELT Journal, 42(4), 237–246.

Sufi, M. K. A., & Idrus, F. (2021). A preliminary study on localising the CEFR written production descriptor to Malaysian higher education context. Asian Journal of Research in Education and Social Sciences, 3(2), 1–15.

Tomlinson, B. (2012). Materials development for language learning and teaching. Language Teaching, 45(2), 143–179. https://doi.org/10.1017/S0261444811000528

Valax, P. (2011). The Common European Framework of Reference for Languages: A critical analysis of its impact on a sample of English language teaching material [Doctoral dissertation, University of Waikato].

Establishing Inter-Rater Reliability in CEFR-Based Textbook Evaluation: Evidence From a Fleiss’ Kappa Study

Authors

DOI:

Keywords:

Abstract

Author Biographies

Imed Sdiri, Universiti Sains Malaysia

Manjet Kaur Mehar Singh, Universiti Sains Malaysia

References

Downloads

Published

Issue

Section