Measures of interrater agreement when each target is evaluated by a different group of raters

Measures of interrater agreement when each target is evaluated by a different group of raters Giuseppe Bove Roma Tre University, Italy This is a section of ASA 2022 Data-Driven Decision Making(DOI: 10.36253/979-12-215-0106-3) by Enrico di Bella, Luigi Fabbris, Corrado Lagazio Firenze University Press Firenze 2023 https://doi.org/10.36253/979-12-215-0106-3.28

Available for academic research purposes

Open Access

Copyright Author(s)

Content licence CC BY 4.0

Metadata licence CC0 1.0

This is original content, published for academic research purposes

Digital edition XML powered by Booksflow

Most measures of interrater agreement are defined for ratings regarding a group of targets, each rated by the same group of raters (e.g., the agreement of raters who assess on a rating scale the language proficiency of a corpus of argumentative written texts). However, there are situations in which agreement between ratings regards a group of targets where each target is evaluated by a different group of raters, like for instance when teachers in a school are evaluated by a questionnaire administered to all the pupils (students) in the classroom. In these situations, a first approach is to evaluate the level of agreement for the whole group of targets by the ANOVA one-way random model. A second approach is to apply subject-specific indices of interrater agreement like rWG, which represents the observed variance in ratings compared to the variance of a theoretical distribution representing no agreement (i.e., the null distribution). Both these approaches are not appropriate for ordinal or nominal scales. In this paper, an index is proposed to evaluate the agreement between raters for each single target (subject or object) on an ordinal scale, and to obtain also a global measure of the interrater agreement for the whole group of cases evaluated. The index is not affected by the possible concentration of ratings on a very small number of levels of the scale, like it happens for the measures based on the ANOVA approach, and it does not depend on the definition of a null distributions like rWG. The main features of the proposal will be illustrated in a study for the assessment of learning teacher behavior in classroom collected in a research conducted in 2018 at Roma Tre University.

Interrater agreement Ordinal data Teacher evaluation

It is available online at https://doi.org/10.36253/979-12-215-0106-3.28

References Borg, I., Groenen, P.J.F. (2005). Modern Multidimensional Scaling. Theory and Applications (Second Edition). Springer, New York. Bove, G. (2022). Measures of interrater agreement based on the standard deviation, in 51st Scientific Meeting of the Italian Statistical Society, Book of short papers, eds. A. Balzanella, M. Bini, C. Cavicchia, R. Verde, Pearson, Milano, pp. 1644-1649. Catalano, M.G. Perucchini, P., Vecchio, G.M. (2014). The quality of teachers’ educational practices: internal validity and applications of a new self-evaluation questionnaire. Procedia-Social and Behavioral Sciences, 141, pp. 459-464. Coombs, C.H. (1964). A Theory of Data. Wiley, New York. James, L.R., Demaree, R.G., Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology, 69, pp. 85–98. LeBreton, J.M., Burgess, J.R.D., Kaiser, R.B., Atchley, E.K.P., & James, L.R. (2003). The restriction of variance hypothesis and interrater reliability and agreement: are ratings from multiple sources really dissimilar?. Organizational Research Methods, 6 LeBreton, J.M., Senter, J.L. (2008). Answers to 20 questions about interrater reliability and interrater agreement, Organizational Research Methods, 11 (4), pp. 815-852. Leti, G. (1983). Statistica descrittiva. Il Mulino, Bologna. McGraw, K.O., Wong, S.P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, pp. 30-46. Shrout, P.E., Fleiss, J.L. (1979) Intraclass correlations: uses in assessing reliability. Psychological Bullettin, 86, pp. 420–428