An experimental annotation task to investigate annotators’ subjectivity in a Misogyny dataset

References Basile, V. (2020). It’s the end of the gold standard as we know it. on the impact of pre-aggregation on the evaluation of highly subjective tasks. In 2020 AIxIA Discussion Papers Workshop, AIxIA 2020 DP (Vol. 2776, pp. 31-40). CEUR-WS. Basile, V., Fell, M., Fornaciari, T., Hovy, D., Paun, S., Plank, B., ... & Uma, A. (2021). We Need to consider disagreement in evaluation. In 1st Workshop on Benchmarking: Past, Present and Future (pp. 15-21). Association for Computational Linguistics. Beigman Klebanov B., Beigman E., and Diermeier D. 2008. Analyzing disagreements. In Coling 2008: Proceedings of the workshop on Human Judgements in Computational Linguistics, pages 2–7, Manchester, UK. Coling 2008 Organizing Committee. Bowman, S. R., & Dahl, G. E. (2021). What Will it Take to Fix Benchmarking in Natural Language Understanding?. arXiv preprint arXiv:2104.02145. Davani, A. M., Díaz, M., & Prabhakaran, V. (2022). Dealing with disagreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics, 10, 92-110. Fleiss, J. L., Cohen, J., & Everitt, B. S. (1969). Large sample standard errors of kappa and weighted kappa. Psychological bulletin, 72(5), 323. Landis JR., Koch GG. 1977. The measurement of observer agreement for categorical data. Biometrics. 1977 Mar;33(1):159-74. PMID: 843571. Lehnert, W., Cardie, C., Fisher, D., McCarthy, J., Riloff, E., & Soderland, S. (1992). University of Massachusetts: MUC-4 test results and analysis. In Fourth Message Uunderstanding Conference (MUC-4): Proceedings of a Conference Held in McLean, Virginia, Nozza, D., Volpetti, C., & Fersini, E. (2019, October). Unintended bias in misogyny detection. In Ieee/wic/acm international conference on web intelligence (pp. 149-155). Pavlopoulos, J., Sorensen, J., Laugier, L., & Androutsopoulos, I. (2021, August). Semeval-2021 task 5: Toxic spans detection. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021) (pp. 59-69). Uma, A., Fornaciari, T., Dumitrache, A., Miller, T., Chamberlain, J., Plank, B., ... & Poesio, M. (2021). Semeval-2021 task 12: Learning with disagreements. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021) (pp. 338-3 Tontodimamma A., Fontanella L., Anzani S., Basile V. (2022). An Italian lexical resource for incivility detection in online discourses. Quality & Quantity. 10.1007/s11135-022-01494-7