KODOLI is a novel KOrean Dataset for Offensive Language Identification.
Warning: it contains highly offensive expressions.
- KODOLI comprises more fine-grained offensiveness categories (i.e., not offensive, likely offensive, and offensive)
- A likely offensive language refers to
texts with implicit offensiveness or abusive language without offensive intentions
. - In addition, we propose two auxiliary tasks to help identify offensive languages: abusive language detection and sentiment analysis.
- You could utilize toxic detection through the auxiliary task. (Be careful the raw expressions)
You can download benchmark KODOLI in this repository. Please, follow the data's license.
- Texts are mainly collected and sampled from online communities and news articles.
[Guideline(KOR.)] Comming Soon
- Apr 20, 2023 We release 3.6k examples for
offensive language identification
task
@inproceedings{park2023feel,
title={“Why do I feel offended?”-Korean Dataset for Offensive Language Identification},
author={Park, San-Hee and Kim, Kang-Min and Lee, O-joun and Kang, Youjin and Lee, Jaewon and Lee, Su-min and Lee, Sangkeun},
booktitle={Findings of the Association for Computational Linguistics: EACL 2023},
pages={1112--1123},
year={2023}
}
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.