-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add italian HateSpeech dataset #385
fix: add italian HateSpeech dataset #385
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lovely to see you here @rbroc! Everything is looking good. There are a few comments, but generally it looks good!
thanks @KennethEnevoldsen! I've left a few comments open above -- especially a question about the multilingual nature of the dataset. Happy to double review points since I am explicitly re-requesting :) |
No problem, let us just keep it at one. |
last quick q @KennethEnevoldsen: are my points for this 2 + 4 (2 for new dataset + 4 point bonus) or 4 in total? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
re. points it is 6 (2+4) in total.
workflows require approval, should be ready to merge after that. |
Hei Roberta, just saw this comment. I actually would prefer that you do not do the split, but rather just shuffle and downsample from the training. Then also switch the eval_split to "train". This is to avoid someone training on the "train" split assuming that we test on another split (which we don't). |
not sure I fully understand. the dataset has no training split, it only comes with a test split. could also have a chat about this IRL if you think that's more productive. |
7241bfc
into
embeddings-benchmark:merge_385
Ahh then this is perfectly fine. Merged into a temporary branch to run linting and then merging. |
* fix: add italian HateSpeech dataset (#385) * add italian HateSpeech dataset * add points * update dialect, socioeconomic status, domains and points * add PR review points * add task_domain for constructed data + rerun models * update points * minor fix * merge points from main * add review points to main --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * run linting --------- Co-authored-by: Roberta Rocca <[email protected]>
Checklist for adding MMTEB dataset
Reason for dataset addition: This is a highly curated hate speech detection dataset, with binary (hateful/non-hateful) labels, as well as fine-grained labels for 25+ hate speech categories (https://huggingface.co/datasets/Paul/hatecheck-italian).
I am currently only using the binary labels, but it would be cool to also test models on the more fine-grained taxonomy (could just subclass the current task, and set the relevant column as label.
There are similar datasets for 9 more languages, released as separate datasets on HF. If this looks like a suitable dataset, I can go on to submit the remaining (through separate PRs, I assume).
Checklist
mteb
package.mteb run -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
make test
.make lint
.Questions
results/intfloat__multilingual-e5-small/model_meta.json
but can revert if needed