fix: add italian HateSpeech dataset #385

rbroc · 2024-04-16T17:17:10Z

Checklist for adding MMTEB dataset

Reason for dataset addition: This is a highly curated hate speech detection dataset, with binary (hateful/non-hateful) labels, as well as fine-grained labels for 25+ hate speech categories (https://huggingface.co/datasets/Paul/hatecheck-italian).
I am currently only using the binary labels, but it would be cool to also test models on the more fine-grained taxonomy (could just subclass the current task, and set the relevant column as label.

There are similar datasets for 9 more languages, released as separate datasets on HF. If this looks like a suitable dataset, I can go on to submit the remaining (through separate PRs, I assume).

Checklist

I have tested that the dataset runs with the mteb package.
I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.
I have added points for my submission to the POINTS.md file.

Questions

The dataset does not have a native training split (it only comes with a test split). I am currently splitting into equally sized (1845 sentences) train/test splits, stratifying by hate speech type, but we could consider alternative options.
Don't know what happened with results/intfloat__multilingual-e5-small/model_meta.json but can revert if needed

KennethEnevoldsen

Lovely to see you here @rbroc! Everything is looking good. There are a few comments, but generally it looks good!

docs/mmteb/points.md

mteb/tasks/Classification/ita/ItaHateClassification.py

rbroc · 2024-04-17T09:49:35Z

thanks @KennethEnevoldsen! I've left a few comments open above -- especially a question about the multilingual nature of the dataset. Happy to double review points since I am explicitly re-requesting :)

KennethEnevoldsen · 2024-04-17T09:54:31Z

Happy to double review points since I am explicitly re-requesting :)

No problem, let us just keep it at one.

rbroc · 2024-04-17T10:21:40Z

last quick q @KennethEnevoldsen: are my points for this 2 + 4 (2 for new dataset + 4 point bonus) or 4 in total?
if it looks good, i'll go ahead, update points if needed, and merge and work on a separate PR for the multilingual expansion.

KennethEnevoldsen

Looks good!

re. points it is 6 (2+4) in total.

rbroc · 2024-04-17T15:45:45Z

workflows require approval, should be ready to merge after that.

KennethEnevoldsen · 2024-04-18T08:27:36Z

The dataset does not have a native training split (it only comes with a test split). I am currently splitting into equally sized (1845 sentences) train/test splits, stratifying by hate speech type, but we could consider alternative option

Hei Roberta, just saw this comment. I actually would prefer that you do not do the split, but rather just shuffle and downsample from the training. Then also switch the eval_split to "train". This is to avoid someone training on the "train" split assuming that we test on another split (which we don't).

rbroc · 2024-04-18T09:04:36Z

The dataset does not have a native training split (it only comes with a test split). I am currently splitting into equally sized (1845 sentences) train/test splits, stratifying by hate speech type, but we could consider alternative option

Hei Roberta, just saw this comment. I actually would prefer that you do not do the split, but rather just shuffle and downsample from the training. Then also switch the eval_split to "train". This is to avoid someone training on the "train" split assuming that we test on another split (which we don't).

not sure I fully understand. the dataset has no training split, it only comes with a test split. could also have a chat about this IRL if you think that's more productive.

KennethEnevoldsen · 2024-04-18T10:02:19Z

Ahh then this is perfectly fine. Merged into a temporary branch to run linting and then merging.

* fix: add italian HateSpeech dataset (#385) * add italian HateSpeech dataset * add points * update dialect, socioeconomic status, domains and points * add PR review points * add task_domain for constructed data + rerun models * update points * minor fix * merge points from main * add review points to main --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * run linting --------- Co-authored-by: Roberta Rocca <[email protected]>

rbroc added 2 commits April 16, 2024 19:05

add italian HateSpeech dataset

c28320f

add points

c57acf1

KennethEnevoldsen reviewed Apr 17, 2024

View reviewed changes

rbroc added 2 commits April 17, 2024 11:40

update dialect, socioeconomic status, domains and points

a90adfc

add PR review points

5e3931b

rbroc requested a review from KennethEnevoldsen April 17, 2024 09:48

rbroc mentioned this pull request Apr 17, 2024

scale Italian HateSpeech to multilingual #395

Closed

add task_domain for constructed data + rerun models

87b159f

KennethEnevoldsen approved these changes Apr 17, 2024

View reviewed changes

rbroc and others added 5 commits April 17, 2024 15:13

update points

3c35264

Merge branch 'main' into emo-it

6e64e36

minor fix

579545d

merge points from main

6e4ece6

add review points to main

da69321

KennethEnevoldsen changed the title ~~add italian HateSpeech dataset~~ fix: add italian HateSpeech dataset Apr 18, 2024

KennethEnevoldsen changed the base branch from main to merge_385 April 18, 2024 09:59

Merge branch 'merge_385' into emo-it

27431ac

KennethEnevoldsen merged commit 7241bfc into embeddings-benchmark:merge_385 Apr 18, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add italian HateSpeech dataset #385

fix: add italian HateSpeech dataset #385

rbroc commented Apr 16, 2024

KennethEnevoldsen left a comment

rbroc commented Apr 17, 2024

KennethEnevoldsen commented Apr 17, 2024

rbroc commented Apr 17, 2024

KennethEnevoldsen left a comment •

edited

Loading

rbroc commented Apr 17, 2024

KennethEnevoldsen commented Apr 18, 2024

rbroc commented Apr 18, 2024

KennethEnevoldsen commented Apr 18, 2024

fix: add italian HateSpeech dataset #385

fix: add italian HateSpeech dataset #385

Conversation

rbroc commented Apr 16, 2024

Checklist for adding MMTEB dataset

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

rbroc commented Apr 17, 2024

KennethEnevoldsen commented Apr 17, 2024

rbroc commented Apr 17, 2024

KennethEnevoldsen left a comment • edited Loading

Choose a reason for hiding this comment

rbroc commented Apr 17, 2024

KennethEnevoldsen commented Apr 18, 2024

rbroc commented Apr 18, 2024

KennethEnevoldsen commented Apr 18, 2024

KennethEnevoldsen left a comment •

edited

Loading