Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add italian HateSpeech dataset #385

Merged
merged 11 commits into from
Apr 18, 2024

Conversation

rbroc
Copy link
Contributor

@rbroc rbroc commented Apr 16, 2024

Checklist for adding MMTEB dataset

Reason for dataset addition: This is a highly curated hate speech detection dataset, with binary (hateful/non-hateful) labels, as well as fine-grained labels for 25+ hate speech categories (https://huggingface.co/datasets/Paul/hatecheck-italian).
I am currently only using the binary labels, but it would be cool to also test models on the more fine-grained taxonomy (could just subclass the current task, and set the relevant column as label.

There are similar datasets for 9 more languages, released as separate datasets on HF. If this looks like a suitable dataset, I can go on to submit the remaining (through separate PRs, I assume).

Checklist

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the POINTS.md file.

Questions

  • The dataset does not have a native training split (it only comes with a test split). I am currently splitting into equally sized (1845 sentences) train/test splits, stratifying by hate speech type, but we could consider alternative options.
  • Don't know what happened with results/intfloat__multilingual-e5-small/model_meta.json but can revert if needed

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely to see you here @rbroc! Everything is looking good. There are a few comments, but generally it looks good!

docs/mmteb/points.md Outdated Show resolved Hide resolved
mteb/tasks/Classification/ita/ItaHateClassification.py Outdated Show resolved Hide resolved
mteb/tasks/Classification/ita/ItaHateClassification.py Outdated Show resolved Hide resolved
mteb/tasks/Classification/ita/ItaHateClassification.py Outdated Show resolved Hide resolved
@rbroc
Copy link
Contributor Author

rbroc commented Apr 17, 2024

thanks @KennethEnevoldsen! I've left a few comments open above -- especially a question about the multilingual nature of the dataset. Happy to double review points since I am explicitly re-requesting :)

@KennethEnevoldsen
Copy link
Contributor

Happy to double review points since I am explicitly re-requesting :)

No problem, let us just keep it at one.

@rbroc
Copy link
Contributor Author

rbroc commented Apr 17, 2024

last quick q @KennethEnevoldsen: are my points for this 2 + 4 (2 for new dataset + 4 point bonus) or 4 in total?
if it looks good, i'll go ahead, update points if needed, and merge and work on a separate PR for the multilingual expansion.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

re. points it is 6 (2+4) in total.

@rbroc
Copy link
Contributor Author

rbroc commented Apr 17, 2024

workflows require approval, should be ready to merge after that.

@KennethEnevoldsen
Copy link
Contributor

The dataset does not have a native training split (it only comes with a test split). I am currently splitting into equally sized (1845 sentences) train/test splits, stratifying by hate speech type, but we could consider alternative option

Hei Roberta, just saw this comment. I actually would prefer that you do not do the split, but rather just shuffle and downsample from the training. Then also switch the eval_split to "train". This is to avoid someone training on the "train" split assuming that we test on another split (which we don't).

@KennethEnevoldsen KennethEnevoldsen changed the title add italian HateSpeech dataset fix: add italian HateSpeech dataset Apr 18, 2024
@rbroc
Copy link
Contributor Author

rbroc commented Apr 18, 2024

The dataset does not have a native training split (it only comes with a test split). I am currently splitting into equally sized (1845 sentences) train/test splits, stratifying by hate speech type, but we could consider alternative option

Hei Roberta, just saw this comment. I actually would prefer that you do not do the split, but rather just shuffle and downsample from the training. Then also switch the eval_split to "train". This is to avoid someone training on the "train" split assuming that we test on another split (which we don't).

not sure I fully understand. the dataset has no training split, it only comes with a test split. could also have a chat about this IRL if you think that's more productive.

@KennethEnevoldsen KennethEnevoldsen changed the base branch from main to merge_385 April 18, 2024 09:59
@KennethEnevoldsen KennethEnevoldsen merged commit 7241bfc into embeddings-benchmark:merge_385 Apr 18, 2024
4 checks passed
@KennethEnevoldsen
Copy link
Contributor

Ahh then this is perfectly fine. Merged into a temporary branch to run linting and then merging.

KennethEnevoldsen added a commit that referenced this pull request Apr 18, 2024
* fix: add italian HateSpeech dataset (#385)

* add italian HateSpeech dataset

* add points

* update dialect, socioeconomic status, domains and points

* add PR review points

* add task_domain for constructed data + rerun models

* update points

* minor fix

* merge points from main

* add review points to main

---------

Co-authored-by: Kenneth Enevoldsen <[email protected]>

* run linting

---------

Co-authored-by: Roberta Rocca <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants