Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilabel Brazilian Toxic Tweets Classification #773

Merged
merged 7 commits into from
May 24, 2024

Conversation

dokato
Copy link
Collaborator

@dokato dokato commented May 20, 2024

Checklist for adding MMTEB dataset

Reason for dataset addition: there's shortage of datasets with Brazilian Portugise, and currently we don't have big enough variety of multilabel datasets. This one is well structured and described: https://paperswithcode.com/dataset/told-br

Furthermore, I corrected some minor issues with Maltese News Classification.

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

@dokato dokato added the WIP Work In Progress label May 20, 2024
@dokato
Copy link
Collaborator Author

dokato commented May 20, 2024

@x-tabdeveloping sorry to bother again, but I need advise how to handle this set. On HF it consists of 1 train split with 21k examples.
Here again I can't use built-in stratified train_test_split from datasets as it complains about column type, so I just use random samples. But as a consequence, we can't guarantee that the labels from training will match ones in test. WDYT? Should we sort #760 or #694 first?

Copy link
Contributor

@Ruqyai Ruqyai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@x-tabdeveloping
Copy link
Collaborator

As far as I know we stopped dataset submissions on the 15th.
I think we should focus our efforts on speeding up the benchmark and finalising everything before running the models.

@x-tabdeveloping
Copy link
Collaborator

x-tabdeveloping commented May 21, 2024

I see you have some changes in Maltese news. Can you elaborate on what you did, and why? If it is relevant to the benchmark we should consider putting it in another PR.

@x-tabdeveloping
Copy link
Collaborator

@Ruqyai The checklist is incomplete and the name of the PR is set to "work in progress". I believe it would be a tad irresponsible to merge this, no? Seems a bit too early and undercooked to just LGTM it to me.

@dokato
Copy link
Collaborator Author

dokato commented May 21, 2024

Your PR for Multilabel Classification was merged only last week, which left just a couple of days to submit before 15th. While I appreciate that we focus on model submissions, and I'm eager and hands on with that, I don't think we should resign from an interesting dataset because of an arbitrary deadline, especially given that Brazilian dialect of Portuguese is underrepresented. While working on it I spotted some minor mistakes in Maltese News Classification: a) wrong type of task b) lack of import.

@x-tabdeveloping
Copy link
Collaborator

I see your point! Can I ask you to move the changes related to the Maltese News to another PR so we can discuss them separately? (seems quite reasonable otherwise)
@KennethEnevoldsen what is your take on this? Should we consider adding this still or stick to the deadline we set to ourselves?

@dokato dokato mentioned this pull request May 23, 2024
4 tasks
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KennethEnevoldsen what is your take on this? Should we consider adding this still or stick to the deadline we set to ourselves?

This looks fine to merge in for me - @dokato will you fill out the checklist

@dokato dokato changed the title Multilabel Brazilian Toxic Tweets Classification [WIP] Multilabel Brazilian Toxic Tweets Classification May 24, 2024
@dokato dokato removed the WIP Work In Progress label May 24, 2024
@dokato dokato merged commit 5f0cd32 into embeddings-benchmark:main May 24, 2024
7 checks passed
@dokato dokato deleted the multi-pr branch September 2, 2024 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants