-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multilabel Brazilian Toxic Tweets Classification #773
Conversation
@x-tabdeveloping sorry to bother again, but I need advise how to handle this set. On HF it consists of 1 train split with 21k examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
As far as I know we stopped dataset submissions on the 15th. |
I see you have some changes in Maltese news. Can you elaborate on what you did, and why? If it is relevant to the benchmark we should consider putting it in another PR. |
@Ruqyai The checklist is incomplete and the name of the PR is set to "work in progress". I believe it would be a tad irresponsible to merge this, no? Seems a bit too early and undercooked to just LGTM it to me. |
Your PR for Multilabel Classification was merged only last week, which left just a couple of days to submit before 15th. While I appreciate that we focus on model submissions, and I'm eager and hands on with that, I don't think we should resign from an interesting dataset because of an arbitrary deadline, especially given that Brazilian dialect of Portuguese is underrepresented. While working on it I spotted some minor mistakes in Maltese News Classification: a) wrong type of task b) lack of import. |
I see your point! Can I ask you to move the changes related to the Maltese News to another PR so we can discuss them separately? (seems quite reasonable otherwise) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KennethEnevoldsen what is your take on this? Should we consider adding this still or stick to the deadline we set to ourselves?
This looks fine to merge in for me - @dokato will you fill out the checklist
mteb/tasks/MultiLabelClassification/por/BrazilianToxicTweetsClassification.py
Outdated
Show resolved
Hide resolved
…assification.py Co-authored-by: Kenneth Enevoldsen <[email protected]>
Checklist for adding MMTEB dataset
Reason for dataset addition: there's shortage of datasets with Brazilian Portugise, and currently we don't have big enough variety of multilabel datasets. This one is well structured and described: https://paperswithcode.com/dataset/told-br
Furthermore, I corrected some minor issues with Maltese News Classification.
mteb
package.mteb -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.438.jsonl
).