-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added multilabel stratification to AbsTaskMultilabelClassification #760
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments. Let's continue discussion on this, I think we are going in the right direction but this still needs some thought and work.
hi @dokato and @x-tabdeveloping: I would probably just do the split when creating the dataset on HF. This avoids the extra dependency. |
Yeah but then we need to reupload a lot of datasets, and I'm wondering if it's too much of a hustle. I am really in favour of not introducing new dependencies if possible. |
@x-tabdeveloping I believe that is the best solution. We just need to add a reference |
a8104b8
to
e89f14b
Compare
Thanks for advise @x-tabdeveloping @KennethEnevoldsen. That's what I did, I moved their function (with acknowledgments) and slightly modified so it returns indices |
Sorry for the delay @dokato I'm just mid exam-season, I'll have a look at it right now :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful! Looks absolutely good to me. Can I ask you to implement it for one or more tasks and see if it works? Thanks for your awesome work and patience
a8104b8
to
053cbaa
Compare
Thanks for review @x-tabdeveloping! I rerun Brazilian dataset as all other had small enough "test" splits that didn't require that much of stratification. Hence, I added additionally Here's dataset card for
|
I've changed "Stratified subsampling of test set to 2000 examples." bit to more commonly used in classification 2048 |
b55919f
to
6988b89
Compare
@x-tabdeveloping struggling to understand what might be going on here. I tried to rebase but it had too many merge conflicts, so I forced push it. But honestly, have no idea why those tests are failing, esp. that they related to some other datasets... |
Well, since this PR is making changes to quite fundamental things in the library, I would prefer if everything was passing before we went on, and all conflicts were resolved. I think if nothing fixes it probably the best thing to do is just reapply your changes to the current main in a new PR or a new branch, it seems like the most painless option to me. I don't have the slightest cue either what might have gone wrong here unfortunately. |
@dokato pulling from main should resolve a lot of the dataset issues. |
Good shoutout guys, I rebased again. I guess it's ready? |
I think it looks alright! Thanks for the work and patience @dokato :D |
No worries, pleasure! hope you're exams went fine ;) |
This is continuation of the discussion from #698.
cc @x-tabdeveloping