Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added multilabel stratification to AbsTaskMultilabelClassification #760

Merged
merged 1 commit into from
Jun 6, 2024

Conversation

dokato
Copy link
Collaborator

@dokato dokato commented May 17, 2024

This is continuation of the discussion from #698.

cc @x-tabdeveloping

@dokato dokato added the WIP Work In Progress label May 17, 2024
@dokato dokato mentioned this pull request May 17, 2024
10 tasks
Copy link
Collaborator

@x-tabdeveloping x-tabdeveloping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments. Let's continue discussion on this, I think we are going in the right direction but this still needs some thought and work.

mteb/abstasks/AbsTaskMultilabelClassification.py Outdated Show resolved Hide resolved
mteb/abstasks/AbsTaskMultilabelClassification.py Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
@KennethEnevoldsen
Copy link
Contributor

hi @dokato and @x-tabdeveloping: I would probably just do the split when creating the dataset on HF. This avoids the extra dependency.

@x-tabdeveloping
Copy link
Collaborator

Yeah but then we need to reupload a lot of datasets, and I'm wondering if it's too much of a hustle. I am really in favour of not introducing new dependencies if possible.
I'm wondering if it would make sense to just copy the file over from sklmultilearn, simplify it a bit, and call it a day.
It seems pretty self-contained to me at least. What do you think @dokato @KennethEnevoldsen ?

@KennethEnevoldsen
Copy link
Contributor

@x-tabdeveloping I believe that is the best solution. We just need to add a reference

@dokato
Copy link
Collaborator Author

dokato commented May 24, 2024

Thanks for advise @x-tabdeveloping @KennethEnevoldsen. That's what I did, I moved their function (with acknowledgments) and slightly modified so it returns indices _iterative_train_test_split.

@dokato dokato changed the title [WIP] Added multilabel stratification to AbsTaskMultilabelClassification Added multilabel stratification to AbsTaskMultilabelClassification May 28, 2024
@x-tabdeveloping x-tabdeveloping self-requested a review May 30, 2024 08:30
@x-tabdeveloping
Copy link
Collaborator

Sorry for the delay @dokato I'm just mid exam-season, I'll have a look at it right now :D

Copy link
Collaborator

@x-tabdeveloping x-tabdeveloping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful! Looks absolutely good to me. Can I ask you to implement it for one or more tasks and see if it works? Thanks for your awesome work and patience

@dokato
Copy link
Collaborator Author

dokato commented May 31, 2024

Thanks for review @x-tabdeveloping! I rerun Brazilian dataset as all other had small enough "test" splits that didn't require that much of stratification. Hence, I added additionally KorHateSpeechMLClassification (https://huggingface.co/datasets/jeanlee/kmhas_korean_hate_speech) also to improve representation of MultiLabel Classification task.

Here's dataset card for KorHateSpeechMLClassification.

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

@dokato
Copy link
Collaborator Author

dokato commented May 31, 2024

I've changed "Stratified subsampling of test set to 2000 examples." bit to more commonly used in classification 2048 AbsTaskMultilabelClassification.py. But now, given that we have proper stratificaiton I wonder if we need that bit at all, or should we just assume that the sampling is done in the dataset_transform method? Similarly as is in other tasks. Maybe just a warning if the split if bigger than 2048 would suffice?

@dokato dokato removed the WIP Work In Progress label May 31, 2024
@dokato dokato force-pushed the mulistart branch 3 times, most recently from b55919f to 6988b89 Compare June 4, 2024 12:24
@dokato
Copy link
Collaborator Author

dokato commented Jun 5, 2024

@x-tabdeveloping struggling to understand what might be going on here. I tried to rebase but it had too many merge conflicts, so I forced push it. But honestly, have no idea why those tests are failing, esp. that they related to some other datasets...

@x-tabdeveloping
Copy link
Collaborator

Well, since this PR is making changes to quite fundamental things in the library, I would prefer if everything was passing before we went on, and all conflicts were resolved. I think if nothing fixes it probably the best thing to do is just reapply your changes to the current main in a new PR or a new branch, it seems like the most painless option to me. I don't have the slightest cue either what might have gone wrong here unfortunately.

@KennethEnevoldsen
Copy link
Contributor

@dokato pulling from main should resolve a lot of the dataset issues.

@dokato
Copy link
Collaborator Author

dokato commented Jun 5, 2024

Good shoutout guys, I rebased again. I guess it's ready?

@x-tabdeveloping
Copy link
Collaborator

I think it looks alright! Thanks for the work and patience @dokato :D

@x-tabdeveloping x-tabdeveloping merged commit d7dc9a8 into embeddings-benchmark:main Jun 6, 2024
7 checks passed
@dokato
Copy link
Collaborator Author

dokato commented Jun 6, 2024

No worries, pleasure! hope you're exams went fine ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants