-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow extending the load_dataset parameters in custom tasks inheriting AbsTask #299
Allow extending the load_dataset parameters in custom tasks inheriting AbsTask #299
Conversation
59abc83
to
9a25877
Compare
This pr. might also relate to the discussion in #301 about creating two dataset sources mteb org + original dataset. |
I'll address the comments some time tomorrow! |
This seems like a very pragmatic change, with an extremely clean PR description. Nice job! Moving the As a side-note: I've been looking into codemodification-programming for cases like this, something like gritql. If you have any experience here, I'd love to hear about it in a discussion. |
0ed134c
to
db79794
Compare
Very interesting, this is much fancier than my Vim macros I ended up using 🤣 We've addressed all comments and we should be ready to go! We've also added additional testing for the cc. @gbmarc1 |
Perfect @gariepyalex I have set the tests to run assuming they pass it will be merged it - thanks again for the contribution. If you do want to participate in the MMTEB coming up, feel free to add your names and 2 points to the points sheet (if not feel free to ignore this part). |
…eriting AbsTask (#299) * Allow extending the load_dataset parameters * format * Fix test * remove duplicated logic from AbsTask, now handled in the metadata * add tests * remove comments, moved to PR * format * extend metadata dict from super class * Remove additional load_data * test: adding very high level test * Remove hf_hub_name and add test * Fix revision in output file --------- Co-authored-by: gbmarc1 <[email protected]>
Currently, it is only possible to pass in a
path
andrevision to
load_dataset. This is fairly limiting, and forces users to implement their own
AbsTask.load_data` function, which relies on the internals of the library.This PR allow to specify any parameter supported by
datasets.load_dataset
in custom tasks. Currently, the metadata is specified as such:This is hard to extend as we would need to add new keys to that Pydantic object for each
load_dataset
key we want to support.This PR proposes to instead have the following structure:
This allows users to add any key they want (dataset config name, token, etc.). All key/values are passed in to
load_dataset
.Note that this is done in a backward compatible manner. A pydantic validator supports the old parameters and populates thedataset
dictionary.Migration
We've included in the PR a migration of all the built-in tasks to avoid logging a deprecation warning when running them. This is what causes most of the line changes.
Refactoring
While migrating the tasks to the new
metadata_dict
to usedataset
that a lot of time,AbsTask.load_data
was overridden for the sole purpose of inserting adataset_transform
call. This was such a common pattern that this led to a lot of duplication. We propose here to calldataset_transform
inAbsTask
, and this default to a no-op.Bug fixes
A few times, the revision of the dataset was not passed to
load_dataset
, leading to a discrepancy between the revision in the metadata and the actual loaded data. These were fixed and I indicated the location of these issues in the PR.Testing
I see that the repo contains test suites for all abstract tasks. Please let me know if any additional testing is required from our end.
cc. @gbmarc1