-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solution for transforming retrieval datasets into parquet #1090
Comments
It is solved atm by setting |
One dataset where we see this example is "GermanDPR", which trust_remote_code=True. We would instead want it formatted like so |
I'm working on functions to reupload datasets in our format. Soon I'll add them in #1362 |
I believe we had this
trust_remote_code
issue a while ago when we wanted to turn files into parquet, and retrieval datasets weren't compatible. Just confirmed with @KennethEnevoldsen this hasn't been solved.Happened to find a solution here, where they turn corpus, queries and qrels separately into parquets. Can then
load_dataset(dataset_name, "qrels")
,load_dataset(dataset_name, "query")
,load_dataset(dataset_name, "corpus")
.I had a go implementing i2t retrieval using this format here. Works smoothly. Will follow this solution when creating more image-text retrieval ones and maybe for main branch we can deal with it the same way!
The text was updated successfully, but these errors were encountered: