Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solution for transforming retrieval datasets into parquet #1090

Open
gowitheflow-1998 opened this issue Jul 15, 2024 · 3 comments
Open

Solution for transforming retrieval datasets into parquet #1090

gowitheflow-1998 opened this issue Jul 15, 2024 · 3 comments

Comments

@gowitheflow-1998
Copy link
Contributor

gowitheflow-1998 commented Jul 15, 2024

I believe we had this trust_remote_code issue a while ago when we wanted to turn files into parquet, and retrieval datasets weren't compatible. Just confirmed with @KennethEnevoldsen this hasn't been solved.

Happened to find a solution here, where they turn corpus, queries and qrels separately into parquets. Can then load_dataset(dataset_name, "qrels"), load_dataset(dataset_name, "query"), load_dataset(dataset_name, "corpus").

I had a go implementing i2t retrieval using this format here. Works smoothly. Will follow this solution when creating more image-text retrieval ones and maybe for main branch we can deal with it the same way!

@KennethEnevoldsen
Copy link
Contributor

I believe we had this trust_remote_code issue a while ago when we wanted to turn files into parquet, and retrieval datasets weren't compatible. Just confirmed with @KennethEnevoldsen this hasn't been solved.

It is solved atm by setting trust_remote_code=True, where required, but future dataset should not use this (tests will fail). It would be great if someone would fix older datasets as well, but it is not strictly required.

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Nov 27, 2024

One dataset where we see this example is "GermanDPR", which trust_remote_code=True. We would instead want it formatted like so

@Samoed
Copy link
Collaborator

Samoed commented Nov 27, 2024

I'm working on functions to reupload datasets in our format. Soon I'll add them in #1362

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants