Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share on Hugging Face ? #1

Open
lhoestq opened this issue Oct 2, 2023 · 8 comments
Open

Share on Hugging Face ? #1

lhoestq opened this issue Oct 2, 2023 · 8 comments

Comments

@lhoestq
Copy link

lhoestq commented Oct 2, 2023

Hi ! I’m Quentin from HF :)

Thanks for sharing the dataset, I believe it will be used a lot to evaluate LLMs! Especially since factual correctness and attributions are imo at the heart of many challenges nowadays.

I was wondering if you planned to share the dataset on Hugging Face ? This way researchers can load it in one line of python, and there is also a nice dataset viewer on the website to visualize the data.

@chaitanyamalaviya
Copy link
Owner

Hi Quentin, that's a good idea! We are on it, and will let you know once we've done this.

@lhoestq
Copy link
Author

lhoestq commented Oct 2, 2023

Cool ! Let me know if you have questions or if I can help

@chaitanyamalaviya
Copy link
Owner

Hi Quentin, I uploaded our dataset here and modified the yaml to display the different configs as described here. I was trying to show three different configs for the main data, the lfqa_random data and the lfqa_domain data. But the dataset viewer seems to not show these configs and their corresponding splits this way. Any chance you know what I could be missing? Thanks a lot!

@lhoestq
Copy link
Author

lhoestq commented Oct 3, 2023

I just opened a PR to fix a small issue with the YAML :)
https://huggingface.co/datasets/cmalaviya/expertqa/discussions/1

@chaitanyamalaviya
Copy link
Owner

Thanks, looks good now!! It would be nice if the main subset could also be previewed, I currently see an Error code: UnexpectedError. Let me know if I need to fix something.

@lhoestq
Copy link
Author

lhoestq commented Oct 3, 2023

I'm getting this error somehow:

pyarrow.lib.ArrowInvalid: JSON parse error: Column(/answers/post_hoc_gs_gpt4/claims/[]/revised_evidence) changed from string to array in row 0

It looks like a field is sometimes a string and sometimes an array in the JSON data. However the dataset viewer only supports fixed types per field. Is this an error in the data file or it's expected ?

@chaitanyamalaviya
Copy link
Owner

Ah that's because when the revised_evidence field is empty, it was stored as an empty list when it is otherwise always a string.
I fixed this in an updated file, but there is still an Unexpected error. Let me know if the error is something different. Also I wonder if I can test with the parquet converter myself. Thanks in any case!

@lhoestq
Copy link
Author

lhoestq commented Oct 9, 2023

It seems that some examples have the gpt4 field but other don't

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants