-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DatasetDict containing Datasets with different features when pushed to hub gets remapped features #4211
Comments
Hi @pietrolesci, thanks for reporting. Please note that this is a design purpose: a To handle sub-datasets with different features, we use another approach: use different configurations instead of splits. However, for the moment |
Hi @albertvillanova, Thanks a lot for your reply! I got it now. The strange thing for me was to have it correctly working (i.e., DatasetDict with different features in some datasets) locally and not on the Hub. It would be great to have configuration supported by Again, thanks for your time @albertvillanova! Best, |
Hi! Yes, we should override @albertvillanova @lhoestq wdyt? |
I would keep things simple and keep subclassing dict. Regarding the features check, I guess this can be done only for Note that later you will be able to push datasets with different features as different dataset configurations (similarly to the GLUE subsets for example). We will work on this soon |
Hi @lhoestq, Returning to this thread to ask whether the possibility to create Best, |
DatasetDict is likely to always require the datasets to have the same columns and types, while different configurations may have different columns and types. Why would you like to see that ? |
Hi @lhoestq and @pietrolesci, I have been curious about this question as well. I don't have experience working with different configurations, but I can give a bit more detail on the work flow that I have been using with As @pietrolesci mentions, I have been using
I am curious why Best wishes, |
+1 |
There's a possibility we may merge the Dataset and DatasetDict classes. The DatasetDict purpose was to define a way to get the train/test splits of a dataset. see the discussions at #5189
There's a PR open to allow to upload a dataset with a certain configuration name. Then later you can reload this specific configuration using see the PR at #5213 |
Hi, regarding the following information:
Altough this is often implied (such as how else would
I think this addition would benefit the docs, especially when you guide a newbie (such as me) through the process of creating a dataset. As I said, you somehow suspect that this is in fact the case, but without reading it in the docs you cannot be sure. |
Hi there,
I am trying to load a dataset to the Hub. This dataset is a
DatasetDict
composed of various splits. Some splits have a differentFeature
mapping. Locally, the DatasetDict preserves the individual features but if Ipush_to_hub
and thenload_dataset
, the features are all the same.Dataset and code to reproduce available here.
In short:
I have 3 feature mapping
Then I create different datasets
I then push to hub
Finally, I load it from the hub
And I get that
since
gets remapped to
The text was updated successfully, but these errors were encountered: