Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DatasetDict containing Datasets with different features when pushed to hub gets remapped features #4211

Closed
pietrolesci opened this issue Apr 25, 2022 · 10 comments · Fixed by #4372
Assignees
Labels
bug Something isn't working

Comments

@pietrolesci
Copy link

pietrolesci commented Apr 25, 2022

Hi there,

I am trying to load a dataset to the Hub. This dataset is a DatasetDict composed of various splits. Some splits have a different Feature mapping. Locally, the DatasetDict preserves the individual features but if I push_to_hub and then load_dataset, the features are all the same.

Dataset and code to reproduce available here.

In short:

I have 3 feature mapping

Tri_features = Features(
    {
        "idx": Value(dtype="int64"),
        "premise": Value(dtype="string"),
        "hypothesis": Value(dtype="string"),
        "label": ClassLabel(num_classes=3, names=["entailment", "neutral", "contradiction"]),
    }
)

Ent_features = Features(
    {
        "idx": Value(dtype="int64"),
        "premise": Value(dtype="string"),
        "hypothesis": Value(dtype="string"),
        "label": ClassLabel(num_classes=2, names=["non-entailment", "entailment"]),
    }
)

Con_features = Features(
    {
        "idx": Value(dtype="int64"),
        "premise": Value(dtype="string"),
        "hypothesis": Value(dtype="string"),
        "label": ClassLabel(num_classes=2, names=["non-contradiction", "contradiction"]),
    }
)

Then I create different datasets

dataset_splits = {}

for split in df["split"].unique():
    print(split)
    df_split = df.loc[df["split"] == split].copy()
    
    if split in Tri_dataset:
        df_split["label"] = df_split["label"].map({"entailment": 0, "neutral": 1, "contradiction": 2})
        ds = Dataset.from_pandas(df_split, features=Tri_features)
    
    elif split in Ent_bin_dataset:
        df_split["label"] = df_split["label"].map({"non-entailment": 0, "entailment": 1})
        ds = Dataset.from_pandas(df_split, features=Ent_features)
    
    elif split in Con_bin_dataset:
        df_split["label"] = df_split["label"].map({"non-contradiction": 0, "contradiction": 1})
        ds = Dataset.from_pandas(df_split, features=Con_features)

    else:
        print("ERROR:", split)
    dataset_splits[split] = ds
datasets = DatasetDict(dataset_splits)

I then push to hub

datasets.push_to_hub("pietrolesci/robust_nli", token="<token>")

Finally, I load it from the hub

datasets_loaded_from_hub = load_dataset("pietrolesci/robust_nli")

And I get that

datasets["LI_TS"].features != datasets_loaded_from_hub["LI_TS"].features

since

"label": ClassLabel(num_classes=2, names=["non-contradiction", "contradiction"])

gets remapped to

 "label": ClassLabel(num_classes=3, names=["entailment", "neutral", "contradiction"])
@pietrolesci pietrolesci added the bug Something isn't working label Apr 25, 2022
@albertvillanova
Copy link
Member

albertvillanova commented Apr 26, 2022

Hi @pietrolesci, thanks for reporting.

Please note that this is a design purpose: a DatasetDict has the same features for all its datasets. Normally, a DatasetDict is composed of several sub-datasets each corresponding to a different split.

To handle sub-datasets with different features, we use another approach: use different configurations instead of splits.

However, for the moment push_to_hub does not support specifying different configurations. IMHO, we should implement this.

@pietrolesci
Copy link
Author

Hi @albertvillanova,

Thanks a lot for your reply! I got it now. The strange thing for me was to have it correctly working (i.e., DatasetDict with different features in some datasets) locally and not on the Hub. It would be great to have configuration supported by push_to_hub. Personally, this latter functionality allowed me to iterate rather quickly on dataset curation.

Again, thanks for your time @albertvillanova!

Best,
Pietro

@mariosasko
Copy link
Collaborator

Hi! Yes, we should override DatasetDict.__setitem__ and throw an error if features dictionaries are different. DatasetDict is a subclass of dict, so DatasetDict.{update/setdefault} need to be overridden as well. We could avoid this by subclassing UserDict, but then we would get the name collision - DatasetDict.data vs. UserDict.data. This makes me think we should rename the data attribute of DatasetDict/Dataset for easier dict subclassing (would also simplify #3997) and to follow good Python practices. Another option is to have a custom UserDict class in py_utils, but it can be hard to keep this class consistent with the built-in UserDict.

@albertvillanova @lhoestq wdyt?

@lhoestq
Copy link
Member

lhoestq commented Apr 26, 2022

I would keep things simple and keep subclassing dict. Regarding the features check, I guess this can be done only for push_to_hub right ? It is the only function right now that requires the underlying datasets to be splits (e.g. train/test) and have the same features.

Note that later you will be able to push datasets with different features as different dataset configurations (similarly to the GLUE subsets for example). We will work on this soon

@pietrolesci
Copy link
Author

Hi @lhoestq,

Returning to this thread to ask whether the possibility to create DatasetDict with different configurations will be supported in the future.

Best,
Pietro

@lhoestq
Copy link
Member

lhoestq commented Nov 12, 2022

DatasetDict is likely to always require the datasets to have the same columns and types, while different configurations may have different columns and types.

Why would you like to see that ?
If it's related to push_to_hub, we plan to allow pushing several configs, but not using DatasetDict

@jonathangomesselman
Copy link

Hi @lhoestq and @pietrolesci,

I have been curious about this question as well. I don't have experience working with different configurations, but I can give a bit more detail on the work flow that I have been using with Dataset_dict.

As @pietrolesci mentions, I have been using push_to_hub to quickly iterate on dataset curation for different ML experiments - locally I create a set of dataset splits e.g. train/val/test/inference, then convert them to HF_Datasets and finally a to Dataset_Dict to push_to_hub. Where I have run into issues is when I want to include different metadata for different splits. For example, I have situations where I only have meta-data for one of the splits (e.g. test) or situations where I am working with inference data that does not have labels. Currently I use a rather hacky work around by adding "dummy" columns for missing columns to avoid the error:

ValueError: All datasets in `DatasetDict` should have the same features

I am curious why DatasetDict will likely not support this functionality? I don't know much about working with different configurations, but allowing for different columns between datasets / splits would be a very helpful use-case for me. Are there any docs for using different configuration OR a more info about incorporating it with push_to_hub.

Best wishes,
Jonathan

@nikita-galileo
Copy link

+1

@lhoestq
Copy link
Member

lhoestq commented Nov 22, 2022

I am curious why DatasetDict will likely not support this functionality?

There's a possibility we may merge the Dataset and DatasetDict classes. The DatasetDict purpose was to define a way to get the train/test splits of a dataset.

see the discussions at #5189

Are there any docs for using different configuration OR a more info about incorporating it with push_to_hub.

There's a PR open to allow to upload a dataset with a certain configuration name. Then later you can reload this specific configuration using load_dataset(ds_name, config_name)

see the PR at #5213

@dburian
Copy link

dburian commented Apr 6, 2023

Hi, regarding the following information:

Please note that this is a design purpose: a DatasetDict has the same features for all its datasets. Normally, a DatasetDict is composed of several sub-datasets each corresponding to a different split.

To handle sub-datasets with different features, we use another approach: use different configurations instead of splits.

Altough this is often implied (such as how else would DatasetDict be able to process multiple splits in the same way?), I would expect it to be written somewhere in the docs plainly and maybe even in bold. Also I would expect to see it in multiple places such as:

I think this addition would benefit the docs, especially when you guide a newbie (such as me) through the process of creating a dataset. As I said, you somehow suspect that this is in fact the case, but without reading it in the docs you cannot be sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants