DatasetDict containing Datasets with different features when pushed to hub gets remapped features #4211

pietrolesci · 2022-04-25T11:22:54Z

Hi there,

I am trying to load a dataset to the Hub. This dataset is a DatasetDict composed of various splits. Some splits have a different Feature mapping. Locally, the DatasetDict preserves the individual features but if I push_to_hub and then load_dataset, the features are all the same.

Dataset and code to reproduce available here.

In short:

I have 3 feature mapping

Tri_features = Features(
    {
        "idx": Value(dtype="int64"),
        "premise": Value(dtype="string"),
        "hypothesis": Value(dtype="string"),
        "label": ClassLabel(num_classes=3, names=["entailment", "neutral", "contradiction"]),
    }
)

Ent_features = Features(
    {
        "idx": Value(dtype="int64"),
        "premise": Value(dtype="string"),
        "hypothesis": Value(dtype="string"),
        "label": ClassLabel(num_classes=2, names=["non-entailment", "entailment"]),
    }
)

Con_features = Features(
    {
        "idx": Value(dtype="int64"),
        "premise": Value(dtype="string"),
        "hypothesis": Value(dtype="string"),
        "label": ClassLabel(num_classes=2, names=["non-contradiction", "contradiction"]),
    }
)

Then I create different datasets

dataset_splits = {}

for split in df["split"].unique():
    print(split)
    df_split = df.loc[df["split"] == split].copy()
    
    if split in Tri_dataset:
        df_split["label"] = df_split["label"].map({"entailment": 0, "neutral": 1, "contradiction": 2})
        ds = Dataset.from_pandas(df_split, features=Tri_features)
    
    elif split in Ent_bin_dataset:
        df_split["label"] = df_split["label"].map({"non-entailment": 0, "entailment": 1})
        ds = Dataset.from_pandas(df_split, features=Ent_features)
    
    elif split in Con_bin_dataset:
        df_split["label"] = df_split["label"].map({"non-contradiction": 0, "contradiction": 1})
        ds = Dataset.from_pandas(df_split, features=Con_features)

    else:
        print("ERROR:", split)
    dataset_splits[split] = ds
datasets = DatasetDict(dataset_splits)

I then push to hub

datasets.push_to_hub("pietrolesci/robust_nli", token="<token>")

Finally, I load it from the hub

datasets_loaded_from_hub = load_dataset("pietrolesci/robust_nli")

And I get that

datasets["LI_TS"].features != datasets_loaded_from_hub["LI_TS"].features

since

"label": ClassLabel(num_classes=2, names=["non-contradiction", "contradiction"])

gets remapped to

 "label": ClassLabel(num_classes=3, names=["entailment", "neutral", "contradiction"])

The text was updated successfully, but these errors were encountered:

albertvillanova · 2022-04-26T07:46:28Z

Hi @pietrolesci, thanks for reporting.

Please note that this is a design purpose: a DatasetDict has the same features for all its datasets. Normally, a DatasetDict is composed of several sub-datasets each corresponding to a different split.

To handle sub-datasets with different features, we use another approach: use different configurations instead of splits.

However, for the moment push_to_hub does not support specifying different configurations. IMHO, we should implement this.

pietrolesci · 2022-04-26T08:31:13Z

Hi @albertvillanova,

Thanks a lot for your reply! I got it now. The strange thing for me was to have it correctly working (i.e., DatasetDict with different features in some datasets) locally and not on the Hub. It would be great to have configuration supported by push_to_hub. Personally, this latter functionality allowed me to iterate rather quickly on dataset curation.

Again, thanks for your time @albertvillanova!

Best,
Pietro

mariosasko · 2022-04-26T14:03:46Z

Hi! Yes, we should override DatasetDict.__setitem__ and throw an error if features dictionaries are different. DatasetDict is a subclass of dict, so DatasetDict.{update/setdefault} need to be overridden as well. We could avoid this by subclassing UserDict, but then we would get the name collision - DatasetDict.data vs. UserDict.data. This makes me think we should rename the data attribute of DatasetDict/Dataset for easier dict subclassing (would also simplify #3997) and to follow good Python practices. Another option is to have a custom UserDict class in py_utils, but it can be hard to keep this class consistent with the built-in UserDict.

@albertvillanova @lhoestq wdyt?

lhoestq · 2022-04-26T16:32:11Z

I would keep things simple and keep subclassing dict. Regarding the features check, I guess this can be done only for push_to_hub right ? It is the only function right now that requires the underlying datasets to be splits (e.g. train/test) and have the same features.

Note that later you will be able to push datasets with different features as different dataset configurations (similarly to the GLUE subsets for example). We will work on this soon

pietrolesci · 2022-11-10T10:25:58Z

Hi @lhoestq,

Returning to this thread to ask whether the possibility to create DatasetDict with different configurations will be supported in the future.

Best,
Pietro

lhoestq · 2022-11-12T15:40:20Z

DatasetDict is likely to always require the datasets to have the same columns and types, while different configurations may have different columns and types.

Why would you like to see that ?
If it's related to push_to_hub, we plan to allow pushing several configs, but not using DatasetDict

jonathangomesselman · 2022-11-22T17:29:28Z

Hi @lhoestq and @pietrolesci,

I have been curious about this question as well. I don't have experience working with different configurations, but I can give a bit more detail on the work flow that I have been using with Dataset_dict.

As @pietrolesci mentions, I have been using push_to_hub to quickly iterate on dataset curation for different ML experiments - locally I create a set of dataset splits e.g. train/val/test/inference, then convert them to HF_Datasets and finally a to Dataset_Dict to push_to_hub. Where I have run into issues is when I want to include different metadata for different splits. For example, I have situations where I only have meta-data for one of the splits (e.g. test) or situations where I am working with inference data that does not have labels. Currently I use a rather hacky work around by adding "dummy" columns for missing columns to avoid the error:

ValueError: All datasets in `DatasetDict` should have the same features

I am curious why DatasetDict will likely not support this functionality? I don't know much about working with different configurations, but allowing for different columns between datasets / splits would be a very helpful use-case for me. Are there any docs for using different configuration OR a more info about incorporating it with push_to_hub.

Best wishes,
Jonathan

nikita-galileo · 2022-11-22T17:39:08Z

+1

lhoestq · 2022-11-22T18:11:12Z

I am curious why DatasetDict will likely not support this functionality?

There's a possibility we may merge the Dataset and DatasetDict classes. The DatasetDict purpose was to define a way to get the train/test splits of a dataset.

see the discussions at #5189

Are there any docs for using different configuration OR a more info about incorporating it with push_to_hub.

There's a PR open to allow to upload a dataset with a certain configuration name. Then later you can reload this specific configuration using load_dataset(ds_name, config_name)

see the PR at #5213

dburian · 2023-04-06T19:25:50Z

Hi, regarding the following information:

Please note that this is a design purpose: a DatasetDict has the same features for all its datasets. Normally, a DatasetDict is composed of several sub-datasets each corresponding to a different split.

To handle sub-datasets with different features, we use another approach: use different configurations instead of splits.

Altough this is often implied (such as how else would DatasetDict be able to process multiple splits in the same way?), I would expect it to be written somewhere in the docs plainly and maybe even in bold. Also I would expect to see it in multiple places such as:

in docstring of DatasetDict
in nlp/image/audio guides on how to create a dataset
in conceptual guide on how to create a loading script

I think this addition would benefit the docs, especially when you guide a newbie (such as me) through the process of creating a dataset. As I said, you somehow suspect that this is in fact the case, but without reading it in the docs you cannot be sure.

pietrolesci added the bug Something isn't working label Apr 25, 2022

mariosasko self-assigned this May 18, 2022

mariosasko mentioned this issue May 19, 2022

Check if dataset features match before push in DatasetDict.push_to_hub #4372

Merged

mariosasko closed this as completed in #4372 May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DatasetDict containing Datasets with different features when pushed to hub gets remapped features #4211

DatasetDict containing Datasets with different features when pushed to hub gets remapped features #4211

pietrolesci commented Apr 25, 2022 •

edited

Loading

albertvillanova commented Apr 26, 2022 •

edited

Loading

pietrolesci commented Apr 26, 2022

mariosasko commented Apr 26, 2022

lhoestq commented Apr 26, 2022

pietrolesci commented Nov 10, 2022

lhoestq commented Nov 12, 2022

jonathangomesselman commented Nov 22, 2022

nikita-galileo commented Nov 22, 2022

lhoestq commented Nov 22, 2022 •

edited

Loading

dburian commented Apr 6, 2023

DatasetDict containing Datasets with different features when pushed to hub gets remapped features #4211

DatasetDict containing Datasets with different features when pushed to hub gets remapped features #4211

Comments

pietrolesci commented Apr 25, 2022 • edited Loading

albertvillanova commented Apr 26, 2022 • edited Loading

pietrolesci commented Apr 26, 2022

mariosasko commented Apr 26, 2022

lhoestq commented Apr 26, 2022

pietrolesci commented Nov 10, 2022

lhoestq commented Nov 12, 2022

jonathangomesselman commented Nov 22, 2022

nikita-galileo commented Nov 22, 2022

lhoestq commented Nov 22, 2022 • edited Loading

dburian commented Apr 6, 2023

pietrolesci commented Apr 25, 2022 •

edited

Loading

albertvillanova commented Apr 26, 2022 •

edited

Loading

lhoestq commented Nov 22, 2022 •

edited

Loading