Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Integration with Hugging Face Hub #46000

Closed
lvwerra opened this issue Feb 15, 2022 · 5 comments · Fixed by #60608
Closed

ENH: Integration with Hugging Face Hub #46000

lvwerra opened this issue Feb 15, 2022 · 5 comments · Fixed by #60608
Labels

Comments

@lvwerra
Copy link

lvwerra commented Feb 15, 2022

Hi Pandas devs and Pandas community 🤗

I am reaching out to you to see if you would be interested in an integration with the Hugging Face Hub. We have been hosting datasets on the hub for a while and are now close to 3000 public datasets not counting all the private datasets.

In both the models and datasets areas of the Hugging Face ecosystem we use the push_to_hub functionality to upload datasets and models to the Hub in one line. Similarly, these assets can be loaded from the Hub in a single line with the load_dataset and from_pretrained functions, respectively.

We wanted to ask you whether you would be interested to add the huggingface_hub dependancy such that any DataFrame could be pushed and pulled from the hub.

Here are a few use-cases where such a functionality would add value:

  • Save and document raw as well as processed datasets on the hub (also as backup)
    • Datasets on the Hub have a preview (see an example here)
    • Datasets on the Hub can be documented with a Readme and linked to models trained on them
    • Datasets on the Hub are versioned (using git-lfs in the background)
  • Share datasets with students for lectures or group projects
  • Share datasets within an organization (publicly or privately)

Here is how such an integration could look like:

# upload a DataFrame to the Hub:
df.push_to_hub("my_dataset", org="my_org")

# load a DataFrame from the Hub:
df = DataFrame.from_hub("my_dataset", org="my_org")

Here is the documentation on publishing files on the Hugging Face Hub using the huggingface_hub library:
https://github.com/huggingface/huggingface_hub/tree/main/src/huggingface_hub#publish-files-to-the-hub

I am curious to hear what you think about this and please let me know if I can clarify anything!

cc @osanseviero @julien-c

@lvwerra lvwerra added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 15, 2022
@jbrockmendel
Copy link
Member

We wanted to ask you whether you would be interested to add the huggingface_hub

We're very wary of adding dependencies and extending an already-overstuffed API. Is something like your_module.push_to_hub(df, "my_dataset", org="my_org") not viable?

@mroeschke
Copy link
Member

Agreed with the hesitancy adding this directly in pandas.

For context, pandas-datareader (similar spirit public/private data sourcing feature) used to be packaged with pandas but was spun off into its own package: https://pandas-datareader.readthedocs.io/en/latest/

Given that, I think this would be best implemented as a third party package and included in the ecosystem docs.

@twoertwein
Copy link
Member

twoertwein commented Feb 15, 2022

Pandas already supports many protocols thanks to fsspec (writing/loading to AWS, GCS, ...). If you manage to integrate the "Hugging Face Hub protocol" in fsspec, you get pandas support for free :)

edit: this would take care of the transmission from a user to your hub, but the format might not be what you want (unless you are fine with a csv/json/pickle/excel version of a dataframe).

@julien-c
Copy link

@twoertwein that's a pretty cool idea!

@lhoestq
Copy link
Contributor

lhoestq commented Jun 27, 2024

You can now find some early documentation on hf:// + pandas here: https://huggingface.co/docs/hub/datasets-pandas :)

import pandas as pd

df = pd.read_parquet("hf://datasets/username/my_dataset/data.parquet")

And automatic code snippets on HF as well:

image image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants