Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added huggingface integration - DEPRECATED #880

Conversation

pranayasinghcsmpl
Copy link
Contributor

Fixes #727

Proposed Changes

  • Add a new option for huggingface in the cli
  • Add model upload functionality to a huggingface repo
  • Add model download functionality from a huggingface repo

Checklist

  • CONTRIBUTING guide has been followed.
  • PR is based on the current GaNDLF master .
  • Non-breaking change (does not break existing functionality): provide as many details as possible for any breaking change.
  • Function/class source code documentation added/updated (ensure typing is used to provide type hints, including and not limited to using Optional if a variable has a pre-defined value).
  • Code has been blacked for style consistency and linting.
  • If applicable, version information has been updated in GANDLF/version.py.
  • If adding a git submodule, add to list of exceptions for black styling in pyproject.toml file.
  • Usage documentation has been updated, if appropriate.
  • Tests added or modified to cover the changes; if coverage is reduced, please give explanation.
  • If customized dependency installation is required (i.e., a separate pip install step is needed for PR to be functional), please ensure it is reflected in all the files that control the CI, namely: python-test.yml, and all docker files [1,2,3].

@pranayasinghcsmpl pranayasinghcsmpl requested a review from a team as a code owner June 4, 2024 11:13
Copy link
Contributor

github-actions bot commented Jun 4, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@sarthakpati sarthakpati changed the base branch from master to new-apis_v0.1.0-dev June 4, 2024 15:06
@sarthakpati
Copy link
Collaborator

Thanks for the PR! I have changed the base to the "new-apis" branch (instead of master), and some conflicts need to be resolved. can you take a look?

image

@sarthakpati
Copy link
Collaborator

Waiting to hear from @NielsRogge about this PR. Also, is there any way to put unit tests for this?

from typing import List, Union


def push_to_model_hub(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as I understand it you'll provide a CLI tool to push and pull models from the 🤗 hub. cc @Wauplin wondering if it's possible to programmatically add tags?

By default, this won't track any download numbers for the various models or add specific tags which ease model discoverability (like "image-classification" etc.). For that we can open a PR similar to this one: huggingface/huggingface.js#669, where you specify an extension which tracks downloads each time a file with the specified extension is downloaded.

Next to that, one can add code snippets when users click the "use in GaNDLF" button which you can add, as shown in this guide: https://huggingface.co/docs/hub/en/models-adding-libraries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pranayasinghcsmpl @sarthakpati do you have an example of GaNDLF repo pushed on the Hub already? Agree with @NielsRogge, having it as an "officially compatible library" on the Hub would be awesome! It enables download counts + better discoverability + possible to have code snippets.

wondering if it's possible to programmatically add tags?

Of course it is! 🤗 Check out the ModelCard guide.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have uploaded a test model made using gandlf. You can find it here-https://huggingface.co/pranayasingh/test_upload1/tree/main

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt responses @NielsRogge, @Wauplin and @pranayasinghcsmpl!

Copy link

@NielsRogge NielsRogge Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I see this uploads various checkpoints to a single repository. Typically we recommend to create a single repository per model checkpoint, see a recent example of YOLOv10 where ONNX checkpoints are stored in separate repositories. However I think it makes sense to keep the initial and latest PyTorch checkpoints (along with the best) in a single repository.

For download stats to work, we could add a "gandlf" tag to the model cards, along with a PR like this one, where you can customize which files should contribute to incrementing the download counter. See also this guide: https://huggingface.co/docs/hub/en/models-download-stats.

This will allow you to track all models that people trained using GaNDLF (and see which ones are downloaded the most).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, from an organizational and requirements perspective, we want to enable anyone training models using GaNDLF to be able to upload to HF using their own org information, and not specifically under GaNDLF itself. For example, if organization A is training a model X-1, they should be able to upload this model (including the hashes of the code, and whatever else is needed to "deploy" the model) to huggingface.co/A/X-1 (or whatever else that makes sense).

The main reason behind this is that MLCommons does not want to be a curator of models, and only wants to enable other organizations to train their own models.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what the huggingface-cli/huggingface_hub are for! The suggestion from @NielsRogge is to add a gandlf tag to the model card metadata from generating it. This way all repos pushed to the Hub with GaNDLF will be tagged as such and searchable. It does not require to manually curate those models yourself. The download count will happen per repo, allowing you and your users to get an idea of the traction you get on the Hub.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping @NielsRogge :) I'm maintainer of the huggingface_hub library and sometimes help with integrations 👋

Note that currently it seems that the 2 added commands are very similar to

huggingface-cli upload <repo-id> <path-to-local-folder>

and

huggingface-cli download <repo-id>

without adding new features to it. Those two commands offers more flexibility to users and are officially supported and maintained. On the long run, having 2 aliases in GaNDLF repo can become more of a maintenance burden than a real benefit. But I'm leaving it to the repo owners that knows the broader context :)

GANDLF/cli/huggingface_hub_handler.py Outdated Show resolved Hide resolved
GANDLF/entrypoints/subcommands.py Outdated Show resolved Hide resolved
)


def download_from_hub(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an external pair of eyes, I'm not sure to understand why defining download_from_hub which end-up being an alias for snapshot_download?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wauplin, i did this to stay consistent with the implementations of other cli subcommands.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with it!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pranayasinghcsmpl - are any specific benefits to using the new commands you have introduces versus those currently offered through HF? Would it make more sense to use them directly, and have HF as a dependency for GaNDLF going forward?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarthakpati, the HF CLI offers more features than this implementation. However, incorporating a CLI subcommand for HF Hub might encourage more users to fully utilize it. Maybe we could explore integrating the HF CLI into our CLI subcommands if we require all its features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarthakpati, I think you're right. Also using the HF CLI might resolve issues mentioned in ref.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precisely!

I am going to mark this PR as draft until this gets ready, @pranayasinghcsmpl.

@NielsRogge & @Wauplin - would it be okay to tag you to take a look at this PR once @pranayasinghcsmpl is done from his end to provide additional feedback?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sounds good!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course! 🤗

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! 👍🏽

@sarthakpati sarthakpati marked this pull request as draft June 6, 2024 19:36
@sarthakpati
Copy link
Collaborator

2 points from me:

  1. Tests need to be added. See the entrypoint tests here.
  2. I am guessing hf needs to be added as a dependency to setup.py.

@pranayasinghcsmpl
Copy link
Contributor Author

@sarthakpati, Sure I'll start working on the tests. Also I'll add the dependencies in the next commit.

@sarthakpati
Copy link
Collaborator

sarthakpati commented Jun 27, 2024

The mlcube-docker errors in actions are related to mlcommons/mlcube#360

Changing the pip version in the workflow should help - please put this change in a separate PR, @pranayasinghcsmpl

Edit: done in #887

@pranayasinghcsmpl pranayasinghcsmpl force-pushed the hf_cli2 branch 2 times, most recently from 58d5508 to 5ad21de Compare July 1, 2024 10:01
@sarthakpati
Copy link
Collaborator

@pranayasinghcsmpl - the latest changes from the base branch should make things easier. However, we are still missing updates to setup.py.

@sarthakpati sarthakpati mentioned this pull request Jul 2, 2024
10 tasks
@sarthakpati sarthakpati changed the title Added huggingface integration Added huggingface integration - DEPRECATED Jul 2, 2024
@sarthakpati sarthakpati deleted the branch mlcommons:new-apis_v0.1.0-dev July 31, 2024 14:21
@pranayasinghcsmpl pranayasinghcsmpl deleted the hf_cli2 branch August 14, 2024 06:47
@pranayasinghcsmpl pranayasinghcsmpl restored the hf_cli2 branch August 14, 2024 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Hugging Face Hub integration
5 participants