Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the ability to track progress of individual file downloads in snapshot_download() through inner_tqdm_class addition #2718

Closed

Conversation

kevinMEH
Copy link

1: Fixed a bug regarding the type of the tqdm_class argument of snapshot_download().

2: Added the ability to track progress of individual file downloads in snapshot_download().

Users can now pass an inner_tqdm_class argument to snapshot_download(), which will be eventually passed onto http_get(), which will use that class if available to instantiate its instance of tqdm.

This enables the user to track the progress of individual file downloads, rather than overall progress (tracked through the regular tqdm_class argument passed into snapshot_download()).

Example helpful scenario:

In my Local-LLM project, I would like to relay the progress of model downloads to the user. Presently, there is no mechanism for such behavior, but through this addition, I can now pass in a custom tqdm class into inner_tqdm_class and receive progress updates through my custom class.

Note: The code below is still a work in progress, but should be sufficient to help illustrate the utility of the addition.

image

Feel free to edit.

Before the fix, tqdm_class accepts an instance of tqdm, not a
class of tqdm.
The inner_tqdm_class argument is passed onto the individual calls of
http_get inside file_download.py and gives the user the ability to
monitor the progress of individual file downloads.
This enables the user to use the regular ThreadPoolExecutor function to
execute _inner_hf_hub_download() in parallel rather than use tqdm's
thread_map(), which calls ThreadPoolExecutor.
@kevinMEH
Copy link
Author

I have made an additional change (commit b282938)

This change lets the user directly use ThreadPoolExecutor to execute file downloads in parallel instead of through tqdm's thread_map.

I propose this change because when using tqdm in a multithread / multiprocess environment, the following warning will pop up:

UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown: {'/mp-91jplfuc'}

One can imagine how seeing such a warning upon every run of their otherwise pristinely programmed application could easily drive an otherwise perfectly sane man to their limits.

Therefore, I propose the following change to skip at least the outer execution of tqdm.

PS: For anyone seeking knowledge on how to avoid the inner execution of tqdm (for the individual file downloads), you can do so by providing your own custom progress bar class: Only update(), enter(), exit() and init() are used by http_get().

image

Once you've replaced the inner_tqdm_class and skipped the outer tqdm execution, you will no longer get the semaphore leak warning.

@hanouticelina
Copy link
Contributor

hanouticelina commented Dec 31, 2024

Hi @kevinMEH, thanks a lot for your PR! The individual file progress tracking can actually be achieved without modifying snapshot_download you can do something like this :

from typing import Optional

from tqdm.auto import tqdm

from huggingface_hub import snapshot_download


class CustomProgress(tqdm):
    """Custom tqdm class for nested progress bars."""

    disable_outer = True

    def __init__(self, *args, **kwargs):
        if kwargs.get("desc", "").startswith("Fetching"):
            kwargs["disable"] = self.disable_outer
        super().__init__(*args, **kwargs)
        self.main_bar: Optional[tqdm] = None

    def __new__(cls, *args, **kwargs):
        """Create a new instance with nested progress bars."""
        instance = super().__new__(cls)
        instance.main_bar = tqdm(desc="Files", position=0, disable=True)
        instance.desc = kwargs.get("desc", "Downloading")
        instance.position = 0
        instance.leave = True
        return instance

    @classmethod
    def get_lock(cls):
        return tqdm.get_lock()


if __name__ == "__main__":
    # A flag to disable the outer progress bar
    NestedProgress.disable_outer = True
    snapshot_download(repo_id="answerdotai/ModernBERT-base", tqdm_class=NestedProgress, max_workers=1)

output:

.gitattributes: 100%|███████████████████████| 1.52k/1.52k [00:00<00:00, 11.3MB/s]
README.md: 100%|████████████████████████████| 8.53k/8.53k [00:00<00:00, 28.5MB/s]
config.json: 100%|██████████████████████████| 1.19k/1.19k [00:00<00:00, 13.0MB/s]
model.safetensors:  37%|████████              | 220M/599M [00:05<00:08, 43.3MB/s]

output with NestedProgress.disable_outer = False:

.gitattributes: 100%|███████████████████████| 1.52k/1.52k [00:00<00:00, 6.66MB/s]
README.md: 100%|████████████████████████████| 8.53k/8.53k [00:00<00:00, 23.4MB/s]
config.json: 100%|██████████████████████████| 1.19k/1.19k [00:00<00:00, 4.75MB/s]
Fetching 16 files:  19%|████▉                     | 3/16 [00:00<00:03,  3.99it/s]
model.safetensors:  54%|███████████▉          | 325M/599M [00:07<00:06, 43.1MB/s]

Let me know if that resolves your use case!

@kevinMEH
Copy link
Author

Hi @kevinMEH, thanks a lot for your PR! The individual file progress tracking can actually be achieved without modifying snapshot_download you can do something like this :

from typing import Optional

from tqdm.auto import tqdm

from huggingface_hub import snapshot_download


class NestedProgress(tqdm):
    """Custom tqdm class for nested progress bars."""

    disable_outer = True

    def __init__(self, *args, **kwargs):
        if kwargs.get("desc", "").startswith("Fetching"):
            kwargs["disable"] = self.disable_outer
        super().__init__(*args, **kwargs)
        self.main_bar: Optional[tqdm] = None

    def __new__(cls, *args, **kwargs):
        """Create a new instance with nested progress bars."""
        instance = super().__new__(cls)
        instance.main_bar = tqdm(desc="Files", position=0, disable=True)
        instance.desc = kwargs.get("desc", "Downloading")
        instance.position = 0
        instance.leave = True
        return instance

    @classmethod
    def get_lock(cls):
        return tqdm.get_lock()


if __name__ == "__main__":
    # A flag to disable the outer progress bar
    NestedProgress.disable_outer = True
    snapshot_download(repo_id="answerdotai/ModernBERT-base", tqdm_class=NestedProgress, max_workers=1)

output:

.gitattributes: 100%|███████████████████████| 1.52k/1.52k [00:00<00:00, 11.3MB/s]
README.md: 100%|████████████████████████████| 8.53k/8.53k [00:00<00:00, 28.5MB/s]
config.json: 100%|██████████████████████████| 1.19k/1.19k [00:00<00:00, 13.0MB/s]
model.safetensors:  37%|████████              | 220M/599M [00:05<00:08, 43.3MB/s]

output with NestedProgress.disable_outer = False:

.gitattributes: 100%|███████████████████████| 1.52k/1.52k [00:00<00:00, 6.66MB/s]
README.md: 100%|████████████████████████████| 8.53k/8.53k [00:00<00:00, 23.4MB/s]
config.json: 100%|██████████████████████████| 1.19k/1.19k [00:00<00:00, 4.75MB/s]
Fetching 16 files:  19%|████▉                     | 3/16 [00:00<00:03,  3.99it/s]
model.safetensors:  54%|███████████▉          | 325M/599M [00:07<00:06, 43.1MB/s]

Let me know if that resolves your use case!

Hello,

I'm afraid I don't quite understand how your NestedProgress class tracks individual file downloads.

I have slightly modified NestedProgress to print "MODIFIED: {desc}" every time it is called. Then, I ran snapshot_download with NestedProgress as tqdm_class.

image

If it truly tracks the individual file downloads, then there will be a print statement for each file download, saying, for example:

MODIFIED: README.md
MODIFIED: config.json

However, this does not appear to be the case:

image

As I expected, there is only a single print statement for the outer tqdm instance which tracks the progress of ALL file downloads:

MODIFIED: Fetching 10 files

Probably because the http_get function in file_download.py uses the tqdm imported from utils, and not from a user supplied tqdm class. Which my commits above fixes, by letting the user supply a custom "inner_tqdm" class:

image image

Perhaps I am wrong and your solution does indeed solve my problem. I don't often use this (what I would describe as a garbage) programming language. In fact, this is my first time seeing the __new__ function so maybe I am doing something wrong? In which case do you mind provide another example of your solution working?

@hanouticelina
Copy link
Contributor

@kevinMEH Ah yes sorry, I was not clear and precise in my answer before. The NestedProgress (I should have just named it CustomProgress) only allows you to customize the progress bar and disable the outer tqdm, it DOES NOT track the downloading of each individual file. those are tracked using tqdm in the http_get() function, so currently, individual file download progress is already available through the existing implementation.

If I understood correctly, you want to be able to pass a custom tqdm to http_get() when calling snapshot_download(), What kind of specific customization do you want to add to the current per-file progress bar?

@kevinMEH
Copy link
Author

Hello,

As mentioned above, tracking the progress of individual file downloads by passing in a custom tqdm class is useful for... tracking the progress of individual file downloads. It is not about adding any specific customization, but letting the developer add any customization they want.

Currently, the progress of individual file downloads is only logged to the console, but what if our user is not using a console interface? For example, the user might be requesting file downloads through a web application. In that case, we will need some other way to provide file download progress information to the user, not just by printing it out to the console. We can do this by allowing ourselves to pass in a custom tqdm class to http_get(). In this custom class, we can then choose to write progress information to somewhere other than the console, for example to an array, which we can then read from and send to the user through an event stream.

Please see this comment for an example:

#2718 (comment)

@hanouticelina
Copy link
Contributor

@kevinMEH got it, thanks for the context and explanation! I will check your PR more in details later and see how to implement this in the simplest way possible.

@kevinMEH
Copy link
Author

kevinMEH commented Jan 4, 2025

The simplest implementation has been provided in the 3 commits above, should not be too hard to review and verify its correctness (whenever you have the time to do so of course).

@kevinMEH
Copy link
Author

Bump

@hanouticelina
Copy link
Contributor

Hi @kevinMEH,
Sorry for the delayed response! thank you for your PR and bringing this use case to our attention.
after reviewing the proposed changes, I have some concerns about adding inner_tqdm_class and disable_outer_tqdm parameters to snapshot_download(). while these parameters would solve your specific use case, I believe this approach has some drawbacks:

  • snapshot_download() signature becomes more cluttered with parameters that aren't core to its main purpose. This might not justify the added API complexity given that the use case hasn't been frequently requested and the tqdm_class parameter (global progress bar) already covers the most common needs.
  • if other users need similar customization for other progress bars, e.g. in uploads, we'd need to add similar parameters across multiple functions, which would make things a bit harder to maintain.

One alternative would be to implement a global configuration system that allows customizing progress bars across the entire library. Of course this still needs to be discussed and prioritize.

from tqdm.auto import tqdm
from huggingface_hub.utils import configure_progress_bars

class CustomProgress(tqdm):
    def update(self, n=1):
        super().update(n)
        # Add custom logging here

# override huggingface_hub.utils.tqdm with CustomProgress
configure_progress_bars(tqdm_class=CustomProgress)
snapshot_download(...)

# or with a context manager
with configure_progress_bars(tqdm_class=CustomProgress):
    snapshot_download(...)

This is one alternative solution but there may be other better approaches as well!

@kevinMEH
Copy link
Author

Hello,

Thanks for your response, however, I am a bit confused about the points you've made.

First, you mentioned that the addition would cause snapshot_download() to have additional clutter and complexity, but snapshot_download() already accepts 21 parameters. Surely 2 more can be added without adding too much "complexity" to the API?

Especially since the two parameters are extremely easy to explain and for anyone to understand. inner_tqdm_class enables users to track per file download progress. skip_outer_tqdm enables users to skip the outer iteration of tqdm. If users are already willing to read documentation for 21 arguably much more complex parameter options, surely they are fine with letting their eyes skip over two additional sentences?

For the inner_tqdm_class parameter; you've already accepted a parameter for tqdm_class, which tracks the outer tqdm iteration. Why not add another class which tracks the inner tqdm iterations too? It is a logical next addition to the snapshot_download() option. If you don't want to add the skip_outer_tqdm parameter, that's fine, I will amend my commits, but there is no real reason to not add inner_tqdm_class.

As you know, there is a vast difference between tracking total file downloads and individual file downloads. Let's take the average repository: There is probably a README file and some other small json configuration files or whatever. And then there are the extremely large model parameter files. The current method of only tracking total file download progress gives no hint as to the current download progress of these larger files, so why not add an inner_tqdm_class which does let developers relay such download information to the user?

On your second point, let's not make the logical leap that adding an inner_tqdm_class in this case will result in a slippery slope situation which results in all other functions also needing an inner_tqdm_class parameter added. As you mentioned in your first point, "the use case hasn't been frequently requested", so probabilistically speaking, there is no real need to worry about anyone else asking for an "inner_tqdm_class" parameter to be added for any other functions. If does exist other people asking for such an addition, then it would surely mean that the use case is frequently requested, which would again justify such an addition.

The alternative solution you proposed for tracking individual file downloads is much more complicated than the current solution, which is simple (simply have http_get() instantiate its tqdm instance using a passed in class) and basically 0 cost (it simply adds two if statements). I don't want to make life hard for you or other maintainers (and for me), so I believe that the current solution is the best solution available.

I sincerely hope you will change your mind, as these feature additions are very useful to me.

@hanouticelina
Copy link
Contributor

Hello @kevinMEH,
Thank you for your detailed feedback.
While adding these parameters might seem straightforward, my point is that we aim to keep the API design of snapshot_download focused on its core purpose: a simple, high-level function to download an entire repository without necessarily having control of per-file downloads.

For users needing file-level granularity and progress tracking, here is what we recommend:

  • In your script, you can use HfApi().repo_info to get the list of files you want to download from a specific repository on the Hub.
  • Then you call hf_hub_download for each file - we're open to adding a progress bar parameter there for per-file download tracking.

I hope this solution works for you, let me know if you want to update your PR accordingly.

@kevinMEH
Copy link
Author

Hello,

Thanks for your response.

I am simply unable to understand the logic behind allowing the implementation of total file download progress tracking through tqdm_class (which again, does not reveal much information to the user about actual file download progress), but strong opposition against an OPTIONAL method of also tracking the individual download progress of each file (nobody is forcing anyone who uses snapshot_download() to use inner_tqdm_class)... Which is already present in the form of progress printed to the command line, but somehow allowing users to access the information by passing in an inner_tqdm_class parameter is too much?

We are not negotiating some kind of complicated international criminal law treaty here, this is simply a feature request that fleshes out already existing (but not thoroughly implemented) functionality in snapshot_download(). Even if there is a need to strictly conform to some kind of standard, it is simply illogical to not allow the addition of inner_tqdm_class tracking individual file download progress when tqdm_class already tracks total file download progress.

I apologize for any inconveniences, but would any other maintainer be available to lend a second opinion on the current subject matter?

@Wauplin
Copy link
Contributor

Wauplin commented Jan 14, 2025

I apologize for any inconveniences, but would any other maintainer be available to lend a second opinion on the current subject matter?

I 100% agree with @hanouticelina 's opinion explained in #2718 (comment).
Adding an argument in hf_hub_download seem far enough as a solution here. Happy to revisit if there is a bigger demand.

I am simply unable to understand the logic behind allowing the implementation of total file download progress tracking through tqdm_class (which again, does not reveal much information to the user about actual file download progress), but strong opposition against an OPTIONAL method of also tracking the individual download progress of each file (nobody is forcing anyone who uses snapshot_download() to use inner_tqdm_class)...

Adding more arguments means more maintenance cost, that's the main reason. No matter if it's a one-line change or a 100-lines change, adding arguments has a cost. In this case we've added a tqdm_class quite some time ago to allow for "some-sort" of progress tracking but tbh I'm not sure we should have done it either (it's a very corner case to do that). snapshot_download should remain a simple wrapper around hf_hub_download, focused on multithreading and ease of use but that's it.

@kevinMEH
Copy link
Author

Hello,

Thanks for your response. By your definition, any genuine feature contribution is going to result in some amount of "maintenance cost", and I am sadden by the fact that such contributions are not desired at the moment. I have received the memo; only contributions that fix slight grammar issues or slightly improve translations made by individuals who desperately need some kind of showcase on their resumes or Github profiles are desired at the moment.

I wish to bother no longer, and I hope everyone here has a wonderful afternoon/evening. Thank you.

@kevinMEH kevinMEH closed this Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants