-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added the ability to track progress of individual file downloads in snapshot_download() through inner_tqdm_class addition #2718
Conversation
Before the fix, tqdm_class accepts an instance of tqdm, not a class of tqdm.
The inner_tqdm_class argument is passed onto the individual calls of http_get inside file_download.py and gives the user the ability to monitor the progress of individual file downloads.
This enables the user to use the regular ThreadPoolExecutor function to execute _inner_hf_hub_download() in parallel rather than use tqdm's thread_map(), which calls ThreadPoolExecutor.
I have made an additional change (commit b282938) This change lets the user directly use ThreadPoolExecutor to execute file downloads in parallel instead of through tqdm's thread_map. I propose this change because when using tqdm in a multithread / multiprocess environment, the following warning will pop up:
One can imagine how seeing such a warning upon every run of their otherwise pristinely programmed application could easily drive an otherwise perfectly sane man to their limits. Therefore, I propose the following change to skip at least the outer execution of tqdm. PS: For anyone seeking knowledge on how to avoid the inner execution of tqdm (for the individual file downloads), you can do so by providing your own custom progress bar class: Only update(), enter(), exit() and init() are used by http_get(). Once you've replaced the inner_tqdm_class and skipped the outer tqdm execution, you will no longer get the semaphore leak warning. |
Hi @kevinMEH, thanks a lot for your PR! The individual file progress tracking can actually be achieved without modifying from typing import Optional
from tqdm.auto import tqdm
from huggingface_hub import snapshot_download
class CustomProgress(tqdm):
"""Custom tqdm class for nested progress bars."""
disable_outer = True
def __init__(self, *args, **kwargs):
if kwargs.get("desc", "").startswith("Fetching"):
kwargs["disable"] = self.disable_outer
super().__init__(*args, **kwargs)
self.main_bar: Optional[tqdm] = None
def __new__(cls, *args, **kwargs):
"""Create a new instance with nested progress bars."""
instance = super().__new__(cls)
instance.main_bar = tqdm(desc="Files", position=0, disable=True)
instance.desc = kwargs.get("desc", "Downloading")
instance.position = 0
instance.leave = True
return instance
@classmethod
def get_lock(cls):
return tqdm.get_lock()
if __name__ == "__main__":
# A flag to disable the outer progress bar
NestedProgress.disable_outer = True
snapshot_download(repo_id="answerdotai/ModernBERT-base", tqdm_class=NestedProgress, max_workers=1) output:
output with
Let me know if that resolves your use case! |
Hello, I'm afraid I don't quite understand how your NestedProgress class tracks individual file downloads. I have slightly modified NestedProgress to print "MODIFIED: {desc}" every time it is called. Then, I ran snapshot_download with NestedProgress as tqdm_class. If it truly tracks the individual file downloads, then there will be a print statement for each file download, saying, for example:
However, this does not appear to be the case: As I expected, there is only a single print statement for the outer tqdm instance which tracks the progress of ALL file downloads:
Probably because the http_get function in file_download.py uses the tqdm imported from utils, and not from a user supplied tqdm class. Which my commits above fixes, by letting the user supply a custom "inner_tqdm" class: Perhaps I am wrong and your solution does indeed solve my problem. I don't often use this (what I would describe as a garbage) programming language. In fact, this is my first time seeing the |
@kevinMEH Ah yes sorry, I was not clear and precise in my answer before. The If I understood correctly, you want to be able to pass a custom tqdm to |
Hello, As mentioned above, tracking the progress of individual file downloads by passing in a custom tqdm class is useful for... tracking the progress of individual file downloads. It is not about adding any specific customization, but letting the developer add any customization they want. Currently, the progress of individual file downloads is only logged to the console, but what if our user is not using a console interface? For example, the user might be requesting file downloads through a web application. In that case, we will need some other way to provide file download progress information to the user, not just by printing it out to the console. We can do this by allowing ourselves to pass in a custom tqdm class to http_get(). In this custom class, we can then choose to write progress information to somewhere other than the console, for example to an array, which we can then read from and send to the user through an event stream. Please see this comment for an example: |
@kevinMEH got it, thanks for the context and explanation! I will check your PR more in details later and see how to implement this in the simplest way possible. |
The simplest implementation has been provided in the 3 commits above, should not be too hard to review and verify its correctness (whenever you have the time to do so of course). |
Bump |
Hi @kevinMEH,
One alternative would be to implement a global configuration system that allows customizing progress bars across the entire library. Of course this still needs to be discussed and prioritize. from tqdm.auto import tqdm
from huggingface_hub.utils import configure_progress_bars
class CustomProgress(tqdm):
def update(self, n=1):
super().update(n)
# Add custom logging here
# override huggingface_hub.utils.tqdm with CustomProgress
configure_progress_bars(tqdm_class=CustomProgress)
snapshot_download(...)
# or with a context manager
with configure_progress_bars(tqdm_class=CustomProgress):
snapshot_download(...) This is one alternative solution but there may be other better approaches as well! |
Hello, Thanks for your response, however, I am a bit confused about the points you've made. First, you mentioned that the addition would cause snapshot_download() to have additional clutter and complexity, but snapshot_download() already accepts 21 parameters. Surely 2 more can be added without adding too much "complexity" to the API? Especially since the two parameters are extremely easy to explain and for anyone to understand. inner_tqdm_class enables users to track per file download progress. skip_outer_tqdm enables users to skip the outer iteration of tqdm. If users are already willing to read documentation for 21 arguably much more complex parameter options, surely they are fine with letting their eyes skip over two additional sentences? For the inner_tqdm_class parameter; you've already accepted a parameter for tqdm_class, which tracks the outer tqdm iteration. Why not add another class which tracks the inner tqdm iterations too? It is a logical next addition to the snapshot_download() option. If you don't want to add the skip_outer_tqdm parameter, that's fine, I will amend my commits, but there is no real reason to not add inner_tqdm_class. As you know, there is a vast difference between tracking total file downloads and individual file downloads. Let's take the average repository: There is probably a README file and some other small json configuration files or whatever. And then there are the extremely large model parameter files. The current method of only tracking total file download progress gives no hint as to the current download progress of these larger files, so why not add an inner_tqdm_class which does let developers relay such download information to the user? On your second point, let's not make the logical leap that adding an inner_tqdm_class in this case will result in a slippery slope situation which results in all other functions also needing an inner_tqdm_class parameter added. As you mentioned in your first point, "the use case hasn't been frequently requested", so probabilistically speaking, there is no real need to worry about anyone else asking for an "inner_tqdm_class" parameter to be added for any other functions. If does exist other people asking for such an addition, then it would surely mean that the use case is frequently requested, which would again justify such an addition. The alternative solution you proposed for tracking individual file downloads is much more complicated than the current solution, which is simple (simply have http_get() instantiate its tqdm instance using a passed in class) and basically 0 cost (it simply adds two if statements). I don't want to make life hard for you or other maintainers (and for me), so I believe that the current solution is the best solution available. I sincerely hope you will change your mind, as these feature additions are very useful to me. |
Hello @kevinMEH, For users needing file-level granularity and progress tracking, here is what we recommend:
I hope this solution works for you, let me know if you want to update your PR accordingly. |
Hello, Thanks for your response. I am simply unable to understand the logic behind allowing the implementation of total file download progress tracking through tqdm_class (which again, does not reveal much information to the user about actual file download progress), but strong opposition against an OPTIONAL method of also tracking the individual download progress of each file (nobody is forcing anyone who uses snapshot_download() to use inner_tqdm_class)... Which is already present in the form of progress printed to the command line, but somehow allowing users to access the information by passing in an inner_tqdm_class parameter is too much? We are not negotiating some kind of complicated international criminal law treaty here, this is simply a feature request that fleshes out already existing (but not thoroughly implemented) functionality in snapshot_download(). Even if there is a need to strictly conform to some kind of standard, it is simply illogical to not allow the addition of inner_tqdm_class tracking individual file download progress when tqdm_class already tracks total file download progress. I apologize for any inconveniences, but would any other maintainer be available to lend a second opinion on the current subject matter? |
I 100% agree with @hanouticelina 's opinion explained in #2718 (comment).
Adding more arguments means more maintenance cost, that's the main reason. No matter if it's a one-line change or a 100-lines change, adding arguments has a cost. In this case we've added a |
Hello, Thanks for your response. By your definition, any genuine feature contribution is going to result in some amount of "maintenance cost", and I am sadden by the fact that such contributions are not desired at the moment. I have received the memo; only contributions that fix slight grammar issues or slightly improve translations made by individuals who desperately need some kind of showcase on their resumes or Github profiles are desired at the moment. I wish to bother no longer, and I hope everyone here has a wonderful afternoon/evening. Thank you. |
1: Fixed a bug regarding the type of the tqdm_class argument of snapshot_download().
2: Added the ability to track progress of individual file downloads in snapshot_download().
Users can now pass an
inner_tqdm_class
argument to snapshot_download(), which will be eventually passed onto http_get(), which will use that class if available to instantiate its instance of tqdm.This enables the user to track the progress of individual file downloads, rather than overall progress (tracked through the regular
tqdm_class
argument passed into snapshot_download()).Example helpful scenario:
In my Local-LLM project, I would like to relay the progress of model downloads to the user. Presently, there is no mechanism for such behavior, but through this addition, I can now pass in a custom tqdm class into
inner_tqdm_class
and receive progress updates through my custom class.Note: The code below is still a work in progress, but should be sufficient to help illustrate the utility of the addition.
Feel free to edit.