-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add machine readable download progress option #12084
Add machine readable download progress option #12084
Conversation
I don't think that windows CI failing is my fault. Not sure what's going on there |
I think using a common data format for the progress data would be good. Right now, devs would have to implement a custom parser. This is both tedious and error-prone. Using a common data format also means that we can more easily change the output in the future, because we don't have to worry about breaking any custom parsers. I would propose using JSON. Example: PROGRESS:{"file":"name.whl","current":1234,"total":87654321} This would allow devs to simply look for the "PROGRESS:" prefix and then use any old JSON parser to get the data. We would have to guarantee that our JSON doesn't span multiple lines (making it closer to JSONL), but I think that's okay. I also think that the Another idea would be to support arbitrary format strings, similar to format strings supported by git. Users could say While flexible, the issue with this approach is documentation. We would have to document all features and placeholders of the format syntax. |
Let's not over-engineer this. There's been (as far as I know) only one person (the OP here) who has ever requested this. It doesn't warrant a complex solution with all of the associated maintenance overheads. As I said here I still sort of feel this is something that we should look to rich to support - at the very least, it's worth asking them if it's something they would consider. |
I've opened a discussion in the rich repo here: Textualize/rich#3000 I know I'm the only person in the world that wants this, but this is currently blocking me again and it's really frustrating to be unable to do something that on the surface seems so simple |
True.
I agree and disagree. I agree in the sense that rich should have an option to force terminal mode. In fact, it does have that option, but it's only accessible to library users (here pip) but not to end users (there is no env variable we could set). See #11172 for that. However, even if rich supported this (via some mechanism), it wouldn't be with the explicit purpose of being machine-readable. AFAIK, rich doesn't guarantee a specific, strictly-defined text format for its progress output. So rich may choose to change the progress format at some point and thereby break our code. This is far from ideal. Frankly, I don't think it's rich's job to provide an API for progress. |
Response on the rich discussion:
Sounds like regardless, pip will have to do some implementation itself |
Alternatively if you guys don't want to merge this, is there a recommended way I could import pip's internals to either manually invoke the dependency resolver/downloader and/or hook into it to do this same kind of thing just for myself? I know you don't recommend people use pip's internals and would rather people invoke it via CLI in a subprocess but that's just out of the question for me without this change. |
No, there isn't any sort of recommended way or using pip's internals. You're basically on your own - you can do it, but it's not supported and you'll have to work out how on your own. To be clear, I'm not against this - I'm mostly just rather indifferent to it. The code looks clean, and simple enough. It needs tests, and it needs documentation that describes exactly what the output format is (as it'll be a commitment we'll have to support from now on). It also probably shouldn't use |
This really just comes down to large downloads seeming to do nothing. I don't want my users with slow internet connection seeing nothing happen for 5 hours while pytorch's 3GB wheel downloads, then giving up because to them it's just frozen. Even with fast internet it still takes quite some time, and there's no way with just running pip in a subprocess to easily track this. You'd probably cringe if you saw the disaster that is my current workaround for this. I basically start a pip install, regex match the wheel or URL pip logs out that it's downloading, cancel the pip install, then use the pypi API to grab a list of all the wheels, then i find the one that matches the one pip said it was downloading, then i download that myself (so i can track progress) then i tell pip to install that wheel. Separately i also attempted specifying my own download directory and using file watchers to track download progress, but that was less reliable. Right now I'm rewriting my code so that the backend handles the pip installs itself (because as-is the frontend can't invoke pip installs on remote backends) and now I'm either tasked with reimplementing that same workaround in python or doing something better (which I'd much rather do). Honestly it's starting to look like the most painless solution is just for me to include a build of my forked pip with my program and just use that instead of trying to integrate this officially. I'm willing to do the documentation and whatnot required to get this PR in, but I'm just getting the feeling that it isn't worth it, especially since this will probably take time to be released anyway and I'm trying to do this rewrite right now. |
Joey isn't exaggerating. In my hometown, downloading PyTorch takes 2 hours. Simply showing "Still downloading" with no progress indication and not even the total download size is not viable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, as noted in the discussion, this needs tests and documentation.
For documentation, I mean a section in the user guide explaining the intended use of this feature, the format of the output data, and how to detect it. It also needs to discuss the fact that (apart from the output this feature adds) pip doesn't typically guarantee that its output is stable, and so any code parsing pip's output is inherently tied to particular pip versions.
Thanks - that sucks, it's easy for me with a fast connection to forget not everyone has that advantage. You've persuaded me. As long as none of the other pip committers objects, we get the docs and tests sorted, I'm willing to support this PR. I've added a review. |
I think you'd be better trying to unit-test this, with monkeypatching to control whether the code detects the stdout as a console or not. I'm definitely not comfortable with adding a download of a large external package to the test suite, the tests are already way too slow, and deliberately adding a test which is deliberately slow just to trigger a feature that's in itself fast isn't really acceptable. |
In regards to the rate limiting: While looking at the spinners.py file I noticed that it already implements rate limiting via a nice little helper class, so I decided to just reuse that here since it is already the way pip is intended to function. The default value for timing here is also reused.
I also decided to change the format slightly. instead of being all caps, progress now gets displayed with As for unit tests, I still have to work on those. |
I'm not sure I'm going to be able to get tests for this working. I got the test suite running locally and messed around with one a bunch that was supposed to capture the error state (warning the user that json mode is not for TTYs) but it appeared to only ever log the "Downloading" text part. Based on some logging I did, I'm pretty sure the reason is that it's triggering the machine readable progress code ( I'd appreciate any suggestions, as I'm not really sure what to do about this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As regards tests, I'm honestly not sure. Yes, this will be hard to test, which is unfortunate, but that's true of a lot of pip features. I'd start with a unit test of pip._internal.network.download._prepare_download
, that passed a dummy Response
object that behaved like a big, slow, download, and then iterate over the returned chunks, capturing the data pip writes to stdout and confirming that it contains the data you expect. You can do this both for dummy responses that have a length, and ones that don't. For the "is this a TTY" check, unit test get_download_progress_renderer
and monkeypatch sys.stdout.isatty
to give the result you want. Then check you get the expected return value.
Once you have those two, you should have a good enough feel for how to test this that you can add whatever else feels necessary (if anything).
Sorry for the delay. In the time since the last reply I actually worked on my planned implementation using a build of my pip fork. It actually didn't even need the Anyway, I'm working on the suggested changes now. The tests still might take me a while to get right, but I'll give them another go. |
This would be useful to me also. Limiting to non-tty is an unnecessary complication. I cant think of a situation where the caller explicitly asks for |
@arenasys Yeah, I agree with that. Unfortunately, I've given up on this PR. I just ended up bundling my pip fork with these changes with my project and have been happily using this feature myself for the past year. After some time, I now think this (machine parseable progress) is something that should really be done thoughout all of pip and not just for the download progress bar. I don't want to personally do that though as this codebase is massive and confusing. And, I couldn't figure out how to write the tests for just this feature, so I don't think i'd be able to do it for the whole codebase. |
Progress reports are the one thing completely missing from pip's output (through pipes), so its very user unfriendly for cases like GUIs using pip and wanting to show the status to the user. As for testing, I would think it should have the same type of testing as the current spinners and progress bars (is there any?). I'll open another PR. |
I don't think there are any tests for that as the progress bars are all handled by "rich" (the progress bar package) And for what I was saying about the machine readable progress everywhere, I mainly just meant that I dislike the inconsistency. In my program, I had to have multiple different ways of parsing messages depending on if it was a standard pip message or a download json progress output. And pip just doesn't have install progress at all either (like it just says "installing packages..." and that's it) I agree this is probably the most important thing though, since it straight up doesn't output anything otherwise. I just think as a general feature I'd prefer an entire overhaul of how pip logs to make parsing overall way simpler and more useful |
Closing as #12586 implements this better. |
Closes #11508 and obsoletes #11172
TLDR: If you call pip in a subprocess, it is impossible to read progress output because rich disables progress bar outputs in subprocesses by default. I had previously added an option to force the progress bar, but it was suggested to add a machine-readable option instead. This PR adds a new progress_bar option that just prints the download progress to stdout and bypasses rich entirely, allowing a program that controls pip to read stdout and parse download progress.
I'm open to suggestions on how this should be formatted. Here is how it currently looks: