-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quite slow when dealing with large sets #22
Comments
I'm happy that you like the tool! I haven't thought about download speed at all to be honest, so I'm not that surprised. In theory, it shouldn't be that hard to add some parallelism to the downloads, but I'm using the |
Seems like the Flickr API is not the snappiest. It's more than 200ms for most calls. And often it takes multiple seconds per call. That could certainly explain some of the slowness |
One challenge here is that even though it looks like you can add So an optimization would be to add the following extras to the One could then also parallelize those downloads if one was really fancy :) |
That sounds good. I think that would minimize the relevance/downtime of API errors. I am of course available to test whenever 😃 |
It's a bit of a major operation to tweak the API calls, which I've tracked in #64 . I likely won't get to that anytime soon. But in the current master, I've just added a |
I did a fresh run in a new folder with this new master and got to 1451 folders before I got a program-exiting 500 error (like before). |
This is not doing anything for the error 500, but if you use the new flags it should zip through the already downloaded files and restart actually downloading again immediately? |
Will it? I must've mis-interpeted your comment thinking it wouldn't do that. I used the arguments Do I use those again when restarting? When I just did so, it started from the beginning skipping just as slow. Should I have specified anything more for the cache file? I didn't see any file in the download directory, or in the flickr_download folder in %Appdata%. |
Are you sure you are using the latest version here from github? I just did a run with a fresh install below. As you can see the first run takes 9s and the second takes 0.198s. The last run is without caching and metadata enabled (ie the old way of skipping) and it takes 7.946s. $ pip install https://github.com/beaufour/flickr-download/archive/refs/heads/master.zip
[pip output removed]
$ time flickr_download --cache cache.file --metadata --download 72157641626097033
INFO:root:Caching is enabled
INFO:root:Downloading Fasnet 2014
INFO:root:Saving: Fasnet 2014/Clown.jpg (https://www.flickr.com/photos/hhoch/13295370503/)
INFO:root:Saving: Fasnet 2014/Hopfennarr.jpg (https://www.flickr.com/photos/hhoch/13197784625/)
INFO:root:Saving: Fasnet 2014/Faselhannes.jpg (https://www.flickr.com/photos/hhoch/13164706264/)
INFO:root:Saving: Fasnet 2014/Schorrenweible.jpg (https://www.flickr.com/photos/hhoch/13011302384/)
INFO:root:Saving: Fasnet 2014/Kügele.jpg (https://www.flickr.com/photos/hhoch/12994078515/)
INFO:root:Saving: Fasnet 2014/Funkenmariechen.jpg (https://www.flickr.com/photos/hhoch/12974696174/)
INFO:root:Saving: Fasnet 2014/Rätsch.jpg (https://www.flickr.com/photos/hhoch/12934610845/)
INFO:root:Saving: Fasnet 2014/spanische Fliege _ Spanish Fly.jpg (https://www.flickr.com/photos/hhoch/12874357023/)
INFO:root:Saving: Fasnet 2014/Don Quijote.jpg (https://www.flickr.com/photos/hhoch/12853478335/)
INFO:root:Saving: Fasnet 2014/Afro.jpg (https://www.flickr.com/photos/hhoch/12831632914/)
INFO:root:Saving: Fasnet 2014/Queens.jpg (https://www.flickr.com/photos/hhoch/12822261443/)
INFO:root:Saving: Fasnet 2014/Dämonen _ demons.jpg (https://www.flickr.com/photos/hhoch/13325972784/)
INFO:root:Saving: Fasnet 2014/Keltenwächter vom Dickenwald.jpg (https://www.flickr.com/photos/hhoch/13359674385/)
INFO:root:Saving: Fasnet 2014/Die rote Spinne _ the red spider.jpg (https://www.flickr.com/photos/hhoch/13566149214/)
INFO:root:Saving: Fasnet 2014/Dämonen _ demons #02.jpg (https://www.flickr.com/photos/hhoch/13610444904/)
INFO:root:Saving: Fasnet 2014/Keltenwächter vom Dickenwald - green version.jpg (https://www.flickr.com/photos/hhoch/13642612284/)
INFO:root:Saving: Fasnet 2014/Federle.jpg (https://www.flickr.com/photos/hhoch/13741930325/)
flickr_download --cache cache.file --metadata --download 72157641626097033 1.09s user 0.18s system 13% cpu 9.407 total
$ time flickr_download --cache cache.file --metadata --download 72157641626097033
INFO:root:Caching is enabled
INFO:root:Downloading Fasnet 2014
INFO:root:Skipping download of already downloaded photo with ID: 13295370503
INFO:root:Skipping download of already downloaded photo with ID: 13197784625
INFO:root:Skipping download of already downloaded photo with ID: 13164706264
INFO:root:Skipping download of already downloaded photo with ID: 13011302384
INFO:root:Skipping download of already downloaded photo with ID: 12994078515
INFO:root:Skipping download of already downloaded photo with ID: 12974696174
INFO:root:Skipping download of already downloaded photo with ID: 12934610845
INFO:root:Skipping download of already downloaded photo with ID: 12874357023
INFO:root:Skipping download of already downloaded photo with ID: 12853478335
INFO:root:Skipping download of already downloaded photo with ID: 12831632914
INFO:root:Skipping download of already downloaded photo with ID: 12822261443
INFO:root:Skipping download of already downloaded photo with ID: 13325972784
INFO:root:Skipping download of already downloaded photo with ID: 13359674385
INFO:root:Skipping download of already downloaded photo with ID: 13566149214
INFO:root:Skipping download of already downloaded photo with ID: 13610444904
INFO:root:Skipping download of already downloaded photo with ID: 13642612284
INFO:root:Skipping download of already downloaded photo with ID: 13741930325
flickr_download --cache cache.file --metadata --download 72157641626097033 0.17s user 0.02s system 97% cpu 0.198 total
$ time flickr_download --download 72157641626097033
INFO:root:Downloading Fasnet 2014
INFO:root:Skipping Fasnet 2014/Clown.jpg, as it exists already
INFO:root:Skipping Fasnet 2014/Hopfennarr.jpg, as it exists already
INFO:root:Skipping Fasnet 2014/Faselhannes.jpg, as it exists already
INFO:root:Skipping Fasnet 2014/Schorrenweible.jpg, as it exists already
INFO:root:Skipping Fasnet 2014/Kügele.jpg, as it exists already
INFO:root:Skipping Fasnet 2014/Funkenmariechen.jpg, as it exists already
INFO:root:Skipping Fasnet 2014/Rätsch.jpg, as it exists already
INFO:root:Skipping Fasnet 2014/spanische Fliege _ Spanish Fly.jpg, as it exists already
INFO:root:Skipping Fasnet 2014/Don Quijote.jpg, as it exists already
INFO:root:Skipping Fasnet 2014/Afro.jpg, as it exists already
INFO:root:Skipping Fasnet 2014/Queens.jpg, as it exists already
INFO:root:Skipping Fasnet 2014/Dämonen _ demons.jpg, as it exists already
INFO:root:Skipping Fasnet 2014/Keltenwächter vom Dickenwald.jpg, as it exists already
INFO:root:Skipping Fasnet 2014/Die rote Spinne _ the red spider.jpg, as it exists already
INFO:root:Skipping Fasnet 2014/Dämonen _ demons #02.jpg, as it exists already
INFO:root:Skipping Fasnet 2014/Keltenwächter vom Dickenwald - green version.jpg, as it exists already
INFO:root:Skipping Fasnet 2014/Federle.jpg, as it exists already
flickr_download --download 72157641626097033 0.79s user 0.10s system 11% cpu 7.946 total |
So I did everything in your message above, including installing the master from pip, and it did indeed work. I'm restarting the download process. I am not using the But I think there's an issue with its implementation. The cache file is only produced upon successful/normal exit of the program - it's not created as you're downloading. That means in my case, with a program-exiting error in the middle of an incomplete download, I won't have a cache file I can use. Evidence: I watched how the cache file was only generated with your example after the download had successfully completed. I then closed CMD in the middle of downloading your example photoset, and the cache file wasn't produced. I'm guessing the same will occur when I hit a 500 error. I will let you know when this happens. |
Hmm. Yes, you might be right in that. I only tested the happy path here.
Let me fix that. Sorry about that!
…On Sat, Oct 1, 2022 at 12:29 PM eggplantedd ***@***.***> wrote:
So I did everything in your message above, including installing the master
from pip, and it did indeed work.
Trouble is, when I did it previously, the "caching is enabled" text did
appear- but I only updated flickr_download from its master. Double
checking, I don't see it's dependables having updates, so it seems I did
manually install it correctly.
Anyway, I'm restarting the download process. I am not using the --download
argument, but -u.
Here's what I'm guessing happens. The cache file is only produced upon
successful/normal exit of the program (plus it's not created as you're
downloading).
That means in my case, with a program-exiting error in the middle of an
incomplete download, I won't have a cache file I can use.
I watched how the cache file was only generated with your example after
the download had successfully completed. I then closed CMD in the middle of
downloading your example photoset, and the cache file wasn't produced. I'm
guessing the same is for my use case.
—
Reply to this email directly, view it on GitHub
<#22 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB2POMAWY3DV73RRMF6Q23WBBREZANCNFSM4DHDLCZQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I just looked through the code and did some tests, and the metadata store should be written fine on errors. The API cache was not, which I've fixed in master now. |
Thanks- I triple-checked I installed the right master and left the program on last night. However, I didn't hit any errors for hours, so I had to manually exit the program to sleep. No cache file created- as expected, but could an argument be added to produce cache file upon manual exit? Would be final step to deal with downloads over multiple days. So when I come back today, it has to use the slower initial checking process. Ok- but I hit a program-exiting error, and indeed a 74.4MB cache file is produced. Yet when I restart the program with the same arguments, it doesn't use it. It just restarts the old process. I am wondering if it produces cache it can't do anything with, or is programmed not to do anything with. The program may need a download to have occurred in the session cache, and so can't work in the 'blind' initial checking process. My desire here is the cache from the initial checking process tells the program what files to zip through, so it can then get to the files it has to initially check. My apologies for this back and forth, hoping these aren't too fundamental changes. |
Am I doing something wrong? I got it to fail during the download stage, 500 error, it produced a cache file. I even close the cmd, then re-open it with same arguments, and it doesn't use it.
The file is
Guessing that
seperates out each photo to download. |
I don't quite understand what is going on here. Could you try to download the latest master and then run
Twice and share the output here? |
I tested that as working before, including using the cache file. I couldn't encounter a program-exiting error with that singular photoset so I'm not sure it can apply to my use case So I ran --verbose on the photo list + cache I described in my last post, and then on the one you just described. This should allow comparison between successful/unsuccessful use of cache.
It doesn't seem to start the using the cache process. Here is the cache_file of alookback.net with my API key removed: https://drive.google.com/file/d/1q9MXZRWgWLn6E3cAeH3PklBconuJaClO/view?usp=sharing I have no trouble finding photo_id 51212763705, 51212763510 etc inside- so no idea why the program won't recognise it. pip install log:
|
You are not running the latest version from master (from yesterday), as that would have printed the full path to the cache file. So please update with the latest version from master. You can either copy the flick_download.py file in to the correct location, or pip install directly from github ( If you run that twice with the arguments I have above I can see if it successfully uses the cache file at all. Then we can tackle the API errors as a second step if needed. |
Edit: Yes, I see in my previous message it must've been the old version, because it didn't show the cache save location. Here is what you asked for. I deleted the previously downloaded files/cache to be sure. Also deleted the API key after use. As far as I can see it is printing the full path to the cache file -
|
Thanks! That seems to run exactly like advertised, right? It both has cache hits and also quickly skips the downloaded files. I also just tested it myself with a private photo set and it works fine. I've also artificially triggered API errors, and it works fine. I've also killed the script and it works. So I'm really puzzled to why it's not working for you? Were you just running an old version. Is it working now? |
Sorry- I had a catastrophic hard drive failure that was housing the files, so had to be dealt with. I just restarted the download process with the same arguments, got a 500 error and a cache file produced, when I restarted it- success! It used the cache file to quickly/instanteously resume. However... Once I hit a second 500 error, the cache file did indeed write again (larger this time), but running program again with same arguments- it didn't use the cache file, and restarts the slow, old download process. Is this a situation where the program isn't designed to handle beyond one error being raised? I'm not sure. |
Ugh, sorry to hear about your hard drive! I hope you got it back to working. Ah, the cache times out after 3600 seconds.... which means it's gone when you restart after a long download. But the metadata store should help (ie |
First of all, thanks for the great tool.
One thing that may be worth looking into is adding parallelism or some other kind of batching. Right now on a 100 mbit link, it takes a number of hours to download a ~10G set, and the link only becomes saturated in bursts. This seems to indicate that the work around downloading a photo using the API is a non-trivial amount of work that may benefit from being done in parallel.
I'm not sure if I'll have time to submit a pull request soon, so I'm submitting this ticket in lieu for now to see your thoughts on adding parallelism support to flickr-download. When I have more time I will get some more profiles and see if there are any low hanging fruit, as well.
The text was updated successfully, but these errors were encountered: