Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect processing of redirects using download_url() #3235

Closed
slipnitskaya opened this issue Jan 8, 2021 · 6 comments
Closed

Incorrect processing of redirects using download_url() #3235

slipnitskaya opened this issue Jan 8, 2021 · 6 comments

Comments

@slipnitskaya
Copy link
Contributor

slipnitskaya commented Jan 8, 2021

🐛 Bug

torchvision.datasets.utils.download_url() processes redirects incorrectly.
An attempt to download via URL that returns redirect headers fails and results in an empty file.

To Reproduce

The behavior can be reproduced using link to the CUB-200-2011 dataset. Here's the example execution:

url = 'http://www.vision.caltech.edu/visipedia-data/CUB-200-2011/CUB_200_2011.tgz'
root = 'data'
download_url(url, root)

Expected behavior

The file is expected to be downloaded correctly given URL.

Environment

  • PyTorch version: 1.7.1
  • OS: Ubuntu 20.04.1 LTS (x86_64)
  • Python version: 3.7 (64-bit runtime)
  • CUDA used to build PyTorch: 10.2
  • GPU models and configuration: GPU 0: GeForce GTX 1060

cc @pmeier

@datumbox
Copy link
Contributor

datumbox commented Jan 8, 2021

@slipnitskaya Thanks for reporting. I don't think simply allowing redirects will do it because the website now hosts the dataset on google drive.

@pmeier Thoughts?

@pmeier
Copy link
Collaborator

pmeier commented Jan 8, 2021

I never had to deal with redirects so take what I say with a grain of salt.


@slipnitskaya I agree, handling redirects is a good addition and I will review your PR later on. That being said I think @datumbox is right that redirecting won't help you with your problem since download_url does not work for files on Google Drive.

In order achieve what you want, we could include a simple regular expression in the beginning of download_url that checks if we want to download from google drive. If that is the case it could dispatch the call to download_file_from_google_drive.

@slipnitskaya
Copy link
Contributor Author

Thanks for reviewing!

@pmeier Would be great indeed to add support for dispatching of download requests to Google Drive inside download_url.
I could implement this feature, so it could be merged into the upstream sooner.

@pmeier
Copy link
Collaborator

pmeier commented Jan 11, 2021

Sounds good @slipnitskaya! I think we should split this into two separate PRs. You can simply ping me when it is ready for review.

@slipnitskaya
Copy link
Contributor Author

@pmeier Dispatching of requests to Google Drive has been added to download_url() (PR #3245)

datumbox added a commit that referenced this issue Jan 15, 2021
* Make download_url() follow redirects

Fix bug related to the incorrect processing of redirects.
Follow the redirect chain until the destination is reached or
the number of redirects exceeds the max allowed value (by default 10).

* Parametrize value of max allowed redirect number

Make max number of hops a function argument and assign its default value to 10

* Propagate the max number of hops to download_url()

Add the maximum number of redirect hops parameter to download_url()

* check file existence before redirect

* remove print

* remove recursion

* add tests

* Reducing max_redirect_hops

Co-authored-by: Vasilis Vryniotis <[email protected]>
Co-authored-by: Philip Meier <[email protected]>
Co-authored-by: Vasilis Vryniotis <[email protected]>
@pmeier
Copy link
Collaborator

pmeier commented Jan 19, 2021

Closed in #3236.

@pmeier pmeier closed this as completed Jan 19, 2021
facebook-github-bot pushed a commit that referenced this issue Jan 21, 2021
Summary:
* Make download_url() follow redirects

Fix bug related to the incorrect processing of redirects.
Follow the redirect chain until the destination is reached or
the number of redirects exceeds the max allowed value (by default 10).

* Parametrize value of max allowed redirect number

Make max number of hops a function argument and assign its default value to 10

* Propagate the max number of hops to download_url()

Add the maximum number of redirect hops parameter to download_url()

* check file existence before redirect

* remove print

* remove recursion

* add tests

* Reducing max_redirect_hops

Reviewed By: datumbox

Differential Revision: D25954556

fbshipit-source-id: 3b2c64592d5882b98e87acdb5efd95e9283d2862

Co-authored-by: Vasilis Vryniotis <[email protected]>
Co-authored-by: Philip Meier <[email protected]>
Co-authored-by: Vasilis Vryniotis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants