Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Follow redirects during scraping #3875

Merged
merged 3 commits into from
Jul 10, 2024

Conversation

zeskeertwee
Copy link
Contributor

What type of PR is this?

  • bug

What this PR does / why we need it:

URL's from albert heijn such as https://www.ah.nl/r/1195504 are shared from the albert heijn app, and fail to scrape because they return a 301 moved permanently which eventually leads to https://www.ah.nl/allerhande/recept/R-R1195504/romige-linguine-met-brie-artisjok-en-parmaham, which scrapes perfectly fine. This is fixed by telling httpx to follow redirects.
This allows users to directly import recepies from the app instead of having to first follow the redirects in a browser.

Which issue(s) this PR fixes:

URL's like https://www.ah.nl/r/1195504 from the albert heijn app fail to scrape

Testing

The aformentioned URL now scrapes fine as the redirects are followed, as seen in the logs:

INFO     2024-07-10T14:23:26 - HTTP Request: GET https://www.ah.nl/r/1195504 "HTTP/1.1 301 Moved Permanently"
INFO     2024-07-10T14:23:27 - HTTP Request: GET https://www.ah.nl/allerhande/recept/R-R1195504 "HTTP/1.1 301 Moved Permanently"
INFO     2024-07-10T14:23:27 - HTTP Request: GET https://www.ah.nl/allerhande/recept/R-R1195504/romige-linguine-met-brie-artisjok-en-parmaham "HTTP/1.1 200 OK"

@michael-genson
Copy link
Collaborator

LGTM, other than the lint issue. Once that's fixed we can merge

@zeskeertwee
Copy link
Contributor Author

zeskeertwee commented Jul 10, 2024

Thanks, formatting has been fixed in 1cf28a6

Copy link
Collaborator

@michael-genson michael-genson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

More info about the dismissed security advisory, copy-pasted for visibility:

This is part of the recipe scraper code so it's expected that we fetch user-supplied URLs. We don't authenticate these requests, so there is no elevated permissions, and we validate the HTML before importing a recipe

@michael-genson michael-genson enabled auto-merge (squash) July 10, 2024 16:41
@michael-genson michael-genson merged commit fd2dc15 into mealie-recipes:mealie-next Jul 10, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants