-
-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: recipe scraper image cleaning #2139
Fix: recipe scraper image cleaning #2139
Conversation
enabled image cleaner added case for nested image dicts
case list(image): | ||
return image[0] | ||
case {"url": str(image)}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct for this case to be completely removed though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just moved it slightly down so it's easier to read, it's not gone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it seems like i am a bit blind
case list(image): | ||
return image[0] | ||
case {"url": str(image)}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it seems like i am a bit blind
mealie/services/scraper/cleaner.py
Outdated
for image_data in image: | ||
match image_data: | ||
case str(image_data): | ||
return image_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a 100% sure about this, but as far as i remember this will ultimately get used by the recipe_data_service to retrieve the image. That service however also accepts a list of image-urls and will use that to determine the largest one, so it would make sense here to not just return the first hit but the full list of urls so the best one can decided on further down the chain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fleshgolem I was also looking at this and confused on why this returns a string. We do have logic that fetches the largest image so I'm wondering if this should really just return a list to begin with?
@michael-genson I'm also not sure we need to nested match statement here - this is sort of what I was thinking
match image: # noqa - match statement not supported
case str(image):
return [image]
case [str(_), *_]:
return image
case [{"url": str(_)}, *_]:
return [x["url"] for x in image]
case {"url": str(image)}:
return [image]
case _:
raise TypeError(f"Unexpected type for image: {type(image)}, {image}")
That would catch all cases we're looking at now right? This returns the list, but could also just return the first resulting match as needed. I don't think we need recursion like I initially thought though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'll update it and make sure it passes the original issue's scrape
Per discussion I refactored There's something to be said about the image tag being cleaned twice (@fleshgolem raised this in discord), but I wasn't comfortable dropping it in either place since we're mixing two different workflows (scraping and migrations) and I don't think both workflows use the cleaning functions the same way. |
What type of PR is this?
(REQUIRED)
What this PR does / why we need it:
(REQUIRED)
The recipe scraper wasn't accounting for nested dictionaries of image urls, as mentioned in #2087. I found that the scraper actually wasn't cleaning images at all, so I enabled that.
Which issue(s) this PR fixes:
(REQUIRED)
resolves #2087
Release Notes
(REQUIRED)