Fix: recipe scraper image cleaning #2139

michael-genson · 2023-02-16T17:36:41Z

What type of PR is this?

(REQUIRED)

bug

What this PR does / why we need it:

(REQUIRED)

The recipe scraper wasn't accounting for nested dictionaries of image urls, as mentioned in #2087. I found that the scraper actually wasn't cleaning images at all, so I enabled that.

Which issue(s) this PR fixes:

(REQUIRED)

resolves #2087

Release Notes

(REQUIRED)

recipe scraper has a higher chance of finding an image when scraping

enabled image cleaner added case for nested image dicts

fleshgolem · 2023-02-16T18:48:06Z

mealie/services/scraper/cleaner.py

        case list(image):
-            return image[0]
-        case {"url": str(image)}:


Is it correct for this case to be completely removed though?

I just moved it slightly down so it's easier to read, it's not gone

Right, it seems like i am a bit blind

fleshgolem · 2023-02-16T18:52:25Z

mealie/services/scraper/cleaner.py

        case list(image):
-            return image[0]
-        case {"url": str(image)}:


Right, it seems like i am a bit blind

fleshgolem · 2023-02-16T19:02:29Z

mealie/services/scraper/cleaner.py

+            for image_data in image:
+                match image_data:
+                    case str(image_data):
+                        return image_data


I am not a 100% sure about this, but as far as i remember this will ultimately get used by the recipe_data_service to retrieve the image. That service however also accepts a list of image-urls and will use that to determine the largest one, so it would make sense here to not just return the first hit but the full list of urls so the best one can decided on further down the chain

@fleshgolem I was also looking at this and confused on why this returns a string. We do have logic that fetches the largest image so I'm wondering if this should really just return a list to begin with?

@michael-genson I'm also not sure we need to nested match statement here - this is sort of what I was thinking

match image: # noqa - match statement not supported case str(image): return [image] case [str(_), *_]: return image case [{"url": str(_)}, *_]: return [x["url"] for x in image] case {"url": str(image)}: return [image] case _: raise TypeError(f"Unexpected type for image: {type(image)}, {image}")

That would catch all cases we're looking at now right? This returns the list, but could also just return the first resulting match as needed. I don't think we need recursion like I initially thought though.

LGTM. I'll update it and make sure it passes the original issue's scrape

michael-genson · 2023-02-16T20:33:11Z

Per discussion I refactored clean_image to return a list of strings, rather than a single string. At the final step (cleaning before loading data into the database) we select the first image in the list as the recipe image. Since clean_image always returns a non-empty list (using a default string), this will never produce an error.

There's something to be said about the image tag being cleaned twice (@fleshgolem raised this in discord), but I wasn't comfortable dropping it in either place since we're mixing two different workflows (scraping and migrations) and I don't think both workflows use the cleaning functions the same way.

updated image cleaner

b4a97af

enabled image cleaner added case for nested image dicts

fleshgolem reviewed Feb 16, 2023

View reviewed changes

refactored image cleaner to return a list of urls

896a50e

hay-kot merged commit 05e2566 into mealie-recipes:mealie-next Feb 20, 2023

michael-genson deleted the fix/scraper-image-cleaning branch February 20, 2023 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: recipe scraper image cleaning #2139

Fix: recipe scraper image cleaning #2139

michael-genson commented Feb 16, 2023

fleshgolem Feb 16, 2023

michael-genson Feb 16, 2023

fleshgolem Feb 16, 2023

fleshgolem Feb 16, 2023

fleshgolem Feb 16, 2023

hay-kot Feb 16, 2023

michael-genson Feb 16, 2023

michael-genson commented Feb 16, 2023 •

edited

Loading

Fix: recipe scraper image cleaning #2139

Fix: recipe scraper image cleaning #2139

Conversation

michael-genson commented Feb 16, 2023

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Release Notes

fleshgolem Feb 16, 2023

Choose a reason for hiding this comment

michael-genson Feb 16, 2023

Choose a reason for hiding this comment

fleshgolem Feb 16, 2023

Choose a reason for hiding this comment

fleshgolem Feb 16, 2023

Choose a reason for hiding this comment

fleshgolem Feb 16, 2023

Choose a reason for hiding this comment

hay-kot Feb 16, 2023

Choose a reason for hiding this comment

michael-genson Feb 16, 2023

Choose a reason for hiding this comment

michael-genson commented Feb 16, 2023 • edited Loading

michael-genson commented Feb 16, 2023 •

edited

Loading