Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: recipe scraper image cleaning #2139

Merged

Conversation

michael-genson
Copy link
Collaborator

What type of PR is this?

(REQUIRED)

  • bug

What this PR does / why we need it:

(REQUIRED)

The recipe scraper wasn't accounting for nested dictionaries of image urls, as mentioned in #2087. I found that the scraper actually wasn't cleaning images at all, so I enabled that.

Which issue(s) this PR fixes:

(REQUIRED)

resolves #2087

Release Notes

(REQUIRED)

recipe scraper has a higher chance of finding an image when scraping

enabled image cleaner
added case for nested image dicts
case list(image):
return image[0]
case {"url": str(image)}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct for this case to be completely removed though?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved it slightly down so it's easier to read, it's not gone

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it seems like i am a bit blind

case list(image):
return image[0]
case {"url": str(image)}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it seems like i am a bit blind

for image_data in image:
match image_data:
case str(image_data):
return image_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a 100% sure about this, but as far as i remember this will ultimately get used by the recipe_data_service to retrieve the image. That service however also accepts a list of image-urls and will use that to determine the largest one, so it would make sense here to not just return the first hit but the full list of urls so the best one can decided on further down the chain

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fleshgolem I was also looking at this and confused on why this returns a string. We do have logic that fetches the largest image so I'm wondering if this should really just return a list to begin with?

@michael-genson I'm also not sure we need to nested match statement here - this is sort of what I was thinking

match image:  # noqa - match statement not supported
    case str(image):
        return [image]
    case [str(_), *_]:
        return image
    case [{"url": str(_)}, *_]:
        return [x["url"] for x in image]
    case {"url": str(image)}:
        return [image]
    case _:
        raise TypeError(f"Unexpected type for image: {type(image)}, {image}")

That would catch all cases we're looking at now right? This returns the list, but could also just return the first resulting match as needed. I don't think we need recursion like I initially thought though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'll update it and make sure it passes the original issue's scrape

@michael-genson
Copy link
Collaborator Author

michael-genson commented Feb 16, 2023

Per discussion I refactored clean_image to return a list of strings, rather than a single string. At the final step (cleaning before loading data into the database) we select the first image in the list as the recipe image. Since clean_image always returns a non-empty list (using a default string), this will never produce an error.

There's something to be said about the image tag being cleaned twice (@fleshgolem raised this in discord), but I wasn't comfortable dropping it in either place since we're mixing two different workflows (scraping and migrations) and I don't think both workflows use the cleaning functions the same way.

@hay-kot hay-kot merged commit 05e2566 into mealie-recipes:mealie-next Feb 20, 2023
@michael-genson michael-genson deleted the fix/scraper-image-cleaning branch February 20, 2023 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[v1.0.0b5] - Issue Scraping from Delish
3 participants