Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues in image processing logic in content_scraping_strategy.py #345

Open
dmurat opened this issue Dec 13, 2024 · 1 comment
Open

Issues in image processing logic in content_scraping_strategy.py #345

dmurat opened this issue Dec 13, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@dmurat
Copy link

dmurat commented Dec 13, 2024

Issue 1: Index Out of Bounds for Relative Image URLs

When processing images with exclude_external_images=True, the code attempts to split image URLs to compare domains, but fails for relative URLs (e.g., 'assets/logo.svg'). The current implementation assumes all URLs have at least 3 segments when split by '/', causing an index out of bounds error.

Current code: line 398 in content_scraping_strategy.py

src_url_base = src.split('/')[2]  # Fails for relative URLs like 'assets/logo.svg'

Steps to reproduce:

  1. Use crawl4ai to scrape a page containing relative image URLs
  2. Set exclude_external_images=True
  3. Observe the error in logs: "Error processing element: exceptions must derive from BaseException"

For example:

    ...
    async with AsyncWebCrawler(verbose=True) as crawler:
        crawl_result = await crawler.arun(
            url="https://docs.astral.sh/uv/",
            exclude_external_links=True,
            exclude_external_images=True, 
            magic=True,
            cache_mode=CacheMode.BYPASS,
            verbose=True,
        )

Issue 2: Incorrect Exception Raising

The error handling code raises a string instead of an Exception object, which is invalid I think:

Line 418 in content_scraping_strategy.py

except Exception as e:
    raise "Error processing images"

This results in the log error message "[SCRAPE].. ◆ Error processing element: exceptions must derive from BaseException".

I'm python newbie so my observations may be wrong. Please take this into account.
Tnx

@unclecode
Copy link
Owner

@dmurat Thank you for your close attention to the code base and your speculation. I'm going to add that to the backlog, and by tomorrow, we'll definitely check it and see what's wrong. Thanks for sharing your code sample as well.

@unclecode unclecode self-assigned this Dec 13, 2024
@unclecode unclecode added the bug Something isn't working label Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants