You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue 1: Index Out of Bounds for Relative Image URLs
When processing images with exclude_external_images=True, the code attempts to split image URLs to compare domains, but fails for relative URLs (e.g., 'assets/logo.svg'). The current implementation assumes all URLs have at least 3 segments when split by '/', causing an index out of bounds error.
Current code: line 398 in content_scraping_strategy.py
src_url_base=src.split('/')[2] # Fails for relative URLs like 'assets/logo.svg'
Steps to reproduce:
Use crawl4ai to scrape a page containing relative image URLs
Set exclude_external_images=True
Observe the error in logs: "Error processing element: exceptions must derive from BaseException"
@dmurat Thank you for your close attention to the code base and your speculation. I'm going to add that to the backlog, and by tomorrow, we'll definitely check it and see what's wrong. Thanks for sharing your code sample as well.
Issue 1: Index Out of Bounds for Relative Image URLs
When processing images with
exclude_external_images=True
, the code attempts to split image URLs to compare domains, but fails for relative URLs (e.g., 'assets/logo.svg'). The current implementation assumes all URLs have at least 3 segments when split by '/', causing an index out of bounds error.Current code: line 398 in
content_scraping_strategy.py
Steps to reproduce:
For example:
Issue 2: Incorrect Exception Raising
The error handling code raises a string instead of an Exception object, which is invalid I think:
Line 418 in
content_scraping_strategy.py
This results in the log error message "[SCRAPE].. ◆ Error processing element: exceptions must derive from BaseException".
I'm python newbie so my observations may be wrong. Please take this into account.
Tnx
The text was updated successfully, but these errors were encountered: