Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add remove_invisible_texts method to AsyncPlaywrightCrawlerStr… #332

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nelzomal
Copy link
Contributor

@nelzomal nelzomal commented Dec 9, 2024

New Feature

  • Added functionality to detect and remove small or invisible text nodes.
  • Logic is gated behind a remove_invisible_texts flag for optional use.

High-Level Design

Currently, the layout-related logic is implemented within the async_crawler_strategy, as layout details are best retrieved during the crawling phase when the web driver renders the page. This allows for efficient detection since the page is already fully rendered.

However, I propose saving the layout information during the crawling phase. And leverage this data to implement more advanced heuristics during the subsequent scrape phase.

I’m open to discussing this approach further before proceeding with the extended implementation.

Implementation Details:

  1. Utilizes document.createTreeWalker to traverse the DOM and process text nodes.
  2. Checks parent elements of text nodes for invisibility based on:
  • CSS properties (display, visibility, opacity).
  • Element dimensions (width, height < 1px).
  1. Includes specific handling for special cases: text within links ( tags) and tooltips (role="tooltip"):
  • Only removed if parent element’s dimensions are too small. (website create very small link for accessibility purposes)

Test

I couldn't run async test locally, so I create a small test script,
I manually check the 4 urls, it functions correctly, and the performance overhead is quite small
image

script:

import asyncio
import base64

from crawl4ai import AsyncWebCrawler
from crawl4ai.cache_context import CacheMode


async def main():
    async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
        urls = [
            
            "https://www.python.org",
            "https://www.stackoverflow.com",
            "https://developer.apple.com/documentation/photokit",
            "https://www.example.com"
        ]
       
        for j in range(len(urls)):
            print(f"staring url: {urls[j]}")
            for i in range(2):
                url = urls[j]
                wait_for = "css:.topictitle" if j == 2 else None
                remove_invisible_texts = (i == 0)
                print(f"remove_invisible_texts: {remove_invisible_texts}")
                result = await crawler.arun(
                    url=url,
                    remove_invisible_texts=remove_invisible_texts,
                    wait_for=wait_for,
                    cache_mode=CacheMode.BYPASS,
                )
            
                with open(f"result_{j}_{i}.md", "w") as f:
                    f.write(result.markdown)
            print("\n\n")
    


if __name__ == "__main__":
    asyncio.run(main())

…ategy

    - Introduced a new feature to remove small and invisible text elements from the crawled pages.
    - Enhanced the existing functionality by allowing users to specify the removal of small texts via the `remove_invisible_texts` parameter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant