feat: Add remove_invisible_texts method to AsyncPlaywrightCrawlerStr… #332

nelzomal · 2024-12-09T09:29:27Z

New Feature

Added functionality to detect and remove small or invisible text nodes.
Logic is gated behind a remove_invisible_texts flag for optional use.

High-Level Design

Currently, the layout-related logic is implemented within the async_crawler_strategy, as layout details are best retrieved during the crawling phase when the web driver renders the page. This allows for efficient detection since the page is already fully rendered.

However, I propose saving the layout information during the crawling phase. And leverage this data to implement more advanced heuristics during the subsequent scrape phase.

I’m open to discussing this approach further before proceeding with the extended implementation.

Implementation Details:

Utilizes document.createTreeWalker to traverse the DOM and process text nodes.
Checks parent elements of text nodes for invisibility based on:

CSS properties (display, visibility, opacity).
Element dimensions (width, height < 1px).

Includes specific handling for special cases: text within links ( tags) and tooltips (role="tooltip"):

Only removed if parent element’s dimensions are too small. (website create very small link for accessibility purposes)

Test

I couldn't run async test locally, so I create a small test script,
I manually check the 4 urls, it functions correctly, and the performance overhead is quite small

script:

import asyncio
import base64

from crawl4ai import AsyncWebCrawler
from crawl4ai.cache_context import CacheMode


async def main():
    async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
        urls = [
            
            "https://www.python.org",
            "https://www.stackoverflow.com",
            "https://developer.apple.com/documentation/photokit",
            "https://www.example.com"
        ]
       
        for j in range(len(urls)):
            print(f"staring url: {urls[j]}")
            for i in range(2):
                url = urls[j]
                wait_for = "css:.topictitle" if j == 2 else None
                remove_invisible_texts = (i == 0)
                print(f"remove_invisible_texts: {remove_invisible_texts}")
                result = await crawler.arun(
                    url=url,
                    remove_invisible_texts=remove_invisible_texts,
                    wait_for=wait_for,
                    cache_mode=CacheMode.BYPASS,
                )
            
                with open(f"result_{j}_{i}.md", "w") as f:
                    f.write(result.markdown)
            print("\n\n")
    


if __name__ == "__main__":
    asyncio.run(main())

…ategy - Introduced a new feature to remove small and invisible text elements from the crawled pages. - Enhanced the existing functionality by allowing users to specify the removal of small texts via the `remove_invisible_texts` parameter.

nelzomal mentioned this pull request Dec 9, 2024

Feature Request: Filtering for Small and Invisible Text #274

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add remove_invisible_texts method to AsyncPlaywrightCrawlerStr… #332

feat: Add remove_invisible_texts method to AsyncPlaywrightCrawlerStr… #332

nelzomal commented Dec 9, 2024 •

edited

Loading

feat: Add remove_invisible_texts method to AsyncPlaywrightCrawlerStr… #332

Are you sure you want to change the base?

feat: Add remove_invisible_texts method to AsyncPlaywrightCrawlerStr… #332

Conversation

nelzomal commented Dec 9, 2024 • edited Loading

New Feature

High-Level Design

Implementation Details:

Test

nelzomal commented Dec 9, 2024 •

edited

Loading