feat: Add remove_invisible_texts method to AsyncPlaywrightCrawlerStr… #332
+87
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New Feature
High-Level Design
Currently, the layout-related logic is implemented within the async_crawler_strategy, as layout details are best retrieved during the crawling phase when the web driver renders the page. This allows for efficient detection since the page is already fully rendered.
However, I propose saving the layout information during the crawling phase. And leverage this data to implement more advanced heuristics during the subsequent scrape phase.
I’m open to discussing this approach further before proceeding with the extended implementation.
Implementation Details:
Test
I couldn't run async test locally, so I create a small test script,
I manually check the 4 urls, it functions correctly, and the performance overhead is quite small
script: