You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
to me it looks like '--blockRules' blocks entire pages, when a subelement like an iframe-content's URL is matching a passed regex.
Is that correct?
Or what is the exact mechanism?
And if my assumption was true, would it be nice to have an option to only exclude exactly the matching elements, but collect the rest of a page?
Many thanks
The text was updated successfully, but these errors were encountered:
Hi @steph-nb, the block rules target requests from specific URLs, so if you have a page at example.com with an iframe loading content from othersite.com and add a block rule matching othersite.com, the overall page at example.com should still be captured but the iframe content from othersite.com should be blocked.
If you're seeing behavior that deviates from this, I'm happy to look into it further!
Hi @tw4l , many thanks for your answer. I am not yet sure, if really the beaviour of browsertrix-crawler or my syntax of using crawler_extra_args in browsertrix is wrong.
How would you enter multiple regexes to blockRules in crawler_extra_args of the value.yaml in browsertrix, to block all matching contents on any page visited?
Hi @tw4l ,
I retried several ways to configure this parameter via the values.yaml of browsertrix.
Here some examples:
a)
crawler_extra_args: '--rolloverSize 100000000 --blockRules [".youtube.",".facebook.",".stats\.i-web\.ch.",".stats4\.i-web\.ch.",".onLogin.",".start_date.",".matomo."]'
b)
crawler_extra_args: '--rolloverSize 100000000 --blockRules ["youtube"]'
It always resulted in blocking much more than the desired page-elements only.
See for instance:
Question 1:
How would you pass this parameter via values.yaml?
Question 2:
If my ways should already be fine, could you maybe rework the functionality to really only exclude matching elements?
Hi,
to me it looks like '--blockRules' blocks entire pages, when a subelement like an iframe-content's URL is matching a passed regex.
Is that correct?
Or what is the exact mechanism?
And if my assumption was true, would it be nice to have an option to only exclude exactly the matching elements, but collect the rest of a page?
Many thanks
The text was updated successfully, but these errors were encountered: