-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect behavior from ABP regex #48
Comments
Thanks.
Taking domains out of ABP was always going to be tricky. I don't know why we do it... but here we are. I am unfamiliar with the ABP format (only Santosh has worked on this code), and so don't mind me asking... what'd you expect in the final entry be when something like |
If I understand what this code is supposed to be doing correctly, I'd expect the output to be blank, because the regex shouldn't be matching this filter at all. This code is supposed to extract only cases where the entire domain is supposed to be blocked, but ABP filters support much more. So for example, the regex (correctly) does not match the following filter from the same list:
The soundcloud filter is supposed to block only ping requests, i.e. requests that originate from An important corner case is what you should do with I think I lean towards leaving them out and anticipating users who want to block domains like this to use a more thorough, domain-oriented blocklist. |
@afontenot Is this a desirable fix? import re
rexpr = re.compile(r"^(\|\||[a-zA-Z0-9])([a-zA-Z0-9][a-zA-Z0-9-_.]+)((\^[a-zA-Z0-9\-\|\$\.\*]*)|(\$[a-zA-Z0-9\-\|\.])*|(\\[a-zA-Z0-9\-\||\^\.]*))$", re.M)
txt = "||soundcloud.com^$ping"
# or: ||soundcloud.com^
# ('||', 'soundcloud.com', '^$ping', '^$ping', None, None)
# or: ('||', 'soundcloud.com', '^', '^', None, None)
g = re.search(rexpr, txt).groups()
if g is not None and len(g) >= 2:
if len(g) >= 3 and g[3] is not None and len(g[3]) > 1:
# is an abp url entry
continue
else:
# there's but just the domain name
domain = g[1]
else:
# no matches
continue |
@ignoramous That does look like it would solve the problem, but if I understand the intent correctly and the limitations of domain based blocking versus the ABP syntax, I don't see why you wouldn't just drastically simplify the regex like so: import re
rexpr = re.compile(r"^\|\|[^/\n]+\^$", re.M)
txt = """||soundcloud.com^$ping
||soundcloud.com^
||domain1.com^
.soundcloud.com^
||soundcloud.com/path/^
||subdomain.domain2.com^"""
assert rexpr.findall(txt) == ['||soundcloud.com^', '||domain1.com^', '||subdomain.domain2.com^'] Seems to work okay. Maybe there are some corner cases I'm not considering? |
LGTM. Thanks. Want to send a pull request? (: The current regex, I beleive, also checks if the entries are valid |
Added: de0db78#diff-0754e2389b0a4982ce0106c260cf218699e6f2061105043cc24acc7398d1d38bR273 Hopefully, it works just fine. Goes without saying: Thanks a bunch for your inputs, appreciate it (: |
Reopening, since no matter the regex I try, some list or the other trips it up. In the interim, I've replaced one or two |
If you have an example of a broken filter, I could have a look at it. |
I think I am done with the blocklist rewrite now and can focus on tests.
There's a bunch that break. I tried iterating upon the |
A user writes, Reproduce:
Expected:
Actual:
Presumed reason:
I assume the Workaround: I'm not familiar with block list syntax, but the related ping type option is described here: https://help.adblockplus.org/hc/en-us/articles/360062733293-How-to-write-filters#type-options Suggested change:
|
Thanks for forwarding. I would like to add that I think using pure hostlist equivalents for blocklists (as you attempted already, @ignoramous), if existing, would be preferable as this avoids dealing with unsupported syntax at all. I don't know whether these exist for the block lists currently provided though. For the EasyPrivacy list, I expected to find it here, but I couldn't figure out which one it is or if there is any. Anyway, for future support of custom block lists (if planned), it would still be an advantage if only pure domains would be imported. |
When using the EasyPrivacy list, the resulting domain set includes
soundcloud.com
, which is incorrect.This is the result of the rule:
||soundcloud.com^$ping
Test case:
Result:
The text was updated successfully, but these errors were encountered: