Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect behavior from ABP regex #48

Open
afontenot opened this issue Sep 21, 2022 · 11 comments
Open

Incorrect behavior from ABP regex #48

afontenot opened this issue Sep 21, 2022 · 11 comments
Assignees

Comments

@afontenot
Copy link

When using the EasyPrivacy list, the resulting domain set includes soundcloud.com, which is incorrect.

This is the result of the rule: ||soundcloud.com^$ping

Test case:

import re

# https://github.com/serverless-dns/blocklists/blob/f57b1d/download.py#L206
rexpr = re.compile(r"^(\|\||[a-zA-Z0-9])([a-zA-Z0-9][a-zA-Z0-9-_.]+)((\^[a-zA-Z0-9\-\|\$\.\*]*)|(\$[a-zA-Z0-9\-\|\.])*|(\\[a-zA-Z0-9\-\||\^\.]*))$", re.M)

txt = "||soundcloud.com^$ping"

print(re.search(rexpr, txt).groups())

Result:

('||', 'soundcloud.com', '^$ping', '^$ping', None, None)
@ignoramous
Copy link
Contributor

Thanks.

This is the result of the rule: ||soundcloud.com^$ping

Taking domains out of ABP was always going to be tricky. I don't know why we do it... but here we are. I am unfamiliar with the ABP format (only Santosh has worked on this code), and so don't mind me asking... what'd you expect in the final entry be when something like ||soundcloud.com^$ping exists in the original list?

@afontenot
Copy link
Author

afontenot commented Sep 21, 2022

If I understand what this code is supposed to be doing correctly, I'd expect the output to be blank, because the regex shouldn't be matching this filter at all. This code is supposed to extract only cases where the entire domain is supposed to be blocked, but ABP filters support much more.

So for example, the regex (correctly) does not match the following filter from the same list:

||discovery.com^*/events

The soundcloud filter is supposed to block only ping requests, i.e. requests that originate from <a ping or Navigator.sendBeacon(). A domain-based blocklist can't block these requests, and so the domain should be left out of the list entirely.

An important corner case is what you should do with ^$third-party blocklist entries (which are also currently included because of the regex). Technically, these domains are not supposed to be blocked if the user visits them directly. But the overwhelming majority seem to be tracking domains that it's hard to imagine anyone ever visiting deliberately.

I think I lean towards leaving them out and anticipating users who want to block domains like this to use a more thorough, domain-oriented blocklist.

@ignoramous
Copy link
Contributor

ignoramous commented Nov 8, 2022

@afontenot Is this a desirable fix?

import re

rexpr = re.compile(r"^(\|\||[a-zA-Z0-9])([a-zA-Z0-9][a-zA-Z0-9-_.]+)((\^[a-zA-Z0-9\-\|\$\.\*]*)|(\$[a-zA-Z0-9\-\|\.])*|(\\[a-zA-Z0-9\-\||\^\.]*))$", re.M)

txt = "||soundcloud.com^$ping"
# or: ||soundcloud.com^

# ('||', 'soundcloud.com', '^$ping', '^$ping', None, None)
# or: ('||', 'soundcloud.com', '^', '^', None, None)
g = re.search(rexpr, txt).groups()

if g is not None and len(g) >= 2:
    if len(g) >= 3 and g[3] is not None and len(g[3]) > 1:
        # is an abp url entry
        continue
    else:
        # there's but just the domain name
        domain = g[1]
else:
    # no matches
    continue

@afontenot
Copy link
Author

@ignoramous That does look like it would solve the problem, but if I understand the intent correctly and the limitations of domain based blocking versus the ABP syntax, I don't see why you wouldn't just drastically simplify the regex like so:

import re

rexpr = re.compile(r"^\|\|[^/\n]+\^$", re.M)

txt = """||soundcloud.com^$ping
||soundcloud.com^
||domain1.com^
.soundcloud.com^
||soundcloud.com/path/^
||subdomain.domain2.com^"""

assert rexpr.findall(txt) == ['||soundcloud.com^', '||domain1.com^', '||subdomain.domain2.com^']

Seems to work okay. Maybe there are some corner cases I'm not considering?

@ignoramous
Copy link
Contributor

I don't see why you wouldn't just drastically simplify the regex like so

LGTM. Thanks. Want to send a pull request? (:

The current regex, I beleive, also checks if the entries are valid abp syntax. Though, I'm all for simplicity.

@ignoramous
Copy link
Contributor

ignoramous commented Nov 12, 2022

Added: de0db78#diff-0754e2389b0a4982ce0106c260cf218699e6f2061105043cc24acc7398d1d38bR273

Hopefully, it works just fine.

Goes without saying: Thanks a bunch for your inputs, appreciate it (:

@ignoramous
Copy link
Contributor

ignoramous commented Nov 15, 2022

Reopening, since no matter the regex I try, some list or the other trips it up. In the interim, I've replaced one or two abp lists with their hostfile equivalent.

@afontenot
Copy link
Author

If you have an example of a broken filter, I could have a look at it.

@ignoramous
Copy link
Contributor

I think I am done with the blocklist rewrite now and can focus on tests.

If you have an example of a broken filter, I could have a look at it.

There's a bunch that break. I tried iterating upon the abp regex, but my python (and regex) skills eventually failed me as the failures kept appearing for different abp files. I am going to automate a test for this. Give me a few days, please.

@ignoramous
Copy link
Contributor

A user writes,

Reproduce:

Expected:

  • Website loads, rain radar is shown
  • Tried this with Mull + uBlock Origin + the EasyPrivacy block list from above imported into uBlock (while the same block list was disabled in RethinkDNS) and the website fully loads

Actual:

  • Website doesn't load. In the log you can see several *.wo-cloud.com requests getting blocked.
  • To confirm: This happens once I enable the EasyPrivacy block list in RethinkDNS and does not if it is deactivated (even when using the exact same block list in uBlock Origin)

Presumed reason:
The block list includes only one related entry:

  • ||wo-cloud.com^$ping

I assume the ping option of this entry is misinterpreted and incorrectly blocks the entire domain.

Workaround:
Trusting the individual domains sometimes helps, but sometimes doesn't. Sometimes the request appears in the log as maybe blocked, which I don't know the meaning of.

I'm not familiar with block list syntax, but the related ping type option is described here: https://help.adblockplus.org/hc/en-us/articles/360062733293-How-to-write-filters#type-options

Suggested change:

  • Preferably correct block list's options interpretation (if at all possible, I'm not familiar with this)
  • Alternatively remove domains with such options from blocking to avoid overblocking

From: celzero/rethink-app#1713 (comment)

@boernsen-development
Copy link

Thanks for forwarding.

I would like to add that I think using pure hostlist equivalents for blocklists (as you attempted already, @ignoramous), if existing, would be preferable as this avoids dealing with unsupported syntax at all. I don't know whether these exist for the block lists currently provided though.

For the EasyPrivacy list, I expected to find it here, but I couldn't figure out which one it is or if there is any.

Anyway, for future support of custom block lists (if planned), it would still be an advantage if only pure domains would be imported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants