Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong redirection on Amazon #30

Closed
sumyungguy opened this issue Jul 11, 2017 · 27 comments
Closed

Wrong redirection on Amazon #30

sumyungguy opened this issue Jul 11, 2017 · 27 comments

Comments

@sumyungguy
Copy link

sumyungguy commented Jul 11, 2017

Hi, thanks for this addon!

I had a problem with Amazon, being unable to go to sellers' storefronts. Eventually I solved it by putting sellercentral.amazon.com on the blacklist of Skip Redirect. I guess you don't do fixes for specific websites, but I wonder then if it would be helpful to have an option to not skip redirects if they go to the same domain? Mostly I don't like click-tracking for privacy reasons, but within the same domain it doesn't matter so much, and it would reduce site breakages like this.

Go to for example this seller's page:
https://www.amazon.com/sp?seller=AEHA1AAWZIC3L

At the top of that page, click "Apevia storefront", which has the link:
https://www.amazon.com/shops/AEHA1AAWZIC3L?ref_=v_sp_storefront

That seems to go to here:
https://sellercentral.amazon.com/gp/sc-redirect/seller-page.html/ref=v_sp_storefront[blahblah-mysterious-information]&servername=www.amazon.com&type=shops

But Skip Links then just goes to the root page of www.amazon.com, instead of to the storefront.

@sblask
Copy link
Collaborator

sblask commented Jul 11, 2017

I can't reproduce your example, but I see what you mean and your proposal doesn't sound like a bad idea at all. I'll have to do a bit of thinking first though. Given your example, matching on everything until the first slash won't work. Extracting the domain can be tough due to top level domains (https://en.wikipedia.org/wiki/Second-level_domain) which I wouldn't want to hardcode...

@sblask
Copy link
Collaborator

sblask commented Jul 12, 2017

I found https://publicsuffix.org/ yesterday, @crssi also referenced in #29 but I am not really a fan of introducing a sync functionality or doing it manually every month. It also is a very long list and checking that for all requests with a redirect could slow down the whole process. And the whole point for me is speed up (I don't care much about the privacy aspect here).

So I thought I could implement a simple heuristic: do not skip if the domain matches or ends with the same 10? (could be configurable) characters. It's not perfect, but could cover 90% of all the cases?

@crssi
Copy link

crssi commented Jul 12, 2017

OK... 2 problems

  1. Sync or manually: what about the database is only loaded from https://publicsuffix.org/list/effective_tld_names.dat after installing extension and then once per month (or user can set days) or when user manually hit a sync button.

  2. Speedup: skip regex comparison and lets assume that database from upper URL is processed into SQL indexed like internal database, then it might not be too costly from CPU perspective? At least not for simple one or two level TLDs.

@sblask
Copy link
Collaborator

sblask commented Jul 12, 2017

I made a quick test and searching through the names takes something between 3 and 15 ms. A lot of variation there. Not sure whether that's acceptable, but anyway I don't think the list is right for the purpose of this ticket. What I'd need is a list of top level domain plus second level domains so that splitting by dots I can identify amazon.com or amazon.co.uk as the right domain. So I basically want com or co.uk plus one more level of subdomain. The public suffix list contains loads of stuff. I can't take the values as is, because that would exclude amazon.com for example, but I can't use them and add another subdomain if necessary either, because there are very specific ones in the list already...

@sumyungguy
Copy link
Author

I hadn't realized it would be so difficult to figure out if something is in the same domain! But of course it makes sense now that I think about it.

I think you'd need more than top level domain plus second level domains. It looks like in Japan three-level public domains are very common, or things like blogspot.co.uk, or as many as five-level ones with amazonaws. I don't see that there's any robust option other than using the public suffix list. But there must be some optimized search code already available that you could use for it? For example, for domains ending in .com, there are relatively few entries to check. If you can't use the list, then I'd say probably just forget about it, because it won't work the way people expect.

The public suffix list contains loads of stuff. I can't take the values as is, because that would exclude amazon.com for example, but I can't use them and add another subdomain if necessary either, because there are very specific ones in the list already...

Not sure what you mean by that...

@crssi
Copy link

crssi commented Jul 13, 2017

@sblask:
I don't think you should treat amazon.com and amazon.co.uk as the same domain.

@sblask
Copy link
Collaborator

sblask commented Jul 13, 2017

That's not what I was talking about. I just meant that I can't just split on dots to figure out the right domain. Given this example:

https://sellercentral.amazon.com/gp/sc-redirect/seller-page.html/ref=v_sp_storefront[blahblah-mysterious-information]&servername=www.amazon.com&type=shops

I would need to extract amazon.com I could just split the domain on dot and use the last two fragments. But this would fail with amazon.co.uk, because I'd need three fragments. My idea was to take the list, identify the equivalent to com or co.uk in the given domain and then add one more fragment. But it doesn't work like this.

There is a library that seems to do the right thing: https://github.com/oncletom/tld.js but it's rather complex and there is no simple way to include it...

@crssi
Copy link

crssi commented Jul 13, 2017

I am just thinking out loud here and now (brainstorming).
Let's say that https://publicsuffix.org/list/effective_tld_names.dat is used and parsed into sqlite for speeding the process using indexes.
The process is simplified for explanation clarity sake.

Let's start with amazon.co.uk as example in a TLD and domain search process.

First part, that is part to 1st . reversed .uk can be be skipped, since seach will obviously always return true

  1. search for co.uk, 2nd . reversed, is invoked and obviously is true, so it is possible TLD candidate
  2. we go on until false, so search for amazon.co.uk is invoked and in this case we have false
    So co.uk is TLD and the last search value is a domain = amazon.co.uk

for amazon.com everything stops already at step 1, where TLD is .com and domain part is last (in this case already the first) search value = amazon.com.

Let's go with a longer example, which does not happen very often... page.domain.act.edu.au
We skip search for .au

  1. search for edu.au returns true, so we go on
  2. search for act.edu.eu returns true, so we go on
  3. search for domain.act.edu.au returns false
    now we know that domain.act.edu.au is a domain, TLD is act.edu.au... but we don't bother, since a domain is a value we search and not TLD.
    and now we know that page is a host or subdomain of domain.act.edu.au, but we don't care, since we are in search for domain part only.

I am not sure if I have explained well, but at least I tried. :)

Cheers

@sblask
Copy link
Collaborator

sblask commented Jul 13, 2017

There is no support for SQLite in webextensions, but I think I found a way to do this using sets - lookup should be instantaneous. I finally understood the format of the suffix list. I think this should work: create 3 sets:

  • exception entries (! removed)
  • wildcard entries (*. removed)
  • others

And then similarly to what you suggest split the domain and lookup bit by bit - just in the opposite direction. It goes like this:

  1. split the domain on the first dot into head and rest
  2. check exception entries - if rest in there the result is head.rest
  3. check wildcard entries - if rest in there the result is previous_head.head.rest
  4. check others - if it's in there, the result is head.rest
  5. nothing found? repeat from step one with rest as domain

@crssi
Copy link

crssi commented Jul 14, 2017

I have suggested the backward solution because I have a feeling that it would need less steps and would be more optimal (CPU regarding) and faster.
But anyway, great thinking and good job.

Cheers

@sblask
Copy link
Collaborator

sblask commented Jul 14, 2017

It's the same thing, but I find it easier to wrap my head around it, especially with exceptions and wildcards. Might need more steps, but using sets that shouldn't be a problem. Now I just need to find the time to implement it...

@crssi
Copy link

crssi commented Jul 14, 2017

You are a champ. :)
It was nice brainstorming... if you need tester for beta, let me know here and I will do it gladly. :)

Thank you

@sumyungguy
Copy link
Author

sumyungguy commented Jul 14, 2017

There's also this:
https://github.com/wrangr/psl
which is used eg. in the Smart Referer addon:
https://github.com/meh/smart-referer/tree/master/webextension/deps

It looks very simple to include as a submodule, and to use it's just one "psl.get" call for the "from" and "to" URLs, gives you the base domain names to compare. From Smart Referer's main.js:

try {
		let toURI   = new URL(request.url);
		let fromURI = new URL(referer.value);
		
		// Check if this request can be dismissed early by either being a perfect
		// source host / target host match OR by being whitelisted somewhere
		if(fromURI.host === toURI.host || policy.allows(fromURI.host, toURI.host)) {
			return;
		}
		
		// Attempt lax matching only if we, in fact, have hostnames and not IP addresses
		let isIPAddress = !isHostnameIPAddress(fromURI.host) && !isHostnameIPAddress(toURI.host);
		if(isIPAddress && !options.strict) {
			// Parse the domain names and look for each of their base domains
			let fromHostBase = psl.get(fromURI.host);
			let toHostBase   = psl.get(toURI.host);
			
			// Allow if the both base domain names match
			if(fromHostBase && toHostBase && fromHostBase === toHostBase) {
				return;
			}
		}
	} catch(e) {
		console.exception(e);
}

Another optimization, besides the "perfect match" as above, is a simple check if the top-level and 2nd level, based on splitting by dots, are not the same. In either of those cases, which is going to be the vast majority, you don't need to do the full check. You could even do a fast pre-check for the most common two-level public domains, like .co.uk, before calling the full check.

@sblask
Copy link
Collaborator

sblask commented Jul 14, 2017

I had a look at the psl code. For every domain it goes through all entries of the public suffix list. I would have to check two domains, so if my performance check is right, that would be up to 30 ms and that's really rather long. Now that I figured out the algorithm, it shouldn't be too hard to implement it myself.

I like the idea of the pre-check. I'll do some benchmarking and if it's still rather slow, I can add that too. I would not want to judge what the most common domains are though ;-)

@petteyg
Copy link

petteyg commented Jul 17, 2017

See @gorhill's implentation for publicsuffix.org. https://github.com/gorhill/publicsuffixlist.js

It is apparently much faster.

@sblask
Copy link
Collaborator

sblask commented Jul 17, 2017

Just finished my own implementation of getting the domain using the public suffix list: b17aca7

I made a new benchmark. Not quite comparable to my previous one because that was done in the browser console and this is done on command line. Also not comparable to the one from gorhill, especially because he didn't list the actual hostnames so I couldn't use the same. I also don't handle punycode - I don't know whether the webextension API uses punycode or unicode and I didn't have good examples to test with.

Anyway a comparison of searching through all rules and using my set implementation looks like this:

ok 11 Search through rules x 1,530 ops/sec ±2.10% (87 runs sampled)
ok 12 Set implementation x 1,103,814 ops/sec ±0.57% (94 runs sampled)

(benchmark code is in test-psl)

The difference is quite significant (as expected), but even the search doesn't look quite as bad as my previous result.

Next step: integrating this into the skipping logic and adding an option.

@sblask
Copy link
Collaborator

sblask commented Jul 22, 2017

Ok, it's done! But I could not upload a new version yet because the add-on page died on me. You can test with Firefox by cloning this repo and running npm install and npm run:stable in the directory you cloned into. For Chrome, just clone into a directory and then use "Load unpacked extension" from the extension settings.

@crssi
Copy link

crssi commented Jul 23, 2017 via email

@sblask
Copy link
Collaborator

sblask commented Jul 24, 2017

New version is out, if you find any issues, please open a new issue.

@sblask sblask closed this as completed Jul 24, 2017
@sumyungguy
Copy link
Author

Seems to work well so far, thanks!

@crssi
Copy link

crssi commented Jul 26, 2017

You are a true wizard. Thank you. :)
I have removed all blacklisted stuff (also the default one) and so far so good, no breakage.

But will know more when I return from vacation in about 2 weeks and will let you know. :)

@petteyg
Copy link

petteyg commented Jul 30, 2017

The "same public domain" check doesn't appear to be working properly. The login form on www.chase.com redirect to chaseonline.chase.com and then to secure05b.chase.com. All are obviously chase.com, but the add-on prevents logging in and pops up a message about skipping the chaseonline.chase.com redirect.

@sblask
Copy link
Collaborator

sblask commented Jul 30, 2017

@petteyg I can't reproduce your problem. Do I need an account? When I click on signin, I get stuff like https://secure01b.chase.com/web/auth/?fromOrigin=https://secure01b.chase.com in the browser console, but the redirect is not skipped. Are you sure you've got the checkbox unticked? Can you post the full URL from which the skip happens?

@petteyg
Copy link

petteyg commented Jul 30, 2017

There's my problem. I was misreading the checkbox and thought it should be on.

@crssi
Copy link

crssi commented Jul 31, 2017

Nah, not your problem obviously. I made the same mistake.
Its really confusing what means "skip redirect" and what means skip skip redirect and what is whitelist and blacklist... the logic at the first glance is reversed.

@sblask I will write you more about to make more clearly when I am back in a week or so.
Anyway, the new "same domain" logic is awesome. 👍

@sumyungguy
Copy link
Author

I also had the same experiences, and chalked it up to me being an idiot :-) But now I don't feel so bad... I'll open a new issue about it.

@crssi
Copy link

crssi commented Aug 7, 2017

OK... reading it again, it make sense.
And you have nicely explained here #32 (comment)
Thank you @sblask, now your extension is not great, but it is awesome :)

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants