Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

canonicalize_url isn't handling some crucial cases #107

Open
sibiryakov opened this issue Jun 22, 2018 · 5 comments
Open

canonicalize_url isn't handling some crucial cases #107

sibiryakov opened this issue Jun 22, 2018 · 5 comments

Comments

@sibiryakov
Copy link
Contributor

  • removal of userinfo
  • dots and slashes in path and hostname
  • spaces succeeding and preceding the URL
  • common session id variables and their values
  • ip v6 canonicalization

Useful links:
https://developers.google.com/safe-browsing/v4/urls-hashing
https://github.com/iipc/urlcanon/blob/master/python/urlcanon/canon.py#L530

@kmike
Copy link
Member

kmike commented Jun 22, 2018

Nice links, thanks!

removal of userinfo

What does it mean? username/password?

dots and slashes in path and hostname

Could you please give an example? What's wrong with e.g. dots in hostname?

spaces succeeding and preceding the URL

Arguably this is an issue with link extraction, not with canonicalization. URLs shouldn't have such whitespaces. See also: https://github.com/scrapy/scrapy/issues/1614.

common session id variables and their values

This would be a very good feature to have, but we can't just blindly strip some known session_id parameter names and values by default. See also: scrapy/scrapy#1560.

ip v6 canonicalization

a good call.

@ghost
Copy link

ghost commented Jun 22, 2018

WRT the session variables, it may be worth considering also the google analytics etc. URL params, like "utm_source"; those attributes will be added by lots of websites and social media tools to all outbound URLs and could probably be safely stripped during canonicalisation.

@sibiryakov
Copy link
Contributor Author

yes, userinfo is username and password.

Could you please give an example? What's wrong with e.g. dots in hostname?
google.com.

obviously these are task-dependent issues, but there is no mechanism to enable such behaviour.

@kmike
Copy link
Member

kmike commented Jun 22, 2018

those attributes will be added by lots of websites and social media tools to all outbound URLs and could probably be safely stripped during canonicalisation.

I don't quite like doing this all by default without an option to turn it off. So the main question for now is how to make this behavior overridable, so that users can implement such rules themselves, without having to modify w3lib or scrapy. We can also provide something by default, but I think it should be a next step, and a separate task.

obviously these are task-dependent issues, but there is no mechanism to enable such behaviour.

Yep, it is discussed in scrapy/scrapy#1560. Changing canonicalize_url to do these actions is not enough, there should be a mechanism in Scrapy to enable it, and it also should be customizable. This part of the ticket is a duplicate of scrapy/scrapy#1560, probably it is better to keep discussion of this feature there. The problem is known, but there was no concrete proposal on how to fix it so far.

@kmike
Copy link
Member

kmike commented Jun 22, 2018

yes, userinfo is username and password.

Is it a real issue in practice? I understand why it can help, but I can also see how it can break some of the use cases if scrapy/scrapy#1466 gets merged, if we do it by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants