-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptative URL filter to normalize URLs based on canonical tag #315
Comments
If a canonical metatag is found on a page, the |
blame the web page ;-)
|
@jnioche Haha, sorry for that ;) This wasn't a good example i think. But we encouter the same problem if we crawl This should be a better example :-) |
From the Javadoc of java.net.URL.URL(URL context, String spec) throws MalformedURLException
The value found in the HTML is neither relative nor absolute which is why we get that. Again, blame the site. You could write a URLFilter to rewrite such URLs so that the outlinks are fixed but that won't help much with the sitemaps. Can you use StackOverflow if you have more questions? |
Thank you and sorry for disturbing you |
You haven't disturbed me at all! I am glad you are using SC and thankful that you are taking the time to report potential issues |
Such a filter could compare the parameters of a URL with the canonical tag found in the page (if any) and determine after a while which parameters can be safely removed in order to normalise the URL.
The aim is similar to the clean-param extension of the robots protocol by Yandex where sites can specify how URLs can be normalised.
TODO compare with [research.google.com/pubs/archive/35210.pdf]
The text was updated successfully, but these errors were encountered: