Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptative URL filter to normalize URLs based on canonical tag #315

Open
jnioche opened this issue Jul 11, 2016 · 6 comments
Open

Adaptative URL filter to normalize URLs based on canonical tag #315

jnioche opened this issue Jul 11, 2016 · 6 comments

Comments

@jnioche
Copy link
Contributor

jnioche commented Jul 11, 2016

Such a filter could compare the parameters of a URL with the canonical tag found in the page (if any) and determine after a while which parameters can be safely removed in order to normalise the URL.

The aim is similar to the clean-param extension of the robots protocol by Yandex where sites can specify how URLs can be normalised.

TODO compare with [research.google.com/pubs/archive/35210.pdf]

@sandrohoerler
Copy link

If a canonical metatag is found on a page, the AbstractIndexerBolt#valueForUrl method returns the targetUrl concatinated with the canonical url found on the page. This issue is can be reproduced by crawling https://www.geotest.ch/kompetenzen/naturgefahren/massnahmenplanung.html. Is this a known issue?

@jnioche
Copy link
Contributor Author

jnioche commented Apr 26, 2019

hi @sandrohoerler

blame the web page ;-)

<link rel="canonical" href="http://www.geotest.ch/https://www.geotest.ch/kompetenzen/naturgefahren/massnahmenplanung.html">

@sandrohoerler
Copy link

@jnioche Haha, sorry for that ;) This wasn't a good example i think. But we encouter the same problem if we crawl
https://www.hoteljob.ch/. It leads to https://www.hoteljob.ch/https://www.hoteljob.ch/job-suchen. And here <link rel="canonical" href="www.hoteljob.ch/job-suchen"> should be properly set :-). Also every persistet subsite url is unusable like https://www.hoteljob.ch/arbeitgeber/park-kafi-kreuzlingen/www.hoteljob.ch/arbeitgeber/park-kafi-kreuzlingen/567033 and its <link rel="canonical" href="www.hoteljob.ch/arbeitgeber/park-kafi-kreuzlingen/567033">

This should be a better example :-)

@jnioche
Copy link
Contributor Author

jnioche commented Apr 26, 2019

From the Javadoc of java.net.URL.URL(URL context, String spec) throws MalformedURLException

Creates a URL by parsing the given spec within a specified context. The new URL is created from the given context URL and the spec argument as described in RFC2396 "Uniform Resource Identifiers : Generic * Syntax" :

      <scheme>://<authority><path>?<query>#<fragment>

The reference is parsed into the scheme, authority, path, query and fragment parts. If the path component is empty and the scheme, authority, and query components are undefined, then the new URL is a reference to the current document. Otherwise, the fragment and query parts present in the spec are used in the new URL.
If the scheme component is defined in the given spec and does not match the scheme of the context, then the new URL is created as an absolute URL based on the spec alone. Otherwise the scheme component is inherited from the context URL.

If the authority component is present in the spec then the spec is treated as absolute and the spec authority and path will replace the context authority and path. If the authority component is absent in the spec then the authority of the new URL will be inherited from the context.

If the spec's path component begins with a slash character "/" then the path is treated as absolute and the spec path replaces the context path.

Otherwise, the path is treated as a relative path and is appended to the context path, as described in RFC2396. Also, in this case, the path is canonicalized through the removal of directory changes made by occurrences of ".." and ".".

For a more detailed description of URL parsing, refer to RFC2396.

The value found in the HTML is neither relative nor absolute which is why we get that. Again, blame the site.

You could write a URLFilter to rewrite such URLs so that the outlinks are fixed but that won't help much with the sitemaps.

Can you use StackOverflow if you have more questions?

@sandrohoerler
Copy link

Thank you and sorry for disturbing you

@jnioche
Copy link
Contributor Author

jnioche commented Apr 26, 2019

You haven't disturbed me at all! I am glad you are using SC and thankful that you are taking the time to report potential issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants