Adaptative URL filter to normalize URLs based on canonical tag #315

jnioche · 2016-07-11T16:07:43Z

Such a filter could compare the parameters of a URL with the canonical tag found in the page (if any) and determine after a while which parameters can be safely removed in order to normalise the URL.

The aim is similar to the clean-param extension of the robots protocol by Yandex where sites can specify how URLs can be normalised.

TODO compare with [research.google.com/pubs/archive/35210.pdf]

sandrohoerler · 2019-04-26T09:00:25Z

If a canonical metatag is found on a page, the AbstractIndexerBolt#valueForUrl method returns the targetUrl concatinated with the canonical url found on the page. This issue is can be reproduced by crawling https://www.geotest.ch/kompetenzen/naturgefahren/massnahmenplanung.html. Is this a known issue?

jnioche · 2019-04-26T10:47:27Z

hi @sandrohoerler

blame the web page ;-)

<link rel="canonical" href="http://www.geotest.ch/https://www.geotest.ch/kompetenzen/naturgefahren/massnahmenplanung.html">

sandrohoerler · 2019-04-26T11:06:46Z

@jnioche Haha, sorry for that ;) This wasn't a good example i think. But we encouter the same problem if we crawl
https://www.hoteljob.ch/. It leads to https://www.hoteljob.ch/https://www.hoteljob.ch/job-suchen. And here <link rel="canonical" href="www.hoteljob.ch/job-suchen"> should be properly set :-). Also every persistet subsite url is unusable like https://www.hoteljob.ch/arbeitgeber/park-kafi-kreuzlingen/www.hoteljob.ch/arbeitgeber/park-kafi-kreuzlingen/567033 and its <link rel="canonical" href="www.hoteljob.ch/arbeitgeber/park-kafi-kreuzlingen/567033">

This should be a better example :-)

jnioche · 2019-04-26T13:30:36Z

From the Javadoc of java.net.URL.URL(URL context, String spec) throws MalformedURLException

Creates a URL by parsing the given spec within a specified context. The new URL is created from the given context URL and the spec argument as described in RFC2396 "Uniform Resource Identifiers : Generic * Syntax" :
      <scheme>://<authority><path>?<query>#<fragment>
The reference is parsed into the scheme, authority, path, query and fragment parts. If the path component is empty and the scheme, authority, and query components are undefined, then the new URL is a reference to the current document. Otherwise, the fragment and query parts present in the spec are used in the new URL.
If the scheme component is defined in the given spec and does not match the scheme of the context, then the new URL is created as an absolute URL based on the spec alone. Otherwise the scheme component is inherited from the context URL.

If the authority component is present in the spec then the spec is treated as absolute and the spec authority and path will replace the context authority and path. If the authority component is absent in the spec then the authority of the new URL will be inherited from the context.

If the spec's path component begins with a slash character "/" then the path is treated as absolute and the spec path replaces the context path.

Otherwise, the path is treated as a relative path and is appended to the context path, as described in RFC2396. Also, in this case, the path is canonicalized through the removal of directory changes made by occurrences of ".." and ".".

For a more detailed description of URL parsing, refer to RFC2396.

The value found in the HTML is neither relative nor absolute which is why we get that. Again, blame the site.

You could write a URLFilter to rewrite such URLs so that the outlinks are fixed but that won't help much with the sitemaps.

Can you use StackOverflow if you have more questions?

sandrohoerler · 2019-04-26T17:21:07Z

Thank you and sorry for disturbing you

jnioche · 2019-04-26T20:13:58Z

You haven't disturbed me at all! I am glad you are using SC and thankful that you are taking the time to report potential issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptative URL filter to normalize URLs based on canonical tag #315

Adaptative URL filter to normalize URLs based on canonical tag #315

jnioche commented Jul 11, 2016

sandrohoerler commented Apr 26, 2019

jnioche commented Apr 26, 2019 •

edited

Loading

sandrohoerler commented Apr 26, 2019

jnioche commented Apr 26, 2019

sandrohoerler commented Apr 26, 2019

jnioche commented Apr 26, 2019

Adaptative URL filter to normalize URLs based on canonical tag #315

Adaptative URL filter to normalize URLs based on canonical tag #315

Comments

jnioche commented Jul 11, 2016

sandrohoerler commented Apr 26, 2019

jnioche commented Apr 26, 2019 • edited Loading

sandrohoerler commented Apr 26, 2019

jnioche commented Apr 26, 2019

sandrohoerler commented Apr 26, 2019

jnioche commented Apr 26, 2019

jnioche commented Apr 26, 2019 •

edited

Loading