Spaces removed from links (all URLs) #72

StephenBrown2 · 2020-02-24T19:55:21Z

Related to #44, however, now I've come across a site (being published on Read the Docs) which has many spaces in the filenames. Though the IDs are fine, the spaces get removed from the path, which breaks the links and results in many false 404's.

Perhaps instead of removing spaces, using the net/url Parse function only? Or removing spaces only from the URL.Fragment?

muffet/scraper.go

Lines 48 to 54 in 4998c9b

    
           s := normalizeURL(scrape.Attr(n, a)) 
        
           if s == "" || sc.isURLExcluded(s) { 
        
           	continue 
        
           } 
        
           u, err := url.Parse(s)

muffet/scraper.go

Lines 82 to 90 in 4998c9b

    
           func normalizeURL(s string) string { 
        
           	return strings.Map(func(r rune) rune { 
        
           		if unicode.IsSpace(r) { 
        
           			return -1 
        
           		} 
        
           		return r 
        
           	}, s) 
        
           }

https://play.golang.org/p/42kUw1Rg23m

The text was updated successfully, but these errors were encountered:

raviqqe · 2020-09-22T05:25:56Z

https://tools.ietf.org/html/rfc3986

Actually the URI spec says that those spaces should be ignored. Do those Read the Docs pages work well on browsers?

StephenBrown2 · 2020-09-22T18:03:26Z

Yes, here are a few examples:
Running on https://draft-openaps-reorg.readthedocs.io/en/reapply-cleanup/ gives many 404's for URLs such as:

From what I can tell of the spec, the spaces and all URI components are to be URI-encoded: https://tools.ietf.org/html/rfc3986#section-2.1 and https://tools.ietf.org/html/rfc3986#section-2.4

Once produced, a URI is always in its percent-encoded form.

raviqqe · 2020-09-23T07:06:24Z

I see your point. I will change the behaviour of muffet then because I think people expect muffet to work similar to browsers..

But I still don't understand the spec of URLs and HTML actually. The both foo and bar links of encoded and decoded URLs in the following HTML works on Chrome while the spec says the href attributes are always valid URLs after their leading and trailing spaces are stripped.

index.html:

<html>
  <body>
    <a href="/a b.html">foo</a>
    <a href="/a%20b.html">bar</a>
  </body>
</html>

a b.html:

<html>
  <body>
    <a href="/">back</a>
  </body>
</html>

Am I missing something???

Anyway, thanks for the report!

StephenBrown2 · 2020-09-23T15:31:29Z

Ah, I think I know where the misunderstanding comes from:

the spec says the href attributes are always valid URLs after their leading and trailing spaces are stripped.

Leading and trailing spaces should most certainly be stripped, however, internal spaces should not, and instead be URI-encoded (percent-encoded). So, strings.TrimSpace is the right way to go about it, if just working with strings, but why not use https://golang.org/pkg/net/url/#Parse directly?

raviqqe · 2020-09-23T15:56:31Z

I did TrimpSpace because I wanted to strip spaces even on URL parse errors anyway. But it might be simpler to do url.Parse first. Thanks for the feedback!

StephenBrown2 · 2020-09-24T03:22:39Z

No worries. It works well as is, just trying to be helpful as I learn more Go myself!

raviqqe mentioned this issue Sep 23, 2020

Fix space handling in URLs #103

Merged

raviqqe closed this as completed in #103 Sep 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spaces removed from links (all URLs) #72

Spaces removed from links (all URLs) #72

StephenBrown2 commented Feb 24, 2020

raviqqe commented Sep 22, 2020

StephenBrown2 commented Sep 22, 2020 •

edited

Loading

raviqqe commented Sep 23, 2020

StephenBrown2 commented Sep 23, 2020

raviqqe commented Sep 23, 2020

StephenBrown2 commented Sep 24, 2020

Spaces removed from links (all URLs) #72

Spaces removed from links (all URLs) #72

Comments

StephenBrown2 commented Feb 24, 2020

raviqqe commented Sep 22, 2020

StephenBrown2 commented Sep 22, 2020 • edited Loading

raviqqe commented Sep 23, 2020

StephenBrown2 commented Sep 23, 2020

raviqqe commented Sep 23, 2020

StephenBrown2 commented Sep 24, 2020

StephenBrown2 commented Sep 22, 2020 •

edited

Loading