Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spaces removed from links (all URLs) #72

Closed
StephenBrown2 opened this issue Feb 24, 2020 · 6 comments · Fixed by #103
Closed

Spaces removed from links (all URLs) #72

StephenBrown2 opened this issue Feb 24, 2020 · 6 comments · Fixed by #103

Comments

@StephenBrown2
Copy link

Related to #44, however, now I've come across a site (being published on Read the Docs) which has many spaces in the filenames. Though the IDs are fine, the spaces get removed from the path, which breaks the links and results in many false 404's.

Perhaps instead of removing spaces, using the net/url Parse function only? Or removing spaces only from the URL.Fragment?

muffet/scraper.go

Lines 48 to 54 in 4998c9b

s := normalizeURL(scrape.Attr(n, a))
if s == "" || sc.isURLExcluded(s) {
continue
}
u, err := url.Parse(s)

muffet/scraper.go

Lines 82 to 90 in 4998c9b

func normalizeURL(s string) string {
return strings.Map(func(r rune) rune {
if unicode.IsSpace(r) {
return -1
}
return r
}, s)
}

https://play.golang.org/p/42kUw1Rg23m

@raviqqe
Copy link
Owner

raviqqe commented Sep 22, 2020

https://tools.ietf.org/html/rfc3986

Actually the URI spec says that those spaces should be ignored. Do those Read the Docs pages work well on browsers?

@StephenBrown2
Copy link
Author

StephenBrown2 commented Sep 22, 2020

Yes, here are a few examples:
Running on https://draft-openaps-reorg.readthedocs.io/en/reapply-cleanup/ gives many 404's for URLs such as:

From what I can tell of the spec, the spaces and all URI components are to be URI-encoded: https://tools.ietf.org/html/rfc3986#section-2.1 and https://tools.ietf.org/html/rfc3986#section-2.4

Once produced, a URI is always in its percent-encoded form.

@raviqqe
Copy link
Owner

raviqqe commented Sep 23, 2020

I see your point. I will change the behaviour of muffet then because I think people expect muffet to work similar to browsers..

But I still don't understand the spec of URLs and HTML actually. The both foo and bar links of encoded and decoded URLs in the following HTML works on Chrome while the spec says the href attributes are always valid URLs after their leading and trailing spaces are stripped.

index.html:

<html>
  <body>
    <a href="/a b.html">foo</a>
    <a href="/a%20b.html">bar</a>
  </body>
</html>

a b.html:

<html>
  <body>
    <a href="/">back</a>
  </body>
</html>

Am I missing something???

Anyway, thanks for the report!

@StephenBrown2
Copy link
Author

Ah, I think I know where the misunderstanding comes from:

the spec says the href attributes are always valid URLs after their leading and trailing spaces are stripped.

Leading and trailing spaces should most certainly be stripped, however, internal spaces should not, and instead be URI-encoded (percent-encoded). So, strings.TrimSpace is the right way to go about it, if just working with strings, but why not use https://golang.org/pkg/net/url/#Parse directly?

@raviqqe
Copy link
Owner

raviqqe commented Sep 23, 2020

I did TrimpSpace because I wanted to strip spaces even on URL parse errors anyway. But it might be simpler to do url.Parse first. Thanks for the feedback!

@StephenBrown2
Copy link
Author

No worries. It works well as is, just trying to be helpful as I learn more Go myself!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants