Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape special HTML chars coming from SITEURL in the internal link replacer #2260

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

mosra
Copy link
Contributor

@mosra mosra commented Dec 10, 2017

When SITEURL (or any other URL such as PAGE_URL, ARTICLE_URL etc.) contains special HTML characters such as &, they were not properly escaped when going through the link replacer.

A convoluted example: when SITEURL = 'http://my.site/?app="blog"&path=' and there's the following input:

<a href="{filename}/article.rst">An article</a>

The following gets rendered:

<a href="http://my.site/?app="blog"&path=/article.rst">An article</a>

Which is wrong and should be instead

<a href="http://my.site/?app=&quot;blog&quot;&amp;path=/article.rst">An article</a>

While the Jinja2 template itself usually takes care of escaping SITEURL where it is used in the template, this link replacement is done after that and thus there's no way around it.

The patch takes care of properly unescaping the other URL parts and then escaping them again so they get preserved. There's a test case checking that as well.

Similarly to #2196, this is based on 1f30306, which just makes the internal link replacer function public. See #2164 for the original reason behind this change.

Thanks in advance! :)

@mosra
Copy link
Contributor Author

mosra commented Dec 10, 2017

Didn't find any utility that could do both HTML escape and unescape in Pelican codebase, so it's using either html.escape()/html.unescape() on Python 3 and a "polyfill" that mimics the same on Python 2.7. Not sure if that's ideal, suggestions welcome.

@justinmayer
Copy link
Member

My admittedly cursory scan of this pull request didn't yield any glaring problems. Any @getpelican/reviewers care to comment before this is merged?

@justinmayer
Copy link
Member

@avaris: Any thoughts on this one?

@avaris
Copy link
Member

avaris commented Feb 13, 2018

I'm a little confused here. Why wouldn't you set SITEURL as encoded in the first place?

SITEURL = 'http://my.site/?app=&quot;blog&quot;&amp;path='

@justinmayer
Copy link
Member

@mosra: Any response to the question from @avaris?

@mosra
Copy link
Contributor Author

mosra commented Mar 23, 2018

Oh, sorry for the late reply, the notification got somehow drowned among all others :/

Encoding the SITEURL would break the feed generator -- it expects a non-encoded URL and it would encode it twice, resulting in http://my.site/?app=&amp;quot;blog&amp;quot;&amp;amp;path= in the feed XML.

In general, for consistency, I think the settings should not require the user to pre-encode any of the values (unless it's already a HTML snippet or something). Pelican and the theme should transparently handle all encoding to avoid double encoding issues like this.

@avaris
Copy link
Member

avaris commented Mar 24, 2018

Well, there is FEED_DOMAIN for cases like that.

@mosra
Copy link
Contributor Author

mosra commented Mar 28, 2018

Okay, um, then the behavior is a matter of opinion, I guess:

  • Either expecting the users to pre-escape SITEURL (and potentially other variables) in settings.py and then hoping that all themes treat them the same way (at the moment some might be using the |e filter on this and other variables, some might not) and there aren't many corner cases like with this FEED_DOMAIN. Plugins, themes and output formats that expect SITEURL to be plain text (for example some sort of JSON output, ...) would need to unescape it back.
  • Or have a rule that all plain text variables in settings.py can be really a plain text and update Pelican and all themes to properly escape them only when needed. This way the users don't need to think twice about using special characters and everything is handled for them transparently.

I'm personally in favor of the second option, it looks cleaner and more consistent to me.

@avaris
Copy link
Member

avaris commented Mar 28, 2018

Well, then there is the inconsistency of SITEURL itself being unescaped but internal replacements escaped. This can potentially break some themes if they were using SITEURL verbatim in the templates and don't do escaping.

If we're to do escaping on SITEURL, we might as well escape then replace it while reading the settings and be done with it. Then it's escaped everywhere and consistent :).

@mosra
Copy link
Contributor Author

mosra commented Mar 28, 2018

Well, then there is the inconsistency of SITEURL itself being unescaped but internal replacements escaped.

But that's because the internal replacements work on the final HTML output, no? That's expected.

This can potentially break some themes if they were using SITEURL verbatim in the templates and don't do escaping.

I'm aware of that. OTOH, I don't think there are many sites out there with special characters in the URL that would break on this. It's more about the consistency.

I have a test like this for my theme, where basically everything coming from the settings is expected to be a plain text and the theme does proper escaping on everything that's not already HTML. I'm sure that, when this PR is in, a theme can be authored so that there are no issues with double escaping or missing escaping -- and none of the settings has to be explicitly pre-escaped. Doesn't that sound reasonable?

The other extreme is to require all the settings to be already escaped by the user (so, to be consistent, even SITENAME, right?). In that case I have no idea if that would "just work" -- expecting a lot of double-escaping issues with the feed generator -- it currently escapes SITENAME, so I would need some FEEDNAME override to provide a non-escaped version. Then there are plugins that use the settings to download files, optimize images on the filesystem etc. I also don't think these expect escaped names... I fear that expecting the settings to be escaped would get quite messy quite fast.

@avaris
Copy link
Member

avaris commented Mar 28, 2018

Now, why would you escape SITENAME? :) That's just text supplied to the themes to do whatever they want with it.

I'm not against this change. I'm just pointing out potential issues (like the possibility of backward incompatibility). Sure, you can author a theme that will work with this. You can also author themes that doesn't need this and expects pre-escaped settings. You can author plugins that will work on escaped SITEURL, or plain text ones. In short, you can handle any case as long as you know what to expect. Until now, SITEURL was used (and supplied) as is. That's changing.

In general I'm OK with this, but at the very least this also requires some documentation about how SITEURL is handled (like not giving escaped ones) and probably how themes should use it.

P.S.: I'm also somewhat puzzled about why you'd need such SITEURLs for a static site in the first place, but that's likely irrelevant :).

@mosra
Copy link
Contributor Author

mosra commented Mar 28, 2018

at the very least this also requires some documentation about how SITEURL is handled

I'll happily provide extensive documentation for this, but not sure what's the right place. Or does some such documentation exist already? Point me to where I can add it :)

I'm also somewhat puzzled about why you'd need such SITEURLs

Ha, yeah :) It's not really SITEURL itself, but rather the other related variables such as PAGE_URL, ARTICLE_URL, STATIC_URL etc., where it's far more likely that there will be some GET parameters separated with & characters in them (google analytics tracking stuff, for example, or in case of STATIC_URL, I can put in a "cache busting" query param placeholder that'll locally calculate a CRC of the file to force browsers to fetch an updated version).

@justinmayer
Copy link
Member

I'll happily provide extensive documentation for this, but not sure what's the right place. Or does some such documentation exist already? Point me to where I can add it :)

I imagine the following is the most appropriate place for SITEURL setting documentation: http://docs.getpelican.com/en/latest/settings.html#SITEURL

@mosra / @avaris: Aside from documentation, what else is left to do regarding this endeavor? Just wanted to bring it up since we're getting close to release time. We can, of course, always defer to the next release if needed. 😊

@mosra
Copy link
Contributor Author

mosra commented Nov 11, 2018

@justinmayer I don't think this is so important to have in for 3.8, so I'm okay with defering until later. At the moment I don't have any released functionality depending on this behavior, just some proofs of concept.

OTOH, #2439 is important for 3.8 ;)

@justinmayer justinmayer added this to the Near-term milestone Jul 5, 2019
@justinmayer justinmayer removed this from the Pelican 4.9 milestone Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants