Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to identify namespaces / SVG + MathML #72

Closed
otherdaniel opened this issue Mar 22, 2021 · 18 comments
Closed

How to identify namespaces / SVG + MathML #72

otherdaniel opened this issue Mar 22, 2021 · 18 comments
Labels
sanitizer-api issues with the API

Comments

@otherdaniel
Copy link
Collaborator

The current spec uses a tag's local name, meaning it cannot distinguish between SVG, MathML, and HTML element names. Clearly, that won't do long-term.

@otherdaniel
Copy link
Collaborator Author

Honestly, I have no idea what a good answer is here, Would appreciate feedback. Several proposals below. IMHO, they are all viable, but with different aesthetics.

In an XML-centric world -- which is where namespaces were invented -- this is easy: QNames everywhere => problem solved. It's just that HTML didn't go the XML-centric route, so I'm not sure whether that should be our answer here.

I think there's two questions: Whether to consider fixed vs arbitrary namespaces, and syntax.

Should we consider arbitrary namespaces, or only the browser-supported ones? That is, should a Sanitizer config be able to specify <foobar> from the http://example.org/ns/bla namespace, or is it sufficient (and certainly easier) to map elements - for Sanitizer purposes - into the triple choice [HTML|SVG|MathML], or maybe a quadruple choice with other?

There'd be a number of ways to represent this (depending a bit on the previous question):

  • separate lists: { allow_elements: [...], allow_svg_elements: [...], ...}
  • unified lists with 'known' prefixes: { allow_elements: [ "div", "svg:bla", "math:bla" ] }
  • unified lists with qnames: { allow_elements: [ "http://www.w3.org/2000/svg:script" ] }
  • nested structure: { allow_elements: { "http://www.w3.org/2000/svg:script" : [ "foo" ] } }

@mozfreddyb
Copy link
Collaborator

A few ill-thought out ideas while I'm already stepping away from my desk.
Do we need the full namespace or can we shorten to html:, svg:, mathml:?

I'm hopeful we don't have to solve this for arbitrary namespaces but hardcode a couple of them? I don't expect lots of new namespaces with their own parsing rules to creep up any minute (I might be wrong!)

@otherdaniel otherdaniel added the sanitizer-api issues with the API label Mar 22, 2021
@otherdaniel
Copy link
Collaborator Author

Do we need the full namespace or can we shorten to html:, svg:, mathml:?

We can absolutely do that, and IMHO it'd be simpler.

I would like to be sure that we aren't overlooking a use case for arbitrary namespaces, though. If we don't strongly care about that - which I don't - then I think we have a pretty easy solution.

@securityMB
Copy link

Can we make the namespaces implicit? For instance: <textarea> exists only in HTML, so it should be only allowed within HTML namespace, while <mtext> is valid in MathML so it should be discarded in other namespaces?

@otherdaniel
Copy link
Collaborator Author

Can we make the namespaces implicit?

What about <a>, which can be either in HTML namespace (HTMLAnchorElement) or SVG namespace (SVGAElement)?
(<image> is svg only, but has weird parsing rules in HTML.)

@securityMB
Copy link

securityMB commented Apr 29, 2021

I'd assume that if element is valid for more than one namespace, it would be valid in all of them. So simple <a> would make it allowable in both SVG and HTML.

I'm aware of five tags that are valid in two namespaces: <title>, <style>, <font>, <a>, <script>.

[edit]: The reasoning for that is from my experience developers mainly are aware of tag names, not the namespaces shenanigans. So that should match their expectations.

@otherdaniel
Copy link
Collaborator Author

I'd assume that if element is valid for more than one namespace, it would be valid in all of them. So simple would make it allowable in both SVG and HTML.

I like this. It seems both simple and effective. Also, amusingly, it means that the spec "by accident" already mostly does the right thing, since this would mean we act on the tag names only, and largely ignore namespaces.

Disadvantages, for completeness sake:

  • In several places, the spec references elements by their interface (e.g HTMLAnchorElement) instead of their tag name. We should either rewrite that, or have be to extra careful that we're always covering the different variants, where they exists.
  • I find it a bit clumsy from a theoretical perspective, since this would e.g. make it impossible to allow HTML's <a>, but disallow SVG's <a>. That said, I note that simplicity is a major project goal, while "theoretical non-clumsiness" isn't. (Maybe one could have an additional top-level boolean for SVG and MathML on/off; but interpret the allow/block lists only based on the tag name.)

[edit]: The reasoning for that is from my experience developers mainly are aware of tag names, not the namespaces shenanigans. So that should match their expectations.

I very much suspect this is correct.

@securityMB
Copy link

securityMB commented Apr 29, 2021

  • I find it a bit clumsy from a theoretical perspective, since this would e.g. make it impossible to allow HTML's <a>, but disallow SVG's <a>. That said, I note that simplicity is a major project goal, while "theoretical non-clumsiness" isn't. (Maybe one could have an additional top-level boolean for SVG and MathML on/off; but interpret the allow/block lists only based on the tag name.)

Maybe we can have a bit of both worlds:

  • if developer specifies a in allowlist, then we assume a tags in all namespaces it's valid (SVG and HTML),
  • if developer specifies svg:a or html:a in allowlist then only a in SVG and HTML respectively would be allowed.

@zcorpan
Copy link

zcorpan commented Oct 7, 2021

The HTML parser knows about these namespaces for elements:

  • HTML
  • MathML
  • SVG

and these for attributes:

  • XML
  • XMLNS
  • XLink

How would we deal with allowedAttributes and namespaces? The least-surprise would probably be to use xml:lang, xmlns and xlink:href as syntax in the API to mean the appropriate namespaced attributes, and other attributes with no prefix always mean no namespace (i.e. allowing lang does not allow lang in the XML namespace).

For extra fun, the attribute with local name xml:lang in no namespace is allowed in HTML:

To ease migration to and from XML, authors may specify an attribute in no namespace with no prefix and with the literal localname "xml:lang" on HTML elements in HTML documents, but such attributes must only be specified if a lang attribute in no namespace is also specified, and both attributes must have the same value when compared in an ASCII case-insensitive manner.

https://html.spec.whatwg.org/multipage/dom.html#the-lang-and-xml:lang-attributes

Personally, I don't mind a sanitizer throwing away that useless attribute. For conformance authors are required to use the lang attribute with the same value, so no information is lost in conforming cases.

@mozfreddyb
Copy link
Collaborator

Oh my goodness. This is very sad and very useful, thanks for bringing these insights here @zcorpan! :)

It's on @otherdaniel to updates his proposal in #103, but what he and I discussed seems along the lines of what you suggested.

  • If you do not supply a namespace, we assume HTML.
  • If you do supply a foo:bar pair, we require foo is a namespace known to the HTML parser, i.e., either of html,mathml,svg . If it's not a known namespace, we will not accept it (I think we should throw an exception, but maybe there are upsides to silently discarding?).

I'm not entirely sure what you mean with "the appropriate namespaced attributes" for xml:lang, xmlns and xlink:href. Having a special, warty corner-case hardcoded check on "xml:lang" looks terrible, but OKish to me? If it's harmless in terms of XSS, I'd rather not throw it away. It's good to hear that it's non-conformant, but the more important question to me is: Is it also not widely used? :)

@zcorpan
Copy link

zcorpan commented Oct 7, 2021

I'm not entirely sure what you mean with "the appropriate namespaced attributes" for xml:lang, xmlns and xlink:href.

I mean

  • specifying an xml: prefix in allowedAttributes means the part after the : refers to the local name, and the namespace is http://www.w3.org/XML/1998/namespace
  • specifying exactly xmlns or xmlns: prefix means the namespace is http://www.w3.org/2000/xmlns/
  • specifying xlink: prefix means the namespace http://www.w3.org/1999/xlink

So to clarify, "xml:lang" can refer to two things

  • lang local name in the XML namespace (http://www.w3.org/XML/1998/namespace)
  • xml:lang local name in no namespace

The HTML parser can produce both!

On HTML elements, you get the no namespace variant. On SVG and MathML elements, you get the namespaced variant.

On HTML elements, the namespaced variant is not allowed, but the no-namespace one is allowed. (You can get it with DOM APIs, not with the HTML parser.)

And actually, the xmlns attribute is in a similar situation: on HTML elements, it has no namespace, but is allowed

In HTML documents, elements in the HTML namespace may have an xmlns attribute specified, if, and only if, it has the exact value "http://www.w3.org/1999/xhtml". This does not apply to XML documents.

Note: In HTML, the xmlns attribute has absolutely no effect. It is basically a talisman. It is allowed merely to make migration to and from XML mildly easier. When parsed by an HTML parser, the attribute ends up in no namespace, not the "http://www.w3.org/2000/xmlns/" namespace like namespace declaration attributes in XML do.

https://html.spec.whatwg.org/multipage/dom.html#global-attributes:html-namespace-2

@zcorpan
Copy link

zcorpan commented Oct 7, 2021

As far as I know, xml: attributes don't pose an XSS risk, so could be allowed by default. xmlns and xmlns:something (the only possible value of something here, in an HTML context, is xlink) also don't pose an XSS risk, but bogus values could trip up something if it goes through an XML pipeline (but then you'll likely have other problems too).

The only XSS-risky one I think is xlink:href (on SVG elements, at least a), which could execute script with the javascript: URL scheme.

@mozfreddyb mozfreddyb added the v1 label Mar 23, 2022
@otherdaniel
Copy link
Collaborator Author

It's been a while. The current status is:

  • elements support SVG + MathML, using fixed "svg:" and "math:" prefixes.
  • attributes supports a fixed set of namespaced attributes, taken directly from the HTML spec.

This purposely limits Sanitizer to support the exact same namespaced content as HTML does.

I think this resolves the issue; but I'll wait for a bit with closing it in case anyone disagrees.

@zcorpan
Copy link

zcorpan commented Mar 25, 2022

The spec isn't very clear about how the attribute match list namespace mapping works, imo. Is the namespace mapping always applied? Or only when the element is "svg:*" or "math:*"?

@otherdaniel
Copy link
Collaborator Author

The spec isn't very clear about how the attribute match list namespace mapping works, imo. Is the namespace mapping always applied? Or only when the element is "svg:" or "math:"?

Thanks for the feedback!

The intent is:

  • There is a finite list of (currently) 11 namespaced attributes, defined in HTML. We'll recognize exactly those, in exactly the syntax of the attribute table in HTML, and nothing else.
  • It doesn't matter what element they belong to.

I'll try to clarify the wording.

(My own problem when trying to spec that was that, knowing how namespaces work in XML, I found the HTML adoption of this concept fairly strange. But I believe our current solution faithfully matches what HTML does.)

@zcorpan
Copy link

zcorpan commented Mar 28, 2022

OK. Note that the HTML syntax only supports these namespaced attributes on foreign elements.

@otherdaniel
Copy link
Collaborator Author

OK. Note that the HTML syntax only supports these namespaced attributes on foreign elements.

Good point, I had overlooked that. This is presently mis-specified in Sanitizer.

@mozfreddyb
Copy link
Collaborator

#181

@mozfreddyb mozfreddyb closed this as not planned Won't fix, can't repro, duplicate, stale Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sanitizer-api issues with the API
Projects
None yet
Development

No branches or pull requests

4 participants