Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Antisamy removes "margin" attribute when it's value is configured very small decimal number #101

Open
bijarniy opened this issue Aug 20, 2021 · 23 comments

Comments

@bijarniy
Copy link

Steps to reproduce the problem-

  1. Create HTML content having tag with styling attribute <p style="margin: 0.0001pt;" /> .
  2. Use filterHTML API to filter the above HTML content
  3. In the response, the "margin" attribute is getting removed with warning log [1]

Expected output-
With regexp configuration [2], should not remove the margin with any decimal number

[1] AntiSamy warning: The p tag had a style attribute, "margin", that could not be allowed for security reasons.
[2] <regexp name="length" value="((-|\+)?0|(-|\+)?([0-9]+(\.[0-9]*)?)(em|ex|px|in|cm|mm|pt|pc))"/>

@davewichers
Copy link
Collaborator

@spassarop - Can you look into this one too?

@spassarop
Copy link
Collaborator

spassarop commented Aug 20, 2021

You won't believe it. When you exceed the three significant figures, the value in CSS as a lexical unit is interpreted with an exponential value. What I mean is that 0.001 is parsed to 0.001, but 0.0001 is parsed to 1.0e-5. That value is later used as string for the regex so it fails.

This lexical unit stuff is how Batik CSS handles CSS parsing and its way beyond AntiSamy limits. What AntiSamy does is to concatenate the lexical unit properties from Batik CSS API, this is the case for margin, a float value:

lu.getFloatValue() + lu.getDimensionUnitText()

This is "1.0E-5" + "pt" and after that comes a lowercase which is the final value compared against the regex. Kinda tricky...
My proposal is to include exponential format in the regex, which is totally valid for browsers already. However, the policy regex is not entirely loyal to what is compared later because if I want to tailor the margin regex to be [+-]?\d*.?\d* and not accept any other formats, then I will have a mismatch as it's happening now. On the other hand, we cannot know deep inside the CssValidator which was the string format before transforming itself into a lexical unit with a float value, simple decimal or exponential?

The other problem is that even if the regex is changed anticipating that shady/hidden/implicit conversion and the margin keeps its value, it will have that new verified string, so instead of the original 0.00001pt it will be 1.0E-5pt. That's more inconsistent behavior. Probably double or BigDecimal would be better to handle this, but they would probably cause a conflict between exponential vs non-exponential anyway and that is up to Batik CSS.

So, there lies the reason and extended background of the issue. @davewichers what do you think about all that?

As a side note: this also happens on .NET, just can't fight with the underlying parser :'(

@davewichers
Copy link
Collaborator

davewichers commented Aug 21, 2021

@spassarop - Whatever you want to do here is fine with me. I trust you :-)

@spassarop
Copy link
Collaborator

spassarop commented Aug 21, 2021

In that case, the only thing I can offer is to label this issue as bug and update the example policies so margin can handle exponentials, but still having the format change inconsistency on the output that's out of our hands. @bijarniy, of course you can do that policy change in your project yourself, I just offer having it by default for future releases.

@davewichers
Copy link
Collaborator

Sounds fine with me. Do you want to implement/push those changes?

@spassarop
Copy link
Collaborator

I'm on it. Doing the test cases I found out that there are also problems on Batik CSS with exponential dimensions when they are the input. When using the margin 1.0E-4pt, the value is 1.0 and the dimension is E-4pt instead of just pt. Maybe I can add the test cases explaining the current behavior so the test code is self-explainatory about the issue, expecting them to fail or something like that.

@bijarniy
Copy link
Author

bijarniy commented Aug 23, 2021

In that case, the only thing I can offer is to label this issue as bug and update the example policies so margin can handle exponentials, but still having the format change inconsistency on the output that's out of our hands. @bijarniy, of course you can do that policy change in your project yourself, I just offer having it by default for future releases.

Thanks @davewichers for sharing RCA of this issue. As per your latest comment, are you going to provide this by default in the next release? If so, could you please share probable timeline for the same and based on that we would be asking our customer to change the policy to allow exponentials
cc : @spassarop

@bijarniy
Copy link
Author

bijarniy commented Aug 24, 2021

@davewichers / @spassarop I did close this issue by mistake so reopening it.

@bijarniy bijarniy reopened this Aug 24, 2021
@spassarop
Copy link
Collaborator

@bijarniy yes this will be on the next release if it gets merged before that.

spassarop added a commit that referenced this issue Aug 29, 2021
- Update length regexes on example policies.
- The tests have comments explaining the bug (#101) and a single case that passes but involves a latent bug.
@spassarop spassarop added the bug label Aug 29, 2021
@spassarop
Copy link
Collaborator

@davewichers even though a partial solution was provided including exponentials on regex there is no actual solution as we depend on Batick CSS. Should we leave this open potentially forever? bug is the only certain label I decided to add.

@kwwall
Copy link
Contributor

kwwall commented Aug 29, 2021

@davewichers and @spassarop - This really is sounding more like a bug in Batik CSS or possibly a misalignment in the understanding with your users than it is a bug per se in AntiSamy. I would recommend closing it with a 'wontfix' label. (You do have one of those, don't you? I haven't checked.)

@LiuXing-R
Copy link
Contributor

I verified some CSS attributes (width and length) and found that there was a problem of 10000000 -> 1.0e7, so I looked at the source code because Batik CSS used Float.parseFloat(String val).
https://github.com/apache/xmlgraphics-batik/blob/f79ce602fa6244d2a7ff45b57b6a0347e883bc37/batik-css/src/main/java/org/apache/batik/css/parser/Parser.java#L981-L983
->
https://github.com/apache/xmlgraphics-batik/blob/f79ce602fa6244d2a7ff45b57b6a0347e883bc37/batik-css/src/main/java/org/apache/batik/css/parser/Parser.java#L1551-L1560

Do we need to push the Batik CSS community to address this issue?

@davewichers
Copy link
Collaborator

@spassarop - Would you be willing to open up a discussion with Batik to see what they say?

@kwwall
Copy link
Contributor

kwwall commented Aug 30, 2021

@star-R I think the only logical approach at this point is to ask that Batik CSS community to fix it. Write up an example with a test case using Batik CSS directly and show them expected outcome to illustrate exactly what you are asking for / expecting. Because I think the AntiSamy team has already said they are not going to replace Batik CSS or try to write their own CSS parser. Just my $.02, but that was my interpretation based on earlier comments from the AntiSamy contributors. Batik CSS is still being actively supported so if you can show them a place that it not in compliance with W3C, I think you'll have a fair shot of convincing them. But since you are the one who wants this (I doubt anyone else uses such extreme margins), I think you should be the one to file an issue with Batik CSS.

@spassarop
Copy link
Collaborator

I agree with Kevin on the wontfix and on the W3C compliance argument. Just to avoid misunderstandings, @LiuXing-R was just helping investigate, @bijarniy was the one who created the issue.

I found the JIRA site where they add issues and signed up but there are so many fields to fill that I'll probably be kicked in the head by one of the team if I do it wrong. It would be better to notify somewhere to discuss and make them follow the right process. Apparently it must be done through the mailing list, I subscribed [email protected] to later see how you're supposed to report a bug. Of course anyone else can do it too, in my case if I end doing it I'll comment here. It will take time to elaborate a convincing and complete bug report.

@davewichers
Copy link
Collaborator

@spassarop - Any dialog going on with the Batik team? If not, can you push/nudge again to see if you can get them to work on this?

@spassarop
Copy link
Collaborator

Damn, I forgot to do this. I started to investigate how to report but somehow I didn't do it. Guess I'll do it in the following days, I'm setting a reminder.

@spassarop
Copy link
Collaborator

Created an issue just now: https://issues.apache.org/jira/browse/BATIK-1320
Hope it is the right path, could not find other way of telling them. Let's wait now.

@carlosame
Copy link

I found this issue while browsing Batik's JIRA, although I'm not related to the Batik project (nor the ASF) in any way. I do not think that there is anything that the Batik people (nor any other SAC implementor) could do. The SAC API provides the getFloatValue() that, as the name implies, provides a float. And you are serializing the result with the following:

lu.getFloatValue() + lu.getDimensionUnitText()

That is, the 1.0e-5 appears because you are serializing the value that way.

The SAC API does not provide any way to determine the exact input text that was used when parsing. Perhaps Antisamy could provide a different kind of check so numbers are checked as numbers and not as strings. Checking numbers with string methods is always going to be error prone.

@carlosame
Copy link

Incidentally, I see that you are checking against old CSS 2.1 units.

<regexp name="length" value="((-|\+)?0|(-|\+)?([0-9]+(\.[0-9]*)?)(em|ex|px|in|cm|mm|pt|pc))"/>

But today's websites use a wider range of values and units, see CSS Values and Units Module Level 4. So if you keep with the regexp approach (for example serializing the lu.getFloatValue() with a DecimalFormat that avoids exponents), I'd suggest adding the newer rem, rex, cap, etc.

@spassarop
Copy link
Collaborator

@carlosame: About the units, the regex was expanded in #104 based on this page. I don't know why some of the units were left out really. We could expand the list even more with absolute and relative units.

Regarding the float stuff the matter is complex. AntiSamy does "generic" validations by comparing to values defined in a policy XML. It can be a strict comparison if it's by value or a regexp-based comparison. I agree it's best to use abstract representations whenever it's possible instead of just using strings in general. That's why we use parsers and such utilities and work with objects. However, relying on parsers for this to be done eventually ends up in having a final granular value that we cannot go "deeper" with.

In general, for HTML and CSS we use strings as normally the values are just text or number+text. We also stop going deeper when evaluating attributes values or tags' text contents. With CSS we stop on those lexical units for CSS property values. A property value can be really complex as you know, with numbers, units, function calls, etc. That's why we also need the parser to be specific with what it provides so we can validate the values somehow. I'd love we had something more versatile than just getFloatValue() which covers the cases we need, but float is the most complete type we can use now.

I don't really know if an improvement from Batik's side can be made, that's what I want them to tell us. I'm not a Java guy so that's why I seek specific knowledge for this. On AntiSamy .NET I had the same issue and reached the parser creator who fixed it, so now it has the behavior I needed on most cases (there was a details with exponents which I cannot remember now). Don't have much to say.

@carlosame
Copy link

I don't know why some of the units were left out really. We could expand the list even more with absolute and relative units.

I haven't seen that updated regex, I'd just add vmin and vmax as the other units are only supported by only one or two user agents AFAIR.

A property value can be really complex as you know, with numbers, units, function calls, etc. That's why we also need the parser to be specific with what it provides so we can validate the values somehow. I'd love we had something more versatile than just getFloatValue() which covers the cases we need, but float is the most complete type we can use now.

The problem are the following lines:

return lu.getFloatValue() + lu.getDimensionUnitText();

and

return String.valueOf(lu.getFloatValue());

If you serialize those float values with a properly configured DecimalFormat instead of the current valueOf(), it does not matter whether the input was 0.0001pt or 1e-5pt (which is also a valid CSS value). Your regular expression would work.

You do not need to change anything at Batik, although I'd migrate to a more modern SAC parser like my Css4j 1.3.1 (in post-EOL state) or cssparser (less complete but still maintained, and the maintainer is generally very helpful).

I don't really know if an improvement from Batik's side can be made, that's what I want them to tell us.

The Batik project is not very active currently so you could have to wait for a long time, and the fix that I explained is easy to apply.

@spassarop
Copy link
Collaborator

Related to #293

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants