Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDATA scanning in XML not behaving properly #531

Open
akshay-kr opened this issue Nov 4, 2024 · 4 comments
Open

CDATA scanning in XML not behaving properly #531

akshay-kr opened this issue Nov 4, 2024 · 4 comments

Comments

@akshay-kr
Copy link

Actually we are using Antisamy plugin to parse XML with content inside CDATA tag which used to work before this commit HtmlUnit/htmlunit-neko@49a31c0 was added in htmlunit-neko

For example an XML like this,

<xt:c-code xt:name="code" xt:version="1" xt:id="15ae0cc7-ded7-4a74-97b8-d66238d3c177"><xt:parameter xt:name="language">html</xt:parameter><xt:text-body><![CDATA[<div></div>]]></xt:text-body></xt:c-code>

Before this commit the result for CDATA scanning part was <![CDATA[<div></div>]]> but after this commit the result is <![CDATA[<div]]>]]&gt;

We are parsing this XML, specifically the content inside CDATA and then storing it. Later when viewing we extract the content inside CDATA and render it on the web page.

Also raised an issue for same on htmlunit-neko repo,
HtmlUnit/htmlunit-neko#125

Is this the expected behaviour going forward? Is there a way we can bring back previous behaviour for folks who maybe using the same for XML content parsing.

@rbri
Copy link
Contributor

rbri commented Dec 1, 2024

Have added a new feature in neko 'http://cyberneko.org/html/features/scanner/cdata-early-closing' version 4.6.0. You have to set this if you are parsing XHtml code because there we do not have to do this strange early closing.

Hopefully there is a way to do this from antisamy.

@davewichers
Copy link
Collaborator

@spassarop - Can you research this? Neko-htmlunit v4.6.0 is included in the AntiSamy:1.7.7 we just released.

@spassarop
Copy link
Collaborator

I was not able to reproduce such output entirely. I tried adding the custom tags to the default policy and use the whole XML and also tried just scanning the CDATA. All by guessing policy and input string as it was not explicitly stated with a code example.

What I do get is this kind of output regarding the CDATA section in every scan &lt;div]]&gt;. Which is similar.

If I add the feature @rbri mentioned, the output changes to &lt;div]]&gt; when it is set to true and &lt;div&gt;&lt;/div&gt; when set to false. It seems the second one is the expected one in the issue description.

What we can do, if that matches the desired behavior, is to add a new directive that allows to set that feature in SAX and DOM parsers by policy. What I am not sure is what default value to use, probably the best would be setting it to false by default as that was the behavior before the state change when upgrading Neko.

@akshay-kr, if this description and analysis seems accurate to your needs, let us know.

@rbri
Copy link
Contributor

rbri commented Dec 16, 2024

Sorry for making thinks a bit more complicated. But i found another issue in neko regarding validating of attribute names. The root cause is more or less the same like for this one - when parsing html some things are really different (and more complicated) compared to parsing Xhtml.
Currently i think about making the parser a bit more clever and automatically choosing the correct way of working instead of having this kind of switches and control it from the outside.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants