Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

' is rendered as ' with OptionOutputAsXml #456

Open
kliklamala opened this issue Nov 17, 2021 · 13 comments
Open

' is rendered as ' with OptionOutputAsXml #456

kliklamala opened this issue Nov 17, 2021 · 13 comments
Assignees

Comments

@kliklamala
Copy link

Description

When converting text containing XML character entities with OptionOutputAsXml=true most of them are - correctly - leaved untouched, but &apos is changed to '

May be HtmlEntitys method public static string Entitize(string text, bool useNames, bool entitizeQuotAmpAndLtGt) should also check for code == 39?

  • HAP version: 1.11.38
@JonathanMagnan JonathanMagnan self-assigned this Nov 17, 2021
@JonathanMagnan
Copy link
Member

Hello @MarcopL ,

I'm not sure to understand, that's pretty much what I'm expecting.

The library needs to convert some characters in XML such as &. Even some tools like this one https://miconv.com/convert-txt-to-xml/ work like us.

What would you expect?

Best Regards,

Jon


Sponsorship
Help us improve this library

Performance Libraries
context.BulkInsert(list, options => options.BatchSize = 1000);
Entity Framework ExtensionsBulk OperationsDapper Plus

Runtime Evaluation
Eval.Execute("x + y", new {x = 1, y = 2}); // return 3
C# Eval FunctionSQL Eval Function

@kliklamala
Copy link
Author

Hello Jon,

here is a Xunit test from a project, where I want to change incoming html to xml:

        [Fact]
        public void WhenHtmlContainsXmlCharacterEntitiesThenAgilityPackDoesNotChangeThem()
        {
            string htmlWithXmlEntityCharacters = "<body><p>&apos;&quot;<br>&amp;&lt;&gt;</p></body>";
            string expectedXhtml = "<body><p>&apos;&quot;<br />&amp;&lt;&gt;</p></body>";

            HtmlDocument htmlDoc = new HtmlDocument();
            htmlDoc.OptionDefaultStreamEncoding = Encoding.UTF8;
            htmlDoc.OptionOutputAsXml = true;
            htmlDoc.LoadHtml(htmlWithXmlEntityCharacters);

            var xhtml = htmlDoc.DocumentNode.ChildNodes[0].OuterHtml;

            // If this fails, check if mapping of &apos; in ContentSetMapper is still necessary
            Assert.Equal(expectedXhtml, xhtml);
        }

The test fails with output:

Assert.Equal() Failure
                     ↓ (pos 11)
Expected: <body><p>&apos;&quot;<br />&amp;&lt;&gt;</p></body>
Actual:   <body><p>&amp;apos;&quot;<br />&amp;&lt;&gt;</p></bo···
                     ↑ (pos 11)

My expectation is that &apos is not masked as it is the case for the other xml character entities.

Thanks for taking a look.

Best regards,
Marco

@JonathanMagnan
Copy link
Member

Hello @MarcopL ,

Unfortunately, that's not how work xml. You need to escape every & to &amp;. There are 5 special characters we need to escape: https://www.freeformatter.com/xml-escape.html

The original text was &apos;, on our side we need to escape the text if you want to OutputAsXml

If you unescape &apos; you will get ' (not the original text)
If you unescape &amp;apos; you will get &apos; (original text)

I hope that's more clear why we are currently doing this.

@kliklamala
Copy link
Author

Hi Jonathan,

as you can see from the Xunit output above only "'" ist escaped, but not """, but both character entities are on the xml-escape list. I wood expect all of those characters to be escaped when converting plain text to xml, but none of them when converting html to xml.

Best regards,
Marco

@JonathanMagnan
Copy link
Member

Hello @MarcopL ,

That's the expected behavior. There is 5 special character that we need to escape and that's all: https://www.freeformatter.com/xml-escape.html

@elgonzo
Copy link
Contributor

elgonzo commented Nov 23, 2021

EDIT: Rewrote my comment, because while i was entirely baffled by the behavior, i think i now know how to see it as expected behavior...

Hmm, it's not intutive behavior in my opinion, but i think i understand now why this might be seen as expected behavior.

Please correct me if my view is wrong.

When reading the source HTML, the entities are parsed and translated into the actual characters they represent according to some older HTML spec (or so), where &apos; is not one of the pre-defined named entities (if i am not mistaken, only with HTML 5 it became a named entity in HTML).

As &apos; is not part of the family of HTML entities as understood by HAP (so i tend to believe), it remains unparsed and unchanged as part of the text content.

When outputting to XML, all reserved characters will be translated to their respective XML entity representations (double quotes to &quot;, < / > to &lt; and &gt;, etc., ...). As there is no apostrophe character in the parsed text content, there will be no &apos; generated in the output. However, there is &,a,p,... in the parsed text content, and & here is a reserved character that will be replaced with &amp;.

I think the key here is that the input has been read as HTML -- thus the &apos; wasn't parsed as an entity but just as plain-text.

Assuming my view is not completely false, this would pose the question: Is there a way to enable HTML5 feature set (such as HTML5 entity parsing)?

@JonathanMagnan
Copy link
Member

Hello @elgonzo ,

To be honest, I'm a little bit lost with some part of your text.

All I know is you asked to output as XML and there are 5 reserved characters we need to escape when writing in XML.

What I understand you are asking will have a side impact when unescaped

The original text was &apos; but if we keep your logic, it will be unescaped to ' which is different.

@elgonzo
Copy link
Contributor

elgonzo commented Nov 24, 2021

@JonathanMagnan, if my view is wrong, my apologies for confusing you. But if my view is wrong, then i don't understand how the observed behavior could be expected behavior. Expected behavior implies some intention, logic and reason being behind the observed behavior.

Why is the & in &apos; being escaped/replaced with &amp;, and why is the & in &quot;, &amp;, &lt, &gt; NOT escaped/replaced with &amp;?
In other words, what exactly is the rationale and logic behind treating &apos; differently than &quot;, &amp;, &lt;, &gt;?

@JonathanMagnan
Copy link
Member

Hello @elgonzo ,

You are indeed right and I just find out that we have a weird behavior here.

Existing representations of those characters are not replaced: &amp; &lt;, &gt;, &quot; (Missing &apos; in the logic!) as we can find it here: https://github.com/zzzprojects/html-agility-pack/blob/master/src/HtmlAgilityPack.Shared/HtmlDocument.cs#L422

I guess the method was made to make it XML compatible only. This means that this method already doesn't work if someone unescapes the XML later as the value will not perfectly fit.

In this case, what we can do in this case is to create a list in which you will be able to add any string starting by & that you want they keep their current representations as &apos;.

Is this a solution that could work for you @MarcopL @elgonzo ?

@elgonzo
Copy link
Contributor

elgonzo commented Nov 25, 2021

@JonathanMagnan,

In this case, what we can do in this case is to create a list in which you will be able to add any string starting by & that you want they keep their current representations as '.

While i find this generally to be a neat idea, for &apos; i would prefer to see built-in support, since &apos; is already part of the original XML 1.0 spec from 1998.

Instead, i would suggest a different approach: Do not use the regexes in the method HtmlEncodeWithCompatibility for XML output as-is. Perhaps define a new separate method XmlEncodeWithCompatibility with a regex including _apos that is solely used for XML output. This should not cost too much, as there seem only to be two callsites for HtmlEncodeWithCompatibility when XML outout is selected.

This XmlEncodeWithCompatibility method would also accept a backwardCompatibility flag, which in this case governs whether the old behavior (of not matching &apos;) or the new, standard-conform behavior should be executed. Kinda like this (slightly refactored):

internal static string HtmlEncodeWithCompatibility(string html, bool backwardCompatibility = true)
{
    Regex rx = backwardCompatibility
        ? new Regex("&(?!(amp;)|(lt;)|(gt;)|(quot;))", RegexOptions.IgnoreCase)
        : new Regex("&(?!(amp;)|(lt;)|(gt;)|(quot;)|(nbsp;)|(reg;))", RegexOptions.IgnoreCase);

    return ReplaceWithEntities(html, rx);
}


internal static string XmlEncodeWithCompatibility(string html, bool backwardCompatibility)
{
    if (backwardCompatibility)
    {
        return HtmlEncodeWithCompatibility(html, backwardCompatibility);
    }

    Regex rx = new Regex("&(?!(amp;)|(lt;)|(gt;)|(quot;)|(apos;))", RegexOptions.IgnoreCase);
    return ReplaceWithEntities(html, rx);
}


private static string ReplaceWithEntities(string html, Regex regexAmpersands)
{
    if (html == null)
    {
        throw new ArgumentNullException("html");
    }

    // replace & by &amp; but only once!

    return regexAmpersands.Replace(html, "&amp;").Replace("<", "&lt;").Replace(">", "&gt;").Replace("\"", "&quot;");
}

Now, the two callsites calling HtmlEncodeWithCompatibility for XML output would just have to call XmlEncodeWithCompatibility instead. The two callsites are:

case HtmlNodeType.Text:
html = ((HtmlTextNode) this).Text;
outText.Write(_ownerdocument.OptionOutputAsXml ? HtmlDocument.HtmlEncodeWithCompatibility(html, _ownerdocument.BackwardCompatibility) : html);
break;

if (_ownerdocument.OptionOutputAsXml)
{
name = _ownerdocument.OptionOutputUpperCase ? att.XmlName.ToUpperInvariant(): att.XmlName;
if (_ownerdocument.OptionOutputOriginalCase)
name = att.OriginalName;
if (!isWithoutValue)
{
outText.Write(" " + name + "=" + quote + HtmlDocument.HtmlEncodeWithCompatibility(att.XmlValue, _ownerdocument.BackwardCompatibility) + quote);
}
else
{
outText.Write(" " + name);
}
}

This approach would utilize the HtmlDocument.BackwardCompatibility field to enable/disable recognition of &apos; when XML output is being selected, so

myHtmlDoc.BackwardCompatibility = true; // the default value is true, and wouldn't need to be set explictly
myHtmlDoc.OptionOutputAsXml = true;

would result in the old/current behavior where &apos; turns into &amp;apos; in the outputted XML.

On the other hand,

myHtmlDoc.BackwardCompatibility = false; // no backwards compatibility desired or needed
myHtmlDoc.OptionOutputAsXml = true;

would result in standard-conform behavior where &apos; remains &apos; in the XML output.


That said, i still like the idea of having lists/sets of known entity names.

Just not as a workaround for the &apos; issue. :-)

Organizing known entity names in lists/sets would make perfectly sense in case future HAP versions would aim for supporting different HTML versions, as almost each subsequent HTML version defines new known entity names. And it would also allow to define entirely custom entity name sets. At that point it would probably also make sense thinking about whether this could be consolidated with the functionality found in the HtmlEntity class, but i guess that would be a more costly and expansive undertaking...

@JonathanMagnan
Copy link
Member

Hello @elgonzo ,

I must admit that I'm a little bit lost as I didn't get the same result as you but perhaps I did a mistake.

Something I'm sure is I would not like that behavior even with the BackwardCompatibility = false. That's not what I'm expecting and I know, that will break some of my code so I'm sure it will break also a lot of people's existing code.

Here is what I propose:

  • A new static option named like UseHtmlEncoreWithEntityName
  • Add a if in the HtmlEncodeWithCompatibility method

And inside, you can add whatever you would like it to do. Since that's a new option, no one will get any unexpected behavior, and only the method HtmlEncodeWithCompatibility will have a slight change.

I updated HAP to the latest version, if you could provide a branch or only the new code for the method HtmlEncodeWithCompatibility using the option UseHtmlEncoreWithEntityName that would also be perfect. I just want to make sure that the new option will do exactly what you are expecting on your side.

@elgonzo
Copy link
Contributor

elgonzo commented Dec 2, 2021

A new option would be fine, too. But i would suggest a different option name. I find "UseHtmlEncodeWithEntityName" confusing.

This option name suggest it being related to HTML encoding in some regard, however the issue/behavior we are talking here is chiefly about XML output, not about HTML encode. The option name also suggests it enables/disables using of entity names as a whole, which also is an ill fit to the behavior we are talking about here: entity names are already used, just in an incomplete manner with regard to XML output. It's not about using or not using entity names, it's about using the complete set of XML entity names vs. using an incomplete set of XML entity names.

As such, i would like to suggest an alternative name for this option: UseXmlEntityNamesForXmlOutput.

The one question i would then still have on my mind is whether setting UseXmlEntityNamesForXmlOutput to true would effectively override BackwardCompatibility. In my mind, the default value of UseXmlEntityNamesForXmlOutput would be false, which would yield the behavior as currently is exhibited by HAP. If a developer is then setting both UseXmlEntityNamesForXmlOutput and OptionOutputAsXml to true, the new behavior using (only) the XML entity names as per XML specification will be used, regardless of the value of the BackwardCompatibility field.

If the developer only sets UseXmlEntityNamesForXmlOutput to true, but leaves OptionOutputAsXml at false, the behavior of HAP is as it currently is. (As the name of the option indicates that it is for XML output, it should be relatively easy for a developer to understand that it would have an effect only when XML output is actually being enabled...)

What do you think?

@JonathanMagnan
Copy link
Member

Hello @elgonzo ,

I 100% agree with your name suggestion UseXmlEntityNamesForXmlOutput and about being false by default. The important point as you said is making sure we are still backward compatible on this feature (no matter the BackwardCompatibility value).

I think that we think the same thing ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants