-
-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
' is rendered as ' with OptionOutputAsXml #456
Comments
Hello @MarcopL , I'm not sure to understand, that's pretty much what I'm expecting. The library needs to convert some characters in XML such as What would you expect? Best Regards, Jon Sponsorship Performance Libraries Runtime Evaluation |
Hello Jon, here is a Xunit test from a project, where I want to change incoming html to xml:
The test fails with output:
My expectation is that Thanks for taking a look. Best regards, |
Hello @MarcopL , Unfortunately, that's not how work The original text was If you unescape I hope that's more clear why we are currently doing this. |
Hi Jonathan, as you can see from the Xunit output above only "'" ist escaped, but not """, but both character entities are on the xml-escape list. I wood expect all of those characters to be escaped when converting plain text to xml, but none of them when converting html to xml. Best regards, |
Hello @MarcopL , That's the expected behavior. There is 5 special character that we need to escape and that's all: https://www.freeformatter.com/xml-escape.html |
EDIT: Rewrote my comment, because while i was entirely baffled by the behavior, i think i now know how to see it as expected behavior... Hmm, it's not intutive behavior in my opinion, but i think i understand now why this might be seen as expected behavior. Please correct me if my view is wrong. When reading the source HTML, the entities are parsed and translated into the actual characters they represent according to some older HTML spec (or so), where As When outputting to XML, all reserved characters will be translated to their respective XML entity representations (double quotes to I think the key here is that the input has been read as HTML -- thus the Assuming my view is not completely false, this would pose the question: Is there a way to enable HTML5 feature set (such as HTML5 entity parsing)? |
Hello @elgonzo , To be honest, I'm a little bit lost with some part of your text. All I know is you asked to output as What I understand you are asking will have a side impact when unescaped The original text was |
@JonathanMagnan, if my view is wrong, my apologies for confusing you. But if my view is wrong, then i don't understand how the observed behavior could be expected behavior. Expected behavior implies some intention, logic and reason being behind the observed behavior. Why is the |
Hello @elgonzo , You are indeed right and I just find out that we have a weird behavior here. Existing representations of those characters are not replaced: I guess the method was made to make it In this case, what we can do in this case is to create a list in which you will be able to add any string starting by Is this a solution that could work for you @MarcopL @elgonzo ? |
While i find this generally to be a neat idea, for Instead, i would suggest a different approach: Do not use the regexes in the method HtmlEncodeWithCompatibility for XML output as-is. Perhaps define a new separate method XmlEncodeWithCompatibility with a regex including _apos that is solely used for XML output. This should not cost too much, as there seem only to be two callsites for HtmlEncodeWithCompatibility when XML outout is selected. This XmlEncodeWithCompatibility method would also accept a backwardCompatibility flag, which in this case governs whether the old behavior (of not matching internal static string HtmlEncodeWithCompatibility(string html, bool backwardCompatibility = true)
{
Regex rx = backwardCompatibility
? new Regex("&(?!(amp;)|(lt;)|(gt;)|(quot;))", RegexOptions.IgnoreCase)
: new Regex("&(?!(amp;)|(lt;)|(gt;)|(quot;)|(nbsp;)|(reg;))", RegexOptions.IgnoreCase);
return ReplaceWithEntities(html, rx);
}
internal static string XmlEncodeWithCompatibility(string html, bool backwardCompatibility)
{
if (backwardCompatibility)
{
return HtmlEncodeWithCompatibility(html, backwardCompatibility);
}
Regex rx = new Regex("&(?!(amp;)|(lt;)|(gt;)|(quot;)|(apos;))", RegexOptions.IgnoreCase);
return ReplaceWithEntities(html, rx);
}
private static string ReplaceWithEntities(string html, Regex regexAmpersands)
{
if (html == null)
{
throw new ArgumentNullException("html");
}
// replace & by & but only once!
return regexAmpersands.Replace(html, "&").Replace("<", "<").Replace(">", ">").Replace("\"", """);
} Now, the two callsites calling HtmlEncodeWithCompatibility for XML output would just have to call XmlEncodeWithCompatibility instead. The two callsites are: html-agility-pack/src/HtmlAgilityPack.Shared/HtmlNode.cs Lines 1977 to 1980 in c41452a
html-agility-pack/src/HtmlAgilityPack.Shared/HtmlNode.cs Lines 2329 to 2343 in c41452a
This approach would utilize the myHtmlDoc.BackwardCompatibility = true; // the default value is true, and wouldn't need to be set explictly
myHtmlDoc.OptionOutputAsXml = true; would result in the old/current behavior where On the other hand, myHtmlDoc.BackwardCompatibility = false; // no backwards compatibility desired or needed
myHtmlDoc.OptionOutputAsXml = true; would result in standard-conform behavior where That said, i still like the idea of having lists/sets of known entity names. Just not as a workaround for the Organizing known entity names in lists/sets would make perfectly sense in case future HAP versions would aim for supporting different HTML versions, as almost each subsequent HTML version defines new known entity names. And it would also allow to define entirely custom entity name sets. At that point it would probably also make sense thinking about whether this could be consolidated with the functionality found in the HtmlEntity class, but i guess that would be a more costly and expansive undertaking... |
Hello @elgonzo , I must admit that I'm a little bit lost as I didn't get the same result as you but perhaps I did a mistake. Something I'm sure is I would not like that behavior even with the Here is what I propose:
And inside, you can add whatever you would like it to do. Since that's a new option, no one will get any unexpected behavior, and only the method I updated |
A new option would be fine, too. But i would suggest a different option name. I find "UseHtmlEncodeWithEntityName" confusing. This option name suggest it being related to HTML encoding in some regard, however the issue/behavior we are talking here is chiefly about XML output, not about HTML encode. The option name also suggests it enables/disables using of entity names as a whole, which also is an ill fit to the behavior we are talking about here: entity names are already used, just in an incomplete manner with regard to XML output. It's not about using or not using entity names, it's about using the complete set of XML entity names vs. using an incomplete set of XML entity names. As such, i would like to suggest an alternative name for this option: The one question i would then still have on my mind is whether setting If the developer only sets What do you think? |
Hello @elgonzo , I 100% agree with your name suggestion I think that we think the same thing ;) |
Description
When converting text containing XML character entities with
OptionOutputAsXml=true
most of them are - correctly - leaved untouched, but&apos
is changed to&apos;
May be
HtmlEntity
s methodpublic static string Entitize(string text, bool useNames, bool entitizeQuotAmpAndLtGt)
should also check forcode == 39
?The text was updated successfully, but these errors were encountered: