Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing HTML entities in XML using Nokogiri as adapter #154

Open
suleman-uzair opened this issue Nov 6, 2024 · 4 comments
Open

Parsing HTML entities in XML using Nokogiri as adapter #154

suleman-uzair opened this issue Nov 6, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@suleman-uzair
Copy link

Nokogiri gem doesn’t handle HTML entities other than &, < ,> , " , and ', the rest of the entities are ignored/replaced, but they are valid input in MathML.

Issue faced while MathML parsing in plurimath/mml#2.

@ronaldtse @HassanAkbar should we consider Ox for this issue or is this implementable in Lutaml-Model?

@suleman-uzair suleman-uzair added enhancement New feature or request question Further information is requested labels Nov 6, 2024
@ronaldtse
Copy link
Contributor

I believe Nokogiri supports only formal XML entities. However for MathML to be built on XML, it should support XML entities?

Why do we have to use any HTML entities when we can use the character codes?

@suleman-uzair
Copy link
Author

Why do we have to use any HTML entities when we can use the character codes?

@ronaldtse, we do not need to use HTML entities, but MathML editors (MathJax for example) does support HTML entities and some examples also contain HTML entities (&sum; and &prod; for example).
Also, &micro; is available in the prefixes.yaml file in UnitsDB for HTML reference, which is used for MathML conversion in Unitsml-Ruby.

@ronaldtse
Copy link
Contributor

I see, so this is purely for supporting bad XML (bad MathML editors): MathML that contains HTML entities.

When Plurimath parses HTML or MathML, sure it can accept HTML entities. But when it outputs MathML, there is no reason for it to output HTML entities, which is unsupported in XML.

I don’t know how we can make Nokogiri support them, in my memory the Nokogiri HTML parser is needed.

@opoudjis
Copy link
Contributor

opoudjis commented Nov 7, 2024

HTML Entities have caused me issues in the past, because they will turn up in markup and they are not guaranteed to be supported by Nokogiri at all: I did indeed need to use the Nokogiri HTML parser in Metanorma, and when Nokogiri forced me to stop doing so, I instead converted all HTML entities in Metanorma Asciidoc to XML entities in preprocessing: metanorma/metanorma-iso#666

And HTML entities will turn up in markup. Declining to support them in reading documents is not an option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants