Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double-escaped RegEx patterns in output JSON schema causing issues with some RegEx flavors #182

Closed
nikitawootten-nist opened this issue Feb 10, 2022 · 2 comments · Fixed by #191, #197 or #183
Assignees
Labels
bug Something isn't working

Comments

@nikitawootten-nist
Copy link
Contributor

nikitawootten-nist commented Feb 10, 2022

Describe the bug

Some flavors of RegEx (such as Go's regex package https://pkg.go.dev/regexp/syntax and PHP's PCRE) do not support Unicode character classes through the \u{code} syntax. The validation of certain datatypes such as the token type may improperly rely on this RegEx syntax.

Who is the bug affecting?

Tool developers that are trying to parse generated JSON schemas in some RegEx flavors (like Go's regex package or PHP's PCRE)

What is affected by this bug?

The regex present in some output JSON schema patterns is invalid for some RegEx flavors.

When does this occur?

Anytime a Unicode code sequence is placed within a RegEx pattern (such as for the token datatype)

Expected behavior (i.e. solution)

Single escaping the Unicode character pattern \u... instead of \\u... will be interpreted by all the JSON parsers I've tested (Go, JS, and Python) as the Unicode character directly. I have not tested how any of these regex flavors handle Unicode characters directly, but it could be a simple solution to this issue.

Other Comments

This bug is related to #181

@nikitawootten-nist nikitawootten-nist added the bug Something isn't working label Feb 10, 2022
@wendellpiez
Copy link
Collaborator

Let's start with the simplest thing and see how far it works, or does not -- express the tests in the (Metaschema) backend source not in escaped form, but with literals (or XML character representations of literals). It would be interesting to see how these are handled by default by the JSON serializer built into Saxon, in particular whether and where it escapes the characters in question (presumably into a single-escaped form).

This issue is hard if we work against the Saxon serializer, easy if we work with it. If we have to work against it, an option is to start with this XSLT, linked from the XSLT 3.0 Rec, which presumably could be used to replace the Saxon serialization of JSON this pipeline presently relies on.

But trying the simple thing first could open a way forward to work with it.

wendellpiez added a commit to wendellpiez/metaschema that referenced this issue Feb 18, 2022
@david-waltermire david-waltermire added this to the Metaschema 0.9.0 milestone Mar 10, 2022
@david-waltermire david-waltermire linked a pull request Mar 10, 2022 that will close this issue
11 tasks
david-waltermire pushed a commit to wendellpiez/metaschema that referenced this issue Mar 20, 2022
@david-waltermire david-waltermire linked a pull request Mar 29, 2022 that will close this issue
8 tasks
@wendellpiez
Copy link
Collaborator

We should look at this again after #183 is merged.

New regex patterns may make the issue just go away by themselves. If not, a solution can probably be patched over from #184.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment