-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Double-escaped RegEx patterns in output JSON schema causing issues with some RegEx flavors #182
Comments
Let's start with the simplest thing and see how far it works, or does not -- express the tests in the (Metaschema) backend source not in escaped form, but with literals (or XML character representations of literals). It would be interesting to see how these are handled by default by the JSON serializer built into Saxon, in particular whether and where it escapes the characters in question (presumably into a single-escaped form). This issue is hard if we work against the Saxon serializer, easy if we work with it. If we have to work against it, an option is to start with this XSLT, linked from the XSLT 3.0 Rec, which presumably could be used to replace the Saxon serialization of JSON this pipeline presently relies on. But trying the simple thing first could open a way forward to work with it. |
…test Metaschema
…test Metaschema
Describe the bug
Some flavors of RegEx (such as Go's regex package https://pkg.go.dev/regexp/syntax and PHP's PCRE) do not support Unicode character classes through the
\u{code}
syntax. The validation of certain datatypes such as thetoken
type may improperly rely on this RegEx syntax.Who is the bug affecting?
Tool developers that are trying to parse generated JSON schemas in some RegEx flavors (like Go's regex package or PHP's PCRE)
What is affected by this bug?
The regex present in some output JSON schema patterns is invalid for some RegEx flavors.
When does this occur?
Anytime a Unicode code sequence is placed within a RegEx pattern (such as for the
token
datatype)Expected behavior (i.e. solution)
Single escaping the Unicode character pattern
\u...
instead of\\u...
will be interpreted by all the JSON parsers I've tested (Go, JS, and Python) as the Unicode character directly. I have not tested how any of these regex flavors handle Unicode characters directly, but it could be a simple solution to this issue.Other Comments
This bug is related to #181
The text was updated successfully, but these errors were encountered: