Roll back `token` datatype to avoid problematic character references #181

wendellpiez · 2022-02-09T17:24:53Z

Describe the bug

As reported in the OSCAL repo usnistgov/OSCAL#1127 a regular expression deployed in the schemas on the token datatype definition is breaking in certain tools.

Testing suggests that it is the numeric character escaping (entity syntax) in the expression [\i-[:𐀀-󯿿]][\c-[:𐀀-󯿿]]* that the tool is not able to handle.

But further testing shows that these character exclusions have no apparent effect on the regex - which suggests we should roll them back.

If the XML-regex syntax here for NCName [\i-[:]][\c-[:]]* is too XMLy, an alternative could be [\w-[:\d]]\w*.

In any case the correction must be made in Metaschema back end, to propagate to into generated schemas.

This is not a backward-compatibility breaking change.

Who is the bug affecting?

Any user of tools that choke on the regex as given.

What is affected by this bug?

Can't validate an OSCAL instance with an appropriate XSD.

When does this occur?

Anytime with the tool in question (C# processor under .NET).

As it happens, the oXygen XML Editor's XML Schema Regular Expression builder also does not support entity syntax, and shows the same error.

Expected behavior (i.e. solution)

The token datatype should be validated appropriately (against NCName constraints).

The text was updated successfully, but these errors were encountered:

aj-stein-nist · 2022-02-09T20:04:36Z

Thanks for adding this @wendellpiez, so are you working on this moving forward or should I learn the dark arts of these regex patterns? :-)

wendellpiez · 2022-02-09T22:16:23Z

I would like to hear from @david-waltermire-nist what he thinks the right choice is here, on balance, from among the remedies proposed so far, or some other.

Having discussed it I would welcome the help implementing a correction in the XSLT M4 pipeline.

However, it is also unclear to me where this should be done. The bug we are addressing is not actually in the XSD but in a single consuming implemention (we know of); and a workaround is feasible for those cases. @david-waltermire-nist would necessarily have to be involved in any integration since this is the Metaschema support infrastructure. (And it will have a direct impact on tooling he is developing!)

wendellpiez · 2022-02-11T17:56:58Z

@david-waltermire-nist with AJ's help we have managed to demonstrate:

The tool in question does appear to have issues dealing with Unicode ranges, or certain Unicode ranges, that other tools are okay with - possible bug there
Nonetheless, the character range exclusion that causes the regex to break is superfluous, and can be removed without effect, as the excluded characters (despite my earlier research) do not appear to be in the character sets to begin with.

(It turns out a test for this is easy: just try and make an XSD definition for an element named A𐀀.)

So I think the solution here (on the XML side) is to roll back the definition to [\i-[:]][\c-[:]]* (no-colon name without other exclusions).

Issue #182 might give us a reason to do otherwise, but it could also be orthogonal.

aj-stein-nist · 2022-02-11T20:38:00Z

Yeah I hope that helped. I mothballed the project for now but we can always revive and add more tests. I integrated .NET code with GitHub Actions on a Windows runner so we can further test issues with this particular XML engine in the .NET System.Xml namespace (since that is different from MSXML, we dodged that bullet 😰).

Let me know if I can be of more help.

…moving references to upper-Unicode characters that break at least one processor - this is not actually relaxed, since the exclusion tests out as inoperative (the offending characters are not in the sets from which they are being excluded)

wendellpiez · 2022-04-01T19:00:33Z

Now emitting functional XSDs with the new datatype mappings.

Pruning the result XSD so it does not contain unused simpleType definitions is a bit difficult when their definitions are chained, but we are managing some of it at least.

JSON Schema production also looks okay wrt datatypes in this branch.

* Redefined lexical constraint on 'token' datatype - #181 - removing references to upper-Unicode characters that break at least one processor - this is not actually relaxed, since the exclusion tests out as inoperative (the offending characters are not in the sets from which they are being excluded) * Rolling back #165 addressing usnistgov/OSCAL#956. We now have no upper-Unicode characters in the schema to be concerned about. * Adding schema-generation unit tests for 'token' datatype * Adjusted JSON Schema generation to capture latest datatype definitions #195 * Debugging path in datatype integration * Adding XSD datatype production logic - reads and rewrites JSON definitions in XSD syntax. * Adjustment to JSON -> XSD rough casting * XSD datatypes adjusted to align with JSON #195 * Adding new datatypes as aliases for old names #1186 * Touchups to inline documentation (for propagation to tools) * Replaced with updated mapping plus adjustments to XSD production * Now doing a better job excluding unneeded datatype (simpleType) definitions * Reconciled datatype merge to emit functional JSON schemas * Removing safety backups; small correction to Metaschema Schematron to avoid a runtime error.

wendellpiez added the bug Something isn't working label Feb 9, 2022

nikitawootten-nist mentioned this issue Feb 10, 2022

Double-escaped RegEx patterns in output JSON schema causing issues with some RegEx flavors #182

Closed

aj-stein-nist mentioned this issue Feb 10, 2022

Build Harness to Test Many Character Behaviors in XML Schema Regex Scenarios in C# aj-stein-nist/Issue1127Example#1

Closed

wendellpiez mentioned this issue Feb 18, 2022

Redefined lexical constraint on 'token' datatype #183

Merged

8 tasks

david-waltermire assigned wendellpiez and david-waltermire Mar 10, 2022

david-waltermire added this to the Metaschema 0.9.0 milestone Mar 10, 2022

david-waltermire linked a pull request Mar 10, 2022 that will close this issue

Relocate schema resources #191

Merged

11 tasks

david-waltermire linked a pull request Mar 29, 2022 that will close this issue

Metaschema / XSLT implementation alignment #197

Merged

8 tasks

david-waltermire linked a pull request Apr 15, 2022 that will close this issue

Redefined lexical constraint on 'token' datatype #183

Merged

8 tasks

david-waltermire closed this as completed in #183 Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roll back `token` datatype to avoid problematic character references #181

Roll back `token` datatype to avoid problematic character references #181

wendellpiez commented Feb 9, 2022

aj-stein-nist commented Feb 9, 2022

wendellpiez commented Feb 9, 2022 •

edited

Loading

wendellpiez commented Feb 11, 2022 •

edited

Loading

aj-stein-nist commented Feb 11, 2022

wendellpiez commented Apr 1, 2022

Roll back token datatype to avoid problematic character references #181

Roll back token datatype to avoid problematic character references #181

Comments

wendellpiez commented Feb 9, 2022

Describe the bug

Who is the bug affecting?

What is affected by this bug?

When does this occur?

Expected behavior (i.e. solution)

aj-stein-nist commented Feb 9, 2022

wendellpiez commented Feb 9, 2022 • edited Loading

wendellpiez commented Feb 11, 2022 • edited Loading

aj-stein-nist commented Feb 11, 2022

wendellpiez commented Apr 1, 2022

Roll back `token` datatype to avoid problematic character references #181

Roll back `token` datatype to avoid problematic character references #181

wendellpiez commented Feb 9, 2022 •

edited

Loading

wendellpiez commented Feb 11, 2022 •

edited

Loading