Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roll back token datatype to avoid problematic character references #181

Closed
wendellpiez opened this issue Feb 9, 2022 · 5 comments · Fixed by #191, #197 or #183
Closed

Roll back token datatype to avoid problematic character references #181

wendellpiez opened this issue Feb 9, 2022 · 5 comments · Fixed by #191, #197 or #183
Assignees
Labels
bug Something isn't working

Comments

@wendellpiez
Copy link
Collaborator

Describe the bug

As reported in the OSCAL repo usnistgov/OSCAL#1127 a regular expression deployed in the schemas on the token datatype definition is breaking in certain tools.

Testing suggests that it is the numeric character escaping (entity syntax) in the expression [\i-[:𐀀-]][\c-[:𐀀-]]* that the tool is not able to handle.

But further testing shows that these character exclusions have no apparent effect on the regex - which suggests we should roll them back.

If the XML-regex syntax here for NCName [\i-[:]][\c-[:]]* is too XMLy, an alternative could be [\w-[:\d]]\w*.

In any case the correction must be made in Metaschema back end, to propagate to into generated schemas.

This is not a backward-compatibility breaking change.

Who is the bug affecting?

Any user of tools that choke on the regex as given.

What is affected by this bug?

Can't validate an OSCAL instance with an appropriate XSD.

When does this occur?

Anytime with the tool in question (C# processor under .NET).

As it happens, the oXygen XML Editor's XML Schema Regular Expression builder also does not support entity syntax, and shows the same error.

Expected behavior (i.e. solution)

The token datatype should be validated appropriately (against NCName constraints).

@wendellpiez wendellpiez added the bug Something isn't working label Feb 9, 2022
@aj-stein-nist
Copy link
Collaborator

Thanks for adding this @wendellpiez, so are you working on this moving forward or should I learn the dark arts of these regex patterns? :-)

@wendellpiez
Copy link
Collaborator Author

wendellpiez commented Feb 9, 2022

I would like to hear from @david-waltermire-nist what he thinks the right choice is here, on balance, from among the remedies proposed so far, or some other.

Having discussed it I would welcome the help implementing a correction in the XSLT M4 pipeline.

However, it is also unclear to me where this should be done. The bug we are addressing is not actually in the XSD but in a single consuming implemention (we know of); and a workaround is feasible for those cases. @david-waltermire-nist would necessarily have to be involved in any integration since this is the Metaschema support infrastructure. (And it will have a direct impact on tooling he is developing!)

@wendellpiez
Copy link
Collaborator Author

wendellpiez commented Feb 11, 2022

@david-waltermire-nist with AJ's help we have managed to demonstrate:

  • The tool in question does appear to have issues dealing with Unicode ranges, or certain Unicode ranges, that other tools are okay with - possible bug there
  • Nonetheless, the character range exclusion that causes the regex to break is superfluous, and can be removed without effect, as the excluded characters (despite my earlier research) do not appear to be in the character sets to begin with.

(It turns out a test for this is easy: just try and make an XSD definition for an element named A𐀀.)

So I think the solution here (on the XML side) is to roll back the definition to [\i-[:]][\c-[:]]* (no-colon name without other exclusions).

Issue #182 might give us a reason to do otherwise, but it could also be orthogonal.

@aj-stein-nist
Copy link
Collaborator

Yeah I hope that helped. I mothballed the project for now but we can always revive and add more tests. I integrated .NET code with GitHub Actions on a Windows runner so we can further test issues with this particular XML engine in the .NET System.Xml namespace (since that is different from MSXML, we dodged that bullet 😰).

Let me know if I can be of more help.

wendellpiez added a commit to wendellpiez/metaschema that referenced this issue Feb 18, 2022
…moving references to upper-Unicode characters that break at least one processor - this is not actually relaxed, since the exclusion tests out as inoperative (the offending characters are not in the sets from which they are being excluded)
@david-waltermire david-waltermire added this to the Metaschema 0.9.0 milestone Mar 10, 2022
@david-waltermire david-waltermire linked a pull request Mar 10, 2022 that will close this issue
11 tasks
wendellpiez added a commit to wendellpiez/metaschema that referenced this issue Mar 22, 2022
…moving references to upper-Unicode characters that break at least one processor - this is not actually relaxed, since the exclusion tests out as inoperative (the offending characters are not in the sets from which they are being excluded)
@david-waltermire david-waltermire linked a pull request Mar 29, 2022 that will close this issue
8 tasks
@wendellpiez
Copy link
Collaborator Author

Now emitting functional XSDs with the new datatype mappings.

Pruning the result XSD so it does not contain unused simpleType definitions is a bit difficult when their definitions are chained, but we are managing some of it at least.

JSON Schema production also looks okay wrt datatypes in this branch.

@david-waltermire david-waltermire linked a pull request Apr 15, 2022 that will close this issue
8 tasks
david-waltermire pushed a commit that referenced this issue Apr 15, 2022
* Redefined lexical constraint on 'token' datatype - #181 - removing references to upper-Unicode characters that break at least one processor - this is not actually relaxed, since the exclusion tests out as inoperative (the offending characters are not in the sets from which they are being excluded)
* Rolling back #165 addressing usnistgov/OSCAL#956. We now have no upper-Unicode characters in the schema to be concerned about.
* Adding schema-generation unit tests for 'token' datatype
* Adjusted JSON Schema generation to capture latest datatype definitions #195
* Debugging path in datatype integration
* Adding XSD datatype production logic - reads and rewrites JSON definitions in XSD syntax.
* Adjustment to JSON -> XSD rough casting
* XSD datatypes adjusted to align with JSON #195
* Adding new datatypes as aliases for old names #1186
* Touchups to inline documentation (for propagation to tools)
* Replaced with updated mapping plus adjustments to XSD production
* Now doing a better job excluding unneeded datatype (simpleType) definitions
* Reconciled datatype merge to emit functional JSON schemas
* Removing safety backups; small correction to Metaschema Schematron to avoid a runtime error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment