-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode UTF-8 Identifiers in USD Proposal #1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good! I'll give it a more thorough review tomorrow, but in the meantime it will be good to have a separate "out of scope" section to clarify that while AEC content often has whitespace, parentheses, etc., display names ARE the right approach there, i.e., like in C++ and Python, it does make sense to draw some reasonable limitations around identifier syntax in USD, even if it means not supporting everything that AECO DCCs support. Thanks!
|
||
Aaron Luk | ||
Matthew Kuruc | ||
Edward Slavin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you're comfortable with it, feel free to put yourself as primary author, @erslavin -- I feel you've done the most heavy lifting with this effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erslavin -- sounds like in general NV is using the term AECO (architecture, engineering, construction, and operations) rather than AEC-- let's reflect that here as well. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made the adjustment to AECO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments in line and have two main notes about topics that might be able to be streamlined.
I think the proposal does not need to spend a lot of time addressing whether or not prim names should be restricted to identifiers. It's definitely notable, but there's compelling reasons to keep it restricted too. Rather than expand the section to include that debate, it's probably better to keep it focused on what's needed for UTF-8 Identifier Support.
Secondly, I think the proposal should acknowledge collation / UCA but ultimately not spend a lot of time on it. The documentation you've provided on collation is really good and should be preserved perhaps in a sibling markdown file, but ultimately if the core is going to use optimistic ASCII for its internal ordering / sorting, it's secondary to the main proposal.
|
||
# Staged Rollout and Adoption | ||
|
||
Since there are potentally large code bases in the current ecosystem that rely on USD, this proposal currently lands changes behind a `TfEnvSetting` (`TF_UTF8_IDENTIFIERS`) such that Pixar and other VFX/animation stakeholders are not forced to adopt UTF-8 identifier support within their code bases right away. This environment setting defaults to `false` such that current ASCII validation rules are used in USD core as they are today. When this setting is `true`, access is provided to the new Unicode based validation rules. This also allows us to measure the performance impact of different approaches to things such as normalization and collation. This proposal seeks to phase out this environment setting over time, similar to the deprecation of Ar 1.0, to give all USD citizens time to adapt their pipelines to support UTF-8 identifiers accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth considering a CMAKE
flag instead of the environment variable given that you use the Ar
rollout as analogy. The environment variable might be harder to deprecate?
It's also worth noting that it's probably important to get the sorting algorithm right since it's hard to change given the concerns about diff-ing files. To ensure that layers are easily interchanged and that UTF-8 supported widely, I'd argue that the USD library should be opinionated about when the feature is considered a preview and when it's considered a first class feature. Perhaps TF_PREVIEW_UTF8_IDENTIFIERS
or (if you accept the CMAKE
suggestion), PXR_PREVIEW_UTF8_IDENTIFIERS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CMAKE
flag is an interesting option, and would alleviate their concerns about a runtime environment switch. I can put this as an option under this section and state that while we implemented it in the PR as a runtime switch, the CMAKE
flag provides some additional flexibility and eliminates the runtime cost of evaluating the flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added additional note at end of section to discuss a CMAKE
flag approach
- UTF-8 | ||
- UTF-16 / UTF-32 | ||
|
||
In evaluating options to support, raw code points rely largely on the support of code pages and individual system locales. UTF-x encodings are already populate for standardized interchange of Unicode text and in particular UTF-8 encoding is widespread within the community. UTF-8 is a variable length encoding that is backward compatible with existing ASCII encodings, meaning that current usd files, code, etc. that rely on ASCII strings should work without any changes. The reader is referred to [UTF-8 Everywhere](http://utf8everywhere.org/) as a summary of why UTF-8 was chosen as the preferred encoding over UTF-16 / UTF-32. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UTF-8 Everywhere
doc was a very compelling read, and I'd argue for its inclusion earlier in the document to seed context for the rest of the proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a reference to this doc early on after the Unicode standard itself.
|
||
Though support may vary, both the C++ standard and PEP 3131 have common definitions for what denotes an identifier and the recommendation is to follow this convention in USD, particularly since it must conform to these rules to enable successful compilation of code-based schemas. One downside of this approach is that strings starting with digits (or indeed, all digit strings) could not be considered valid identifiers, but this no more restrictive than it is in today's implementation, is a reasonable restriction given that USD supports code-based schemas, and still allows for Unicode characters falling in those two classes (`XID_Start` / `XID_Continue`) to be supported. | ||
|
||
Restricting prim names in this way, however, is largerly unnecessary given they are never used by compilers generating schema code. In fact, it is entirely possible prim names could be any valid Unicode string (e.g., parentheses, spaces, emojis, etc.). While this proposal does not specify the possible characters that could potentially make up a valid prim name, it does suggest a way to separate out the validation of identifiers and the validation of prim names such that it could be changed in the future. For now, this proposal uses the same rule set for what makes a valid identifier and valid prim name. If in the future the rules for validation of prim names change, a balance must be maintained between permissiveness and context-free parsability of a string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting and important to note that prim names don't necessarily need to be restricted to identifiers in the same way that properties do. Mentioning emojis and other characters as possible prim identifiers might distract from the current proposal. I might argue for keeping this higher level and noting that there might be opportunities to extend the character set for prims in future proposals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see your point. On the other hand, this gets people thinking about the possibilities of native character sets in the names. I think there is an argument to be made for both restricting and expanding the set of characters allowed. The emojis might be a little distracting though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed emojis and added a NOTE to the beginning to illustrate the discussion nature of the paragraph
- `xid_start_range_class`: A vector of pairs (`low`, `high`) of code points that fall in the `XID_Start` character class | ||
- `xid_continue_range_class`: A vector of pairs (`low`, `high`) of code points that fall in the `XID_Continue` character class | ||
|
||
This preprocessed file would then be compiled with the rest of the code and be available for fast lookup at runtime. Preprocessing is necessary becuase: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error. "because"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
This preprocessed file would then be compiled with the rest of the code and be available for fast lookup at runtime. Preprocessing is necessary becuase: | ||
|
||
- The Unicode database is not formatted in such a way to provide fast run-time lookups of character classes in a memory efficient manner | ||
- The Unicode database is large, and not all information is required to perform the necessary processing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we profiled a memory impact from the Unicode database?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have not in any recordable sense, though we did take a look at the library size between the versions that did and did not have the compile time tables to support the lookups. The database itself is preprocessed, so the memory overhead is that used to store the various maps we generate and compile into the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add additional descriptive text here since we haven't really documented the increase in size
|
||
For an optimistic ASCII algorithm to be used in practice, it would be required to explicit lay out the context-sensitive rules being used by the existing dictionary ordering algorithm. These rules would then have to be adjusted to allow for what happens when non-ASCII characters "break" the context - are they ignored? do they contribute a new context? does the algorithm terminate at that point on the basis of a code point ordering? The answers to these questions are not straightforward, but need to be explicitly stated such that users of USD can be assured that whatever ordering is used, it can be understood. | ||
|
||
## Unicode Collation Algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really good section, but given that "optimistic ASCII" is going to be the sorting algorithm, I wonder if it should be its own referenced file to capture the work and information while keeping the main proposal focused on the changes to USD Core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Captured in the overall comment below.
|
||
Autodesk is one of the primary stakeholders currently asking for UTF-8 identifiers, as Revit and 3ds max already support UTF-8 element names. While Maya does not currently support UTF-8, we anticipate that it will eventually also be in demand, and the VFX/animation community will then be pushed to follow suit in their internal pipelines. With the opt-in approach outlined above, this proposal aims to give VFX/animation stakeholders the necessary breathing room ahead of time to safely adopt UTF-8 sooner rather than later. | ||
|
||
We share concern over UTF-8 identifiers leaking into pipelines that do not yet support them, and stand ready to offer guidance and support to update those use cases. As most VFX/animation facilities source their content in-house and/or within non-UTF-8 ecosystems, we do not expect UTF-8 identifiers to appear unexpectedly in such pipelines unless intentionally ingested. IN any case, the `TfEnvSetting` provides a mechanism through which VFX/animation pipelines can safeguard against accidental UTF-8 ingestions while they migrate over a grace period at which point UTF-8 identifiers become the norm in USD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "IN any case" be "In any case"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
# An Overview of Identifiers in USD | ||
|
||
Identifier syntax is currently enforced in USD by several different mechanisms. The primary mechanism is the inlined function `TfIsValidIdentifier`, which states that "an identifier is valid if it follows the C/Python identifier convention; that is, it must be at least one charaacter long, must start with a letter or underscore, and must contain only letters, underscores, and numerals". However, there are several other code paths that validate different flavors of identifiers, including namespaced and variant identifiers, that use similar logic to achieve validation. In particular, this means that any change to the logic governing core identifier validation in `TfIsValidIdentifier` necessitates change in several different similar validation functions across the `Sdf` module (for example, `SdfPath::IsValidNamespacedIdentifier`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should charaacter be character?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still seems to be there in the latest unless I'm missing an update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, you are right, must have been a Ctrl+Z malfunction, fixed
|
||
It is also necessary to define what we mean by *identifier*. The current implementation mixes two syntactically similar but semantically different concepts: | ||
|
||
- **Prim names:** The name of a USD prim, differentiating one sibling prim from another and forming the hierarchical path from which a prim can be referenced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good point and worth capturing, but for focus and given that the proposal isn't as this time recommending that prim names extend beyond identifiers, it might be worth condensing or moving to a sibling document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Captured in the comment below.
|
||
Aaron Luk | ||
Matthew Kuruc | ||
Edward Slavin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
This content was from the original proposal where we did actually modify the prim name rules slightly and spent a fair amount of time implementing UCA. I'd like to keep them in there, but it looks like I need to make a clearer statement somewhere that there are no changes to prim name rules planned (it's important, because we did do the work to separate out the validation, one just falls back to the other when there are no changes to the rules). As far as UCA, we never finalized that optimistic ASCII was the way to go (and I have some doubts in its current state). That said, I agree it likely isn't the final choice, so perhaps I preface the whole collation section with the fact that the subsections are different proposals, we still need to decide which route to take (with tradeoffs mentioned in the different sections) and that because of the performance hit and / or effort needed to cache that data, we likely wouldn't go with UCA? |
- Addressed review comments
- Removed transparency in diagram for dark backgrounds
- String concatenation - Unicode normalization forms are not closed under string concatenation; specifically at least the area where the strings are joined need to be considered for composition | ||
- `TfStringJoin` | ||
|
||
It may be interesting to consider the impact of C++ 20's `std::u8string` type for the methods listed above. Overloads of these methods that take a `std::u8string` parameter would provide very clear semantics to a developer about the treatment of string content. While the use of `std:u8string` would really only affect the copmiler encoding of string literals, the burden is then placed on the developer to choose explicitly which of the methods to use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copmiler => compiler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proofreading is hard :/ fixed! (and thanks!)
|
||
For lexer-based validation, a new lexer and parser are introduced to specifically capture new identifier rules for both the path lexer and the text file format lexer. Although it is possible to change the existing lexer / parser in such a way to support UTF-8 encodings and new identifier rules, we thought it may introduce subtle errors across the ecosystem when using the modified lexer / parser under the current ASCII validation rules (e.g., the new lexer / parser accepting more strings during the lex phase that ultimately failed identifier validation logic). | ||
|
||
The path lexer is used as the first step in validating prim path strings. The text file format lexer is used to read and validate the content of text-based usd files (i.e., `.usda`). Both require that a sequence of characters be recognized as a logical token (and parsed in context) before being validated. The rules that dictate what lexer sequences are acceptable for identifiers need to be updated accordingly for Unicode based identifiers. It is desireable that these rules conform to those updated in the code-based validation sequences. Unfortunately, it is difficult in flex rules to specify the entire set of byte strings staht would encapsulate a character in the e.g., `XID_Start` character class. To get around this, we propose the lex rule be very permissive (basically any properly encoded UTF-8 character) and specifc validation (prim name or identifier) on the lexed string be done as part of the lexer logic to accept the token prior to parsing. This does have the unfortunate overhead of performing some validations twice, once at tokenization using the state machine and once adding the token to the tree and checking for identifer validity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
staht => that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
- [Core Identifier Validation](#core-identifier-validation) | ||
- [The Path and Text File Format Lexer / Parser](#the-path-and-text-file-format-lexer--parser) | ||
- [Case Mapping and other String Operations](#case-mapping-and-other-string-operations) | ||
- [Identifier Collation](#identifier-collation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be Identifier Sorting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
In evaluating options to support, raw code points rely largely on the support of code pages and individual system locales. UTF-x encodings are already populate for standardized interchange of Unicode text and in particular UTF-8 encoding is widespread within the community. UTF-8 is a variable length encoding that is backward compatible with existing ASCII encodings, meaning that current usd files, code, etc. that rely on ASCII strings should work without any changes. The reader is referred to [UTF-8 Everywhere](http://utf8everywhere.org/) as a summary of why UTF-8 was chosen as the preferred encoding over UTF-16 / UTF-32. | ||
|
||
Building USD is already a complex endevor. While there are existing libraries that do provide APIs for working with Unicode character sequences, like [ICU](https://icu.unicode.org/), using these libraries would necessitate the acquisition and distribution of this library with the core USD libraries. Efforts are already underway to eliminate dependencies (e.g., Boost), so we felt that a small enough subset of Unicode string processing could be implemented in USD core without the need for external libraries. In practice, the choice of collation algorithm is the primary driver of whether or not to use an external library, so decisions on doing so should be deferred until after understanding the collation algorithm that should be used with the addition of Unicode identifiers (see [Identifier Collation](#identifier-collation) for the different proposed options). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "the choice of sorting algorithm", "understanding the sorting algorithm", and a link to "Identifier Sorting"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
…cted and which was not
- Updated proposal to reflect changed default setting to TF_UTF8_IDENTIFIERS runtime switch - Updated to make clear the distinction (and validation rules used) between prim names and Identifiers
Description of Proposal
Currently, identifiers in USD are limited to ASCII character sets. As such, Unicode based languages cannot use their native character sets to name prims and properties. This proposal discusses what it would take to add Unicode support to core USD identifiers.
Supporting Materials
PixarAnimationStudios/OpenUSD#2120
Contributing