-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode UTF-8 Identifiers in USD Proposal #3
Unicode UTF-8 Identifiers in USD Proposal #3
Conversation
- Addressed review comments
- Removed transparency in diagram for dark backgrounds
It seems to me that an important alternative has not been discussed. You mention the limitations of "approximate approaches", but what about an ASCII-only exact encoding scheme? By simply allowing one new character in identifiers (I'll say "#" as a for-instance fot eh rest of this post), you can encode any string, and even allow currently illegal ASCII-only identifier names:
Some advantages:
Some disadvantages:
I'm certain I'm glossing over other problems, and maybe this idea was discussed and dismissed? In which case maybe an explanation in the proposal would be warranted? |
Thanks for suggestion @marktucker! It's an interesting approach, but as stated in the proposal, approximate approaches do not allow native non-Latin languages to be used directly and pushes many of the issues described in the proposal to the UI layer, resulting in many different implementations across the ecosystem in different tools. Many of the perceived advantages you describe here would still require fundamental changes to the lexer / parser in the same ways as required for native UTF-8 encodings, with the disadvantage of using a non-standard encoding that is difficult to interchange across different ecosystems. While it may simplify the sorting issue, it ends up being almost identical to code point sorting (e.g., for the escaped characters you would be comparing code values for either ASCII or non-ASCII characters). The advantages of using UTF-8 natively are enumerated in the UTF-8 Everywhere link, and I think using a new non-standard encoding in USD limits its ability to become standard across multiple ecosystems. This is especially true if you push issues to the UI layer, where many libraries exist already to display and otherwise deal with UTF-8 strings. I'm not sure I see an easier implementation in this suggestion, given that the same fundamental parts still need to be modified (e.g. the lexer / parser, the rules on identifier validity, etc.) and could potentially introduce unforeseen complications, especially in the lexer / parser where it is already semi-ambiguous (at best, context-sensitive) to distinguish between comments (starting with #) and the usda header in the text file format parser (i.e., #usda 1.0) when talking about the '#' character, specifically (though other characters may introduce different ambiguities). I would argue the same is true for many approximate approaches. |
|
Re-reading your first paragraph again, I'm wondering if I am misunderstandingwhat you're trying to say about the lexer/parser... Perhaps you could explain why the addition of one new "acceptable" character would require changes to the lexer/parser on par with the UTF-8 related changes in your PR? Also, I assume we're exclusively talking about the USDA parser here, right? Is there some other component within USD that I'm not thinking about? Also, conceding your point about |
I think we have to distinguish at least two things here - prim identifiers and schema type names / property identifiers. When we talk about the schema generator, we are primarily concerned with schema type names and property identifiers. In this case, anything we generate has to be compile-able when the schema is codeful (and for C++ / Python, this means something that starts with Prim identifiers are given by the user, and used to form I'm trying to work through what a different encoding would offer over a native UTF-8 encoding, both in terms of changes needed to USD to support it and in terms of how a user would work with it and I'm not seeing any advantages this would offer over UTF-8 in either case.
But it's not UTF-8 encoded - and not directly interpretable by existing tools / APIs. The encoding representation is already solved by UTF-8 (which is already a standard for transporting Unicode data across ecosystems). Unicode strings can already be stored natively in both USD and USDA files using UTF-8, so proposing a new non-standard encoding as an alternative doesn't give you an additional advantage here. Both encodings require something that interprets the byte stream, but many applications exist today that can natively interpret UTF-8 encodings (as it is a standard), making UTF-8 an attractive choice that 1) allows you to use the files you have with no change (as the encoding for ASCII characters is the same as their native values) and 2) allows you to interchange with a rich set of tooling across ecosystems consisting of multiple domains and users (many new to USD).
The lexer forms lexical tokens from sequences of characters. Any change to what you represent as an identifier needs to be reflected in the lexer rules for what an "identifier" token is. This requires a change to the state machine that accepts / rejects classes of tokens based on the characters it sees. In both the proposal laid out here and your own, changes would have to be made to the lexer to classify an identifier. Depending on the additional characters belonging to the class, it may not be feasible for the lexer to completely validate what it thinks is an identifier token, hence additional acceptance rules may need to be added to fully determine this. Also, depending on the additional class of characters, you may run into ambiguity in the tokens it could classify as, and may have to make additional changes to the parser to resolve this ambiguity. In your proposal, at minimum the |
My thoughts on this are all listed in my first post. But you've obviously thought about this a lot more deeply than I have. We've reached a point where you're mostly talking above my head (or talking past me, which I'm happy to assume is because you're seeing things that I'm not). So I'll just trust that you're right and I'm wrong :) I just wanted to raise this alternative suggestion that I thought would be a whole lot simpler to implement and adopt. But as always the devil is in the details. Thanks for your time! |
This kind of discourse is what submitting something like this is all about, so feedback is greatly appreciated! |
Hi @marktucker. In your earlier post, you mention "USD applications can ignore if they don't want to deal with UTF8". Are you thinking that applications might want to adopt UTF-8 on different timelines or are you thinking that some applications may never want to support UTF-8? To make it easy for tooling to identify UTF-8 content on a per string / token level, I'm going to argue that the UTF-8 offers this through the 'leader' byte. Additional characters that can be represented in UTF-8 but not traditional ASCII can be identified through the 'leader' byte (a byte starting with |
|
Apologies Mark, I interpreted the goal of the |
Oh, I see... No, my goal was simply to do the encoding using nothing but ASCII characters to minimize the changes to the USD library or applications that use the USD library. Your question makes me wonder if I wasn't clear about how the encoding would work? In my suggestion, Unicode character U+04A8 (Ҩ) used in a prim name like |
I very much sympathize with the pipelines/developers that don't want to deal with unicode, @marktucker ... but even the C++ standard is building in support, and my read of the tea leaves (which I make no clai mis the most informed) is that it is the future. On that assumption, to me it makes overbearing sense to encode identifiers directly as utf-8 in USD, to leverage the many tools that do already support it natively, including Qt and many text editors that would otherwise be presenting gobbledegook in |
It's worth noting that pipelines that stick to only ASCII characters in their assets would be unaffected (modulo performance internal to USD parsing / sorting). To build on @spiffmon's observation about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offline discussions about sorting have evolved further, but this draft is comprehensive of the issues, and looks great.
…cal-camera updates based on WG discussion
Description of Proposal
Currently, identifiers in USD are limited to ASCII character sets. As such, Unicode based languages cannot use their native character sets to name prims and properties. This proposal discusses what it would take to add Unicode support to core USD identifiers.
Supporting Materials
PixarAnimationStudios/OpenUSD#2120
Contributing