Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode UTF-8 Identifiers in USD Proposal #3

Merged

Conversation

erslavin
Copy link
Contributor

@erslavin erslavin commented Mar 7, 2023

Description of Proposal

Currently, identifiers in USD are limited to ASCII character sets. As such, Unicode based languages cannot use their native character sets to name prims and properties. This proposal discusses what it would take to add Unicode support to core USD identifiers.

Supporting Materials

PixarAnimationStudios/OpenUSD#2120

Contributing

@marktucker
Copy link

marktucker commented Mar 8, 2023

It seems to me that an important alternative has not been discussed. You mention the limitations of "approximate approaches", but what about an ASCII-only exact encoding scheme?

By simply allowing one new character in identifiers (I'll say "#" as a for-instance fot eh rest of this post), you can encode any string, and even allow currently illegal ASCII-only identifier names:

  • #X#, #XX# == represents the HEX byte 0X or XX
  • ## == #
  • # == # if it doesn't fall into any of the above cases (not sure if this is a good idea or if this should be considered illegal...)
  • Allow identifiers to start with #

Some advantages:

  • Identifier validation would just need to allow # as a valid character, allowed at the front of the identifier. Probably some change to USDA/schema parsing as well, but fairly limited. I believe no actual signature changes to any existing APIs would be required.
  • New APIs to encode/decode identifiers and paths from byte streams (which by decree could be ASCII or UTF8 strings) could be net new APIs that USD applications can ignore if they don't want to deal with UTF8
  • any identifier can be represented because otherwise invalid ASCII characters are also encodeable (12345 would be #XX#2345, foo/bar would be foo#XX#bar - sorry, I couldn't be bothered to look up the actual ASCII codes for 1 and /).
  • The translation to/from these encoded/decoded forms would be left entirely to the presentation/UI layer (where I think it belongs)
  • No sorting issues (USD internal sorting would just be ASCII, presentation UI could sort on the decoded names if it wants)
  • Easier for everyone to implement! (I'm a very, very lazy person) No changes to any existing USD author/consumer would be required, unless the application does its own identifier validation.

Some disadvantages:

  • not backward compatible (though if and only if you are dealing with files that use # in identifiers) (applies to UTF8 as well)
  • not a standard encoding (though maybe there is a standard encoding that would work and provide all the same advantages)
  • I know very little about unicode, so I may be wrong about the ease of converting UTF8 to this ASCII encoding.

I'm certain I'm glossing over other problems, and maybe this idea was discussed and dismissed? In which case maybe an explanation in the proposal would be warranted?

@erslavin
Copy link
Contributor Author

erslavin commented Mar 9, 2023

Thanks for suggestion @marktucker! It's an interesting approach, but as stated in the proposal, approximate approaches do not allow native non-Latin languages to be used directly and pushes many of the issues described in the proposal to the UI layer, resulting in many different implementations across the ecosystem in different tools. Many of the perceived advantages you describe here would still require fundamental changes to the lexer / parser in the same ways as required for native UTF-8 encodings, with the disadvantage of using a non-standard encoding that is difficult to interchange across different ecosystems. While it may simplify the sorting issue, it ends up being almost identical to code point sorting (e.g., for the escaped characters you would be comparing code values for either ASCII or non-ASCII characters).

The advantages of using UTF-8 natively are enumerated in the UTF-8 Everywhere link, and I think using a new non-standard encoding in USD limits its ability to become standard across multiple ecosystems. This is especially true if you push issues to the UI layer, where many libraries exist already to display and otherwise deal with UTF-8 strings. I'm not sure I see an easier implementation in this suggestion, given that the same fundamental parts still need to be modified (e.g. the lexer / parser, the rules on identifier validity, etc.) and could potentially introduce unforeseen complications, especially in the lexer / parser where it is already semi-ambiguous (at best, context-sensitive) to distinguish between comments (starting with #) and the usda header in the text file format parser (i.e., #usda 1.0) when talking about the '#' character, specifically (though other characters may introduce different ambiguities). I would argue the same is true for many approximate approaches.

@marktucker
Copy link

Many of the perceived advantages you describe here would still require fundamental changes to the lexer / parser in the same ways as required for native UTF-8 encodings
I see why you would say this, but I would disagree, and I should clarify my intent here. For anyone writing schemas/usda files/anything that feeds through the USD library, the onus would be on the author of that schema/usda file to pre-encode identifiers. I think this is okay because:

  • the number of users writing schemas, or writing USDA by hand is tiny, and not a demographic worth worrying excessively about
  • I expect most schemas will still stick to ASCII identifiers, if they are meant to be publicly consumable and ever made part of the universal standard
  • Like USD, there could be authoring tools that handle this on the user's behalf
    But maybe I'm missing use cases here? Or maybe you just disagree about the importance of this capability.

I think using a new non-standard encoding in USD limits its ability to become standard across multiple ecosystems
Can't disagree with this, but I can't say that this bothers me at all :)
I genuinely don't see why this should matter... The identifier data is still actually UTF-8 encoded... I'm only talking here about how to store UTF-8 strings in USD and USDA files.

This is especially true if you push issues to the UI layer, where many libraries exist already to display and otherwise deal with UTF-8 strings
Indeed. The UI layer could choose to completely ignore the encoding and just show the encoded ASCII versions. Or it can translate to/from UTF-8 where it makes sense. Again, I see this as a strength, becuase I think the UI layer is where this effort belongs. But I certainly concede this claim is debatable.

@marktucker
Copy link

marktucker commented Mar 9, 2023

Re-reading your first paragraph again, I'm wondering if I am misunderstandingwhat you're trying to say about the lexer/parser... Perhaps you could explain why the addition of one new "acceptable" character would require changes to the lexer/parser on par with the UTF-8 related changes in your PR? Also, I assume we're exclusively talking about the USDA parser here, right? Is there some other component within USD that I'm not thinking about?

Also, conceding your point about # not being the correct character to use for this purpose. But I'm sure we could find one that isn't currently in use...

@erslavin
Copy link
Contributor Author

erslavin commented Mar 9, 2023

I see why you would say this, but I would disagree, and I should clarify my intent here. For anyone writing schemas/usda files/anything that feeds through the USD library, the onus would be on the author of that schema/usda file to pre-encode identifiers.

I think we have to distinguish at least two things here - prim identifiers and schema type names / property identifiers. When we talk about the schema generator, we are primarily concerned with schema type names and property identifiers. In this case, anything we generate has to be compile-able when the schema is codeful (and for C++ / Python, this means something that starts with XID_Start and continues with XID_Continue). Allowing one of these to begin with the # character, for example, already makes this fail compilation. Solutions that decode it prior to sending it to the compiler would be bad, because any user written code against that compiled API would also have to be decoded prior to compilation to get the API call right (e.g., Get / Set a property value with the property identifier in the function name). UTF-8 encodings of the string are already understood by the compiler and can be used in a natural way as a developer against those APIs.

Prim identifiers are given by the user, and used to form SdfPath objects, which form the native way of navigating / traversing the scene. Would the expectation be that the user encodes the name manually? That they use an additional function somewhere to take their native representation (most likely encoded in UTF-8 anyway) and convert to this new one prior to creating a SdfPath with the value?

I'm trying to work through what a different encoding would offer over a native UTF-8 encoding, both in terms of changes needed to USD to support it and in terms of how a user would work with it and I'm not seeing any advantages this would offer over UTF-8 in either case.

Can't disagree with this, but I can't say that this bothers me at all :)
I genuinely don't see why this should matter... The identifier data is still actually UTF-8 encoded... I'm only talking here about how to store UTF-8 strings in USD and USDA files.

But it's not UTF-8 encoded - and not directly interpretable by existing tools / APIs. The encoding representation is already solved by UTF-8 (which is already a standard for transporting Unicode data across ecosystems). Unicode strings can already be stored natively in both USD and USDA files using UTF-8, so proposing a new non-standard encoding as an alternative doesn't give you an additional advantage here. Both encodings require something that interprets the byte stream, but many applications exist today that can natively interpret UTF-8 encodings (as it is a standard), making UTF-8 an attractive choice that 1) allows you to use the files you have with no change (as the encoding for ASCII characters is the same as their native values) and 2) allows you to interchange with a rich set of tooling across ecosystems consisting of multiple domains and users (many new to USD).

Re-reading your first paragraph again, I'm wondering if I am misunderstandingwhat you're trying to say about the lexer/parser... Perhaps you could explain why the addition of one new "acceptable" character would require changes to the lexer/parser on par with the UTF-8 related changes in your PR? Also, I assume we're exclusively talking about the USDA parser here, right? Is there some other component within USD that I'm not thinking about?

The lexer forms lexical tokens from sequences of characters. Any change to what you represent as an identifier needs to be reflected in the lexer rules for what an "identifier" token is. This requires a change to the state machine that accepts / rejects classes of tokens based on the characters it sees. In both the proposal laid out here and your own, changes would have to be made to the lexer to classify an identifier. Depending on the additional characters belonging to the class, it may not be feasible for the lexer to completely validate what it thinks is an identifier token, hence additional acceptance rules may need to be added to fully determine this. Also, depending on the additional class of characters, you may run into ambiguity in the tokens it could classify as, and may have to make additional changes to the parser to resolve this ambiguity. In your proposal, at minimum the # character would need to be added to the appropriate character classes and lex rules (and actually it would be the new sequences you propose above to bracket the encoding). There are two sets of lexer / parsers - one for the text file format (USDA) and one for the path parser (which has to interpret sequences of identifiers). Both the current proposal and your own has to go through the same set of changes to the lexer / parser and the valid identifier rules, so we don't get additional advantages here using this type of encoding. Hopefully that clarifies what I meant by lexer / parser changes - I'd be happy to refer you to these changes in the associated USD PR as an example.

@marktucker
Copy link

I'm trying to work through what a different encoding would offer over a native UTF-8 encoding, both in terms of changes needed to USD to support it and in terms of how a user would work with it and I'm not seeing any advantages this would offer over UTF-8 in either case.

My thoughts on this are all listed in my first post. But you've obviously thought about this a lot more deeply than I have. We've reached a point where you're mostly talking above my head (or talking past me, which I'm happy to assume is because you're seeing things that I'm not). So I'll just trust that you're right and I'm wrong :)

I just wanted to raise this alternative suggestion that I thought would be a whole lot simpler to implement and adopt. But as always the devil is in the details. Thanks for your time!

@erslavin
Copy link
Contributor Author

erslavin commented Mar 9, 2023

I just wanted to raise this alternative suggestion that I thought would be a whole lot simpler to implement and adopt. But as always the devil is in the details. Thanks for your time!

This kind of discourse is what submitting something like this is all about, so feedback is greatly appreciated!

@nvmkuruc
Copy link
Contributor

nvmkuruc commented Mar 9, 2023

Hi @marktucker. In your earlier post, you mention "USD applications can ignore if they don't want to deal with UTF8". Are you thinking that applications might want to adopt UTF-8 on different timelines or are you thinking that some applications may never want to support UTF-8?

To make it easy for tooling to identify UTF-8 content on a per string / token level, I'm going to argue that the UTF-8 offers this through the 'leader' byte. Additional characters that can be represented in UTF-8 but not traditional ASCII can be identified through the 'leader' byte (a byte starting with 110, 1110, or 11110). A hypothetical TfInferStringEncoding that would check for ASCII vs. UTF-8 instead of scanning for # would scan characters for one of these 'leader' bytes.

@marktucker
Copy link

Are you thinking that applications might want to adopt UTF-8 on different timelines or are you thinking that some applications may never want to support UTF-8? - Both. I bet there are a lot of studio-made tools that would love to never have to think about this. Command line tools shouldn't have to be changed other than recompiling. And some developers may not be happy if they are forced to make their UIs unicode-aware, even if they never use unicode in their prim names. And I'm not saying that the current UTF-8 changes being proposed would do this. I have no idea what effect this PR would have on existing tools (though I'm very curious to find out).

To make it easy for tooling to identify UTF-8 content on a per string / token level, - I'm not sure what you're asking/suggesting here? Are you talking about a function that can determine if an identifier is "pure ASCII" or not? I'm not sure what such a function would be used for (though I am sure it would be trivial to implement with either UTF8 or a #-based encoding scheme)?

@nvmkuruc
Copy link
Contributor

nvmkuruc commented Mar 9, 2023

Apologies Mark, I interpreted the goal of the # encoding you floated as making it easy to identify when and where a string contained non-ASCII characters. My intended observation was that UTF-8 is effectively encoded that way, but perhaps I misunderstood.

@marktucker
Copy link

Oh, I see... No, my goal was simply to do the encoding using nothing but ASCII characters to minimize the changes to the USD library or applications that use the USD library. Your question makes me wonder if I wasn't clear about how the encoding would work? In my suggestion, Unicode character U+04A8 (Ҩ) used in a prim name like abcҨdef would be encoded as abc#04A8#def. And again, # is probably not the right character to use, but it would be some specific ASCII character not currently allowed in USD prim names. Maybe ! could have worked?

@spiffmon
Copy link
Member

I very much sympathize with the pipelines/developers that don't want to deal with unicode, @marktucker ... but even the C++ standard is building in support, and my read of the tea leaves (which I make no clai mis the most informed) is that it is the future. On that assumption, to me it makes overbearing sense to encode identifiers directly as utf-8 in USD, to leverage the many tools that do already support it natively, including Qt and many text editors that would otherwise be presenting gobbledegook in usdedit

@nvmkuruc
Copy link
Contributor

It's worth noting that pipelines that stick to only ASCII characters in their assets would be unaffected (modulo performance internal to USD parsing / sorting).

To build on @spiffmon's observation about usdedit, the hypothetical # encoding could also obfuscate error messages and python repr. Instead of seeing Usd.Prim(/World/abcҨdef), users of UTF-8 would see Usd.Prim(/World/abc#04A8#def) and Error detected at /World/abc#04A8#def.

@spiffmon spiffmon added the Internationalization Issues relating to internationalizing USD, e.g. UTF-8 related label Mar 27, 2023
@spiffmon spiffmon self-assigned this Mar 27, 2023
@spiffmon spiffmon self-requested a review March 27, 2023 23:48
Copy link
Member

@spiffmon spiffmon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offline discussions about sorting have evolved further, but this draft is comprehensive of the issues, and looks great.

@pixar-oss pixar-oss merged commit 31f09a4 into PixarAnimationStudios:main Mar 28, 2023
meshula added a commit to meshula/OpenUSD-proposals that referenced this pull request Jun 19, 2024
…cal-camera

updates based on WG discussion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internationalization Issues relating to internationalizing USD, e.g. UTF-8 related usd-utf8-identifiers
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants