Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 search support in usdviewq #2890

Merged

Conversation

tylerm-nv
Copy link
Contributor

@tylerm-nv tylerm-nv commented Dec 27, 2023

Description of Change(s)

Normalizes prim and attributes search strings in usdviewq to make searching easier. Users can enter the normalized form of a string, e.g. VII instead of Ⅶ, and they will match.

NFKC applies a compatibility decomposition that NFC does not. NFKC was chosen over other normalization forms to make it easier for users to search for UTF8 identifiers with ASCII representations. Some data in the unicode strings is lost during the NFKC normalization, which is intended since the goal is to have loose matching.

This change supports #2848 but can be merged independently

Fixes Issue(s)

  • I have verified that all unit tests pass with the proposed changes
  • I have submitted a signed Contributor License Agreement

@tallytalwar
Copy link
Contributor

Filed as internal issue #USD-9121

Copy link
Contributor

@tallytalwar tallytalwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I also request to also update the PR description to give some overview of why NFKC is chosen over NFC, etc and what will be the shortcomings of the same, like what if someone wants to distinguish between "VII" and "Ⅶ", but with NFKC normalization the 2 will match.

@@ -2203,6 +2203,9 @@ def setFrameField(self, frame):

# Prim/Attribute search functionality =====================================

def _normalize_unicode(self, str: str, form = 'NFKC'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does make sense for usdview as a client to do this sort of normalization and not in the core USD, but we think it might still make sense to provide in brief in-code comments on why NFKC is being chosen for normalization.

Note that it was discussed with @mati-nvidia and others that at least for identifiers NFKC normalization should be discouraged, and encourage NFC normalization.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallytalwar "Loose Matching" is one of the motivations for NFKC normalization
(https://unicode.org/faq/normalization.html, "Why should my program normalize strings?").

It's possible that regular expressions don't fit the definition of "loose matching", and different forms would be best for each.

I don't think we have to necessarily discourage users from NFKC normalization of their strings or having pipelines that enforce NFKC forms. NFKC normalized strings are also NFC normalized.

@sunyab sunyab added the usd-utf8-identifiers Issues/PRs for Unicode Identifiers in USD proposal label Jan 11, 2024
@pixar-oss pixar-oss merged commit 768ee98 into PixarAnimationStudios:dev Jan 16, 2024
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usd-utf8-identifiers Issues/PRs for Unicode Identifiers in USD proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants