Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node name character set #56

Closed
alimanfoo opened this issue Mar 31, 2020 · 23 comments · Fixed by #196
Closed

Node name character set #56

alimanfoo opened this issue Mar 31, 2020 · 23 comments · Fixed by #196
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec protocol-extension Protocol extension related issue todo pre-rfc

Comments

@alimanfoo
Copy link
Member

Currently the core protocol v3.0 draft includes a section on node names which defines the set of characters that may be used to construct the name of a node (array or group).

The set of allowed characters is currently extremely narrow, being only the union of a-z, A-Z, 0-9, -_.. There is no support for non-latin characters, which obviously creates a barrier for many people. I'd like to relax this and allow any Unicode characters. Could we do this, and if we did, what problems would we need to anticipate and address in the core protocol spec?

Some points to bear in mind for this discussion:

Currently in the core protocol, node names (e.g., "bar") are used to form node paths (e.g., "/foo/bar"), which are then used to form storage keys for metadata documents (e.g., "meta/root/foo/bar.array") and data chunks (e.g., "data/foo/bar/0.0"). These storage keys may then be handled by a variety of different store types, including file system stores where storage keys are translated into file paths, cloud object stores where storage keys are translated into object keys, etc.

Different store types will have different abilities to support the full Unicode character set for node names. For example, although most file systems support Unicode file names, there are still reserved characters and words which cannot be used, which differ between operating systems and file system types. However, these constraints might not apply at all to other store types, such as cloud object stores. In general, other store types may have different constraints. Do we need to anticipate any of this, or can we delegate these issues to be dealt with in the different store specs?

One last thought, whatever we decide, the set of allowed characters should probably be defined with respect to some standard character set, e.g., Unicode. I.e., we should probably reference the appropriate standard when discussing which characters are allowed.

@alimanfoo alimanfoo added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Mar 31, 2020
@Carreau
Copy link
Contributor

Carreau commented May 5, 2020

Could we do this, and if we did, what problems would we need to anticipate and address in the core protocol spec?

The main problem you will hit in my opinion is Unicode normalisation, would é be é or e+ ´ ? I'm also worried about unicode support for some edge languages like fortran.

I'm not sure about unicode versions as new characters get added. Thinking at loud we can't really ignore codepoints we don't know, if a new codepoint is now combining (Surrogates?) it may change the meaning of the subsequents bytes depending on encoding. You might also kill performance with some implementations as parsing unicode and deciding the length or wether to split on / can be hard and depends on all the previous characters(is that true only on variable encoding like utf8 ?).

If anything rely on sorting as well then you may become $LOCALE aware.

For FFI you may want to actually normalize on a given encoding.

On the UI side, you may get rendering and homoglyphs/heteroglyphs confusions, plus input issues. In Python \epsilon and \varepsilon look different (depending on the font) but are the same variable:

Screen Shot 2020-05-05 at 10 58 55

Web browser are also really bad at respecting unicode combining, typically chrome (depending in the version), think that \vec is a pre-combining character.

I would not worry too much about the ability of filesystem/KVstore to handle arbitrary unicode as long as you decouple the storage spec from the querying spec. For a file store we could decide to sore the keys/path-parts as base64 worse case, and then casing/unicode should not be a compatibility issue, but might be a performance one.

I would ere toward more restrictive (maximum stay within the basic multilingual plane) it's always easier to relax this later.

@Carreau
Copy link
Contributor

Carreau commented May 5, 2020

Another point to consider, usually in URL, hostnames mostly, are encoded using Punycode , I don't know if that's of any concern or relevance.

@jstriebel
Copy link
Member

Crosslinking #149 (comment)

@rabernat
Copy link
Contributor

rabernat commented Dec 6, 2022

Where have we landed on this? Are we going to expand the set of allowed characters in node names? Over in #149, @jbms said

I would agree that it would be reasonable to say it is implementation defined what characters are allowed.

I agree as well. I think we should go back to a broader range of allowed characters. Perhaps it would be more useful to define what characters are explicitly forbidden.

For example, is whitespace allowed?

@jbms
Copy link
Contributor

jbms commented Dec 6, 2022

I think it would be fine to allow, subject to the limitations of the underlying store, arbitrary Unicode code point sequences, but reserve "/" for the hierarchy and reserve leading underscore after the last "/" for zarr's own usage (all metadata keys should use a leading underscore). We can suggest a subset of the full namespace that will have better compatibility across stores.

@jstriebel jstriebel moved this from Todo to In Discussion in ZEP1 Dec 6, 2022
@jstriebel jstriebel moved this from In Discussion to In Review in ZEP1 Dec 22, 2022
@ethanrd
Copy link

ethanrd commented Dec 22, 2022

Unicode normalization was mentioned at some point in the discussion but I'm not seeing that reflected in PR #196.

The Python language also limits characters allowed in Python identifiers to a limited set of Unicode categories. For instance it does not allow control characters (Cc) or format characters (Cf). Oddly, to me anyway, it also doesn't allow most of the punctuation categories.

NetCDF requires Unicode normalization for identifiers/names (enforces, I believe, @DennisHeimbigner?) but does not limit by Unicode categories. I suspect it should limit by categories but not sure what real problems it would avoid.

(Sorry if I've missed some of this already being addressed. My lurking here is pretty sporadic.)

@jstriebel
Copy link
Member

Thanks for raising this point @ethanrd. Disallowing control and format characters seems fine to me, and unicode normalization is a good idea too, IMO. I'd propose to use Unicode Normalization Form C (NFC), requiring implementations to normalize any user input and interacting with the storage only via normalized unicode.

@jbms
Copy link
Contributor

jbms commented Jan 10, 2023

Is there an example of a use case where Unicode normalization is important?

Requiring Unicode normalization means applications must contain Unicode data tables. TensorStore, for instance, does not currently depend on Unicode data tables, so this would require adding them.

Given that we decided to make case sensitivity store-dependent, it seems that it would be reasonable to also leave Unicode normalization as store-dependent unless there is a compelling use case.

@jstriebel
Copy link
Member

equiring Unicode normalization means applications must contain Unicode data tables.

True, that would require additional complexity, I would rather not make this required then. Since #196 already proposes a safe subset, I feel like normalization or limiting the character set do not need to be enforced on the spec level. From #196:

To ensure consistent behaviour across different storage systems and programming
languages, we recommend to use only characters in the sets a-z, A-Z,
0-9, -, _, ..

Is there an example of a use case where Unicode normalization is important?

Simply typing a string in different editors or different keyboards might result in two different unicode representations in rare cases, even if they look alike. Therefore unicode normalization would be nice to have if we want to encourage unicode usage, but I think we're fine with the recommendation above. We could additionally recommend to forbid CF/CC characters and also recommend normalization, which seems easy in languages with built-in unicode tables support. cc @ethanrd @DennisHeimbigner @jbms

@ethanrd
Copy link

ethanrd commented Jan 18, 2023

As I understand it, the goal of this proposal is to move beyond a-z, A-Z, 0-9, -, _, . and support a much broader, international set of characters for Zarr names/identifiers by supporting Unicode. If that is the case, to ensure consistent behavior across Zarr implementations, I think the specification needs to require some level of support for Unicode. Part of that, I believe, should be Unicode normalization as it ensures that any two canonical-equivalent strings will have
precisely the same binary representation and so, compare as equal (see Unicode FAQ "Why should my program normalize strings?"). The Unicode FAQ "Which form of normalization ...?" suggests NFKC normalization for identifiers.

I do not think alternate encodings for the same characters is rare. It sounds like the existence of alternate encodings was for legacy and backward compatibility reasons (see the Precomposed character wikipedia page, second paragraph). One of the reasons was a clean mapping to/from ISO Latin-1 (see this Unicode email) which, I believe, could make this fairly common.

I'm not sure I understand the argument against requiring Unicode because it would require a Unicode library. Wouldn't a dependency on Unicode libraries fall on the Zarr libraries rather than the applications or underlying storage libraries that are using or being used by the Zarr libraries?

Also, (I don't know for sure but) I suspect that the Unicode character category limitations used for Python language identifiers corresponds in some way to the lower case (a-z), upper case (A-Z), numbers (0-9), and limited punctuation (-_.) character set currently supported by Zarr.

@jbms
Copy link
Contributor

jbms commented Jan 18, 2023

As I understand it, the goal of this proposal is to move beyond a-z, A-Z, 0-9, -, _, . and support a much broader, international set of characters for Zarr names/identifiers by supporting Unicode. If that is the case, to ensure consistent behavior across Zarr implementations, I think the specification needs to require some level of support for Unicode. Part of that, I believe, should be Unicode normalization as it ensures that any two canonical-equivalent strings will have precisely the same binary representation and so, compare as equal (see Unicode FAQ "Why should my program normalize strings?"). The Unicode FAQ "Which form of normalization ...?" suggests NFKC normalization for identifiers.

I do not think alternate encodings for the same characters is rare. It sounds like the existence of alternate encodings was for legacy and backward compatibility reasons (see the Precomposed character wikipedia page, second paragraph). One of the reasons was a clean mapping to/from ISO Latin-1 (see this Unicode email) which, I believe, could make this fairly common.

I can see that without normalization, it may be quite difficult and error prone to manually type node names with non-ascii characters, since depending on the operating system and input configuration, the representation may differ. In practice, it would probably be necessary to either normalize in the application when writing and reading, or rely on listing and then do normalization-aware matching on the list results.

However, in addition to requiring extra data tables, there are other potential issues:

  • The closest analogue to this that I can think of is filesystem paths. On Linux, neither case folding nor unicode normalization are commonly enabled by default, but the ext4 filesystem does support a case folding mode that also does Unicode normalization. Microsoft's NTFS does case folding but not Unicode normalization. Apple's APFS supports either case-insensitive + unicode normalizing (default on macOS), or case sensitive + unicode normalizing. Notably, in all cases I have seen, the filesystem always preserves the original code point sequence (and that is what is returned when listing a directory), and just uses normalization/case folding for comparison. Therefore, whether it uses NFC or NFD is purely an implementation detail with no user-visible impact. Preserving the original code point sequence is helpful if the keys correspond to keys in some other table/database/filesystem.
  • Unfortunately since zarr exists as a thin layer on top of an underlying store, it is not practical to support normalization insensitivity while still preserving the original code point sequence. Therefore, we would have to simply perform the normalization when writing.
  • This could raise compatibility issues with zarr v2, since we would not be able to preserve node names used in zarr v2 if they are not already in the chosen normalization form.
  • As far as which level of normalization to use, it seems like it would make sense to be consistent with filesystems and use canonical-equivalence, i.e. NFC or NFD, rather than NFKC/NFKD.

I'm not sure I understand the argument against requiring Unicode because it would require a Unicode library. Wouldn't a dependency on Unicode libraries fall on the Zarr libraries rather than the applications or underlying storage libraries that are using or being used by the Zarr libraries?

Yes, I do mean that it imposes the dependency on zarr libraries, not directly on the application using the zarr library. For tensorstore this added dependency for not be that big of a deal. I could imagine that for a zarr implementation intended for an embedded device, it may be a larger issue.

Also, (I don't know for sure but) I suspect that the Unicode character category limitations used for Python language identifiers corresponds in some way to the lower case (a-z), upper case (A-Z), numbers (0-9), and limited punctuation (-_.) character set currently supported by Zarr.

In zarr v2 I don't believe there are currently any restrictions on node names, nor is any normalization performed.

@jstriebel
Copy link
Member

In zarr v2 I don't believe there are currently any restrictions on node names, nor is any normalization performed.

At least the v2 spec specifies keys to be ASCII strings:

A Zarr array can be stored in any storage system that provides a key/value interface, where a key is an ASCII string

Unfortunately since zarr exists as a thin layer on top of an underlying store, it is not practical to support normalization insensitivity while still preserving the original code point sequence. Therefore, we would have to simply perform the normalization when writing.

👍 (we might even assume a store which is not able to list directories, which makes normalized matching impossible).

As I understand it, the goal of this proposal is to move beyond a-z, A-Z, 0-9, -, _, . and support a much broader, international set of characters for Zarr names/identifiers by supporting Unicode.

Yes, but we'd still recommend to use the limited set due to incompatibilities between underlying stores. This is not an enforced restriction, though.

I think there are three options, and I'll try to add arguments for all of them here:

  1. Require unicode normalization ("MUST")
    Since unicode is allowed it should be practical to use. To ensure implementations are compatible with each other, each must implement unicode normalization.
  2. Recommend unicode normalization
    Same as before, but some implementations might not want to add unicode data to support normalization. However, any non-normalized key might not be readable by other implementations then. However, reads are not limited. I could imagine that (for such a library tailored towards embedded systems) writing is allowed only using ASCII, and reading could be any unicode sequence, which must be normalized before. (One might argue that this implementation is compatible with 1., if it specifies the normalized key as a requirement for the user)
  3. Nothing
    Since we already propose a safe subset of characters, anything beyond this is possible, but hard to specify atm without having much experience with it.

I tend towards 1, making unicode normalization required by writing and reading normalized keys, but no strong opinion from my side. @rabernat @Carreau @alimanfoo more opinions/arguments? I'd like to settle this soon.

@jbms
Copy link
Contributor

jbms commented Jan 19, 2023

There is one other issue with normalization that occurred to me:

Suppose a zarr implementation supports opening a zarr array by a local filesystem path:

/home/üser/páth/to/zarr/grôup/pãth/to/arrày

Let's say that /home/üser/páth/to/zarr/grôup/ is the top-most group. Then that portion of the path is outside of zarr's control and should not be normalized. But the path within that group, pãth/to/arrày needs to be normalized.

Thus we are back to needing to know the root even without storage transformers. Or maybe we should treat Unicode normalization as a storage transformer, which means we would need to use a two-part URL:

file:///home/üser/páth/to/zarr/grôup#pãth/to/arrày

@jbms
Copy link
Contributor

jbms commented Jan 19, 2023

Actually I realized there is the same issue with the proposal to escape group member names that start with underscore.

@jstriebel
Copy link
Member

maybe we should treat Unicode normalization as a storage transformer

That would be great IMO!

@DennisHeimbigner
Copy link

In practice, I think the reason to have normalization is for name equality testing.
Doing a byte by byte comparison of normalized vs un-normalized will
probably fail.
If one is willing to live with the possibility of having two names that display the same but
are not detected as equal by the zarr library, then I suspect normalization is not needed.
In effect it becomes the burden of the user of the library to normalize any names it uses.

@jstriebel jstriebel moved this from In Review to Maybe extension in ZEP1 Jan 19, 2023
@jstriebel jstriebel added the protocol-extension Protocol extension related issue label Jan 19, 2023
@jstriebel
Copy link
Member

@ethanrd What do you think about adding unicode normalization as an extension? I think that might work nicely as a group storage transformer and solves the problem @jbms mentioned above.

@ethanrd
Copy link

ethanrd commented Jan 24, 2023

I think the multiple language support provided by Unicode is important enough that it should have the visibility of being in the core. On the other hand, given all the issues that have come up, I would now lean towards recommending rather than requiring Unicode support.

Perhaps just some text after the A-Z guidance (in the current PR) providing advice to both implementers and users that wish to implement or use Unicode. It could include links to a more detailed section/document or at least references to Unicode resources. Maybe something like:

Unicode support for Zarr node names allows users to name things using the language/alphabet of their choice. To accurately perform equality comparison against node names, Unicode normalization is needed. There are also security and interoperability considerations that make it potentially useful to limit the set of Unicode characters supported. It is therefore recommended that implementors that want to support Unicode should perform Unicode normalization on node names, and check that only allowed characters are used, before performing any operations with or on them. Simiarly, users should be aware of the recommended limitations on the Unicode characters when naming Zarr objects.

More information on how to implement Unicode support is available in the Unicode section below (or separate document? or just references to Unicode documents).

Of course, it's the details beyond this that get complicated. Thoughts?

Note: The Python PEP 3131 "Supporting Non-ASCII Identifiers" does a good job explaining the motivation for supporting Unicode and the decisions they made on limiting allowed characters. The Unicode Standard Annex #31 "Unicode Identifier and Pattern Syntax" provides a recommended default for defining identifiers, it includes some reasoning behind the default and reasons and cautions for altering the default.

@jbms
Copy link
Contributor

jbms commented Jan 24, 2023

I think we haven't fully ironed out the "extension" vs "core" part, but I think the more relevant distinction here is "always enabled" vs "opt in via storage transformer". Even if it is a storage transformer, it could still be supported by all or most implementations.

Logically, Unicode normalization as a feature does behave as a storage transformer that impacts path resolution, and that has implications for how the array needs to be accessed.

We decided to eliminate the concept of an explicit root, in order to allow the zarr v2-like behavior of being able to access an array directly via its path, as long as there are no group-level storage transformers in place. For example, let's say we have:

/home/user/whatever/zarr_group/zarr_array

If zarr_group has a group-level storage transformer or other extension that impacts path resolution, then to access zarr_array we have to use a URL to the group plus a path to the array within the group, like: file:///home/user/whatever/zarr_group#zarr_array. But in the default case that zarr_group does not have any such storage transformer or extension, we can access the array directly as file:///home/user/whatever/zarr_group/zarr_array.

With Unicode normalization, we technically could access the array directly as file:///home/user/whatever/zarr_group/zarr_array as long as we ensured the path matches the stored representation, but it would be error prone and easy to accidentally create a array with an improperly normalized path. Therefore, we are proposing that it be an opt-in storage transformer, so that we can preserve direct zarr v2-style access to arrays when it is disabled.

As for how to describe Unicode normalization in the spec, for interoperability between implementations it is critical to precisely specify the normalization form to use.

@ethanrd
Copy link

ethanrd commented Jan 24, 2023

Hi @jbms - Yes, I'll just mention that I've been following Zarr for some time but I'm not deeply familiar with the details of the spec or implementations (other than netCDF at a non-Zarr level). So I don't really understand how Zarr extensions and storage transforms fit into the spec or into the various implementations.

In my last comment, I was suggesting a recommendation in the core because, to me, an extension sounds less visible than a mention in the core. Unicode support seems important enough that visibility is important, whether optional or not.

I agree, it is important to provide precise details about which normalization scheme to use and, possibly, any limits on the characters allowed. The Unicode Standard Annex # 31 I mentioned in my earlier comment provides a recommended default syntax for the definition of identifiers that is probably a good starting point. Anything beyond that is well past my current understanding of Unicode. Reviewing the various Unicode documents has just reaffirmed for me that Unicode can get complicated very quickly.

@ethanrd
Copy link

ethanrd commented May 30, 2024

I just noticed what I believe is a discrepancy in the Node Names section of the v3 spec. The third paragraph says

Node names are case sensitive, e.g., the names “foo” and “FOO” are not identical.

Then the fourth paragraph starts with

When using non-ASCII Unicode characters, we recommend users to use case-folded NFKC-normalized strings ...

My understanding is that case folding is similar to (but more complicated than) converting each character in a string to uppercase. So it would result in the opposite of the case sensitive nature of node names described in paragraph three. I think case sensitive is the intent and so suggest that "case-folded" be removed from paragraph four.

Here's a W3C reference for case folding (and case mapping): https://www.w3.org/TR/charmod-norm/#definitionCaseFolding

@LDeakin
Copy link

LDeakin commented May 31, 2024

The first paragraph means implementations must be case sensitive and case preserving, the second is just a recommendation to users about how to name their nodes. So I don't think there is a discrepancy here.

@ethanrd
Copy link

ethanrd commented May 31, 2024

Perhaps not a discrepancy. However, if node names are case sensitive, I don't think it makes sense to recommend that users use case folding which changes the case of certain characters (for instance, case folding would change "ABCdefG" to "abcdefg"). If node names are indeed case sensitive, users should be able to use the case that is appropriate for each letter in the name they have decided to use.

Also, I don't think users should have to understand NFKC or case folding. Those developing implementations need to understand them and decide how and when to use them, but not users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec protocol-extension Protocol extension related issue todo pre-rfc
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

8 participants