-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not allow special characters in base table locations #8524
Comments
Do we have a definition of "special characters"? Would it be enough to apply e.g. percent encoding to all base locations? |
I suspect this can become pretty nasty later - some locations (provided "externally") are "properly escaped" and some are not. |
I do not think URL encoding is interoperable with |
I think we should permit only unreserved (per RFC 3986) chars in base locations. |
Ah, you're right. Then we can only implement "our own" "safe escaping" for in namespace/content-key elements. Forbidding would then mean that you cannot create tables or views (just because of such a character).
Some like that? |
Unicode is tricky too... but something that works transparently with |
I think the transformation does not have to be reversible. |
Are we ok with a destructive encoding function? I.e. if both |
Exactly what I was thinking :) Is there a good OSS impl.? |
Ugh - true. However, entities have their Iceberg-UUID in the name - so it should™️ not be ambiguous? I suspect we have to rigorously forbid special-chars in object-store locations (as in |
But no matter which encoding we use - we have to think about existing locations (which we must/should not change) and new locations. |
Legacy system issues - not nice |
Existing locations are covered by |
It seems the jdk has one: https://docs.oracle.com/en%2Fjava%2Fjavase%2F21%2Fdocs%2Fapi%2F%2F/java.base/java/net/IDN.html |
From my POC the main concern with new locations is that stuff derived from Nessie |
With Unicode |
On the other hand, Punycode will make Unicode path elements unreadable to humans in storage paths, which defeats the whole idea of using |
Yes, and I'm also concerned by the fact that it encodes all the ASCII characters first, then all the rest after, thus altering the natural sort order of original names. E.g.
|
Maybe collations et al? |
Issue description
Coming from discussions on #8516.
Related to:
S3FileIO
when column names contain#
apache/iceberg#10279The text was updated successfully, but these errors were encountered: