Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not allow special characters in base table locations #8524

Open
dimas-b opened this issue May 16, 2024 · 20 comments
Open

Do not allow special characters in base table locations #8524

dimas-b opened this issue May 16, 2024 · 20 comments
Milestone

Comments

@dimas-b
Copy link
Member

dimas-b commented May 16, 2024

Issue description

Coming from discussions on #8516.

Related to:

@adutra
Copy link
Contributor

adutra commented May 16, 2024

Do we have a definition of "special characters"? Would it be enough to apply e.g. percent encoding to all base locations?

@snazy
Copy link
Member

snazy commented May 16, 2024

I suspect this can become pretty nasty later - some locations (provided "externally") are "properly escaped" and some are not.
Ideally, we/Nessie should store "properly escaped" locations - the big question is probably: how do we know when a location that's provided "to us/Nessie" is already escaped (so we don't escape it twice or more often).

@dimas-b
Copy link
Member Author

dimas-b commented May 16, 2024

I do not think URL encoding is interoperable with URI, Iceberg's S3URI and AWS S3Utilities... sadly... cf. apache/iceberg#10329 (comment)

@dimas-b
Copy link
Member Author

dimas-b commented May 16, 2024

I think we should permit only unreserved (per RFC 3986) chars in base locations.

@snazy
Copy link
Member

snazy commented May 16, 2024

Ah, you're right. Then we can only implement "our own" "safe escaping" for in namespace/content-key elements. Forbidding would then mean that you cannot create tables or views (just because of such a character).
I suspect, we have to work on some rules for this - something like:

  • If it's a quote char, skip it
  • If it's a # or % or ?, ignore/skip it - or replace with _

Some like that?

@dimas-b
Copy link
Member Author

dimas-b commented May 16, 2024

Unicode is tricky too... but something that works transparently with java.net.URI is probably ok.

@dimas-b
Copy link
Member Author

dimas-b commented May 16, 2024

I think the transformation does not have to be reversible.

@adutra
Copy link
Contributor

adutra commented May 16, 2024

Are we ok with a destructive encoding function? I.e. if both foo# and foo? become foo (or foo_), the encoded result becomes ambiguous.

@adutra
Copy link
Contributor

adutra commented May 16, 2024

@dimas-b
Copy link
Member Author

dimas-b commented May 16, 2024

Exactly what I was thinking :) Is there a good OSS impl.?

@snazy
Copy link
Member

snazy commented May 16, 2024

destructive encoding function / encoded result becomes ambiguous

Ugh - true. However, entities have their Iceberg-UUID in the name - so it should™️ not be ambiguous?
But your point's still valid.

I suspect we have to rigorously forbid special-chars in object-store locations (as in org.projectnessie.catalog.service.config.WarehouseConfig.location()) but map/escape/destrictive-encode content-key elements?

@snazy
Copy link
Member

snazy commented May 16, 2024

But no matter which encoding we use - we have to think about existing locations (which we must/should not change) and new locations.

@snazy
Copy link
Member

snazy commented May 16, 2024

Legacy system issues - not nice

@dimas-b
Copy link
Member Author

dimas-b commented May 16, 2024

Existing locations are covered by StorageUri (hopefully). I think if it used to work in Iceberg/Spark, it will keep working with Nessie.

@adutra
Copy link
Contributor

adutra commented May 16, 2024

Exactly what I was thinking :) Is there a good OSS impl.?

It seems the jdk has one: https://docs.oracle.com/en%2Fjava%2Fjavase%2F21%2Fdocs%2Fapi%2F%2F/java.base/java/net/IDN.html

@dimas-b
Copy link
Member Author

dimas-b commented May 16, 2024

From my POC the main concern with new locations is that stuff derived from Nessie ContentKey for new tables may still have # and %, which will then break something on the Iceberg side.

@dimas-b
Copy link
Member Author

dimas-b commented May 16, 2024

With Unicode URI.toString() does not percent-encode non-reseved Unicode chars (and is able to parse them back), but URI.toASCIIString() does encode them. The latter will then hit S3 interop problems, I'm afraid.

@dimas-b
Copy link
Member Author

dimas-b commented May 16, 2024

On the other hand, Punycode will make Unicode path elements unreadable to humans in storage paths, which defeats the whole idea of using ContentKey for base locations, WDYT?

@adutra
Copy link
Contributor

adutra commented May 16, 2024

Yes, and I'm also concerned by the fact that it encodes all the ASCII characters first, then all the rest after, thus altering the natural sort order of original names. E.g.

äbc -> bc-uia
žbc -> bc-1va

@snazy
Copy link
Member

snazy commented May 16, 2024

Maybe collations et al?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants