Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Improve special character handling for generating Flint index name #192

Closed
dai-chen opened this issue Dec 12, 2023 · 1 comment
Labels
0.2 enhancement New feature or request

Comments

@dai-chen
Copy link
Collaborator

dai-chen commented Dec 12, 2023

Is your feature request related to a problem?

Spark table restricts only alphabet, digit and underscore in table name. However, table doesn't from Spark/Hive catalog can have special characters. In this case, the Flint index name generated will be rejected by OpenSearch restricts as below:

Index names can’t contain spaces, commas, or the following characters: :, ", *, +, /, , |, ?, #, >, or <

Ref: https://opensearch.org/docs/latest/api-reference/index-apis/create-index/#index-naming-restrictions

What solution would you like?

Some extra encoding is required here to solve the problem:

  1. Base64 can represent any string into a sequence of characters that don't include spaces or the restricted characters. However, Base64 might still contain characters like / and + which might not be suitable in this case.
  2. Percent(URI)-encoding converts the character to its corresponding byte value in ASCII and then represent that value as a pair of hexadecimal preceded by a percent sign. Ref: https://en.wikipedia.org/wiki/Percent-encoding

Because percent sign is valid in OpenSearch index name, percent-encoding maybe the preferred solution. Also it preserves the readability for user who wants to access the index directly.

Here is quick test with OpenSearch 2.9:

# Example: loggroup/test
# => loggroup%2Ftest   : encode slash
# => loggroup%2ftest   : lowercase f (FlintOSClient is doing this already)
# => loggroup%252ftest : encode % (automatic conversion by REST client)
PUT loggroup%252ftest

GET _cat/indices
# green open loggroup%2ftest    A8F3EVwqSdCseKOLuD6ULQ 5 2       0 0     3kb    1kb

In Flint index metadata, there is source field which stores the source table name. This can be used to double check in case of any naming conflicts.

What alternatives have you considered?

Define some custom encoding rule to solve this specific problem.

@dai-chen dai-chen added enhancement New feature or request untriaged and removed untriaged labels Dec 12, 2023
@penghuo
Copy link
Collaborator

penghuo commented Dec 18, 2023

More character we should consider

a-z, A-Z, 0-9, '_' (underscore), '-' (hyphen), '/' (forward slash), '.' (period), and '#' (number sign)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.2 enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants