Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#4895] docs(iceberg): add document for support not managed storages for Iceberg #4896

Merged
merged 5 commits into from
Sep 12, 2024

Conversation

FANNG1
Copy link
Contributor

@FANNG1 FANNG1 commented Sep 9, 2024

What changes were proposed in this pull request?

For other storage not build in with Gravitino, we should add a document about how to run it

Why are the changes needed?

Fix: #4895

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

@FANNG1 FANNG1 changed the title [#4895] docs(iceberg): add document for support other storages for Iceberg [#4895] docs(iceberg): add document for support not managed storages for Iceberg Sep 9, 2024
|----------------------------------|-----------------------------------------------------------------------------------------|---------------|----------|---------------|
| `gravitino.iceberg-rest.io-impl` | The IO implementation for `FileIO` in Iceberg, please use the full qualified classname. | (none) | No | 0.6.0 |

For other custom properties like `security-token` to pass to `FileIO`, you could config it directly by `gravitino.iceberg-rest.security-token`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more here? For example, if user has a custom FileIO implementation called "A", then set configuration like "gravitino.iceberg-rest.xxx", how does this "A" know this configuration "xxx"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -321,7 +332,7 @@ For example, we can configure Spark catalog options to use Gravitino Iceberg RES
--conf spark.sql.catalog.rest.uri=http://127.0.0.1:9001/iceberg/
```

You may need to adjust the Iceberg Spark runtime jar file name according to the real version number in your environment. If you want to access the data stored in S3, you need to download [Iceberg AWS bundle](https://mvnrepository.com/artifact/org.apache.iceberg/iceberg-aws-bundle) jar and place it in the classpath of Spark, no extra config is needed because S3 related properties is transferred from Iceberg REST server to Iceberg REST client automaticly.
You may need to adjust the Iceberg Spark runtime jar file name according to the real version number in your environment. If you want to access the data stored in S3, you need to download [Iceberg AWS bundle](https://mvnrepository.com/artifact/org.apache.iceberg/iceberg-aws-bundle) jar and place it in the classpath of Spark, no extra config is needed because S3 related properties is transferred from Iceberg REST server to Iceberg REST client automaticly. For other storages not managed by Gravitino, you could specify the configuration explicitly to initialize the `FileIO` implementation, like `spark.sql.catalog.${catalog_name}.${configuration_key}`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only for S3, how about other storages supported by Gravitino, and others not directly support by us?

Can you please describe more, and write them down as a user? I don't think users can handle this with such simple words, at least for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


#### Other storages

For storages that are not inherently integrated into Gravitino Iceberg REST service, you can manage them effectively through custom catalog properties.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the meaning of "inherently integrated"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

- HDFS
- S3
- OSS
- Supports diverse storage like `S3`, `HDFS`, `OSS`, and provides the capability to support other storages.
Copy link
Contributor

@jerryshao jerryshao Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there's gcs, right? Please list all the supported cloud storage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Supports different cloud storages...: "

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using Supports different storages, because HDFS is not cloud storage

@@ -337,7 +348,7 @@ For example, we can configure Spark catalog options to use Gravitino Iceberg RES
--conf spark.sql.catalog.rest.uri=http://127.0.0.1:9001/iceberg/
```

You may need to adjust the Iceberg Spark runtime jar file name according to the real version number in your environment. If you want to access the data stored in cloud, you need to download corresponding jars (please refer to the cloud storage part) and place it in the classpath of Spark, no extra config is needed because related properties is transferred from Iceberg REST server to Iceberg REST client automatically.
You may need to adjust the Iceberg Spark runtime jar file name according to the real version number in your environment. If you want to access the data stored in cloud, you need to download corresponding jars (please refer to the cloud storage part) and place it in the classpath of Spark, no extra config is needed because related properties is transferred from Iceberg REST server to Iceberg REST client automatically. For other storages not managed by Gravitino, the properties wouldn't transfer from the server to client automatically, if you want to pass custom properties to initialize `FileIO`, you could add it by `spark.sql.catalog.${iceberg_catalog_name}.${configuration_key}` = `{property_value}`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please split this long sentence into several paragraphs to make it more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -119,6 +119,20 @@ Please make sure the credential file is accessible by Gravitino, like using `exp
Please set `warehouse` to `gs://{bucket_name}/${prefix_name}`, and download [Iceberg gcp bundle jar](https://mvnrepository.com/artifact/org.apache.iceberg/iceberg-gcp-bundle) and place it to `catalogs/lakehouse-iceberg/libs/`.
:::

#### Other storages

For other storages that are managed by Gravitino directly, you can manage them through custom catalog properties.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“that are not managed...”

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -162,6 +159,20 @@ You should place HDFS configuration file to the classpath of the Iceberg REST se
Builds with Hadoop 2.10.x. There may be compatibility issues when accessing Hadoop 3.x clusters.
:::

#### Other storages

For other storages that are managed by Gravitino directly, you can manage them through custom catalog properties.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"are not managed by..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jerryshao jerryshao merged commit 55ad5fd into apache:main Sep 12, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Subtask] add document for support other storages for Iceberg
2 participants