-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#3379] feat(catalog-hadoop): Add S3 support for Fileset Hadoop catalog #4232
Conversation
...st-common/src/test/java/org/apache/gravitino/integration/test/container/S3MockContainer.java
Outdated
Show resolved
Hide resolved
@xiaozcy |
# Conflicts: # catalogs/catalog-hadoop/build.gradle.kts
@FANNG1 can you please also take a look at this? Besides, I think it would be better that s3 related configurations to be catalog/schema/fileset properties, the reason is that such properties are important to make fileset on s3 work, we'd better not hiding them into the hadoop site xml. |
...gration-test-common/src/test/java/org/apache/gravitino/integration/test/util/AbstractIT.java
Outdated
Show resolved
Hide resolved
@xiaozcy could you provide a document about how to make S3 works for Fileset hadoop catalog. |
# Conflicts: # integration-test-common/src/test/java/org/apache/gravitino/integration/test/util/AbstractIT.java
import org.slf4j.LoggerFactory; | ||
|
||
@Tag("gravitino-docker-test") | ||
public class HadoopCatalogS3IT extends AbstractIT { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we could do some abstraction to extract test logic in HadoopCatalogCommonIT
, HadoopCatalogHDFSIT
and HadoopCatalogS3IT
to create environment. like SparkCommonIT
and SparkHiveCatalogIT
@xiaozcy can you please address the comments here? |
# Conflicts: # catalogs/catalog-hadoop/build.gradle.kts
@jerryshao, sorry for the late reply, I have already upgraded the version of hadoop and done some abstraction of the IT, and I'm still working on managing some S3-related configurations in Gravitino. |
# Conflicts: # catalogs/catalog-hadoop/src/test/java/org/apache/gravitino/catalog/hadoop/integration/test/HadoopCatalogIT.java
To make fileset works on s3, we may have to add configurations like |
@yuqi1129 what do you think? |
I think it would be best to add a flag to clearly indicate the type of authentication we will be using. The Gravitino fileset currently supports simple and Kerberos authentication. Once we set the type, we can verify that the required properties have been provided before initializing the correspond SDK client. gravitino/docs/hadoop-catalog.md Lines 26 to 48 in 2d57f63
|
@yuqi1129, could you help review this again? |
Sure. |
Lines 92 to 94 in 13f3b3f
I wonder why the passed value to hadoop conf is |
No, I would rather have a thorough solution beforehand. Supporting one cloud storage is easy, but when we add more, maintaining them will have a heavy burden, this is the burden the community should take care, not the user. The solution should not only focus on the server side Fileset catalog support, the client side GVFS should also be considered in this solution, the unified configuration thing should also be included in this solution. So I would suggest we have a complete design doc about how to support multiple cloud storages that includes both the server and client side solutions. We can discuss based on the design doc. The design doc will make us more clear and thorough about how to well support different storages, currently, all our discussion is around S3, but what if the current solution cannot fit ADSL or GCS, how do you handle this? Besides, whether the pluggable framework should be introduced or not should be decided based on thorough investigation and discussion, not just based on whether it is urgent or blocking or not. |
Hi all, let me share some actual production cases:
What I want to say is that with the increase of supported storages, dependency conflicts are inevitable, so on-demand loading may be a more reasonable approach. But I still hope that one Catalog can support multiple storage types, but the supported storage types can be determined by the maintainer. |
# Conflicts: # catalogs/catalog-hadoop/build.gradle.kts # gradle/libs.versions.toml
we could reuse the s3 properties defined in https://github.com/apache/gravitino/pull/4897/files#diff-7434c367b3597195902b8b064a5efe5810cb0cf5e7f55228044bfcc4cd9b2abd |
build.gradle.kts
Outdated
":bundles:aliyun-bundle:copyLibAndConfig", | ||
":bundles:aws-bundle:copyLibAndConfig", | ||
":bundles:gcp-bundle:copyLibAndConfig" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? We will not ship these jars with gravitino binary, if you want to use it for test, you'd better figure out a different way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me try another way to copy the bundle jars when testing S3 in deploy mode automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, I removed it and replaced it with adding test dependency in hadoop-catalog
and hadoop3-filesystem
.
… catalog (apache#4232) ### What changes were proposed in this pull request? Add S3 support for Fileset Hadoop catalog. We only add hadoop-aws dependency actually, most of the work is conducting tests. ### Why are the changes needed? Fix: apache#3379 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? IT. --------- Co-authored-by: zhanghan18 <[email protected]> Co-authored-by: yuqi <[email protected]>
What changes were proposed in this pull request?
Add S3 support for Fileset Hadoop catalog. We only add hadoop-aws dependency actually, most of the work is conducting tests.
Why are the changes needed?
Fix: #3379
Does this PR introduce any user-facing change?
No.
How was this patch tested?
IT.