-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-18562] Support for Hadoop ABFS for Azure Datalake Gen2 accounts #16559
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 7b469bc (Wed Jul 21 21:13:00 UTC 2021) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
@flinkbot run azure |
Tagging @AHeise who is helping review this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for contributing! The code looks quite good already. I made a suggestion to disentangle the abstract factory from the specific factories.
Please double-check the licenses - two new modules are added; just to be safe.
Concerning testing your changes (and actually also the existing FS): there is now azurite available which we could use inside a testcontainer to have an ad-hoc blob storage. It would be a blast if you could add respective ITCases. I can give more detailed pointers or we create a follow-up task.
.../flink-azure-fs-hadoop/src/main/java/org/apache/flink/fs/azurefs/AbstractAzureFSFactory.java
Outdated
Show resolved
Hide resolved
flink-filesystems/flink-fs-hadoop-shaded/src/main/resources/META-INF/NOTICE
Show resolved
Hide resolved
...tems/flink-s3-fs-hadoop/src/main/java/org/apache/flink/fs/s3hadoop/HadoopS3AccessHelper.java
Show resolved
Hide resolved
flink-filesystems/flink-s3-fs-presto/src/main/resources/META-INF/NOTICE
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srinipunuru I want to thank you very much for taking the time to work on this implementation. I am going to review the changes in more detail and I can work with you to update the Flink Documentation as soon as this PR is merged.
I will be reviewing your changes in this PR shortly.
Thanks again.
org.apache.hadoop.fs.FileSystem azureFS = new NativeAzureFileSystem(); | ||
azureFS.initialize(fsUri, hadoopConfig); | ||
return azureFS; | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check if the scheme prefix starts with "abfs" here to cover both the secure and none secure schemes without including other schemes that are non Azure file systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily, But anyway this code is going to get refactored and this if else condition will go away.
Thanks for the Pointers @AHeise, I did some digging last few days on the azurite. There is good news and bad news, Good news is, It looks like we could use that for testing with Azure storage with wasb:// (which is the legacy hdfs driver). It doesn't yet support the abfs and ADLS Gen2 Azure/Azurite#553. I could add a follow up task to add a end to end test with Azurite for wasb. What do you think? |
I suggest that once the commits are finished, you could do a final manual test with all 4 schemes and briefly mention it here. |
Thanks a lot for reviewing this @AHeise. I updated the PR and I think i have addressed all of the comments. I verified the latest PR with all wasb/wasbss and abfs/abfss end to end. Please let me know if there is anything else pending to merge this PR. Thanks for your help. |
...re-fs-hadoop/src/main/resources/META-INF/services/org.apache.flink.core.fs.FileSystemFactory
Outdated
Show resolved
Hide resolved
Thank you again for your contribution and the quick responses. Except for the smaller suggestion of @izzyacademy , everything looks good. Could you please prefix all commits with [FLINK-18562][fs]? Then it's easier to see what the commits are about when they are merged into master. |
08e8313
to
8149b8c
Compare
Thanks @AHeise I have addressed the comments and also updated the commit messages according to the convention followed. Thanks again for taking time to review this :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you very much for the contribution. I'm merging now.
What is the purpose of the change
This pull request adds support for abfs to talk to ADLS Gen2 (Azure DataLake Gen2) storage accounts. These are newer storage account types that hierarchical namespaces, abfs takes advantage of this newer capability. Azure recommends using abfs for reading/writing to these newer storage accounts.
Brief change log
Verifying this change
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation