Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#5492] feat(hadoop-catalog): Support Azure blob storage for Gravitino server and GVFS Java client #5508

Merged
merged 12 commits into from
Nov 14, 2024

Conversation

yuqi1129
Copy link
Contributor

@yuqi1129 yuqi1129 commented Nov 7, 2024

What changes were proposed in this pull request?

Add support for Support Azure blob storage for Gravitino server and GVFS Java client

Why are the changes needed?

It's a big improvement for fileset usage.

Fix: #5492

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

ITs


public class ABSFileSystemProvider implements FileSystemProvider {

private static final String ABS_PROVIDER_SCHEME = "wasbs";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between wasb and wasbs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path with prefix wasbs will use SSL in the transport process and wasb will not. It's advisable to use wasbs in the production environment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Azure provides several storages, like wasb, adls2, can you please investigate more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the early days, Azure only supported block storage and the corresponding protocol was wsab or wsabs. As time went by, azure added the support for ADLS, which supports directory management like HDFS to support big data eco-systems and the protocol is abfs or abfss(abfss is the securable enhancement of abfs).

Another point is that the version of Hadoop we use is 3.1 and only support wsab, we need to upgrade to 3.3to
support this feature. I have created an issue about version upgrades.

#5532

@jerryshao Please help to verify whether we need to update the Hadoop version to 3.3 or only use 3.3 for hadoop-azure (I'm afraid there will be some problem if we use Hadoop common with version 3.1 and hadoop-azure 3.3).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuqi1129 As far as I know, hadoop environment using 3.1 and hadoop-azure using 3.2.1 can run normally (abfs/abfss protocol was first supported in hadoop-azure 3.2.0). However, whether hadoop-azure 3.3 is feasible needs further testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I'm not completely confident that Hadoop Common 3.1 is compatible with Hadoop-azure 3.2.
Compared to using Hadoop 3.2 or above for both two, would you suggest trying Hadoop 3.1 and Hadoop-azure 3.2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no particular preference for this.
In terms of compatibility, our current production environment is using Hadoop 3.1 and Hadoop-Azure 3.2.1, which have been running for more than a year, and no compatibility issues have been found.
If we consider upgrading the Hadoop version, can you consider upgrading the Hadoop version of the entire project to a relatively stable version during this upgrade process? Currently, Hive still uses Hadoop 2.7. If we upgrade to a higher version, it may cause greater differences in the future.

@jerryshao
Copy link
Contributor

You'd better also update the doc in this PR, don't separate into several PRs.

Copy link
Contributor

@mchades mchades left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's blob storage not "block storage", plz update the title and description

@yuqi1129 yuqi1129 changed the title [#5492] feat(hadoop-catalog): Support Azure block storage for Gravitino server and GVFS Java client [#5492] feat(hadoop-catalog): Support Azure blob storage for Gravitino server and GVFS Java client Nov 12, 2024

public class AzureFileSystemProvider implements FileSystemProvider {

@VisibleForTesting public static final String ABS_PROVIDER_SCHEME = "abfss";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg uses this protocol, however, wasbs is also used by several softwares like Drimo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's support this first, then we can add more support later on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if later on when we support wasb, can we still use this provider, or shall we create another provider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to introduce a name alias method to support multiple protocol in one provider.

Set schemeAlias() {
Sets.of("wasb", "wasbs", "abfs");
}

if the scheme of the path in it, it will also use this provider.

public static final String GRAVITINO_ABS_ACCOUNT_NAME = "abs-account-name";

// The account key of the Azure Blob Storage.
public static final String GRAVITINO_ABS_ACCOUNT_KEY = "abs-account-key";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is similar to AKSK in Azure blob storage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Azure block storage also supports other two authentication mechanisms, which are quite complex than the current one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please make sure if this auth is a main stream auth for ABS, and also widely used by other systems?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, like AKSK, it is the most widely used by users and application, others like SAS Token and Azure Active Directory (AAD) are quite complicated and hard to configured.

import org.junit.jupiter.api.condition.EnabledIf;
import org.junit.platform.commons.util.StringUtils;

@EnabledIf("absEnabled")
Copy link
Contributor

@jerryshao jerryshao Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also unify the name for abs here as before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, can we move this test to the azure-bundle (also for other storages)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, can we move this test to the azure-bundle (also for other storages)?

Yes, I can be, I have tried it but it seems to introduce a lot of dependencies to the bundle, so I plan to use a separate issue #5565 to solve it wholely.

docs/hadoop-catalog.md Outdated Show resolved Hide resolved
docs/hadoop-catalog.md Outdated Show resolved Hide resolved
docs/how-to-use-gvfs.md Outdated Show resolved Hide resolved
Copy link
Contributor

@jerryshao jerryshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revisit your code and doc carefully to avoid any errors.


@VisibleForTesting public static final String ABS_PROVIDER_SCHEME = "abfss";

@VisibleForTesting public static final String ABS_PROVIDER_NAME = "abfs";
Copy link
Contributor

@jerryshao jerryshao Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this better to change to "abs", abfs is a scheme/protocol name, the service name should be abs, right?

Copy link
Contributor

@jerryshao jerryshao Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You bring in several concepts, like abfs, abfss, abs, I think you should clearly separate them and define them correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this better to change to "abs", abfs is a scheme/protocol name, the service name should be abs, right?

I took it.

docs/hadoop-catalog.md Outdated Show resolved Hide resolved
docs/hadoop-catalog.md Outdated Show resolved Hide resolved
docs/how-to-use-gvfs.md Outdated Show resolved Hide resolved
settings.gradle.kts Outdated Show resolved Hide resolved
Comment on lines +52 to +59
if (config.containsKey(ABSProperties.GRAVITINO_ABS_ACCOUNT_NAME)
&& config.containsKey(ABSProperties.GRAVITINO_ABS_ACCOUNT_KEY)) {
hadoopConfMap.put(
String.format(
"fs.azure.account.key.%s.dfs.core.windows.net",
config.get(ABSProperties.GRAVITINO_ABS_ACCOUNT_NAME)),
config.get(ABSProperties.GRAVITINO_ABS_ACCOUNT_KEY));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can abfs work without these two configurations? if not, I think we should throw an exception here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we let users pass other Azure configurations here, I think we should not throw exceptions here. For example, users can set SAS Token and Azure Active Directory (AAD) related configuration through the bypass mechanism thought I'm not 100% sure whether uses can successfully do so.

docs/how-to-use-gvfs.md Outdated Show resolved Hide resolved
docs/how-to-use-gvfs.md Outdated Show resolved Hide resolved
Copy link
Contributor

@jerryshao jerryshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jerryshao jerryshao merged commit 79c362c into apache:main Nov 14, 2024
26 checks passed
jerryshao added a commit that referenced this pull request Nov 26, 2024
…nt (#5538)

### What changes were proposed in this pull request?

Support GVFS python client to access ADSL fileset. 

### Why are the changes needed?

This is a subsequent PR for #5508 

Fix: #5507 

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

IT locally.

---------

Co-authored-by: Jerry Shao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Azure blob storage for Gravitino server and GVFS Java client
4 participants