Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Databricks Azure qual tool to set env variable for ABFS paths #1016

Merged
merged 10 commits into from
May 18, 2024

Conversation

cindyyuanjiang
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang commented May 14, 2024

Fixes #981

spark_rapids qualification --eventlogs=abfss://xxxxxxxxxx

Running the above command fails because spark_rapids qualification tool tries to detect platform from ABFS event logs but it cannot verify that without credentials.

Instead of asking the users to specify either AZURE_STORAGE_ACCOUNT_NAME or AZURE_STORAGE_CONNECTION_STRING env variables like:

export AZURE_STORAGE_ACCOUNT_NAME=xxxxxxxx
export AZURE_STORAGE_CONNECTION_STRING=xxxxxxxx
spark_rapids qualification --eventlogs=abfss://xxxxxxxxxx

This PR sets the env vars in the qualification tools so that users don't have to do that.

@cindyyuanjiang cindyyuanjiang self-assigned this May 14, 2024
@cindyyuanjiang cindyyuanjiang added bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels May 14, 2024
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang !
Some few comments to keep the design flow.

Comment on lines 280 to 285
@classmethod
def init_env_vars(cls, path: Union[str, 'CspPath']) -> None:
if cls.protocol_prefix == AZURE_STORAGE_PREFIX:
account_name = get_abfs_account_name(str(path))
os.environ['AZURE_STORAGE_ACCOUNT_NAME'] = account_name

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very specific to AdlsPath class which extends CspPath. It is better to be inside adlspath.py.
I did not look deeper in the code, but my intuition is that this code block should be inside adlspath.py by overriding is_protocol_prefix

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to adlspath.py, thanks!

Comment on lines 115 to 118
def get_abfs_account_name(path: str):
return path.split('@')[1].split('.')[0]


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It is always useful to use type hints to get some checks on type safety and code incorrectness. Python started adopted this typing/atype-annotations more widely to overcome the problem of type-safety that makes python difficult to work with in large projects.
     def get_abfs_account_name(path: str) -> str :
  • I don't see util.py is the best place for this because it is a functionality that's specific to ADLS driver.
  • Should this method try to read the env-variable. if it is not set, then it extracts it from the URL?
  • Also, with defining this new method, I would expect 2 sites calling this method: 1- in pytools (old way we xtracted information for Azure); 2- During the initialization in the new spark_rapids_tools
    aka, DBAzureLocalRapidsJob.get_account_name strips the account-name from the URL; regardless of the initialization we did. It should read

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein! Updated this.

@@ -56,6 +57,7 @@


CspPathString = Annotated[str, StringConstraints(pattern=r'^\w+://.*')]
AZURE_STORAGE_PREFIX = 'abfss://'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this because this overrides the purpose of the design that each Class will be able to recognize its own URL.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this, thanks!

Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang for this change. We are automatically setting AZURE_STORAGE_ACCOUNT_NAME in this PR.

In addition to this, AZURE_STORAGE_CONNECTION_STRING would also be required. Do we plan to handle this as well?

@cindyyuanjiang
Copy link
Collaborator Author

In addition to this, AZURE_STORAGE_CONNECTION_STRING would also be required. Do we plan to handle this as well?

Thanks @parthosa! Setting AZURE_STORAGE_ACCOUNT_NAME is enough to verify the abfss path.

parthosa
parthosa previously approved these changes May 16, 2024
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang. LGTM

Comment on lines 32 to 37
@classmethod
def is_protocol_prefix(cls, value: str) -> bool:
if value.startswith(cls.protocol_prefix) and "AZURE_STORAGE_ACCOUNT_NAME" not in os.environ:
account_name = cls.get_abfs_account_name(value)
os.environ["AZURE_STORAGE_ACCOUNT_NAME"] = account_name
return super().is_protocol_prefix(value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for more readability:

@classmethod
def is_protocol_prefix(cls, value: str) -> bool:
    valid_prefix = super().is_protocol_prefix(value)
    if valid_prefix:
        # check if the AZURE_STORAGE_ACCOUNT_NAME env_variable is defined.
        # If not, set it to avoid failures when user is not ..bla..bla
        # https://github.com/NVIDIA/spark-rapids-tools/issues/981
        if "AZURE_STORAGE_ACCOUNT_NAME" not in os.environ:
            account_name = cls.get_abfs_account_name(value)
            os.environ["AZURE_STORAGE_ACCOUNT_NAME"] = account_name
     return valid_prefix

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein! Updated this.

Signed-off-by: cindyyuanjiang <[email protected]>
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang
Great fix!

@cindyyuanjiang cindyyuanjiang merged commit 6fda70c into NVIDIA:dev May 18, 2024
15 checks passed
@cindyyuanjiang cindyyuanjiang deleted the spark-rapids-tools-981 branch May 18, 2024 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Failed to run spark_rapids qualification tool with abfss event logs
4 participants