-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Databricks Azure qual tool to set env variable for ABFS paths #1016
Update Databricks Azure qual tool to set env variable for ABFS paths #1016
Conversation
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang !
Some few comments to keep the design flow.
@classmethod | ||
def init_env_vars(cls, path: Union[str, 'CspPath']) -> None: | ||
if cls.protocol_prefix == AZURE_STORAGE_PREFIX: | ||
account_name = get_abfs_account_name(str(path)) | ||
os.environ['AZURE_STORAGE_ACCOUNT_NAME'] = account_name | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very specific to AdlsPath
class which extends CspPath
. It is better to be inside adlspath.py
.
I did not look deeper in the code, but my intuition is that this code block should be inside adlspath.py
by overriding is_protocol_prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this to adlspath.py
, thanks!
def get_abfs_account_name(path: str): | ||
return path.split('@')[1].split('.')[0] | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It is always useful to use type hints to get some checks on type safety and code incorrectness. Python started adopted this typing/atype-annotations more widely to overcome the problem of type-safety that makes python difficult to work with in large projects.
def get_abfs_account_name(path: str) -> str :
- I don't see
util.py
is the best place for this because it is a functionality that's specific to ADLS driver. - Should this method try to read the env-variable. if it is not set, then it extracts it from the URL?
- Also, with defining this new method, I would expect 2 sites calling this method: 1- in pytools (old way we xtracted information for Azure); 2- During the initialization in the new spark_rapids_tools
aka,DBAzureLocalRapidsJob.get_account_name
strips the account-name from the URL; regardless of the initialization we did. It should read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amahussein! Updated this.
@@ -56,6 +57,7 @@ | |||
|
|||
|
|||
CspPathString = Annotated[str, StringConstraints(pattern=r'^\w+://.*')] | |||
AZURE_STORAGE_PREFIX = 'abfss://' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this because this overrides the purpose of the design that each Class will be able to recognize its own URL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this, thanks!
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang for this change. We are automatically setting AZURE_STORAGE_ACCOUNT_NAME
in this PR.
In addition to this, AZURE_STORAGE_CONNECTION_STRING
would also be required. Do we plan to handle this as well?
Thanks @parthosa! Setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang. LGTM
@classmethod | ||
def is_protocol_prefix(cls, value: str) -> bool: | ||
if value.startswith(cls.protocol_prefix) and "AZURE_STORAGE_ACCOUNT_NAME" not in os.environ: | ||
account_name = cls.get_abfs_account_name(value) | ||
os.environ["AZURE_STORAGE_ACCOUNT_NAME"] = account_name | ||
return super().is_protocol_prefix(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for more readability:
@classmethod
def is_protocol_prefix(cls, value: str) -> bool:
valid_prefix = super().is_protocol_prefix(value)
if valid_prefix:
# check if the AZURE_STORAGE_ACCOUNT_NAME env_variable is defined.
# If not, set it to avoid failures when user is not ..bla..bla
# https://github.com/NVIDIA/spark-rapids-tools/issues/981
if "AZURE_STORAGE_ACCOUNT_NAME" not in os.environ:
account_name = cls.get_abfs_account_name(value)
os.environ["AZURE_STORAGE_ACCOUNT_NAME"] = account_name
return valid_prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amahussein! Updated this.
Signed-off-by: cindyyuanjiang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang
Great fix!
Fixes #981
spark_rapids qualification --eventlogs=abfss://xxxxxxxxxx
Running the above command fails because
spark_rapids
qualification tool tries to detect platform fromABFS
event logs but it cannot verify that without credentials.Instead of asking the users to specify either
AZURE_STORAGE_ACCOUNT_NAME
orAZURE_STORAGE_CONNECTION_STRING
env variables like:This PR sets the env vars in the qualification tools so that users don't have to do that.