Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpytDistributions #35

Closed
wants to merge 3 commits into from
Closed

Conversation

faucct
Copy link
Collaborator

@faucct faucct commented Oct 3, 2024

No description provided.

@faucct faucct requested a review from alextokarew October 3, 2024 10:04
@faucct faucct marked this pull request as ready for review October 3, 2024 10:04
def get_base_cluster_config(global_conf, spark_cluster_version, params, base_discovery_path=None, client=None):
def get_spyt_distributions(client, params=None) -> SpytDistributions:
if params is None:
params = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should explicitly resolve yt_root parameter here and pass it to SpytDistributions constructor so it will be more readable

@@ -43,6 +43,7 @@ class SparkDefaultArguments(object):
@staticmethod
def get_params():
return {
"spyt_distributions": { "yt_root": "//home/spark" },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename this parameter to spyt_distribution or spyt_distribution_root and set it to //home/spark by default because it is the only sensible value in this structure for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this entity parameter nested so it would be possible to support other way of distributing YT, for example – "spyt_distributions": {"local_path": "/opt/spyt" }. I don't think that this should be a flat string parameter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in this pull request you don't support it. I think it should be kept as simple as possible, and if we decide to support other ways of specifying distribution (btw, is there a need for this?) we just refactor it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've meant that if we will want to support it we will have to add a new conflicting parameter entity instead of extending an existing one. That's why I didn't want to make it a flat string.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what's the possibility of such situation? I think it's almost none

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that in the future we will most likely want to support ways of distributing SPYT different from a YT catalog, for example in Docker images, like the Spark does.

@@ -26,6 +18,71 @@
logger = logging.getLogger(__name__)


class SpytDistributions:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename this class and the get_spyt_distributions method to SpytDistribution because plural form confuses. It actually points to a different root, but it's not multiple distributions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it is really not singular, but plural, because this yt_root contains multiple SPYT and Spark distributions and to choose one you have to provide specific version to it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to support multiple spyt and spark distributions for a single cluster or job? I think there must be specific versions for each YT operation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catalog //home/spark is organized in a such way that there are multiple versions in it. If I am giving a way to change this path to something else, I am giving way to customize multiple SPYT distributions, not a single one, even though only one will be used.

class SpytDistributions:
def __init__(self, client: YtClient, yt_root: str):
self.client = client
self.yt_root = YPath(yt_root)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename yt_root to base_path so it would be SpytDistribution.base_path, I think it makes more sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is called root in the publisher scripts: https://github.com/ytsaurus/ytsaurus-spyt/tree/382ae1656208601175b6f004fd31a39affb17b83/tools/release/publisher
Also it is not just any root or path, but a YT one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think base_path in the context of this class is more suitable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In context of this class – maybe, but in context of public interface I believe that this should be coherent with the --root parameter from publisher scripts.

@@ -3,19 +3,11 @@
from spyt.dependency_utils import require_yt_client
require_yt_client()

from yt.wrapper import get, YPath, list as yt_list, exists # noqa: E402
from yt.wrapper import get, YPath, list as yt_list, exists, YtClient # noqa: E402
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get rid of importing and using get method because now you use YtClient instead

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside the SpytDistributions – yes, but I am not sure about the external methods: client can still be None in those.

def latest_ytserver_proxy_path(self, cluster_version):
if cluster_version:
return None
global_conf = self.global_conf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This local variable is redundant, you can directly use self.global_conf on the next line

"Please update your local ytsaurus-spyt".format(spark_cluster_version, SELF_VERSION))

def get_available_cluster_versions(self):
subdirs = yt_list(self.conf_base_path.join(RELEASES_SUBDIR), client=self.client)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subdirs can be obtained by calling self.get_available_spyt_versions() so it will remove code duplication

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two different paths used there, so this method cannot be used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, conf_base_path and spyt_base_path, you're right

def _get_version_conf_path(self, cluster_version):
return self.conf_base_path.join(self._version_subdir(cluster_version)).join(cluster_version).join("spark-launch-conf")

def _version_subdir(self, version):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function can be embedded into _get_version_conf_path because it is the only place where it is used.

@faucct faucct requested a review from alextokarew October 4, 2024 05:49
@faucct faucct closed this Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants