Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API] Performance Logging #669

Merged
merged 13 commits into from
Nov 8, 2022
Merged

[API] Performance Logging #669

merged 13 commits into from
Nov 8, 2022

Conversation

JasonDing0401
Copy link
Contributor

@JasonDing0401 JasonDing0401 commented Nov 5, 2022

Remaining questions:

  • Client id is not persistent for uuid.getnode()
  • Where to disable usage since we don't have ~/.skyplane/config for API. But setting environment variable still works.
  • Performance logging for per tracker (job_list) or per job

Copy link
Contributor

@parasj parasj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good, but we should calculate object size in the Chunker class and then reuse that value since list_objects is slow.

@@ -17,6 +17,7 @@ google-cloud-storage
cachetools
paramiko
rich
typer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the gateway require typer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For setting the usage here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only used in the client, use dependency injection like this:

@imports.inject("pandas")
def monitor_transfer(pd, self):

We can remove this dependency from the Docker image then

skyplane/api/dataplane.py Outdated Show resolved Hide resolved
@@ -56,6 +56,41 @@ def estimate_cost(self):
# TODO
raise NotImplementedError

def calculate_size(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should cache this from the Chunker class and reuse it. list_objects is slow so we should only call it once. This may mean we only report usage after a transfer completes which is OK.

@@ -152,6 +187,7 @@ class CopyJob(TransferJob):
multipart_transfer_list: list = field(default_factory=list)

def __post_init__(self):
self.type = "copy"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a "type" class attribute to the dataclass with a default value. That way, it will serialize cleanly.

@@ -229,6 +265,11 @@ def verify(self):

@dataclass
class SyncJob(CopyJob):

def __post_init__(self):
self.type = "sync"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a "type" class attribute to the dataclass with a default value. That way, it will serialize cleanly.

@JasonDing0401
Copy link
Contributor Author

@parasj, any suggestions for the questions I mentioned at the beginning?

@parasj
Copy link
Contributor

parasj commented Nov 6, 2022

  1. Client id is not persistent for uuid.getnode():

provisioner.py requires keys to be stored in ~/.skyplane as well as the AWS region config. Can we unify the UUID logic to read/write a ~/.skyplane/host_uuid file? This would be separate from the ~/.skyplane/config file which is used by SkyplaneConfig.

  1. Where to disable usage since we don't have ~/.skyplane/config for API. But setting environment variable still works.

Can we have a usage_enabled bit in TransferConfig? Not sure of a good design here.

  1. Performance logging for per tracker (job_list) or per job

Either is fine, whatever is easier.

Copy link
Contributor

@parasj parasj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This is great!

self.topology = topology
self.src_region_tag = self.topology.source_region()
self.dst_region_tag = self.topology.sink_region()
self.max_instances = int(len(self.topology.gateway_nodes) / 2)
regions = Counter([node.region.split(":")[1] for node in self.topology.gateway_nodes])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory there shouldn't be collisions between clouds but maybe should we index it as ("aws", "us-east-1") instead?

if cr.chunk.chunk_id in self.job_complete_chunk_ids[job_uuid]
]
)
return sum(bytes_total_per_job.values()) / (2**30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from rich import print as rprint
from typing import Optional, Dict

import skyplane
import skyplane.cli.usage.definitions
import skyplane.api.usage.definitions

# from skyplane.api.client import tmp_log_dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused imports

@JasonDing0401 JasonDing0401 merged commit d8d343b into main Nov 8, 2022
@JasonDing0401 JasonDing0401 deleted the api/usage branch November 8, 2022 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants